DLESE Tools
v1.2

org.dlese.dpc.oai.harvester
Class Harvester

java.lang.Object
  extended byorg.dlese.dpc.oai.harvester.Harvester
All Implemented Interfaces:
ErrorHandler

public class Harvester
extends Object
implements ErrorHandler

Harvests metadata from an OAI data provider, saving the results to file or returning the raw XML as an array of Strings. The static methods harvest(String baseURL,String metadataPrefix,String set,Date from,Date until,String outdir,HarvestMessageHandler msgHandler,boolean writeHeaders) and harvest(String baseURL,String metadataPrefix,String set,Date from,Date until,String outdir,boolean writeHeaders) are provided for convenience. If not using the static methods, note that a new Harvester instance must be used for each harvest performed.

For information on OAI, see:
OAI v2.0 spec:
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
OAI tools:
http://www.openarchives.org/tools/tools.html
Repository Explorer:
http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai

Author:
Steve Sullivan, John Weatherley
See Also:
HarvestMessageHandler

Constructor Summary
Harvester()
          Creates a Harvester that uses no HarvestMessageHandler.
Harvester(HarvestMessageHandler msgHandler)
          Creates a Harvester that uses the given HarvestMessageHandler.
 
Method Summary
 String[][] doHarvest(String baseURL, String metadataPrefix, String set, Date from, Date until, String outdir, boolean writeHeaders)
          Performs the harvest.
 void error(SAXParseException exc)
          Part of ErrorHandler interface.
 void fatalError(SAXParseException exc)
          Part of ErrorHandler interface.
 long getEndTime()
          Gets the endTime when the havest completed either because of an error or at the end of a successful harvest.
 String getHarvestedRecordsDir()
          Gets the harvestedRecordsDir attribute of the Harvester object
 long getHarvestUid()
          Returns a unique ID for this harvest.
 int getNumRecordsHarvested()
          Gets the current number of records that have been harvested by this harvester.
 int getNumResumptionTokensIssued()
          Gets the number of resumption tokens that have currently been issued by the data provider.
 long getStartTime()
          Gets the startTime when the harvest began, or 0 if it has not begun yet.
static String[][] harvest(String baseURL, String metadataPrefix, String set, Date from, Date until, String outdir, boolean writeHeaders)
          Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings.
static String[][] harvest(String baseURL, String metadataPrefix, String set, Date from, Date until, String outdir, HarvestMessageHandler msgHandler, boolean writeHeaders)
          Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings.
 boolean isRunning()
          Determines whether this Harvester is currently running or not.
 void kill()
          Gracefully kills the harvest after the current record is finished being harvested.
static void main(String[] args)
          Command line test interface for the harvester.
 void setNumRecordsForNotification(int numRecords)
          Sets the number of records harvested before statusMessage notifications to the HarvestMessageHandler are made.
 void warning(SAXParseException exc)
          Part of ErrorHandler interface.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Harvester

public Harvester()
Creates a Harvester that uses no HarvestMessageHandler.


Harvester

public Harvester(HarvestMessageHandler msgHandler)
Creates a Harvester that uses the given HarvestMessageHandler.

Parameters:
msgHandler - The HarvestMessageHandler that will receive messages as the harvest progresses, or null if none.
Method Detail

main

public static void main(String[] args)
Command line test interface for the harvester. Parameters are shown below.

Parameters:

Parameters:
args - The command line arguments

harvest

public static String[][] harvest(String baseURL,
                                 String metadataPrefix,
                                 String set,
                                 Date from,
                                 Date until,
                                 String outdir,
                                 HarvestMessageHandler msgHandler,
                                 boolean writeHeaders)
                          throws Hexception,
                                 OAIErrorException
Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings. The user may supply a HarvestMessageHandler to capture harvest progress messages. Use a SimpleHarvestMessageHandler to have harvest messaged sent to standard out.

Parameters:
baseURL - The baseURL of the data provider.
metadataPrefix - metadataPrefix. e.g., "oai_dc"
set - set. e.g., "testset" or null for none.
from - from date. May be null.
until - until date. May be null.
outdir - path of output dir. If null or "", we return the String[][] array; if specified we return null.
msgHandler - A handler for status messages that occur during harvesting, or null to ingnore messages.
writeHeaders - True to have headers written, false not to.
Returns:
If outdir is specified returns null; if outdir is null or "", returns one row for each record harvested. Each row has two elements:
  • identifier, encoded
  • content xml record, or the String deleted if status=deleted.

Throws:
Hexception - If serious error.
OAIErrorException - If OAI error.

harvest

public static String[][] harvest(String baseURL,
                                 String metadataPrefix,
                                 String set,
                                 Date from,
                                 Date until,
                                 String outdir,
                                 boolean writeHeaders)
                          throws Hexception,
                                 OAIErrorException
Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings. Harvest progress messages are output to standard out.

Parameters:
baseURL - The baseURL of the data provider.
metadataPrefix - metadataPrefix. e.g., "oai_dc"
set - set. e.g., "testset" or null for none.
from - from date. May be null.
until - until date. May be null.
outdir - path of output dir. If null or "", we return the String[][] array; if specified we return null.
writeHeaders - True to have headers written, false not to.
Returns:
If outdir is specified returns null; if outdir is null or "", returns one row for each record harvested. Each row has two elements:
  • identifier, encoded
  • content xml record, or the String deleted if status=deleted.

Throws:
Hexception - If serious error.
OAIErrorException - If OAI error.

kill

public void kill()
Gracefully kills the harvest after the current record is finished being harvested.


setNumRecordsForNotification

public void setNumRecordsForNotification(int numRecords)
Sets the number of records harvested before statusMessage notifications to the HarvestMessageHandler are made.

Parameters:
numRecords - The new numRecordsForNotification value

getStartTime

public long getStartTime()
Gets the startTime when the harvest began, or 0 if it has not begun yet.

Returns:
The startTime, or 0 if not started yet.

getHarvestedRecordsDir

public String getHarvestedRecordsDir()
Gets the harvestedRecordsDir attribute of the Harvester object

Returns:
The harvestedRecordsDir value

getHarvestUid

public long getHarvestUid()
Returns a unique ID for this harvest.

Returns:
The harvestId value

getEndTime

public long getEndTime()
Gets the endTime when the havest completed either because of an error or at the end of a successful harvest. Returns 0 if the harvest is still in progress.

Returns:
The endTime, or 0 if the harvest is still in progress.

getNumRecordsHarvested

public int getNumRecordsHarvested()
Gets the current number of records that have been harvested by this harvester. This number increases as the harvest progresses.

Returns:
The numRecordsHarvested value

getNumResumptionTokensIssued

public int getNumResumptionTokensIssued()
Gets the number of resumption tokens that have currently been issued by the data provider. This number increases as the harvest progresses. This number gives a rough indication of the progression and duration of the harvest.

Returns:
The numResumptionTokensIssued value.

isRunning

public boolean isRunning()
Determines whether this Harvester is currently running or not.

Returns:
True if the harvest is in progress, false otherwise.

doHarvest

public String[][] doHarvest(String baseURL,
                            String metadataPrefix,
                            String set,
                            Date from,
                            Date until,
                            String outdir,
                            boolean writeHeaders)
                     throws Hexception,
                            OAIErrorException
Performs the harvest. Note that his method is not safe for multiple harvests - a separate Harvester instance should be created for each havest performed.

Parameters:
metadataPrefix - metadataPrefix. e.g., "oai_dc"
set - set. e.g., "testset" or null for none.
from - from date. May be null.
until - until date. May be null.
outdir - path of output dir. If null or "", we return the String[][] array; if specified we return null.
writeHeaders - True to have oai headers written to file, false not to.
The directory structure under outdir is:
outdir/set/subset/subset/metadataPrefix/oaiId_hdr.xml OAI header
outdir/set/subset/subset/metadataPrefix/oaiId_data.xml OAI contents
baseURL - The baseURL of the data provider.
Returns:
If outdir is specified returns null; if outdir is null or "", returns one row for each record harvested. Each row has two elements:
  • identifier, encoded
  • content xml record.

Throws:
Hexception - If serious error.
OAIErrorException - If OAI error was returned by the data provider.

fatalError

public void fatalError(SAXParseException exc)
Part of ErrorHandler interface.

Specified by:
fatalError in interface ErrorHandler
Parameters:
exc - DESCRIPTION

error

public void error(SAXParseException exc)
Part of ErrorHandler interface.

Specified by:
error in interface ErrorHandler
Parameters:
exc - DESCRIPTION

warning

public void warning(SAXParseException exc)
Part of ErrorHandler interface.

Specified by:
warning in interface ErrorHandler
Parameters:
exc - DESCRIPTION

DLESE Tools
v1.2