Getting started | Documentation | Provider explorer | ODL search
Provider administration | Harvester administration


DLESE OAI Software Documentation

The DLESE OAI software supports the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), v2.0. This is software version 2.0.9.3.

Overview

This software is intended to be simple to set up and use. The data provider works by serving metadata collections that exist as XML files stored on disc. When modifications or deletions are made to the files, the data provider automatically updates its output to reflect those changes, allowing propagation of the changes to harvesters. Administrators that have metadata stored in a database may use the data provider by exporting their metadata to an XML file cache and configuring the software to serve from that cache. The software includes a number of search, validation and data viewing features to help support the creation and maintenance of metadata repositories. This software is provided under the GNU general public license.

List of features

Harvester:

  • Supports protocol versions 2.0 and 1.1.
  • Automatic updating and synchronizing of the local data files with the remote data provider.
  • Supports gzip compression.
  • Ability to monitor harvest activity through the web interface.
  • Saves files to disk in folders by baseURL path, set and format.
  • Future features include integrated validation of harvested records, indexing and search over harvested sets (didn't have enough time to put them in).

Data provider:

  • Supports exploring the repository using OAI-PMH requests. See Provider explorer.
  • Supports keyword searching over the repository. See Provider search and discovery page.
  • Implements an ODL search extension of the OAI-PMH protocol that supports content-based queries.
  • Serves metadata directly from XML files cached on disc and requires no database for operation.
  • Allows users to validate and view each OAI response that is being returned by the provider.
  • Allows users to validate and view individual XML metadata records within the repository.
  • Supports configuration of arbitrary metadata sets.
  • Supports indexing and serving of any schema-based XML metadata format, plus DLESE-IMS (contains a DTD).
  • Provides extensible, plug-and-play configuration of XML metadata format converters, with support for XSL or Java.
  • Includes an XSL format converter to convert from DLESE IMS to ADN.
  • Allows a number of administrative options to control the behavior of the provider and the data that is returned.
  • Provides searchable reports that detail validation or critical errors found in metadata files.
  • Provides searchable reports of harvesters that have accessed the data provider.
  • Supports data transmission in compressed gzip format.
  • Supports deletions.
  • Supports modification date granularity down to seconds.
Bug fixes and additional late-breaking information about the software is listed in the README.txt file.

DLESE ODL search specification

The data provider extends the OAI-PMH to support searching similar to the Open Digital Library (ODL) search specification (odlsearch1). The DLESE ODL search exposes the data in this provider as a web service, allowing clients to use the OAI-PMH to perform keyword and fielded Information Retrieval (IR) search queries over the metadata repository. Because ODL search (and OAI in general) leverages existing Internet technologies such as HTTP, it may be defined as a Representational State Transfer (REST) style web service. After issuing a search request, clients receive raw metadata back as an ordered set of results within the standard ListRecords or ListIdentifiers response containers. Results are ordered by relevance, which is determined by relative term frequency and proximity. The metadata may then be used to render custom interfaces or be embedded in remote clients on the fly. See the DLESE ODL search page to issues search requests to this data provider.

Search query syntax

To perform a search, clients must provide a search query in the set argument of either a ListRecords or ListIdentifiers request. The set argument must conform to the following syntax: dleseodlsearch/[query string]/[set]/[offset]/[length] where "dleseodlsearch" is the exact string dleseodlsearch, "query string" indicates a list of keywords upon which to search, "set" indicates the set over which to search, "offset" indicates the offset into the results list to begin the results, and "length" indicates the total number of results to return. To search over all sets, clients must supply the string "null" in the set field. Clients must escape all spaces in queries with a plus (+) symbol. The default boolean logic is AND. To request a query using boolean OR, clients must supply the exact string "OR" between each term in the query string.

Examples:

".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/0/10" - Performs a text search for the term "ocean" across all sets in the repository, returning matching results numbers 0 through 10.

".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/10/10" - Performs a text search for the word "ocean" across all sets in the repository, returning matching results numbers 10 through 20.

".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean+weather/dcc/0/10" - Performs a text search for the terms "ocean" AND "weather" across the set dcc, returning matching results numbers 0 through 10.

".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean+OR+weather/dcc/0/10" - Performs a text search for the terms "ocean" OR "weather" across the set dcc, returning matching results numbers 0 through 10.

Tip: you may use this software to construct and submit sample DLESE ODL search queries for you. To do so, go to the DLESE ODL search page and enter a query string into the search interface. The provider will return the matching results to your web browser for inspection. The address (URL) provided to your browser shows the full OAI-PMH and DLESE ODL query string that was issued to the provider (note, the argument "rt" is not part of the protocol and should be ignored).

Flow control

All state is embedded in the search query string, giving clients control over response flow. Clients can "page through" a set of results by issuing the same request in succession and incrementing the offset parameter by the desired quanta. For example, a client wishing to iterate through three pages of results for a search on the term ocean, retrieving ten records per page, would issue the following three queries:

".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/0/10"

".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/10/10"

".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/20/10"

At the end of each response an empty resumptionToken element is provided for the client that contains the attributes completeListSize and cursor. The completeListSize attribute shows the total number of records in the repository that match the given query. The cursor reflects the offset into the result set where the current response container begins.

Ongoing development on the DLESE ODL search service is taking place. Future iterations are proposed to extend the number of fields available for searching. The three fields that are likely to be added for searching in the coming months are grade level, resource type and content standards. Please send questions, comments or suggestions to John Weatherley jweather@ucar.edu.

Configuring the data provider

Quick start instructions

For instructions on setting up the data provider for it's first use, see Getting started.

How your metadata files should be formatted

Metadata files should contain valid XML that is defined by a schema. One exception: the DLESE IMS metadata format uses a DTD, however this software is able to handle that format properly (v2.1).

Plug-and-play format converters

The software can be configured to convert to and from any arbitrary metadata format using an XSL stylesheet or a custom Java class that implements the XMLFormatConverter Interface. To configure a converter, do the following:

1. Create or obtain an XSL stylesheet or Java class that performs the desired format conversion. For Java converters, the class must implement the XMLFormatConverter Interface.

2. If you have an XSL stylesheet, place it in the directory "xsl_files" located in the "WEB-INF" directory of the OAI software context. If you have a Java class converter, place it anywhere within the classpath of the servlet container.

3. Edit the"web.xml" file located in the "WEB-INF" directory. To point the software to your converter, add an init parameter for the OAIProviderServlet. If it is an XSL converter, give the parameter a name that begins with "xslconverter." For a Java class converter the name must begin with "javaconverter." Each param-name must be unique, otherwise it will not be recognized. For the param-value field, supply a string of form [convertername] | [from format] [to format] where convertername is either the name of an xsl file or a fully qualified Java class and the to and from formats are metadataPrefixes for the given formats. For example, if you have an xsl stylesheet named myDCConverter.xsl that converts from ADN to Dublin Core the param-value would be "myDCConverter.xsl|adn|oai_dc" (quotes omitted). If it were a Java class converter by the full name org.institution.converter.MyDCConverter, the param-value would be "org.institution.converter.MyDCConverter|adn|oai_dc" (quotes omitted).

4. After configuring the web.xml file and placing the converter in the appropriate location, start or restart the software. The software will automatically recognize the converter and add the new format to its list of available formats.

Tip: The format converter module caches the converted metadata to disc for increased performance. You may access these converted files as well. Once a item has been accessed for the first time it's cached file will appear in the directory "WEB-INF/repository_data/converted_xml_cache."

Using the harvester

The harvester included with this software can be used to harvest from any OAI data provider that supports protocol versions 1.1 or 2.0, saving the metadata to files in a directory you choose. You can access the harvester through the Harvester administration page. The harvester has two interfaces - one that allows you to save settings for recurring harvests and one that can be used for a simple one-time harvest of a given data provider.

To set-up a recurring harvest click the Add new recurring harvest button and type in the required information in the page that appears. Once set-up, a recurring harvest can be run manually or automatically at regular intervals. To perform a one-time harvest simply enter the required information in the one-time harvest interface and click harvest.

To view the progression of a harvest as it is taking place or to view the history of harvests, use the view status links in the harvester interface. Harvest log reports are updated continually. To view new log entries, use your browser's refresh button.

The Open Archives maintains a list of registered data providers.
Known data providers of interest within the DLESE community (baseURL shown):

  • DLESE - http://www.dlese.org/oai/provider
  • NSDL - http://services.nsdl.org:8080/nsdloai/OAI

The harvester may also be used programmatically within your Java code. Examples and documentation are available in the Harvester Javadoc. To use the Harvester programmatically, include DLESETools.jar, found in the lib directory of this distribution, in your classpath.

Enabling access control

To enable password protection for OAI administration in a Tomcat servlet container, uncomment the 'security-constraint' and 'login-config' elements found in the oai 'WEB-INF'/web.xml' file. Then copy the following context definition into the <host> tagset of the 'server.xml' config file found in the Tomcat 4.x 'conf' directory (the configuration below assumes the OAI context and directory name is 'oai'):

<Context path="/oai" docBase="oai" debug="0" reloadable="true">
      <Realm className="org.apache.catalina.realm.MemoryRealm"
      pathname="webapps/oai/WEB-INF/users.xml" />
</Context>

You will then need to restart Tomcat. The file 'WEB-INF/users.xml' defines user names and passwords for those granted access to the OAI administration pages. A default user of 'admin' with password 'admin' has been defined. You should edit the file and replace this default definition with your own user/password definitions before making your software public. Note that this type of authentication does not provide encryption for user names and passwords sent over the net therefore you may want to only access these pages from a computer that is on the same local network as the host machine.

Obtaining this software

This software is available in pre-compiled binary form ready for installation. The binary distribution is packaged as a Java WAR file, which can be dropped into a servlet container such as Tomcat for rapid deployment.

The binary distribution is available at the DLESE SourceForge portal.

Installation instructions are available at:
Linux installation
Windows installation
Mac OS X installation
Instructions are also included with the distribution - after downloading, unzip the archive and see the file INSTALL.txt.

Building from source

This software can be built directly from source. You may obtain the current source distribution at the DLESE SourceForge portal (recommended). Alternatively, the latest development tree is available via anonymous CVS at the DLESE CVS repository on SourceForge. If obtaining from CVS, note that you will need both the oai-project and dlese-tools-project.

Instructions for building this software from source are available inside the source distribution. After downloading the source package at the DLESE SourceForge portal, unzip the archive and see the file BUILD_INSTRUCTIONS.txt.

General information

This software was built using the Lucene search API and the Struts web application framework.

The full Javadoc for this software is included here.

Send e-mail to support@dlese.org with questions, comments, or suggestions.

Open Archives Initiative

Getting started | Documentation | Provider explorer | ODL search
Provider administration | Harvester administration