DLESE OAI Software Documentation
The DLESE OAI software supports the Open Archives Initiative Protocol
for Metadata Harvesting (OAI-PMH), v2.0. This is software version 2.0.9.3.
Overview
This software is intended to be simple to set up and use. The data provider works by serving metadata
collections that exist as XML files stored on disc. When modifications
or deletions are made to the files, the data provider automatically updates
its output to reflect those changes, allowing propagation of the changes
to harvesters. Administrators that have metadata stored in a database
may use the data provider by exporting their metadata to an XML file cache
and configuring the software to serve from that cache. The software includes a number of search, validation
and data viewing features to help support the creation and maintenance
of metadata repositories. This software is
provided under the GNU general public license.
List of features
Harvester:
- Supports protocol versions 2.0 and 1.1.
- Automatic updating and synchronizing of the local data files with the
remote data provider.
- Supports gzip compression.
- Ability to monitor harvest activity through the web interface.
- Saves files to disk in folders by baseURL path, set and format.
- Future features include integrated validation of harvested records,
indexing and search over harvested sets (didn't have enough time to
put them in).
Data provider:
- Supports exploring the repository using OAI-PMH requests. See
Provider explorer.
- Supports keyword searching over the repository. See
Provider search and discovery page.
- Implements an ODL search extension of the OAI-PMH protocol that supports
content-based queries.
- Serves metadata directly from XML files cached on disc and requires
no database for operation.
- Allows users to validate and view each OAI response that is being
returned by the provider.
- Allows users to validate and view individual XML metadata records
within the repository.
- Supports configuration of arbitrary metadata sets.
- Supports indexing and serving of any schema-based XML metadata format,
plus DLESE-IMS (contains a DTD).
- Provides extensible, plug-and-play configuration of XML metadata format
converters, with support for XSL or Java.
- Includes an XSL format converter to convert from DLESE IMS to ADN.
- Allows a number of administrative options
to control the behavior of the provider and the data that is returned.
- Provides searchable reports
that detail validation or critical errors found in metadata files.
- Provides searchable reports
of harvesters that have accessed the data provider.
- Supports data transmission in compressed gzip format.
- Supports deletions.
- Supports modification date granularity down to seconds.
Bug fixes and additional late-breaking information about the software is listed in the
README.txt file.
DLESE ODL search specification
The data provider extends the OAI-PMH to support searching similar to the
Open Digital Library (ODL)
search specification (odlsearch1). The DLESE ODL search exposes the data in this provider
as a web service, allowing clients to use the
OAI-PMH to perform keyword and fielded Information Retrieval (IR) search
queries over the metadata repository. Because ODL search (and OAI in general) leverages existing
Internet technologies such as HTTP, it may be defined as a
Representational State Transfer (REST) style web service. After issuing a search request,
clients receive raw metadata back
as an ordered set of results within the standard
ListRecords or
ListIdentifiers response containers.
Results are ordered by relevance, which is determined by relative term frequency and proximity.
The metadata may then be used to render custom interfaces or be embedded in remote clients
on the fly. See the DLESE ODL search page to issues
search requests to this data provider.
Search query syntax
To perform a search, clients must provide a search query in
the set argument of either a
ListRecords or
ListIdentifiers request. The set argument must
conform to the following syntax: dleseodlsearch/[query string]/[set]/[offset]/[length]
where "dleseodlsearch" is the exact string dleseodlsearch,
"query string" indicates a list of keywords upon which to
search, "set" indicates the set over which to search, "offset"
indicates the offset into the results list to begin the results, and
"length" indicates the total number of results to return.
To search over all sets, clients must supply the string "null"
in the set field. Clients must escape all spaces in queries
with a plus (+) symbol. The default boolean logic is AND. To request
a query using boolean OR, clients must supply the exact string
"OR" between each term in the query string.
Examples:
".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/0/10"
- Performs a text search for the term "ocean" across all sets
in the repository, returning matching results numbers 0 through 10.
".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/10/10"
- Performs a text search for the word "ocean" across all sets
in the repository, returning matching results numbers 10 through 20.
".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean+weather/dcc/0/10"
- Performs a text search for the terms "ocean" AND "weather"
across the set dcc, returning matching results numbers 0 through 10.
".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean+OR+weather/dcc/0/10"
- Performs a text search for the terms "ocean" OR "weather"
across the set dcc, returning matching results numbers 0 through 10.
Tip: you may use this software to construct and submit sample DLESE ODL search
queries for you. To do so, go to the
DLESE ODL search page and enter a
query string into the search interface. The provider will return
the matching results to your web browser for inspection. The address
(URL) provided to your browser shows the full OAI-PMH and DLESE ODL query
string that was issued to the provider (note, the argument "rt"
is not part of the protocol and should be ignored).
Flow control
All state is embedded in the search query string, giving clients control over response flow.
Clients can "page through" a set of results by issuing the same request in succession
and incrementing the offset parameter by the desired quanta. For example, a client wishing to
iterate through three pages of results for a search on the term ocean, retrieving ten
records per page, would issue the following
three queries:
".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/0/10"
".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/10/10"
".../provider?verb=ListRecords&metadataPrefix=oai_dc&set=dleseodlsearch/ocean/null/20/10"
At the end of each response an empty resumptionToken element is provided for the client that contains the
attributes completeListSize and cursor. The completeListSize attribute shows the total number of records
in the repository that match the given query. The cursor reflects the offset into the result set where the current
response container begins.
Ongoing development on the DLESE ODL search service is taking place. Future iterations are proposed to extend the
number of fields available for searching. The three fields that are likely to be added for searching in the coming months
are grade level, resource type and content standards. Please send questions, comments or suggestions to John Weatherley
jweather@ucar.edu.
Configuring the data provider
Quick start instructions
For instructions on setting up the data provider for it's first use,
see Getting started.
How your metadata files should be formatted
Metadata files should contain valid XML that is defined by a schema.
One exception: the DLESE IMS metadata format uses a DTD, however this
software is able to handle that format properly (v2.1).
Plug-and-play format converters
The software can be configured to convert to and from any arbitrary
metadata format using an XSL stylesheet or a custom Java class that
implements the XMLFormatConverter
Interface. To configure a converter, do the following:
1. Create or obtain an XSL stylesheet or Java class that performs the
desired format conversion. For Java converters, the class must implement
the XMLFormatConverter
Interface.
2. If you have an XSL stylesheet, place it in the directory "xsl_files"
located in the "WEB-INF" directory of the OAI software context.
If you have a Java class converter, place it anywhere within the classpath
of the servlet container.
3. Edit the"web.xml" file located in the "WEB-INF"
directory. To point the software to your converter, add an init parameter
for the OAIProviderServlet. If it is an XSL converter, give the parameter
a name that begins with "xslconverter." For a Java class converter
the name must begin with "javaconverter." Each param-name
must be unique, otherwise it will not be recognized. For the param-value
field, supply a string of form [convertername] | [from format] [to format]
where convertername is either the name of an xsl file or a fully qualified
Java class and the to and from formats are metadataPrefixes for the
given formats. For example, if you have an xsl stylesheet named myDCConverter.xsl
that converts from ADN to Dublin Core the param-value would be "myDCConverter.xsl|adn|oai_dc"
(quotes omitted). If it were a Java class converter by the full name
org.institution.converter.MyDCConverter, the param-value would be "org.institution.converter.MyDCConverter|adn|oai_dc"
(quotes omitted).
4. After configuring the web.xml file and placing the converter in
the appropriate location, start or restart the software. The software
will automatically recognize the converter and add the new format to
its list of available formats.
Tip: The format converter module caches the converted metadata to disc
for increased performance. You may access these converted files as well.
Once a item has been accessed for the first time it's cached file will
appear in the directory "WEB-INF/repository_data/converted_xml_cache."
Using the harvester
The harvester included with this software can be used to harvest from any
OAI data provider that supports protocol versions 1.1 or 2.0, saving the metadata to
files in a directory you choose.
You can access the harvester through the
Harvester administration page. The harvester has two interfaces - one that allows
you to save settings for recurring harvests
and one that can be used for a simple one-time harvest of a given data provider.
To set-up a recurring harvest click the Add new recurring harvest button and
type in the required information in the page that appears. Once set-up, a recurring
harvest can be run manually or automatically at regular intervals. To
perform a one-time harvest simply enter the required information in the one-time
harvest interface and click harvest.
To view the progression of a harvest as it is taking place or to view the history
of harvests, use the view status links in the harvester interface. Harvest log reports
are updated continually. To view new log entries, use your browser's refresh button.
The Open Archives maintains a
list of registered data providers.
Known data providers of interest within the DLESE community (baseURL shown):
- DLESE - http://www.dlese.org/oai/provider
- NSDL - http://services.nsdl.org:8080/nsdloai/OAI
The harvester may also be used programmatically within your Java code. Examples
and documentation are available in the
Harvester Javadoc.
To use the Harvester programmatically, include DLESETools.jar, found in the lib directory of this
distribution, in your classpath.
Enabling access control
To enable password protection for OAI administration in a Tomcat servlet
container, uncomment the 'security-constraint' and 'login-config' elements
found in the oai 'WEB-INF'/web.xml' file. Then copy the following context
definition into the <host> tagset of the 'server.xml' config file
found in the Tomcat 4.x 'conf' directory (the configuration below assumes
the OAI context and directory name is 'oai'):
<Context path="/oai" docBase="oai" debug="0"
reloadable="true">
<Realm className="org.apache.catalina.realm.MemoryRealm"
pathname="webapps/oai/WEB-INF/users.xml"
/>
</Context>
You will then need to restart Tomcat. The file 'WEB-INF/users.xml' defines
user names and passwords for those granted access to the OAI administration
pages. A default user of 'admin' with password 'admin' has been defined.
You should edit the file and replace this default definition with your
own user/password definitions before making your software public. Note
that this type of authentication does not provide encryption for user
names and passwords sent over the net therefore you may want to only access
these pages from a computer that is on the same local network as the host
machine.
Obtaining this software
This software is available in pre-compiled binary form ready for installation. The
binary distribution is packaged as a Java WAR file, which can be dropped into a
servlet container such as Tomcat for rapid deployment.
The binary distribution is available at the
DLESE SourceForge portal.
Installation instructions are available at:
Linux installation
Windows installation
Mac OS X installation
Instructions are also included with the distribution - after downloading, unzip the archive and see the file
INSTALL.txt.
Building from source
This software can be built directly from source. You may obtain the current source distribution at the
DLESE SourceForge portal (recommended).
Alternatively, the latest development tree is available via anonymous CVS at the
DLESE CVS repository on SourceForge. If obtaining
from CVS, note that you will need
both the oai-project and dlese-tools-project.
Instructions for building this software from source are available inside the source distribution.
After downloading the source package at the
DLESE SourceForge portal,
unzip the archive and see the file BUILD_INSTRUCTIONS.txt.
General information
This software was built using the Lucene search API
and the Struts web application framework.
The full Javadoc for this software is included here.
Send e-mail to support@dlese.org
with questions, comments, or suggestions.
|