Using the Open Archives Initiative Protocol for Metadata Harvesting

Jonathan Eaton (London Business School, London, UK)

Program: electronic library and information systems

ISSN: 0033-0337

Article publication date: 26 September 2008

241

Keywords

Citation

Eaton, J. (2008), "Using the Open Archives Initiative Protocol for Metadata Harvesting", Program: electronic library and information systems, Vol. 42 No. 4, pp. 450-452. https://doi.org/10.1108/00330330810912133

Publisher

:

Emerald Group Publishing Limited

Copyright © 2008, Emerald Group Publishing Limited


The Open Archives Initiative Protocol for Metadata Harvesting (OAI‐PMH) is a relatively recent initiative but one that is now widely regarded as an important new component of the global, distributed networked information infrastructure. OAI‐PMH enables data service providers to “expose” their resources' metadata in a simple‐to‐implement, structured, XML‐based record format to “harvesting” services that in turn can be used to build new and innovative kinds of aggregated information services. Published by Libraries Unlimited in its Third Millennium Cataloging series, this book provides an authoritative guide to the practicalities and benefits of using OAI‐PMH in the context of the long‐established discipline of descriptive cataloguing. This hybrid approach produces a densely packed yet well‐organised textbook describing how the OAI‐PMH protocol can support such basic cataloguing goals in novel ways to enable wider interoperability in digital library systems and to promote wider sharing and discovery of electronic resources metadata. Authors Timothy Cole and Muriel Foulonneau share considerable theoretical and practical expertise in the fields of digital library systems and metadata and have collaborated at the University of Illinois (Urbana‐Champaign) on a high‐profile metadata harvesting project for leading US research universities' libraries.

The real purpose of this study is to go beyond the mere technical details of the OAI‐PMH protocol to examine what the authors identify as the more interesting (and more complex) implications it poses for both descriptive cataloguing practice and notions of how digital libraries should interoperate. In one key respect, the book's title somewhat understates its goal: the text shows not just how OAI‐PMH can be “used” but effectively what kinds of further work and processing are often needed after metadata are harvested. Harvesting is thus not merely an end in itself, but actually the means of starting and sustaining a number of collaborative processes between data providers (i.e. those data repositories that have exposed their metadata to harvesters) and OAI‐PMH service providers (the search services that provide an aggregated index to that harvested metadata).

Cole and Foulonneau have accordingly structured their book into three parts that describe this trajectory. Each chapter concludes with a lengthy array of URLs under a “Notes” heading to help keep the main body of the text uncluttered whilst providing many practical illustrations and examples. An extensive list of references to published works follows, and chapters end with a list of detailed “Questions and topics for discussion” and “Suggestions for exercises” which help focus the reader on the practical next steps that they or colleagues can take to apply this material to their own local situation and practice. The book starts with contextualisation by charting the origins of OAI‐PMH and its relationship to the Open Access movement and institutional repositories. At this point, the authors take pains to clear up some widespread misconceptions that stem at least partly from the protocol's beginnings in response to problems that emerged as scholarly self‐archiving e‐print servers became more widespread. The “Open Access” part of its name does not mean that it is an Open Access application: the protocol supports metadata exchange among multiple kinds of data providers, ranging from institutional repositories to fee‐based suppliers of proprietary content. “Archive” is also potentially misleading in that OAI‐PMH is not prescriptive about archival records; instead, the term comes from the early association with academic e‐print repositories' collections, commonly referred to as “archives” by their creators. Moreover, the protocol is not – as often believed – synonymous with the basic Dublin Core metadata format.

The second section covers the protocol's technical details and how to implement an OAI‐compliant data provider service. This is achieved by a series of examples where the key verb and argument elements of the protocol are examined both in theory and practice, with XML response examples printed to illustrate key elements of the actual conversations between OAI service and data provider. The third and final section entitled “Sharable metadata: creating and using” takes up half the entire book, to demonstrate its authors' positioning of descriptive cataloguing at the heart of successful adoption and practical benefits of the protocol. As they take care to note, for purposes of building digital libraries, OAI‐PMH can only be considered as useful as the metadata, which content providers are enabled to share by using it. Three closely detailed chapters cover the Dublin Core metadata formats but also introduce the reader to more specialised and richer metadata formats that have evolved for use with OAI‐PMH in response to the known limitations of (un)qualified Dublin Core, including MARCXML, MODS, VRA, TEI and METS. A further chapter then demonstrates how extra value can be added by the OAI Service Provider in both normalising (i.e. standardising record structures and values) and augmenting (adding information to provide an explicit context for metadata harvested from a collection that has been created originally with an assumption that its context was implicitly understood).

Simple yet persuasive examples of the need for normalising and augmenting approaches are provided, such as the widespread variance in date formatting. This creates the context for the chapter “Using Aggregated Metadata to Build Digital Library Services” which appraises both the exciting opportunities and the challenges involved in using OAI‐PMH to create new digital library services in the new Web 2.0 (or “Library 2.0” era). These can range from the original implementation of the University of Michigan's OAIster service enabling cross‐institutional repository search, to new subject‐based portals that selectively harvest data based on predefined resource clusters. Federated search is another area where OAI‐PMH could transform the current limitations of HTML page scraping; however, a more powerful application framework can be envisaged in which indexing of “deep web” content normally displayed only in response to interactive search can be integrated with indexing of static web pages. A “Concluding Thoughts” chapter discusses not only the rapid adoption of the protocol between 2001 and the end of 2006, but also some of the practical limitations and lessons learned, including the acceptance that the original belief that OAI‐PMH could be implemented as a largely “automatic” model requiring little if any human intervention, is now misplaced. This only reinforces the authors' argument and emphasis on the importance of the manually driven metadata management element.

This is a closely argued, detailed guide to implementing and deriving maximum benefit from the OAI‐PMH protocol, covering the period up to the end of 2006, written by two specialists with both a theoretical and also a strong practical grasp of the subject. It can be positively recommended to digital library designers and cataloguers involved in such implementations and aggregation service projects.

Related articles