Project to optimise Web crawling

The Electronic Library

ISSN: 0264-0473

Article publication date: 1 August 2004

Citation

(2004), "Project to optimise Web crawling", The Electronic Library, Vol. 22 No. 4. https://doi.org/10.1108/el.2004.26322dab.005

Publisher

:

Emerald Group Publishing Limited

Copyright © 2004, Emerald Group Publishing Limited


Project to optimise Web crawling

The Computer Science Department of Old Dominion University and the Research Library of the Los Alamos National Laboratory have announced the launch of the “mod_oai” project. The aim of the project is to create the mod_oai Apache software module that will expose content accessible from Apache Web servers via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The mod_oai project (www.modoai.org) is funded by the Andrew W. Mellon Foundation.

Apache is an open-source Web server that is used by 63 per cent – approximately 27 million – of the Web sites in the world. The OAI-PMH is a protocol to selectively harvest from data repositories. The protocol has had a considerable impact in the field of digital libraries but it has yet to be embraced by the general Web community. The mod_oai project hopes to achieve such broader acceptance by making the power and efficiency of the OAI-PMH available to Web servers and Web crawlers. For example, the planned OAI-PMH interface to Apache Web servers should allow responding to requests to collect all files added or changed since a specified date, or all files that are of a specified MIME-type.

The Apache Web server defines an extensible module format that allows specific functionality to be incorporated directly into the Web server. The mod_oai project will build such an Apache module that is able to respond to OAI-PMH requests pertaining to files made accessible by the Apache server. The mod_oai module will be developed under the GNU Public License (GPL) and distributed through sourceforge.net upon completion.