CitationDownload as .RIS
Emerald Group Publishing Limited
Copyright © 2001, MCB UP Limited
Digitization of Printed Material: The Metadata Engine Project (METAe)
Digitization of Printed Material:The Metadata Engine Project (METAe)
The METAe project http://meta-e.uibk.ac.at/ is a highly collaborative research and software development project in which university departments, libraries, archives and software companies from seven European countries and the USA are cooperating in order to develop application software for the digitization of printed material. Initial prototypes of the software will be available in 2002. The METAe project is co-funded by the European Commission, "Digital heritage and cultural content" http://www.cordis.lu/ist/ka3/digicult/
The main objectives of METAe are to:
ease the digitization of books, journals and magazines in terms of cost-effectiveness and degree of automation;
enrich the output of the conversion process in terms of structural metadata capturing; and
enhance the opportunity for successful digital preservation from the very beginning of life-cycle-management by producing highly standardized information objects.
The METAe software is designed to be a comprehensive software package where all tasks within a digitization workflow can be carried out according to the standards currently emerging, such as: the Open Archival Information System; the NISO working draft for Technical Metadata for Digital Still Images; http://www.niso.org/commitau.html or the NISO draft standard for Book Item and Component Identifier http://www.niso.org/pdfs/BICI-DS.pdf The functionality of the software will include:
image enhancement and pre-processing;
capturing descriptive metadata from electronic library catalogues;
carrying out the OCR-processing;
creating technical and administrative metadata;
extracting structural metadata;
organizing permanent quality control.
The key technology to enable such remarkable progress in enlarging the degree of automation and enriching the output of conversion projects is based on the introduction of layout- and document-analysis and capturing techniques. Since the layout and structure of printed material are not arbitrary, but follow strong and often ancient rules, the project partners hope to succeed in extracting more information from the page images in a highly automated way than is usually possible. Page numbers, headlines, footnotes, graphs and caption lines can be extracted. Further than even this, the hierarchical structure of books and journals such as: periodical; issue; single article; graph within this article; will also be automatically recognized and captured.
The METAe software package will also consist of a specialized Optical Character Recognition (OCR) engine adapted to recognize old typefaces and historical texts. This is an overdue task, especially for the German typeface "Fraktur", a derivate of the gothic letter (used in a large majority of printed texts in Central Europe and the Nordic countries until the middle of the twentieth century). Five historical dictionaries representing the historical orthography of the English, French, German, Italian and Spanish languages will support the OCR engine. The software package is completed with an XML/SGML search engine that is intended to perform queries on the full-text as well as on the structure of XML documents.
Simon Tanner works for the Higher Education Digitization Service of the University of Hertfordshire, Hatfield, UK (firstname.lastname@example.org)