Heuristics for identification of bibliographic elements from verso of title pages

A.R.D. Prasad (Associate Professor in the Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India)
Durga Sankar Rath (Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India)

Library Hi Tech

ISSN: 0737-8831

Publication date: 1 December 2004


This paper presents a methodology to capture bibliographic data from the verso of the title pages of documents. A survey has been undertaken to identify the syntactic and semantic features of bibliographic elements on the verso of title pages. These features include the font size, line numbers and appearence of certain string of characters. Emphasis is given to the study of “cataloguing‐in‐publication” data. The results of the survey are used to develop heuristics which can help in developing a program to automatically identify the various bibliogaphic data elements. The back of the title pages are scanned and stored as HTML pages using optical recognition software. The heuristics are then applied on the HTML pages. Few samples of input and the output generated are presented. Finally, the problems related to OCR and the heuristics are enumerated.



Prasad, A. and Sankar Rath, D. (2004), "Heuristics for identification of bibliographic elements from verso of title pages", Library Hi Tech, Vol. 22 No. 4, pp. 397-403. https://doi.org/10.1108/07378830410570502

Download as .RIS



Emerald Group Publishing Limited

Copyright © 2004, Emerald Group Publishing Limited

Please note you might not have access to this content

You may be able to access this content by login via Shibboleth, Open Athens or with your Emerald account.
If you would like to contact us about accessing this content, click the button and fill out the form.
To rent this content from Deepdyve, please click the button.