To read this content please select one of the options below:

Heuristics for identification of bibliographic elements from verso of title pages

A.R.D. Prasad (Associate Professor in the Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India)
Durga Sankar Rath (Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India)

Library Hi Tech

ISSN: 0737-8831

Article publication date: 1 December 2004



This paper presents a methodology to capture bibliographic data from the verso of the title pages of documents. A survey has been undertaken to identify the syntactic and semantic features of bibliographic elements on the verso of title pages. These features include the font size, line numbers and appearence of certain string of characters. Emphasis is given to the study of “cataloguing‐in‐publication” data. The results of the survey are used to develop heuristics which can help in developing a program to automatically identify the various bibliogaphic data elements. The back of the title pages are scanned and stored as HTML pages using optical recognition software. The heuristics are then applied on the HTML pages. Few samples of input and the output generated are presented. Finally, the problems related to OCR and the heuristics are enumerated.



Prasad, A.R.D. and Sankar Rath, D. (2004), "Heuristics for identification of bibliographic elements from verso of title pages", Library Hi Tech, Vol. 22 No. 4, pp. 397-403.



Emerald Group Publishing Limited

Copyright © 2004, Emerald Group Publishing Limited

Related articles