This paper presents a methodology to capture bibliographic data from the verso of the title pages of documents. A survey has been undertaken to identify the syntactic and semantic features of bibliographic elements on the verso of title pages. These features include the font size, line numbers and appearence of certain string of characters. Emphasis is given to the study of “cataloguing‐in‐publication” data. The results of the survey are used to develop heuristics which can help in developing a program to automatically identify the various bibliogaphic data elements. The back of the title pages are scanned and stored as HTML pages using optical recognition software. The heuristics are then applied on the HTML pages. Few samples of input and the output generated are presented. Finally, the problems related to OCR and the heuristics are enumerated.
Prasad, A.R.D. and Sankar Rath, D. (2004), "Heuristics for identification of bibliographic elements from verso of title pages", Library Hi Tech, Vol. 22 No. 4, pp. 397-403. https://doi.org/10.1108/07378830410570502
Emerald Group Publishing Limited
Copyright © 2004, Emerald Group Publishing Limited