Search results1 – 2 of 2
This paper presents a methodology to capture bibliographic data from the verso of the title pages of documents. A survey has been undertaken to identify the syntactic and…
This paper presents a methodology to capture bibliographic data from the verso of the title pages of documents. A survey has been undertaken to identify the syntactic and semantic features of bibliographic elements on the verso of title pages. These features include the font size, line numbers and appearence of certain string of characters. Emphasis is given to the study of “cataloguing‐in‐publication” data. The results of the survey are used to develop heuristics which can help in developing a program to automatically identify the various bibliogaphic data elements. The back of the title pages are scanned and stored as HTML pages using optical recognition software. The heuristics are then applied on the HTML pages. Few samples of input and the output generated are presented. Finally, the problems related to OCR and the heuristics are enumerated.
This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like…
This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition (OCR) software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rule‐based expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.