To read this content please select one of the options below:

Development and customization of in-house developed OCR and its evaluation

Rajeswari S. (Homi Bhabha National Institute, Indira Gandhi Center for Atomic Research, Kalpakkam, India)
Sai Baba Magapu (Department of Natural Science and Engineering, National Institute of Advanced Studies, Karnataka, India)

The Electronic Library

ISSN: 0264-0473

Article publication date: 16 October 2018

Issue publication date: 5 November 2018

261

Abstract

Purpose

The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention.

Design/methodology/approach

For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document.

Findings

The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing.

Social implications

The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention.

Originality/value

The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.

Keywords

Acknowledgements

The authors thank Prof G. Sivakumar, IITB, for his suggestions and useful discussions.

Citation

S., R. and Magapu, S.B. (2018), "Development and customization of in-house developed OCR and its evaluation", The Electronic Library, Vol. 36 No. 5, pp. 766-781. https://doi.org/10.1108/EL-01-2018-0011

Publisher

:

Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited

Related articles