Search results

11 – 20 of over 1000
Article
Publication date: 1 April 1993

Niël van der Merwe

This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an…

Abstract

This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an electronic document database that broadens the user's capability to retrieve relevant information more accurately, without going through costly processes to get paper documents into electronic text. The principles of document image processing systems, as well as the problems and shortcomings of most of today's document image processing systems, will be discussed. Then concept retrieval as the latest development in text retrieval will be discussed, with specific reference to the ability of the TOPIC intelligent text retrieval system to allow users to build up a knowledge base of search objects or concepts that can be used at any point in time by all users for the system. This paper will further specifically look at the automatic processing of paper documents by converting the scanned document image pages through to electronic text. The use of optical character recognition technology, the indexing and loading of the documents in a text database, the automatic linking of the documents to the related document images and the retrieval technology available in TOPIC, specifically the TYPO operator that was developed to handle so‐called dirty data such as the common misspellings, character transpositions and ‘dirty’ text received as output from the OCR process, will be discussed. A possible solution to load paper documents quickly and cost‐effectively into an electronic document database will be discussed and demonstrated in detail. The advantages and disadvantages of this approach will be discussed with specific reference to an electronic news clipping service application.

Details

The Electronic Library, vol. 11 no. 4/5
Type: Research Article
ISSN: 0264-0473

Content available
Article
Publication date: 1 September 2003

139

Abstract

Details

Sensor Review, vol. 23 no. 3
Type: Research Article
ISSN: 0260-2288

Keywords

Article
Publication date: 31 August 2012

Tobias Blanke, Michael Bryant and Mark Hedges

This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives.

1921

Abstract

Purpose

This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives.

Design/methodology/approach

The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large‐scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools.

Findings

The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high‐quality research‐oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly.

Originality/value

There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre‐processing and layout analysis. All this can be done without the need to develop dedicated code.

Details

Journal of Documentation, vol. 68 no. 5
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 1 January 1993

John Mackrory

Optical character recognition (OCR) is a vital tool for the food and pharmaceutical industries, allowing them to inspect for correct labelling and thereby conforming to good…

Abstract

Optical character recognition (OCR) is a vital tool for the food and pharmaceutical industries, allowing them to inspect for correct labelling and thereby conforming to good manufacturing practices (GMP).

Details

Sensor Review, vol. 13 no. 1
Type: Research Article
ISSN: 0260-2288

Article
Publication date: 1 April 1977

John Ross and Bruce Royan

The difficulties of retrospective library catalogue conversion are described. Computer Input Microfilm (CIM) is compared with more conventional Optical Character Recognition (OCR…

Abstract

The difficulties of retrospective library catalogue conversion are described. Computer Input Microfilm (CIM) is compared with more conventional Optical Character Recognition (OCR) techniques. After giving examples of known uses of the technique, and listing studies of it in a library context, the paper outlines the scope for such a system within a very large library such as the British Library.

Details

Program, vol. 11 no. 4
Type: Research Article
ISSN: 0033-0337

Article
Publication date: 16 October 2018

Rajeswari S. and Sai Baba Magapu

The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document…

Abstract

Purpose

The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention.

Design/methodology/approach

For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document.

Findings

The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing.

Social implications

The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention.

Originality/value

The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.

Details

The Electronic Library, vol. 36 no. 5
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 1 January 1993

David Nangle

For many years the philosophy of machines reading data from paper into computers, known as Optical Character Recognition or OCR, has been regarded by the unenlightened as a dream…

Abstract

For many years the philosophy of machines reading data from paper into computers, known as Optical Character Recognition or OCR, has been regarded by the unenlightened as a dream which will be wonderful when it happens. OCR has, unfortunately, been an insufficiently publicized reality for 25 years.

Details

Sensor Review, vol. 13 no. 1
Type: Research Article
ISSN: 0260-2288

Article
Publication date: 1 January 1993

David J.M. Elliott

The Optical Character Recognition (OCR) system described in this article was developed for an application in a major high street bank requiring the high speed reading of cheque…

Abstract

The Optical Character Recognition (OCR) system described in this article was developed for an application in a major high street bank requiring the high speed reading of cheque serial numbers.

Details

Sensor Review, vol. 13 no. 1
Type: Research Article
ISSN: 0260-2288

Article
Publication date: 30 October 2009

Hildelies Balk and Lieke Ploeger

The purpose of this paper is to address the most urgent challenges that libraries face in the mass digitization of historical printed text: the unsatisfactory result of the…

1262

Abstract

Purpose

The purpose of this paper is to address the most urgent challenges that libraries face in the mass digitization of historical printed text: the unsatisfactory result of the conversion of scanned images to full featured electronic text by means of automated optical character recognition (OCR); the historical language barrier around 1850, caused by inadequacy of most existing lexica for historical language for OCR or post‐correction and a lack of institutional knowledge and expertise in libraries, museums and archives.

Design/methodology/approach

In the EC‐funded project IMPACT (Improving Access to Text), seven libraries, six research institutes and two private sector companies across Europe work together to address the challenges by the development of OCR software and technologies which exceed the accurateness of current state‐of‐the‐art software significantly. The IMPACT solutions focus on the entire process of recognition after the document leaves the scanner: Image processing, OCR processing (including use of dictionaries), OCR correction and Document formatting. IMPACT will also build capacity in mass digitization by sharing best practice and expertise with the cultural heritage communities in Europe.

Findings

Technical results will include toolkits for image enhancement and segmentation, an adaptive OCR engine and several prototypes of experimental OCR engines, computational lexica and several post‐correction modules including a web based collaborative correction system and a parser for structural metadata. Strategic tools include several decision support tools, guidelines, a web site with demonstrator platform, a training programme and ultimately, a sustainable Centre of Competence for mass digitization in Europe.

Originality/value

The IMPACT solutions will allow for the first time to transform large amounts of digitized historical texts into electronic text with a minimum of manual interference and a significantly improved accessibility for the user.

Details

OCLC Systems & Services: International digital library perspectives, vol. 25 no. 4
Type: Research Article
ISSN: 1065-075X

Keywords

Article
Publication date: 1 January 1992

Allen‐Bradley recently announced a new addition to the company's CVIM vision input module offering. The Allen‐Bradley optical character recognition package, OCR‐PAK, allows the…

Abstract

Allen‐Bradley recently announced a new addition to the company's CVIM vision input module offering. The Allen‐Bradley optical character recognition package, OCR‐PAK, allows the CVIM module to read character strings within an image for product identification.

Details

Sensor Review, vol. 12 no. 1
Type: Research Article
ISSN: 0260-2288

11 – 20 of over 1000