Search results
1 – 10 of 503Mansoor Alghamdi and William Teahan
The aim of this paper is to experimentally evaluate the effectiveness of the state-of-the-art printed Arabic text recognition systems to determine open areas for future…
Abstract
Purpose
The aim of this paper is to experimentally evaluate the effectiveness of the state-of-the-art printed Arabic text recognition systems to determine open areas for future improvements. In addition, this paper proposes a standard protocol with a set of metrics for measuring the effectiveness of Arabic optical character recognition (OCR) systems to assist researchers in comparing different Arabic OCR approaches.
Design/methodology/approach
This paper describes an experiment to automatically evaluate four well-known Arabic OCR systems using a set of performance metrics. The evaluation experiment is conducted on a publicly available printed Arabic dataset comprising 240 text images with a variety of resolution levels, font types, font styles and font sizes.
Findings
The experimental results show that the field of character recognition for printed Arabic still requires further research to reach an efficient text recognition method for Arabic script.
Originality/value
To the best of the authors’ knowledge, this is the first work that provides a comprehensive automated evaluation of Arabic OCR systems with respect to the characteristics of Arabic script and, in addition, proposes an evaluation methodology that can be used as a benchmark by researchers and therefore will contribute significantly to the enhancement of the field of Arabic script recognition.
Details
Keywords
The purpose of this paper is to report results of a formative usability study that investigated first-year student use of an optical character recognition (OCR) mobile application…
Abstract
Purpose
The purpose of this paper is to report results of a formative usability study that investigated first-year student use of an optical character recognition (OCR) mobile application (app) designed to help students find resources for course assignments. The app uses textual content from the assignment sheet to suggest relevant library resources of which students may not be aware.
Design/methodology/approach
Formative evaluation data are collected to inform the production level version of the mobile application and to understand student use models and requirements for OCR software in mobile applications.
Findings
Mobile OCR apps are helpful for undergraduate students searching known titles of books, general subject areas or searching for help guide content developed by the library. The results section details how student feedback shaped the next iteration of the app for integration as a Minrva module.
Research limitations/implications
This usability paper is not a large-scale quantitative study, but seeks to provide deep qualitative research data for the specific mobile interface studied, the Text-shot prototype.
Practical implications
The OCR application is designed to help students learn about availability of library resources based on scanning (e.g. taking a picture, or “Text-shot”) of an assignment sheet, a course syllabus or other course-related handouts.
Originality/value
This study contributes a new area of application development for libraries, with research methods that are useful for other mobile development studies.
Details
Keywords
Hrvoje Stančić and Željko Trbušić
The authors investigate optical character recognition (OCR) technology and discuss its implementation in the context of digitisation of archival materials.
Abstract
Purpose
The authors investigate optical character recognition (OCR) technology and discuss its implementation in the context of digitisation of archival materials.
Design/methodology/approach
The typewritten transcripts of the Croatian Writers' Society from the mid-60s of the 20th century are used as the test data. The optimal digitisation setup is investigated in order to obtain the best OCR results. This was done by using the sample of 123 pages digitised at different resolution settings and binarisation levels.
Findings
A series of tests showed that different settings produce significantly different results. The best OCR accuracy achieved at the test sample of the typewritten documents was 95.02%. The results show that the resolution is significantly more important than binarisation pre-processing procedure for achieving better OCR results.
Originality/value
Based on the research results, the authors give recommendations for achieving optimal digitisation process setup with the aim of increasing the quality of OCR results. Finally, the authors put the research results in the context of digitisation of cultural heritage in general and discuss further investigation possibilities.
Details
Keywords
Tobias Blanke, Michael Bryant and Mark Hedges
This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives.
Abstract
Purpose
This paper aims to present an evaluation of open source OCR for supporting research on material in small‐ to medium‐scale historical archives.
Design/methodology/approach
The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large‐scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools.
Findings
The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high‐quality research‐oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly.
Originality/value
There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre‐processing and layout analysis. All this can be done without the need to develop dedicated code.
Details
Keywords
Kimmo Kettunen, Heikki Keskustalo, Sanna Kumpulainen, Tuula Pääkkönen and Juha Rautiainen
This study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different…
Abstract
Purpose
This study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different quality OCR on users' subjective perception through an interactive information retrieval task with a collection of one digitized historical Finnish newspaper.
Design/methodology/approach
This study is based on the simulated work task model used in interactive information retrieval. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869–1918 which consists of ca. 1.45 million autosegmented articles. The article search database had two versions of each article with different quality OCR. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top 10 results using a graded relevance scale of 0–3. Users were not informed about the OCR quality differences of the otherwise identical articles.
Findings
The main result of the study is that improved OCR quality affects subjective user perception of historical newspaper articles positively: higher relevance scores are given to better-quality texts.
Originality/value
To the best of the authors’ knowledge, this simulated interactive work task experiment is the first one showing empirically that users' subjective relevance assessments are affected by a change in the quality of an optically read text.
Details
Keywords
The idea of optical character recognition (OCR), in other words the “reading” of documents by other than human means, arose as a practical proposition during the Second World War…
Abstract
The idea of optical character recognition (OCR), in other words the “reading” of documents by other than human means, arose as a practical proposition during the Second World War. Wartime experience of using computers in the United States had revealed the contrasts in speeds between the transcription of documents to be processed (at that time the punching of cards or tape by operatives working from original documents) and the central processing within the computer itself. Visual output was also slower than central processing but was much speeded up by the introduction of line printers and later of xerography. This “paired” case study, part of a project sponsored by the Science Research Council to examine patterns of success and failure in industrial innovation, is confined to two attempts to innovate in the field of OCR. There were others, one or two of which were contemporary, most of which have followed, have a much more recent history and may be thought to have overtaken, in terms of market penetration, the innovation here designated a commercial success. The point of this study when it was undertaken was to extract data about the two innovations that would be suitable for general analysis by a computer programme designed to search out significant groups of explanatory factors so that the characteristics associated with innovative success might be recognised as typical within an industry, or perhaps generally. This study belongs to one of two groups, the instrument industry, the other group investigated being chemical manufacturing.
Zainab Akhtar, Jong Weon Lee, Muhammad Attique Khan, Muhammad Sharif, Sajid Ali Khan and Naveed Riaz
In artificial intelligence, the optical character recognition (OCR) is an active research area based on famous applications such as automation and transformation of printed…
Abstract
Purpose
In artificial intelligence, the optical character recognition (OCR) is an active research area based on famous applications such as automation and transformation of printed documents into machine-readable text document. The major purpose of OCR in academia and banks is to achieve a significant performance to save storage space.
Design/methodology/approach
A novel technique is proposed for automated OCR based on multi-properties features fusion and selection. The features are fused using serially formulation and output passed to partial least square (PLS) based selection method. The selection is done based on the entropy fitness function. The final features are classified by an ensemble classifier.
Findings
The presented method was extensively tested on two datasets such as the authors proposed and Chars74k benchmark and achieved an accuracy of 91.2 and 99.9%. Comparing the results with existing techniques, it is found that the proposed method gives improved performance.
Originality/value
The technique presented in this work will help for license plate recognition and text conversion from a printed document to machine-readable.
Details
Keywords
Optical character recognition (OCR) technology can be employed to produce an ASCII‐text database for mounting on computer systems. Current technologies and principles of scanning…
Abstract
Optical character recognition (OCR) technology can be employed to produce an ASCII‐text database for mounting on computer systems. Current technologies and principles of scanning and OCR are discussed. A prototypical “local” project—the creation of a full‐text database of dissertations done at George Mason University—has been undertaken by the Fenwick Library at that institution. Problems encountered with current scanning and OCR technologies are illustrated and discussed, as well as techniques and “filter” programs developed to streamline the scanning and OCR conversion process.
Rajeswari S. and Sai Baba Magapu
The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document…
Abstract
Purpose
The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention.
Design/methodology/approach
For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document.
Findings
The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing.
Social implications
The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention.
Originality/value
The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.
Details
Keywords
Optical character recognition (OCR) is a vital tool for the food and pharmaceutical industries, allowing them to inspect for correct labelling and thereby conforming to good…