Search results

1 – 4 of 4
Article
Publication date: 7 January 2020

Omri Suissa, Avshalom Elmalech and Maayan Zhitomirsky-Geffet

Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then…

Abstract

Purpose

Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then convert images into text using optical character recognition (OCR) algorithms. However, the outcome of OCR processing of historical documents is usually inaccurate and requires post-processing error correction. The purpose of this paper is to investigate how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which crowdsourcing methodology is the most effective in different scenarios and for various research objectives.

Design/methodology/approach

A series of experiments with different micro-task’s structures and text lengths were conducted with 753 workers on the Amazon’s Mechanical Turk platform. The workers had to fix OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures were devised.

Findings

The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-size) and the optimal structure of the experiment is two phase with a scanned image. In terms of efficiency, the best results were obtained when using longer text in the single-stage structure with no image.

Practical implications

The study provides practical recommendations to researchers on how to build the optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to create golden standard historical texts for automatic OCR post-correction.

Originality/value

This is the first attempt to systematically investigate the influence of various factors on crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.

Details

Aslib Journal of Information Management, vol. 72 no. 2
Type: Research Article
ISSN: 2050-3806

Keywords

Article
Publication date: 30 October 2009

Hildelies Balk and Lieke Ploeger

The purpose of this paper is to address the most urgent challenges that libraries face in the mass digitization of historical printed text: the unsatisfactory result of the…

1262

Abstract

Purpose

The purpose of this paper is to address the most urgent challenges that libraries face in the mass digitization of historical printed text: the unsatisfactory result of the conversion of scanned images to full featured electronic text by means of automated optical character recognition (OCR); the historical language barrier around 1850, caused by inadequacy of most existing lexica for historical language for OCR or post‐correction and a lack of institutional knowledge and expertise in libraries, museums and archives.

Design/methodology/approach

In the EC‐funded project IMPACT (Improving Access to Text), seven libraries, six research institutes and two private sector companies across Europe work together to address the challenges by the development of OCR software and technologies which exceed the accurateness of current state‐of‐the‐art software significantly. The IMPACT solutions focus on the entire process of recognition after the document leaves the scanner: Image processing, OCR processing (including use of dictionaries), OCR correction and Document formatting. IMPACT will also build capacity in mass digitization by sharing best practice and expertise with the cultural heritage communities in Europe.

Findings

Technical results will include toolkits for image enhancement and segmentation, an adaptive OCR engine and several prototypes of experimental OCR engines, computational lexica and several post‐correction modules including a web based collaborative correction system and a parser for structural metadata. Strategic tools include several decision support tools, guidelines, a web site with demonstrator platform, a training programme and ultimately, a sustainable Centre of Competence for mass digitization in Europe.

Originality/value

The IMPACT solutions will allow for the first time to transform large amounts of digitized historical texts into electronic text with a minimum of manual interference and a significantly improved accessibility for the user.

Details

OCLC Systems & Services: International digital library perspectives, vol. 25 no. 4
Type: Research Article
ISSN: 1065-075X

Keywords

Article
Publication date: 10 December 2018

Tessel Bogaard, Laura Hollink, Jan Wielemaker, Jacco van Ossenbruggen and Lynda Hardman

For digital libraries, it is useful to understand how users search in a collection. Investigating search patterns can help them to improve the user interface, collection…

1070

Abstract

Purpose

For digital libraries, it is useful to understand how users search in a collection. Investigating search patterns can help them to improve the user interface, collection management and search algorithms. However, search patterns may vary widely in different parts of a collection. The purpose of this paper is to demonstrate how to identify these search patterns within a well-curated historical newspaper collection using the existing metadata.

Design/methodology/approach

The authors analyzed search logs combined with metadata records describing the content of the collection, using this metadata to create subsets in the logs corresponding to different parts of the collection.

Findings

The study shows that faceted search is more prevalent than non-faceted search in terms of number of unique queries, time spent, clicks and downloads. Distinct search patterns are observed in different parts of the collection, corresponding to historical periods, geographical regions or subject matter.

Originality/value

First, this study provides deeper insights into search behavior at a fine granularity in a historical newspaper collection, by the inclusion of the metadata in the analysis. Second, it demonstrates how to use metadata categorization as a way to analyze distinct search patterns in a collection.

Details

Journal of Documentation, vol. 75 no. 2
Type: Research Article
ISSN: 0022-0418

Keywords

Article
Publication date: 25 July 2023

Priyanka Thakral, Praveen Ranjan Srivastava, Sanket Sunand Dash, Sajjad M. Jasimuddin and Zuopeng (Justin) Zhang

The growth of the global labor force and business analytics has significantly impacted human resource management (HRM). Human resource (HR) analytics is an emerging field that…

Abstract

Purpose

The growth of the global labor force and business analytics has significantly impacted human resource management (HRM). Human resource (HR) analytics is an emerging field that creates value for employees and organizations. By examining the existing studies on HR analytics, the paper systematically reviews the literature to identify active research areas and establish a roadmap for future studies in HR analytics.

Design/methodology/approach

A portfolio of 503 articles collected from the Scopus database was reviewed. The study has adopted a Latent Dirichlet allocation (LDA) topic modeling approach to identify significant themes in the literature.

Findings

The HR analytics research domain is classified into four categories: HR functions, statistical techniques, organizational outcomes and employee characteristics. The study has also developed a framework for organizations adopting HR analytics. Linking HR with blockchain technology, explainable artificial intelligence and Metaverse are the areas identified for future researchers.

Practical implications

The framework will assist practitioners in identifying statistical techniques for optimizing various HR functions. The paper discovers that by implementing HR analytics, HR managers and business partners can run reports, make dashboards and visualizations and make evidence-based decision-making.

Originality/value

The previous studies have not applied any machine learning techniques to identify the topics in the extant literature. The paper has applied machine learning tools, making the review more robust and providing an exhaustive understanding of the domain.

Details

Management Decision, vol. 61 no. 12
Type: Research Article
ISSN: 0025-1747

Keywords

1 – 4 of 4