Search results
1 – 4 of 4Omri Suissa, Avshalom Elmalech and Maayan Zhitomirsky-Geffet
Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then…
Abstract
Purpose
Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then convert images into text using optical character recognition (OCR) algorithms. However, the outcome of OCR processing of historical documents is usually inaccurate and requires post-processing error correction. The purpose of this paper is to investigate how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which crowdsourcing methodology is the most effective in different scenarios and for various research objectives.
Design/methodology/approach
A series of experiments with different micro-task’s structures and text lengths were conducted with 753 workers on the Amazon’s Mechanical Turk platform. The workers had to fix OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures were devised.
Findings
The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-size) and the optimal structure of the experiment is two phase with a scanned image. In terms of efficiency, the best results were obtained when using longer text in the single-stage structure with no image.
Practical implications
The study provides practical recommendations to researchers on how to build the optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to create golden standard historical texts for automatic OCR post-correction.
Originality/value
This is the first attempt to systematically investigate the influence of various factors on crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.
Details
Keywords
Hildelies Balk and Lieke Ploeger
The purpose of this paper is to address the most urgent challenges that libraries face in the mass digitization of historical printed text: the unsatisfactory result of the…
Abstract
Purpose
The purpose of this paper is to address the most urgent challenges that libraries face in the mass digitization of historical printed text: the unsatisfactory result of the conversion of scanned images to full featured electronic text by means of automated optical character recognition (OCR); the historical language barrier around 1850, caused by inadequacy of most existing lexica for historical language for OCR or post‐correction and a lack of institutional knowledge and expertise in libraries, museums and archives.
Design/methodology/approach
In the EC‐funded project IMPACT (Improving Access to Text), seven libraries, six research institutes and two private sector companies across Europe work together to address the challenges by the development of OCR software and technologies which exceed the accurateness of current state‐of‐the‐art software significantly. The IMPACT solutions focus on the entire process of recognition after the document leaves the scanner: Image processing, OCR processing (including use of dictionaries), OCR correction and Document formatting. IMPACT will also build capacity in mass digitization by sharing best practice and expertise with the cultural heritage communities in Europe.
Findings
Technical results will include toolkits for image enhancement and segmentation, an adaptive OCR engine and several prototypes of experimental OCR engines, computational lexica and several post‐correction modules including a web based collaborative correction system and a parser for structural metadata. Strategic tools include several decision support tools, guidelines, a web site with demonstrator platform, a training programme and ultimately, a sustainable Centre of Competence for mass digitization in Europe.
Originality/value
The IMPACT solutions will allow for the first time to transform large amounts of digitized historical texts into electronic text with a minimum of manual interference and a significantly improved accessibility for the user.
Details
Keywords
Tessel Bogaard, Laura Hollink, Jan Wielemaker, Jacco van Ossenbruggen and Lynda Hardman
For digital libraries, it is useful to understand how users search in a collection. Investigating search patterns can help them to improve the user interface, collection…
Abstract
Purpose
For digital libraries, it is useful to understand how users search in a collection. Investigating search patterns can help them to improve the user interface, collection management and search algorithms. However, search patterns may vary widely in different parts of a collection. The purpose of this paper is to demonstrate how to identify these search patterns within a well-curated historical newspaper collection using the existing metadata.
Design/methodology/approach
The authors analyzed search logs combined with metadata records describing the content of the collection, using this metadata to create subsets in the logs corresponding to different parts of the collection.
Findings
The study shows that faceted search is more prevalent than non-faceted search in terms of number of unique queries, time spent, clicks and downloads. Distinct search patterns are observed in different parts of the collection, corresponding to historical periods, geographical regions or subject matter.
Originality/value
First, this study provides deeper insights into search behavior at a fine granularity in a historical newspaper collection, by the inclusion of the metadata in the analysis. Second, it demonstrates how to use metadata categorization as a way to analyze distinct search patterns in a collection.
Details
Keywords
Priyanka Thakral, Praveen Ranjan Srivastava, Sanket Sunand Dash, Sajjad M. Jasimuddin and Zuopeng (Justin) Zhang
The growth of the global labor force and business analytics has significantly impacted human resource management (HRM). Human resource (HR) analytics is an emerging field that…
Abstract
Purpose
The growth of the global labor force and business analytics has significantly impacted human resource management (HRM). Human resource (HR) analytics is an emerging field that creates value for employees and organizations. By examining the existing studies on HR analytics, the paper systematically reviews the literature to identify active research areas and establish a roadmap for future studies in HR analytics.
Design/methodology/approach
A portfolio of 503 articles collected from the Scopus database was reviewed. The study has adopted a Latent Dirichlet allocation (LDA) topic modeling approach to identify significant themes in the literature.
Findings
The HR analytics research domain is classified into four categories: HR functions, statistical techniques, organizational outcomes and employee characteristics. The study has also developed a framework for organizations adopting HR analytics. Linking HR with blockchain technology, explainable artificial intelligence and Metaverse are the areas identified for future researchers.
Practical implications
The framework will assist practitioners in identifying statistical techniques for optimizing various HR functions. The paper discovers that by implementing HR analytics, HR managers and business partners can run reports, make dashboards and visualizations and make evidence-based decision-making.
Originality/value
The previous studies have not applied any machine learning techniques to identify the topics in the extant literature. The paper has applied machine learning tools, making the review more robust and providing an exhaustive understanding of the domain.
Details