Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections
ISSN: 0022-0418
Article publication date: 27 February 2023
Issue publication date: 3 September 2024
Abstract
Purpose
Historical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.
Design/methodology/approach
In this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.
Findings
The results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.
Originality/value
The proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).
Keywords
Acknowledgements
This research has been funded by the DATA-KBR-BE project (2020–2024) financed by the Belgian Science Policy Office (Belspo) as part of the Belgian Research Action through Interdisciplinary Networks, BRAIN 2.0 program which is coordinated by KBR. The authors would like thank the KBR for enabling access to the historical newspaper data for this research and Alec Van den broeck for his assistance with the NER evaluation.
Citation
Ali, D., Milleville, K., Verstockt, S., Van de Weghe, N., Chambers, S. and Birkholz, J.M. (2024), "Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections", Journal of Documentation, Vol. 80 No. 5, pp. 1031-1056. https://doi.org/10.1108/JD-01-2022-0029
Publisher
:Emerald Publishing Limited
Copyright © 2022, Emerald Publishing Limited