To read this content please select one of the options below:

Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections

Dilawar Ali (IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium) (IMEC, Leuven, Belgium)
Kenzo Milleville (IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium) (IMEC, Leuven, Belgium)
Steven Verstockt (IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium) (IMEC, Leuven, Belgium)
Nico Van de Weghe (Department of Geography, Ghent University, Ghent, Belgium)
Sally Chambers (Ghent Centre for Digital Humanities, Ghent University, Ghent, Belgium) (KBR, Brussel, Belgium)
Julie M. Birkholz (Ghent Centre for Digital Humanities, Ghent University, Ghent, Belgium) (KBR, Brussel, Belgium)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 27 February 2023

Issue publication date: 3 September 2024

562

Abstract

Purpose

Historical newspaper collections provide a wealth of information about the past. Although the digitization of these collections significantly improves their accessibility, a large portion of digitized historical newspaper collections, such as those of KBR, the Royal Library of Belgium, are not yet searchable at article-level. However, recent developments in AI-based research methods, such as document layout analysis, have the potential for further enriching the metadata to improve the searchability of these historical newspaper collections. This paper aims to discuss the aforementioned issue.

Design/methodology/approach

In this paper, the authors explore how existing computer vision and machine learning approaches can be used to improve access to digitized historical newspapers. To do this, the authors propose a workflow, using computer vision and machine learning approaches to (1) provide article-level access to digitized historical newspaper collections using document layout analysis, (2) extract specific types of articles (e.g. feuilletons – literary supplements from Le Peuple from 1938), (3) conduct image similarity analysis using (un)supervised classification methods and (4) perform named entity recognition (NER) to link the extracted information to open data.

Findings

The results show that the proposed workflow improves the accessibility and searchability of digitized historical newspapers, and also contributes to the building of corpora for digital humanities research. The AI-based methods enable automatic extraction of feuilletons, clustering of similar images and dynamic linking of related articles.

Originality/value

The proposed workflow enables automatic extraction of articles, including detection of a specific type of article, such as a feuilleton or literary supplement. This is particularly valuable for humanities researchers as it improves the searchability of these collections and enables corpora to be built around specific themes. Article-level access to, and improved searchability of, KBR's digitized newspapers are demonstrated through the online tool (https://tw06v072.ugent.be/kbr/).

Keywords

Acknowledgements

This research has been funded by the DATA-KBR-BE project (2020–2024) financed by the Belgian Science Policy Office (Belspo) as part of the Belgian Research Action through Interdisciplinary Networks, BRAIN 2.0 program which is coordinated by KBR. The authors would like thank the KBR for enabling access to the historical newspaper data for this research and Alec Van den broeck for his assistance with the NER evaluation.

Citation

Ali, D., Milleville, K., Verstockt, S., Van de Weghe, N., Chambers, S. and Birkholz, J.M. (2024), "Computer vision and machine learning approaches for metadata enrichment to improve searchability of historical newspaper collections", Journal of Documentation, Vol. 80 No. 5, pp. 1031-1056. https://doi.org/10.1108/JD-01-2022-0029

Publisher

:

Emerald Publishing Limited

Copyright © 2022, Emerald Publishing Limited

Related articles