To read the full version of this content please select one of the options below:

Comparing “parallel passages” in digital archives

Martyn Harris (Department of Computer Science and Information Systems, Birkbeck, University of London, London, UK)
Mark Levene (Department of Computer Science and Information Systems, Birkbeck, University of London, London, UK)
Dell Zhang (Department of Computer Science and Information Systems, Birkbeck, University of London, London, UK)
Dan Levene (Department of History, Southampton University, Southampton, UK)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 16 September 2019

Issue publication date: 7 January 2020

Abstract

Purpose

The purpose of this paper is to present a language-agnostic approach to facilitate the discovery of “parallel passages” stored in historic and cultural heritage digital archives.

Design/methodology/approach

The authors explore a novel, and relatively simple approach, using a character-based statistical language model combined with a tailored version of the Basic Local Alignment Tool to extract exact and approximate string patterns shared between groups of documents.

Findings

The approach is applicable to a wide range of languages, and compensates for variability in the text of the documents as a result of differences in dialect, authorship, language change over time and errors due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process.

Research limitations/implications

A number of case studies demonstrate that the approach is practical and generalisable to a wide range of archives with documents in different languages, domains and of varying quality.

Practical implications

The approach described can be applied to any digital archive of modern and contemporary texts. This makes the approach applicable to digital archives recording historic texts, but also those composed of more recent news articles, for example.

Social implications

The analysis of “parallel passages” enables researchers to quantify the presence and extent of text-reuse in a collection of documents, which can provide useful data on author style, text genres and cultural contexts.

Originality/value

The approach is novel and addresses a need by humanities researchers for tools that can identify similar documents and local similarities represented by shared text sequences in a potentially vast large archive of documents. As far as the authors are aware, there are no tools currently exist that provide the same level of tolerance to the language of the documents.

Keywords

Citation

Harris, M., Levene, M., Zhang, D. and Levene, D. (2020), "Comparing “parallel passages” in digital archives", Journal of Documentation, Vol. 76 No. 1, pp. 271-289. https://doi.org/10.1108/JD-10-2018-0175

Publisher

:

Emerald Publishing Limited

Copyright © 2019, Emerald Publishing Limited