Optical character recognition quality affects subjective user perception of historical newspaper clippings

Purpose – This study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different quality OCR on users ’ subjective perceptionthroughaninteractiveinformationretrievaltaskwithacollectionofonedigitizedhistoricalFinnishnewspaper. Design/methodology/approach – Thisstudyisbasedonthesimulatedworktaskmodelusedininteractive information retrieval. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar1869 – 1918whichconsistsofca.1.45millionautosegmentedarticles.Thearticlesearchdatabasehad two versions of each article with different quality OCR. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top 10 results using a graded relevance scale of 0 – 3. Users were not informed about the OCR quality differences of the otherwise identical articles. Findings – The main result of the study is that improved OCR quality affects subjective user perception of historical newspaper articles positively: higher relevance scores are given to better-quality texts. Originality/value – Tothebestoftheauthors ’ knowledge,thissimulatedinteractiveworktaskexperimentis the first one showing empirically that users ’ subjective relevance assessments are affected by a change in the quality of an optically read text.


Introduction
Digitized versions of historical newspaper collections have been both produced and studied extensively during the last few decades in different parts of the world. In the year 2012, it was estimated that digitized newspaper collections in Europe consisted of ca. 129 M pages and 24,000 titles (Dunning, 2012), about 17% of the available content (Gooding, 2018). Since that several national libraries and other stakeholders, such as publishers, have produced and are currently producing more and more digitized historical content online out of their newspaper collections. In a recent publication describing in detail 10 different digitized historical newspaper collections, Beals and Bell (2020) state that over the past thirty years, national libraries, universities and commercial publishers around the world have made available hundreds of millions of pages of historical newspapers through mass digitisation and currently release over one million new pages per month worldwide. These have become vital resources not only for academics but for journalists, politicians, schools, and the general public.
It is evident that more collections and more data will be available in the future and usage of the collections will increase.
Several projects have studied and developed methods to access digitized newspaper collections. Out of the projects, only a few can be mentioned here. One of the first large European projects was Europeana (http://www.europeana-newspapers.eu/, Neudecker and Antonacopoulos, 2016). Europeana gathered a selection of digitized newspaper content produced in different European libraries together, enhanced searchability of the content, developed an evaluation and quality-assessment infrastructure for newspaper digitization and created best-practice recommendations for newspaper metadata, among other things. Later research projects like NewsEye (newseye.eu, Oberbichler et al., 2021), Impresso (https:// impresso-project.ch/app/) and Oceanic Exchanges (https://oceanicexchanges.org/) have studied how to access, develop and enrich the contents of digitized newspaper collections.
In Finland, the most important national project related to the study of digitized historical newspapers has been Computational History and the Transformation of Public Discourse in Finland, , which finished at the end of the year 2019. The project used historical newspapers published in Finland during the years 1771-1920 as one of its main research data. One part of the project studied metadata related to the newspapers (M€ akel€ a et al., 2019; Marjanen et al., 2019), another part of the project produced, e.g. a database out of which reuse of the published news stories could be detected and studied (Salmi et al., 2020). The National Library of Finland (NLF) participated in the project by providing both data and working with improvement of the data and production of optical character recognition (OCR), named entity and article extraction training and evaluation collections (Kettunen and Koistinen, 2019;Kettunen et al., 2019a, b).
In the current study, a collection of 49 years of Finnish newspaper Uusi Suometar is used in a user-oriented study of information retrieval and access. The study can be considered as a realistic approach to information retrieval: participants of the study make searches in an optically read historical newspaper collection to find newspaper articles related to given topics. This is what scholars and lay users of these collections do daily in different parts of the world with varying success. In our evaluation, users search the newspaper database both with pre-formulated queries and queries that they formulate on their own based on the topic descriptions. The database of Uusi Suometar has two different versions of the content: one with original OCR and one with improved, new OCR. The query engine always searches for the results of queries in the new OCR version of the database and ranks the results according to these. However, retrieved texts presented for reading are balloted in the two different optically read qualities of the same articles. Users of the query system were not aware of differences in the OCR quality when they used the query environment. JD 79,7 During the last decades, vast investments have been made in various national and international archival projects to digitize growing numbers of historical documents and grant end-user access to them. Yet, as recent empirical studies attest (e.g. Kumpulainen and Late, 2022;Late and Kumpulainen, 2021), the use of such collections is prone to be negatively affected by the noise present in digitized documents, caused by the errors made during the OCR process. However, the core question of how the variation in OCR quality affects the user has proved to be evasive. In this paper, we study this question experimentally. The key question is whether a change in the document's OCR quality affects subjective perception regarding the usefulness of the documents (perceived usefulness) when the test person is framed to be performing a simulated research work task.

Related research
Even if new ways of access to large, digitized text collections are under development, basic information retrieval tools are still the main entry point to these collections (Organisciak et al., 2021). Users search collections in a query engine interface using keywords to describe their information needs. Information retrieval of historical newspapers has several challenges, among which are, e.g. OCR quality, spelling variation of historical language, lack of proper tools for natural language processing of older language and lack of structure in the optically read documents (Lopresti, 2009;Gotscharek et al., 2011;Piotrowski, 2012;J€ arvelin et al., 2016;Karlgren et al., 2019;Pfanzelter et al., 2021;Torget et al., 2022). In their present state, historical newspapers are a hard task for information retrieval engines, and users of the collections, such as researchers of digital humanities, are many times not very satisfied with the search possibilities and may have low trust in the search results (Jarlbrink and Snickars, 2017;Pfanzelter et al., 2021).
Textual properties of digitized historical newspapers, such as the quality of OCR and segmentation, are often studied in data-oriented scenarios, which pay attention to the statistical properties of text without consulting the user viewpoint. Effects of subquality OCR on efficiency of information retrieval have been studied to a fair extent, and the results include both simulations, where the quality of the text content has been tampered with artificially and original OCR text. Actual user studies in a controlled query environment, however, have been so far missing. Simulated research settings include, e.g. Taghva et al. (1996), Savoy and Naji (2011) and Bazzo et al. (2020), just to mention a few. The general result of these studies is that worse OCR quality lowers query results clearly. Most clearly, the effect of worse OCR is with short documents and queries of a few words: with these query, engine has less evidence for matching the document and the query words.
J€ arvelin et al. (2016) report the results of information retrieval in a laboratory-style collection of digitized Finnish newspapers. Their collection consisted of 180,468 documents (84,512 pages of newspapers), for which they had developed 56 search topics with graded relevance assessments. Results of the study show that low-level OCR quality of the collection lowered search results clearly, even if heavy fuzzy-matching methods were used in query expansions to improve the results.
If we broaden the scope and look at research outside information science, digital humanists have also paid attention to the problems of low-level OCR in digital historical newspaper collections. Jarlbrink and Snickars (2017), for example, show how one digital Swedish newspaper collection, Aftonbladet 1830-1862, "contains extreme amounts of noise: millions of misinterpreted words generated by OCR, and millions of texts re-edited by the auto-segmentation tool". Their main contribution is a discussion of low-quality OCR and its effects on using digitized newspapers as research data. Pfanzelter et al. (2021) describe user experiences and needs of digital humanities researchers with three digitized newspaper collections: Austrian ANNO (https://anno.onb.ac.at/), Finnish Digi Optical character recognition quality (digi.kansalliskirjasto.fi) and French Gallica (https://gallica.bnf.fr/) and Retronews (https://gallica.bnf.fr/edit/und/retronews-0). Although their main concern in the paper is related to general functionality demands for interfaces of digitized newspaper collections, they report also experiences related to the searchability of the collections. One of their general findings is that "in some cases, the OCR quality is still very low. After identifying some major issues in this regard, the DH team's reliance on (and trust in) some search results was very low". Also, slightly differing opinions have been stated by digital humanities researchers. Strange et al. (2014), for example, state that "The cleaning was thus desirable but not essential" referring to cleaning to correction of OCR errors in the digitized texts they were studying. van Strien et al. (2020) suggest caution in trusting retrieval results of optically read text. They show that both rankings of articles and a number of returned articles from the query engine are affected by the text quality. Traub et al. (2018) show that better data quality decreases so-called retrievability bias, which tends to bring certain documents as a search result more often than others (Azzopardi and Vinay, 2008). Chiron et al. (2017) show, with respect to the French Gallica collection, that low-frequency query words that contain frequent optical character error patterns have a higher risk to result in poor query results. Results of Hill and Hengchen (2019) on the Eighteenth Century Collections Online (ECCO), on the other hand, show that the effects of poor OCR quality vary with respect to different higher level NLP tasks. They studied topic modelling, collocation analysis, authorial attribution and vector space modelling in a data-oriented setting. Traub et al. (2015) interviewed historians to get an impression of their usage of digital archives and their awareness of possible problems related to them. Researchers were usually aware of OCR quality problems of digitized collections and the possible bias caused by quality problems, but they could not quantify, how the problems could affect their research. Traub et al. show also that problems of OCR quality affect different types of research settings differently. They concluded "that the current knowledge situation on the users' side as well as on the tool makers' and data providers' side is insufficient and needs to be improved" with respect to the data quality. In the interviews of Korkeam€ aki and Kumpulainen (2019;Late and Kumpulainen, 2021), historians were asked of their task-based information interaction in digital environments. Some of the interviewed researchers worked a lot with historical optically read texts and some mentioned problems with the quality of data, especially in the analysis phase of the data: "Especially when using big data, historians had to evaluate the impact of the noisy OCR on the results and if the analysis is worth doing" (Korkeam€ aki and Kumpulainen, 2019). Some of the interviewed researchers did not trust that search in the optically read collections would retrieve all the relevant hits as results (Late and Kumpulainen, 2021).

Research questions
In this interactive information retrieval study, we look answers for the following research questions.
(1) Do the search results of pre-formulated queries differ from self-formulated queries of the users?
(2) Does different quality of the OCR (old versus new) affect the perceived usefulness of the newspaper clippings? The subquestions are these two: What happens with perceived usefulness in the case of pre-formulated queries?
What happens with perceived usefulness in the case self-formulated queries? JD 79,7 4. The research setting 4.1 Research method Interactive information retrieval, in general, is a user-centred approach for doing information retrieval research (e.g. Ingwersen and J€ arvelin, 2005, p. 191). Our research belongs specifically to the tradition of interactive information retrieval, where we use a simulated work task situation approach with its information needs (Borlund, 2000;Ingwersen and J€ arvelin, 2005, pp. 251-254). Our interactive information retrieval setting uses the three main requirements for this kind of task, as described in Borlund (2000): (1) potential users as test persons, (2) application of dynamic and individual information needs and (3) use of multidimensional and dynamic relevance judgements. Our interactive information retrieval environment developed for the task is the tool that triggers the simulated information needs of the users (cf. Section 4.5).
Interactive information retrieval approach has the following four main advantages (cf., Borlund, 2000;Borlund and Ingwersen, 1998): firstly, the usage of cover stories triggers information needs provoked by simulated work tasks. Secondly, the setting allows individual test persons to assess the usefulness of the newspaper clippings with respect to their own interpretation. Thirdly, using graded and multidimensional relevance assessments instead of binary and topical ones facilitates both control and repeatability during experimentation by combining the use of both static queries and allowing free-form queries. And finally, the setting enables the use of a realistic search interface with actual data.

Participants of the study and their task
To perform the study, we recruited 32 participants for the evaluation task. The student users for the evaluation task were recruited from the courses Information Retrieval and Language Technology and Information Retrieval Methods at the Tampere University, Faculty of Information Technology and Communication Sciences. Three teachers of information science also participated in the evaluation task. The choice of the participants was based mainly on the ease of getting a large enough group to perform the tasks, and the group can be considered as a convenience sample (Kelly, 2009, p. 67). A large enough group of real historians would have been harder to gather, but on the other hand, university-level education is an asset for the participants. We did not collect detailed information about the participants, only the information on whether they were students or teachers of information science (cf. Figure 4). Thus, we do not perform any analysis of results based on different user qualities.
The participants, students and teachers of information science were informed in background that they use the information retrieval system of digitized newspaper clippings to write an article on historical events in Finland or the world in the time span of 1869-1918. Participants needed to evaluate the results of the search in relation to this information need.
Participants were given a one-page instruction leaflet which described the information retrieval task. The leaflet described the general idea of the task and informed the participants about two sessions, gave them the backgrounding simulated work task story and explained the graded evaluation scale of 0-3. Participants were guided to perform six queries in both sessions with different topics that the query environment balloted one at a time after the user had logged in to the system. Due to COVID-19 users made searches alone using remote net access to the system, and no common session was arranged. Users were instructed to perform the searches on their own schedule in a timeframe of 10 days beginning from March 8, 2021.
Instructions for the search sessions were given to the participants as one A4 page. The instructions gave the participants a background story and formulated the evaluation scale to use in relation to the simulated work task as described in Table 1. Optical character recognition quality

Topic creation for the study
The topics of the searches were created using history timelines from two popular history encyclopedias: Suomen historian pikkuj€ attil€ ainen ("A small encyclopedia of Finnish history", Zetterberg, 1989) and Maailmanhistorian pikkuj€ attil€ ainen ("A small encyclopedia of world history", Zetterberg, 1988). After finding suitable topics from the timeline, searches to the newspaper database at digi.kansalliskirjasto.fi were performed to confirm that the database had enough hits related to the topic. During the final creation of the query environment, many original topics were abandoned, and new ones were created due to too few hits in the final article extraction database. Final topic descriptions were based on Finnish Wikipedia articles related to the topics. Out of the topics we created readymade short queries for the task. The topics cover the time frame of the historical collection of Uusi Suometar, beginning from 1870s and ending in 1918. First mentioned year in the topic descriptions is 1871, last 1918. Topics cover both domestic and foreign news, the share of domestic news being 21, and foreign 9. Demarcation line between foreign and domestic news is not always sharp, some topics could be classified as both.
Participants performed two separate query sessions: one with pre-formulated queries and another one where they could formulate the queries freely based on the topic descriptions, which were available on the user-interface of the search environment. In the second session, participants could use, e.g. Google searches to gather background information about the topics. For each query, only the top 10 results were shown to the users. Pre-formulated Finnish queries and their translations are listed in Appendix.
This study uses a relatively small, digitized collection of one newspaper's 49-year history to study users' access to the collection with means of article extraction on the pages of the collection. Article extractionmany times also called segmentationon historical newspaper collections is not a pervasive property, as good quality separation of articles has been hard to produce automatically (cf. Clausner et al., 2017;2019;Dengel et al., 2014;Kise et al., 2014;Jarlbrink and Snickars, 2017). Recent evaluation studies reported in International Conference on Document Analysis and Recognition 2017 and 2019 (Clausner et al., 2017(Clausner et al., , 2019 show that especially state-of-the-art systems (commercial and open source) used many times by libraries in data production do not perform very well in page segmentation and region classification of complex layout document pages. Among the 10-digitized historical newspaper collections gathered in Beals and Bell (2020), some do have article extraction, but many do not have it. Article extraction in our study collection has been produced automatically based on a machine learning model and the quality of the result has not been assessed on a large scale, only with a small evaluation collection (cf. section Newspaper data -Optical Character Recognition quality and article structure).
The background story Imagine that you are writing an article related to topics in history of Finland or world history at the end of 19th century or the beginning of 20th century. Evaluate quality of the clippings you get as search results. Evaluate the quality of each clipping from the viewpoint, how it helps you to proceed with your article writing Evaluation of the search results (graded relevance scale of 0-3) 3. The clipping deals with the topic very broadly and its information content corresponds well with the task. The clipping helps well in accomplishing your task 2. The clipping deals partially with the task or touches it. The content of the clipping helps to some extent in accomplishing your task 1. The clipping does not deal with the actual topic but helps to find better search terms and to limit the topic somehow. It helps indirectly in accomplishing your task 0. The clipping is wholly off topic and does not even help to formulate new queries. This clipping brings no benefit in accomplishing your task Source(s): Our data has two main challenges for the user, as will be described in more detail later. Shortly put, the OCR quality of the data is not optimal even if it has been improved, and the results of article extraction will make an evaluation of query results hardthe texts shown to the users may contain irrelevant content from the surrounding context of the newspaper page. These kinds of problems are, however, many times a reality with digitized historical newspapers, as testified, for example, in digital humanities pilot studies of three different digitized newspaper collections in Pfanzelter et al. (2021).
We conduct the study in the framework of interactive information retrieval, but our results are also relevant for the expanding field of digital humanities, which increasingly uses digitized historical newspapers as its research data. We thus try to include some implications for digital humanities in our discussion of the findings of our research.
4.4 Newspaper dataoptical character recognition quality and article structure Our search collection consists of the whole history of Uusi Suometar 1869-1918, ca. 86,000 pages and 306.8 million words (Kettunen and Koistinen, 2019). Uusi Suometar was at the time of its publication one of the most important Finnish language newspapers in Finland, where newspapers were published in two languages, Finnish and Swedish (Hynynen, 2019). The newspaper data has been scanned and optically read as part of the digitization of Finnish newspapers and journals at the NLF and is part of the digitized content offered by the NLF (https://digi.kansalliskirjasto.fi/etusivu?set_language5en).
The original OCR for Uusi Suometar was performed using a line of ABBYY FineReader® products. The quality of the digitization of the whole collection of Finnish newspapers from 1771 to 1910 has been estimated by Kettunen and P€ a€ akk€ onen (2016). They conclude that ca 70-75% of the words in the Finnish language 2.4-billion-word index database could be recognized by using automatic morphological analysers. In general, this level of word-level quality can be considered quite typical for older digitizations of historical newspaper collections (cf. e.g. Tanner et al., 2009).
Improved OCR for the whole history of Uusi Suometar was achieved with Tesseract v.3.0.4.01. Improvement to the earlier quality in recognition of words is approximately 15% units as a mean over the whole period. On average 83% of the words in the newspaper were recognized with automatic morphological analysers, and the recognition rate varied from ca 78-88% over the 49 years. For the old OCR mean word recognition rate was 68.2% (Kettunen and Koistinen, 2019). Even if the improvement in OCR quality is considerable, the improved quality can still be challenging for information retrieval engines, especially with short queries and articles, where the information retrieval engine has less evidence for matching the query words and collection data in the engine's index (J€ arvelin et al., 2016;Mittendorf and Sch€ auble, 2000). In a recent study, Bazzo et al. (2020), for example, found that statistically significant degradation of search results begins already at the word error rate of 5%, when Portuguese pdf texts with artificially induced errors were sought for with state-of-the-art modern query engine and algorithms. The same kind of results has been achieved in most of the current studies, as a current survey paper of Nguyen et al. (2021) shows. It should also be emphasized that even if substantially better-quality OCR for collections can be achieved with the latest tools like Transkribus (Muehlberger et al., 2019), new OCR of large old historical collections is tedious and time-consumingthus the problem of low-level optically read text will stay for some time.
Newspaper data at the NLF was originally scanned and recognized page by page without article structure information besides the basic layout of the pages. For this study, we used articles that were extracted automatically from the pages of Uusi Suometar using a trained machine learning model with the software PIVAJ (Hebert et al., 2014a, b;Kettunen et al., 2019a). Training of the PIVAJ model was based on 168 pages of manually marked data that Optical character recognition quality had a different number of columns (varying from 3 to 9). Kettunen et al. (2019a) reported success percentages of 67.9, 76.1 and 92.2 for an evaluation dataset of 56 pages in three different evaluation scenarios based on Clausner et al. (2011) using layout evaluation software from PRImA (https://www.primaresearch.org/). In the automatic segmentation process, the collection of Uusi Suometar was divided into 1,459,068 articles with PIVAJ. Figure 1 shows an example of PIVAJ's graphical output for the article structure. Different articles on the page are shown with different colours.
In the article extraction of the whole history of Uusi Suometar, article separation is far from optimal, and articles are perhaps best called automatically extracted clippings with varying lengths. In the search evaluation task, these clippings are documents that users search and evaluate. It should be emphasized that the article segmentation that was producible for the whole history of Uusi Suometar is experimental and its quality will bring one layer of difficulty to the evaluation of search results. As Jarlbrink and Snickars (2017) formulate it, autosegmentation tools create random texts, and borders of text snippets are fuzzy. This feature was informed to the users in the instructions of the search task.

The query interface
Participants of the evaluation task performed their task using the query engine Elastic search (https://www.elastic.co/), version 7.3.2, which is the background engine of the library's presentation system. Queries were performed in AND mode, where every query term is sought for in the documents. Hits of the search engine shown for the users needed to be at least 500 characters long to avoid very short text passages which would be hard to evaluate. The index of the newspaper collection's database is lemmatized, i.e. it contains base forms of the words, which is crucial as Finnish is a highly inflected language (J€ arvelin et al., 2016). Search environment and database out of the Uusi Suometar article extraction results were developed by Evident Solutions Ltd (https://evident.fi/). Figure 2 shows the search environment we created for the evaluation of the search results. Figure 3 depicts the query interface the participants used.

Collection and analysis of the query results
Query results of the user sessions were collected for analysis in a query log, which is shown in Figure 4. The columns in the query log indicate the following data beginning from the left: (1) query words; (2) session: pre-formulated or self-formulated queries, A and B, respectively; (3) number of the topic; (4) OCR quality in the results (0 for the old and 1 for the new); (5) user id; (6) role of the user (student or teacher); (7) id number of the result clipping; (8) user-given evaluation result on the scale of 0-3; (9) date and time of the performance; (10) possible change for the evaluation: time stamp; (11) size of the clipping in characters; and (12) rank (1-10) of the result clipping in the result list.
The interactive information retrieval system balloted the topics for each user so that out of the 32 users' work we obtained 3,893 evaluations. There were 1,861 evaluations of preformulated queries and 2,032 evaluations of self-formulated queries. Differences in these numbers are due to the fact that part of the users did not perform all their tasks, and some did more than that required: if 32 users had each performed 120 evaluations (2 sessions and 10 evaluations for 6 queries), the log should contain 3,840 evaluations evenly divided to the 2 different sessions.
Analysis of the self-formulated queries showed that users formulated slightly longer queries than the pre-formulated queries were. The mean length of the pre-formulated queries is 2.87 words, and the mean length of the self-formulated queries is 3.15 words, which is a 9.8% increase. In general, the queries are short and web-like (cf. Jansen et al., 2000). The clippings the users evaluated were of varying lengths. We had set a minimum length of 500 characters for the results to be shown for users, but no maximum length. The mean length of the clippings in all the results of the two sessions was 6,116 characters. Some type of query structuring was used very little in users' self-formulated queries: 111 queries used AND connector and 90 queries connector OR. Combined AND and OR was used in 20 queries. Truncation operators ? and * were both used 43 times in 33 queries.
We assumed that the user interface would take care of the number of queries and evaluations each user finished. However, some of the users did not finish all the queries or evaluations in the pre-formulated queries' session because the possibility of a user's premature quitting was not taken care of in the system. In the self-formulated queries' session, the users could edit their queries and resubmit them after they had performed them once, and the query log stored all the results for the same query with possible query variations. These issues were unexpected and should be taken care of in the user interface and instructions.

Results
As the example of the log shows, the result set is multifaceted and could be analysed in many ways. In this study, we concentrate on the analysis of the main results of the query sessions to be able to answer our research questions stated in the second section of the study. Figure 5 shows mean averages for evaluations of the individual queries in both sessions with different quality OCR.
Mean averages for the relevance scores over the whole query set are shown in Table 2. From Figure 5 and Table 2, we can see that especially pre-formulated queries benefited from improved OCR. The mean average evaluation score for the improved OCR in the preformulated queries is 7.93% higher than with the old OCR. In self-formulated queries, the difference is clearly smaller: 1.71%. Three queries, #3, #18 and #27, obtained low evaluations in all the scenarios. Two queries, #6 and #7, obtained low evaluations in part of the scenarios: #6 with pre-formulated queries with both OCR qualities and #7 with self-formulated queries with both OCR qualities.   Inspection of query-per-query results shows that pre-formulated queries gain better mean evaluation scores in 19 cases out of 30. There is one tie and 10 queries, where evaluations of old OCR gain better mean evaluation scores. With self-formulated queries, the result is more even: new OCR results gain better mean evaluation scores in 15 cases, and old OCR results gain better mean evaluation scores in 14 cases; there is one tie. This is depicted in detail in Figure 6. The mean length of the clippings retrieved by the search engine in the two sessions differs quite a lot. With pre-formulated queries the mean length of the clippings was 5,467 characters, and with self-formulated queries 6,711 characters, which is a 22.75% difference. It is thus possible that the longer result clippings of the self-formulated queries were harder to evaluate for users and this is reflected in the lower evaluation results, as was seen in Figure 5 and Table 2. With longer clippings, the users may get tired or frustrated, which may lower their evaluations. The longer clippings may also be fuzzier than the shorter ones and contain text from adjacent text segments. This aspect would need further study.
Wilcoxon's signed rank test (Croft et al., 2010) showed that, in the case of the pre-formulated queries, the improved OCR quality resulted in statistically significantly higher relevance judgments scores compared to the lower quality OCR at the level of assessed individual document pairs (p 5 0.002). However, when two OCR quality situations were compared at the level of topics (N 5 30) via mean average of cumulated gain values for top-10 documents retrieved, the difference was not statistically significant in Wilcoxon's signed rank test (p 5 0.10).
In the case of the self-formulated queries' session, statistically significant differences were not observed. Self-formulated queries retrieved almost four times larger set of unique documents for all user searchers than pre-formulate queries; these queries also differed to some extent from the pre-defined queries (they were slightly longer), and they returned longer documents. However, we did not study these differences and effects deeper.

Discussion
Before discussing our results further, we would like to emphasize key issues of the work and developments that made the present study possible. As Oberbichler et al. (2021) indicate, it is vital to successfully negotiate both the challenges and opportunities present in typical interdisciplinary research settings. In the case of our research setting, several theoretical and pragmatic (resource-related) obstacles were present, and advanced work with the collection  was not self-evident. In general, we have utilized NLF's document presentation system, different OCR software, an experimental page segmentation software and a search engine. The developed realistic web query interface and user logging needed to be adapted to this environment. Only the cumulation of much previous work made this study possible. Firstly, experienced OCR quality was the main target of the study instead of studying effects of data-oriented OCR quality, which has been repeated probably tens of times in different collections and languages (e.g. J€ arvelin et al., 2016;Bazzo et al., 2020). As Kumpulainen and Late (2022) show empirically, noise in OCR quality disturbs researchers of historical newspaper collections both in the searching and selection phases of their research. In general, bad OCR quality concerns all users of the collections, although the quality may manifest differently to different user groups. Secondly, simulated interactive work task method introduced by Borlund (2000) was used for the gathering of evaluation data. The method is crucial in gaining insight into the measurement of experienced effects. Thirdly, many years of earlier background work in newspaper digitization and development work at the NLF enabled us to even think of the possibility of this kind of work. Fourthly, we needed to set up the user interface for the query system, design the collection of feedback data and recruit participants for the simulated work task. Out of these pieces, we created a whole that can be used as a general model for similar future studies.
To the best of our knowledge, this is the first study showing empirically, based on simulated work task situations, that the subjective relevance assessments of the test persons were affected by the change of quality of the optically read text presented to them. Earlier studies on the effects of OCR quality have been performed in data-oriented settings, often using laboratory-style tests and artificially tampered data or describing subjective experiences of users regarding the effects of OCR quality on their work. Our study presents a unique approach and a re-useable methodology, if parallel collections of clearly different OCR quality are available for empirical user evaluation.
The answer to our first research question was that pre-formulated and self-formulated queries differ in their results. Self-formulated queries of users were slightly longer than the pre-formulated queries that had been created for the topics. The result documents of queries in the two different query sessions differed clearly in length and number. It is possible that the longer result documents of the self-formulated queries were harder to evaluate for users, and they could also contain more text from the adjacent text segments, which would make their evaluation more difficult. This aspect would need a more detailed comparison between the result clippings of the two sessions.
The answer to our second research question and its first subquestion is clearly positive. With pre-formulated queries, the improved OCR quality clippings gained a 7.93% better mean relevance score, and the difference to the basic level OCR quality evaluations was statistically significant. The answer to the second subquestion was not as clear. With selfformulated queries, we found a small difference of 1.71% in the evaluations in favour of improved OCR. However, this difference was not statistically significant.
Our query environment implementation for the evaluation of two optically read text qualities is the first version of the system. As such it works well, but experience from user sessions showed that it has features that could be developed.
A clear limitation of our research is the assumption that the user interface would take care of the number of queries and evaluations each user finished. However, some of the users did not finish all the queries or evaluations in the pre-formulated queries' session because the possibility of a user's premature quitting was not taken care of in the system. Another limitation is that in the self-formulated queries' session, the users could edit their queries after they had performed them, and the query log stored all the results for the same query with possible query variations. These user behaviours were unexpected and should be taken care of in developing the user interface and instructions. In Kettunen et al. (2022), we have Optical character recognition quality analysed our query environment and its components from the point of view of re-usability and suggest improvements to the user-interface's functionality. Another limitation in the research was that we did not consider how display qualities affected reading or compare reading or proofreading speed between computer screen and paper, which have been studied by, e.g. Gould et al. (1986Gould et al. ( , 1987 and Dillon (1992). Current studies like K€ opper et al. (2016), Ocal et al. (2022) and a recent literature review by Vitello (2022) show that there is no real difference in reading speed or error correction when modern TFT-LCD screens and paper are compared. COVID-19 made it impossible to arrange the query sessions in a computer class, and all the participants performed their assessments using their own computers and screens. It would be advisable that evaluation sessions are arranged in a computer class where equipment is standard, and especially computer screens would be identical.
Our user group in the study was also not optimal, as it consisted mainly of students. A large enough group of historians, however, would have been hard to recruit as, e.g. experience of Kumpulainen and Late (2022) shows. Kumpulainen and Late examined how researchers used digital newspaper collections in their work and what were obstacles they encountered using the collections. They were able to find 13 participants who use Finnish historical newspapers in their work. The usage of university-level students in our evaluation task can be considered motivated: an evaluation task which includes historical information needs a higher level educational background. But as Kelly emphasizes (2009, p. 69), we cannot over-generalize our results based on the convenience sample.
We also had differences in the two query types. In the pre-formulated queries' session, the search was run in Elastic's AND mode, where all the query words were sought for. In the self-formulated queries' session, users could use query operators AND, OR and NOT, if they wished. Users' self-formulated queries were also slightly longer than pre-formulated queries. The effect of these differences on the evaluation results is hard to establish conclusively. Evidently, it would be more consistent to instruct users not to use any operators in the self-formulated queries' session and instead run the search in AND mode.
The clippings the users evaluated were of varying lengths. The mean length of the clippings in all the results of the two sessions was 6,116 characters. With pre-formulated queries, the mean length of the clippings was 5,467 characters, and with self-formulated queries 6,711 charactersa 22.75% difference. It is thus possible that the longer clipping results of the self-formulated queries were harder to evaluate for users and this is reflected in the lower evaluation results. With longer clippings, the users may get tired or frustrated, which may lower their evaluations. Longer clippings may also be fuzzier than shorter ones in their content.
One possible development issue for the evaluation could be the evaluation of the clippings' overall textual and segmentation quality by the users. Our article segmentation for the collection is experimental, and many of the clippings may be quite hard to read due to fuzzy boundaries: the clippings may contain text from adjacent segments, which affects evaluations. A combination of relevance evaluation scores and users' scores for the quality of the clipping boundaries might bring new insights to the evaluation and reveal the usability of the optically read texts better than the usage of traditional relevance assessments only.
The well-known simulated work tasks used in interactive information retrieval have been used in this study to answer the question of optical character recognition quality's effect on the relevance evaluation of retrieval results in a Finnish historical newspaper collection. We have shown that improvement in OCR quality of documents leads to higher mean relevance evaluation scores in a simulated work task scenario. This means that the perceived usefulness of historical newspaper clippings increases with better OCR quality. Our results were achieved with one language, one specific collection and with one user group, but our method is generalizable to any language and can be evaluated with further users and JD 79,7 different collections. As our results are achieved in one collection and one language, further studies with different collections and languages should be conducted to strengthen our results. So far studies like Kumpulainen and Late (2022) and Late and Kumpulainen (2021) have raised subjective concerns of digital humanities researchers about the OCR quality of the historical newspaper collections and its effect on their work progress.
Our results should be seen both in the context of information retrieval and the requirements of digital humanities scholars and lay users of the collections. These results bring more weight to both higher quality document needs of digital humanists and efforts of improving the quality of optical character recognition with new developments in software. Better quality of optically read historical documents should be strived for both for the sake of research and lay users.