To read this content please select one of the options below:

A study on automatic creation of a comparable document collection in cross‐language information retrieval

Tuomas Talvensaari (Department of Computer Sciences, University of Tampere, Finland)

Jorma Laurikkala (Department of Computer Sciences, University of Tampere, Finland)

Kalervo Järvelin (Department of Information Studies, University of Tampere, Finland)

Martti Juhola (Department of Computer Sciences, University of Tampere, Finland)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 May 2006

Downloads

733

Abstract

Purpose

–

To present a method for creating a comparable document collection from two document collections in different languages.

Design/methodology/approach

–

The best query keys were extracted from a Finnish source collection (articles of the newspaper Aamulehti) with the relative average term frequency formula. The keys were translated into English with a dictionary‐based query translation program. The resulting lists of words were used as queries that were run against the target collection (Los Angeles Times articles) with the nearest neighbor method. The documents were aligned with unrestricted and date‐restricted alignment schemes, which were also combined.

Findings

–

The combined alignment scheme was found the best, when the relatedness of the document pairs was assessed with a five‐degree relevance scale. Of the 400 document pairs, roughly 40 percent were highly or fairly related and 75 percent included at least lexical similarity.

Research limitations/implications

–

The number of alignment pairs was small due to the short common time period of the two collections, and their geographical (and thus, topical) remoteness. In future, our aim is to build larger comparable corpora in various languages and use them as source of translation knowledge for the purposes of cross‐language information retrieval (CLIR).

Practical implications

–

Readily available parallel corpora are scarce. With this method, two unrelated document collections can relatively easily be aligned to create a CLIR resource.

Originality/value

–

The method can be applied to weakly linked collections and morphologically complex languages, such as Finnish.

Keywords

Citation

Talvensaari, T., Laurikkala, J., Järvelin, K. and Juhola, M. (2006), "A study on automatic creation of a comparable document collection in cross‐language information retrieval", Journal of Documentation, Vol. 62 No. 3, pp. 372-387. https://doi.org/10.1108/00220410610666510

Publisher

:

Emerald Group Publishing Limited

To read this content please select one of the options below:

Please note you do not have access to teaching notes

A study on automatic creation of a comparable document collection in cross‐language information retrieval

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Keywords

Citation

Publisher

Related articles

Something didn’t work…

All feedback is valuable

Platform update page

Questions & More Information

To read this content please select one of the options below:

Please note you do not have access to teaching notes

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Keywords

Citation

Publisher

Related articles

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information