Search results

1 – 10 of 352
Content available
Article

Imad Zeroual and Abdelhak Lakhouaja

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual…

Abstract

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2634-1964

Keywords

Content available
Article

Imad Zeroual and Abdelhak Lakhouaja

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual…

Abstract

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2210-8327

Keywords

To view the access options for this content please click here
Article

Jelena Andonovski, Branislava Šandrih and Olivera Kitanović

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to…

Abstract

Purpose

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to create a benchmark Serbian-German annotated corpus searchable with various query expansions.

Design/methodology/approach

The presented research is particularly focused on the enhancement of bilingual search queries in a full-text search of aligned SrpNemKor collection. The enhancement is based on using existing lexical resources such as Serbian morphological electronic dictionaries and the bilingual lexical database Termi.

Findings

For the purpose of this research, the lexical database Termi is enriched with a bilingual list of German-Serbian translated pairs of lexical units. The list of correct translation pairs was extracted from SrpNemKor, evaluated and integrated into Termi. Also, Serbian morphological e-dictionaries are updated with new entries extracted from the Serbian part of the corpus.

Originality/value

A bilingual search of SrpNemKor in Bibliša is available within the user-friendly platform. The enriched database Termi enables semantic enhancement and refinement of user’s search query based on synonyms both in Serbian and German at a very high level. Serbian morphological e-dictionaries facilitate the morphological expansion of search queries in Serbian, thereby enabling the analysis of concepts and concept structures by identifying terms assigned to the concept, and by establishing relations between terms in Serbian and German which makes Bibliša a valuable Web tool that can support research and analysis of SrpNemKor.

Details

The Electronic Library , vol. 37 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article

Mehrdad Vasheghani Farahani and Zeinab Amiri

In an effort to bridge the gap between applying translation corpora, specialized terminology teaching and translation performance of undergraduate students, the purpose of…

Abstract

Purpose

In an effort to bridge the gap between applying translation corpora, specialized terminology teaching and translation performance of undergraduate students, the purpose of this paper is to investigate the possible impacts of teaching specialized terminology of law as a specific area of inquiry on translation performance of Iranian undergraduate translation student (English–Persian language pairs). The null hypothesis of this study is that using specialized terminology does not have statistically significant impacts on the translation performance of the translation students.

Design/methodology/approach

The design of this research was experimental in that there was pretest, treatment, posttest and random sampling. In other words, this research was pre-experimental one-group pretest-posttest design. This design was used in this research as the number of subjects who participated in the research was limited. Apart from being experimental, this research enjoyed a corpus-based perspective. As Mcenery and Hardie (2012) claim, corpus-based research uses the “corpus data in order to explore a theory or hypothesis, typically one established in the current literature, in order to validate it, refute it or refine it” (p. 6). Table I shows the design of this research.

Findings

The results of this research indicated that on the whole, the posttest results had statistically significant differences with that of the pretest. In this regard, the quality of students’ translation enhanced after using the specialized terminology in the form of three types of corpora. Indeed, there was a general trend in the improved quality of the novice translators in translating specialized and subject-field terminologies in an English–Persian context.

Originality/value

This paper is original in that it probes into one of the less researched areas of Translation Studies Research and employs corpora methodology.

Details

Journal of Applied Research in Higher Education, vol. 11 no. 3
Type: Research Article
ISSN: 2050-7003

Keywords

To view the access options for this content please click here
Article

Stelios Piperidis

This paper describes the research and development activities carried out in the framework of the Translearn project. The aim of the project is to build a translation…

Abstract

This paper describes the research and development activities carried out in the framework of the Translearn project. The aim of the project is to build a translation memory tool and the appropriate translation work environment. Translearn's application corpus consists of regulations and directives of the European Union (EU), extracted from the CELEX database, the EU's documentation system on EU law, and the language versions it concentrates on are English, French, Portuguese and Greek. The development of the prototype tool for the envisaged system proves the application's usefulness in the translation process of international multilingual organizations as well as in the localization‐internationalization process of international enterprises.

Details

Aslib Proceedings, vol. 47 no. 3
Type: Research Article
ISSN: 0001-253X

To view the access options for this content please click here
Article

Tuomas Talvensaari, Jorma Laurikkala, Kalervo Järvelin and Martti Juhola

To present a method for creating a comparable document collection from two document collections in different languages.

Abstract

Purpose

To present a method for creating a comparable document collection from two document collections in different languages.

Design/methodology/approach

The best query keys were extracted from a Finnish source collection (articles of the newspaper Aamulehti) with the relative average term frequency formula. The keys were translated into English with a dictionary‐based query translation program. The resulting lists of words were used as queries that were run against the target collection (Los Angeles Times articles) with the nearest neighbor method. The documents were aligned with unrestricted and date‐restricted alignment schemes, which were also combined.

Findings

The combined alignment scheme was found the best, when the relatedness of the document pairs was assessed with a five‐degree relevance scale. Of the 400 document pairs, roughly 40 percent were highly or fairly related and 75 percent included at least lexical similarity.

Research limitations/implications

The number of alignment pairs was small due to the short common time period of the two collections, and their geographical (and thus, topical) remoteness. In future, our aim is to build larger comparable corpora in various languages and use them as source of translation knowledge for the purposes of cross‐language information retrieval (CLIR).

Practical implications

Readily available parallel corpora are scarce. With this method, two unrelated document collections can relatively easily be aligned to create a CLIR resource.

Originality/value

The method can be applied to weakly linked collections and morphologically complex languages, such as Finnish.

Details

Journal of Documentation, vol. 62 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

To view the access options for this content please click here
Article

Chengzhi Zhang and Dan Wu

Terminology is the set of technical words or expressions used in specific contexts, which denotes the core concept in a formal discipline and is usually applied in the…

Abstract

Purpose

Terminology is the set of technical words or expressions used in specific contexts, which denotes the core concept in a formal discipline and is usually applied in the fields of machine translation, information retrieval, information extraction and text categorization, etc. Bilingual terminology extraction plays an important role in the application of bilingual dictionary compilation, bilingual ontology construction, machine translation and cross‐language information retrieval etc. This paper aims to address the issues of monolingual terminology extraction and bilingual term alignment based on multi‐level termhood.

Design/methodology/approach

A method based on multi‐level termhood is proposed. The new method computes the termhood of the terminology candidate as well as the sentence that includes the terminology by the comparison of the corpus. Since terminologies and general words usually have different distribution in the corpus, termhood can also be used to constrain and enhance the performance of term alignment when aligning bilingual terms on the parallel corpus. In this paper, bilingual term alignment based on termhood constraints is presented.

Findings

Experimental results show multi‐level termhood can get better performance than the existing method for terminology extraction. If termhood is used as a constraining factor, the performance of bilingual term alignment can be improved.

Originality/value

The termhood of the candidate terminology and the sentence that includes the terminology is used for terminology extraction, which is called multi‐level termhood. Multi‐level termhood is computed by the comparison of the corpus. Bilingual term alignment method based on termhood constraint is put forward and termhood is used in the task of bilingual terminology extraction. Experimental results show that termhood constraints can improve the performance of terminology alignment to some extent.

To view the access options for this content please click here
Article

John Hutchins

In the 1980s the dominant framework of MT was essentially ‘rule‐based’, e.g. the linguistics‐based approaches of Ariane, METAL, Eurotra, etc.; or the knowledge‐based…

Abstract

In the 1980s the dominant framework of MT was essentially ‘rule‐based’, e.g. the linguistics‐based approaches of Ariane, METAL, Eurotra, etc.; or the knowledge‐based approaches at Carnegie Mellon University and elsewhere. New approaches of the 1990s are based on large text corpora, the alignment of bilingual texts, the use of statistical methods and the use of parallel corpora for ‘example‐based’ translation. The problems of building large monolingual and bilingual lexical databases and of generating good quality output have come to the fore. In the past most systems were intended to be general‐purpose; now most are designed for specialized applications, e.g. restricted to controlled languages, to a sublanguage or to a specific domain, to a particular organization or to a particular user‐type. In addition, the field is widening with research under way on speech translation, on systems for monolingual users not knowing target languages, on systems for multilingual generation directly from structured databases, and in general for uses other than those traditionally associated with translation services.

Details

Aslib Proceedings, vol. 47 no. 10
Type: Research Article
ISSN: 0001-253X

To view the access options for this content please click here
Article

Werner Winiwarter

The purpose of this paper is to address the knowledge acquisition bottleneck problem in natural language processing by introducing a new rule‐based approach for the…

Abstract

Purpose

The purpose of this paper is to address the knowledge acquisition bottleneck problem in natural language processing by introducing a new rule‐based approach for the automatic acquisition of linguistic knowledge.

Design/methodology/approach

The author has developed a new machine translation methodology that only requires a bilingual lexicon and a parallel corpus of surface sentences aligned at the sentence level to learn new transfer rules.

Findings

A first prototype of a web‐based Japanese‐English translation system called Japanese‐English translation using corpus‐based acquisition of transfer (JETCAT) has been implemented in SWI‐Prolog, and a Greasemonkey user script to analyze Japanese web pages and translate sentences via Ajax. In addition, linguistic information is displayed at the character, word, and sentence level to provide a useful tool for web‐based language learning. An important feature is customization; the user can simply correct translation results leading to an incremental update of the knowledge base.

Research limitations/implications

This paper focuses on the technical aspects and user interface issues of JETCAT. The author is planning to use JETCAT in a classroom setting to gather first experiences and will then evaluate a real‐world deployment; also work has started on extending JETCAT to include collaborative features.

Practical implications

The research has a high practical impact on academic language education. It also could have implications for the translation industry by superseding certain translation tasks and, on the other hand, adding value and quality to others.

Originality/value

The paper presents an extended version of the paper receiving the Emerald Web Information Systems Best Paper Award at iiWAS2010.

Details

International Journal of Web Information Systems, vol. 7 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

To view the access options for this content please click here
Article

Martyn Harris, Mark Levene, Dell Zhang and Dan Levene

The purpose of this paper is to present a language-agnostic approach to facilitate the discovery of “parallel passages” stored in historic and cultural heritage digital archives.

Abstract

Purpose

The purpose of this paper is to present a language-agnostic approach to facilitate the discovery of “parallel passages” stored in historic and cultural heritage digital archives.

Design/methodology/approach

The authors explore a novel, and relatively simple approach, using a character-based statistical language model combined with a tailored version of the Basic Local Alignment Tool to extract exact and approximate string patterns shared between groups of documents.

Findings

The approach is applicable to a wide range of languages, and compensates for variability in the text of the documents as a result of differences in dialect, authorship, language change over time and errors due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process.

Research limitations/implications

A number of case studies demonstrate that the approach is practical and generalisable to a wide range of archives with documents in different languages, domains and of varying quality.

Practical implications

The approach described can be applied to any digital archive of modern and contemporary texts. This makes the approach applicable to digital archives recording historic texts, but also those composed of more recent news articles, for example.

Social implications

The analysis of “parallel passages” enables researchers to quantify the presence and extent of text-reuse in a collection of documents, which can provide useful data on author style, text genres and cultural contexts.

Originality/value

The approach is novel and addresses a need by humanities researchers for tools that can identify similar documents and local similarities represented by shared text sequences in a potentially vast large archive of documents. As far as the authors are aware, there are no tools currently exist that provide the same level of tolerance to the language of the documents.

Details

Journal of Documentation, vol. 76 no. 1
Type: Research Article
ISSN: 0022-0418

Keywords

1 – 10 of 352