Search results

1 – 1 of 1
Content available
Article
Publication date: 17 July 2020

Imad Zeroual and Abdelhak Lakhouaja

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual…

Abstract

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2210-8327

Keywords

1 – 1 of 1