Search results

1 – 1 of 1

Open Access

Article

Publication date: 17 July 2020

MulTed: a multilingual aligned and tagged parallel corpus

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel…

HTML

PDF (745 KB)

Downloads

2607

Abstract

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.

Details

Applied Computing and Informatics, vol. 18 no. 1/2

Type: Research Article

DOI:

ISSN: 2210-8327

Keywords

Access

Year

All dates (1)

Content type

Article (1)

1 – 1 of 1

Search results

MulTed: a multilingual aligned and tagged parallel corpus

Abstract

Details

Keywords

Access

Year

Content type

Something didn’t work…

All feedback is valuable

Platform update page

Questions & More Information

MulTed: a multilingual aligned and tagged parallel corpus

Abstract

Details

Keywords

Access

Year

Content type

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information