Search results

1 – 10 of 98
Content available
Article
Publication date: 17 July 2020

Imad Zeroual and Abdelhak Lakhouaja

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual

Abstract

Recently, more data-driven approaches are demanding multilingual parallel resources primarily in the cross-language studies. To meet these demands, building multilingual parallel corpora are becoming the focus of many Natural Language Processing (NLP) scientific groups. Unlike monolingual corpora, the number of available multilingual parallel corpora is limited. In this paper, the MulTed, a corpus of subtitles extracted from TEDx talks is introduced. It is multilingual, Part of Speech (PoS) tagged, and bilingually sentence-aligned with English as a pivot language. This corpus is designed for many NLP applications, where the sentence-alignment, the PoS tagging, and the size of corpora are influential such as statistical machine translation, language recognition, and bilingual dictionary generation. Currently, the corpus has subtitles that cover 1100 talks available in over 100 languages. The subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used; then, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP and corpus linguistics, especially for under-resourced languages.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2210-8327

Keywords

To view the access options for this content please click here
Article
Publication date: 6 April 2012

Wen Zeng

The paper aims to explore multilingual thesauri automation construction based on the freely available digital library resources. The key methods and study results are…

Abstract

Purpose

The paper aims to explore multilingual thesauri automation construction based on the freely available digital library resources. The key methods and study results are presented in the paper. It also proposes a way that terms are automatically extracted from multilingual parallel corpus.

Design/methodology/approach

The study adopted the technology of natural language processing to analyze the linguistics characteristics of terms, and combined this with statistical analyses to extract the terms from technological documents. The methods consist of automatically extracting and filtering terms, judging and building relationship among terms, building the multilingual parallel corpus, and extracting term pairs between Chinese and foreign languages through calculating their associated probability. The experiments run on the Java test platform.

Findings

The study obtains the following conclusions: finding the similarities and differences between the Chinese thesaurus standard and international thesaurus standard. The methods for automatically extracting terms and building relationships among them are presented. Eventually the multilingual terms' translation sets are generated based on real corpora. The results of the study show that the proposed methods can obtain better performance. The effect of automatic terms' translation alignment method is better than that of traditional IBM model method.

Practical implications

The study results can provide references for further study and application of multilingual thesauri automation construction using Chinese as a pivot.

Originality/value

The paper proposes new ideas on thesaurus automation construction in the digital age. The presented method based on linguistics and statistics is a new attempt. According to the experimental results, this exploration and study is innovative and valuable. In addition, these ideas and methods give a good start for improving information services of the PRC's National Science and Technology Digital Library.

Details

The Electronic Library, vol. 30 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article
Publication date: 1 March 1995

Stelios Piperidis

This paper describes the research and development activities carried out in the framework of the Translearn project. The aim of the project is to build a translation…

Abstract

This paper describes the research and development activities carried out in the framework of the Translearn project. The aim of the project is to build a translation memory tool and the appropriate translation work environment. Translearn's application corpus consists of regulations and directives of the European Union (EU), extracted from the CELEX database, the EU's documentation system on EU law, and the language versions it concentrates on are English, French, Portuguese and Greek. The development of the prototype tool for the envisaged system proves the application's usefulness in the translation process of international multilingual organizations as well as in the localization‐internationalization process of international enterprises.

Details

Aslib Proceedings, vol. 47 no. 3
Type: Research Article
ISSN: 0001-253X

To view the access options for this content please click here
Article
Publication date: 1 May 2019

Mehrdad Vasheghani Farahani and Zeinab Amiri

In an effort to bridge the gap between applying translation corpora, specialized terminology teaching and translation performance of undergraduate students, the purpose of…

Abstract

Purpose

In an effort to bridge the gap between applying translation corpora, specialized terminology teaching and translation performance of undergraduate students, the purpose of this paper is to investigate the possible impacts of teaching specialized terminology of law as a specific area of inquiry on translation performance of Iranian undergraduate translation student (English–Persian language pairs). The null hypothesis of this study is that using specialized terminology does not have statistically significant impacts on the translation performance of the translation students.

Design/methodology/approach

The design of this research was experimental in that there was pretest, treatment, posttest and random sampling. In other words, this research was pre-experimental one-group pretest-posttest design. This design was used in this research as the number of subjects who participated in the research was limited. Apart from being experimental, this research enjoyed a corpus-based perspective. As Mcenery and Hardie (2012) claim, corpus-based research uses the “corpus data in order to explore a theory or hypothesis, typically one established in the current literature, in order to validate it, refute it or refine it” (p. 6). Table I shows the design of this research.

Findings

The results of this research indicated that on the whole, the posttest results had statistically significant differences with that of the pretest. In this regard, the quality of students’ translation enhanced after using the specialized terminology in the form of three types of corpora. Indeed, there was a general trend in the improved quality of the novice translators in translating specialized and subject-field terminologies in an English–Persian context.

Originality/value

This paper is original in that it probes into one of the less researched areas of Translation Studies Research and employs corpora methodology.

Details

Journal of Applied Research in Higher Education, vol. 11 no. 3
Type: Research Article
ISSN: 2050-7003

Keywords

Content available
Article
Publication date: 6 April 2012

Daqing He

Abstract

Details

The Electronic Library, vol. 30 no. 2
Type: Research Article
ISSN: 0264-0473

To view the access options for this content please click here
Article
Publication date: 21 September 2012

Ahmet Soylu, Felix Mödritscher, Fridolin Wild, Patrick De Causmaecker and Piet Desmet

Mashups have been studied extensively in the literature; nevertheless, the large body of work in this area focuses on service/data level integration and leaves UI level…

Abstract

Purpose

Mashups have been studied extensively in the literature; nevertheless, the large body of work in this area focuses on service/data level integration and leaves UI level integration, hence UI mashups, almost unexplored. The latter generates digital environments in which participating sources exist as individual entities; member applications and data sources share the same graphical space particularly in the form of widgets. However, the true integration can only be realized through enabling widgets to be responsive to the events happening in each other. The authors call such an integration “widget orchestration” and the resulting application “mashup by orchestration”. This article aims to explore and address challenges regarding the realization of widget‐based UI mashups and UI level integration, prominently in terms of widget orchestration, and to assess their suitability for building web‐based personal environments.

Design/methodology/approach

The authors provide a holistic view on mashups and a theoretical grounding for widget‐based personal environments. The authors identify the following challenges: widget interoperability, end‐user data mobility as a basis for manual widget orchestration, user behavior mining – for extracting behavioral patterns – as a basis for automated widget orchestration, and infrastructure. The authors introduce functional widget interfaces for application interoperability, exploit semantic web technologies for data interoperability, and realize end‐user data mobility on top of this interoperability framework. The authors employ semantically enhanced workflow/process mining techniques, along with Petri nets as a formal ground, for user behavior mining. The authors outline a reference platform and architecture that is compliant with the authors' strategies, and extend W3C widget specification respectively – prominently with a communication channel – to foster standardization. The authors evaluate their solution approaches regarding interoperability and infrastructure through a qualitative comparison with respect to existing literature, and provide a computational evaluation of the behavior mining approach. The authors realize a prototype for a widget‐based personal learning environment for foreign language learning to demonstrate the feasibility of their solution strategies. The prototype is also used as a basis for the end‐user assessment of widget‐based personal environments and widget orchestration.

Findings

The evaluation results suggest that the interoperability framework, platform, and architecture have certain advantages over existing approaches, and the proposed behavior mining techniques are adequate for the extraction of behavioral patterns. User assessments show that widget‐based UI mashups with orchestration (i.e. mashups by orchestration) are promising for the creation of personal environments as well as for an enhanced user experience.

Originality/value

This article provides an extensive exploration of mashups by orchestration and their role in the creation of personal environments. Key challenges are described, along with novel solution strategies to meet them.

To view the access options for this content please click here
Article
Publication date: 2 September 2019

Jelena Andonovski, Branislava Šandrih and Olivera Kitanović

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to…

Abstract

Purpose

This paper aims to describe the structure of an aligned Serbian-German literary corpus (SrpNemKor) contained in a digital library Bibliša. The goal of the research was to create a benchmark Serbian-German annotated corpus searchable with various query expansions.

Design/methodology/approach

The presented research is particularly focused on the enhancement of bilingual search queries in a full-text search of aligned SrpNemKor collection. The enhancement is based on using existing lexical resources such as Serbian morphological electronic dictionaries and the bilingual lexical database Termi.

Findings

For the purpose of this research, the lexical database Termi is enriched with a bilingual list of German-Serbian translated pairs of lexical units. The list of correct translation pairs was extracted from SrpNemKor, evaluated and integrated into Termi. Also, Serbian morphological e-dictionaries are updated with new entries extracted from the Serbian part of the corpus.

Originality/value

A bilingual search of SrpNemKor in Bibliša is available within the user-friendly platform. The enriched database Termi enables semantic enhancement and refinement of user’s search query based on synonyms both in Serbian and German at a very high level. Serbian morphological e-dictionaries facilitate the morphological expansion of search queries in Serbian, thereby enabling the analysis of concepts and concept structures by identifying terms assigned to the concept, and by establishing relations between terms in Serbian and German which makes Bibliša a valuable Web tool that can support research and analysis of SrpNemKor.

Details

The Electronic Library , vol. 37 no. 4
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article
Publication date: 1 October 1995

John Hutchins

In the 1980s the dominant framework of MT was essentially ‘rule‐based’, e.g. the linguistics‐based approaches of Ariane, METAL, Eurotra, etc.; or the knowledge‐based…

Abstract

In the 1980s the dominant framework of MT was essentially ‘rule‐based’, e.g. the linguistics‐based approaches of Ariane, METAL, Eurotra, etc.; or the knowledge‐based approaches at Carnegie Mellon University and elsewhere. New approaches of the 1990s are based on large text corpora, the alignment of bilingual texts, the use of statistical methods and the use of parallel corpora for ‘example‐based’ translation. The problems of building large monolingual and bilingual lexical databases and of generating good quality output have come to the fore. In the past most systems were intended to be general‐purpose; now most are designed for specialized applications, e.g. restricted to controlled languages, to a sublanguage or to a specific domain, to a particular organization or to a particular user‐type. In addition, the field is widening with research under way on speech translation, on systems for monolingual users not knowing target languages, on systems for multilingual generation directly from structured databases, and in general for uses other than those traditionally associated with translation services.

Details

Aslib Proceedings, vol. 47 no. 10
Type: Research Article
ISSN: 0001-253X

To view the access options for this content please click here
Article
Publication date: 3 November 2020

Jagroop Kaur and Jaswinder Singh

Normalization is an important step in all the natural language processing applications that are handling social media text. The text from social media poses a different…

Abstract

Purpose

Normalization is an important step in all the natural language processing applications that are handling social media text. The text from social media poses a different kind of problems that are not present in regular text. Recently, a considerable amount of work has been done in this direction, but mostly in the English language. People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script. This kind of text further aggravates the problem of normalizing. This paper aims to discuss the concept of normalization with respect to code-mixed social media text, and a model has been proposed to normalize such text.

Design/methodology/approach

The system is divided into two phases – candidate generation and most probable sentence selection. Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language. Character-based translation system has been proposed to generate candidate tokens. Once candidates are generated, the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.

Findings

Character error rate (CER) and bilingual evaluation understudy (BLEU) score are reported. The proposed system has been compared with Akhar software and RB\_R2G system, which are also capable of transliterating Roman text to Gurmukhi. The performance of the system outperforms Akhar software. The CER and BLEU scores are 0.268121 and 0.6807939, respectively, for ill-formed text.

Research limitations/implications

It was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing. Spell checker can improve the output of the system by correcting these minor errors. Extensive experimentation is needed for optimizing language identifier, which will further help in improving the output. The language model also seeks further exploration. Inclusion of wider context, particularly from social media text, is an important area that deserves further investigation.

Practical implications

The practical implications of this study are: (1) development of parallel dataset containing Roman and Gurmukhi text; (2) development of dataset annotated with language tag; (3) development of the normalizing system, which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi. It can be extended for any pair of scripts. (4) The proposed system can be used for better analysis of social media text. Theoretically, our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.

Originality/value

Existing research work focus on normalizing monolingual text. This study contributes towards the development of a normalization system for multilingual text.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 13 no. 4
Type: Research Article
ISSN: 1756-378X

Keywords

To view the access options for this content please click here
Article
Publication date: 21 September 2012

Dan Wu and Daqing He

This paper seeks to examine the further integration of machine translation technologies with cross language information access in providing web users the capabilities of…

Abstract

Purpose

This paper seeks to examine the further integration of machine translation technologies with cross language information access in providing web users the capabilities of accessing information beyond language barriers. Machine translation and cross language information access are related technologies, and yet they have their own unique contributions in handling information in multiple languages. This paper aims to demonstrate that there are many opportunities to further integrate machine translation with cross language information access, and the combination can greatly empower web users in their information access.

Design/methodology/approach

Using English and Chinese as the language pair for studying, this paper looks at machine translation in query translation‐based cross language information access at multiple important aspects, which include query translation, relevance feedback, interactive cross language information access, out‐of‐vocabulary term translation, and data fusion. The goal is to obtain more insights about the wide range usages of machine translation in cross language information access, and to help the community to identify promising future directions for both machine translation and cross language access.

Findings

Machine translation can be applied effectively in many places in the whole cross language information access process. Queries translated by a machine translation system are high quality and are more robust in handling potential untranslated terms. Translation enhancement, a relevance feedback method using machine translation generated returned documents, is not only a valid technique by itself, but also helps to generate more robust cross language information access performance when combined with other relevance feedback techniques. Machine translation is also found to play a significant role in resolving untranslated terms and in data fusion.

Originality/value

This set of comparative empirical studies on integrating machine translation and cross language information access was performed on a common evaluation framework, and examined integration at multiple points of the cross language access process. The experimental results demonstrate the value of further integrating machine translation in cross language information access, and identify interesting future directions for both machine translation and cross language information access research.

1 – 10 of 98