To read this content please select one of the options below:

Roman to Gurmukhi Social Media Text Normalization

Jagroop Kaur (Department of Computer Science and Engineering, Punjabi University, Patiala, India)
Jaswinder Singh (Department of Computer Science and Engineering, Punjabi University, Patiala, India)

International Journal of Intelligent Computing and Cybernetics

ISSN: 1756-378X

Article publication date: 3 November 2020

Issue publication date: 13 November 2020

186

Abstract

Purpose

Normalization is an important step in all the natural language processing applications that are handling social media text. The text from social media poses a different kind of problems that are not present in regular text. Recently, a considerable amount of work has been done in this direction, but mostly in the English language. People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script. This kind of text further aggravates the problem of normalizing. This paper aims to discuss the concept of normalization with respect to code-mixed social media text, and a model has been proposed to normalize such text.

Design/methodology/approach

The system is divided into two phases – candidate generation and most probable sentence selection. Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language. Character-based translation system has been proposed to generate candidate tokens. Once candidates are generated, the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.

Findings

Character error rate (CER) and bilingual evaluation understudy (BLEU) score are reported. The proposed system has been compared with Akhar software and RB\_R2G system, which are also capable of transliterating Roman text to Gurmukhi. The performance of the system outperforms Akhar software. The CER and BLEU scores are 0.268121 and 0.6807939, respectively, for ill-formed text.

Research limitations/implications

It was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing. Spell checker can improve the output of the system by correcting these minor errors. Extensive experimentation is needed for optimizing language identifier, which will further help in improving the output. The language model also seeks further exploration. Inclusion of wider context, particularly from social media text, is an important area that deserves further investigation.

Practical implications

The practical implications of this study are: (1) development of parallel dataset containing Roman and Gurmukhi text; (2) development of dataset annotated with language tag; (3) development of the normalizing system, which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi. It can be extended for any pair of scripts. (4) The proposed system can be used for better analysis of social media text. Theoretically, our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.

Originality/value

Existing research work focus on normalizing monolingual text. This study contributes towards the development of a normalization system for multilingual text.

Keywords

Citation

Kaur, J. and Singh, J. (2020), "Roman to Gurmukhi Social Media Text Normalization", International Journal of Intelligent Computing and Cybernetics, Vol. 13 No. 4, pp. 407-435. https://doi.org/10.1108/IJICC-08-2020-0096

Publisher

:

Emerald Publishing Limited

Copyright © 2020, Emerald Publishing Limited

Related articles