To read the full version of this content please select one of the options below:

A set of parameters for automatically annotating a Sentiment Arabic Corpus

Guellil Imane (Laboratoire des Méthodes de Conception des Systèmes, Ecole Nationale Supérieure d’Informatique, Oued-Smar, Alger, Algérie)
Darwish Kareem ( Qatar Computing Research Institute (QCRI) , Doha, Qatar)
Azouaou Faical (Laboratoire des Méthodes de Conception des Systèmes, Ecole Nationale Supérieure d’Informatique, Oued-Smar, Alger, Algérie)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 2 September 2019

Issue publication date: 15 October 2019

Downloads
98

Abstract

Purpose

This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.

Design/methodology/approach

The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).

Findings

The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.

Originality/value

The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.

Keywords

Citation

Imane, G., Kareem, D. and Faical, A. (2019), "A set of parameters for automatically annotating a Sentiment Arabic Corpus", International Journal of Web Information Systems, Vol. 15 No. 5, pp. 594-615. https://doi.org/10.1108/IJWIS-03-2019-0008

Publisher

:

Emerald Publishing Limited

Copyright © 2019, Emerald Publishing Limited