This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian.
The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR).
The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent.
The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.
Imane, G., Kareem, D. and Faical, A. (2019), "A set of parameters for automatically annotating a Sentiment Arabic Corpus", International Journal of Web Information Systems, Vol. 15 No. 5, pp. 594-615. https://doi.org/10.1108/IJWIS-03-2019-0008Download as .RIS
Emerald Publishing Limited
Copyright © 2019, Emerald Publishing Limited