The construction of an accurate Arabic sentiment analysis system based on resources alteration and approaches comparison

Purpose – This paper purposed a multi-facet sentiment analysis system. Design/methodology/approach – Hence, This paper uses multidomain resources to build a sentiment analysis system. The manual lexicon based features that are extracted from the resources are fed into a machine learning classifier to compare their performance afterward. The manual lexicon is replaced with a custom BOW to deal with its time consuming construction. To help the system run faster and make the model interpretable, this will be performed by employing different existing and custom approaches such as term occurrence, information gain, principal component analysis, semantic clustering, and POS tagging filters. Findings – The proposed system featured by lexicon extraction automation and characteristics size optimization proved its efficiency when applied to multidomain and benchmark datasets by reaching 93.59% accuracy which makes it competitive to the state-of-the-art systems. Originality/value – The construction of a custom BOW. Optimizing features based on existing and custom feature selection and clustering approaches.


Introduction
Opinions influence human activities, thus their analysis allows predicting the consequent behavior. However, the task of relevant information extraction from massive amounts of data remain a difficult challenge for humans which raises the need for Information technologies such as opinion mining and sentiment analysis.
Sentiment analysis is one of the most active research domains that deal with Web mining studies and data classification. Text analysis requires natural language processing tools and analysis approaches that will be applied to text. The main target of sentiment analysis is to identify the inferred polarity within reviews [1].
Training a model on a characteristic vector of considerable size is time consuming and makes the result analysis hard. As a result dimensionality reduction techniques are required. Feature selection is considered a pre-processing step for a machine learning-based system. Its primary target is to reduce data dimensionality. Dimensionality reduction approaches might be linear or non-Linear [2,3]. Consequently, it reduces storage requirement as well as computation time [2,4,5] and helps to improve model readability and interpretation by reducing features number. It helps the model to train faster and it overcomes the overfitting challenge.
We can distinguish between the following types of feature selection approaches filterbased, wrapper-based, embedded, and hybrid. Each approach involves selecting a subset of features that performs the best based on a specific algorithm [4].
In this paper, we perform linguistic analysis on an opinionated multi domain corpus. Afterward, we go through the model characteristics in depth, how they are retrieved and how to select pertinent features and reduce their dimensionality. Then, we build a classifier from it. The feature vector may be shrunk not only by using the existing methods but also by introducing custom clustering approaches of characteristics. We follow a custom approach to reduce dimensionality and perform lexicon semantic clustering by defining a set of sentiment clusters, where each lexicon word is added to the relevant cluster. Moreover, we use a Part of speech (POS) tagger [6] to cluster the lexicon by defining noun, adjective and verb clusters. The selected features are evaluated and compared to the generated ones in term of size and performance. Furthermore, the system performance is compared afterward with state-of-the-art systems.

Previous work
Sentiment analysis is considered a subcomponent technology for other decision-making systems [7] that help to understand person attitudes and gender expressions [8], improvise features, find out strengths and weaknesses based on the online reviews of potential users and identify problems in the world of social networking sites such as Facebook and Twitter [9,10]. It is also used to predict sentiment changes over time [11].
Subjectivity, sentiment analysis levels, and opinion types, in addition to SA resources, have been outlined in important studies in the realm of sentiment analysis. They emphasized multiple classification approaches and investigated cross-domain and cross-language variations as well as the impact of summarization [1,12].
Sentiment analysis systems rely on linguistic resources, namely sentimental corpora and lexicons. Subjective tweets may be positive, negative, neutral or mixed in the labeled corpora. SA systems require sentiment corpora annotation and lexicons extraction [13]. Opinion Corpus for Arabic (OCA) is a collection of Arabic movie reviews, from which, the English version EVOCA, is generated [14]. LABR has almost 63,000 book reviews scored from 1 to 5 stars [15]. ArSAS and ArSentD-LEV are respectively Arabic speech-act and Levantine Multi-Topic sentiment analysis corpora [16]. The baptized ArSAS [17], SemEval 2017 [18] and ASTD [19] corpora contain tweets annotated as positive, negative or neutral. The sentimental lexicons contain sentimental terms that are verified manually, or obtained automatically by machine translation [20]. Sentiwordnet is a lexical resource that assigns numerical scores to each wordnet synset based on objectivity, positivity, and negativity [21]. ArSEL is a comprehensive Arabic Sentiment and Emotion Lexicon [16]. Gold tags and existing lexicons such as SentiWordNet 3.0 help to expand and evaluate other polarity lexicons [22]. Different Bag-of-Words (BOW) aspects have been investigated to detect sentiments [23]. Many studies have been conducted to determine lexicon words domain and to disambiguate their meaning based on fuzzy lexico-semantic and word meaning similarities [24][25][26].
Sentiment classification can be bipolar (positive, negative), tripolar (positive, negative, mixed) or fine grained that considers the strength of positive and negative polarities [10,19]. Moreover, it can be applied to document, sentence, phrase and aspect levels based on grammatical and semantic orientation approaches [27][28][29]. Text categorization techniques based on subjectivity summarization can be applied to subjective documents [30].
Sentiment classification approaches are either unsupervised based on dictionaries [31] and apply rules [32] or supervised that build a model from a labeled corpus [33]. Since the supervised approaches are domain dependent, algorithms that address domain independence have been proposed [34,35]. Moreover, deep learning models for multidomain Arabic sentiment analysis have been performed [36]. Attention-based Bidirectional CNN-RNN Deep Model that extracts both past and future contexts [37], as well as the Convolutional LSTM model, has been used [38]. Word Embedding Parameters variation and Hyperparameter Tuning for Machine Learning Algorithms have been undertaken to assess their impact on Arabic Sentiment Analysis performance [39,40]. Many studies provided systems based on word2vec, CNN and LSTM as well as a collection of open-source tools for Arabic natural language processing tasks such as sentiment analysis using AraBERT and mBERT [41,42]. BERT post-training has been performed for aspect-based sentiment analysis [43]. A powerful comparison of effective approaches [44] and deep learning frameworks [45] for Arabic sentiment analysis has been performed. Different valuable tools [46] as well as challenges and trends of sentiment analysis [45] have been presented. The semi-supervised learning algorithms develop patterns with great generalizability from a limited labeled sample [47]. Semi-supervised learning may be used to predict users' personality traits which may improve personalized service and human psychology research [48]. Besides the semi-supervised approaches, there are clustering approaches that segment data into different classes without the need for annotated data and pre-trained models [49,50].
The linguistic content generated by Web users is multi-lingual since it may contain various languages, combine different dialects or languages or switch between them within the same expression. Aside from multilingual sentiment analysis, adaptation of English resources and sentiment classification approaches to other languages have been conducted [51][52][53]. Besides MSA sentiment analysis, many studies have focused on Arabic dialects [54,55] and integrated stem and lemma lexicon morphologies [56]. Other studies carried out an indepth study of Arabic and multi-lingual sentiment analysis, and presented their approaches and tools as well as their challenges [57,58]. Sentiment analysis is faced with many challenges among which, spam, polarity fuzziness, sarcasm, domain dependency, fake news, Arabic varieties, language morphology and code-switching [1,57,59,60].
Since sentiment analysis is considered a classification domain, feature selection has gained researchers interest who presented feature selection algorithms, their applications and categories [2,3,5,61] and addressed their strengths and challenges [62] besides ranking fundamental algorithms used to reduce dimensionality [2] according to relevance [63], computation time [4,64], and the matching degree between the algorithm and the known optimal solution [65][66][67].

Linguistic resources 3.1 Corpus collection
The employed corpora that vary in terms of domain and length were extracted from different websites by the authors of [68] and cover various domains Hotels (HTL), Products (PROD), Movies (MOV) and Restaurants (RES). Table 1 summarizes the statistics of the used corpora that were preprocessed by removing all non Arabic characters, namely, Latin letters, punctuation marks, and digits.

Lexicon 3.2.1 Manual lexicon.
The domain-specific lexicon which statistics are given in Table 1 was established by Ref. [68], re-checked, altered and cleaned by Ref. [56]. We created two lexicons by browsing the investigated corpora, negation words (NW) that contain 167 negation indicators that reverse terms polarity, and a set (SW) of 558 stop words that keep the same meaning regardless to the context. Manual lexicon extraction and adjustment is a difficult and time-consuming task. Hence, we describe in the next section the followed steps to perform the Bag-of-Words construction. We aim to improve the classical Bag-of-Words extraction by addressing automation, domain dependency and semantic disambiguation. For this, we opt for a custom approach to generate an automatic BOW by performing many filtering and threshold decision steps. Figure 1 describes the BOW construction process. We construct a custom BOW that weighs terms based on their occurrences. After pretreatment, we tokenize positive and negative reviews into raw positive and negative lexicon terms using a space delimiter; and removed stop words, negation words, redundancy and intersection within the positive and negative lexicons. We automatically define the threshold (Th), used to obtain the BOW size, as the average of lexicon terms occurrences in either the positive or the negative corpus. The reduced BOW, after each filtering operation (F_O), consists only of terms whose occurrences are greater than the threshold. We present in Table 1 the different initial and reduced BOW sizes; and the reduction rate obtained following the computed thresholds for each domain.

System methodology
We aim to construct a sentiment analysis system that addresses the characteristic of the research topic and improves the optimization approaches. It can serve as a roadmap for many classification domains whose main target is the separation of data with similar characteristics besides their interpretation. We present in Figure 2 the system architecture.

Classification approaches
Machine learning algorithms prefer well defined fixed-length inputs and outputs. In the following, we describe the extracted features and how models are generated from the data using machine learning approaches.
4.1.1 Unsupervised. The unsupervised approach is based on the criteria that consider the major score of sentimental terms within an expression, hence a review is labeled with the polarity of the major score.
4.1.2 Classical supervised. We represent comments by a vector V W based on lexicon terms. The characteristic vector V W ¼ ðP 1 ; Á Á Á ; P p ; N 1 ; Á Á Á ; N n ; P w ; N w ; P w ; N w Þ of a review W is composed of terms occurrences from the positive P i (1 ≤ i ≤ p) and negative N j (1 ≤ j ≤ n) lexicons respectively, as well as their sum P W ; N W ; and the number of times they have been preceded by a negation term P W and N W [56]. Afterward, we build a model V W -SVM using 80% for training from each corpus and 20% for testing.
4.1.3 Deep learning. From the training corpus we create a Word2Vec model that transforms the words of the corpus, with a frequency greater than 5 and windows size equals to 10, which are an empirical choice, into a set of numeric vectors with a size of 300. We employ padding arrays to provide a consistent representation for all reviews of different lengths. The mask matrix contains 1 if data is present and 0 otherwise. The entered corpus is represented by a characteristic matrix, labels either positive or negative as well as their masks. For the second model, we use the vector V W described in section 4.1.2 to represent the corpus. Subsequently, the set of vectors is fed to a neural network for weight estimation. The neural network is made up of layers, the input layer, and the inner LSTM and RNN layers, which helps to have Word2Vec -RN and V W -RN models. For the LSTM layer, we initialize weights using Xavier, use their update program Adam, and Tanh as the activation function. Finally, the RNN layer has a softmax activation function that gives a probability distribution over the classes, and defines loss using MCXENT function.

Characteristic vector optimization
We optimize the characteristic vector based on existing approaches such as term occurrence filter (TO), information gain (IG) and PCA; and custom approaches based on BOW reduction, semantic and morphological clustering which helps to maintain high accuracy while reducing feature number, execution time, and storage requirement; and improving model interpretation.
4.2.1 Classical filtering. We use term occurrence based on the characteristic vector V W , information gain and PCA to perform data filtering. Information gain of an attribute is measured with respect to the class. PCA enables the transformation of a dataset into a new dataset of lower dimensionality based on the identification of correlations within it.
4.2.2 Custom filtering. 4.2.2.1 BOW size reduction. In this paper, we replace the manual lexicon construction and semantic verification with a BOW constructed using a custom automatic approach. We feed the characteristic vector V W based on the BOW lexicons into an SVM classifier to build a supervised sentiment analysis system. 4.2.2.2 Lexicon semantic clustering. In order to reduce the characteristic vector size V W , we diminish the number of lexicon segments to twelve V W S ¼ ðP 1 ; Á Á Á ; P 6 ; N 1 ; Á Á Á ; N 6 ; P W ; N W ; P W ; N W Þ. The positive segments are Love, Optimism, Joy, Satisfaction, Entertainment, and Relief; whereas the negative segments include Hatred, Pessimism, Sadness, Dissatisfaction, Boredom and Fear. 4.2.2.3 Lexicon POS clustering. We accomplish classification using a model built from an optimized characteristic vector V W P ¼ ðP V ; P N ; N V ; N N ; P W ; N W ; P W ; N W Þ composed of the occurrences of the four POS classes terms as well as the last four features of V W . The vector is based on POS clustering where each positive and negative lexicon is segmented using the POS Tagger [6] into two classes V and N, where V is the class of verbs and N is that of adjectives and nouns. The segmentation is followed by a slight manual check that proved the efficiency of the automatic tagging. Adjectives and nouns are confused within the same category since they can be used to tag the same term in some cases, for instance, the term ‫ﺱ‬ ‫ﻉ‬ ‫ﻱ‬ ‫ﺩ‬ /sEyd/ (happy). Moreover, the adjectives are abundant in the employed lexicon which was already confirmed by many previous studies where they were considered the most significant class for sentiment analysis as they are the most clues for subjectivity.

Experimental work and result interpretation
We choose the best performing approach and classifier according to the size and characteristics of the data, then we perform experiments on the characteristic vector optimization.

Performance according to classification approaches
The results of the comparison between classification approaches, namely the unsupervised, classical supervised and deep learning, described in section 4 are given in Table 2.
From the results, we can point out that the best performance is achieved in the HTL domain. The poor results recorded in the MOV domain can be explained by the nature of the reviews, and their length ( Table 1).
According to Table 2, the supervised approach gave better results than the unsupervised and deep learning approaches. The degraded results of deep learning can be mainly due to the limited size of the used corpora. In addition, when comparing the two deep learning models, the extracted vector V W outperforms Word2Vec which shows the relevance of the extracted sentimental terms. Hence, we opt for the supervised approach as well as the vector V W to perform the remaining classification tests that aim to optimize the characteristic vector using various approaches.

Characteristic vector optimization
We based dimensionality reduction on the classical filtering operations TO, IG and PCA (section 4.2.1) and three custom approaches which are BOW lexicon reduction and the segmentation of the lexicon using manual semantic clustering or automatic morphology clustering ( [6]. In Table 3, we give the experimental results of each optimization approach and we compare them with the result obtained using the raw lexicon. We also define the number of features as well as the execution time (ET) for HTL domain only in order to lighten the paper.
5.2.1 Classical filtering. Table 3 shows that we have comparable results related to the classical filtering operations TO, IG and PCA whose accuracies are very close to each other. Moreover, from a reduced set of features, we have obtained pertinent results. The PCA filter gives degraded results at PROD and MOV domains which may be explained by the fact that the principal components calculated for these domains are not easily separable by the SVM classifier.
5.2.2 Custom approaches. 5.2.2.1 BOW size reduction. The results show that the execution time is optimized whereas the accuracy degrades when the Bag-of-Words size is reduced, which may be caused by the removal of relevant features when passing from a BOW to a reduced one. The strength of the BOW lexicon besides the result relevance, since it has an advantage over the manual lexicon, lies in that it weighs corpus terms based on their occurrences according to an automatic process. 5 Table 3. Accuracies of models based on filtering operation ACI Interpretation can be easier when using semantic and POS vectors since their size is limited and not proportional to lexicon terms which means that the extension of the lexicon will not affect the characteristic vectors size which is not the case for the word model. The semantic and POS models have achieved the same accuracy in PROD and MOV domains which is an advantage for the automatic segmentation using POS tagging in comparison to the manual semantic segmentation. 5.2.2.3 Custom approaches comparison. We compute the information gain of each characteristic based on which the words; semantic and POS classes models are constructed. 5.2.2.3.1. Feature significance. For the word model, the features N w ; P w and P w are respectively ranked from the first to the third followed by N w that is of low significance in comparison to the best ranked features. However, the performance of our system is improved using this feature since it inverts the polarity of the negative terms preceded by a negation word.
For the semantic segmentation (Figure 3, left), the feature with low significance is N w that is ranked 13 out of 16 features and the most significant features are N W , P w and P W that are ranked the first, the third and fifth out of 16 respectively, which shows the rarety of the negative terms N w preceded with a negation term and also the importance of positive terms P w and negative terms (true negative N W and false positive P W ) in the classification. The most significant segment relates to the category of satisfaction and dissatisfaction P 4 and N 4 that are ranked the fourth and the second out of 16 respectively.
Using POS segmentation (Figure 3, right), we identify the feature with low significance, namely the one with a low information gain which is N w that is ranked in the seventh position out of 8, and the most significant feature N W that is ranked the first out of 8 features. As a result, there aren't many terms preceded by a negation word (N w ), and the presence of negative terms N W is discriminant in identifying the class of each analyzed review. Moreover, P N and N N namely the positive and negative nouns, that are ranked the third and the second respectively, are the most subjectivity indicators that convey more information than other remaining features which proves that the identification of sentiment is basically related to the presence of this category within comments.
The feature ranking related to word, semantic classes and POS clustering when analyzing the four domain specific models is almost the same with a slight difference in the position.
Using semantic classes and analyzing the model on which it has been created will help to define emotion categories and also to detect hate speech by adding related lexicon categories as the model clearly define the best features based on their information gain. For the segmentation based on the POS it will help in defining which category of the lexicon has to be added to improve the accuracy of our system.
The lexicon clustering turns out to be pertinent since it summarizes the information dispersed when using the word model and it may be extended by defining which categories  Table 3, semantic classes give comparable results to the word lexicon with a gain in execution time and storage requirement. POS segmentation, in turn, gives comparable results to semantic classes and hence helps to overcome the same challenges with which the word model is faced.
In the case of our paper, where the lexicon is of limited size (190 terms), there isn't a significant gain since the word lexicon is small, however, the segmentation will be helpful when it comes to lexicons with huge size, for instance, the BOW with 2679.89 ET that is characterized by 17,042 terms, which largely exceeds the size of semantic and POS classes categories that are fixed to 12 and 4 respectively. Moreover, the representation based on lexicon segmentation helps to augment the interpretation of a model and preserves the consistency of each feature significant which is lost when using the number of features proportional to the number of lexicon terms.

Systems comparison
After measuring the performance of our system based on various configurations, we give in Table 4 examples of correctly classified and misclassified comments.
From Table 4, the annotation subjectivity may be the cause for the misclassification since a comment mixed can be annotated as positive or negative according to its context. Moreover, the lack of comment sentimental terms within the lexicon can lead to its misclassification, and hence the need for the lexicon extension that enlarges the characteristic vector and this was the major cause behind the raised questions, how to keep information and thus high accuracy, to which we have responded in this paper by the optimization of the characteristic vector.
In order to state our approach with the previous works, we compare our system to Mazajak, CAMel Tools, SemEval 2017; and Abu Farha and Magdy systems [18,41,42,44] based on deep learning and pretrained models. The system results are given in Table 5 using accuracy, precision, recall, and F-measure metrics.
The comparison is performed on the same datasets which are ArSAS [17], SemEval 2017 [18], and ASTD [19] in order to reach a fair comparison between the different systems. The results prove the efficiency of our system that shows an improvement. Moreover, the obtained low F-measure measured by 69.95% is since there are few data to train on in comparison to testing data. Hence, we inversed the training and testing portions and obtained a 93.55% F-measure.

Conclusion and further work
In this paper, we have aimed to optimize the components of a sentiment analysis system, we first collected multi-domain datasets and lexicons. Since the manual construction and verification of a lexicon is time consuming, we have constructed a custom BOW whose size is diminished following a custom threshold. We have performed classification based on the unsupervised; and the classical and deep neural supervised approaches. The execution time, and storage requirement, as well as model interpretation, have gained the interest in data analytics, thus we have opted for existing and custom methods to optimize the characteristic vector of the opinionated reviews. Moreover, to make the interpretation of our models and results easier, the reduced characteristic vector was based on semantic and morphological lexicon segmentation to give significance to its components. The current system proved efficient in comparison to the enhanced state-of-the-art models. As further work, we intend to apply the described approaches to a wide range of classification areas to prove their efficiency since we believe that the sole requirement is the adaptation of domain categories. Moreover, the automatic annotation of corpora will be one of the main focuses. We intend also to extract cross-domain and cross-lingual features. In order to minimize the effort when performing sentiment analysis, we will base the task on transfer learning.