Clustering as feature selection method in spam classification: uncovering sick-leave sellers

Purpose – This paperaimstoproposeanovelwayofusingtextualclusteringasafeatureselectionmethod.It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter. Design/methodology/approach – Four machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied. Findings – Radom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore,usingclusteringasafeatureselectionmethodimprovedtheSensitivityofthemodelfrom73.83%to98.79%.Sensitivity(recall)isthemostimportantmeasureofclassifierperformancewhendetectingpromoters ’ accounts that have spam-like behavior. Researchlimitations/implications – Themethodappliedisnovel, moretestingisneededinother datasets before generalizing its results. Practical implications – The model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online. Originality/value – The research is proposing a new way textual clustering can be used in feature selection.


Introduction
When not reporting to work, employees are expected to present proof if they claim to have had a medical condition. Sick leaves are documents provided by medical facilities issued by a doctor certifying that the person is suffering from a condition that allows them days off. Some employees and students abuse this allowance and issue documents illegally to have free day(s) off. In Saudi Arabia, employee absenteeism has been an issue for some time now. The government is combating the issuance of these documents by designing laws and regulations [1]. Despite these efforts, this type of documents is still being circulated. One mean of connecting to those who sell these documents is through Twitter. Promoters are accounts that sell these documents on social media. Since it is illegal, most of the accounts are either fake or pseudo accounts. Sick-leave promoters' tweeting behavior can be comparable to spamming behavior. Spammers tend to repeat the exact text multiple times within a short period of time [1].
Clustering as feature selection method They use multiple hashtags, and they also capitalize on trending hashtags to gain exposure. For these reasons, promoters of sick-leave documents are treated in the same manner as spam accounts in this research.
Machine learning algorithms were employed to detect spam accounts on Twitter. Some of these approaches rely on features extracted from tweets, while others utilized textual content of tweets. This paper is attempting to uncover those who are involved in the sick-leave deception in Twitter. It contributes to this effort by: (1) Analyzing a data set of 15,578 tweets downloaded from 2010 until January of 2021 and manually labels them. This is used as ground truth data.
(2) Using K-means clustering as feature selection approach to improve the performance of the classification model.
(3) Using the textual tweets of sick-leave promoters, construct a classification model by applying supervised learning algorithms. Four classifiers are used, including Decision Tree DT, Random Forest RF, Naı €ve Bayes NB, and Logistic Regression LR.
(4) Identifying the list of keywords that are most effective in revealing promoters.
This paper continues as follows: Related work of relevant research is presented, followed by proposed scheme, experiments and results, evaluation, and finally conclusion, implications and future work.

Related work
According to Twitter, profile detection is "the attempt to automatically infer the values of user attributes by leveraging observable information such as user behavior, network structure, and the linguistic content of the user's Twitter feed." [2]. Profile detection has been presented in many studies to distinguish the owner of the profile based on their interest and profile information. The detection efforts are mostly binary where the researchers want to identify whether or not the user is playing a certain role (male or female, bot or human, organization or individual, spam or not spam) [3]. Detecting spam accounts on Twitter can follow one or hybrid approaches of analysis based on: time-series analysis of tweets and interactions [4], features extracted from the user profile [5] and the text posted (tweets) [6,7]. The first type applies time-series analysis to reveal trends. This can be the search of specific terms' count within a period of time such as in. The second type use features extracted from the user profile. Researchers investigate through profile interaction and the content of the tweet whether the account belongs to a human or a bot.
The third type is known as content analysis approach where tweet text is used to detect spam content. The analysis of text start by Bag-of-Words analysis, a popular approach to identify the k-top words in user groups [8]. Alternatively, studies use n-gram character features, unsupervised learning such as LDA and ensemble approach [9]. Content analysis of tweets also focuses on the fact that spammers on Twitter use malicious links. Therefore, the use of blacklist URLs is also another method applied [10].
Feature selection represents an important tool to balance the number of selected attributes to avoid overfitting the model (with too few attributes) and expensive computational time (with too many attributes). There are many methods for feature selection such as: wrapper methods [11], filter methods and unsupervised methods. Wrapper and filter methods are considered supervised approaches as they utilize the output to produce the best set of features. With textual data, unsupervised feature selection has been applied namely K-means clustering to select the best set of features with high-frequency words [12]. Four corpuses were experimented and test using three classifiers. SVM was found to have achieved better ACI performance when clustering was applied. Another approach involved selecting a list of features using K-means clustering and correlation analysis [13]. Using two textual data sets, NB showed improvement in accuracy. None of these studies applied K-means clustering as feature selection to classify profiles. Furthermore, all the text used consisted of lengthy documents and news databases. None of the texts used were short text (tweets). Table 1 summarizes these studies.

Proposed scheme
In this section, the proposed scheme is introduced, but first, an explanation of how data was collected.

Data collection
The data were retrieved by specifying a list of keywords that were identified using www. hashtagify.me. It is a tool that provides a list of relevant hashtags that are used frequently together. The keyword that was used to start finding the list was " ‫ﺳ‬ ‫ﻜ‬ ‫ﻠ‬ ‫ﻴ‬ ‫ﻒ‬ ", a transliteration of the word sick leave that is commonly used to refer to the document that is obtained. The retrieved list included keywords that are either the Arabic version of the word or variation with similar meaning. A total of nine words were used and a tweet is retrieved if it contained at least one of the nine. The location of the tweet was setup to be in Saudi Arabia as a condition for it to be selected.
Tweets between January 1, 2010, and January 8, 2021, were downloaded. The data have been manually labeled using two categories: promoter and nonpromoter. It was noticed from the data that people would write about sick leaves to either joke about needing a sick leave, promoting their sick-leave business, or ask for a sick leave. The majority of the tweets were humoring about needing a sick leave, very few were asking for one. For that reason, jokers and those who ask for sick leaves are considered as one category (nonpromoters) and the rest are promoters. Tweets were obtained and were ready for cleaning and preprocessing; 2, 413 tweets were identified as promoter tweets. The cleaning and preprocessing of the tweets included removing duplicates. Unifying the characters that contain such marks, like ( to be ‫)ﺍ‬ and remove links and emojis. Clustering as feature selection method The tweet text goes through further preprocessing: (1) Tokenization: Each tweet is converted to tokens. A token is any word that is preceded and followed by a space.
(2) Stop words and nonuseful words like pronouns and articles are removed [19].
(3) N-grams: n-grams are word sequences that are often co-occur. These can be two or more words. The data have been explored for up to 4-g.
(4) After that, term frequency-inverse document frequency (TF-IDF) approach has been applied to vectorize textual data. TF-IDF reflects the importance of a keyword in a document by giving high-frequent words more weight [20].
For each term, frequencies are calculated. This is used to prune the word list and specify the list of words with the highest frequencies. The pruning condition is to keep words that were used more than 1,207 times (half the number of promoter tweets).
Inverse document frequency (IDF) is calculated for each term. Each word is considered as a feature for each tweet and will have a weight. The formula for IDF is: IDF ¼ log Total number of tweetd number of tweets which have that word At this stage, a list of eight words were identified. They represent the eight features along with their weights. Two of which are 2-g features. Table 2 shows the resulting wordlist.
The resulting data set includes these features along with the ID of each tweet. Figure 1 shows an example of the process that the tweet goes through during cleaning and preprocessing, and Figure 2 shows a sample from the resulting data set.

Classification techniques
Four classification algorithms are tested based on what was obtained from literature. NB, DT, RF and LR. NB is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. It assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature [21].
DT is a supervised classifier where the data are continuously split according to a certain parameter. In DTs, each leaf is assigned to one class or its probability. Small variations in the training set result in different splits leading to a different DT. Thus, the error contribution due to variance is large [22].
RF consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes become the model's prediction. Although RF has powerful properties, it is said to be less sensitive to the optimization of method parameters leading to a simpler training process [21].  Table 2.
List of attributes, their types and their description the data set ACI LR is a simple and more efficient method for binary and linear classification problems. It is a classification model, which is very easy to realize and achieves very good performance with linearly separable classes [21]. It is an extensively employed algorithm for classification in industry.

Clustering technique
Clustering is an unsupervised learning technique where similar instances are grouped together. K-means algorithm is one of the most common approaches to apply clustering. It has been applied to multiple problems such as recommender systems, image processing and  Clustering as feature selection method text mining [10]. Compared to other clustering algorithms, it is considered to be time efficient due to its linear complexity. It converges at O( J*K*m*N ) with K clusters and J number of iterations, where m is the number of instances in the data set and N is the number of features [10]. In the problems where the number of clusters is unknown, multiple iterations of the algorithm are run in order to find the optimum value of K. Many approaches are used to find the best K including elbow approach, cross-validation and Silhouette approach [10]. In the current work, the number of K has already been set to 2.
The K-means algorithm starts by randomly selecting k instances (in this case two) as initial centroids of the clusters. After that, the distance between each of the remaining instances and the two centroids is calculated. The instance is assigned to a certain cluster if it is close to it. Once all instances are assigned, the mean of the distances between the instances and their centroid is calculated, and it becomes the new centroid. The process is repeated until the optimum clustering is reached using Eqn (1) [10], where μ k is the mean for cluster k, N k is the number of instances in the cluster k and x i is one of the instances that belong to cluster k.

Study model
The proposed model utilizes K-means clustering algorithm to identify features (terms) to be used in the classification efforts. This means that K-means clustering is applied to identify the terms that were useful in differentiating between the two clusters. After that, the list of terms is experimented with the four classification algorithms to see their performance compared to the standard feature selection approach. The clustering algorithm produces a list of features that are considered determinants in the clustering effort. They determine the similarity and dissimilarity between the instances and their centroids. The process is explained in the pseudocode showed in Figure 3.

Experiments and results
Promoters represent 16.8% of the data set. This means that the ratio of promoter to notpromoter is 1:5. This is showing an imbalance in the data set and needs to be considered when the classification algorithms are run in order to overcome any possible overfitting. Experiments are set using data set of ratio 1:1, 1:2 and 1:3.
4.1 Clustering analysis as feature selection K-means algorithm was applied with K 5 2. With topic modeling, the TFIDF operator was able to generate a list of 30,232 words. Nine words were used in clustering based on their  sick leave) is more in cluster_1. The performance of K-means clustering is evaluated using the average distance within centroid. The larger the number, the better, and in this case, it is 0.584. Another number to look at is the average within centroid distance within each cluster. For cluster_0, it is 0, and for cluster_1, it is 0.821. Table 4 presents the list of most influential terms in Arabic along with their translation and the distances between them and the centroid of each cluster. The difference shows the extent at which the term belongs to a certain cluster. If a certain tweet contains one of the terms, it supports its assignment to the cluster centroid closest to it. In the table, five terms have high absolute difference values ranging from 0.108 to 0.503. The other terms are showing very low difference. The top five terms are used with the classification algorithms as features.
In Table 5, RF is showing the best performance based on most of the measures.

2 Experimenting with different ratios
The four selected classification algorithms were conducted using the 10-fold cross-validation technique. The results are shown for the original data and the data after sampling to treat the imbalance (see Table 6). The highest accuracy has been achieved using RF with the original   Table 3. wordlist, their translation, frequencies and the number of tweets they appeared in Clustering as feature selection method ratio reaching for up to 94.92% with the highest specificity of 98.57%. Other significant results were achieved under the ratio 1:1 where the DT achieved 95.9% precision. RF also achieved the highest f-measure of 89.11% under 1:1 ratio. NB achieved the highest recall of 88.73% under 1:1 ratio. The ratios 1:2 and 1:3 did not achieve any significant results.
To further attempt improving the results, correlation between attributes is calculated to be applied in backward elimination. The process of elimination starts by including all attributes and eliminating the least significant attribute and then runs the classifier. The process continues until the best performance is reached. For the purpose of this analysis, backward elimination is applied with only RF since it achieved the best results. The result of backward elimination with RF improved the accuracy, recall and f-measure. However, it slightly reduced the specificity. RF using four features managed to reach to 95.01% accuracy, 90.81%  Table 4.
The terms and their centroid values based on each cluster and their absolute difference. Table 5.
Classification results with clustering as feature selection sick-leave (transliteration)). The rest of the algorithms showed similar improvement. Figure 5 shows the comparison between two approaches.

Evaluation criteria
In the literature dealing with spam detection, some standard metrics are used. These include accuracy, precision, sensitivity, f-measure and specificity. Accuracy is the total ratio of correctly predicted as promoter to the total cases (Eqn 2). Sensitivity, also known as recall or true positive rate, reflects the percentage of the positively predicted as promoter to those predicted positive (Eqn 3). Specificity is the measure of instances that were correctly predicted as not-promoter (Eqn 4). Precision is the percentage of instances that were correctly predicted as promoters to the percentage of positively and negatively predicted (Eqn 5). Finally, the f-measure is calculated as a harmonic mean for precision and sensitivity (Eqn 6).

Evaluating the model
As RF achieved the best accuracy, a discussion of the other measures is also important to reflect on the model's performance. RF showed also highest precision, specificity and f-measure; however, it achieved low sensitivity. This means that the model is likely to generate false-negatives 26.17% of the times. On the other hand, the model is able to correctly identify an instance to be not-promoter 98.57% of the time. These results were based on the original data set. The results improved when applying feature selection using backward elimination. However, sensitivity remained low at 78.24%. When applying clustering as  Clustering as feature selection method feature selection, sensitivity improved significantly. Other measures also improved including accuracy, precision and f-measure. It is also visible that a decline happened in specificity from 98.39% to 81.64%. This means that the model's ability to detect tweets belonging to not-promoter is less. Figure 6 compares the performance of RF without feature selection, with backward elimination and using features identified by clustering.
The difference between features selected using backward elimination and the ones selected by clustering is in the number of features and terms included. reports which refers to the sick-leave documents.

Comparing with related work
The focus is on studies that used textual analysis of tweet content to classify spam/nonspam accounts.
All of the studies in Table 7 used Twitter data from either publicly available data sets or data downloaded from Twitter. This study showed the highest performance compared to the previous studies. In fact, recall improvement is considered the most significant contribution as it shows the sensitivity of the model in detecting promoters. According to [7], the majority of studies of spam detection rely on recall as a performance measure.   Comparing the performance of RF: without feature selection, with backward elimination and with clustering Table 7. Studies used tweet content for spam classification ACI

Conclusion, implications and future directions
This work is dealing with the problem of Twitter accounts that sell undeserved sick leaves in Saudi Arabia. The model proposed utilizes K-means clustering as feature selection approach to identify the most important keywords in determining each cluster. The resulting features are tested with four classification algorithms. When comparing the performance of these algorithms without K-means clustering, it was found that clustering improved the classification performance of all the algorithms. Most importantly, the sensitivity of the classification model improved. The study also identified a list of keywords that can be used as determinants in the classification of sick-leave promoters.
The major implications of this issue can be directly influencing the efforts of Saudi Arabia to identify the accounts that are engaged in illegally selling sick-leave documents. Detecting and reporting these accounts to Twitter means that the mean of communication between those seeking the service and those promoting it is broken. The authors understand that other platforms maybe utilized; however, it is considered as contributing to the other efforts to combat these actions. Future directions can be in investigating other platforms to compare the behavior of promoters across platforms. Technically, future work can involve experimenting with ensemble machine learning techniques and testing the model with other standard databases for spam detection.