Email classification analysis using machine learning techniques

Purpose – In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed. Design/methodology/approach – Researchers contribute to solving this problem by a focus on advanced machinelearningalgorithmsandimprovedmodelsfordetectingspamemailsbutthereisstillagapinfeatures.Toachievegoodresults,featuresalsoplayanimportantrole.Toevaluatetheperformanceofappliedclassifiers,10-foldcross-validationisused. Findings – The results approve that the spam emails are correctly classified with the accuracy of 98.00% for the Support Vector Machine and 98.06% for the Artificial Neural Network as compared to other applied machine learning classifiers. Originality/value – In this paper, Point-Biserial correlation is applied to each feature concerning the class label of the University of California Irvine (UCI) spambase email dataset to select the best features. Extensive experiments are conducted on selected features by training the different classifiers.


Introduction
There are many tools for communication on the Internet. One tool that is used to convey your message to others more formally is called email. Spam is a very complicated problem in email services. Spam email is an unwanted and unwelcome email sent to users which contains job offers, selling products, services and so forth. More than 85% of spam emails are sent to users [1]. Email is not only used for personal communication but also for resolving queries of clients, job handling tasks and social activities. The email categorization as spam or not spam is mostly based on the body of the email in machine learning. The specific keywords used in the body can identify the spam email. As for the detection of spam email or text, different types of model and feature selection methods are used like structural and social network features [2], genetic search algorithm for feature selection [3] and Infinite latent feature selection [4].
In this paper, the primary focus is on the selection of features to get better performance for the classification of spam and ham emails. The major contributions in this research are as follows: (1) We experimented with Distance-based (K-Nearest Neighbor (KNN), Support Vector Machine (SVM)), Tree-based (Random Forest (RF), Decision Tree (DT)) and Gradientbased (Artificial Neural Network (ANN), Logistic Regression (LR), Radial Basis Function (RBF)) algorithms on the University of California Irvine (UCI) spambase email dataset.
Email classification analysis (2) We select the Point-Biserial feature selection technique to extract the most relevant features for the classification purpose of spam email.
(3) To the best of our knowledge, no research study shows that this feature selection technique is used for spam email classification.
(4) We experiment on extracted relevant features with the Distance-based, Gradientbased and Tree-based algorithms.
Point-Biserial correlation is used to measure the relationship between the class labels with each feature. We use the dataset in which features are continuous and class labels are nominal in 1 and 0. The Point-Biserial correlation is used to measure the relationship between a continuous variable and binary variable that supported and suited the dataset we used in this research. The ANN is applied to UCI spambase email dataset to get the best result, but the problem is that the feature selection technique is not used to select the best features from data for the applied model [5]. If the algorithm is applied to data without preprocessing, it leads to less accuracy. The feature selection and dataset size are the factors that contribute to the accuracy of the machine learning model. The Support Vector Machine is applied to a dataset having 400 emails for training, if the lesser the data then the greater is the chance of the overfitting. The training accuracy of the data may be high and lower accuracy of testing data is achieved [6]. Overfitting occurs when the complex model is made for a simple dataset. Mostly spam email affect the user in the form of time consumption while reading spam email, bandwidth and in form of space that is required for the storage of spam email [7]. Users spend a lot of time reading spam emails which are useless for them. Due to the above reasons, this problem is considered in this article to find a better solution to it. Already some articles were published on this issue by using different techniques, but the maximum accuracy achieved by these articles is 95% due to the abovementioned reasons. The machine learning algorithms can solve these issues by taking minimum time [8].
To handle these issues mentioned above in the proposed methodology, a large dataset is taken consisting of about 5000 instances and try to minimize the chances of overfitting. For the improvement of the accuracy model, feature selection techniques are applied to preprocess the data before applying the machine learning model. For the evaluation of the model, 10-fold technique is applied and the overfitting and underfitting of the applied model are checked.
The remainder of the paper is organized as follows: Section 2 is the literature review that describes the details about the previous paper on the selected topic. Section 3 represents the proposed methodology for the classification of spam emails. Section 4 is the experiment and results which include the evaluation of all the techniques used in classification. Section 5 is the conclusion section which describes the conclusion of this paper and the achievement of experiments performed.

Literature review
The spambase UCI dataset was used for the classification of spam and ham emails. The Infinite latent feature selection was used for the selection of features. The authors were applied ten machine learning algorithms for the performance comparison between them which were RF, ANN, Logistic, SVM, RF, KNN, DT, Bayes Net, NB and RBF [4]. The strength of the suggested work is as follows: (1) The author used Distance-based, Gradient-based and Tree-based machine learning algorithms.
(2) To reduce the influence, biasness of features having large value normalization is performed.
The weakness is that authors did not mention the number of neurons used in ANN; while using less number of neuron for complex problem in ANN cause to give less accuracy as compared to other algorithms. ANN could learn by itself; this quality of ANN is not present in Distance-based algorithms. For complex models, increasing the number of neurons in ANN improves the performance of classification. The spam Assassin dataset was used and applied 24 different machine learning classifiers by using the Weka tool and achieved an accuracy of 96.32% which is the highest accuracy among other classifiers. The strength of Sharma and Amit's work is they used a large variety of machine learning algorithms to measure their performance individually. The weakness in the proposed methodology [9] is not using any feature selection technique to select the more relevant features among others features. Six hundred mails were used in the in the filtration of the spam emails. Of which, 400 mails were used for testing data and the remaining 200 were used for training data. The weighted Support Vector Machine classifier got 99.5% accuracy. The weakness is the dataset includes only a smaller number of emails as compared to spambase UCI's thousands of emails for experimentation. Less number of instances in any dataset have higher chances of accurate results. The researchers must have enough data to test well the model of machine learning algorithms.
Details of different papers are given in Table 1 according to the dataset they used and which algorithms perform best on these datasets for email classification. The Enron dataset is downloaded for the classification of spam emails and the classifier implemented here were J48 and Multilayer Perceptron which belong to the artificial-neural network family. J48 and Multilayer Perceptron achieved an accuracy of 93% and 92%, respectively [11]. The experimentation is performed with data of two different sizes, one was 1000 mails size and the other was 5000 mails size. Three classifiers were implemented which were Support Vector Machine, Naive Bayes and J48. When 1000 mails size was used, SVM, NB and J48 achieved 92, 97.2 and 95.8% accuracy, respectively [12]. When 5000 mails size was used, Support Vector Machine and Naive Bayes accuracy dropped by 1.8% and 0.7%, respectively. J48 was increased by 1.8%. The spambase UCI dataset was used for the classification of spam emails. Five different experiments were performed and 96.4% accuracy was achieved using EDT [6]. J48, IBK and Naive Bayes 96.3% [15] Artificial Neural Network 85.31% [16] Naive Bayes 89.7% Table 1.

Email classification analysis
The spam and ham emails were recognized using two different emails sizes 400 and 50 [13]. Repeated incremental pruning to produce error (RIPPER) reduction technique was used to classify emails. When 400 mails size was used, 90% accuracy was achieved using RIPPER and when 50 mails size was used, 95% accuracy was achieved on the email dataset. The Facebook dataset was used for the identification of spam messages. J48, IBK and NB classifiers were applied to the Facebook dataset and as compared to these three classifiers J48 produced good results [14]. The spambase UCI dataset was used for the classification of ham and spam emails, features were selected from the spambase UCI dataset using the feature selection technique which is called Infinite latent selection [4]. Ten machine learning classifiers were implemented here, and the results showed that the RF classifier achieved the best accuracy as compared to others. Accuracy of 95.45% was achieved using the RF technique. Maximum accuracy was achieved using the spambase UCI dataset and pass-through Multilayer Perceptron with sigmoid function [15]. This method proved that it correctly classifies the email spam more than 85.31%. A total of 4601 email records were used in it, of which 3233 emails were used for training the model and the remaining 1368 emails were used for testing the model which was 30% of all the datasets. The spam messages were used for the filtration of spam emails [16]. Five different versions of NB were implemented on fresh spam messages. Accuracy was achieved by implementing a two-stage smoothing version that is highest than the others. Based on the previous articles published on the classification of spam emails, we conclude some points as follows: (1) Many wide and effective classifiers for ham and spam classification of emails have been introduced.
(2) Different articles use different types of datasets related to spam email.
(3) Mostly 30 and 20% of whole dataset instances are used for testing and the remaining dataset instances are used to train the specific model.
(4) All the datasets have some gaps which can be fulfilled by preprocessing the data and as well as different feature selection techniques are used for ham and spam email classification.
(5) Till today, many researchers are working on different datasets and using different classifiers to improve the results and achieve the best accuracy of all the others.
The greatest accuracy is achieved using spambase UCI datasets, 95.45%.
Some issues of the literature discussed above are overcome in the proposed methodology and the main points are as follows: (1) The latest and larger dataset of spam and ham emails than the existing ones is used for experimenting with the proposed methodology because the actual evaluation of any proposed methodology can measure on a larger dataset.
(2) The small dataset has a higher probability of achieving good accuracy using even simple methods but it is hard to achieve better accuracy in the larger dataset.
(3) The Point-Biserial feature selection technique is used to find the features that have a relation with class labels to participate in achieving the best classification results.
(4) The dimension of features is reduced but those features that have no relation with a class label are eliminated.

Proposed methodology
In this article, the proposed methodology consists of different stages. The first stage is data gathering. In the first stage, data is downloaded from the UCI database called spambase UCI. The second step is to normalize all the attributes of the dataset which have a higher range of values. In the third step, the feature selection technique is applied which is called Point-Biserial correlation. After that in the fourth step, eight machine learning classifiers are applied to selected attributes as shown in Figure 1.
The proposed methodology, the environment of hardware and software was set as needed to perform experiments. The hp laptop core i5 4th generation having 8 GB RAM is used for experimentation. The PYCHARM software is used which is the Integrated Development Environment for the python language in which we programmed our experiments. All the latest libraries of python are used for experiments like NumPy, Sklearn and Stratified K-Fold.

Email dataset
Spambase UCI dataset is used in this article which is downloaded from UCI machine learning repository. This dataset includes 57 attributes having continuous and discrete values. It consists of 4601 instances with given labels in the first instance. In the last column of the dataset, a class label is given which consists of 1 and 0 values. 1 means email is spam and 0 means email is not spam. Most attributes of the dataset indicate the occurrence of a particular word and some special characters in the email. The last three attributes indicate the longest, average and total capital letter sequences length in email texts.

Preprocessing
Most datasets available on the Internet are not preprocessed. The definition of spambase UCI attributes is given in Table 2.
The spambase UCI dataset attributes have many value ranges, this large range of value normalization technique is applied by Eqn (1).
The instance value of the specific attribute is denoted by x. x minimum is the minimum value and x maximum is the maximum value in a specific attribute which is to be normalized. x normalized is the normalized value.

Feature selection
To select the best attributes from a list of attributes, Point-Biserial correlation coefficient [17] is applied. Point-Biserial correlation is applied where one attribute value is continuous and another value is to be dichotomous. Dichotomous is also known as a binary value. The point-Biserial correlation coefficient is calculated by Eqn (2).
To calculate r pb , dichotomous variables divide into two groups 1 and 0. M1 is the mean value of all the data points which lie in group 1 and M0 is the mean value of all the data points which lie in group 0. n1 is several data points in group 1 and n0 is several data points in group 0. At last, n is the total sample size. Point-Biserial correlation is applied to each attribute concerning the class label. Those attributes are not selected whose r pb value is equal to 0. The data used in this research fulfill the requirement of this feature selection technique due to values of features which are continuous and dichotomous class labels. This is the first time we are using it in this domain of email classification according to the best knowledge got from the literature, there is no prior use of it in the domain of spam email classification.

Classification techniques
The process in which items are combined is based on the similarity between data and the definition of a group. Machine learning classifiers play an important role to classify a large amount of data. In this article, we use different types of machine learning classifiers to predict class labels using the spambase email dataset. The dataset is split into two parts: one is training and the other is testing with the ratio of 70 and 30 size. The training dataset is used to train classifiers model and the testing dataset is used to test the trained model. The classifiers applied in this article are Naive Bayes, Random Forest, K-Nearest Neighbor, Radial Basis Function, Decision Tree, Artificial Neural Network, Logistic Regression and Support Vector Classifier. Naive Bayes classifier is built for phishing email filtering in Microsoft [2]. It is based on probability and is used to solve classification problems. The training dataset is given to the Naive Bayes model to train the model. Naive Bayes is calculated using Eqn (3).
where A is the class label and B is the attribute.  Table 2.
Description of attributes used in spambase UCI dataset ACI One of the algorithm for classification is RF. Bremen introduced a classifier in 2001 called RF. It consists of multiple decision trees to predict the class label. It is used for classification as well as regression problems. It is very effective against noise and outliers in data and it deals with thousands of inputs without any deletion. RF is measured on classification data using the Gini index. The Gini index formula is defined by Eqn (4).
Gini index uses class and probability to define the Gini of each branch. C indicates the number of classes. pi indicates the relative frequency of the class. The training part of the dataset is used to train the model of RF and the testing part of the dataset is used to test the trained model of RF. RBF is a part of the ANN. As compared to multiple hidden layers network, RBF computing speed is fast. It has many uses like classification, time series prediction and system control. In the RBF, every hidden node which is present represents one of the kernel functions. Each kernel function range is defined by its center and width. When attributes are near to center, it means output of kernel function is high and the output of kernel function is reduced to zero as attributes' distance starts to increase from zero. One of the popular kernel functions is the Gaussian function which is applied to training data to train the model of the RBF and testing data to test the trained model of RBF. The RBF consists of inputs, hidden layer and output. Mathematically, input F k (x) to the kth output node is given by Q k represents the number of hidden nodes linked with target k, q refers to q th target k hidden node and G k q (x) is the response function of the q th hidden node for target k. The Decision Tree is a graphical representation of possible solutions. It is predictive model learning and it is used for the classification to predict the categorical class label. It works to build a tree that represents different rules for classifying class labels. It considers all attributes to be equally important and independent. The top node in the tree is the known root node and the last nodes are leaf nodes. The Decision Tree can handle both categorical and numerical data. To draw a Decision Tree firstly, we find the entropy of the complete Decision Tree. Secondly, we find the information gain of every attribute. The attribute which has the greatest information gain will be chosen. To calculate two types of entropy and information gain, formulas are defined in Eqs (6) and (7).
where Eqn (6) represents the frequency table of one attribute. S represents the target attribute or class. Pi is the frequent probability of element or class in our data.
EðT; X Þ ¼ X ceX PðcÞEðcÞ where entropy is defined by the frequency table of two attributes. T is the target label and X is the attribute. E(c) is the entropy of the attribute and P(c) is the attribute probability.

Email classification analysis
Gini index uses class and probability to define the Gini of each branch. C indicates the number of classes. pi indicates the relative frequency of the class. The training part of the dataset is used to train the model of RF and the testing part of the dataset is used to test the trained model of RF. ANN is an important classifier in machine learning algorithms. It consists of three layers. The first layer is called the input layer which takes all attributes of the data. The first layer size depends on the number of attributes in the data. The second layer is the hidden layer and its size depends on results taken from multiple experiments. The third layer is the output layer and its size depends on class label values of data. The Multilayer Perceptron is applied to data whose parameters are two hidden layers. Each hidden layer consists of five nodes and the alpha learning rate is 0.01. Boyden-Fletcher-Goldfarb-Shannon is an activation function that is applied in Multilayer Perceptron. Randomly weights are assigned and multiplied with each attribute value. Sum all the product values of attributes and weights. After that activation function is applied on summation and supplied toward the output layer. Weight new formula is written below in Eqn (9).
where weight old is old weight and α is the learning rate.
x is the attribute value of data.
LR is used for biological sciences in the early years. It is used in classification problems where the target variable is categorical. In logistic regression a specific threshold is defined. One class is considered above the threshold and below the threshold another class is considered. Its graph shape is just like the S shape. Its value strictly ranges between 1 and 0. The Sigmoid activation function is used in logistic regression. The equation of the linear model is defined by Eqn (10).
where y is the predicted value which depends on x and x is the independent value. b 0 and b 1 are constants. b 0 moves the curves left or right and b 1 is the steepness of the curve. LR is calculated as follows: where e is exponential whose value is equal to 2.7182 and the above equation represents the Sigmoid function on which logistic regression depends. Support vector classifier is a supervised machine learning algorithm. It predicts class labels by maximizing the distance between classes which is called a hyperplane. The vectors which define the hyperplane are called support vectors. It efficiently separated linear and nonlinear attributes. To optimize the result the formula of minimizing w2 5 w T w is calculated as follows.
where C is the penalty term that controls the strength. Some samples should be at a distance ζi from their correct margin boundary. KNN is the supervised machine learning algorithm that can be used in classification as well as for regression problems. It stores all training samples and predicts testing samples based on distance function. The classification of testing data is based on most neighbor votes. The distance function is measured by Eqn (13).
where d is equal to the Euclidean distance function. This equation is only valid for continuous variables. If k 5 1, then the closest class neighbor will be assigned to the testing case. There are several advantages of KNN given as follows: (1) No need for the additional parameter to add in the KNN model.
(2) It is easy to use and implement.
(3) It is a versatile model that can be used for multiple purposes.
The training set of data is given to the KNN model to train the KNN model and the testing set is given to the KNN trained model to predict.

Evaluation and results
In the end, the performance of all the machine learning classifiers applied, such as Naive Bayes, Random Forest, Radial Basis Function, Decision Tree, Artificial Neural Network, Logistic Regression, Support Vector Classifier and K-Nearest Neighbor, is evaluated. Evaluation is done using a confusion matrix table and by calculating Precision, Recall, Accuracy and F-Measure of each classifier. Precision, Recall, Accuracy and F-Measure are validated using 10-fold cross-validation. 10-fold cross-validation is the splitting of whole data into 10 different parts and 10 iterations are performed on all data. In the first iteration, the first part of data will become test data and all other 9 parts of data will become train data. In the second iteration, the second part of data will become test data and all other parts from 1 to 10 except 2 parts will be train data and so on till 10 iterations. The confusion matrix is described in Figure 2.    Figure 3 shows the graph about the performance of all the machine learning classifiers by using the evaluation measures Precision, Recall, Accuracy and F-Measure.

Comparison of results
The Point-Biserial correlation is used to measure the relationship between a continuous variable and a binary variable.  Table 3.
Results about confusion matrix of each classier at 10-fold cross-validation Table 4.
Evaluation measures of each classifier on spambase UCI classification (see Table 5). In [4], a total of ten machine learning classifiers are applied to the dataset and the highest accuracy is achieved by using Infinite latent feature selection with the RF that is 95.45%. In [15], the accuracy achieved is 85.31% by using weighted feature selection with ANN on spambase UCI dataset. The accuracy was achieved better by using the Point-Biserial feature selection because it helps us to extract the relevant features for the classification of spam and ham emails. Our proposed method conclude that the highest accuracy is achieved by using the Point-Biserial feature selection and the selected features are used as input for ANN which achieves the accuracy of 0.9806%. The performance measure of classifiers that give the best result in Table 4 are ANN and Support Vector Machine. In Table 6