Feature selection based on weighted conditional mutual information

Feature selection isan essential step in data mining.The core of it isto analyze and quantizethe relevancyand redundancy between the features and the classes. In CFR feature selection method, they rarely consider which featuretochooseiftwoormorefeatureshavethesamevalueusingevaluationcriterion.Inordertoaddressthisproblem,thestandarddeviationisemployedtoadjusttheimportancebetweenrelevancyandredundancy.Basedonthisidea,anovelfeatureselectionmethodnamedasFeatureSelectionBasedonWeightedConditional MutualInformation(WCFR)isintroduced.Experimentalresultsontendatasetsshowthatourproposedmethodhashigherclassificationaccuracy.


Introduction
Feature selection is an important step in pattern recognition and data mining. The target features contain fewer ones, which can reduce the training time and improve the interpretability of the model [1]. Different from other dimension reduction technique, feature selection doesn't produce the new combinations of features. In other words, it only selects features [2].
Feature selection methods can be broadly classified into three types such as filter method [3,4], wrapper method [5][6][7] and embedded method [8]. Feature selection based on mutual information belongs to the first one. Different from the latter two ones, the filter method is independent of the classifier [9]. Accordingly, the filter method is usually faster than other two ones [10].
Mutual information is usually used to measure the relation between two variables [11]. Feature selection method based on it regards mutual information as feature selection criterion. The definition of MI can be shown in Eq. (1).
where X m is the candidate feature, S denotes the selected feature set and C is the class. Obviously, the higher the score is, the more important the feature is. However, it involves massive calculation of high dimensional joint probability. In some literatures, the high-order mutual information is decomposed into the sum of some multiple low-order mutual information under some independence assumptions [12]. But in real world, such assumption is unrealistic. In face of this, this paper presents a feature selection method-WCFR (Weight Composition of Feature Relevancy). In our proposed algorithm, standard deviation is applied to weigh the relations between the features and the feature sets. This paper is organized as follows. Section 2 introduces the related work and some classical feature selection methods based on mutual information. In Section 3, we describe the proposed approach-WCFR. The experimental results on large amounts of data are given in Section 4. Section 5 gives the conclusion and future problem.

Related work
Feature selection based on MI belongs to the filter method. It uses the mutual information to select and evaluate the features [9]. Forward search is a greedy algorithm that selects one feature in per iteration. In this way, the feature set can be obtained after some iterations. Recently, the feature selection methods based on MI have already been proposed.
The earliest method using mutual information is MIM [13], in which MI is used to evaluate the relation between the feature and the class. The evaluation criterion is shown as follow.
where X m is the candidate feature and C denotes the class. MIM is simple. It ignores the relations between the features and the selected feature sets, which could lead to a situation in which the feature subset involves too many redundant features. Mutual Information based on Feature Selection (MIFS) [14] is proposed by Battiti, and it is shown as Eq. (3).
Where X s denotes the selected feature from the selected feature set S. MIFS considers the redundancy between the candidate feature and the selected features based on MIM. mRMR [15] is the variant of MIFS. However, there is a potential problem about mutual information. It tends to select some features containing more values. Let's take an example. When some ACI features of sequential data are to be faced, such method will be incapable. To avoid such situation, some researches normalized the mutual information by scaling the value of mutual information to the interval from 0 to 1. La The Vinh has proposed NMIFS (Normalized Mutual Information Feature Selection) [16], which is shown in Eq. (4).
where jΩ k j denotes the size of the sample space of the variable k. MIFS or NMIFS represents a general idea that the candidate feature should be high-relevancy with C and be lowredundancy with S. Some evaluation criteria of feature selection are developed in this aspect. Conditional Informative Feature Extraction (CIFE) [17] is proposed by Lin and Tang, and the corresponding criterion is shown as follow.
where I ðX m ; X s jCÞ denotes the redundancy between X m and X s when giving C. The description about redundancy in CIFE is more specific than that in MIFS. I ðX m ; X s Þ − I ðX m ; X s jCÞ is named as intra-class redundancy [1]. H.Y.Yang proposed JMI [18], which is shown in Eq. (6).
½I ðX m ; X s Þ À I ðX m ; X s jCÞ Therefore, JMI can be regarded as CIFE that adds the weight 1 jSj . The second item in Eq. (7) uses the average to reflect the centralized tendency of the redundancy. RelaxFS [12], as shown in Eq. (8), is proposed by Vinh * and Zhou, which introduces the new redundancy containing more redundant information.
I ðX m ; X i jX j Þ denotes the redundancy between X m and X i when giving X j .
It should be noted that X i and X j are from the selected feature set S. Therefore, 1 jSjjS − 1j P X j ∈S P X i ∈S i≠j I ðX m ; X i jX j Þ contains more redundancy between X m and S than 1 jSj P Xs∈S ½I ðX m ; X s Þ − I ðX m ; X s jCÞ. There is also a feature selection method, named as CFR [19], which is proposed by Wangfu Gao. Feature relevancy is composed of two parts in CFR, Feature selection based on mutual information and it is shown in Eq. (9). I ðX m ; CÞ ¼ I ðX m ; CjX s Þ þ I ðX m ; C; X s Þ (9) where I ðX m ; CjX s Þ denotes the information related to the class, I ðX m ; C; X s Þ denotes the redundant information. The criterion of WCFR maximizes the correlation and minimizes the redundancy, which is shown in Eq. (10).
In above mentioned methods, we can see there is a trend that relation between features is more concrete from mutual information to conditional mutual information. However, it involves lots of computation. For example, However, the computational complexity of the former is higher than that of the latter. As a matter of fact, X i and X j are from the selected feature set S. When the redudant term 1 it is required to search S twice. Therefore, how to improve the efficiency of feature selection without increasing the amount of computation is a problem.

Problem and method
The new proposed method, WCFR (Weighted Composition of Feature Relevancy), is the improvement of CFR. Eq. (11) can be gotten from the Eq. (9), which indicates the mutual information between X m , C and X s . I ðX m ; C; X s Þ ¼ I ðX m ; CÞ À I ðX m ; CjX s Þ ¼ I ðX m ; X s Þ À I ðX m ; X s jCÞ When we use the r.h.s of Eq. (11) instead of the redundancy term in Eq. (10), we can get Eq. (12).
The criteria of CFR is similar to those of MIM, mRMR, JMI, CIFE and Relaxmrmr, because all of them try to search the features that are high relevant with the class and low redundant with the selected feature set. Another common point is that relevancy and redundancy are expressed by the summation. In fact, there exists a new problem. Suppose there are two features which are X 1 and X 2 . The relevance of the two features using Eq. (12) can get the same value. How to distinguish X 1 and X 2 is a problem. I ðX m ; CjX s Þ denotes the information that X m can provide while X s cannot. The value of I ðX m ; CjX s Þ is different for each X s . However, the summation on I ðX m ; CjX s Þ ignores the difference. Therefore, the difference measured by standard deviation is introduced in the proposed method. And it is shown in Eq. (13). (13) In Eq. (13), the expressions of standard deviation δ 1 and δ 2 are shown as Eqs. (14) and (16).
Standard deviation is usually used to measure the degree of dispersion. Hence, we use it to adjust the importance degrees of relevant items or redundant items in WCFR. The higher the value of the standard deviation is, the higher the degree of dispersion is. In this way, WCFR can tackle above problem that how to select X 1 and X 2 when the summation of X 1 and X 2 on I ðX m ; CjX s Þ is equal.
The Pseudo-code of WCFR is shown in Table 1. It contains two parts. The first part is the initialization of the selected feature set S (Lines 1-8), and the second part is the process of iteration in which it selects one feature in each iteration by the Eq. (13). Feature selection based on mutual information 3.2 Complexity analysis WCFR contains a 'while' loop and two 'for' loops, and its time complexity is Oðk 2 nmÞ (k is the number of selected features, n is the number of all features, mis the number of samples). The Complexity of WCFR is same as that of CFR, CIFE and JMI. And it is higher than that of MIM and is lower than that of RelaxFS.

Data sets
To verify the effectiveness of the proposed WCFR, ten data sets are used in the experiments. They are from different fields and can be found in UCI [20]. There are hand written digital data (Semeion, Mfeatfac), text data (CANE9), voice data (Isolet), image data (ORL, COIL20, WarpPIE10p) and biodata (TOX171). More detail descriptions can be found in Table 2. Each data will be normalized and discretized, which is similar to other literatures [12,19].

Experiment settings
In this experiment, we use Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) to evaluate different feature selection methods. They are two classical and widely used classifiers in relative references [12,19]. In such mentioned ones, K is set to be 3 for KNN and linear kernel is used for SVM [12]. Analogously, we do the same strategy.
The experiment consists of three parts. The first part is data preprocessing. For the validity of the calculation, the value of every feature is shrunk into À1 to1 and is categorized into five equal-size bins. The second part is feature subsets generation. If the number of features is less than 50, the size of the feature subset is equal to the size of the features. Otherwise, the size of the feature subset is set to be 50. Feature selection methods are used to generate feature subset. The third part is the feature subsets evaluation. In this experiment, we use average classification accuracy and Macro-F1 to evaluate the classifiers on feature subset. Classification accuracy means the proportion of the number of correctly classified samples to the total number of samples. F1 is defined as follows.
where P denotes the precision and R is the Recall. F1 can be used to measure the binary classification problem. If the number of categories is greater than two, Macro average F1 can be used to treat F value of multi-class classification problems as the average F value of n binary classification. Macro average F1 is defined in Eq. (19).  Tables 3 and 4 are average classification accuracy using KNN and SVM on ten datasets. If m is made as the size of feature subset, its variation range is from 1 to 50. We calculate the classification accuracy with 10-fold crossvalidation for each m, the value in cell can be gotten by averaging accuracy corresponding to different m. The maximum value of each row in the table is identified by bold fonts. The row named as 'Average' means the average classification accuracy on all the datasets. WCFR used a Kolmogorov-Smirnov test with other existing MI-based methods. The Kolmogorov-Smirnov test is a non-parametric test method [21] that does not require knowledge of the data distribution. The default significance level of the K-S test is 5% and we use it in our experiment. If the P value is less than 5%, the two algorithms are considered to have significant difference while if the P value is greater than 5%, there is no significant difference. Tables 3 and 4 show the results of the experiment. We employ 'þ', '5' and '-' to indicate that WCFR performs 'better than', 'equal to' and 'worse than' other methods. The last row of Tables 3 and 4 named as "W/T/L" implies the statistical results that WCFR win/tie/loss compared to other methods. The statistical results are summarized in Figure 1.
It can be seen from Table 3 that the highest result is obtained by RelaxFS, CFR and WCFR. The possible reason is that these three methods describe the relations between the features and the classes more precisely than other methods. In Eq. (8), Eqs. (10) and (13), relevancy and redundancy in criterion of RelaxFS, CFR and WCFR are measured by conditional mutual information that is different from that of MIM, JMI, mRMR and CIFE. The average classification accuracy of RelaxFS is higher slightly than that of CFR, which is because RelaxFS can eliminate more redundant information by the second item of Eq. (8). Our proposed WCFR outperforms RelaxFS and CFR on seven data sets. The reason is that the weight is introduced in WCFR. Table 4 shows the classification results for KNN. We can find that the different classifier have a different influence on the verify of feature subset. When KNN is used as the classifier, the result of RelaxFS is better slightly than that of CFR, but the new WCFR method is better than RelaxFS. RelaxFS can get the highest accuracy on TOX171, which means the hypothesis in RelaxFS meets the pattern of data ToX171 on KNN. As can be seen from Tables 3 and 4, WCFR can improve classifier accuracy when compared with other methods.
In order to observe the influence of feature subset size on accuracy, performance on different dataset is given in Figures 2 and 3. In the Figure 2(a), Figures 2(b) and 3(a), the trends of WCFR and CFR are similar, especially when the dimension number of the data goes up. But WCFR is actually better slightly than CFR if we can combine them with Table 3. The accuracy of WCFR is lower than that of CFR on TOX17 while is higher than that of RelaxFS. It means the weight added in CFR cannot influence the performance of CFR to a large extent in the worst case. On the other datasets, the accuracy of CFR is higher than that of RelaxFS on Sonar and Isolet; the accuracy of the RelaxFS is higher than that of CFR on Semeion, ORL and COIL 20. However, it is obvious that the proposed method outperforms CFR, RelaxFS, MIM, JMI, mRMR and CIFE on these data sets. Tables 5 and 6 show the highest classification accuracy of seven algorithms for SVM and KNN. In the Table 5, the results of WCFR are same as that of CFR on Vehicle, Sonar and CANE9, and are better than that of CFR on the rest datasets. The highest classification of WCFR on KNN is worse than on that of SVM. This situation is similar to the above experimental results. Therefore, WCFR is more suitable for SVM than KNN. In the Table 6 (5) 90.94 ± 0.12 (5) 85.00 ± 0.13 (5) 84.85 ± 0.12 Average 63.75 ± 0.16 74.92 ± 0.14 75.14 ± 0.14 70.60 ± 0.12 78.32 ± 0.14 77.56 ± 0.14 78.50 ± 0.14 W/T/L 9/1/0 7/3/0 5/5/0 8/2/0 3/7/0 5/5/0 Feature selection based on mutual information WCFR is also the best feature selection method except on TOX171. On the whole, the four tables and three figures show the same consequence that the weight used in WCFR has worked and can improve CFR.

Comparisons on Macro-F1.
In order to measure the influence of weight in WCFR, Macro-F1 is used to evaluate the results of classifiers on different data subsets. Table 7 shows  Feature selection based on mutual information the result of Macro-F1 with SVM. It can be seen that F1 of WCFR is higher than that of CFR on all datasets, and is lower than that of RelaxFS only on Semeion and ORL. The evaluation criterion of RelaxFS can eliminate more redundant information while the weight in WCFR can adjust the importance between relevancy and redundancy. In the Table 8 Table 5.
Highest Classification Accuracy of Seven Algorithms using SVM. Table 6.
Highest Classification Accuracy of Seven Algorithms on KNN. Table 7.
Macro-F1 of seven algorithms using SVM on different datasets. ACI classification result of WCFR is lower than that of CFR on TOX 171, and is higher than other methods on Vehicle, Mfeatfac, Isolet, CANE9, COIL20 and WarpPE10p.

Conclusions and feature work
Mutual information is usually used to measure the relations between the feature and the class. Most of the feature selection methods based on low-order mutual information try to describe the relations more precisely. We introduce a new method to improve the quality of feature subset by using the standard deviations. The new method, WCFR, is an improvement on CFR without increasing the time complexity. And the experiment results show such improvement. WCFR is more effective than other method, while the improvement doesn't solve the essential issue of feature selection based on mutual information. The feature selection methods mentioned above are all based on low-order mutual information, and this leads to lose a lot of information. In the future, we plan to describe the relations among features with high order mutual information.