Improving the classification accuracy using hybrid techniques


Purpose
Heart diseases have become one of the most causes of death among Egyptians. With 500 deaths per 100,000 occurring annually in Egypt, it has been noticed that medical data faces a high-dimensional problem that leads to a decrease in the classification accuracy of heart data. So the purpose of this study is to improve the classification accuracy of heart disease data for helping doctors efficiently diagnose heart disease by using a hybrid classification technique.


Design/methodology/approach
This paper used a new approach based on the integration between dimensionality reduction techniques as multiple correspondence analysis (MCA) and principal component analysis (PCA) with fuzzy c means (FCM) then with both of multilayer perceptron (MLP) and radial basis function networks (RBFN) which separate patients into different categories based on their diagnosis results in this paper, a comparative study of the performance performed including six structures such as MLP, RBFN, MLP via FCM–MCA, MLP via FCM–PCA, RBFN via FCM–MCA and RBFN via FCM–PCA to reach to the best classifier.


Findings
The results show that the MLP via FCM–MCA classifier structure has the highest ratio of classification accuracy and has the best performance superior to other methods; and that Smoking was the most factor causing heart disease.


Originality/value
This paper shows the importance of integrating statistical methods in increasing the classification accuracy of heart disease data.


has large amounts of complex data about patients, disease diagnosis, electronic patient records (Haq et al., 2018), etc.
Factors affecting heart disease are characterized by their multiple levels known as multiplicity, so that it is not easy to discriminate between them, which leads to a decrease in the classification accuracy of heart disease data, so this paper evaluates integrating dimensionality reduction techniques, fuzzy c-means (FCM) and both multilayer perceptron (MLP) and radial basis function networks (RBFNs) for improving the classification accuracy of heart disease data. The proposed structure comprises three stages; at the first stage, the study applied the multiple correspondence analysis (MCA) and principal component analysis (PCA) (dimensionality reduction techniques) on the heart disease data set with the aim of arranging relationships between the variables and reducing them to a smaller number of dimensions that have the most variances. In the second stage, the study used the dimensions obtained from PCA and MCA as inputs for FCM to ease separation of clusters and increase the ability to classify observations more precisely and raised FCM classifier performance (Hwang et al., 2010;Ziasabounchi and Askerzade, 2014), thus they interpret the relationships between the variables correctly. In the third stage, the dimensions obtained from the MCA and PCA (both separately) represent the input layer in MLP and RBFNs, whereas the clusters obtained from FCM via MCA and FCM via PCA classifier act as the output layer; this is because FCM is responsible for grouping data with different values of membership. Based on these membership values, the MLP backpropagation algorithm and RBFNs classify heart disease data into two groups (infected and uninfected), thereby reducing the training period of the neural networks and increasing the accuracy of classification.
Finally, all the methods used in this paper were compared after and before integrating, and it is found that FCM via MCA is the best primary classifier, and that MLP via FCM-MCA is the best final classifier in this study.

Literature review
Many studies focused on the diagnosis and classification of heart diseases data; these studies have applied different statistical methods to a specific problem and have achieved a high classification accuracy of 75% or higher, and some studies have gone into integrating between classification methods and cluster methods to improve classification accuracy. Some examples of such studies are as follows. Bhatla and Jyoti (2012) explain the results of applying neural networks, decision trees, fuzzy logic and genetic algorithm that have been closely associated with heart disease diagnosis. In recent years, the result of classification accuracy shows that the neural network was the best. Dalvi et al. (2016) applied an integrating among neural networks and dimensionality reduction technique on the electrocardiogram (ECG) database of the Massachusetts Institute of Technology arrhythmia, and the achieved classification accuracy was 96.97%. The results obtained confirmed that using PCA reduced classification complexity without major changes within the performance. Deng (2020) proposed generating new insight into an improvement on general clustering algorithms through this inspection of one specific clustering algorithm (FCM) help. This paper clarifies that there are three common problems of clustering algorithms, one of them is the noise problem and then it explained that the solution to this problem lies in combining adversarial learning with the FCM algorithm.
The study by El-Bialy et al. (2015) applied integration of the outcomes of the machine learning analysis applied to coronary artery heart disease data sets to compare the classification accuracy. The results clarified that the accuracy of classification of the collected data set is 78.06% higher than that of all separate data sets. REPS 6,3 Haq et al. (2018) this study used machine learning algorithms as a hybrid system, to diagnose heart diseases, such as logistic regression, k-nearest neighbors algorithm, artificial neural network (ANN), support vector machine (SVM), Naive Bayes Algorithm, decision tree algorithm, and the random forest used with three feature selection algorithms Relief, minimum redundancy maximum relevance, and least absolute shrinkage and selection operator to select the important features, the result gives that the logistic regression performance with Relief; is the best predictive system which gives 89% of the classification accuracy.
In this paper, Jabbar et al. (2013) had applied integrating associative classifications algorithms and genetic approach for heart disease prediction; this integration gives high classification accuracy and best heart disease prediction compared to integration of Naive Bayes and neural networks methods Kumar et al. (2018) aimed to build a classification model for patients of heart diseases; they used four classification methods, such as Naive Bayes, MLP, Random Forest and Decision Table, to classify whether a patient is tested positive or negative for heart diseases. The results illustrated that the Naive Bayes has the highest percent of classification accuracy (87.20%) for diagnosing heart patients.
Kumari and Godara (2011) made a comparative study of four classification techniques; they are the Ripper Algorithm, Decision Tree, ANN and SVM in data mining to predict cardiovascular disease applied to coronary heart illness data. The results show that the SVM predicts that cardiovascular disease has the least error rate with classification accuracy equals to 84.12%. Kurt et al. (2008) made a comparative study using the receiver operating characteristic curve, hierarchical cluster analysis and multidimensional scaling between MLP, logistic regression, a classification and regression tree, radial basis function (RBF) and selforganizing feature map of the performances of classification to predict coronary artery disease (CAD) presence. MLP gives classification accuracy of 75.3% and it was the best technique to predict the presence of CAD in this data set.
Le (2019) applied a fuzzy c-means clustering interval Type-2 cerebellar model articulation neural network (FCM-IT2CMANN) method to help physicians improve the accuracy of diagnostic for breast cancer and liver disease. The proposed method combines two classifiers, where the IT2CMANN is the primary classifier and the FCM algorithm is the preclassifier; the results illustrated that the proposed classifier better than other methods. Patra and Pradhan (2008) aimed to integrate FCM, independent component analysis (ICA) and neural networks (NN) and then compared with the structure ICA-NN. The performance of proposed FCM-ICA-NN was faster and more accurate than that of ICA-NN. Patra et al. (2009) made a comparative study of the performance of four structures such as FCM-NN, PCA-NN, FCM-ICA-NN and FCM-PCA-NN to investigate the classification of ECG arrhythmias. They confirmed that the performance of FCM-PCA-NN structure was faster and better than other techniques. Wiharto and Suryani (2020) aimed to make a comparison between FCM and clustering algorithms K-means for segmentation retinal blood vessels. The statistical test results of comparison between them based on area under the ROC curve values resulted in p-values <0.05 with a confidence level of 95%. They confirmed that retinal vascular segmentation with the FCM method is significantly better than k-means.
Ziasabounchi and Askerzade (2014) integrated fuzzy clustering and k-means with PCA to diagnose heart disease patients; they showed that classification based on k-means via PCA is the best with 87% of classification accuracy.

Hybrid techniques
The above results show that the studies that used the integration between the classification techniques gave higher classification results for heart disease data or other data, so we will integrate the artificial neural networks, fuzzy cluster and dimensionality reduction techniques to help increase the classification accuracy of heart disease data of Egyptian patients. The study produced a comparison between the algorithms that integrated, to illustrate the Importance of integrating between earlier methods mentioned.
3. Proposed methodology 3.1 Multiple correspondence analysis MCA is an analytical tool that shows how strong the relationships are between large groups of variables, and it is a method of dimensionality reduction techniques used to organize the multi-level data into reduced dimensions based on the percentages of explained variance that each variable interprets, that is, it increases the homogeneity between these variables (Hwang et al., 2006). In addition, object scores used as a preliminary stage for other techniques (Hwang et al., 2010). Initially, by applying MCA to the indicator matrix, we will obtain the scores for the rows and columns factors, and then these scales are to be measured again to get another table J Â J called Burt Matrix (B = X T X) to use for obtaining MCA. In addition, note that we choose the dimension that has an Eigenvalue greater than 1 and if it is less than 1, it should be rejected, and that in this study, six dimensions were obtained. The calculation is done using SPSS 16 and STATA 15 (Abdi and Valentin, 2007).

Principle components analysis
PCA considered one of the most popular methods of dimensionality reduction techniques, as it converts the original data into a new coordinate system (uncorrelated variables) or which called reduced dimensions space by combining the variables that related to each other in one factor for using it in another analysis, while the unnecessary variables that have no effect on the target variable are eliminated without losing much of the information (Ziasabounchi and Askerzade, 2014).
Steps of PCA are as follows: assuming a set of data is X = {x 1 , x 2 ,. . .., x n }, X is converted to standard variables; calculate correlation matrix or the covariance matrix; calculate the eigenvectors and eigenvalues; computing the principal components then forming a feature vector; and reducing the dimensions of the data set (Bhateja et al., 2018).

Fuzzy C-mean method
FCM method is one of the most widely used fuzzy clustering analysis techniques, especially in medical research; this technique is based on the idea of partial membership and fuzzy partitioning.
Assuming that X = {x 1 , x 2 ,. . .., x n } represents the set of data elements (inputs) and assuming that k is the number of clusters and is an integer such that 2 # k # n, by a membership grade u ij .
The membership grade quantifies the grade of membership of the element to the fuzzy set. The value 0 means that it is not a member of the fuzzy set; the value 1 means that it is fully a member of the fuzzy set. The values between 0 and 1 characterize fuzzy members, which belong to the fuzzy set only partially (Hunt, 2012).

REPS 6,3
FCM technique divides the data set X to c of the fuzzy clusters, the idea of FCM is to find the center of each cluster and reduce the objective function J m , which takes the following form (Patra et al., 2009): where d ij is the Euclidean distance between the X i data point and the center c i , u ij is the degree of membership of the data point X i to the cluster center j which is in the range [0, 1], m is the weighting exponent or fuzziness exponent, x i a set of data with a d À dimensional (n Â p), c i cluster centers with d À dimensional (k Â p), k * k indicates the measure of similarity between any measured data and the center. To reduce the objective function and reach the previous equation, with the update of membership u ij and the cluster centers c i , the following conditions must be met (Bezdek, 2013): The iteration will stop if k U (k þ 1) À U (k) k<« where « is a termination tolerance between zero and one, and k is the iteration. FCM algorithm consists of several stages as follows: initializing partition matrix U, which is expressed as [u ik ] matrix U 0 ; at k-step: calculate the centers vectors C(k) = [cj] with U(k); and update U (k) to U (kþ1) . if k U (kþ1) À U (k) k<« . stop; otherwise return to Step 2 (Liu et al., 2011).

Multilayer networks
MLP is one of the most important of the feed-forward neural network classification methods, which are trained by the backpropagation algorithm that relies on the training of a nonlinear and feed-forward neural network, and it is a generalization of the training method in the pattern of error reduction, as this algorithm aims to reduce the error value between the targeted outputs and the network output by adjusting weights, where the algorithm depends on the spread of errors from the back to the front to adjust the network weights and the implementation of the backpropagation on three stages.
Steps of the backpropagation algorithm (Chakraverty et al., 2019) are as follows: small random values generated for the weights (initial values); display inputs and target outputs, and then prepare input vector values for (x(1), x Hybrid techniques calculate the actual outputs using the activation function, also to calculate the output signals, which are y 1 ; y 2 ; . . . ; y N M ð Þ by the following formula: modification of weights (w ij ) and biases (b i ), where the algorithm begins to work by modifying weights between the output layer and the hidden layers: Where: where x j (n) is the output of the node j in the cycle n and l is the layer and k is the number of outputs of the nodes in the neural network, and M represents the output layer and w represents the activation function, and m expresses the learning rate (used to modify weights during the training process) which increased the convergence faster but may also cause the network oscillation around the extreme values and may not get the desired benefit from the training, and to achieve a faster convergence with the minimum oscillation, the Momentum Term added to the basic formula to update weight, as it improves the efficiency and speed of the training process through continuous adjustment then the effect of the learning rate passed, and after completing the training process for the neural network, the weights of the multi-layer networks are frozen and ready for use during the testing phase (Patra et al., 2009).

Radial basis functions network
RBFN networks represent an attractive alternative to other neural network models, while they are one of the usual function approximations, it has a successfully important role in medical diagnostics. Note that the idea of the RBFN networks derives from the theory of function approximation, where the Euclidean distance calculated from the point that evaluated to the center of each neuron, and RBFN applied to distance with the goal of calculating the weight (effect) of each neuron (Riahi-Madvar et al., 2019). Training steps in the radial base function are as follows: Starting to generate random values for weights of the layer. In this step, each unit j of the hidden layer was calculated using the following equation: where x is the number of input dimensions vector, w (.) is the base function (Gauss activation function) which is described by xc j , c j is the central vector of hidden j neurons having the same number of dimensions with x, w ij is the weight that connects the node jth of the hidden layer and the node ith of the output layer, m is the number of neurons nodes in the output layer and k is the number of nodes in the hidden layer. Applying a sigmoid function to each node in each output layer using the following equation: where L is the number of hidden nodes. The weights that link the output and the hidden layers are updated using the following equations: where h is the learning rate which takes the value between (0.1), the actual output of the network expressed by W k (x), while t k (x) expresses the required output of the target vector for each pair. The earlier steps repeated from Step 3 until a small and acceptable error rate reached in the event that the desired goal did not reach (Liu et al., 2011).

Processing steps used in the study to reach the hybrid technique
Step 1: Get the input data.
Step 2: Calculate each of the PCA or MCA.
Step 3: Obtaining the reduced dimensions from the previous techniques and using them as inputs for the FCM analysis.
Step 4: Using the clusters obtained from the previous step as inputs to the analysis of both multilayer networks and RBFN.
Note that dimensions of PCA or MCA are the input layer at both multilayer networks and RBFN and clusters obtained from FCM analysis is the output layer.

Data source
The population of the study is the data obtained from the Hospital for cardiac, chest, and vascular diseases at Ain Shams University about heart disease patient's records from the year 2010 to 2020.
All the available observations in the population selected to consist of a sample size of 216 Observations to reduce sampling error as possible, with 17 attributes used as shown in Table 1.

Multiple correspondence analysis and principal component analysis
Primarily, by applying MCA procedure, an explanation of 94.126% of the total variance (by SPSS 16) is obtained, and 81.58% (by STATA 15) based on six dimensions confirmed that the scree plot is as shown in Figure 1.

Hybrid techniques
By applying PCA procedure, it reduced the size of a heart disease patient data set into six principal components, and the eigenvalues of the six principals explained 80.99% and 78.27% of variance (NCSS11 and STATA), respectively, the eigenvalues of the all six dimensions are greater than 1, as shown in Figure 2.

Fuzzy c-means, fuzzy c-means via multiple correspondence analysis and fuzzy c-means via principal component analysis
There is more than one measure-to-evaluate goodness of fit of a fuzzy clustering solution. The first is the average silhouette per cluster, which has a range between (À1, þ1) and an average silhouette ! 0.71 meaning that a strong structure has been found. An average silhouette that has the range from 0.51 to 0.70 shows a reasonable structure; the value from 0.26 to 0.50 means that the structure is weak and try other methods on this database, and the value from 0.25 to À1 means no substantial structure. The second measure is the normalized Dunn partition coefficient Fc(U), which is in the range from 0 (completely fuzzy) to 1 (hard clustering) and by the normalized Kaufman coefficient Dc(U) that ranges from 0 (hard  clustering) to 1-(1/K) (completely fuzzy), and K indicates the number of clusters. The number of clusters should choose so that Fc (U) is large, and Dc (U) is small, because of that the results of the FCM procedure were unaccepted, and Table 2 clarifies that FCM is merged with both MCA and PCA and running the analysis based on their dimensions improves the performance of the FCM clustering method. According to the results presented in Table 2, it is concluded that FCM via MCA provides better performance than FCM via PCA (but they yield close results) and thus FCM method, and it is obvious that FCM method performance has improved. SPSS 16 and NCSS 11 Statistical Software were used for calculation, the best results can be obtained if the analysis was carried out using two clusters.
The results were compared based on each value of the average silhouette that was 0.72, the normalized Dunn partition coefficient Fc(U) that was 0.72, the normalized Kaufman coefficient Dc (U) and its value is 0.08, hence all results were the highest than each of FCM and FCM via PCA.

Classification by neural networks
Primarily, note that the input layer of MLP and RBFN is the number of dimensions extracted from MCA and PCA analyses; it was six dimensions, and then these six dimensions were used in FCM to get a new variable that separates the study observation into two clusters, where the new variable represents the output layer in MLP and RBFN. Table 3 shows that classification accuracy increased in all cases of merging, whereas the results of MLP and RBF before merging have less performance which confirms the importance of integrating these methods, but it is noticed that the relative error percent in training and testing sample of MLP via FCM-MCA structure is least in comparison to MLP, RBF, MLP via FCM-PCA, RBF via FCM-MCA and RBF via FCM-PCA. It is noticeable that the performance of MLP via FCM-MCA was superior to other methods, so the normalized importance chart presented for it only.
The results were compared based on each value of percent of relative error in training with value 2.5%, percent of relative error in testing with value 5.1%, ratio of classification 5.4 Importance chart shows Figure 3 emphasizes that the results dominated by the third dimension, which has the highest percent of normalized importance, included only smoking, followed by the fifth dimension, which included alcohol abuse, followed by the fourth dimension, which included both (age, marital status and sleep apnea), followed by the sixth dimension, which included both cholesterol and hypertension, followed by the second dimension, which included (gender, obesity rate, and physical activity) and finally, the first dimension included both (family history, diabetes, place of residence, number of family members, working hours, level of blood urea and uric acid ratio) and this dimension has the lowest percentage of normalized importance.

Conclusion
This proposed work is known as the hybrid technique, which uses both FCM-MCA and FCM-PCA as a preliminary stage of MLP and RBFN, the classifier FCM with PCA and MCA give better performance than FCM only but FCM-MCA gives higher performance, and MLP, RBF with FCM-MCA and FCM-PCA gives more accuracy than the classification techniques MLP and RBFN. As a way to validate the proposed system, it has been tested with a focus on those infected and uninfected with heart disease using six structures, which show that hybrid classifier structures improve the accuracy than traditional classifiers, and the comparison of performance between the six classifiers MLP, RBF, MLP via FCM-MCA, MLP via FCM-PCA, RBFN via FCM-MCA and RBFN via FCM-PCA shows that proposed hybrid classifier structure MLP via FCM-MCA performs faster with a high classification accuracy of heart disease data. Finally, the results showed that smoking is the most important variable that caused heart disease; hence, the obtained prediction model will help doctors to efficiently diagnose heart diseases. In the future, results will be used to create a monitoring plan for heart patients because heart patients usually are not identified until a later stage of the disease or the event of complications and will be integrated between other methods to clarify the importance of integrating.