Predictive modelling and analytics for diabetes using a machine learning approach

Diabetes is a major metabolic disorder which can affect entire body system adversely. Undiagnosed diabetes can increase the risk ofcardiac stroke,diabetic nephropathyand other disorders. All overthe world millions of peopleareaffectedbythisdisease.Earlydetectionofdiabetesisveryimportanttomaintainahealthylife.Thisdiseaseisareasonofglobalconcernasthecasesofdiabetesarerisingrapidly.Machinelearning(ML)isa computationalmethodforautomaticlearningfromexperienceandimprovestheperformancetomakemoreaccuratepredictions.InthecurrentresearchwehaveutilizedmachinelearningtechniqueinPimaIndian diabetesdatasettodeveloptrendsanddetectpatternswithriskfactorsusingRdatamanipulationtool.Toclassifythepatientsintodiabeticandnon-diabeticwehavedevelopedandanalyzedfivedifferentpredictive modelsusing R data manipulation tool. For this purpose we used supervised machine learning algorithms namely linear kernel support vector machine (SVM-linear), radial basis function (RBF) kernel support vector machine, k -nearest neighbour ( k -NN), artificial neural network (ANN) and multifactor dimensionality reduction (MDR).


Introduction
Diabetes is a very common metabolic disease. Usually onset of type 2 diabetes happens in middle age and sometimes in old age. But nowadays incidences of this disease are reported in children as well. There are several factors for developing diabetes like genetic susceptibility, body weight, food habit and sedentary lifestyle. Undiagnosed diabetes may result in very high blood sugar level referred as hyperglycemia which can lead to complication like diabetic Predictive modelling and analytics for diabetes retinopathy, nephropathy, neuropathy, cardiac stroke and foot ulcer. So, early detection of diabetes is very important to improve quality of life of patients and enhancement of their life expectancy [1][2][3][4]22]. Machine Learning is concerned with the development of algorithms and techniques that allows the computers to learn and gain intelligence based on the past experience. It is a branch of Artificial Intelligence (AI) and is closely related to statistics. By learning it means that the system is able to identify and understand the input data, so that it can make decisions and predictions based on it [5,23,24].
The learning process starts with the gathering of data by different means, from various resources. Then the next step is to prepare the data, that is pre-process it in order to fix the data related issues and to reduce the dimensionality of the space by removing the irrelevant data (or selecting the data of interest). Since the amount of data that is being used for learning is large, it is difficult for the system to make decisions, so algorithms are designed using some logic, probability, statistics, control theory etc. to analyze the data and retrieve the knowledge from the past experiences. Next step is testing the model to calculate the accuracy and performance of the system. And finally optimization of the system, i.e. improvising the model by using new rules or data set. The techniques of machine learning are used for classification, prediction and pattern recognition. Machine learning can be applied in various areas like: search engine, web page ranking, email filtering, face tagging and recognizing, related advertisements, character recognition, gaming, robotics, disease prediction and traffic management [6,7,25]. The essential learning process to develop a predictive model is given in Figure 1. Now days, machine learning algorithms are used for automatic analysis of high dimensional biomedical data. Diagnosis of liver disease, skin lesions, cancer classification, risk assessment for cardiovascular disease and analysis of genetic and genomic data are some of the examples of biomedical application of ML [8,9]. For liver disease diagnosis, Hashemi et al. (2012) has successfully implemented SVM algorithm [10]. In order to diagnose major depressive disorder (MDD) based on EEG dataset, Mumtaz et al. (2017) have used classification models such as support vector machine (SVM), logistic regression (LR) and Naı €ve Bayesian (NB) [11].
Our novel model is implemented using supervised machine learning techniques in R for Pima Indian diabetes dataset to understand patterns for knowledge discovery process in diabetes. This dataset discusses the Pima Indian population's medical record regarding the onset of diabetes. It includes several independent variables and one dependent variable i.e class value of diabetes in terms of 0 and 1. In this work, we have studied performance of five different models based upon linear kernel support vector machine (SVM-linear), radial basis kernel support vector machine (SVM-RBF), k-nearest neighbour (k-NN), artificial neural network (ANN) and multifactor dimensionality reduction (MDR) algorithms to detect diabetes in female patients.

Material and method
Dataset of female patients with minimum twenty one year age of Pima Indian population has been taken from UCI machine learning repository. This dataset is originally owned by the National institute of diabetes and digestive and kidney diseases. In this dataset there are total 768 instances classified into two classes: diabetic and non diabetic with eight different risk factors: number of times pregnant, plasma glucose concentration of two hours in an oral glucose tolerance test, diastolic blood pressure, triceps skin fold thickness, two-hour serum insulin, body mass index, diabetes pedigree function and age as in Table 1.
We have investigated this diabetes dataset using R data manipulation tool (available at https://cran.r-project.org). Feature engineering is an important step in applications of machine learning process. Modern data sets are described with many attributes for practical machine learning model building. Usually most of the attributes are irrelevant to the supervised machine learning classification. Pre-processing phase of the raw data involved feature selection, removal of outliers and k-NN imputation to predict the missing values.
There are various methods for handling the irrelevant and inconsistent data. In this work, we have selected the attributes containing the highly correlated data. This step is implemented by feature selection method which can be done by either 'manual method' or Boruta wrapper algorithm. Boruta package provides stable and unbiased selection of important features from an information system whereas manual method is error prone. So, feature selection has been done with the help of R package Boruta. The method is available as an R package (available from the Comprehensive R Archive Network at http://CRAN.Rproject.org/package5Boruta). This package provides a convenient interface for machine learning algorithms. Boruta package is designed as a wrapper built around random forest classification algorithm implemented in the R. Boruta wrapper is run on the Pima Indian dataset with all the attributes and it yielded four attributes as important. With these attributes, the accuracy, precision and recall and other parameters are calculated.
There are a handful of machine learning techniques that can be used to implement the machine learning process. Learning techniques such as supervised and unsupervised learning are most widely used. Supervised learning technique is used when the historical data is available for a certain problem. The system is trained with the inputs and respective responses and then used for the prediction of the response of new data. Predictive modelling and analytics for diabetes machines and Naı €ve Bayes classifier. Unsupervised learning technique is used when the available training data is unlabeled. The system is not provided with any prior information or training. The algorithm has to explore and identify the patterns from the available data in order to make decisions or predictions. Common unsupervised approaches include k-means clustering, hierarchical clustering, and principle component analysis and hidden-Markov model [12,13]. Supervised machine learning algorithms are selected to perform binary classification of diabetes dataset of Pima Indians. For predicting whether a patient is diabetic or not, we have used five different algorithms: linear kernel and radial basis function (RBF) kernel support vector machine (SVM), k-nearest neighbour (k-NN), artificial neural network (ANN) and multifactor dimensionality reduction (MDR) in our machine learning predictive models which details are given below: 2.1 Support vector machine Support vector machine (SVM) is used in both classification and regression. In SVM model, the data points are represented on the space and are categorized into groups and the points with similar properties falls in same group. In linear SVM the given data set is considered as p-dimensional vector that can be separated by maximum of p-1 planes called hyper-planes. These planes separate the data space or set the boundaries among the data groups for classification or regression problems as in Figure 2. The best hyper-plane can be selected among the number of hyper-planes on the basis of distance between the two classes it separates. The plane that has the maximum margin between the two classes is called the maximum-margin hyper-plane [14,15].
For n data points is defined as: where x 1 is real vector and y 1 can be 1 or À1, representing the class to which x 1 belongs. A hyper-plane can be constructed so as to maximize the distance between the two classes y 5 1 and y 5 À1, is defined as: where w ! is normal vector and b k w ! k is offset of hyper-plane along w ! .

Radial basis function (RBF) kernel support vector machine
Support vector machine has proven its efficiency on linear data and non linear data. Radial base function has been implemented with this algorithm to classify non linear data. Kernel function plays very important role to put data into feature space. Mathematically, kernel trick (K) is defined as: A Gaussian function is also known as Radial basis function (RBF) kernel. In Figure 3, the input space separated by feature map (Φ). By applying equation (1) & (2) we get: By applying equation (3) in 4 we get new function, where N represents the trained data.
2.3 k-Nearest neighbour (k-NN) k-Nearest neighbour is a simple algorithm but yields very good results. It is a lazy, nonparametric and instance based learning algorithm. This algorithm can be used in both classification and regression problems. In classification, k-NN is applied to find out the class, to which new unlabeled object belongs. For this, a 'k' is decided (where k is number of neighbours to be considered) which is generally odd and the distance between the data points that are nearest to the objects is calculated by the ways like Euclidean's distance, Hamming distance, Manhattan distance or Minkowski distance. After calculating the distance, 'k' nearest neighbours are selected the resultant class of the new object is calculated on the basis of the votes of the neighbours. The k-NN predicts the outcome with high accuracy [16].

Artificial neural network (ANN)
Artificial neural network mimics the functionality of human brain. It can be seen as a collection of nodes called artificial neurons. All of these nodes can transmit information to one Predictive modelling and analytics for diabetes another. The neurons can be represented by some state (0 or 1) and each node may also have some weight assigned to them that defines its strength or importance in the system. The structure of ANN is divided into layers of multiple nodes; the data travels from first layer (input layer) and after passing through middle layers (hidden layers) it reaches the output layer, every layer transforms the data into some relevant information and finally gives the desired output [17]. Transfer and activation functions play important role in functioning of neurons. The transfer function sums up all the weighted inputs as: where b is bias value, which is usually 1.
The activation function basically flattens the output of the transfer function to a specific range. It could be either linear or non linear. The simple activation function is: Since this function does not provide any limits to the data, sigmoid function is used which can be expressed as:

Multifactor dimensionality reduction (MDR)
Multifactor dimensionality reduction is an approach for finding and representing the consolidation of independent variables that can somehow influence the dependent variables. It is basically designed to find out the interactions between the variables that can affect the output of the system. It does not depend on parameters or the type of model being used, which makes it better than the other traditional systems. It takes two or more attributes and converts it into a single one. This conversion changes the space representation of data. This results in improvement of the performance of system in predicting the class variable. Several extensions of MDR are used in machine learning. Some of them are fuzzy methods, odds ratio, risk scores, covariates and much more.

Predictive model
In our proposed predictive model (Figure 4), we have done pre-processing of raw data and different feature engineering techniques to get better results. Pre-processing involved removal of outliers and k-NN imputation to predict the missing values. Boruta wrapper algorithm is used for feature selection as it provides unbiased selection of important features and unimportant features from an information system. Training of raw data after feature engineering has a significant role in supervised learning. We have used highly correlated variables for better outcomes. Input data, here indicates to test data used for predict and confusion matrix.

Results and discussions
Early diagnosis of diabetes can be helpful to improve the quality of life of patients and enhancement of their life expectancy. Supervised algorithms have been used to develop different models for diabetes detection. Table 2 gives a view of the different machine learning models trained on Pima Indian diabetes dataset with optimized tuning parameters. All techniques of classification were experimented in "R" programming studio. The data set have been partitioned into two parts (training and testing). We trained our model with 70% training data and tested with 30% remaining data. Five different models have been developed using supervised learning to detect whether the patient is diabetic or non-diabetic. For this purpose linear kernel support vector machine (SVM-linear), radial basis function (RBF) kernel support vector machine, k-NN, ANN and MDR algorithm are used.
To diagnose diabetes for Pima Indian population, performance of all the five different models are evaluated upon parameters like precision, recall, area under curve (AUC) and F1 score (Table 3). In order to avoid problem of over fitting and under fitting, tenfold cross validation is done. Accuracy indicates our classifier is how often correct in diagnosis of whether patient is diabetic or not. Precision has been used to determine classifier's ability provides correct positive predictions of diabetes. Recall or sensitivity is used in our work to find the proportion of actual positive cases of diabetes correctly identified by the classifier used. Specificity is being used to determine classifier's capability of determining negative S.No.
Model Name Tuning Parameters
MDR recode function to converts the value into 0, 1, and 2  Predictive modelling and analytics for diabetes cases of diabetes. As the weighted average of precision and recall provides F1 score so this score takes into account of both. The classifiers of F1 score near 1 are termed as best one [18]. Receiver operating characteristic (ROC) curve is a well known tool to visualize performance of a binary classifier algorithm [19]. It is plot of true positive rate against false positive rate as the threshold for assigning observations are varied to a particular class. Area under curve (AUC) value of a classifier may lie between 0.5 and 1. Values below 0.50 indicated for a set of random data which could not distinguish between true and false. An optimal classifier has value of area under the curve (AUC) near 1.0. If it is near 0.5 then this value is comparable to random guessing [20]. From Table 3  So, from above studies, it can be said that on the basis of all the parameters SVM-linear and k-NN are two best models to find that whether patient is diabetic or not. Further it can be  Table 3. Evaluation parameters of different Predictive models. seen that accuracy and precision of SVM-linear model are higher in comparison to k-NN model. But recall and F1 score of k-NN model are higher than SVM-linear model. If we examine our diabetic dataset carefully, it is found to be an example of imbalanced class with 500 negative instances and 268 positive instances giving an imbalance ratio of 1.87. Accuracy alone may not provide a very good indication of performance of a binary classifier in case of imbalanced class. F1 score provides better insight into classifier performance in case of uneven class distribution as it provides balance between precision and recall [21,25]. So in this case F1 score should also be taken care of. Further it can be seen that AUC value of SVMlinear and k-NN model are 0.90 and 0.92 respectively (Figures 5 and 6). Such a high value of AUC indicates that both SVM-linear and k-NN are optimal classifiers for diabetic dataset.

Conclusion
We have developed five different models to detect diabetes using linear kernel support vector machine (SVM-linear), radial basis kernel, support vector machine (SVM-RBF), k-NN, ANN and MDR algorithms. Feature selection of dataset is done with the help of Boruta wrapper algorithm which provides unbiased selection of important features. All the models are evaluated on the basis of different parameters-accuracy, recall, precision, F1 score, and AUC. The experimental results suggested that all the models achieved good results; SVM-linear model provides best accuracy of 0.89 and precision of 0.88 for prediction of diabetes as compared to other models used. On the other hand k-NN model provided best recall and F1 score of 0.90 and 0.88. As our dataset is an example of imbalanced class, F1 score may provides better insight into performance of our models. F1 score provides balance between precision and recall. Further it can be seen that AUC value of SVM-linear and k-NN model are 0.90 and 0.92 respectively. Such a high value of AUC indicates that both SVM-linear and k-NN are optimal classifiers for diabetic dataset. So, from above studies, it can be said that on the basis of all the parameters linear kernel support vector machine (SVMlinear) and k-NN are two best models to find that whether patient is diabetic or not.
This work also suggests that Boruta wrapper algorithm can be used for feature selection. The experimental results indicated that using the Boruta wrapper features selection algorithm is better than choosing the attributes manually with less medical domain knowledge. Thus with a limited number of parameters, through the Boruta feature selection algorithm we have achieved higher accuracy and precision.