IDMPF: intelligent diabetes mellitus prediction framework using machine learning

Purpose – Machine Learning is an intelligent methodology used for prediction and has shown promising results in predictive classifications. One of the critical areas inwhichmachine learning can save lives is diabetes prediction. Diabetes is a chronic disease and one of the 10 causes of deathworldwide. It is expected that the total number of diabeteswill be 700million in 2045; a 51.18% increase compared to 2019. These are alarming figures, and therefore, it becomes an emergency to provide an accurate diabetes prediction. Design/methodology/approach – Health professionals and stakeholders are striving for classification models to support prognosis of diabetes and formulate strategies for prevention. The authors conduct literature review of machine models and propose an intelligent framework for diabetes prediction. Findings – The authors provide critical analysis of machine learning models, propose and evaluate an intelligent machine learning-based architecture for diabetes prediction. The authors implement and evaluate the decision tree (DT)-based random forest (RF) and support vector machine (SVM) learning models for diabetes prediction as the mostly used approaches in the literature using our framework. Originality/value – This paper provides novel intelligent diabetes mellitus prediction framework (IDMPF) using machine learning. The framework is the result of a critical examination of prediction models in the literature and their application to diabetes. The authors identify the trainingmethodologies, models evaluation strategies, the challenges in diabetes prediction and propose solutions within the framework. The research results can be used by health professionals, stakeholders, students and researchers working in the diabetes prediction area.


Introduction
Machine learning modeling is an intelligent way to extract the hidden relationship among different variables in a dataset. It has been used as a decision-support system for prediction in different applications' domains such as healthcare, education and industry [1][2][3]. Machine learning models can be classified into three main categories: (1) supervised learning, (2) unsupervised learning and (3) semi-supervised learning [4] (Figure S1 available at https:// github.com/Dr- Leila-Ismail). The objective of a machine learning classification model is to predict the class of a given input data [5]. They are heavily used in healthcare for disease diagnosis and prognosis, fraud detection, drug efficiency and the development of a Machine learning for diabetes prediction nationwide prevention plan [6]. Diabetes disease has attracted a lot of attention lately due to its proliferation and dangerous consequences that may lead to death. Diabetes prediction is a classification problem, where the input features variables are the risk factors [7], and the aim is to classify an individual, based on class labels, as diabetic or non-diabetic [8].
Few machine learning prediction frameworks have been proposed in the literature for healthcare [9][10][11][12]. However, to the best of our knowledge, there is no comprehensive framework in the literature which depicts the process of diabetes data analytics from domain understanding to model deployment. In this paper, we propose an intelligent diabetes mellitus prediction framework (IDMPF) using machine learning models, as support for allied health professionals, consisting of doctors, dieticians, medical technologists, therapists and pathologists, for better diagnosis and prognosis of diseases, for better patient care. The framework helps stakeholders, such as insurance companies, pharmaceutical firms and the government to put in place a preventive plan and an effective healthcare strategy. IDMPF is based on the principles of data analytic lifecycle [13]. The proposed IDMPF is evaluated using the decision tree (DT)-based random forest (RF) and support vector machine (SVM) classification models, as they are the most used in the literature [8,12,[14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29] from 2010 to 2019, as shown in Figure S2 (https://github.com/Dr-Leila-Ismail). Very few works compare RF and SVM [19,20,22]. While [19], and [20] do not report on the number of observations in the considered dataset, [22] uses a dataset consisting of 2500 observations. They do not consider the impact of an imbalanced dataset on the prediction results. In this paper, we evaluate the models in terms of accuracy, precision, recall, F-measure, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC) and execution time using a dataset having 65,839 observations.

Literature review and critical analysis
RF is a DT-based model [30] that uses a tree structure to define the sequences of decisions and the corresponding outcomes [31]. Each risk factor (feature) is represented by a node in the tree (Figure 1 (a)) where the model decides to select a particular branch and traverse down the tree. A node without further branches is called a leaf node that represents the class label, i.e. positive (diabetic) or negative (non-diabetic). DT uses a greedy algorithm for the selection of a risk factor to split the tree. The risk factor having the highest information gain is selected for splitting. The information gain for a feature is calculated using Eqn (1).
where H class represents the base entropy calculated using Eqn (2) and H classjfeature represents the conditional entropy calculated using Eqn (3).
where P(class) is the probability of the number of observations in the given class compared to the total number of observations and f is the set of values for a feature. SVM [32] creates a decision boundary known as hyperplane that separates the observations into positive (diabetic) and negative (non diabetic) classes. Figure 1 (b) shows the SVM hyperplane that separates the positive and negative classes for two different ACI features. We evaluate the SVM model using its different kernels: linear, polynomial, radial basis function (RBF) and sigmoid. We obtain the hyperplane using Eqn (4).
where f i are the features, c i are the class labels, w is the normal of the hyperplane, and b is the bias. A literature survey was carried out (Table S1 available at: https://github.com/Dr- Leila-Ismail) to compare the performance of RF and SVM for diabetes prediction. Mostly the studies use a dataset with less than 10,000 observations. Only three works evaluate the models in terms Machine learning for diabetes prediction of F-measure. F-measure is important in the case of an imbalanced dataset (very common healthcare sector). This is because, F-measure reveals how much the model is correctly classifying the minority class, which cannot be detected by accuracy [33]. The present work proposes IDMPF, as a support system for accurate diabetes prediction. The study evaluates IDMPF using the RF and SVM models in terms of accuracy, precision, recall, F-measure, ROC curve, AUC and execution time using the UCI diabetes dataset having 12 features and 65,839 observations [34].

The proposed intelligent diabetes mellitus prediction framework (IDMPF)
A framework for diabetes prediction in terms of stages is presented to describe the characteristics of the data used in diabetes prediction and how this data fits within the framework. The proposed IDMPF is based on the data analytics lifecycle which depicts the process of data collection, organization and analysis to extract correlations, hidden patterns and other invaluable information [13]. Figure 2 presents the stages of IDMPF.
(2) List the potential risk factors by consulting an expert and surveying the literature [36].
(3) State the objective of the prediction model, i.e., binomial classes (diabetic/nondiabetic) or multiple classes (non-diabetic/pre-diabetic/diabetic) prediction, prediction for men and/or women and comparison of diabetes prevalence between different age groups.

Data collection
(1) Collect data from an online public data repository such as UCI machine learning repository [37], request it from a critical care database, such as MIMIC [38] and/or create it using patients' medical data after consent. This process can be automated by developing an intelligent agent. The inclusion of the risk factors in the dataset should be verified. (1) Aggregate the dataset if it is divided into multiple files. For instance, one file can contain the demographic data of the patients such as age, gender, ethnicity, education level and marital status, while another file can contain the medication and laboratory data such as BMI, cholesterol level, blood pressure and pulse rate [39].
(2) Refer to the disease coding system (e.g. the International Classification of Diseases (ICD)-9 [40]) if the risk factors are represented by codes.
(3) Decide on the class labels, based on the expert's advice or domain understanding, for each observation in the dataset in case they are not mentioned. For instance, observations having fasting plasma glucose level <100 mg/dl can be labeled as a nondiabetic class, a level between 100-125 mg/dl can be labeled as a pre-diabetic class and level >125 mg/dl can be labeled as a diabetic class [18].

Data preparation 3.4.1 Feature selection.
(1) Exclude the features that do not contribute to diabetes to avoid overfitting the model at its building stage. For instance, features, such as data sequence number, hospital ID, time and date should be removed.
(2) Use all the features (risk factors) available in the dataset or select a subset of features by applying feature selection algorithms [41], or taking an expert's advice, or using a hybrid approach. Ideally, researchers should evaluate several feature selection algorithms or a combination of these algorithms along with the classification model and then select the features which provide the highest accuracy, F-measure and AUC.

Data preprocessing.
(1) Remove the outliers for better accuracy [42] using manual visualization of the data plot or machine learning [43].
(2) Normalize the numerical features having varying ranges to avoid bias [42]. For example, the model could be biased toward plasma glucose's range of  compared to BMI's range 18.2-67.1.
(3) Identify the missing values (no value or zero) in the dataset, based on domain understanding. For example, if an observation has the value 0 for BMI, then it could be a missing value as BMI cannot be 0, whereas a value of 0 for age could represent a newborn.
(4) Treat the missing values by removing the corresponding observations or adding synthetic values, using statistical or machine learning approaches [44].
(5) Balance the imbalanced dataset by over-sampling, under-sampling, or a hybrid approach [45]. Ideally, evaluate different approaches with the classification model and then select the approach providing the highest accuracy, F-measure and AUC.

Model building
(1) Split the dataset for model training (building) and validation, by dividing it into 70% and 30% respectively, or using the k-fold cross-validation technique [46].
(2) Develop the model using the preprocessed dataset.
Machine learning for diabetes prediction

Model evaluation
(1) Use the validation dataset to evaluate the developed model.
(2) Select the evaluation metrics [33] to analyze the performance of the developed model.
The most commonly used metric is accuracy.
(3) Evaluate the complexity of the developed model by measuring the execution time.
(4) Evaluate F-measure and AUC which are useful in case the dataset is imbalanced.

Model deployment
(1) Apply the developed model to predict diabetes.
(2) Re-develop the model based on updated and/or new data (go to 3.5).
The use of a systematic experimental methodology, depicted by the above stages, to the problem of diabetes prediction is necessary for the best prediction results as oversight of a step can lead to inaccurate results. For instance, if the dataset is imbalanced, the model might be very accurate but will not be able to detect the minority class, which could lifethreatening in the case of a diabetic minority. Table 1 compares the work in the literature, on machine learning-based prediction framework for healthcare and diabetes in particular and our work.

Performance analysis 4.1 Experimental environment
To evaluate the performance of RF and SVM for the prediction of type 2 diabetes, the proposed framework was using an imbalanced UCI dataset with parameters presented in Table 2. The performance of the classifiers with and without feature selection was judged before and after data balancing, and using correlation attribute evaluator [47] for feature selection as it improves the accuracy for diabetes prediction [22]. The data balancing techniques used in the experiments were adopted from [45]

Experiments
The dataset were preprocessed by removing the irrelevant features such as encounter id, patient number, admission type id, discharge deposition id, hospital time in and time out, and payer code, and remove the feature "weight" as it has 100% missing values. The resultant dataset includes race, gender, age, diagnosis 1, diagnosis 2, diagnosis 3 and diabetes medication. Diagnosis 1, 2 and 3 represent the results of the primary, secondary and additional secondary diagnoses respectively. A class label for diabetes was created based on the diabetes medication feature. The class value is set to 1, i.e. diabetic, if the corresponding value in the diabetes medication column is "yes", else it is set to "0", i.e. non-diabetic. We remove all the observations having missing values. For diagnoses 1, 2 and 3, we extracted the ICD-9 code values of the diseases that are risk factors of type 2 diabetes such as obesity, hypertension and cardiovascular disease. A column for each risk factor is added. The value ACI for every observation for each risk factor is set to "1" if the disease appears in either diagnosis 1, 2 and 3, otherwise, it is set to "0". The 70% of the dataset was used for training and 30% for testing. The study evaluates the data balancing techniques for different values of involved parameters and select the parameters that result in the highest AUC value. The accuracy and the F-measure are calculated using Eqs (5) and (6) where The values of recall for the positive and negative class are calculated using Eqs (7) and (8) respectively and the values of precision for the positive and negative class are calculated using Eqs (9) and (10) respectively.
Precision ðnegative classÞ ¼ TN TN þ FN (10) Figure 3 shows the accuracy, precision, recall and F-measure of RF and SVM models with and without feature selection algorithm, before and after data balancing. The precision, recall and F-measure values are presented for the diabetic class (þ), non-diabetic class (À) and their weighted averages (A). We present the results for the data balancing techniques that have the highest F-measure value among those which have an AUC value greater than 0.5. Before data balancing, RF outperforms SVM in terms of accuracy and F-measure, meaning that the DT is Performance of the classification models with and without feature selection before and after data balancing ACI more suitable for diabetes prediction, which is consistent with the literature [19,20]. SVMlinear, polynomial and RBF kernels have higher accuracy than SVM-sigmoid. However, they cannot detect the minority non-diabetic class using the imbalanced UCI. The relative performance of RF and SVM does not change before and after feature selection. The selected features in our experiments, i.e. age, blood pressure, cholesterol, gender and obesity, are the same as the ones in the literature, as shown in Table S2 (https://github.com/Dr- Leila-Ismail). After data balancing, the SVM-linear kernel outperforms the other models under study in terms of accuracy without feature selection, but after feature selection, RF yields the highest accuracy. Moreover, after data balancing SVM models with linear, polynomial and RBF kernels can predict the non-diabetic minority class. Figure 4 shows ROC and AUC for the developed models with and without feature selection before and after data balancing. It shows that before data balancing the SVM models with linear, polynomial and RBF kernels have an AUC of 0.5, with and without feature selection, revealing that the model is randomly assigning all the observations to the majority diabetic class. However, after data balancing the AUC of the models under study are greater than 0.5, revealing a detection of the minority class. Table 3 shows our experimental results on the execution time of the models with and without feature selection, before and after data balancing. It shows that the execution time of the models decreases after feature selection.  . ROC curve and AUC of the classification models with and without feature selection before and after data balancing Table 3. Execution times of the classification models Machine learning for diabetes prediction

Conclusions and summary
Being a global crisis it is crucial to predict the prevalence of diabetes in an individual to reduce the risk of complications and to save lives. The paper evaluates recent works on diabetes prediction that have used DT-RF and SVM models. In addition, different machine learningbased prediction frameworks for healthcare and diabetes in particular were analyzed. The proposed framework (IDMPF) is the result of a critical analysis of machine models in the literature and our implementation of RF and SVM for diabetes prediction. The performance of the models in terms of accuracy, precision, recall, F-measure, ROC curve, AUC and execution time was evaluated. In addition, challenges involved in diabetes prediction are highlighted to guide future research. The present study will help allied health professionals and researchers in the field of diabetes prediction. For an imbalanced dataset, data balancing techniques could help to detect the minority class. However, the performance of the models is data-driven and dependent on the features being used, and therefore, cannot be generalized. The IDMPF is evaluated using the most two used classification models in the literature. A larger spectrum of models will be considered in our future work.