Classification models for likelihood prediction of diabetes at early stage using feature selection likelihood of diabetes

Purpose – Diabetes is one of the life-threatening chronic diseases, which is already affecting 422m people globallybasedon(WorldHealthOrganization)WHOreportasat2018.Thiscostsindividuals,governmentandgroupsawholelot;rightfromitsdiagnosisstagetothetreatmentstage.Thereasonforthiscost,amongothers,isthatitisalong-termtreatmentdisease.Thisdiseaseislikelytocontinuetoaffectmorepeoplebecauseofitslongasymptoticphase,whichmakesitsearlydetectionnotfeasible. Design/methodology/approach – In this study, the authors have presented machine learning models with feature selection, which can detect diabetes disease at its early stage. Also, the models presented are not costly and available to everyone, including those in the remote areas. Findings – The study result shows that feature selection helps in getting better model, as it prevents overfittingandremovesredundantdata.Hence,thestudyresultwhencomparedwithpreviousresearchshowsthebetterresulthasbeenachieved,afteritwasevaluatedbasedonmetricssuchasF-measure,Precision-RecallcurveandReceiverOperatingCharacteristicAreaUnderCurve.Thisdiscoveryhasthepotentialtoimpactonclinicalpractice,whenhealthworkersaimatdiagnosingdiabetesdiseaseatitsearlystage. Originality/value – This study has not been published anywhere else.


Introduction
Chronic diseases are known to affect the way of life negatively [1] and have financial impact on individuals, governments and groups while tackling the chronic diseases [2,3]. Diabetes mellitus is one such example categorized by high levels of blood glucose [4] leading to severe damage of the heart, blood vessels, eyes, kidneys and nerves [5,6]. Diabetes is a growing health challenge of this era irrespective of geographic, racial and ethnic background [7,8]. As recorded by World Health Organization (WHO), the number of diabetic patients globally has rapidly grown from 1m in 1980 to 422m in 2014, and it is increasing steadily especially in low and middle-income countries [9]. According to International Diabetes Federation (IDF), about 463m people between the ages of 20 and 79 years are diabetes patients, and it is estimated that by 2045, the figure would have increased to 700m [10,11]. It is also referred to as one of the major causes of death with annual death toll of 1.6m [5].
Researchers have divided diabetes mellitus into three major types: Type 1 diabetes is a serious and ceaseless illness [12] wherein the immune system wrongly attacks the pancreatic beta cells, thus causing insufficient or no insulin production. Type 2 diabetes mellitus is caused when the body uses insulin ineffectively while gestational diabetes occurs only during pregnancy [13] as a result of hormonal changes [14].
While heavy financial burden of diagnosing and managing the disease is experienced by government, individuals and groups and the prevalence rate is growing, a study [11] established that in 2019, diabetes caused at least 760bn dollars, thus demanding to find ways and means to eradicate or reduce this burden to the barest minimum. In this regard, one of the issues is identifying the risk of diabetes at its early phase [15] as early diagnosis and use of suitable therapeutic management support patient compliance and reduce the overweight expenses.
Established methods of diagnosing diabetes are Oral Glucose Tolerance Test (OGTT) and HbA1c test once the patients develop certain type of symptoms and need resources and time.
In addition, such resources are not available at distant places [8]. Once diagnosed, the treatment process is long-term and expensive. Therefore, the earlier it is detected, the better it is managed in terms of disease and the expenses [8,16].
In recent years, machine learning has been employed in prediction of most, if not all, of human activities and natural phenomenon and the health sector is not exempted. Machine learning, a branch of artificial intelligence uses scientific algorithms and models that computer system uses to perform tasks efficiently, without using explicit instructions, but depending on patterns and inference instead [17][18][19]. A lot of data have been gathered by the healthcare industry [20,21], which will be of great help in bringing insight into big data for prediction, diagnostic, disease prevention and policymaking purposes through machine learning and data analytics [13,22].
Authors have proposed various models for the diagnosis of diabetes [1], employing dataset of 768 female subjects with nine attributes such as glucose level, blood pressure level, number of times pregnant, skin thickness, insulin, diabetic pedigree function, age, body mass index (BMI), and outcome to create model for the prediction of diabetes using artificial neural network (ANN), which gave 75.7% accuracy; random forests, which gave 74.7% accuracy; and K-means clustering, which gave 73.4% accuracy; all coupled with feature selection, which will aid health workers with treatment decisions.
A similar study compared five machine learning algorithms to predict diabetes [18] using dataset with 11 parameters after feature selection on Support Vector Machine, Random Forest, Naı €ve Bayes, Decision Tree and K-nearest Neighbor. The study revealed that Naı €ve Bayes performed the best. Using another approach, easily accessible and cost-effective model was used for early detection of symptom of this deadly ailment [8] employing dataset of 520 subjects with 16 attributes (symptoms), on different classification algorithms and reported that random forest performed optimally.
One of the major challenges of machine learning is high dimensionality of the dataset [18,23] requiring a large memory due to analysis of many features, which leads to overfitting. Therefore, the weighting features reduce redundant data and processing time, thereby improving the performance of the algorithm [24][25][26].
The present study intends to address research questions: Is it possible to have an optimal model that will predict the likelihood of diabetes based on its symptoms? Can less costly system be developed to diagnose diabetes at early stage?
Similarly, along this line, the main objective of this study is to predict the likelihood of diabetes at early stage using feature selection, which eliminates the unnecessary and unimportant features in the dataset [20,[27][28][29][30] in order to obtain better results compared to previous research studies such as [15].

Materials and methods
The proposed methodology is hereby described as follows: (1) Preprocessing (data manipulation).
(2) Feature selection: This is done by using four different algorithms coupled with ranker search method.
The methodology for this study was formulated using Waikato Environment for Knowledge Analysis (WEKA) software that is an open-source software for machine learning that was developed at the University of Waikato. The dataset that was used to pinpoint this research was gotten from University of California, Irvine (UCI) Machine Learning Repository [31], which is a clinical record of symptoms that may cause diabetes; dataset by [8] was loaded into WEKA. In order to obtain better result, feature selection was applied for selecting the attributes to be used for the classification task. In this study, cross-validation and percentage split methods were used. Random forests, J48, Naı €ve Bayes and K-Nearest Neighbor (KNN)(IBK) algorithms were used for this study.

Data description and statistical analysis
The dataset contains records of diabetes-related symptoms of 520 individuals. It entails records of people including the symptoms that may cause diabetes, which was collected from Sylhet Diabetes Hospital of Sylhet in Bangladesh. The dataset was created from a direct questionnaire to people who recently have become diabetic or who are still nondiabetic but having some symptoms that may cause diabetes. The dataset contains 17 attributes, which contain information about diabetes symptoms. The full description of the dataset is available at (https://github.com/OladosuO).
In the present analysis, the data of subjects aged ≤90 and exhibiting symptoms was included while the ones refusing the prior informed consent were not included. The WEKA software is used for this study (version 3.8.3) based on Java runtime version À1.8.0_221-b11.

Hypothesis testing
The aim of hypothesis testing is to evaluate whether the attributes of the population suggest prevalence of diabetes or otherwise assuming that the result is valid when the null hypothesis is accepted at p-value > alpha (0.05). Table 1 shows the result of the hypothesis using t-test method.
H0. The attribute value affects the outcome of the diagnosis.

Data preprocessing
First step in data mining is data cleaning, which involves data preprocessing processes [32,33]. The data preprocessing had already been done through the handling of the missing values using the technique of ignoring the tuples with incomplete values by previous research on this subject matter [8]. In this process, it was discovered that the dataset was skewed (imbalanced). The positive class has 320 instances, while the negative class has 200 instances; the Synthetic Minority Oversampling Technique (SMOTE), which is an oversampling method [34], was used to alleviate the class imbalance problem.

Feature selection
The feature selection process involves understanding the datasets and selecting the attributes, which will produce the essential data required to infer the knowledge been sought for. This is also referred to as feature selection, which is a procedure of recognizing the subset of data from large dimension of data [35].
Attributes that contribute more to the development of the model were derived with the use of SymmetricalUncertAttributeEvaluator (SU), InfoGainAttributeEvaluator (IG), GainRatioAttributeEvaluator (GR), CorrelationAttributeEvaluator (CO) coupled with ranker search method. Table 1 presents a summary of the attributes and how the algorithms ranked them.
Obesity, delayed healing and itching are found to be redundant attributes and not contributing to the model based on how they were ranked by the evaluators; hence, their removal from the classification task. The extracted features are polydipsia, polyuria, gender, sudden weight loss, partial paresis, irritability, polyphagia, age, alopecia, visual blurring, weakness, genital thrush and muscle stiffness.

Classification
After the data preprocessing and selection processes were completed, random forest, Naı €ve Bayes, J48 and K-nearest neighbor algorithms were applied using WEKA. It is a tested and trusted open-source software for machine learning developed at the University of Waikato, New Zealand [36]. Cross-validation and percentage split were used as the test mode option with 10 as the number of folds and 80% test split for cross-validation and percentage split respectively. Class attribute was set as the target to be predicted for the classification in which the model will present the output variable, which is expected to state the diagnosis outcome whether positive or negative. This process was done five times coupled with changing the random seed starting from 1 to 5 for the process for validation purposes.

Results and discussion
The results are presented for the likelihood of diabetes based on the dataset before applying the algorithms using cross-validation method followed by the results that were obtained using the percentage split method. Then we evaluated the results reported in [8,11]. The algorithms were implemented as discussed in the previous section. The performance metrics  [37], precision-recall area under curve (PR AUC) and receiver operating characteristic area under curve (ROC AUC), which is best used to determine the accuracy of imbalanced dataset because of its robustness [38] and accuracy and F-measure that are obtained from the confusion matrix used to determine how well a classification has performed [39] by reporting the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). The data presented in Table 2 exhibits the details of the average performance of the classification based on F-measure, MCC, AOC RUC, PR AUC and accuracy after the process was repeated five times coupled with changing of random seed values from 1 to 5.
A recent research [8] on the likelihood of diabetes prediction dataset was compared with the present investigation ( Table 3). The results of accuracy, MCC, F-Measure, ROC AUC and PR AUC criteria for the models were obtained according to the tenfold cross-validation and percentage split methods compared to the previous study.
Based on the results obtained, we have been able to show not only that better accuracy can be obtained from handling imbalanced dataset but also, more accurate result could be obtained through feature selection. Our results also show that the random forest algorithm performs better than the other algorithms.
This aspect is encouraging for healthcare industry as this model is not expensive and time-consuming to use. As it does not require the use of lab reagents and technical skills unlike OGTT and HbA1c test, therefore, this model can also be put to use to detect diabetes most especially at its early stage in remote areas where health facilities are not accessible. Polyuria and polydipsia attributes are important because the kidney is also affected when people have diabetes. In cases such as fever, diarrhea when the patient is thirsty once water has been drunk, the thirst will be eliminated. But for diabetes this is not so because high blood sugar mounts pressure on the kidneys. The kidneys in turn produce more urine to limit the excess sugar; therefore, causing dehydration, which of course leads to thirst. The cycle continues frequently until the kidneys are weakened and unable to function properly [14,40]. Sudden weight loss, occasioned by insufficient insulin is one of the early symptoms of diabetes. It (insufficient insulin) prevents the body from producing glucose from the bloodstream to be used as energy in the body's cells. So, when insulin is insufficient in the body, it leads to burning of fat and muscle for energy, which reduces the total weight of the body. Similarly, this in turn leads to polyphagia because when the body lacks enough glucose, it feels more and more hungry. Partial paresis comes in when the body is not able to control the sugar in the blood, which can damage the blood vessels and nerves [14]. The results confirm that age is significant in diabetes diagnosis, which was already declared by another research on this subject matter [14]. Also, gender is crucial in diagnosing diabetes as already stated by a study that had examined carefully men and women separately [41] and observed that it was as a result of higher amount of visceral fat in men. Some important features such as obesity, BMI play no role in some models like the one considered in this study and likewise one (of the models) proposed by [11]. Similarly, a previous study [42] acknowledged that they are not statistically significant with diabetes.

Conclusion
With the rate at which diabetes is increasing among the people, there is a need of detection of diabetes at its early stage. The present study shows the importance of machine learning in the healthcare industry in decision-making and also in reducing the cost of diagnosis. The main contribution of the work is providing a new optimal model for predicting diabetes in its early stage and has also emphasized the importance of feature selection and proper handling of data. It would be interestingin the future research to know whether body size, height and BMI could be included in the dataset and find the role these parameters play in the detection of diabetes.
The models created with WEKA are available for the readers (https://github.com/ OladosuO) for future research purpose.