Prediction of new prescription requirements for diabetes patients using big data technologies

Purpose – The study aimed to evaluate the effectiveness of using large data sets for new diabetes patient prescriptions. Design/methodology/approach – This study consisted of 101,766 individuals, who had applied to the hospital with a diabetes diagnosis and were hospitalized for 1 – 14 days and subjected to laboratory tests and medication. Findings – With the help of Mahout and Scala, data mining methods of random forest and multilayer perceptron were used. Accuracy rates of these methods were found to be 0.879 and 0.849 for Mahout and 0.849 and 0.870 for Scala. Originality/value – The mahout random forest method provided a better prediction of new prescription requirements than the other methods according to accuracy criteria.


Introduction
Diabetes mellitus (DM) is a complex and metabolic chronic disease associated with a state of high blood glucose level or hyperglycemia, occurring from deficiencies in insulin secretion, action or both.Autoimmune destruction of the pancreatic gland, insulin secretion anomalies and insulin resistance play a role in the development of diabetes.Diabetes is divided into two categories: type 1 diabetes is caused by insulin secretion disorder, and type 2 diabetes is the development of insulin resistance.The chronic metabolic imbalance associated with this disease puts patients at high risk for long-term macro and microvascular complications and dysfunction of some organs which, if not provided with high-quality care, lead to frequent hospitalization and complications, including elevated risk for cardiovascular and renal diseases (CVDs) [1].In addition, ketotic and non-ketotic coma may develop.
The International Diabetes Federation estimates that there are approximately 387 million people diagnosed with diabetes across the globe with two-thirds of them being adults aged 20-65 years and the proportion of deaths before 60 years ranging from 36% to 73% [2].In general, 1.4 million newly diagnosed cases in the US are reported every year.If this trend continues, it is projected that in 2050, one in three Americans will have diabetes.Diabetes, with its associated side effects, remains the seventh leading cause of mortality in the United States [3].In addition, epidemiological studies report that diabetes causes more deaths in Americans every year compared to breast cancer and acquired immunodeficiency syndrome (AIDS) combined [3].The most important criterion in the treatment of diabetes is glycemic control.Drugs are being evaluated to manage DM including oral GL Panalogs (Glucagon-like peptid), glucokinase activators, glucagon receptor antibodies, metformin, sodium-glucose cotransporter-2(SGLT-2) inhibitors.The purpose is to keep the HBA1C below 7% [4].However, advances in diabetes treatment continue.New treatment regimens are especially emerging to reduce cardiovascular mortality and renal transplant need.Therefore, in order to ensure good metabolic control in diabetic patients and to ensure improved health and longevity, a combination of changes in lifestyle, pharmacological treatment and prescription change should be used (in cases where the treatment cannot be benefited) [5].Countries with the highest prevalence of diabetes account for 60% of the world's population, so researching new effective treatment regimens is essential [6].
At least 95% of clinical data recorded in the healthcare sector being in video format indicate the importance of multimedia data within big data.Big data technologies have started to be used for it is not possible to store and analyze all these data with standard database solutions and classical statistical methods.Big data is defined as "how businesses, states, hospitals, and organizations integrate datasets which are mostly not structured and continue to accumulate endlessly, are away from structurality to the extent that they cannot be analyzed with traditional association-based database techniques and are very big, raw and growing exponentially, and explore information that has remained hidden and surprise correlations through methods of statistics and data mining" [7][8][9].Today, data are still kept in different formats in hospital systems, and it is almost impossible to collect this data in the same format for multi-center studies.Therefore, there are no studies conducted with big data in the literature apart from a few articles.Mahout and Scala used in the study are software packages like SPSS (Statistical Package for the Social Sciences) that enables analysis of big data.The two software packages were analyzed for their performance on the same data that was compared, and the results they provided for data mining methods were presented.This study aimed to find the factors affecting the variable "New Prescription" and to determine whether patients need medication change using the big data technologies of Mahout and Scala.

Datasets
Health Facts is a database that records details of hospital data in the USA including electronic medical records, demographics, hospital procedures, laboratory findings, pharmacy data and hospital death rates.Diabetes is a disease affected by genetic and environmental factors.There are more than 100 genetic differences in diabetes [10].This gives us a chance to customize the treatment.However, it may be difficult to apply a fixed treatment given that there are so many factors.Different responses to antidiabetics have increased the importance of pharmacogenetics.The dataset used in our study was retrieved from the data in the Health Fact database (Cerner Corporation, Kansas City, MO).This dataset involved hospitals in the Central (18 hospitals), Northeast (58 hospitals), Southern (28 hospitals) and Western (16 hospitals) regions of the USA between 1999 and 2008.There were 50 variables regarding New diabetes patient prescriptions patient and hospital results in the dataset.Approval was obtained to use the data set [11,12].Out of this dataset, 101,766 individuals who had applied to the hospital with a diagnosis of diabetes, were hospitalized for 1 to 14 days and were subjected to laboratory tests and medication were included in our study.Hence, variables that had high missing data percentages had some categories with very few data and were deemed to have no effect on the dependent variable were excluded from the study, and 10 out of 49 independent variables were included in the study [11].It must be noted that even when data sets with the same features and data are taken from two different centers, their characteristics differed.These differences were also revealed in the statistical tests to be performed (For example, the average age was different, the gender ratio was different).For this reason, it was a more correct approach to use different data mining methods and/or different data mining software (packages) for the data set used.We aimed to do this in our study and presented results in a table format.The scope of the research relied on the available data.

Big data technologies
Through the latest technologies, big data provides the opportunity to analyze the data types which are impossible to be analyzed with standard methods such as text, audio and video analyses [13,14]. Hadoop.
Hadoop is an open-coded library that was developed in Java and runs the applications necessary for processing and analyzing big data on the set formed by multiple servers.Hadoop is composed of Hadoop Distributed File System (HDFS) and MapReduce.
HDFS combines the disks of multiple servers to use them as a single virtual disk for storing a huge amount of data that cannot be stored in one server [15][16][17].MapReduce is used for processing the large-scale data stored on HDFS.It is composed of the Map function developed to filter data and the Reduce function used for having outputs from data [18].
Machine learning.Machine Learning is also called automatic modeling and tests the data with several models to achieve the best fit possible.The velocity and volume of big data technologies make use of machine learning importance [19,20].
Machine learning libraries.The most commonly used machine learning algorithms are Mahout and Scala.The Mahout algorithm has features such as data preparation, modeling and accessing information via a model.It is frequently used for classification and clustering.The most used classification algorithms in Mahout are Logistic Regression, Naı €ve Bayes and random forest and the most used clustering algorithms are k-means, Canopy and MinHash [19,20].
Scala is also regarded as a programming language as it involves object-oriented and functional programming languages.It has its own compiler, so it can compile and run Java codes easily.Since it can use all libraries and features offered by Java, it is possible to produce all projects in Java in Scala, too [19,20].
Data pre-processing procedure Variables were evaluated by using the gain ratio, information gain and chi-squared.Attributed evaluation variable importance methods and the variables which were considered to be insignificant by the three methods and were thought as less important by clinical evaluation were excluded from the data set.Data were randomly divided into two datasets: training data (80%) and test data (20%).Following these procedures, the data were transferred to the big data technologies of Mahout and Scala, prediction of new prescriptions was predicted by using 10 independent variables with the help of random forest and multilayer perceptron algorithms.Mahout and Scala used in the study are software JHR 36,2 applications like SPSS that enable analysis of big data.Although these software algorithms are already in the literature, their use is not common.In the study, these methods were also evaluated to compare the results.Gain ratio, information gain and chi-squared attributed evaluation methods are the methods used routinely in data mining, giving the degree of importance of the independent variables based on the result variable.
We used multilayer perception as a neutral network algorithm.In fact, the study looked at methods such as support vector machine, J48 and logistic regression.Since the two methods that give the best results are multilayer perceptron and random forest, the results of these methods were included in the study.

Ethical issue
The dataset was approved and obtained for use from the data in the Health Fact database (Cerner Corporation, Kansas City, MO), [11,12].

Results
There were 101,766 patients in the study and 78,363 (77.0%) of these patients required a new prescription, while 23,403 (23.0%) did not require a new prescription.The majority of patients applying to the emergency department required new prescriptions (76.4%).The average length of hospital stay for patients requiring new prescriptions was 4.5 days, while patients who did not need a new prescription stayed for an average of 4.1 days.The descriptive statistics regarding the explanatory variables which formed the dataset of the study on the level of the dependent variable are given in Tables 1 and 2.

Mahout random forest
In the first stage of the Mahout random forest method, the training data was run in the Mahout environment and was divided into several numbers to experiment with until the fittest model was created.The Map procedure was performed to filter the data and create key/ value pairs and Reduce was performed to reduce the data.
The time taken for the procedures of map and reduce to be completed was approximately nine minutes (541,348 ms) in the training model.As for the comparison between the numbers of read and written bytes, the reading from the file was much lower than the numbers during processing on HDFS (FILE Number of bytes read 5 3,369, HDFS Number of bytes read 5 5,624,479) (Figure 1).
While the model creation required 5,624,095 bytes of data during reading, its processing and writing stage required 4,440,974 bytes of data.Even in this stage, an advantage of about 25% was achieved with the reduction process.The training model was created in 3 minutes 42 seconds.This process would have taken about 30 minutes if it could have been distributed to and processed on five computers with standard hardware used today.The Mahout technology selects the numbers of nod and depth for the decision tree in the random forest method in such a way that they provide the most ideal outcome with the least processing  2. In our model, the optimum number of nods (Forest num Nodes) was found 269,245, mean number of nods (Forest mean num Nodes) 2,692 and depth (Forest mean max Depth) 26 (Figure 2).

Descriptive statistics of quantitative variables in the dataset
The time passed for the procedures of map and reduce performed with the entire hardware was approximately nine seconds (8724 ms) when creating the Mahout random forest test model.As for the comparison between the numbers of read and written bytes, the numbers of read and written bytes during reading from the file were higher than the numbers during processing on HDFS (FILE Number of bytes read 5 4,472,619, HDFS Number of bytes read 5 1,396,768).The reason is that the number of processes on HDFS decreased because the model was created during the training stage.
For the test model, data read required 1,396,642 bytes, and write required 4,112,43 bytes in the Mahout random forest method.The fact that the number of bytes required for data write decreased by one-third proves the success of Mahout technology in the reduction process.This significantly shortens the time spent on test data outcomes.
In the last stage, accuracy and F-measure values were obtained through the test data with the model created using the training data with the help of the Mahout random forest.The accuracy and F-measure value were found to be 0.879 and 0.662.
One of the tree diagrams of the random forest is shown in Figure 3.The accuracy value was calculated as 0.872 and the F-measure value was calculated as 0.659.These values, calculated from the single tree structure were close to the values calculated by random forest.
By using this tree structure it can be concluded that when the number of medications is greater than eight, the prescription change is probably needed.When the number of medications is less than or equal to eight and the Number of diagnoses is greater than five and Readmitted is greater than or equal to thirty, the prescription change is probably needed and so on (Figure 3).Scala random forest Parameters are provided manually within the code in the Scala method.Hence, outcomes of the random forest model were calculated using the parameters created for Mahout.In the test model, the Accuracy and F-measure value were found to be 0.849 and 0.604.

Mahout multilayer perceptron
In the multilayer perceptron analysis conducted with the Mahout technology, the scenarios were similar to those in the random forest and the fittest parameters were selected automatically.In the test model, the Accuracy and F-measure value were found to be 0.849 and 0.580, respectively.

Scala multilayer perceptron
Parameters are provided manually within the code in the Scala method.Hence, outcomes of the multilayer perceptron model were calculated using the parameters created for Mahout.The Accuracy and F-Measure value were found to be 0.870 and 0.709 in the test model.

Performance comparison of classification methods by big data technologies
Accuracy and F-measure values for the random forest and multilayer perceptron methods using the Scala and Mahout technologies are given in Table 3. Accuracy of the Mahout and Scala technologies by the random forest and multilayer perceptron methods were found to be 0.879 and 0.849 and 0.849 and 0.870, respectively.According to these rates, the accuracy of random forest using the Mahout technology was higher than others.

Discussion
There are many studies on data mining methods for the prediction of diabetes disease.Malik et al. [21] used linear data logistic regression, ANN, linear support vector machine and radial basis function support vector machine methods in their study to diagnose and predict diabetes disease with 175 (87 healthy and 88 type 2 diabetes) people.The accuracy values of these methods were found as 75.86, 80.70, 77.93 and 84.09 respectively.Farran et al. [22] used logistic regression, support vector machine, k nearest neighbors and multifactor dimension reduction methods for predicting diabetes in the study they planned with 10,632 patients.The accuracy values of these methods were 80.7, 81.3, 78.6 and 78.3 respectively.Tapak et al., [23] in their study with 6,500 patients, used logistic regression, linear discriminant analysis, fuzzy c-mean, support vector machine, neural network and random forest methods and found accuracy values for these methods as 0.935, 0.925, 0.859, 0.986, 0.931 and 0.930.
In Rajesh et al.'s study [24] of 768 patients, they used many data mining algorithms and provided the best results from RMD Tree.However, due to the over-fitting of data problem in this method, they preferred C 4.5 which was the second method with the accuracy value of 0.910 and giving the best result Meng et al. [25], in their study of 1,487 (735

New diabetes patient prescriptions
and decision tree methods for the prediction of diabetes or prediabetic patients.They found the accuracy values of these methods as 0.761, 0.822 and 0.807.
El-Sappagh et al. [26] implemented a framework and tested the accuracy of this system in their study with 60 patients diagnosed with diabetes.The accuracy value of the system was found as 0.977.They compared their framework with existing CBR systems and a set of five machine-learning classifiers and their system outperformed all of these methods.Deep learning is a method that has become recently popular.It is preferred for analysis that includes time data such as image processing and survival.It gives better results in such data, but the data for modeling purposes like our study gives similar results.In our study, data mining methods were preferred because our main purpose was modeling, classificationbased analysis.
The data in our study included 101,766 individuals who applied to the hospital with suspicion of diabetes.Using the data mining technologies of random forest and multilayer perceptron with the help of big data technologies, prediction of new prescription was made on these data.Accuracy of the random forest and multilayer perceptron methods using the Mahout technology were found to be 0.879 and 0.849 respectively whereas Accuracy of the random forest and multilayer perceptron methods using the Scala technology were found to be 0.849 and 0.870.It is a big data-based software program that analyzes for data mining algorithms such as Mahout Scala.For example, the logic is similar to the Student-t-test in both SPSS software and R programming language.For analysis, 5 SSD Cloud Servers are run simultaneously and the results are processed by means of the map and reduce functions.The use of servers with higher vertical configurations only reduces processing time and does not cause any change in performance criteria.
The results are better because the large data enables the relationships between variables, and even with the low probability of realization, it is possible to add to the training data and learn.This is one of the purposes of recommending the use of big data in studies.These results suggest that big data technologies provide good results for diabetes patients and that it is necessary to use them for new knowledge discovery about this disease.

Conclusion
Big data research will continue to increase since diabetes is a major health problem.Diabetes with hyperglycemia and many complications can be reversed with appropriate approaches.New treatment modalities should develop as diabetes and complications create high costs [27].Large data sets should be used in the development of these treatment modalities.With the increasing size of data, big data will continue to expand in years to come, and each data scientist will have to manage more data each year.These data will be more diverse, bigger and faster and can be a potentially exciting opportunity for the future.Thus, big data will be the new frontier for scientific data research and business applications.
The use of big data provides advantages in the healthcare sector by allowing for more testing or more qualities for research; this brings about faster validity of studies and acquisition of sufficient examples for education when small amounts of examples are positively available.On the path to medical informatics, big data obtained from all levels of medical data is utilized, and the best possible ways of analyzing, deriving and answering as many medical questions as possible are being found to improve patients' health [28].
Future research may focus on using all data of patients and diseases to offer clinicians more diagnoses, treatments and ways of helping the patients.Each of these technologies can be developed and whether the results are the same in different populations can be tested using large-volume and diversified datasets.These technologies provide a short overview of opportunities that can be accessed through big data analysis in the data mining and healthcare field.

JHR 36,2
In this study, the Mahout random forest method provided a better prediction of new prescription requirements than the other methods according to accuracy criteria.Consequently, when the clinical parameters (risk factors) used in a new prescription prediction are known, a model can be created with the Mahout random forest method, and this model can be used as an alternative in the diagnosis of patients and be an assistant tool for clinicians.It is the models created as a result of the analysis using big data that can be used as an assistant tool in the diagnosis and diagnosis stages by transforming them into software or algorithms.Most hospitals do not have such ancillary software for physicians, and physicians continue diagnosis and treatment based on their experience and knowledge level.

Figure 1
Figure 1.Read, write, byte amounts and processing times when creating Mahout random forest training model

Figure 2 .Figure 3 .
Figure 2. Time spent, byte quantity, node and depth count of the created tree for the Mahout random forest training model

Table 1 .
Descriptive diabetes or prediabetes patients and 752 control) people used logistic regression, artificial neural network