Significant factors associated with malaria spread in Thailand: a cross-sectional study

Purpose – This paper aims to uncover new factors that influence the spread of malaria. Design/methodology/approach – The historical data related to malaria were collected from government agencies. Later, the data were cleaned and standardized before passing through the analysis process. To obtain the simplicity of these numerous factors, the first procedure involved in executing the factor analysis where factors’ groups related to malaria distribution were determined. Therefore, machine learning was deployed, and the confusionmatrices are computed. The results frommachine learning techniques were further analyzed with logistic regression to study the relationship of variables affecting malaria distribution. Findings – This research can detect 28 new noteworthy factors. With all the defined factors, the logistics model tree was constructed. The precision and recall of this tree are 78% and 82.1%, respectively. However, when considering the significance of all 28 factors under the logistic regression technique using forward stepwise, the indispensable factors have been found as the number of houses without electricity (houses), number of irrigation canals (canals), number of shallow wells (places) and number of migrated persons (persons). However, all 28 factors must be included to obtain high accuracy in the logistics model tree. Originality/value –This papermay lead to highly-efficient government development plans, including proper financial management for malaria control sections. Consequently, the spread of malaria can be reduced naturally.


Introduction
Malaria is caused by Plasmodium spp. parasites, which are transmitted by Anopheles mosquitoes. This disease is considered a major public health problem, particularly in countries located in the tropical and subtropical regions of the world. Specifically, it is most prevalent in low-income countries [1,2]. As mosquitoes are a considerable vector of malaria that spreads easily, implementing a vector control policy is a necessity to prevent malaria transmission. Various methods are implemented for vector control, which includes insecticide-treated bed nets (ITNs) [3][4][5] and indoor residual spraying (IRS) [2,6,7] with residual insecticides, and the combining of IRS and long-lasting ITNs (LLINs) [7]. Nevertheless, the malaria report [1] stated that severe drug-resistant strains of malaria, also known as super malaria, have appeared in the Mekong River Basin and are continuously spreading in many countries in Southeast Asia. Thus, eliminating malaria using various kinds of nets and drugs is not a sustainable solution.
Numerous studies and preventive measures to solve the spread of malaria include an evolutionary-epidemiological modeling framework [8], determining the role of chemoprophylaxis treatment for malaria infection and distribution in the risk area [9] and identifying the relation between land used and the mosquitoes' distribution [10]. Moreover, the measurement of human population movement, associated predictors of travel and human population movement correlates with self-reported malaria among people living within malaria hotspots are identified [11]. This population movement has been related as an action plan to monitor and protect malaria by the United Nations High Commissioner for Refugees (UNHCR), resulting in lower expenses and greater efficacy in the long term than solving drugresistant problems. In addition, the World Health Organization (WHO) [2] suggested that the map related to the high-risk migrant communities must be drawn, including the continued distribution of the ITNs [4] similarly to [5], which recommended a map to determine the pattern and stability of malaria hotspots in Bangladesh to inform intervention planning for elimination. It is a challenge for the government to continue supporting the project to achieve the goals and future strategies for malaria elimination [12].
This research aims to identify alternative factors that support the sustainability of malaria elimination plans from government organizations, including relationships among factors so that the impacts of key factors can be explained. This study mainly considers quantitative factors such as physical data, biological data, excluding laboratory data. These factors may lead to awareness of unexpected malaria distribution. These factors should be included in the guidelines for malaria control policy.

Data collection
Various research papers have defined the factors that affect the distribution of malaria, such as demographic factors [3], socioeconomic factors [3,13] and the educational background of the population [3]. According to literature between 2006 and 2018, the additional factors that have been comprised were housing characteristics [3], weather data [13][14][15], etc. This research will consider all these factors as the base factors that cause malaria distribution. Besides, the factors that are related to these base factors from those malaria papers are also included in this research. For example, the number of mobile phones, the number of computers, the amount of internet access and the number of communication sources [3] were included under the educational background factor because all of these factors can contribute to education in the community. Therefore, there were 294 factors for the entire 11 years to be selected from reliable sources. Besides, the dominant data belong to the National Statistical Office (NSO) because its role is as the data center of the government. All data from the NSO are always open and free to retrieve anytime via the website of NSO statistical report where data for each factor are summarized from every province. Additionally, data from the Royal Forest Department and Land Development Department were added to clarify the existing data from the NSO. Furthermore, the collected data were subjected to data filtering and cleaning processes by considering missing or duplicate data of the same factors obtained from different sources. Thus, the data from another source, such as the Royal Forest Department or Land Development Department, were retrieved to be the data set of this research when the data in NSO were not completed. As a result, only 71 factors remained for applying to the next procedure. Furthermore, all data were standardized prior to computing and analyzing procedures.
Once the 71 factors that were expected to have an impact on malaria spread were collected, the number of malaria patients in 13 years was also collected from the Epidemiology Department, Ministry of Public Health in the same period as the 71 factors. As the number of provinces in Thailand is 77, each province will have 13 historical records for the number of malaria patients. So, the accumulative number of malaria patients for 13 years of each province was calculated and represented as the number of malaria patients in the province.
Data analysis Variable definitions. To analyze the standardized data, the group of patients is classified into two different groups: high and low distribution groups based on the distribution rank of patients. These two groups are relied on the total number of patients within 13 years, from 2006 to 2018. On the other hand, the variable represented the number of patients each year. If the number of patients is in the sequence of the first 10th within 13 years, then, the province is in the high-distribution group; otherwise, it is in the low-distribution group. According to the number of patients and data for 13 years, therefore, there are 21 provinces in the high distribution groups and 56 provinces in the low distribution group.
Since the research focused on the distribution of malaria patients in Thailand, two dependent variables, the number of patients and group of patients, are deployed, while 71 factors from government agencies were collected and classified as independent factors which have effects on both dependent variables. To obtain the values for these 71 factors, the mean values of recorded data from 2007 to 2017 of all factors were calculated and used as the factors representative of the entire 11 years.
Therefore, there are 71 independent factors and 2 dependent variables that should be in the malaria distribution model. Nonetheless, 71 independent factors are too many for creating a model. Thus, the reduction of these 71 factors should be performed using factor analysis, which is described in the following section.
Factor analysis varimax rotation. Factor analysis is a method to classify similar variables that are related or have the same qualities as one group. The structures of relationships between variables are defined and renamed as components. Each component refers to the incorporation of all variables that are related to each other with high correlation. Moreover, different components will not have or have a very low correlation. Thus, it can be said that this technique helps to reduce the number of variables in the analysis process without changing the values or meaning of the original data. Moreover, variable grouping by this method can also prevent multicollinearity problems in the regression analysis [16].
Since there are 71 factors to be considered in the malaria distribution model, the model will be too complicated. Thus, the suitable and reasonable factors from 71 factors should be selected using the factor analysis method. Considering 95% confident intervals in the factor analysis with IBM Statistical Product and Service Solutions (SPSS) version 22 (IBM Corp. Released 2013; IBM SPSS Statistics for Windows, Version 22.0. Armonk, NY: IBM Corp.), nine non-overlapping categorical factors were identified with different correlation values of variables in the component matrix. The fact is that the first category from the analysis result refers to the group with the highest correlation (r) value toward the spreading of malaria, at least 0.9 for each factor, while the other eight categories correlate less than 0.5. So, this first category with 28 factors was selected as the significant influences for malaria distribution. Consequently, the independent factors for malaria distribution were reduced from 71 to 28 factors according to the results of factor analysis. As a result, many factors defined by other research in malaria distribution have been eliminated from the computing list, such as forest area and climate situation.

Factors that influence the spread of malaria
Logistic model tree (LMT). Searching for a fitting model among all factors using machine learning (ML) always provides high accuracy for prediction, although data can be complicated. In this research, Weka 3.8 is used for running ML for classification findings. Although there are various classification techniques provided by Weka 3.8, such as the k-nearest neighbor method, logistic regression, naive Bayes classifier method, decision trees and support vector machines, the most suitable classification method depends on the characteristics of the collected data. Here, the most suitable classification models were decision trees. As the research of [6] applied the mixed linear regressions to estimate the reduction of malaria after IRS [6], one interesting subcategory of decision trees is the logistic model tree (LMT). An LMT is a classification model that was introduced by Landwehr et al. [17] in 2005. This model combines the logistic regression and decision tree models. Consequently, the structure of the LMTs is based on the model of a decision tree which has the linear regression at its leaves to provide a piecewise linear regression model. The method to create an LMT starts from splitting the data set into two groups: a training data set and a testing data set. The LogitBoost algorithm is used to run the training data set for finding the logistic regression model where the testing data set is used to validate the accuracy of the obtained model. The first logistic regression model is located at the root node. It is iteratively refined the same as constructing the linear models in stepwise model tree induction (SMOTI). As the lower level of the regression model is located, therefore, the simpler fitting model can be obtained followed by the increase of the model's variables. In this study, the dependent data are the nominal data with two values, so the next step is to split the training data set into two individual subsets based on their nominal values, each of which runs through the LogitBoost algorithm to obtain two other regression models. These processes can be iterated until the best logistic regression models are obtained in every leaf node of the tree.
This model also fits the data with two independent groups. As a result of using logistic regression to derive the LMT, an explicit class probability estimation can be calculated, which is a significant advantage compared to the use of logistic regression alone. This model avoids confounding effects by analyzing the association of all the variables together [18].
Though the LMT classification model was chosen, the data from two different groups must be balanced. Unfortunately, the original data in this research are imbalanced. Thus, the balancing mechanism is required before passing to the LMT classification model. Later, all 28 factors in the LMT model (Table 1) are re-arranged using factor analysis; then, passing the outputs to derive the logistic regression model for reducing the complication of the malaria distribution model, including the effects of all factors, can be elaborated.

Ethical approval
This research comprised secondary data mainly retrieved online from the websites of the NSO, Land Development Department and the Royal Forest Department, Thailand.

Results
The indicators to measure the accuracy of the model include the probability of falsely rejecting the truth value, called a false positive rate (FPR), the ratio of correctly predicted positive observations to the total predicted positive observations (precision), and the ratio of correctly predicted positive observations to all observations in actual class (recall). The data under 28 factors entering the LMT model using Weka 3.8 (Table 1) show that the precision, the recall and FPR values are 78%, 82.1% and 23.2%, respectively. As the recall is over 80%, it indicates that the performance of the LMT model is acceptable.
Since the LMT is a type of decision tree classifier, a comparison between the basic decision tree (J48) and the LMT was performed. The results of running the J48, decision tree classifier, show that the precision, recall and FPR values are 73.2%, 73.2% and 26.8%, respectively. Factors that influence the spread of malaria This indicates that the performance of the J48 is inferior to the LMT models, although the recall is higher than 50%. In Table 1, these 28 factors can be grouped as follows: (1) Information about the population in the province: Population density in a province (persons) [3], number of houses (1:1000 houses) [3], number of farmers' families (families), number of non-migrated persons (persons), Number of migrated persons (persons) [4,11].
(2) Information about public utilities: Number of houses with tap water (houses), number of houses without electricity (houses), number of houses with electricity (houses), electricity from the government (houses), number of registered vehicles (cars). (6) Information about the environment [13,14]: Provincial area (rai), provincial agricultural area (rai), provincial non-agricultural area (rai), number of rivers (rivers), number of canals (canals), number of shallow wells (places), number of artesian wells (places).
All 28 factors from the LMT model have been applied to the forward stepwise process to derive a logistics regression for discovering the preponderant factors, Table 2.
In Table 2, the positive and negative effects of each factor can be seen. In the case when the number of houses with no electricity is increased, the risk for malaria distribution can be increased six times (OR adj 5 6.41; 95% CI 5 1.81 À 22.65). Besides, the results also indicated that increasing the number of irrigation canals is a way to decrease malaria (OR adj 5 0.20; 95% CI 5 0.06 À 0.66). On the other hand, the number of shallow wells can be counted as a risk factor because every increasing value causes the distribution of malaria to increase five times (OR adj 5 4.83; 95% CI 5 1.64 À 14.26). The number of migrated persons is another risk factor because the malaria spread can increase one time whenever the number of migrated persons increases (OR adj 5 1.19; 95% CI 5 1.05 À 2.70).

Discussion
Malaria is a severe illness that must be considered for individual protection. Unlike other researchers, this paper examined, analyzed and synthesized historical data from government agencies to find the factors that support the long-lasting prevention and elimination of malaria distribution. Based on the study of historical data from government organizations, 28 new factors were discovered as influencers toward malaria distribution, with either direct or indirect effects. Thus, new methods for malaria distribution protection can be implemented according to these new factors. For example, the government should control the quality of rivers or even provide electricity to houses in high-risk areas, which indirectly refers to good quality of life being delivered. However, using all 28 factors for malaria distribution control is ideal, highly complicated and expensive in real life. The related departments in malaria distribution control for each area must select only some significant factors related to their landscape within these 28 factors to be controlled.
Direct preventive factors mainly involve social and demographic factors such as occupation, household size, economic agriculture, etc. These lead to the improvement of protection plans for the affected community. At the same time, the indirect impact factors include various elements such as electricity usage, communication technologies, rivers, etc. Nevertheless, the Thai government puts forth efforts on the small details of community prevention rather than these factors. Therefore, maintaining the reduction of malaria means these indirect impact factors are defined as parameters in the risk model to be aware of the spread of malaria infection, including the completeness of control.
Many studies have identified factors related to the distribution of malaria, such as forests, vegetation cover, temperature, rainfall and humidity, which have an impact on the sustainability of this disease [13][14][15][19][20][21]. In addition, changes in land usage also affect the spread of malaria [22,23]. Although various factors have been identified and monitored in malaria protection programs, the spread of this disease still exists, and drug resistance is increasing. Consequently, the number of ailments is expanding and becoming complicated to manage.
According to the WHO [1,2], the characteristics of the malaria vector can be identified; the mosquito-breeding location must have good-quality water sources and they love hiding in dim places or forests. Therefore, a provincial area that is high can contain many natural resources, such as a provincial agricultural area, forest and water sources, which are very suitable for the distribution of Anopheles mosquitoes. Moreover, there are places for Anopheles mosquitoes to breed if the number of good quality shallow wells is large. Once the mosquitoes have matured, they prefer to stay in dark or dim places, such as bathrooms and kitchens. Thus, a community that has plenty of houses without electricity will risk Considering the installation of electricity by the Provincial Electricity Authority (PEA) for electricity from the government, the PEA mainly uses the water power from water resources around Thailand, such as irrigation canals, dams and reservoirs. Therefore, building irrigation canals can support good management of water circulation. Consequently, it can reduce the risk of malaria spread because of the ability to switch the speed of water draining between fast and slow. The ability to build obstacles to prevent the mosquitoes from laying eggs, including shaking the laid eggs, has also been implemented.
In such a case, the value of electricity from the government is varied based on the number of water sources around Thailand. Consequently, the malaria vector is likely to spread. In addition, if some of the population in the risk area moves to a new location, the number of migrated persons is equivalent as another malaria vector. On the other hand, a person from the clean area that has moved to the infected area can be infected. Therefore, malaria is a disease that relates to human occupation, including the environment. Hence, it can directly influence human lives.

Conclusion
Malaria is a serious disease that can cause significant harm if patients do not receive proper care after infection. While the number of malaria patients is decreasing, the prevalence of drug-resistant malaria is increasing as well; this ultimately causes the spread of malaria. As a result of this research, many new vital factors are revealed in the form of an LMT risk model that determines the relations between independent factors and dependent variables, with 78% precision as well as 82.1% recall. Moreover, the output from LMT's computing model is presented. The effects from independent factors were evaluated based on the characteristics of Anopheles mosquitoes and population migration. Based on 28 new recovery factors, the government sectors can set up policies and make efforts to control some factors according to the risk model to reduce the spread of malaria, such as increasing electricity areas and reducing the number of migrants by creating jobs for local people, etc.