Boruta-grid-search least square support vector machine for NO 2 pollution prediction using big data analytics and IoT emission sensors

Purpose – This paper seeks to assess the performance levels of BA-GS-LSSVM compared to popular standalone algorithms used to build NO 2 prediction models. The purpose of this paper is to pre-process a relatively large data of NO 2 from Internet of Thing (IoT) sensors with time-corresponding weather and traffic data and to use the data to develop NO 2 prediction models using BA-GS-LSSVM and popular standalone algorithms to allow for a fair comparison. Design/methodology/approach – This research installed and used data from 14 IoT emission sensors to develop machine learning predictive models for NO 2 pollution concentration. The authors used big data analytics infrastructure to retrieve the large volume of data collected in tens of seconds for over 5 months. Weather data from the UK meteorology department and traffic data from the department for transport were collected and merged for the corresponding time and location where the pollution sensors exist. Findings – The results show that the hybrid BA-GS-LSSVM outperforms all other standalone machine learning predictive Model for NO 2 pollution. Practical implications – This paper ’ s hybrid model provides a basis for giving an informed decision on the NO 2 pollutant avoidance system. Originality/value – This research installed and used data from 14 IoT emission sensors to develop machine learning predictive models for NO 2 pollution concentration.


Introduction
Air pollution, a release of pollutants into the air, remains one of the significant challenges in the UK and globally, with over 25,000 associated deaths recorded yearly in the UK [1] and around 8.8 million deaths recorded globally [2]. Apart from deaths, air pollution exposure can result in various short and long-term health challenges [3,4]. Examples of short-term health challenges include eye pain, throat irritation, headaches, allergic reactions, and upper respiratory infections. While lung cancer, brain damage, liver damage, kidney damage, heart disease, respiratory disease, and suchlike are examples of long-term health challenges [5].
Aside from the severe impact of air pollutants on health, air pollution has significant consequences on the UK and the global economy. It costs the UK government approximately £40bn yearly [6] and around £3 trillion economic costs globally [7]. Recent studies by the centre for research on energy and clean air (CREA) links over 1.5 billion days of absence from work, over 3.5million new cases of asthma and approximately 2million preterm births to air pollutants leading to an increase in health care cost and decrease in economic productivity.
Air pollutants are airborne substances, usually of two categories: particulate matter and gases. Of the gases, Nitrogen dioxide (NO 2 ) is arguably the most dangerous to human health [8]. NO 2 pollutant emanates from combustion processes such as vehicle emissions, and this was noted during the Covid-19 pandemic, with a 20% decrease in global NO 2 concentration [9]. However, a recent finding hypothesized that NO 2 will still exceed the (Air quality index) AQI limit by 2025 [3]. Therefore, the NO 2 AQI estimate poses a responsibility to stakeholders and researchers to devise strategic means to curb exposure to this UK's pollutant.
Arguably, predicting NO 2 concentration is among the most efficient and effective ways to save lives from exposure to this deadly pollutant in different geographical locations. Furthermore, this prediction can help people avoid such areas when they have high NO 2 concentration levels.

Related Work
Studies on NO 2 prediction models have thus justifiably increased since the turn of the millennium. However, if they are helpful to users vulnerable to pollution, e.g. coronavirus patients, the effectiveness of such models depends on the model's performance. Lesser performance can be misguiding and could expose the user to a pollution hotspot, triggering life-threatening attacks.
A machine learning-built predictive model's performance is, among other factors, vastly dependent on the machine learning algorithm used [10,11]. Several studies have thus compared some of the most popular algorithms (e.g. artificial neural network, support vector machine, and suchlike) in terms of their performance in predicting NO 2 [12][13][14] with Random forest, support vector machine usually performing better. However, despite clear proof from the literature that hybrid algorithms have performed better than standalone, they have not been vastly employed in the comparison studies [15,16].
One such hybrid is the optimal-hybrid artificial intelligent algorithm based on the Least squares support vector machine optimized by grid search, whose features were selected using the Boruta Algorithm (BA-GS-LSSVM). The Least square support vector machine (LSSVM) differs from the classical SVM due to improved objective function. LSSVM is widely used for classification and regression problems due to its high predictive ability compared to classical SVM. Findings from research like the prediction of gasoline's price [17], speed of wind's forecast [18] indicated that this model presents more operation speed and convergence accuracy. However, some shortcomings are associated with this algorithm's performance, including optimizing parameter and feature selection. The BA-GS-LSSVM solves these comings.
Thus, this paper seeks to assess the performance levels of BA-GS-LSSVM compared to popular standalone algorithms used to build NO 2 prediction models. The objectives are as follows: (1) To pre-process a relatively large data of NO 2 from IoT sensors with timecorresponding weather and traffic data (2) To use the data to develop NO 2 prediction models using BA-GS-LSSVM and popular standalone algorithms to allow for a fair comparison.
It is imperative to describe the symbols used in this research work. Table 1 defines most of the symbols and their description. The rest of this paper organized as follows: section two presents a brief explanation of the source of data and data volume. Section three presents feature selection techniques used in selecting the valuable features for developing the hybrid model. Section four presents the hybrid model detailing the theoretical/mathematical representation of the model and how it differs from classical SVM. Lastly, Section five describes the development of the BA-GS-LSSVM, other popular standalone machine learning algorithms for NO 2 prediction and their performance assessment for comparison. Finally, the conclusion and discussion form part of the fifth section.

Data description and big data analytics
Many UK cities, just like other cities of the world, suffer from air pollution. A significant contributor to air pollution is increasing traffic emissions [19]. Air pollution caused by traffic depends on the type of vehicle (diesel, gasoline, petrol, electric), level of congestion, time spent in the traffic jam, and the atmospheric/geographical features of the environment at a given time.
To monitor/reduce exposure to air pollution, most cities now deploy monitoring sensors for measuring traffic intensity, weather characteristics, and air quality of the environment. The data is collected at specified frequencies (seconds, minutes, hours, days, and suchlike) depending on the users' preference. For this project, a total of 14 Internet of Things (IoT) monitoring sensors for NO 2 and other pollutant concentrations represented as blue circles were deployed across Wolverhampton City in the UK (see Figure 1).
The sensors collected NO 2 concentration and other harmful pollutant's data every 10 s for five months (i.e. December 2019 and April 2020). Over ten billion (i.e. 10 3 6 3 60 3 60 3 24 3 30 3 5 3 14) data points were generated for this period which was massive. The data through the Middleware gateway deployed on elastic bean of amazon web service  (AWS) directly dump the data into an AWS Elastic Computing cloud two (EC2) Relational database. We used the AWS EC2 infrastructure to run the big data analytics required for this study due to the extensive data. For the development of the BA-GS-LSSVM, the data from the 14 IoTs monitoring sensors for NO 2 pollutant concentration was the dependent variable. In contrast, weather, other pollutants, e.g. PMx and Ozone and traffic data, were the independent variables. The traffic data was sourced from the UK's Department for Traffic (DfT) and included mainly vehicle counts, split into various vehicle types (see Table 2). The traffic data, covering the same period as the data from the sensors, were retrieved. The weather data for a similar period was recovered from the UK Met Office. It included various weather variables like ambient pressure and humidity, among others (see Table 2). Traffic and weather data were provided hourly, and each had over fifty thousand data points. To match the weather and traffic data with the pollutant data from the sensors, the hourly average of the pollutant concentration was used to match the corresponding hourly weather and traffic data leading to (24 hrs 3 30 days 3 5months 3 14 IoTs) data points.
The concentration of NO 2 from December 2019 to April 2020 across the 14 installed sensors indicates some interesting trends, with some outliers around the end of 2019 (see Figure 2). These outliers at the end of the year are arguable due to the shopping, Christmas and other season celebration. In addition, the pollution concentration is arguably Link length in Km -8 Link length in miles -9 Pedal cycles -10 Two wheeled motor -11 Cars and taxis -12 Buses and coaches -13 Lgvs -14 Hgvs 2 rigid Axle -15 Hgvs 3 rigid Axle -16 Hgvs 4 or more rig -17 Hgvs 3 or 4 Articulate Axle -18 Hgvs_5_Articulated_Axle Hgvs_6_Articulated_Axle -20 All Hgvs -21 All motor vehicles -22 Zid -IoT 23 Date -24 Holiday -25 Day of the week Pm25 mg/m 3 Table 2. Independent features after matching the three data sources ACI influenced by the national lockdown imposed across the UK cities during the covid19 pandemic (see Figure 2).
Another exciting exploration is the outliers discovered within some days of the week (see Figure 3). Looking at the boxplot, it is arguably the days with the highest amount of traffic as we can hypothetically say these are days many go out to bars, clubs and other gatherings at the end of the week.
After pre-processing was completed on the AWS Big data infrastructure, the complete data was split into data (60%) for training and (40%) for model testing at random to avoid biases and other shortcomings.

Feature selection
The predictive capability of various machine learning depends on the features' dimensionality; LSSVM is not an exception [20]. Not all features impact the prediction, making feature/variable selection critical in developing/building machine learning predictive models.  Dimensionality reduction has been proven to help make predictive models perform better [21]. Of the reduction techniques, feature selection is selecting the most impactful features from the original set of features as the new input features. Since Random forest (RF) has consistently proven in past studies, e.g. [21][22][23][24], to be very good at selecting the most impactful features, the wrapper Boruta algorithm(BA) built around the RF was implemented for feature selection in this study. BA uses the same strategy as the classical RF classifier model introduced by [25]. The BA is implemented using the following steps: (1) Replicate and add a copy of all input features, i.e. weather and traffic features, to form an information system (IS) After applying the feature selection process, a total of 13 essential features were selected by the BA, namely, Timestamp, O 3 , All motor vehicles, Humidity, Ambient pressure, Temperature, PM10, PM2.5, PM1, Day of the week, and x,y,z which is the 3d-geocentricrepresentation of the longitude and latitude. The timestamp arguably suggests some level of consistency in the pollutant levels at a specific time. For instance, the morning peak period or close of the day peak periods results in higher traffic. Afterwards, the 13 selected features will be used in developing an LSSVM predictive model, as discussed in the next section.
4. Least square support vector machine LSSVM, improvement to SVM was proposed [15]. LSSVM provides a linear equation solution with an improvement in the objective function of classical SVM.
We use x k as the 13 feature selected with BA and y k is the NO 2 concentration Then the improved SVM model can be mathematically written: where, ∅ðxÞ 5 nonlinear mapping function, ω 5 weight, and b 5 bias.
The equation can be expressed: Subject to Where, γ 5 regularisation parameter and e k 5 error term.

ACI
The model can be optimized using the LaGrange function as follows Lðω; b; e; αÞ ¼ where α k ∈ R 5 Lagrange multiplier From Karush-Kuhn-Tucker (KKT) equation given as. 8 > > > > > < > > > > > : The optimization equation can be transformed to the linear equation given as Eqn 6, after eliminating the variables ω and e k .
The final equation of the LSSVM model is.
where Kðx ; x i Þ ¼ wðxÞ T * wðx j Þ is the kernel function: The finite response and the Radial basis function (RBF) kernel function was used in this research and mathematically expressed as,

Model development process and performance measures
Related works on the development of predictive models (e.g. [26][27][28][29], identified Random forest (RF), Support vector machine (SVM), Decision tree (DT), XGboost (XGB), Adaboost, Artificial neural network (ANN) and Linear Regression (LR) as powerful machine learning algorithms for prediction. These popular algorithms were developed and compared with BA-GS-LSSVM. Given that feature selection may not be entirely favourable to some algorithms [11,20], we developed the predictive models for each algorithm in two ways to allow fairer comparison. The first was to develop the models using all the available variables before the feature selection processes. The results from this were recorded and compared (see section 5 on results). The second was to develop the models using the 13 features selected with the Boruta algorithm. The results from this were also recorded and compared (see section 5 on results). Finally, the best results for each algorithm (whether from the first or second development) were compared to determine the best algorithm overall. The LSSVM Regression model has no specific python package, so we have implemented the Scikit learn package in python. Figure 4 presents the flow chart and overall procedures to build the hybrid GS-LSSVM predictive model to predict the concentration of NO 2.
To determine the predictive capability of a regression predictive machine learning model, various metrics measures loss and score models. Among these metrics, four, including the mean absolute error (MAE), mean square error (MSE), Explained variance score (EVS) and R Squared (R 2 ), were used in this paper because of their popularity and are briefly described below.
MAE is the average absolute variation(error) between each point in a scatter plot between the actual observation and the corresponding predicted value. It is a risk metric corresponding to the expected value of the absolute error loss. The best possible score is 0.00; the higher the MAE, the worse the predictive model's performance. The MAE can be mathematically written as  MSE is another risk metric that corresponds to the average of all the error squares between the predicted value and the actual value of the target variable. It is also referred to as mean squared deviation. MSE value is strictly positive, ranging between [0,1], and values closer to zero signifies a better predictive model. The mathematical definition of MSE is as follows.
Unlike the risk metric functions (i.e. MAE, MSE), The Explained variance score and R-square score depicts a better regression model when the score is getting closer to 1.0 and not zero. The EVS score measures the variation (a measure of dispersion) of the test data set. The best possible score of EVS is 1.0, and it is mathematically written as.
Lastly, R-squared referred to as the coefficient of determination is the proportion of dispersion in the feature(s) and the target variable. It indicates the goodness of fit and aid the measurement of how well-unseen data are likely to be predicted by the model. The best possible value for R-square is 1.0. It is mathematically given as.
Where y I ¼ P n i¼1 y i n

Discussion of result
In this study, an optimal-hybrid artificial intelligent algorithm based on the Least squares support vector machine was optimized by grid search, whose features were selected using the Boruta Algorithm (BA-GS-LSSVM) to predict NO 2 pollutant concentration were developed. We identified the most optimal values for the parametric functions of LSSVM to be γ ¼ 1000 and σ 2 ¼ 10 using grid the search.
The model was all developed following the union of the three data sources, including the weather, traffic, and IoT data on a big data platform considering many data points recorded (i.e. 24 hrs 3 30 days 3 5 months 3 14 IoTs). The data were merged and matched for the development of the predictive models. We then compared the performance capability of the proposed hybrid model and other powerful standalone machine learning in predicting NO 2 , however, in two streams to achieve a fair comparison with no bias. The two streams of comparisons to avoid biases are; (1) All models implemented without feature selection (2) All Models implemented with feature selection. Table 3  As shown in Figure 5, the error measures, including the MAE, MSE for all the developed models, were presented in decreasing order. The order, in this case, shows the Adaboost (AB) to have the maximum error, followed by the LSSVM and linear Regression (LR) implemented without feature selection. At the same time, we can see GS-LSSVM with feature selection, i.e. BA-GS-LSSVM with the most negligible error score value. This explains the higher performance ability of the hybrid model over other standalone models.
In addition, the doughnut chart shows the R-squared score for all the models developed, and the maximum score was yielded in the development of BA-GS-LSSVM. i.e. 6.35%(0.82). The assertion of the bias caused by feature selection was proved in this paper; for instance, the first approach (i.e. model's implementation without feature selection) shows poor and woeful model performance.
In addition, the models developed with feature selection was subjected to the 10-fold crossvalidation to ensure efficient/unbias evaluation. For this, the research present in Figure 6, a box and whisker plot showing the spread in different performance metrics across each crossvalidation fold for each algorithm From these results, BA-GS-LSSVM is identified the best considering its minimal error metric scores (i.e. lowest MAE and MSE score) compared to other ML developed. At the same time, BA-GS-LSSVM has scored the highest EVS and R-square score. Thus, our model Conclusively performs best compared to all other standard and powerful standalone ML models developed in this paper.
The use of the big-data platform reduced the computational complexity for most of the models implemented. Also, the lower computational complexity of the LSSVM over the SVM is another outstanding advantage recognized in this research.

Conclusions
High-precision NO 2 prediction is critical to people's well-being, especially those that are vulnerable to air pollution. However, the BA-GS-LSSVM model in this paper happens to be appealing and proves to be better than popular algorithms. To demonstrate the advantages of the BA-GS-LSSVM model, nine different algorithms were compared. At the end of the study, the following list of the inferences can be reached, including: (1) Boruta, a dimensionality selection technique, improves the performance of the ML model