Travel time forecasting on a freeway corridor: a dynamic information fusion model based on the random forests approach

Purpose – Metropolitan areas suffer from frequent road traf ﬁ c congestion not only during peak hours but also during off-peak periods. Different machine learning methods have been used in travel time prediction, however, such machine learning methods practically face the problem of over ﬁ tting. Tree-based ensembles have been applied in various prediction ﬁ elds, and such approaches usually produce high prediction accuracy by aggregating and averaging individual decision trees. The inherent advantages of these approaches not only get better prediction results but also have a good bias-variance trade-off which can help to avoid over ﬁ tting. However, the reality is that the application of tree-based integration algorithms in traf ﬁ c prediction is still limited. This study aims to improve the accuracy and interpretability of the models by using randomforest(RF) to analyzeandmodel the travel time onfreeways. Design/methodology/approach – As the traf ﬁ c conditions often greatly change, the prediction results are often unsatisfactory. To improve the accuracy of short-term travel time prediction in the freeway network, a practically feasible and computationally ef ﬁ cient RF prediction method for real-world freeways by using probe traf ﬁ c data was generated. In addition, the variables ’ relative importance was ranked, which provides an investigation platform to gain a better understanding of how different contributing factors might affect traveltime on freeways. Findings – The parameters of the RF model were estimated by using the training sample set. After the parameter tuning process was completed, the proposed RF model was developed. The features ’ relative importance showedthat the variables (travel time 15min before) andtime of day (TOD) contribute the most to the predicted travel time result. The model performance was also evaluated and compared against the extreme gradient boosting methodand theresults indicated that the RF alwaysproduces more accurate travel time predictions. Originality/value – This research developed an RF method to predict the freeway travel time by using the probe vehicle-based traf ﬁ c data and weather data. Detailed information about the input variables and data the mean absolute percentage errors were computed for different observation segments combined with different prediction horizons ranging from 15 to 60min.


Introduction
Nowadays, travel time prediction plays a significant role as it can greatly help route planning and also the development of countermeasures to reduce traffic congestion. Metropolitan areas are adversely affected by frequent road traffic congestion not only in peak hours but also in off-peak periods. Therefore, the capability to forecast traffic conditions, particularly travel times, is of utmost importance in traffic management applications aimed at relieving negative social, environmental and economic impacts for people. The definition of travel time is the total time for a vehicle to travel from one point to another over a specified route (Zhu et al., 2009). Travel time has been widely used to measure the effectiveness of transportation systems and increasingly becomes one of the most popular traffic information that travelers are interested in gathering. The ability to accurately predict travel time in transportation networks is a critical component of the traveler information system. Accurate travel time prediction can enhance the performance of the traffic management systems, in which travelers are given the opportunities to react to the traffic proactively (Oh et al., 2015). Furthermore, as an important performance indicator, accurate predicted travel times can be used for quantitatively comparing different traffic management systems. Nowadays, with the explosive availability of abundant data collected by sensors and monitors, the big data storage and processing issues have become more and more relevant (Šemanjski, 2015).
In travel time prediction, a reliable prediction method needs to achieve the following three objectives: accuracy, robustness and adaptability (Van Lint, 2006). Traditional databased (e.g. linear regression and time series) models have been widely applied to predict travel times based on the historical data. However, with the consideration of effectiveness, accuracy and feasibility, these models may have become outdated and replaceable. Recently, different machine approaches (such as neural networks, ensemble learning and support vector machines) have been used by different researchers and the results indicate that such approaches to prediction are adaptable and can give better performances than traditional models. Therefore, the machine learning-based approaches are selected for the travel time prediction in this study. The purpose of this study is to propose an approach to systematically analyze the relationship between travel time and various traffic features. In that regard, a machine learning-based approach (e.g. the random forest [RF] model) is used to predict the freeway travel time. The proposed approach is also tested using a freeway corridor in Charlotte, North Carolina using the probe vehicle-based traffic data. The advantages and disadvantages of the proposed model are also identified and compared. Finally, the effectiveness and efficiency of the proposed model are also evaluated.

Literature review
Transportation researchers and data scientists have developed various techniques in the past three decades to provide more reliable future travel time estimation methods (Oh et al., 2015). Generally speaking, such techniques can be classified into three groups: naive methods, traffic theory-based methods and data-driven methods. As the name indicates, the naive prediction models are very simple methods, which typically do not involve the estimation of model parameters. As the model assumptions are usually restrictive, they are SRT not actually fulfilled in many situations (Wunderlich et al., 2000). As one of the traffic theorybased methods, traffic flow simulation and user-optimal dynamic traffic assignment have been widely used in freeway travel time prediction. Examples include Papageorgiou et al. (2010) and Dion et al. (2004). In data-based traffic time prediction models, the function that relates traffic factors with the prediction result (dependent variable) is not obtained from predetermined traffic theory, as the relationships of variables come from the sample data itself by using statistical data mining methods. This approach greatly expands the pool of researchers who can participate in travel time prediction because they no longer have to become experts in traffic theory. However, such data-based methods usually need a lot of data, which is not always available. The data-based models are strongly subjected to data availability and accessibility (Van Lint, 2006).
In general, the data-based models can be divided into two categories, which are parametric and non-parametric models. In the parametric models, the parameters can be estimated to define the function, which are predefined and set in a finite-dimensional space. The most widely applied parametric model is linear regression, where the dependent variable is always a linear function of the explanatory input variables. Generally, the independent temporal variables are traffic observations in several past time intervals. The second type of parametric model is the Bayesian net, which assumes that the explanatory variables are always conditionally independent given the dependent variable. The third group of the parametric models in modeling travel time is time series models, of which the most widely used one is the autoregressive integrated moving average model.
In the non-parametric models, the structure of the model is not predefined and the intrinsically complex relationships cannot be expressed by simple functions. Furthermore, the term non-parametric does not mean that there are no parameters to be estimated, but on the contrary, it means that the number and typology of the parameters are unknown a priori and possibly infinite depending on the sample data set (Mori et al., 2015). With the rapid development of data science, the methodologies for non-parametric estimation are also being quickly updated. Along this line, the most widely seen in the literature of travel time prediction is the artificial neural networks (ANN). ANN models are widely used in transportation because of their ability to capture complex relationships in large data sets (Dharia and Adeli, 2003). Unlike multivariable models, ANN models are developed without a predetermined form of function, whereas they can overcome multicollinearity problems. Different types of neural networks have been applied in travel time prediction, from regular multilayer feedforward neural networks (Yildirimoglu and Ozbay, 2012) to more complex spectral basis neural networks (Park et al., 1999). Another choice for travel time prediction is using support vector machine (SVM) methods. This advanced algorithm consists of decision function, the application of the kernel functions and the sparsity of solutions. The SVM models have a good performance on travel time prediction with historical travel time data. Some researchers (Yildirimoglu and Geroliminis, 2013;Wu et al., 2004) used SVM methods to estimate travel times. In the calculation process, the kernel function can map the input data into a higher-dimensional space. In the model generating process, the flattest linear function is identified which relates to the transferred input vectors into the target variables. Travel time prediction will be based on the function, which can be mapped into the initial space by the flattest linear function. Both the ANN and SVM models tend to be overfitting due to their complicated structures and the large number of parameters that need to be calibrated, which is a serious problem that commonly existed in the non-parameter machine learning algorithm.
The local regression approach is another non-parametric approach that always produces accurate and reliable results. The main idea of local regression is to generate a method to choose a set of historical data points which have similar properties to the current situation Random forests approach and predict the travel time using a constructed model with these chosen data points. Various local regression models can be used depending on the type of methods used to select the set of similar historical points and depending on the methodology chosen to fit the model (Mori et al., 2015).
There have also been some semi-parametric models developed, as a combination of parametric and non-parametric methods, in travel time prediction. Some of the strict assumptions of the parametric model are loosened to obtain a more flexible structure (Ruppert et al., 2003). In the application of travel time prediction, semi-parametric models are presented as varying coefficient regression models. The prediction result (travel time) was defined as a linear function of the naive historical and instantaneous predictors; however, the parameters vary depending on the departure time interval and prediction horizon (Schmitt and Jula, 2007).
In summary, with the wide applications of big data in the field of transportation, different machine learning approaches have been deployed in the travel time prediction area. The methodologies include, but are not limited to, the following: SVM regression, neural network approaches (e.g. state-and-space neural network, long short-term memory neural network), nearest neighbor (e.g. k-nearest neighbor) and ensemble learning (e.g. RF and gradient boosting), etc. Table 1 provides a summary of the studies reviewed in chronological order.
3. Data collection and processing 3.1 Data collection 3.1.1 Travel time data. In this study, the raw travel time data are gathered from the regional integrated transportation information system (RITIS), an advanced traffic system that includes segment analysis, probe data analytics and signal analytics. A series of major freeway segments in Charlotte, North Carolina are selected for the case study: as one of the most heavily traveled interstate freeways in the City of Charlotte area, I-485 is an interstate highway loop encircling the city, which completed the last segment on June 5, 2015. Charlotte metropolitan area has been growing and in the past 25 years, the Charlotte area population has increased from 688,000 to 1.4 million and more than 500,000 more residents are anticipated over the next 20 years. In 2018 alone, there was over $1bn in capital investment in the region. One result of this growth is increased traffic congestion. I-485 freeway segments in the vicinity of the southern Charlotte area experiences massive traffic congestion during weekdays due to heavy commuter and interstate traffic. As the recurrent congestion seriously affects the travel and further economic development in this area, the I-485 Express Lanes project will add one express lane in each direction along I-485 between I-77 and US 74 (Independence Boulevard), resulting in a seamless network of express lanes in southern Mecklenburg county that could improve travel time reliability and traffic flow in this critical transportation corridor. The project will also add one general-purpose lane in each direction along I-485 between Rea Road and Providence Road. The estimated cost is $346m and the construction began in summer 2019 and the completion date is 2022.
In the RITIS system, the selected section of I-485 Southern loop starts from the interchange with I-77 (Exit 67) and ends at the interchange with US-74 (Exit 51). The directions include clockwise and counter-clockwise and 37 miles of roadways and 32 traffic message channel (TMC) code segments are selected in this study. All the selected segments have uninterrupted coverage in the RITIS data 24 h per day and 365 days a year. The data set is collected from January 1, 2019 to December 1, 2019, and the interval is 15 min. An example of the raw time data used in this study is shown in Table 2 below.
3.1.2 Weather data collection. The historical weather data are also collected at locations that are close to the Charlotte Douglas International airport. The raw weather data includes information on different categories such as temperature, dew point, humidity, pressure, visibility, wind direction, wind speed, gust speed, precipitation and conditions. The raw

SRT
Author ( SRT weather data were recorded on a per hour basis, and as such, the discrepancy in the time intervals was treated by a mapping methodology to combine the traffic data with the weather data. An example of the raw weather data used in this study is shown in Table 3 below.

Data processing
Based on previous studies, it was revealed that travel speed is much more sensitive to severe weather events. The weather conditions in the Charlotte area were originally classified into 30 detailed weather conditions. However, in this study, the weather conditions are further categorized into only three groups including normal, rain and snow/fog/ice. Table 4 presents the detailed classification of the newly grouped weather conditions. To keep the sample size to the extent that is acceptable, "snow," "fog," "ice pellet" and other similar conditions are combined because of their rates of occurrence.

Random forests approach
To merge the link travel times data set with the historical weather data set, the issue of different intervals of two data sets should be resolved first. The RITIS data sets are aggregated into 15 min intervals, while the weather data set is aggregated into 1 h intervals. Therefore, the weather conditions are distributed evenly with the RITIS data set based on the timestamp.

Travel time prediction methodology 4.1 Ensemble learning methodology
An ensemble itself is a supervised learning algorithm, which can be trained and used to make predictions. The ensemble learning-based algorithms consist of multiple base models (e.g. decision tree model), each of which provides an alternative solution to the problem. The prediction results tend to be more accurate when there is a strong diversity among the models (Kuncheva and Whitaker, 2003). Decision trees always suffer from high variance which causes the instability of the prediction results. Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms. In the bagging process, the algorithm builds multiple models from the same original samples data set to reduce the variance. However, the bagging can make the trees highly correlated. RF is an extension of bagging in that in addition to building trees based on multiple samples of the original training data, it also constrains the features that can be used to build the trees, forcing trees to be different. To date, the RF models have been widely applied to various research fields (Greenhalgh and Mirmehdi, 2012;Jia et al., 2016). For classification tasks, RF typically gives high accuracy while also having a faster classification time. An RF classifier requires training with large data sets, which in our study are obviously available because of the nature of the travel record data collected. Furthermore, the RF computational process runs efficiently on large data sets, which can reduce model complexity, overcome the overfitting to some extent and improve the efficiency. As known, overfitting means that the estimated model fits the training data too well. Generally, this is caused by the fact that the model function is too complicated to consider each data point and even outliers. The RF method can build a large number of random trees and then combine the results from each individual tree. The benefit of using the RF methods is that through averaging, the variance can be reduced.

Random forest algorithm
RF is an algorithm that can compete with gradient boosted trees in ensemble learning, especially because of its convenient parallel training, which is very attractive in the era of big data and large samples. RF is an ensemble tool which takes a subset of features to build multiple decision trees. Before the explanation of RF, one needs to mention decision trees. A decision tree is a very simple algorithm, which is a supervised learning algorithm based on the if-then-else rules. Its explanation is strong, and it is in line with human intuitive thinking. For each separate decision tree, the feature selection is conducted randomly, which means there is no correlation between different decision trees. The low correlation between models is the key in which uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. RF is an integrated algorithm that is composed of decision trees, which can get the final prediction better than any best separate judge. The RF algorithm procedure consists of the following main steps: Step 1: Randomly draw the samples from a given data set.
Step 2: Construct a decision tree for each sample and predict the result.
Step 3: Voting will be performed from the independent prediction results.
Step 4: Select the popularly voted result for the classification problem or average the results for regression. Figure 1 shows the prediction process of the RF algorithm, which is described as follows: The number of training data points is N and the number of variables in the classifier is M;

Random forests approach
Select the m variables in the whole variable set M to determine the decision at a node of the tree (Note that m is always considerably smaller than M); To construct the forest by trees, choose a training set k times with replacement from all N training data set. Each of these data sets is called a bootstrap data set. The number k is the number of the trees to be trained; For each tree node, randomly choose m variables on which to make the decision at that node. Calculate and get the best split based on these m variables in the training set; and The Gini index is used for calculating the Gini value to determine the best split point, which can be used to describe the purity after the split. The Gini index will fall between 0 to 1 and the smaller the value, the better the split. If a data set contains elements from two classes, the Gini index is defined as follows: where p j is the relative proportion of class j in the original data set T and n is the number of classes in data set T.
5. Proposed travel time prediction approaches 5.1 Feature selection and pre-processing steps In the prediction model, the southern part of the I-485 freeway is divided into 32 sections by the recorded sensor segment in this study. Traffic data on each segment (from sensor to sensor) contains information on the subject segment and adjacent segment travel times, day of week (DOW), time of day (TOD), segment length and space mean speed. The RITIS real-world travel time data used for this study has a less than 0.5% missing rate (i.e. 4,246 out of 981,083). Note that in this study, the missing values are simply replaced with the mean of its closest surrounding values. From the previous studies (Wang et al., 2018), the variables that have a significant impact on the travel time prediction included the basic variables (such as TOD, DOW, month and weather) and the spatial and temporal characteristics of the adjacent road segments. Furthermore, in this study, the travel times (which are collected several steps ahead of the travel time to be predicted) are also accounted for in the model estimation. The prediction model is developed under normal traffic conditions and does not consider unexpected conditions (e.g. special events). The data on each segment will be used to train one forest which consists of decision trees. The RF model prediction includes two major steps: training and prediction. The forests are constructed by using randomly selected parameter combinations and different numbers of trees during the training step. The selected variables include the temporal features, such as travel time at prediction segment 15, 30 and 45 min before, which is defined as T tÀ1 , T tÀ2 and T tÀ3 , respectively. The travel time at prediction segment exactly 1 week before, which are defined as T tÀw ; TOD and DOW as important temporal features are also included. The spatial features include road segment identification (ID), segment length. In the data preparation, the temporal-spatial features are also generated, including the travel time of the nearest downstream and upstream road segment 15 min before, which are defined as SRT T iþ1 tÀ1 and T iÀ1 tÀ1 , respectively. The detailed information and definition of the selected variables can be seen in Table 7.

Model development
To achieve the best modeling results, it is important to explore the effect of different combinations of parameters on the RF model prediction performance. Based on previous studies, there are primarily three features that can be tuned to optimize the predictive power of the model: the maximum number of features (Max_features), the number of trees (N_estimators) and minimum leaf size (Min_sample-leaf). They are presented as follows: 5.2.1 Max_features. This is the maximum number of features in the RF model that is allowed to try in each tree. There are multiple options available in Python to assign maximum features. "Auto/None" is a command that simply takes all the features that make sense in every tree, which simply does not put any restrictions on the individual tree. The "SQRT" option takes the square root of the total number of features in each individual run. For example, if the total number of variables is 100, under this option the system can only take 10 of them in each individual tree. The "log2" option is another similar type of option used for max_features. In this study, after several tests, the random subspace method is applied. The number of features considered at each internal node of RF is m, which is randomly chosen to be m = INT(log 2 M þ 1), where m is the total number of features, as suggested by Breiman (2001aBreiman ( , 2001b.

n_Estimators.
This is the number of trees that one wants to build before taking the maximum voting or averages of predictions. A larger number of trees will give one better performance with a compromise of computing efficiency. As such, one should choose a value as high as what the processor can handle because this makes the predictions stronger and more stable.

5.2.3
Min_sample_leaf. This is the minimum leaf size. The leaf is the end node of a decision tree, which is the number of cases or observations in that leaf. A smaller leaf makes the model more prone to capture noise in the train data. To optimize the RF model, it is important to estimate the effect of different combinations of parameters on the model's performance. Based on this information, in this study, the tool RandomSearch is applied to optimize the tuning process to achieve a lower prediction error. In this study, after several trials of different min_sample_leaf, a minimum leaf size of 30 is chosen. When the parameters select 50 as the number of trees and 30 as the minimum leaf size, the mean absolute percentage error (MAPE) reaches the lowest 5.97%. This process is shown in Figure 2 and Table 5 demonstrates how the performance varies with different combinations of parameters (i.e. the number of trees and the minimum leaf size). RF models are not sensitive if the features are independent or dependent, though many will perform better if the data are preprocessed. A simple way to identify dependence among features is to calculate a correlation coefficient between each feature and all other features. From Figure 2 and Table 5, the results clearly show that when the number of trees reaches 50, the value of MAPE becomes nearly the same. In statistics, overfitting is the co-product of an analysis that corresponds perfectly to the sample set of data, and therefore, may fail to fit additional data or predict future observations reliably, which is a general problem of traditional ensemble learning methods. For example, the prediction error usually increases when the number of trees increases after it reaches the optimized point in the tree base model (Zhang and Haghani, 2015). There is also a need to consider the tradeoff between prediction accuracy and computational time. As when a large number of trees are being fitted, model complexity also increases and requires more computational time. The "randomness" in an RF means two things: n training samples are randomly extracted from the training set and Random forests approach the m feature subsets are randomly drawn from M features. The introduction of such randomness is very important to the performance of RF. Due to their introduction, the RF is not prone to overfitting and is very noise-resistant (i.e. insensitive to default values).
It is also important to note that the performance measure used in this study is the MAPE. The MAPE statistic usually expresses accuracy as a percentage that is calculated as follows: where m = the total number of the data points; y^i = the predicted travel time value in the test data set of record i; and y i = the actual travel time value in the test data set of record i.
To measure the effectiveness of different travel time prediction algorithms, the MAPEs are computed for three different observation segments (where A, B, C are three observation  SRT segments along the selected freeway for study, shows in Figure 3) with different prediction horizons from 15 min to 60 min. According to the comparison shown in Table 6 and Figure 4, the performance of the proposed RF is better than the eXtreme Gradient Boosting (i.e. [XGBoost], another widely used tree-based ensemble method), especially when the horizon of prediction time is long. The MAPEs of the RF model is significantly smaller than XGBoost when the horizon is long enough (i.e. longer than 45 min).
In the machine learning area, usually, only part of the predictor variables have significant impacts on the prediction results. Exploring the impact on the individual feature can help researchers and policymakers better understand contributing variables. Higher relative importance indicates a higher influence on travel time. Table 7 presents the relative importance of each variable and its ranks in the optimized RF model. From Table 7, each predictor variable has significant and different degrees of impact on the predicted travel time. The model result shows that the variable T tÀ1 (travel time 15 min before) contributes the most (34.85%) to the predicted travel time result. This result is expected and consistent with a previous study (Zhang and Haghani, 2015), which demonstrates that the immediate previous traffic condition will directly influence the traffic condition in the future. TOD is the second-highest ranked variable with the relative importance value of 30.12% and this result is also under expectation. T tÀw is the fourthhighest ranked variable with an importance value of 9.87%, which can be interpreted as a highly similar pattern of traffic times between weeks.

Random forests approach
The result in Table 7 also shows that the spatial impact is less than the temporal impact, as, except for the variable road ID with a relative importance value of 2.28%, all the relative importance values of other spatial variables are less than 1%. Several variables such as the travel time of the two upstream segments (with the relative importance value of 0.31% and 0.42%, respectively) and the travel time of the two downstream segments (with the relative importance value of 0.35% and 0.61%, respectively) one time-step ahead are considered in the model. With respect to the travel time change value, the relative importance values of the two upstream segments are both 0.29% and the relative importance values of the two upstream segments are 0.79% and 0.37%, respectively. Based on these results, it could be explained that the relative importance values of the downstream segments are higher than those of upstream segments. The reason is caused by the spatial characteristics of the roadway. When a bottleneck occurs at the downstream segment, the upstream will be impacted very shortly.

Conclusions and recommendations
The tree-based ensemble methods are widely used in the field of prediction. By combining a simple tree with a forest, RF always produces high prediction accuracy (Zhang and Haghani, 2015). In this study, the authors applied an RF method to analyze and model freeway travel time to improve the prediction accuracy and model interpretability. Most existing machine learning models can capture the nonlinear pattern of travel time but suffer from over-fitting. Study results indicated that the RF model has its considerable advantages in freeway travel time prediction, the performance evaluation result showed that the RFbased model can have better predictions in terms of prediction accuracy. RF model showed a reasonable performance compared with other approaches. When the prediction horizon is no more than 15 min, the RF algorithm is relatively accurate. However, when the prediction horizon is longer than 30 min, the prediction error increases dramatically like other methods. Different from other machine learning methods, RF methods provide interpretable results with different types of predictor variables. RF can also handle data with very high dimensions (many features) without specific feature selection (because feature subsets are randomly selected) and identifies which features are more important after the training MAPEs for different road segments with different prediction horizon SRT process. Furthermore, it has an effective way of estimating missing data and maintaining accuracy when a significant proportion of the data are missing. The relative importance of the features shows that the travel time one step ahead (15 min before) contributes the most to the predicted travel time. Features such as the TOD, DOW and the travel time at prediction segment one week before and weather also have higher relative importance values in the model than other features. Adding up the most important eight variables' relative importance values (T tÀ1 , TOD, Speed, T tÀw , DOW, Weather, Road ID, Month) in the Table 7 will be as high as 94.77%, which means that these eight selected variables include most of the information needed in the travel time prediction. The proposed RF travel time prediction method has considerable advantages over the other tree-based approach.
However, the practice of RF algorithm and other tree-based ensemble methods in the travel time prediction area is still very limited. The future focus of the research would be hybrid models (combination models) which can combine several models of the same or The travel time at prediction segment 30 min before 0.57 11 T tÀ3 The travel time at prediction segment 45 min before 0.28 18 T tÀw The travel time at prediction segment one week before 9.87 4 DT tÀ1 The ravel time change value at T tÀ1 0.24 19 DT tÀ2 The ravel time change value at T tÀ2 0.20 21 DT tÀ3 The Random forests approach different types of prediction models to enhance the model performance and prediction. The RF method can be combined with other tree-based methods or another type of machine learning method in the preprocessing step or prediction step. Experimental results showed the combination methods have a better prediction result than using a method alone (Li et al., 2009). As the combination model method has been proved superior in terms of prediction accuracy, this should be given careful consideration in the future.