Abstract
The concept of travel time reliability was developed to quantify the variability in travel times. As travel time reliability measures are increasingly used in system planning and performance measurement processes at many transportation agencies, predicting travel time reliability measures has become critical. However, it can be challenging because of the dynamic nature of traffic and the variety of factors contributing to unreliable travel times. This paper developed machine learning models to predict travel time reliability at a planning level. Two random forest algorithms, quantile random forests (QRF) and generalized random forests (GRF), were used to develop prediction models while taking account of a variety of variables from multiple data sources simultaneously. The reliability measures studied are the percentiles of travel times as they are a key component for many commonly used travel time reliability measures. Both QRF and GRF models produced accurate predictions; GRF performed better than QRF at predicting the 50th percentile travel time, and QRF achieved slightly better predictions for the 90th percentile. A case study demonstrated the use of the proposed models for estimating the impact on travel time reliability from an improvement project. The results found both models captured the trend in reliability change, and GRF was preferred over QRF for estimating the level of travel time reliability.
Keywords
Average travel times are widely used in traffic operations and planning, but they only represent typical situations and do not show the entire breadth of user experiences. The concept of travel time reliability was developed to quantify the variability in travel times, and it is a critical component of highway system performance evaluation. The U.S. Department of Transportation Federal Highway Administration (FHWA) defines travel time reliability as “the consistency or dependability in travel times, as measured from day-to-day and/or across different times of the day” ( 1 ). Travel time reliability can quantify the benefits of traffic improvement projects that are not well captured by the average travel time, such as the improvement in the worst-case unexpected delay. Travel time reliability is also critical to various roadway users, including drivers, transit riders, and freight shippers, influencing decisions about where, when, and how travel is made ( 2 – 4 ). Several performance metrics have been developed to measure travel time reliability. Commonly used reliability metrics include time-based metrics (e.g., 90th or 95th percentile travel times, buffer time) and index-based metrics such as buffer time index, travel time index, and the level of travel time reliability (LOTTR).
Developing credible forecasts of travel time reliability measures is increasingly becoming a key component of the system planning and performance measurement process at many transportation agencies as they work toward establishing reliability targets and tracking progress toward those targets. However, current practice by most agencies is to predict travel time reliability using historical trend lines or change rates of other congestion measures, which cannot account for or might not be sensitive to the changes in reliability influencing factors ( 5 ). Predicting travel time reliability measures can be challenging because of the dynamic nature of traffic and the variety of factors known to contribute to unreliable travel times, such as traffic incidents, inclement weather, work zones, special events, traffic control devices, fluctuations in demand, and inadequate base capacity ( 6 ). As a consequence, most travel time reliability prediction models developed in the past focused on predicting a single reliability metric using a few variables (e.g., traffic volume, incidents, and weather) with data collected from one corridor or a limited number of segments ( 7 ). As most reliability metrics are based on the distribution of travel times, prediction models were created (often using linear regression modeling techniques) based on an assumption of a single or multi-mode travel time distribution ( 8 ). There is limited agreement on the optimal distribution of travel times in the literature, however. Although a range of statistical methods have been developed, different choices of the type and numbers of distributions result in a lack of consistency in research conclusions. In addition, when the number of independent variables is relatively large, prediction models are at risk of overfitting the data.
This paper aims to predict travel time reliability at a planning level for statewide interstate highways while considering multiple impact factors including traffic incidents, inclement weather, work zones, demand and capacity, roadway geometry, traffic management, and operational strategies. The reliability metrics studied are the percentiles of travel times on individual segments. The 90th or 95th percentile travel time are the simplest metrics of travel time reliability recommended by the FHWA, and most of the commonly used reliability metrics, such as planning time index and LOTTR, are calculated using the percentiles of travel time as inputs ( 1 ). Random forest models were developed to predict multiple percentiles of travel times simultaneously for statewide interstate segments. Random forests, first introduced by Breiman ( 9 ), are one of the most commonly used machine learning techniques with a reputation for good quality predictions. Random forests are suitable for modeling the enormous amounts of data made available through probe vehicles and other sources, and have been shown in previous studies to outperform linear regression models ( 10 ). Two random forest algorithms, generalized random forests (GRF) and quantile random forests (QRF), were used to develop prediction models. The model performance was then compared, and a case study was presented to demonstrate an application of the proposed models.
Literature Review
With the widespread application of travel time reliability measures by transportation agencies, prediction of travel time reliability has become increasingly important ( 8 ). Reliability is one of the four focus areas of the Strategic Highway Research Program 2 (SHRP2), which conducted a series of travel time reliability studies and developed prediction methods for freeways and arterials. The SHRP2 Project L03 developed a data-poor model and a data-rich model to predict travel time index at the 10th, 80th, 90th, and 95th percentiles along with the distribution of travel time index (defined as the ratio of the travel time during the peak period to that during free-flow conditions) for certain time periods ( 4 ). Adopting a modification to the data-rich model developed in Project L03, SHRP2 Project L07 predicts the distribution of travel time index for each hour of the day ( 11 ). SHRP2 Project C11 is a modification of the data-poor model developed in Project L03, and it predicts the travel time index at 50th, 80th, and 95th percentiles on a segment using the mean travel time index ( 12 ). A pilot study in Florida found that the prediction of the 50th, 80th, and 90th travel time percentiles as well as the travel time index using SHRP2 methods in Projects L03, L07, and C11 were reasonably accurate for freeways and arterials ( 13 ).
The Highway Capacity Manual (HCM, 6th Edition) provides a probability-based methodology to incorporate the impact factors into reliability analysis at a freeway facility or corridor level ( 14 ). This method was developed in SHRP2 Project L08. The core element of this method is the scenario generator, which provides a large set of different combinations of impact factors (e.g., demand, weather, incidents) with their corresponding probabilities. Then, travel times are inferred for each scenario through the HCM FREEVAL tool to construct the travel time distribution. According to Tufuor et al. ( 15 ), the HCM-6 method was only validated through simulations without calibration with empirical travel time data. Another study by the same authors compared the travel time distribution built by the HCM-6 method and the empirical one using data on a 1.16-mile testbed in Lincoln, Nebraska ( 16 ). The results indicated that these two distributions were statistically different, with the HCM-6 distribution having a standard deviation lower than the empirical distribution by about 67%.
Most commonly used travel time reliability measures are based on the distribution of travel times. Parametric prediction models often assume certain statistical distributions of travel times, and a variety of distributions have been used previously, such as normal and lognormal ( 8 ), gamma and compound gamma ( 17 ), Halphen ( 18 ), and Burr distributions ( 19 ). There is limited agreement in the literature on the best probability distribution for travel times, however. The heterogeneity of the traffic environment makes it challenging to characterize the shape of travel time distribution using a specific statistical distribution. Different goodness-of-fit measures adopted by researchers to decide the best fit distribution also contribute to the lack of consistent results. Plötz et al. ( 20 ) demonstrated that different goodness-of-fit measures might lead to different conclusions of the best fit distribution types even using the same travel time dataset. To overcome this problem, nonparametric modeling approaches that do not rely on specific assumptions of the travel time distribution have been used to estimate travel time reliability.
Li ( 21 ) used a nonparametric method, kernel density estimation, to measure travel time reliability ratio (the ratio of travel time variability to travel time). Simulation analysis showed that the proposed kernel estimators performed better than the Cornish-Fisher approximation and a “naïve estimator.” Chiou et al. ( 22 ) used functional principal component analysis to model travel time reliability on freeways. Nonparametric kernel density estimation was used to estimate the probability density functions of travel times, and the quantiles of travel time distribution were derived from the probability density function.
Zargari et al. ( 7 ) predicted the planning time index (ratio of the 95th percentile travel time to free-flow travel time) for 59 segments on 11 interstate highways in Virginia using a set of machine learning and five ordinary least squares linear regression models. Annual average daily traffic (AADT), crashes of different injury severities, and weather factors were included as independent variables. The machine learning models tested include neural network, support vector regression, linear, polynomial, and radial basis function kernel functions, K-nearest neighbor, and decision trees. The K-nearest neighbor model outperformed all other models based on model accuracy and a stability measure defined as the ratio between the coefficients of determination of training data and testing data.
Random forest is a nonparametric modeling approach that has been demonstrated to provide accurate predictions while minimizing the potential for creating models that may not be transferable because of overfitting—a major concern associated with machine learning models such as the decision tree. The interactive impacts among variables are considered during the model construction process without increasing the risk of overfitting. It has been successfully used in prediction of travel time, speed, and volume ( 23 , 24 ). Sun et al. ( 23 ) developed a random forest method to predict bus service reliability index.
Meinshausen ( 10 ) and Athey et al. ( 25 ) adapted the random forest algorithm developed by Breiman ( 9 ) for quantile estimation. Meishausen compared the performance of linear quantile regression models with and without interaction terms and quantile random forests (QRF) using various popular datasets ( 10 ). The results show that quantile random forest models provided better prediction accuracy than traditional quantile regression, especially for higher quantiles. These advantages fit the need of travel time reliability prediction since reliability metrics, both time-based and index-based, often involve quantiles or ratios of quantiles especially toward the tail portion of the travel time distribution; at the time of this study, QRF has not yet been adopted for prediction of travel time percentiles in published studies.
Data Description and Preparation
This study used data for all interstate highways in Virginia from January 2017 to December 2019. Prediction models were developed for AM and PM peak traffic periods (6:00–10:00 and 16:00–20:00 on weekdays) since those are the time periods experiencing the greatest travel time variability.
Probe data from INRIX was the source of travel times. Travel times were provided based on Traffic Management Channel (TMC) segments, which is a standard network representation used by INRIX for reporting data. TMC segments are classified as internal and external segments by INRIX. Internal segments represent a stretch of road within an interchange (e.g., between an exit ramp and an entrance ramp), and are referred to in this paper as interchange segments. External segments represent a stretch of road between interchanges and are referred to as freeway segments in this paper. TMC segments less than 0.1 miles long were removed since those segments often had lower quality data. Probe travel times were collected at 1-h intervals. Hours with missing data points across all TMCs and all years, about 0.34% of overall data, were discarded from the analysis. This screening process produced a total of 1,853 TMC segments that were used in this study.
Based on a previous study on travel time reliability influencing factors for interstate highways in Virginia ( 26 ), the following variables were considered in modeling.
Roadway factors including TMC segment length, the number of through lanes, urban/rural designation, and the presence of parallel high occupancy vehicle (HOV), high occupancy toll (HOT), or express lanes. These data were obtained from INRIX metadata and the roadway inventory database of the Virginia Department of Transportation (VDOT).
Weather variables including accumulated hourly liquid precipitation volume (the amount of drizzle, rain, thunderstorm precipitation in inches) and accumulated hourly frozen precipitation volume (the amount of snow, snow grains, snow pellets, ice pellets, and hail precipitation in inches). Weather data were obtained from the local climatological data (LCD) from the National Centers for Environmental Information ( 27 ).
Count of crashes by severity level (e.g., severe injury, visible injury, non-visible injury, and property damage only) and count of non-crash incidents by incident type. The non-crash incidents considered in this study were vehicle breakdowns (disabled vehicles on shoulder) and hazards (disruptive events like vehicles on fire). Crash and incident data were obtained from internal VDOT databases.
Frequencies of shoulder and lane closures because of work zones. Work zone related information was also from the internal VDOT database.
Volume-to-capacity (v/c) ratio. Capacity was calculated according to Highway Capacity Manual ( 14 ) methodologies using data provided by VDOT. Hourly traffic volumes were estimated using AADT and hourly volume profiles calculated from traffic counts collected at VDOT’s continuous count stations.
A safety service patrol (SSP) indicator that denotes the availability of SSP during the analysis interval. SSP provides temporary traffic control and supports scene management during incidents. The presence of SSP helps reduce incident clearance times. The SSP schedules and coverage map were obtained from VDOT.
The TMC segment was used as the spatial unit of analysis in this study. Data were aggregated at TMC segment level following a data conflation procedure described in Zhang et al. ( 26 ). The final dataset prepared for model training and testing included more than 2.55 million rows of records. A list of variables and their characteristics is shown in Table 1.
Summary of Input Data
Note: PDO = property damage only; v/c = volume-to-capacity; HOV = high occupancy vehicle; HOT = high occupancy toll; SSP = safety service patrol; Min. = minimum; Max. = maximum; SD = standard deviation.
Methods
Random forests were first introduced by Breiman (
9
) and have been adopted widely as a competitive option for prediction of accuracy and stability. The essential idea of random forests is to grow an ensemble of trees that is comprised of B single trees,
Draw a bootstrap sample from the training data.
Using this bootstrapped data, the following steps are conducted at each node until a leaf is created: a. Select a random subset of independent variables, denoted as mtry, as the split variable candidates at each node. Usually, mtry is a third of the total number of independent variables for regression models. However, results are typically nearly optimal over a wide range of this parameter (9). Also, independent variables could be selected multiple times at different nodes. b. Next, the splitting variable and the threshold are determined. The sum squared residuals (SSR) using different values of each variable are calculated, and the value with the smallest SSR becomes the threshold. Then, the variable with the smallest SSR at its threshold becomes the split variable at that node. c. Repeat steps a and b and continue growing the tree until a terminal criterion (e.g., a minimum number of observations in a leaf or a predefined number of nodes) is met
Repeat steps 1 and 2 B times. In practice, at least B = 200 trees are recommended.
The prediction of the conditional mean is then estimated by averaging the response across the B trees. This averaging reduces the prediction variance of individual trees while maintaining low bias.
Quantile Random Forests (QRF)
Lin and Jeon (
28
) proposed a prediction method using the perspective of adaptive neighborhood methods. Suppose the training data is
where
Meinshausen (
10
) later harnessed this concept and developed the quantile random forest model to extend predictions to all quantiles rather than just the mean. For quantile regression forests, trees are grown as in the standard random forests. The key difference between quantile regression forests and random forests is that random forests keep only the mean of the observations that fall into the leaf nodes and neglect all other information for the leaves. In contrast, quantile regression forests keep the value of all observations in this leaf, not just their mean (
10
). Assume the conditional distribution function of
Using the same weights calculated in Equation 1, the conditional quantile could be estimated by:
The relative importance of different variables can be inferred from random forests. During the growth of each tree, a sample of data that is not used to build the tree, termed out-of-bag (OOB) sample, may be used to calculate the variable importance. First, the OOB sample is used to measure the prediction accuracy. Then, the values of one variable, χ, in the OOB sample are randomly shuffled, keeping all other variables the same. Finally, the changes in prediction accuracy on the shuffled data are measured. The average of this number over all trees in the forest is the importance score for variable χ. Variables could then be ranked in order of importance according to their scores.
Generalized Random Forests (GRF)
Although the node splitting rules for the standard random forest as described above (step 2b of the random forest algorithm) are widely used, they might not be suitable for quantile estimation. The targeted statistical measures are different (e.g., conditional mean versus quantiles). To adjust the splitting rules and tailor them to quantile estimations, thus improving prediction performance, Athey et al. ( 25 ) proposed GRF for quantile estimation. Specifically, instead of using SSR to measure the quality of a split, the author used moment conditions in the form of the following equation:
where
Comparison of QRF and GRF Algorithms
Besides the difference from a statistical theory perspective between QRF and GRF, several steps of the algorithms (as implemented by their corresponding R packages) also vary and are compared below:
Parameter
Sampling: The sampling process of QRF is the same as the original random forest algorithm ( 9 ). It is a bootstrapped version of the input data, meaning that it is the same size and chosen with replacement. For GRF, only a fraction of the full input data will be used in growing trees, and the selection is done without replacement. The specific value of the fraction used is considered as a tuning factor, but a value of 0.5 is typical. Also, instead of using one set of data for both node splitting and prediction as used by QRF, trees are constructed using the honesty method ( 29 ). To reduce the bias in tree predictions, the honesty trees are built using a subset of the data, determined by the tuning factor, for node splits; but the observations will not be saved in leaf nodes. Then the remaining portion of the sample is pushed down these built trees and saved as the observations in the leaf nodes they fall in, which are used for predictions.
Clustering: The input data used in this study includes multiple data points from the same TMC segment. While QRF does not explicitly consider this kind of data structure, GRF provides a feature called “cluster-robust estimation” to accommodate it. After enabling this feature, all sampling processes described above are conducted, treating each cluster (a TMC segment in this paper) as a whole. Specifically, when a cluster is selected, all data points from that cluster are selected.
Theoretically, since GRF aims at maximizing the heterogeneity of the targeted quantiles, it is expected to have better performance. However, it is unclear whether such improvements are noticeable for the particular application in this study. As noted earlier, QRF uses a bagging version of the full input data, and all data is used for tree growth and prediction. In contrast, GRF divides the data into two portions, with one for tree growth and the other for prediction. Previous studies have shown that smaller sample size (e.g., sample sizes like those used in GRF) results in less related and more diverse trees ( 30 ). However, since the accuracy is decreased for the single trees, it might need a bigger forest to achieve the desired prediction bias.
From a statistical perspective, QRF is more tailored to predict quantiles than GRF; however, the comparison between these two algorithms was only conducted through simulation ( 25 ). In this study, both methods were explored to compare their suitability for predicting travel time percentiles.
Model Development
QRF and GRF models were constructed using a training dataset consisting of data for 2017 and 2018. To identify the optimal value for model parameter mtry—the number of randomly selected variable candidates used for node splitting—models using different values of mtry were built and then compared using a testing dataset. Often, forests built with a lower mtry value provide a better opportunity for exploiting variables with moderate effects on the targeted quantiles. Because the chance of such variables being selected simultaneously with variables having strong effects is also lower, they have a higher possibility of being used as the variable for node splitting. The disadvantage of selecting a lower mtry value is that some trees are constructed by variables that do not actually have a significant impact on the targeted quantiles, which reduces the prediction accuracy after averaging the results of all trees. On the contrary, if the value of mtry is set high, it is most likely that variables with moderate effects are masked by variables with strong effects.
This study adopted mtry values ranging from 5 (about one third of the total number of variables) to 17 (the total number of variables) because: (1) most previous studies recommend using one third of the total number of independent variables as the optimal mtry for prediction accuracy, and (2) Bernard et al. ( 31 ) found that the optimal mtry value is highly related to the number of variables that significantly affect the response variable. Although there are other model tuning parameters, such as the minimum number of observations in a leaf node and the number of total trees, they contribute a minimal amount to prediction changes ( 32 ). For these parameters, the commonly used values were adopted. The minimum node size was set to 10, and the number of trees was 2,000 for all models.
The performance of QRF and GRF models were validated using OOB error and then compared using the testing data. Random forests produce OOB predictions during tree growing and the OOB error is useful in understanding the goodness-of-fit of models. Using OOB error for model validation provides reliable results equivalent to cross-validation ( 33 ). Four error metrics—mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), and bias (the average difference between prediction and observation)—were used to validate and compare model performance.
Results
The QRF and GRF models were built using the R packages “ranger” ( 34 ) and “grf” ( 35 ). Data from January 1, 2017 to December 3, 2018 were used for training, and data from December 4, 2018 to December 31, 2019 were used for testing. The split of training and testing data was based on an INRIX TMC map update in December 2018 that changed the segmentation of the TMC network. For about 30% of all TMC segments studied, the length and/or segment start/end location were changed for no less than 0.5 mile in that map update. This split could help evaluate model flexibility. GRF and QRF models were created for freeway and interchange segments separately because a previous study found the independent variables could affect these two types of segments differently ( 26 ). The 50th, 80th and 90th percentiles of travel times were predicted at the same time in one model run to facilitate easy implementation in practice. In addition, Athey et al. ( 25 ) found that estimating multiple quantiles at the same time produced better predictions at less computational expense.
The model training and testing were performed on a computer cluster consisting of 20 cores with 9 GB memory for each core. The QRF models run slightly faster than the GRF models but the difference is minor.
Comparison of QRF and GRF Performances
QRF and GRF models were built using mtry values of 5, 8, 11, 14, and 17 for freeway and interchange segments, respectively. The results of mtry values of 5 and 8 had minimal differences, as did 11 and 14; therefore, only the results of mtry values 5, 11, and 17 are included in this paper for the sake of brevity. Figures 1 and 2 show the prediction performances of GRF and QRF models with different mtry values using the testing data.

Prediction performances of generalized random forest models.

Prediction performances of quantile random forest models.
For the GRF, almost all error metrics indicate that the prediction accuracy increases with increasing values of mtry. Out of the 24 possible error metrics for all three travel time percentiles and for both freeway and interchange models, GRF models with mtry = 17 performed the best for 14 metrics, models with mtry = 11 achieved the best for 9 metrics, and models with mtry = 5 performed best for 1 metric. Therefore, the value of 17 was considered the best mtry for GRF models. It should be noticed that the value of mtry used in GRF is likely to change for each node splitting as it is drawn from a Poisson distribution with a mean equal to mtry. As a result, even when set to the maximum value of 17 (the total number of variables), the GRF models should still maintain a certain level of stability.
For the QRF models, the best performance occurred mostly when mtry is lower than the total number of variables, which supports the finding from several previous studies that the optimal value of mtry is around one third of the total number of variables. QRF models with mtry = 11 performed the best for 11 out of the total 24 error metrics for all three travel time percentiles and for both freeway and interchange models; model with mtry = 5 outperformed the others for 10 metrics. The value of 11 was selected as the best mtry for QRF models.
The best-performing GRF (mtry = 17) and QRF (mtry = 11) models were validated and compared both with OOB samples from model training and the testing data. The GRF models performed well on the training data, with an R-squared of 0.92 and MSE of 991 from OOB predictions for freeway segments, and an R-squared of 0.56 and MSE of 431 for interchange segments. The OOB predictions from QRF models had an R-squared of 0.91 and MSE of 1,108 for freeway segments as well as an R-squared of 0.74 and MSE of 174 for interchange segments, indicating a good fit of the training data. The QRF interchange model fit the training data better than the GRF one to a small extent.
Figure 3 shows the comparison of error metrics from GRF and QRF models using the testing data. For both GRF and QRF models, the prediction errors increase as the travel time percentile being predicted increases. The 90th percentile travel time is often used to represent the extreme situations, and past studies have found it is more challenging to predict 90th than 80th and 50th percentiles ( 26 ). Because freeway segments were longer than interchange segments and the travel times were longer, non-scaled error metrics—such as MAE, MSE, and bias—of freeway models had higher values than interchange models, but the freeway models were generally more accurate than interchange models.

Performance comparison of GRF and QRF models.
GRF models performed slightly better than QRF for the 50th percentile travel time in five out of eight error metrics (MAE and MSE for interchange segments, bias for freeway segments, and MAPE for all segments); however, the differences between GRF and QRF metrics were generally minor. For the 80th percentile travel time, QRF performed slightly better than GRF in five out of eight error metrics (MAE and MSE for both freeway and interchange segments, and bias for interchange segments), and the differences in bias were more obvious than other error metrics. For the 90th percentile travel time, QRF appeared to have a little advantage over GRF with smaller values of MAE, MAPE, and bias for all segments and MSE for freeway segments.
The distributions of prediction errors were similar for GRF and QRF models with respect to deviation and the frequency of values around zero, as shown in Figure 4. The prediction error in this figure is defined as the difference between predicted value and observed value divided by the observed value. As the predicted travel time percentile increased, the deviation within prediction errors became wider and the frequency of errors around zero decreased. The number of errors within 10% was a bit higher for GFR than QRF for 50th percentile, and the opposite trend was observed for 90th percentile. For 80th percentile, the number of prediction errors less than 10% were almost the same for GRF and QRF models.

Distributions of prediction errors.
Both GRF and QRF models produced accurate estimates. Depending on the use case in practice, one error metric might be preferred over the others to select the best-performing model. Also, the results in Figures 3 and 4 are from GRF and QRF models trained to predict the 50th, 80th and 90th percentile travel times simultaneously. It is possible that a model trained to predict a single travel time percentile only could achieve better accuracy as the model could be more tailored to the percentile to be predicted.
Variable Importance Ranking
GRF and QRF models can provide a measure of variable importance that indicates how much the variables were used to make predictions. There are several metrics available for variable importance in random forests. In this paper, the variable importance in GRF was a simple weighted sum based on how many times a particular variable was used for node splitting, and the one in QRF is an impurity measure that quantifies the variance of the responses for quantile regression ( 34 , 35 ). Both metrics were calculated during model training. As all three travel time percentiles were estimated simultaneously, only one set of importance scores was generated for each model and it was not distinguished for each quantile. Even though the methods of metric calculation were different, the variable importance rankings from GRF (mtry = 17) and QRF (mtry = 11) models were mostly consistent, as shown in Figure 5. The length of TMC segment (mile), heavy vehicle percentage (heavy_percent), the number of through lanes (throu_lane), and volume-to-capacity ratio (vc_ratio) were ranked high for both GRF and QRF models, which indicates that these variables have stronger predictive power than other variables for predicting the 50th, 80th, and 90th travel times based on the study dataset. These variables are among the major influencing factors identified in current literature for modeling travel time reliability ( 7 , 26 , 36 ). Compared with all other models, the QRF interchange model gave relatively lower importance to urban/rural designation (rural), and relatively higher importance to v/c ratio, SSP and crash variables. This might be caused by collinearity as the variable importance metrics for random forests may spread over and overestimate the importance of correlated variables. Rare events like severe crashes and extreme weather could significantly interrupt travel time reliability, but variables representing those events did not show strong predictive power. It is likely that these variables were less used for node splitting in GRF and QRF models. Studies have shown that the variable importance metrics of random forests are biased in favor of variables with more categories and when predictor variables are highly correlated, inflated toward the continuous variables ( 37 , 38 ). The variable importance may be unreliable in models where variables vary in their scale of measurement or their number of categories, which could affect model interpretability. Random forests are more of a predictive tool than a descriptive tool. The variable importance ranking is not intended to explain how the variables may affect travel time reliability.

Variable importance for GRF and QRF models.
Case Study
An application of the proposed models was demonstrated through a case study to estimate the change in travel time reliability of a capacity improvement project. The study site is an approximately 5 mile long section (6 TMC segments) on I-64 in Virginia shown in Figure 6. The freeway segments between I-295 interchange and Bottoms Bridge areas (blue lines in Figure 6) were widened from two to three lanes in both travel directions. The upstream and downstream interchange segments (red lines in Figure 6) were also included in the analysis. The construction began in August 2017 and was completed in August 2019. Because of the impact of COVID-19 on traffic, data from 2020 and 2021 were not considered and the “after” period was set as August 2019 to December 2019. Accordingly, the “before” period was set from March 2017 to July 2017, resulting in 5 months for both periods. The reliability metric estimated was the LOTTR, which is one of the National Highway System performance metrics required by FHWA rulemaking to be reported periodically ( 39 ). The LOTTR metric is expressed as a ratio of the 80th to 50th percentile travel time of a segment occurring throughout a calendar year. In this case study, only the data in the before and after periods were used to calculate LOTTR for demonstration purposes.

Case study site on I-64.
The data sources and data preparation procedure were the same as described earlier for statewide data. The 50th and 80th percentile travel times were predicted using the GRF and QRF models trained with statewide data, and LOTTR metrics were then estimated. The predicted changes in LOTTR and the prediction errors are summarized in Table 2.
Summary of LOTTR Predictions from GRF and QRF Models
Note: LOTTR = level of travel time reliability; GRF = generalized random forests; QRF = quantile random forests; MAE = mean absolute error; MAPE = mean absolute percentage error; MSE = mean squared error.
On average, the observed 50th percentile travel time decreased by a little more than 6 s from the before period to the after period on the widened freeway segments, and by about 5 s on interchange segments. The observed 80th percentile travel time dropped by around 4 s for both freeway and interchange segments. As the decreases in 50th percentile travel times were slightly greater than those in 80th percentiles, the LOTTRs increased by an average of 0.0062 and 0.0079 from before to after period for freeway and interchange segments, respectively. These changes were minimal and thus challenging to predict. Both GRF and QRF captured the trend of change in LOTTR, and GRF performed better with smaller errors. The difference between predicted LOTTRs in before and after periods could better reflect the actual changes than the difference between observed LOTTR for the before period and predictions for the after period. Using the predicted values, the GRF model accurately captured the direction and magnitude of change in LOTTR for freeway segments. Both GRF and QRF models tend to overestimate for interchange segments.
Conclusions
Travel time reliability has been widely used in performance reporting and project evaluation. Reliability metrics are often expressed as a function of the mean and/or percentiles of travel times. This study developed two types of random forest models, GRF and QRF, for predicting different percentiles of travel time simultaneously. The purpose is to produce planning level estimates of travel time percentiles to further estimate travel time reliability metrics. The models were constructed using a variety of variables including roadway geometric features, traffic supply and demand, incidents, weather, work zones, traffic management and operation programs (e.g., managed lanes, safety service patrol). These factors and their interactions were considered at the same time, which makes it possible to apply the proposed models for estimating the reliability impacts of diverse improvement projects and operational strategies. The models were developed using data for all interstate highways in Virginia, aiming to facilitate statewide implementation such as MAP-21 system reliability target setting activities. The travel time percentiles of interest, 50th, 80th, and 90th percentiles in this paper, could be predicted at TMC segment level. Separate models were developed for freeway and interchange segments.
GRF and QRF models were built using two-year training data and analysis was conducted to identify the optimal value for model tuning parameter mtry. For GRF, the model accuracy increased with increasing mtry values, and the best prediction performance was achieved when mtry was 17, the total number of independent variables. For QRF, the mtry value associated with the best performance was 11. The best-performing GRF (mtry = 17) and QRF (mtry = 11) models were validated and compared using predictions from out-of-bag samples and one year of testing data. In relation to MAE, MSE, MAPE, and bias, GRF models performed better than QRF on the testing data for predicting the 50th percentile travel time, and QRF models were slightly better at predicting the 90th percentile. However, the differences between those error metrics were generally not significant; for the same travel time percentile, a model could be preferred by one error metric but disfavored by another. The context of implementation should be considered for selecting the best suited model. A case study was conducted to illustrate the use of the proposed models for estimating the reliability impact of a capacity improvement project. Both GRF and QRF models captured the trend in reliability change, but GRF performed better than QRF for estimating LOTTR.
The GRF and QRF models were less accurate for interchange segments than for freeway segments, and the prediction accuracy decreased for 90th and higher percentiles. This might be associated with the interactions with downstream segments and ramps that were not considered in model construction. Research is needed to further improve the accuracy of proposed models by including those dynamic interactions. Integrating random forests with other parametric and non-parametric modeling techniques could also be considered to improve model accuracy and stability. It would also be valuable to compare GRF and QRF with other machine learning techniques that might fit quantile regression for a large number of roadway segments, such as quantile gradient boosting decision tree and deep quantile regression.
The proposed models were developed using data for peak traffic periods. To adopt the proposed models for system reliability performance target setting and evaluation of reliability improvement projects, further research is needed to expand and evaluate the models for non-peak periods and for non-interstate highways.
Footnotes
Acknowledgements
The authors appreciate and acknowledge the contributions to this research from Mena Lockwood, Jungwook Jun, Paul Szatkowski, Ramkumar Venkatanarayana, Sanhita Lahiri, Simona Babiceanu, and Katie Felton of VDOT; Margit Ray of the Office of Intermodal Planning and Investment of the Secretary of Transportation, Commonwealth of Virginia.
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: X. Zhang, M. Zhao, J. Appiah, M. D. Fontaine; data collection: Zhang, Zhao, Appiah; analysis and interpretation of results: Zhao, Appiah, Zhang, Fontaine; draft manuscript preparation: Zhao, Zhang, Appiah, Fontaine. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Data Accessibility Statement
This paper used third-party proprietary probe vehicle data, and the authors are not permitted to share the data under the data use agreement.
