Abstract
With improvements in data collection, storage, and processing, machine learning (ML) is gaining momentum as a behavior prediction method in the field of engineering. Several studies have evaluated these algorithms’ potential to predict pavement serviceability, however some challenges limit its use. Training data preprocessing has a great impact on the model’s predictive performance, is highly dependent on the modeler’s experience, and is not typically reported in engineering-related literature. The objective of this study was to assess the effects of data preprocessing, hyperparameter selection, and time series size on the model’s evaluation metrics. Therefore, this paper analyzes the performance of three ML algorithms on maximum deflection (D0) and international roughness index (IRI) prediction: support vector machine, random forest (RF), and artificial neural network (ANN). An R2 and mean square error (MSE) analysis was conducted on 12 training datasets, with two sizes of historical data and five stages of data preprocessing. The results indicated that ANN was the most accurate technique with an R2 of 0.99 and MSE of 20 ×10−3 mm on the D0 prediction and an R2 of 0.91 and MSE of 0.03 m/km on the IRI prediction. RF was also identified as an effective technique, generating similar results with less data preprocessing. The addition of structural and traffic categorical features to the training dataset resulted in the most significant improvement of the support vector regression and ANN performance metrics; the hyperparameter selection was effective only on IRI prediction, especially with the ANN algorithm.
Keywords
The accurate prediction of pavement performance is essential to define a cost-effective maintenance, rehabilitation, and reconstruction program in pavement management. Overall, the pavement’s performance is measured by indexes for cracking, rutting, patching, and roughness.
The accumulated changes in pavement surface roughness over time is one of the most important performance indicators. In fact, studies following the AASHO Road Test indicated that approximately 95% of the information about pavement serviceability can be attributed to surface roughness ( 1 ). The international roughness index (IRI) is the most used pavement serviceability indicator, since it is used throughout the world as a standard to measure road surface roughness ( 2 , 3 ).
The Highway Development and Management System (HDM-4) is the most used pavement management tool. Its performance prediction models have great flexibility, being able to consider different surface materials, structure types, climates, and traffic characteristics. HDM-4 regression models can forecast surface roughness, cracking, rutting, potholes, and so on ( 4 ), although it is critical to calibrate its models to each network. Considering the robustness of HDM-4 models, but recognizing their limitations in relation to local calibration and to big-data treatment and handling, it is important to develop more current and adaptable solutions.
With improvements in data storage and collection, machine learning (ML) has been gaining visibility as a potential behavior prediction method in engineering studies ( 5 ). ML is the science of developing complex algorithms that iteratively learn from data collected from real-world observations, and then produce reliable, repeatable decisions and results ( 6 , 7 ).
ML’s ability to capture complex relations in the data makes it a promising candidate to address challenging cases in engineering that would be difficult to solve with traditional regression approaches. The prediction models developed using these techniques are as good as the data used for model training, therefore, the availability of an extensive and complete database is crucial for an ML-based model to thrive.
In addition to the completeness of the database, training data preprocessing has a great impact on a model’s predictive performance, is highly dependent on the modeler’s experience, and is not typically reported in engineering-related literature ( 8 ). In fact, most research focuses on the ML algorithms used and the evaluation metrics, without describing the training database construction and final predictors ( 8 – 12 ).
Therefore, the objective of this study was to assess the effects of data preprocessing, hyperparameter selection, and time series size on the evaluation metrics of each tested ML algorithm each tested training dataset. To achieve this goal, the three most frequently used ML algorithms in surface deflection (D0) and IRI prediction—support vector machine (SVM), random forest (RF), and artificial neural network (ANN)—were tested on 12 training datasets, composed of 20-years of historical data collected from 23 highways across the state of São Paulo, Brazil.
ML Algorithms in Pavement Performance Prediction
ML is defined as a set of robust algorithms for data analysis and interpretation, based on statistical concepts, which allow pattern recognition, behavior prediction, data classification, and data regression with great precision ( 7 ). These techniques go beyond the boundaries of computer science and statistics. They have been widely used in other fields such as medicine, biology, economics, public safety, and engineering. ML algorithms allow speech and image recognition, human behavior and prediction, data classification, and data regression with great accuracy.
As indicated in previous studies, the most frequently used ML algorithms in pavement performance prediction are support vector regression (SVR), RF, and ANN ( 8 – 12 ). Therefore, their theoretical bases, characteristics, and calibration hyperparameters are described in the following.
Support Vector Regression
SVM is a set of supervised learning methods used for classification, regression, and inconsistency detection. SVR algorithms are a generalization of a simple and intuitive classifier called the “maximum margin classifier,” in which data points are arranged in space and separated by an optimal hyperplane, which splits the data with the greatest possible margin. When data classes are not separable by linear functions, maximum margin classifiers do not work efficiently. To accommodate nonlinear margins, SVR algorithms increase the amount of space dimensions of data points, using quadratic, cubic, polynomial, sigmoidal, or gaussian functions (i.e., kernels) to transform the input variables ( 13 ).
SVR uses the same principles for classification as SVM. However, since the result is a real number, the model adds a margin of tolerance to the resulting hyperplane ( 14 ). The most significant hyperparameters in this algorithm are (i) kernel: the kernel type to be used in the algorithm; (ii) C: a regularization parameter; (iii) gamma: the kernel coefficient; and (iv) epsilon: the margin of tolerance to the resulting hyperplane.
Random Forest Regression
RF is a method based on the results of multiple decision trees. Parts of the training data are successively separated and submitted to the decision tree model. The different trees that make up the forest work in parallel, without any interaction between them, and the results are obtained through the averages of the values predicted by each tree ( 13 ). In this algorithm, each time the division of a branch is considered, a random sample of the variables is chosen as a candidate for the division of that branch. The insertion of randomness corrects the tendency of decision trees to overfit the training dataset, in addition all variables are considered in the various decision branches ( 13 ). The optimized hyperparameters in this algorithm are (i) N estimators: number of trees in the forest; (ii) max depth: maximum depth of each tree; (iii) min samples split: minimum number of samples required to split an internal node; (iv) min samples leaf: minimum number of samples required to be at a leaf node; (v) max features: number of features to consider when looking for the best split; and (vi) bootstrap: whether bootstrap samples are used when building trees. If false, the whole dataset is used to build each tree.
Artificial Neural Network
ANNs are models based on the emulation of the human nervous system ( 15 ). The architecture of neural networks involves an interconnected network of neurons, arranged in layers, which interact by passing information from one layer to another through activation functions ( 16 ). The most common type of ANN consists of an input layer, X, at least one hidden layer, and an output layer, f(X), as shown in Figure 1. Between the input and output layers, the hidden layers are composed of weight matrices, bias, and different activation functions.

Artificial neural network model.
The training of the model based on neural networks takes place through optimization by descending gradients of the weight matrix, so that the output data from each layer of the model minimize the error between the predicted and measured values ( 17 ). The standard ANN model has a significant limitation when applied to problems such as time series analysis. As the sequence size increases, the ANN size grows exponentially, which increases processing time and demands a more robust computational infrastructure. The optimized hyperparameters in this algorithm are (i) hidden layer sizes: representing the number of neurons in each hidden layer; (ii) solver: solver for weight optimization; (iii) learning rate: learning rate schedule for weight updates; and (iv) max iter: the maximum allowed number of iterations for the solver optimization.
Algorithm Performance Measures
To assess the ML algorithm and training dataset that best performed the regression of the final D0 and IRI, the metrics evaluated were the coefficient of determination (R2) and the mean square error (MSE), resulting from the cross-validation process.
In supervised ML, cross-validation is a procedure in which the database is randomly divided into k distinct groups. The first group—the validation group—is separated from the others; the remaining k− 1 groups are used for training the model and obtaining the regression or classification functions. The validation group, without the variable to be predicted, is used in the obtained function; the results of the prediction functions are compared to the known results and the metric chosen for evaluating the fit of the function to the data is obtained. This procedure is repeated k times until all groups have been a validation group ( 13 ).
Data Preprocessing
To assess the impact of each step of the data preprocessing on the ML algorithm’s performance, the evaluated models were cross validated with different dataset constructions. The raw data obtained from the São Paulo State Transportation Agency consisted of IRI and deflection basin survey results, conducted on 23 highways during a 20-year period, segmented into 200-m or smaller sections, in all traffic lanes, subjected to different traffic loads, and located in diverse weather zones.
Data preprocessing consisted of data cleansing, database consolidation, feature values distribution, temporal series construction, and feature engineering.
Data Cleansing
All the unreliable data were excluded. The raw data reports presented problems with highway segment identification, missing data, survey procedure errors, and parameter value errors. To be included in the study, the segments selected had to have all the survey standardized information registered in the reports, in addition to annual average daily traffic (AADT) and equivalent single axle load (ESAL) values, and maintenance history. Finally, the segments had to have at least two consecutive years of IRI and FWD (Falling Weight Deflectometer) survey results to be able to establish a time series. With this initial data treatment, 54,854 segments were selected for the study.
Database Consolidation
The data needed to construct the training dataset were from distinct sources and configured differently. The IRI surveys were divided into 200
Target Feature Balance
As specified by the São Paulo State Transportation Agency, every highway operated by private enterprises should always have an IRI below 2.69 m/km. This limit would have restricted the variability of the training data, resulting in an imbalanced dataset. Therefore, to reduce the model’s bias, data from nonregulated or recently regulated highways were included in the construction of the training dataset.
Temporal Series
To assess the model’s performance in relation to different temporal series sizes, two training datasets were constructed. The first consisted of 36,155 segments, with two consecutive years of IRI and FWD surveys results. The second consisted of 11,471 segments, with three consecutive years of IRI and FWD surveys results. Because of the maintenance dynamics and the expected pavement performance specified in the São Paulo State Transportation Agency contracts, the temporal series size was restricted to 3 years. This restriction was put in place so that D0 and IRI degradation over time would not be hidden by maintenance or rehabilitation work. It is worth noting that performance changes caused by environmental conditions were not significant in this particular research. Because of contract restrictions, the surveys were conducted using the same period on each highway.
Feature Engineering
Feature engineering consists in extracting features from raw data and transforming them into suitable variables for the ML model. This is a crucial step in the ML pipeline since the right features will produce higher quality results ( 18 ).
Deflection Basin Indexes
In this study, pavement structural robustness and condition were inferred through four deflection basin parameters:
where
SCI = surface curvature index (10−3 mm),
D 0 = maximum deflection (10−3 mm), and
D
30 = deflection at a 30
where
BDI = base damage index (10−3 mm),
D
30 = deflection at a 30
D
60 = deflection at a 60
where
BCI = base curvature index (10−3 mm),
D
60 = deflection at a 60
D
90 = deflection at a 90
where
AREA: area parameter (cm2/cm),
D 0: maximum deflection (10−3 mm),
D
30: deflection at a 30
D
60: deflection at a 60
D
90: deflection at a 90
Categorical Features
Nonlinear ML regression algorithms tend to generate more accurate results with the use of qualitative variables ( 13 ), therefore some features were divided into categories. Surface curvature index (SCI) indicates the condition of the surface pavement layer. SCI values greater than 25×10−2 mm indicate a thin or weak coating layer ( 19 – 21 ). Base damage index (BDI) indicates the condition of the pavement base layer. BDI values greater than 40 × 10−2 mm indicate pavements that are not very resistant or have structural problems ( 19 – 21 ). Base curvature index (BCI) indicates the condition of the pavement subgrade. BCI values greater than 10 × 10−2 mm indicate that the subgrade has a CBR (California Bearing Ratio) under 10% or has structural problems ( 19 – 21 ). Since the literature defines clear thresholds to these indexes, parameters SCI, BDI, and BCI were transformed into binary variables.
Asphalt pavements can be classified into four categories according to the area parameter (i.e., AREA) ( 22 ): weak asphalt pavement (30 cm2/cm ≤ AREA < 38 cm2/cm); thin structure or hot mix asphalt (HMA) (38 cm2/cm ≤ AREA < 53 cm2/cm); thick structure or HMA (53 cm2/cm ≤ AREA < 76 cm2/cm); and rigid pavement behavior or sound PCC (Portland Cement Concrete) (AREA ≥ 76 cm2/cm).
Finally, a 5-year cumulated traffic load was transformed into six categories, through the combination of AASHTO ESAL and AADT variables: W18 ≤ 1 × 106; 1 × 106 < W18 ≤ 5 × 106; 5 × 106 < W18 ≤ 1 × 107; 1 × 107 < W18 ≤ 2,5 × 107; 2,5 × 107 < W18 ≤ 5 × 107; and e W18 > 5 × 107. These thresholds were defined to generate a balanced category distribution for the traffic features, considering the traffic patterns on the studied highways.
Performance Prediction Model
Three pavement parameter prediction models were tested with each algorithm. The first predicted the D0 in the next year, based on its history, surface layer age, deflection basin surveys, and traffic load. The second predicted the IRI in the next year, also based on its history, surface layer age, deflection basin surveys, and traffic load. The third predicted the IRI in the next year, based on its history, surface layer age, deflection basin surveys, traffic load, and the predicted D0 in the next year.
To improve the algorithm’s performance, a grid search was conducted on each tested model to determine the optimal hyperparameter configuration.
Tested Algorithms
According to the literature review, the three most accurate ML algorithms in IRI predictions are SVM, RF, and ANN. Since the predicted variable is a real number, the regression versions of these algorithms were selected from the scikit-learn library: SVR, RF regression, and the multilayer perceptron (MLP) regressor. Initially, these algorithms were tested with the default hyperparameters.
Hyperparameter Selection
A model’s hyperparameters are the variables used to control the learning process. Grid search is a technique that computes the optimum hyperparameter values through an exhaustive search, performed on each specific parameter of a model. This study’s ML algorithms were tested using the hyperparameters presented in Table 1.
Tested Hyperparameters
Tested Scenarios
To assess the improvement in each algorithm’s performance, their evaluation metrics were obtained for each tested training dataset and each distinct algorithm hyperparameter configuration, resulting in five scenarios, as presented in Table 2.
Tested Scenarios
Note: FWD = falling weight deflectometer; D0 = maximum deflection; IRI = international roughness index.
The historical data ends 1 year before the target feature.
Results and Discussion
In this section, the first two subsections present the algorithm performance metrics on each tested D0 and IRI prediction model and training dataset. The third presents a comparison between the ML simulated results and the field results of 10 highway segments that were not part of any training dataset.
Surface Deflection (D0)
As can be observed in Table 3, the SVR algorithm showed significant performance improvement with the introduction of categorical features. The training datasets that contained any raw data generated poorly adjusted models, even when considering the 3-year history dataset. The grid search had no effect on the results, once the default hyperparameters already generated highly accurate predictions, especially when using the 3-year history training dataset.
Support Vector Regression Performance Metrics: D0 Prediction
Note: MSE = mean square error.
It is possible to observe that the RF algorithm generates well-adjusted results even without any feature engineering or hyperparameter selection. As shown in Table 4, the optimum result (R2 = 0.98 and MSE = 29 × 10−3 mm) occurred for the dataset with a 3-year pavement history and the introduction of categorical features. Once the default hyperparameters had already generated highly accurate predictions, the grid search had no effect on the results.
Random Forest Performance Metrics: D0 Prediction
Note: MSE = mean square error.
Finally, as shown in Table 5, the ANN algorithm also needed categorical features to generate accurate predictions. The datasets that contained any raw data did not generate a viable model. This may have occurred because these first attempts were made using default parameters that did not have enough hidden layers or iterations for the model to establish a more complex relationship between the features. The optimum result (R2 = 0.99 and MSE = 20 × 10−3 mm) was for the dataset with a 3-year pavement history and the best hyperparameter selection (hidden_layer_sizes = 50; learning_rate = ‘adaptive’; max_iter = 800; solver = ‘lbfgs’).
Artificial Neural Network Performance Metrics: D0 Prediction
Note: MSE = mean square error; inf = infinite error rate.
Overall, the D0 prediction models produced more accurate results when the training dataset contained a 3-year performance history. This result was expected, since longer histories tend to better describe pavement behavior.
IRI
Two types of training datasets were used for the IRI simulations. The first one (marked with an *) was based only on the available data from field surveys. The second included the expected structural behavior (D0), predicted using the models generated in the previous section.
As in the D0 prediction, the SVR algorithm showed significant accuracy improvements with the introduction of categorical features but was not sensitive to hyperparameter selection (Table 6).
SVR Performance Metrics: IRI Prediction
Note: SVR = support vector regression; SVR* = SVR without the expected structural behavior feature; MSE = mean square error.
When comparing the results obtained with the two different datasets (i.e., with and without structural behavior prediction), the model failed to show accuracy improvements in most cases. The only exception was the SVR algorithm trained with a 3-year history dataset and expected structural behavior that resulted in R2 = 0.90 and MSE = 0.03 m/km (Table 6).
It can be observed from Table 7 that the RF algorithm generated accurate results from the beginning, even without the introduction of categorical features. Furthermore, this model was not sensitive to the use of predicted structural behavior on the training dataset.
RF Performance Metrics: IRI Prediction
Note: RF = random forest; IRI = International roughness index; RF* = RF without the expected structural behavior feature; MSE = mean square error.
Once the raw data and default hyperparameters already generated highly accurate predictions, feature engineering and the grid search process had no effect on the results. However, use of the 3-year history dataset resulted in improvements in the algorithm performance metrics (Table 7).
As in D0 prediction, the ANN algorithm did not generate a viable IRI regression function without the introduction of categorical features. As can be observed from Table 8, the ANN algorithm also showed performance improvements with use of the 3-year history training dataset.
ANN Performance Metrics: IRI Prediction
Note: ANN = artificial neural network; IRI = International roughness index; ANN* = ANN without the expected structural behavior feature; MSE = mean square error; inf = infinite error rate.
The resulting ANN model was not sensitive to the use of predicted structural behavior in the training dataset. The optimum result (R2 = 0.91 and MSE = 0.03 m/km) occurred for the dataset with a 3-year pavement history and the best hyperparameter selection (hidden_layer_sizes = 500; learning_rate = ‘invscaling’; max_iter = 800; solver = ‘lbfgs’).
Overall, the IRI prediction models showed more accurate results when the training dataset contained a 3-year performance history. However, use of the expected structural behavior variable, predicted with the D0 regression functions, did not improve the final algorithm’s performance.
ML Simulated and Field Results
This section presents the results of a comparison between the predicted parameters and field survey parameters of 10 highway segments. These segments were selected to represent diverse traffic loads, structural conditions, functional conditions, and surface layer age, as shown in Table 9. It is worth noting that the field data required for this simulation comprised one survey result for the 2-year history models and two survey results for the 3-year history models.
Test Segment Characteristics
Note: D0 = Surface Deflection; IRI = International Roughness Index; SCI = Surface Curvature Index; BDI = Base Damage Index; BCI = Base Curvature Index; AREA = Area Parameter.
As observed in Figure 2, and on the previous item results, the prediction based on models trained with a 3-year pavement performance history showed more accurate results for both parameters.

ML simulated versus field results: (a) D0 (2-year history training dataset), (b) D0 (3-year history training dataset), (c) IRI (2-year history training dataset), (d) IRI (3-year history training dataset).
The RF algorithm, trained with the 3-year history dataset, presented the most accurate D0 prediction, with MSE < 10 × 10−3 mm for all 10 tested segments. This algorithm’s better performance, when tested with field results, was the result of the overfitting prevention intrinsic to its functioning.
When using the 2-year history dataset, the first five tested segments presented less accurate predictions for all algorithms tested. Since this set of segments did not present any similar characteristics, this result was attributed only to the shorter history of the training dataset.
Observing the IRI prediction results, it is possible to conclude that the training dataset with a 3-year pavement performance history generated a more accurate IRI regression model. Most predictions made with the 2-year history training dataset had an MSE > 0.1 m/km and most predictions made with the 3-year history training dataset had an MSE < 0.05 m/km.
Tested Segment #1 presented the worst predictions, with an MSE > 0.3 m/km on all tested algorithms. This segment went from an IRI = 1.93 m/km to an IRI = 2.57 m/km the next year, indicating a localized pavement problem or a survey error. Therefore, the poor model accuracy in this case was considered to be a result of external issues.
Tested Segments #4 and #9 also presented poorly predicted results for most cases, with an MSE > 0.15 m/km. It is worth noting that the three segments that had low accuracy predictions had surface layers that were older than 5 years. This result indicated that this particular feature is significant for an ML model’s performance and its influence on pavement functional performance is not currently well described by the training datasets used.
Finally, the RF algorithm generated the most accurate regression models for IRI prediction. Except for Segments #1, #4, and #9, all tested segments presented an average MSE = 0.04 m/km when using the 3-year history training dataset.
Conclusions and Recommendations
Initially, it is possible to conclude that the construction of the training dataset had a great impact on the regression model’s performance in most cases, with the exception of the RF algorithm that showed accurate results even without feature engineering. The RF algorithm’s superior performance was a result of the random nature of its feature selection that, apart from reducing the possibility of overfitting, gave every feature a chance to be part of the predictive function, even when its impact and relationship to other variables were not initially obvious. SVR and ANN showed significant performance metrics improvement with predictor selection, particularly with the use of categorical features, presenting results similar to those of the RF algorithm with the final training dataset configuration. Hyperparameter selection had little effect on prediction accuracy, improving only the IRI prediction results, especially with the ANN algorithm.
The training datasets based on a 3-year pavement history generated more accurate results on all tested algorithms. This was anticipated, since longer histories tend to better describe pavement behavior. However, it is worth noting that longer pavement performance histories are not always available. In these cases, a shorter time series or data interpolation could be used, generating sufficiently accurate results.
The D0 regression models demonstrated more accurate results, in both simulated and field results. This means that the features selected for the training dataset construction were sufficient for structural behavior prediction. These included D0, deflection basin parameter classes, traffic classes, and surface layer age.
The IRI regression models showed less accurate results, especially when tested against field results. The use of future structural behavior on the training dataset did not improve the generated model’s accuracy; therefore the use of predicted D0 on the IRI regression function was not necessary or advised once it could introduce more error into the model.
In addition to the predictors already used (i.e., current IRI, D0, deflection basin parameter classes, traffic classes, and surface layer age), the insertion of more features into the model, such as surface layer mixture characteristics and climate, is recommended to improve prediction accuracy.
The ANN models produced better performance metrics in the computer-based cross-validation in all prediction models. However, when tested against field results, RF was more accurate. This indicated that the ANN model may be overfitting the results, so use of the RF algorithm is recommended.
Footnotes
Acknowledgements
The authors thank the Pavement Technology Laboratory of the Polytechnic School of the University of São Paulo for the opportunity to develop this research and the São Paulo State Transportation Agency for providing all the data necessary for this study.
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: A. Aranha, L. Bernucci; data collection: A. Aranha; analysis and interpretation of results: A. Aranha, L. Bernucci, K. Vasconcelos; draft manuscript preparation: A. Aranha, L. Bernucci, K. Vasconcelos. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Data Accessibility Statement
The input data and research outcomes of this study are available from the corresponding author on request.
