Abstract
The use of pavement condition data to support maintenance and resurfacing strategies and justify budget needs becomes more crucial as more data-driven approaches are being used by the state highway agencies (SHAs). Therefore, it is important to understand and thus evaluate the influence of data variability on pavement management activities. However, owing to a huge amount of data collected annually, it is a challenge for SHAs to evaluate the influence of data collection variability on network-level pavement evaluation. In this paper, network-level parallel tests were employed to evaluate data collection variability. Based on the data sets from the parallel tests, classification models were constructed to identify the segments that were subject to inconsistent rating resulting from data collection variability. These models were then used to evaluate the influence of data variability on pavement evaluation. The results indicated that the variability of longitudinal cracks was influenced by longitudinal lane joints, lateral wandering, and lane measurement zones. The influence of data variability on condition evaluation for state routes was more significant than that for interstates. However, high variability of individual metrics may not necessarily lead to high variability of combined metrics.
Pavement condition evaluation is a key element in a network-level pavement management system ( 1 ) in that it provides a basic evaluation of overall network condition and is used for identifying needs of maintenance, reconstruction, and rehabilitation. Since huge amounts of pavement condition data are collected and processed annually, data collection variability has become a major challenge for highway management agencies.
Several factors influence data collection variability. Even with well-calibrated equipment and well-trained operators, variability of pavement condition data collected by automated collection equipment is unavoidable because uncertainty is a part of measurement ( 2 ). From the perspective of highway agencies, variability of pavement condition data becomes a concern when these data are used for budget/maintenance needs analysis ( 3 , 4 ) and performance target setting ( 5 ). However, efficient ways to evaluate the influence of pavement condition data variability on a network-level performance evaluation are lacking. It is time-consuming and sometimes even impossible to evaluate data variability by the same method used for project-level performance evaluation for which data variability could be determined by multiple test runs.
As indicated in a previous study ( 6 ), a large total error may occur in any single measurement. Therefore, to obtain a clear picture of the road condition for the entire network, it is of importance for state highway agencies (SHAs) to understand and evaluate the influence of data variability on network-level pavement evaluation. For example, when SHAs set pavement performance targets, the variability of pavement condition data could potentially affect the final distribution of pavement condition which is used to determine the baseline performance. If those segments which are subject to collection variability can be identified and considered in budget/performance analyses, pavement engineers will be able to set more reasonable performance targets and make more reliable budget and resurfacing plans.
Some efforts have been made to evaluate data variability of manual and automated rating methods. Landers et al. introduced Cohen’s weighted kappa statistics into quality assurance/quality control processes to assess the level of agreement beyond change among raters for the British Columbia Ministry of Transportation ( 7 ). The index served as an overall estimate of level of agreement between a manual benchmark survey and contractors’ ratings. Ong et al. developed a set of performance measures to quantify the quality of data collected and processed by automated technique. These measures can also be used for assessing the effects of sampling on overall condition ratings ( 8 ). To improve the Northern New Mexico Pavement Evaluation Program, Bogus et al. proposed a framework consisting of assessment of consistency over time and assessment of agreement between evaluators ( 9 ). By investigating data from the Long-Term Pavement Performance (LTPP) program’s distress rater accreditation workshop, Rada et al. found that there was a large variability in individual rater with increases in distress quantity ( 10 , 11 ). Ong and Sinha evaluated the quality of pavement roughness data collected at project- and network level ( 12 ). They found the International Roughness Index (IRI) data quality at project level was affected by the individual run, the wheel path, and the lane in which the profile was measured, whereas IRI quality at network level was affected by the wheel path profile. Siabil et al. proposed a computational technique for detecting errors in network-level pavement condition data sets ( 13 ). They also developed an error detection method that considered all the data properties and consistency among multiple performance indicators ( 14 ).
There are also studies focusing on the influence of data variability on network-level pavement management activities. By investigating longitudinal profile data from LTPP, Yin et al. developed measures of the variability between routine network-level profile visits to a site. They also provided recommendations for quantifying and controlling the variability of longitudinal profile data ( 15 ). Saliminejad and Gharaibeh quantified the impact of systematic and random error in pavement condition data on the results of Maintenance and Rehabilitation (M&R) analysis. They concluded that, in practice, both systematic and random errors can highly distort the analysis results even within acceptable error ranges ( 16 ). To understand and quantify the influence of IRI measurement variability on pavement evaluation, Jia et al. constructed a probabilistic relationship to consider the influence of run-to-run variability of IRI for network-level pavement evaluation. The uncertainty of pavement evaluation was introduced in the process ( 17 ). Jia et al. also developed methods to consider the influence of variability of surface distress data on pavement condition evaluation ( 18 ).
In this study, network-level data collection variability was evaluated using parallel testing that was conducted by two testing vehicles equipped with the identical testing device, Laser Crack Measurement System (LCMS), which was developed by Pavemetrics System Inc. in Canada. The results of the analysis provided understanding of network-level data variability and its influences on pavement condition evaluation. The method used in this study is based on state-of-the-art data mining technology and could be used by agencies in their data quality management plans to identify potential segments that could be subject to data variability and to quantify the influence of data variability on network pavement evaluation. A demonstration of how to implement the proposed method into pavement management activities is also provided by this study.
Objective and Scope
The objective of this study was to evaluate the variability of network-level data collection and its influence on pavement condition evaluation. Classification models were constructed by random forest (RF) method to identify road segments subject to data variability. The pavement condition data used in this study were from Tennessee Department of Transportation’s (TDOT) pavement management system (PMS).
Parallel Data Collection
Network-level data collection variability was evaluated through parallel tests performed on selected road segments using two vehicles. One vehicle is owned by the data collection vendor for annual network-level data collection. The other is owned by TDOT for agency's data collection. Both vehicles are installed with the identical data collection/processing system. The equipment consisted of a 3-D imaging system that was used to identify surface characteristics, which were then used for extracting data on surface roughness and distresses by the automated distress identification method. The two testing vehicles were validated at same test site using the same validation procedure. To decrease systematic errors caused by vehicle operators, testing conditions, and surface properties, the parallel tests were performed within 2 weeks under same environmental conditions by the operators who are certified by the validation test. Test routes were selected based on the type and extent of the surface distress. A route comprising 180 centerline miles was selected for parallel testing. It should be noted that owing to the large amount of efforts, it is difficult to establish ground-truth for evaluating network-level data accuracy. Instead, in this study the parallel tests evaluated data repeatability and its influence on pavement evaluation.
The pavement condition data used in this study consisted of roughness and surface defects. The roughness data included the IRI and rut depth, which characterize surface profiles in longitudinal and transversal directions of the pavement, respectively. The surface defects were cracking-type distresses, which are the main methods of pavement evaluation in Tennessee. There were five major types of cracks from TDOT’s PMS: fatigue cracks, longitudinal wheel-path cracks, longitudinal nonwheel-path cracks, block cracks, and transverse cracks; there was also a national measure: crack percent. All data were reported in 0.1-mi segments, which were used for further analysis in this study.
Roughness and individual surface distresses are generally combined in indices for evaluating pavement condition. Three indices were used in this study: the pavement smoothness index (PSI), pavement distress index (PDI), and the pavement quality index (PQI). PSI is a function of IRI, whereas PDI is a function of individual distress and rut depth. By combining PSI and PDI, PQI can be calculated using the following equation:
Three categories are employed to evaluate the condition of a pavement. The thresholds are defined in relation to PQI, which are listed in Table 1.
Definition of TDOT Performance Measures
Note: TDOT = Tennessee Department of Transportation; PQI = pavement quality index.
As pavement evaluation methods vary between states, national performance measures were proposed to provide each SHA with a standard, facilitating the reporting of pavement condition in the same manner. Four performance metrics are used for pavement condition evaluation: IRI, crack percent, rut depth (for asphalt pavement only), and faulting (for concrete pavement only). In this study, a segment was initially rated according to IRI, rut depth, and cracking percent. For each pavement type, three metrics were used to determine performance measures, comprising three categorial conditions (Good, Fair, and Poor). The thresholds for asphalt pavements are listed in Table 2. The final rating was the combination of these three metrics. With all three metrics rated as “Good,” the overall rating of a segment is Good, whereas if two or more metrics are rated “Poor,” the overall rating is Poor. All other combinations are considered to be “Fair.”
Definition of National Performance Measures ( 19 )
Note: IRI = international roughness index.
Table 3 lists the statistical summary of pavement condition data from the parallel tests. Surface cracks of differing types and severity levels were observed on the parallel test road segments. Cracks were classified according to three severity levels: low, moderate, and high. IRI from both wheel paths and half car were collected and included in the models. IRI results were then averaged to calculate PSI. Rut depths from both wheel paths were also included. Crack percent was collected and used for determining national performance measures. All other types of surface cracks along with average rut depth were then used to calculate PDI. Therefore, in Table 3, PSI, PDI, and PQI calculated from other pavement condition data are presented. Figure 1 illustrates the histogram of pavement condition data collected from the parallel tests.
Summary of Pavement Condition Data
Note: HPMS = highway performance monitoring system; IRI = international roughness index; PSI = pavement smoothness index; PDI = pavement distress index; PQI = pavement quality index.

Histogram of pavement condition data from two testing devices.
Variability of Network-Level Data Collection
Comparison of Distributions of Pavement Condition Data
The purpose of network-level pavement evaluation is to provide a global view of pavement conditions over the entire network instead of for individual segments. The pavement condition distribution is of interest and importance as it may be used for budget allocation, strategic investment planning, performance target setting, and public reporting. Therefore, distributions of pavement condition data were evaluated using the Kolmogorov–Smirnov test (K-S test).
The K-S test is used to compare two data sets to determine whether they are significantly different. The K-S statistic, Dn,m, developed from the K-S test quantitively evaluated the distance between the cumulative distribution functions of the two data sets collected in the parallel tests. The K-S statistic is a supremum function, which is defined as
where
Table 4 lists the summary of the K-S tests for each data item collected from the parallel tests. A p-value < 0.05 indicated the difference of distributions of the longitudinal wheel-path cracks, nonwheel-path cracks, and rut depths between the two data sets were significant.
Results of the Kolmogorov–Smirnov Tests for the Parallel Tests
Note: HPMS = highway performance monitoring system; IRI = international roughness index. Bold in “p-value” column indicates a significant difference between two data sets.
Evaluation of Data Collection Variability
As the distributions of rut depth and longitudinal cracks between the two data sets were significantly different, further investigations were conducted to find out what might have caused this difference.
Rut Depth
Figure 2 illustrates the cumulative distributions of rut depth from the two data sets. As can be seen, the cumulative distribution curves of rut depth collected by the vendor were slightly higher than those collected by TDOT. However, the cumulative distribution curves from the two data sets were slightly closer to each other on the left than on the right wheel-path.

Comparison of distribution of rut depth from two wheel paths (in.): (a) left rut depth, and (b) right rut depth.
It should be noted that the rut depth in this study was reported in 0.1-mi intervals. For each 0.1-mi segment, the rut depth was measured at 0.005-mi intervals and averaged for reporting. A total of 20 different cross-sections were used for calculating rut depth within 0.1-mi segments. For network-level data collection, it is difficult to ensure individual segment termini matching between different testing vehicles (i.e., 0.1-mi segment) because of the potential cumulative error from distance measuring instruments (DMIs; used to measure and record elapsed distance traveled by a vehicle). Further investigation may be needed to evaluate the influence of DMI measurements on the variability of rutting. As the evaluation of DMI measurements is beyond the scope of this study, no more discussion is provided in this paper.
Longitudinal Cracks
Figure 3 illustrates the cumulative distributions of longitudinal wheel-path and nonwheel-path cracks from the two data sets. As the cumulative distributions of cracks at different severity levels exhibited similar patterns, only the distributions of total cracking are illustrated. It was found that the cumulative distribution curves of both longitudinal wheel-path and nonwheel-path cracks collected by the vendor were slightly higher than those collected by TDOT.

Comparison of distributions of longitudinal cracks (%): (a) total longitudinal wheel-path cracking, and (b) total longitudinal nonwheel-path cracking.
Potential contributors to differing distributions of longitudinal wheel-path and -nonwheel-path cracking are summarized in the following.
Longitudinal lane joints were rated as longitudinal nonwheel-path cracks
In Tennessee, longitudinal lane joints are rated separately, not factored into calculations of pavement condition indices. As illustrated in Figure 4, the longitudinal lane joints might have been rated as longitudinal nonwheel-path cracks as the location of these lane joints were generally close to nonwheel-path zones, or Zone 1 or Zone 5 as per AASHTO R85 “Standard Practice for Quantifying Cracks in Asphalt Pavement Surfaces from Collected Pavement Images Utilizing Automated Methods.” Once covered by nonwheel-path zones, longitudinal lane joints are generally rated as nonwheel-path cracking by automated imaging process systems. Currently, there is no other way of identifying this type of error than by reviewing the downward pavement images manually.

Longitudinal lane joints rated as longitudinal nonwheel-path cracks.
Lateral wandering during data collection
Lateral wandering, which results in incomplete coverage of lane width, may contribute to data variability. Figure 5 illustrates downward images collected from two testing vehicles. The surface distresses for this segment indicated that there was a significant difference in fatigue cracking and longitudinal wheel-path and -nonwheel-path cracking between the two testing vehicles. As illustrated in Figure 5a, owing to lateral wandering, the full lane was not captured in the downward image. Some surface information was also missing on the left side of the lane where excessive distresses can be observed (Figure 5b).

Incomplete coverage of lane width because of lateral wandering: (a) downward image from Vehicle #1, and (b)downward image from Vehicle #2.
Variations of AASHTO measurement zones
As per AASHTO PP67, five zones across a pavement section are defined to determine nonwheel-path and wheel-path areas. Zone 1 in Figure 6a includes areas where pattern cracks were observed, whereas Zone 1 in Figure 6b excluded the pattern cracks. As per AASHTO PP67, widths and locations are only specified for Zones 2, 3, and 4. The widths for Zones 1 and 5 mainly depend on the lane boundary. For automated methods, the lane boundaries are generally determined based on the pavement markings or surface features. Lane boundaries may be subject to change where there is a change in surface features. Figure 6 provides an example in which the lane boundaries were different according to the two testing vehicles. Zone 1 in Figure 6a appears to be wider than in Figure 6b. As a result, the distress quantities in Figure 6a were higher than in Figure 6b.

Variation of AASHTO measurement zone: (a) downward image from Vehicle #1, and (b) downward image from Vehicle #2.
Classification Models
Owing to collection variability, the final performance ratings based on the pavement condition data from the two data sets could be different. Therefore, RF classification models were constructed to identify the segments whose ratings may be subject to data collection variability.
Random Forest Classification Model
As one of the most powerful machine learning technologies, the RF method has been used in a wide range of applications. In recent years, RF modeling has been successfully used in pavement management activities. The RF method is based on classification and a regression tree algorithm that is used for growing decision trees for the “forest.” The term “random” means the samples used for growing a tree are randomly selected. A subset of samples is randomly generated—also called “bootstrapping process” or “row subsampling.” The number of independent variables is also randomly selected using “column subsampling.” Through row and column subsampling methods, the risk of over fitting can be reduced and thus model performance can be improved.
The RF algorithm is designed to grow a series of decision trees by selecting several splits such that the training data set is split into several subsets. To grow a single decision tree, a series of nodes are determined by minimizing the difference within each subset and maximizing the difference among subsets. Impurity measures are typically used for quantitatively determining the degree of impurity which describes the heterogeneity of a dataset. For categorical dependent variables, the Gini index is commonly employed. The Gini index was determined by Equation 1.
where
i = target categorical variable,
p(.) = proportion of observations in the leaf, and
t = split.
For a certain split, a data set can be divided into left node, tL, and right node, tR. Then the decrease in the Gini index, which describes the impurity of a data set, can be written as Equation 2. The split of a node in a decision tree is determined by finding the splitting value that can maximize D(s,t).
where
s = split on decision node,
and
Consistency of Parallel Data Collection
A segment in the road network may be incorrectly rated because of measurement variability. Figure 7 provides an example of determining the probability of a segment being incorrectly rated owing to data variability. For example, Segment #1 was initially rated Good in relation to the threshold of “Good/Fair.” However, the gray area under the curve indicates the probability that this segment may be rated as Fair. Similarly, there was a probability that Segment #2 might be incorrectly rated as Fair as indicated in the gray area under the Segment #2curve. As constructing probabilistic curves is very costly and time-consuming, parallel testing was employed to evaluate network-level data collection variability. The same road segments were tested and rated separately by two vehicles under the same testing and surface conditions. Generally, ratings from the two vehicles were the same for the same segment. However, there were also some segments for which ratings from the two vehicles were different. The segments with inconsistent ratings from the two testing vehicles were of interest because these segments could potentially influence network-level performance ratings and annual maintenance work plans. On identification of these segments, one might be able to focus on those that are subject to data variability. For example, segments initially rated Good may actually be in Fair condition because of data variability. However, if these segments are still in the early stages of Fair condition, pavement preservation may be an option. Conversely, there might be a few segments that are in or close to Poor condition rated as Fair. The application of pavement preservation to those segments would not be a cost-effective solution.

Variability of performance ratings.
In the parallel tests, individual segments were classified into three consistent and two inconsistent groups. The three consistent groups (GG, FF, and PP) included all the segments with the same ratings from the two testing vehicles. GG means the segment was rated as Good by both vehicles; FF means the segment was rated as Fair by both vehicles; and PP means the segment was rated Poor by both vehicles. The two inconsistent groups, meaning the segment may have been subject to collection variability, were GF and FP: the segment was rated as Good and Fair or as Fair and Poor, respectively, by the two vehicles.
Model Input Variables and Parameters
The input variables for the classification models are listed in Table 5. The dependent (categorical) variables mirrored the five aforementioned groups: three levels indicating consistent ratings between testing vehicles (GG, FF, and PP) and two levels indicating inconsistent ratings (GF and FP).
Sampling of Imbalanced Data Sets
Note: IRI = International Roughness Index; PQI = pavement quality index; FF = the segment was rated as “Fair” by both vehicles; FP = the segment was rated as "Fair" by one vehicle and “Poor” by another; GF = the segment was rated as “Good” by one vehicle and “Fair” by anthor; GG = the segment was rated as “Good” by both vehicle; PP = the segment was rated as “Poor” by both vehicles; NA = not available.
The model input variables (predictors) included roughness data (IRI and rut depth); surface distresses (fatigue cracks, longitudinal wheel-path and nonwheel-path cracks, transverse cracks, block cracks, and cracking percent); and pavement performance indices (PSI, PDI and PQI).
The two parameters for the RF model were number of trees and number of variables in the model. Generally, as the number of trees in a model increases, the accuracy of the prediction increases. However, many trees may result in a significant increase in computation time without improvements in model accuracy. When developing an RF model, the errors in each group need to be balanced. Based on the results from trial tests, the number of trees for this study was determined to be 500. The number of variables was used to specify how many variables (serving as predictors) were to be used for splitting for each “tree” in the “forest.” The default value for the number of variables in a classification tree is the square root of the total number input variables. Based on the results from trial tests, the number of variables was determined to be five. The variables in the tree were randomly selected by the algorithm.
Resampling of Imbalanced Data Sets
A data set is considered to be imbalanced when there are disproportionate ratios of observations in each class ( 20 ), that is, the number of examples representing certain classes is much lower than others. In this study, the numbers of segments in group GF, FP, and FF were significantly lower than those in group GG and FF. It should be noted that models established on imbalanced classes may result in a bias toward the majority class. Like other machine learning models, the RF algorithm is designed to maximize accuracy and minimize error. Therefore, the final model may increase the number of false negatives, which means the model may not perform well in identifying segments with inconsistent ratings.
It was necessary to resample the data set for the number of false negatives in the models to be reduced. The resampling method used in this study was synthetic minority over-sampling technique (SMOTE). SMOTE generates duplicates for existing minority groups through the “k nearest neighbors” method. Duplicates are interpolated in the nearest neighbors, determined by the Euclidean distance between the data points. Table 5 presents the results of the resampling of the data sets used for the RF models. Note that there were only four classes for rut depth with no samples for the FP class. This was because rut depth for the majority of segments was in Good condition, and the proportion of Poor condition segments was less than 0.5%. Therefore, the average rate after resampling for rut depth was 25%. The average rates for other groups after resampling were 20%.
Training and Validation of Classification Models
The performance of classification models was evaluated by model statistics, such as “sensitivity,”“specificity,” and “accuracy.” In binary classification, the metric “sensitivity” is used to evaluate the model’s ability to predict true positives, whereas “specificity” is used to evaluate the model’s ability to predict true negatives in each category. “Accuracy” is the ratio of correct predictions to the total number of predictions. Generally, the higher the accuracy, the better the model performs. In this study, multiclass classification models were constructed. These model statistics were calculated by a one-to-all approach. For example, when the metrics for GG are calculated, true positive means all the GG group is classified as GG; true negative means all non-GG cases in the group are classified as non-GG; false positive means all non-GG are classified as GG; and false negative means all GG are classified as non-GG.
To construct classification models, the resampled data set was split into two subsets for each model with 70% of the samples used as the training set and 30% as the testing set. Table 6 presents the summary statistics for each model. From the results in Table 6, it can be observed that the overall accuracy for all classification models was greater than 0.9, which means the models generally performed well and may be used to preliminarily identify those segments subject to variability of performance ratings. For national performance measures, it was found that the overall accuracy of the classification model for rut depth was the highest, followed by IRI and overall rating, whereas the overall accuracy of the crack percent model was lowest (0.92). For the TDOT indices, PQI (state route) model accuracy was slightly higher than that of PQI (interstate). The positive predictions for FP and GF groups were of interest because the two groups were used to identify those segments subject to inconsistent ratings. As can be seen from Table 6, the models generally performed better in classifying the FP than the GF group, except for the classification model for crack percent. It should also be noted that, owing to the limited FP group data, the classification model for rut depth was not capable of identifying segments for this group. However, this model may be improved when more data are available in the future.
Summary of Model Statistics
Note: TDOT = Tennessee Department of Transport; IRI = International Roughness Index; PQI = pavement quality index; 95% CI = 95% confidence interval; FF = the segment was rated as “Fair” by both vehicles; FP = the segment was rated as “Fair” by one vehicle and “Poor” by another; GF = the segment was rated as “Good” by one vehicle and “Fair” by anthor; GG = the segment was rated as “Good” by both vehicle; PP = the segment was rated as “Poor” by both vehicles; NA = not available.
Influence of Data Variability on Performance Measures
The classification models developed were applied to 2019 pavement condition data to identify segments whose performance rating was subject to change owing to data variability. To investigate the influence of data variability on performance measures, adjustments to performance ratings were applied, as illustrated in Figure 8. For each segment, the ratings adjustment was made in relation to the original rating and the class determined based on classification models. There are different ways to adjust the performance ratings when considering the influence of data variability on pavement evaluation. In this study, adjustments were made to ensure that no segment missed the optimal timing for pavement preservation. For example, if preventive maintenance is recommended on segments in the early stages of a Fair condition, segments close to Fair may be excluded from preventive maintenance programs if they are rated as Good owing to data variability.

Adjustment of performance ratings by class.
Figure 8 illustrates how performance ratings were adjusted in this study.
For segments with an original rating of Good, there were three possible classes: GG, GF, and FF. Both GF and FF were adjusted to Fair given they may be close to a Fair condition.
There were five classes for segments originally rated as Fair. Segments in Group GG were adjusted to Good because they were close to a Good condition. Similarly, those in Group PP were adjusted to Poor.
There were three classes for segments originally rated as Poor. Segments in Groups FF and PF were adjusted to Fair condition.
In Figure 8, one might find that there are some scenarios for which the original rating was different from the class, for example, an original rating of Good with a class of FF, an original rating of Fair with class of PP, and so forth. These segments were also subject to data collection variability. For example, a segment with an IRI of 93 in./mi, a rut depth of 0.18 in., and a crack percentage of 4% will be rated as Good as per the definition in Table 2. However, based on the RF model, the condition of this segment was closer to Fair. Therefore, the original rating could have been incorrectly determined.
Figure 9 illustrates the comparison of original and adjusted performance ratings. Figure 9, a–d, illustrate the changes of performance ratings by national metrics, whereas Figure 9e illustrates the changes by PQI-based TDOT measures.

Influence of rating adjustments on final ratings: (a) ratings by IRI, (b) ratings by rut depth, (c) rating by cracking percent, (d) overall ratings, and (e) PQI ratings.
For national performance measures, Figure 9a illustrates the change in percentage of performance measures by IRI. There was a slight decrease in percentage of Good and Poor conditions for both interstates and state routes, whereas the percentage of Fair condition increased. In general, the influence of rating adjustment on performance measures for state routes was more significant than for interstate. Figure 9b illustrates the percentage change in performance measures by rut depth. There was little change of percentage in the Poor condition from rating adjustments. The main reason was because the rut depths for the majority of the segments were significantly lower than the threshold of the “Fair/Poor” condition. In addition, the percentage change of the “Good/Fair” condition for state routes was higher than for the interstates. Figure 9c illustrates the change of percentage for performance measures by cracking percent. There was a slight change in percentage of Poor condition by rating adjustments. Figure 9d illustrates the change in percentage of performance measures by overall rating: change of “Good/Fair/Poor” for overall ratings was generally lower than individual measures, which means data variability of a single metric (i.e., crack percent) may not necessarily increase the variability of overall ratings.
For PQI, there was little influence of rating adjustments on percentage of Poor segments for interstate. There was a significant increase in the percentage of Fair condition for state routes (10%), whereas there was a slight decrease in the percentage of Poor. From the perspective of a budget allocation, if the segments in Poor condition are considered candidates for major rehabilitation and those in Fair condition for preservation and minor rehabilitation, adjustments on performance ratings could slightly shift budgetary priorities from major rehabilitation to a preservation/minor rehabilitation program.
It should be noted that depending on the rule applied (see Figure 8), the adjusted rating may be subject to change. The rules illustrated in Figure 8 were established to ensure that no segments missed the window of opportunity for preservation. Therefore, segments that may be inconsistently rated as Good or Poor were adjusted to Fair, because “False Good” segments may miss the timing for preservation, whereas “False Poor” segments may be subject to unnecessary rehabilitation. If the data are used for other purposes, such as performance target setting and long-term prediction, other rules may be applied. For example, to identify all the potential segments in Good condition, one could keep all the segments with an original rating of Good unchanged and adjust those in the GF class with an original rating of Fair to Good. Similarly, the FP class with an original rating of Fair may be changed to Poor with all the Poor segments unchanged. The segments in GF/FG and FP/PF classes are subject to change because of collection variability. Evaluating potential data variability is crucial for state agencies to set reasonable performance targets and interpret the performance measures by identifying these segments. It should also be noted that adjusting the definitions of Good, Fair, and Poor could also affect the results.
Conclusions and Recommendations
This paper presented a study in which the variability of network-level pavement condition data were evaluated. Network-level parallel tests were conducted on selected routes to gather raw data that were then used for training RF classification models to identify segments with inconsistent performance measures. The models were applied to a road network to quantify the influence of data variability on pavement condition evaluation. The case examples provided in this paper will be of assistance to SHAs to evaluate the influence of data collection variability on network-level pavement condition ratings and decision-making. Some conclusions can therefore be made as follows.
From the parallel tests it was observed that the distributions of longitudinal wheel-path cracks, nonwheel-path cracks, and rut depth from the two data sets were significantly different from each other.
The reasons contributing to the variability in longitudinal cracks were summarized as (1) longitudinal lane joints rated as longitudinal nonwheel-path cracks; (2) lateral wandering; and (3) variations in lane measurement zones.
The influence of data variability on condition evaluation for state routes was more significant than that for interstates.
The variability of an individual metric (i.e., crack percent) may not necessarily increase the variability of overall ratings, which are the combination of all three metrics (cracking percent, rut depth, and IRI).
Some recommendations are as follows:
Further investigations are needed to evaluate the influence of DMI variability on the distribution of rut depth. Certification of transverse profiling systems is also recommended to reduce rut depth measurement uncertainty.
Methods for adjusting the performance ratings could vary depending on the agencies’ priorities when undertaking pavement evaluation. In this study, adjustments were made to ensure that no segment missed optimal timing for preservation and no segments were subject to unnecessary rehabilitation. However, were the purpose of pavement evaluation to set performance goals, different rating adjustment rules would be considered and applied in the stage depicted in Figure 8. Accordingly, the results illustrated in Figure 9 could be quite different.
Footnotes
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: X. Jia, M. Woods, B. Huang; data collection: X. Jia, D. Zhu; analysis and interpretation of results: X. Jia, H. Gong, W. Hu; draft manuscript preparation: X. Jia, M. Woods, D. Zhu. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
The contents of this paper reflect the views of the authors, who are responsible for the facts and the accuracy of the data presented here, and do not reflect the views of Tennessee Department of Transportation. The contents do not constitute a standard, specification, or regulation.
