Abstract
Real-time crash prediction is crucial in traffic management on freeway to improve traffic safety. This study presents an optimized crash prediction model on freeway with over-sampling techniques based on Support Vector Machine (SVM). The model was constructed with traffic data collected by discrete loop detectors from a 48.7 km segment on the G60 Freeway, Shanghai, China. Matched case-control method and SVM were applied to identify the high-risk traffic flow status. Two kinds of over-sampling techniques have been conducted to optimize the raw samples. The adaptive synthetic over-sampling technique presents better performance than the synthetic minority over-sampling technique according to the nonparametric test. The results indicate that SVM classifiers with the adaptive synthetic over-sampling technique improve the accuracy and robustness when dealing with imbalanced data. Mean Impact Value method was employed to rank the contributing factors leading to crash. This research contributes to more targeted strategies for real-time safety management of freeway.
Introduction
The concept of measuring and predicting the crash risk on freeway is gaining more practicality due to the recent advancements in the fields of information systems and traffic surveillance technology. Many studies have investigated the interrelationship between the crash mechanism and the traffic operating characteristics such as traffic state, weather and road geometric characteristics. Meanwhile, there is a growing trend in developing the real-time crash prediction models with the historical data on freeway. In recent years, several studies on real-time crash prediction estimate the likelihood of crash occurrence by using data collected by traffic surveillance systems such as dual loop detectors [1–8], Automatic Vehicle Identification (AVI) [9–11], and single loop detectors [12–14]. Majority studies require high quality traffic data. For instance, Abdel-Aty et al. utilized traffic data from 7 loop detectors [1–4] to identify contributing factors as significant predictors of crash likelihood.
However, due to lack of enough surveillance devices on freeway in most developing countries, these models are unable to dynamically evaluate the traffic safety condition of freeway. For example, surveillance devices are often installed somewhere with high congestion or crash occurrence in China. Moreover, these models would be invalid in case of any detector failure. Additionally, the transferability of these models has not been verified for the traffic network in most developing countries, due to driving behavior and traffic patterns.
By far, most studies utilized generalized linear models by associating the crash precursor variables with crash data. Explanatory variables can be identified to generate targeted strategies. The widely applied modeling techniques in developing real-time crash prediction model are logistic regression model [1, 6], logit model [7, 8] and Bayesian approach [9–12].
In these approaches, several traffic characteristics such as the variation of speed, the upstream occupancy, the variation of flow, etc. were found as significant predictors of crash likelihood. However, some study [18] showed that a generalized linear model based approach may lead to biased model estimation and interpretation when the independent variables demonstrate strong non-linear features.
Besides application of linear and non-linear models, other approaches were applied to associate crash likelihood and real-time traffic data. Neural Network (NN) [2, 14] and Support Vector Machine (SVM) [15, 16] techniques have showed strong performance to predict crash occurrence on freeway. Despite the innovative characteristics, the neural network method displays some limitations. First, the NN model works as a black box and may have over-fitting issues as well as local extremum issues [20]. Second, the traditional ways to select samples in case-control studies often apply a crash/non-crash ratio as 1:4. This method will create an imbalanced dataset. Mujalli et al. [22] found that data mining algorithms tend to produce high predictive accuracy over the majority class, but poor accuracy over the minority class in an imbalanced dataset. Usually SVMs solve the over-fitting issues by introducing kernel function and try to get the global optimal solution by solving the convex optimization problems. However, the SVMs have difficulty in dealing with the imbalanced dataset as well.
The aim of this research is to investigate the possibility of developing real-time crash prediction models based on the traffic data collected from discrete loop detectors. This study also investigates how to employ over-sampling techniques to improve the model performance.
This paper is organized in four sections. Section 2 presents the data and the structure of the proposed method. Section 3 explains the methods including the SVM modeling technique, two over-sampling techniques and nonparametric tests for the over-sampling samples. The results and discussions are discussed in Section 4.
Data collection and preparation
Traffic data
The study area in this study is a part of the G60 Freeway in Shanghai, China. The total length of studied road segment is 48.7 km with 6 to 10 lanes (3 to 5 lanes for each direction). The primary dataset includes all crash and traffic loop data between January 2014 and September 2015.
Figure 1 shows the locations of the 5 pairs of loop detectors along the G60 freeway (9 pairs are installed in total). Five pairs are located on the mainline and four pairs are installed on the ramps.
For the primary objective of this study, only the five pairs of loop detectors on the mainline of the freeway were utilized. The average distance between the loop detectors are approximately 6.6 km. Several traffic flow characteristics associated with crash occurrence were collected by the loop detectors, such as traffic flow, vehicle type, vehicle speed for corresponding vehicle type, and vehicle occupancy every 20 s. The database also stores the device working state, data validity and timing record, etc.

Locations of the loop detectors on G60 Freeway in Shanghai.
The primary crash dataset includes all 913 crashes that occurred in the study area between January 2014 and September 2015. Only rear-end crashes and side-wipe crashes are utilized in this study. Due to the discrete loop detectors, traffic data from a freeway segment between one on-ramp to the next off-ramp were utilized to predict the crash occurrence.
The next step in crash data preparation is the data aggregation. Because of the random noise issue, M. Ahmed and Abdel-Aty et al. [9] recommended to combine 1-minute raw data to 5-minute level. Other previous studies also discussed that the 5-min aggregated data would be more significant [6, 17].
In this experiment, the extracted raw data were selected at 5–10 minutes prior to crash. The original 20-s traffic data, Flow (f n ), Speed (V n ), Occupancy (Occ n ) on each lane(L n ) are aggregated into a 5-minute level. The primitive data matrixes are listed as below:
(1) Flow matrix
In the matrix, n denotes the number of lane.
(2) Speed matrix
(3) Occupancy matrix
By calculating the mean value and the standard deviation value of each matrix within lanes or time slices, 12 variables can be deduced. The variable definitions are shown in Table 1.
Symbols and variable description
For preparing the crash dataset that can be used in the modeling estimation, a few principals were applied to avoid possible bias. ‘No data’ or ‘invalid data’ are defined when there is no loop detector on certain segments on the G60 freeway, or when the loop detectors failed. The dataset consists of the traffic flow data corresponding to each crash record. This final dataset includes 307 observations.
To eliminate the seasonal effects such as vacation traffic demand and weather diversity, the matched case-control method was utilized to avoid possible bias resulting from dissimilar traffic patterns on different days of the month and the week.
A 4:1 control-case ratio was recommended by existing study [17], thus every crash sample was matched with four non-crash samples. The four control samples were selected by the crash recorded time, respectively 14 days before, 7 days before, 7 days after and 14 days after. The control samples with invalid traffic data would also be removed from the control dataset. A final control dataset with 1210 non-crash samples was utilized in this study.
SVM modeling technique
SVM modeling technique has been widely applied in text classification, image recognition, voice recognition in machine learning. The method can often be employed for data with high dimensions and linearity problems.
Based on structural risk minimization theory, SVMs generate an optimal classification hyperplane, which maximums the margin between the hyperplane and the nearest samples of the classified sample categories and sets an equal margin. Meanwhile, SVMs try to get the global optimal solutions, which help the classifiers achieve better generalization ability.
The C-SVC (C-Support Vector Classification) model was employed in this study and RBF (Radial Basis Function) kernel function was utilized to deal with the high-dimension variables. The RBF function form is shown as Equation (4) as the following:
The function of the decision function based on the RBF kernel function can be deduced as Equation (5).
As the performance of the SVMs was determined by the parameters in the kernel function, grid search method was employed to select the optimized parameters (C, γ) by training the SVMs step by step with cross-validation method. The searching scope for the parameters (C, γ) varies among [20, 210] and [2–15, 2–5] respectively. The modeling process was based on the LIBSVM tool developed by Chang and Lin [18] in the Matlab software.
The data mining algorithms often find difficult in dealing with imbalanced dataset. Under-sampling and over-sampling are data analysis techniques used to adjust the class distribution. The reason for over-sampling in this study is the ability to create artificial data points. Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling Technique (ADASYN) are two kinds of common over-sampling techniques. This section aims to investigate the optimal technique in crash prediction models with imbalanced dataset.
Synthetic minority over-sampling technique
The synthetic minority over-sampling technique is proposed to over-sample the minority class by creating synthetic examples. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen [19].
Adaptive synthetic sampling technique
The adaptive synthetic sampling technique uses a weighed distribution for different minority class samples according to their level of difficulty in learning. More synthetic data is generated for minority class samples that are harder to learn compared to those minority samples that are easier to learn, thus reducing the bias introduced by the imbalanced data distribution [20].
Nonparametric test
Both SMOTE and ADASYN have been applied to reconstruct the dataset. Furthermore nonparametric test has been made to test the difference of the new samples. As stated above, a case-control 1:4 ratio was utilized. In the SMOTE strategy, each crash sample were matched with 3 additional samples, while in the ADASYN strategy all crash samples were utilized to generate the additional sampling samples.
The Wilcoxon signed-rank nonparametric test (see Table 2) for related samples was conducted to test the SMOTE samples and the original crash samples in the software SPSS.
Wilcoxon test results for SMOTE samples
Wilcoxon test results for SMOTE samples
*The null hypothesis H0 (the variances of the median are equal). H0 = 0 denotes the hypothesis cannot be rejected and H0 = 1 denotes the variances are significant at the 5% significance level.
The median test and Mann-Whitney U test (see Table 3) for two independent samples were conducted to test the ADASYN samples and the original crash samples.
Median test and Mann-Whitney U test results for ADASYN samples
The results in Tables 2 and 3 present that some parameters of the SMOTE samples are significantly different from those of the original samples, which indicates that the SMOTE-generated samples may be invalid for modeling.
However, the nonparametric tests for ADASYN samples show that ADASYN performs well in over-sampling process and the ADASYN strategy can adaptively learn from difficult samples. Hence, the balanced dataset (1202 crash samples and 1210 non-crash samples) with ADASYN strategy was finally utilized in the SVM modeling process.
Primitive SVM classifier performance and SVM classifier performance with ADASYN
Both SVM without over-sampling technique and SVM with ADASYN have been conducted to compare the outcomes. In the modeling process, the overall samples were divided into two parts randomly, training data and test data with a ratio 7:3. The training data were utilized to develop the model, and the test data were utilized to evaluate the model’s predictive power. Three indexes were employed to evaluate the classifier performance, Accuracy, True Positive Ratio (TPR) and False Positive Ratio (FPR). Accuracy can be calculated by Equation 6 as below:
Where True Positive (TP) denotes the number of crash samples classified to be crashes, True Negative (TN) is the number of non-crash samples classified to be non-crashes, P represents the total number of crashes after classification, N denotes the total number of non-crashes after classification.
Where False Positive (FP) denotes the number of non-crash samples classified to be crashes.
Additionally, the prediction performance can be illustrated by means of a graphical plot, which is called the Receiver Operating Characteristic (ROC) curve [27]. Area Under Curve (AUC) of the ROC curve was utilized to evaluate the classifier performance as well. The larger the area under the ROC curve, the better the prediction accuracy. The results for the test dataset are listed in Table 4.
Classifier performance of primitive SVM and SVM with ADASYN
Table 4 indicates that the primitive SVM classifier has significant tendency to classify the samples into the non-crash samples. While for ADASYN-generated samples, the SVM classifier tends to be neutral. The AUC value also indicates that the ADASYN technique promotes the performance of the SVM classifier. The performance of each SVM classifiers is shown as the ROC curve in Fig. 2.

ROC curve of primitive SVM and SVM with ADASYN.
From the perspective of the modeling results of SVMs with ADYSYN, the TPR value 50.00% is relatively lower than the existing studies, such as 69.4% by Abdel-Aty et al. [2], 57.1% by Pande et al. [19] and 75.93% by Ahmed et al. [9]. However, the FPR value 9.06% is much lower than the existing studies, such as 47.2% by Abdel-Aty et al. [2], 28.8% by Pande et al. [19] and 45.9% by Ahmed et al. [9]. As shown in the ROC figure (Fig. 2), when the TPR value increases, the FPR value would increase as well. This would lead to an unsatisfactory result when applying the model in real traffic management. Once the FPR value exceeds a certain limit, the traffic management system would alarm frequently, as a result, none of the traffic management departments would like to put the model with high crash predicting power as well as high false alarming rate into practice. This is one of the main reasons why the existing models have not been widely applied. From the perspective of the AUC value (0.7868 in this study), the performance of the SVMs is quite satisfactory compared with the existing studies, such as 0.74 by Yu et al. [20]. In general, the results show the potential value and reliability of the optimized SVMs model.
The Mean Impact Value (MIV) method has been employed to weight the variables, thus determining the major crash precursors.
The method calculates the value change of each variable with respectively a 10% increase and then a 10% decrease while keeping the value of the other variables unchanged. The final results are demonstrated in Table 5.
MIV and rank of each variable
MIV and rank of each variable
As shown in Table 5, SSV (the accumulated deviation of speed within the lanes during 5-minute), Flow and SSFlow are the three most important factors in the model.
In real traffic operation environment, the value of SSV represents the significance of the speed difference within different lanes. With the increasing value of speed difference, drivers tend to change lanes more frequently, or even faster. It falls into the aggressive driving pattern, which can increase the crash risk significantly. The variable Flow ranks second. It reflects that crash risk varies quite much under different traffic flow condition. When increasing the value of flow, more vehicles flow onto the road. Drivers have to pay attention to the increasing vehicle around. Once a driving mistake has been made, a crash is likely to occur. The variable SSFlow represents the difference within different lanes in the temporal dimension. It indicates that with the increasing value of flow difference, drivers tend to drive on the lane with fewer vehicles by lane-changing to acquire better driving experience, i.e. it leads to more lane-changing behavior. To some extent, the variables can reflect the inner mechanism between traffic characteristics and crash potential.
The main objective of this paper was to explore an optimized real-time crash prediction model on freeway with over-sampling technique ADASYN based on support vector machine. Matched case-control method was utilized to generate samples with traffic characteristics data collected by discrete loop detectors. The raw data were aggregated into 5-minute level to avoid the random noise. The analysis indicates that ADASYN promotes the performance of the SVM classifier and solves the problem of significant tendency of SVM with imbalanced dataset. The results show the potential value of the traffic data collected by discrete loop detectors and the priority of the optimized SVMs model compared with the existing studies when considering the practical application in the traffic management system.
Meanwhile, ADASYN performs better than SMOTE when dealing with complicated samples. Various variables were selected as the potential predictor factors for estimation. Mean Impact Value method was employed to select the contributing factors leading to crash. The first three significant factors are SSV, Flow and SSFlow. This exercise contributes to more targeted strategies for real-time safety management of freeway. There is always potential for improving the performance of developed models by enriching the traffic data and crash data in future studies.
The accuracy and robustness of the prediction models can be validated jointly. Another extension of this study would be the transferability test of the models. By employing more intelligent machine learning techniques, traffic authorities will be more confident to utilize crash prediction models in the real-time safety management system.
Footnotes
Acknowledgments
This research is supported by the National “Twelfth Five-Year” Plan for Science & Technology Support Project in China (2014BAG01B04) and National Natural Science Foundation of China (71671126).
