Abstract
Accurate predictive modeling is often hindered by the prevalent issue of class imbalance within weather-related crash datasets. To address this critical challenge, this study introduces a novel and tailored synthetic data generation technique aimed at effectively handling nominal predictors specific to weather-related crash cases in North Carolina. Data treatment techniques such as the synthetic minority over-sampling technique-nominal (SMOTE-N) and adaptive synthetic-nominal (ADASYN-N) were investigated in this study. A comprehensive comparison of these data treatment techniques is conducted using two prominent machine learning models: the bagging algorithm (random forest [RF]) and the boosting algorithm (extreme gradient boosting [XGBoost]). The findings indicate that the effectiveness of data treatment varies with the severity level and the algorithm used. The ADASYN-N technique was observed to be highly effective for severe and moderate injury crash prediction using both the RF and XGBoost models, while the control dataset and SMOTE-N demonstrated notable performance in property damage only crash prediction using both the RF and XGBoost models. The findings from evaluating the performance of these models with data treatment methods serve as a benchmark for practitioners in selecting appropriate synthetic sample generation techniques, consequently facilitating the development of more accurate crash severity prediction models and contributing to enhanced traffic safety strategies.
Keywords
Weather-related vehicle crashes are particularly hazardous because of their unpredictable nature. Not only can such crashes cause serious physical harm, but they can also have a substantial economic impact. Research indicates that weather-related crashes in the U.S.A. in 2014 cost approximately US$46 billion ( 1 ). Furthermore, weather-related crashes are more likely to be fatal than other types of crashes ( 2 ), with statistics from the same year showing that they accounted for 21% of all fatalities ( 1 ). The psychological repercussions of such crashes can be long-lasting and have a substantial economic impact in addition to reducing the quality of life of those affected ( 3 ).
To mitigate the impact of weather-related crashes, practitioners must be able to accurately predict and identify the factors associated with their severity levels. However, because of the highly imbalanced nature of crash data, traditional machine learning models are not robust enough to effectively classify crash severity ( 4 , 5 ). Thus, it is crucial to address the problem of class imbalance before predicting severity classes on weather-related crash datasets.
Common approaches, such as using only one classifier or the majority class to predict all classes, often lead to inaccurate results ( 6 ). Two of the most popular approaches are undersampling and oversampling techniques. In undersampling, the number of instances from the majority weather-related crash severity class are reduced to match the frequency of the minority class ( 4 , 5 ). However, there is a risk of losing vital information about the majority class. On the other hand, oversampling involves replicating weather-related crash events from the minority class to increase their representation in the dataset. In the context of crash events, the rare occurrence of fatal/severe crashes gives room for overfitting in the case of the oversampling technique ( 4 ).
Because of the weakness of undersampling and oversampling, the synthetic minority oversampling technique (SMOTE) ( 7 ) and adaptive synthetic (ADASYN) ( 8 ) methods are used to balance the data, thus avoiding the risk of overfitting or data loss ( 4 , 6 ). The SMOTE works by generating new events that are interpolations of existing minority class events, thereby improving the diversity of instances without simply duplicating data. The ADASYN technique, however, takes this a step further by adaptively generating minority data points based on their density distribution with more weight on the most difficult-to-learn data instances. However, a significant limitation arises when dealing with a dataset that is entirely composed of nominal predictors. These methods struggle because they rely on the concept of distance or interpolation between points.
Modern data sources such as weather-related crash data have resulted in complex datasets that are cumbersome to model with classic statistical methods ( 9 – 12 ). On the other hand, machine learning models are robust enough to deal with complex datasets with high-dimensional features. While these models are becoming popular in predicting traffic crashes, the presence of data imbalance in weather-related crash severity datasets often introduces unintended biases to resultant machine learning models, making it essential to address these imbalances before beginning the model training process ( 13 , 14 ). This problem becomes even more complex with a dataset containing predominantly nominal predictors.
Therefore, this study aims to assess the efficacy of different data treatment techniques and their influence on machine learning models, particularly focusing on handling nominal predictors. Specifically, data treatment techniques involving two synthetic approaches, namely the synthetic minority over-sampling technique-nominal (SMOTE-N) and adaptive synthetic-nominal (ADASYN-N) technique, are investigated for nominal predictors. The study employed two crash severity classification models, the bagging algorithm (random forest [RF]) and the boosting algorithm (extreme gradient boosting [XGBoost]). Evaluating the performance of these models with different data treatment methods provides insights into how such techniques can enhance the accuracy of machine learning models when applied to weather-related traffic crash data.
Background and Motivation
Weather-Related Crashes
Research in crash severity prediction, focusing specifically on weather-related traffic crashes, is vital for understanding how various factors influence the severity of such incidents. Accurate crash severity prediction is crucial for emergency response planning, resource allocation, and the development of effective countermeasures to reduce the impact of these crashes. Numerous studies have been conducted to investigate the factors affecting the crash severity of weather-related traffic crashes ( 15 – 19 ). These factors can be broadly categorized into three main groups, driver-related, vehicle-related, and environmental-related factors, with emphasis on weather conditions ( 19 – 21 ). Driver-related factors include driver age, gender, impairment (e.g., alcohol or drug use), distraction, and fatigue ( 22 ). Vehicle-related factors encompass vehicle type, size, and safety features. Environmental-related factors consist of road conditions, weather conditions, lighting, and traffic characteristics. Understanding the impact of these factors is crucial for developing effective crash severity prediction models. The selection of predictor variables significantly affects the accuracy and interpretability of crash severity prediction models ( 23 ). Previous research has identified a wide range of potential predictor variables, including emphasizing weather conditions and their interactions with driver characteristics (e.g., age and gender), roadway attributes (e.g., speed limit and road type), and other environmental conditions (e.g., lighting), and crash-specific variables (e.g., crash type and time of day) ( 24 – 27 ). Das et al. ( 9 , 10 ) identified other contributing factors, such as locality and road-specific features. Feature selection techniques, such as stepwise regression, principal component analysis, and recursive feature elimination, have been employed to identify the most influential variables for crash severity prediction.
Evaluation metrics play a crucial role in assessing the performance of crash severity prediction models. Commonly used metrics include accuracy, precision, recall, F1 score, and the area under the curve–receiver operating characteristic curve (AUC-ROC) ( 26 ). In addition, confusion matrix analysis provides insights into model performance across different severity levels. Studies have compared the performance of different models and techniques, highlighting the strengths and limitations of each approach. Higher accuracy and AUC-ROC values indicate better model performance ( 28 ).
Accurate crash severity prediction models have practical implications for road safety management. For instance, emergency response systems can use predicted severity levels to dispatch appropriate medical personnel and resources during adverse weather conditions. Transportation agencies can prioritize weather-specific road safety improvements and allocate funding based on predicted crash severity hotspots ( 29 ). Furthermore, crash severity prediction models can aid in the development of intelligent transportation systems and advanced driver assistance systems to prevent or mitigate risks associated with weather-related crashes.
Approaches for Addressing the Imbalance Problem in Crash Data
The phenomenon of class imbalance has been identified as a significant challenge within the realm of statistical modeling and machine learning methodologies. The disproportionate representation of classes often results in the dominance of the majority class, thereby undermining the reliability of the predictions pertaining to the minority class. This accentuates the pivotal role that data plays in these modeling frameworks, given their data-dependent nature. This aspect is particularly salient within the context of traffic crash data analysis. Infrequent crashes, while less common in the dataset, are typically associated with higher severity and concomitant socioeconomic costs ( 14 ). Consequently, the accurate prediction of these minority class crash events assumes critical importance ( 23 ).
Iranitalab and Khattak ( 30 ) undertook a comparative analysis of four distinct statistical and machine learning methodologies, namely the multinomial logit (MNL), nearest neighbor classification (NNC), support vector machine (SVM), and RF, in the context of traffic crash severity prediction. The findings revealed that machine learning algorithms, encompassing the NNC, SVM, and RF, generally outperformed the traditional MNL model with respect to prediction accuracy. This result was further strengthened by a recent report by Ogungbire et al. ( 31 ) comparing deep learning, machine learning, and statistical models for weather-related crash severity prediction. Nonetheless, a common challenge encountered by both studies pertained to the classification of infrequent severe injury crashes, such as those resulting in disabling or fatal injuries. Therefore, there is a pressing need to implement data treatment techniques to mitigate the issues arising from the imbalance in crash data and investigate their effectiveness on different machine learning algorithms.
To address the prevalent issue of class imbalance in crash severity analysis, minority oversampling techniques such as the SMOTE and the ADASYN technique have been developed and explored in the past ( 7 , 8 ). The SMOTE generates synthetic instances of the minority class by employing a bootstrapping approach in combination with the k-nearest neighbors’ algorithm, which has been widely used in road safety analysis. Similarly, the ADASYN technique adopts a density distribution-based measure to determine the required number of samples from the minority class, in contrast to the SMOTE’s uniform weight assignment. Besides oversampling the minority class, undersampling of the majority class has also been incorporated in crash analysis. For instance, Fiorentini and Losa ( 32 ) utilized a random undersampling of the majority class strategy to develop models predicting crash type, specifically focusing on two levels of collision severity: property damage only (PDO) and fatal + injury crashes. Their findings demonstrated that the random undersampling of the majority class significantly improved the prediction performance for the minority class compared to the model trained on imbalanced data. Table 1 provides a summary of selected weather-related crash severity studies ( 33 – 38 ) that have explored machine learning techniques in the past.
Summary of Previous Weather-Related Crash Severity Studies Using Machine Learning
Note: RF = random forest.
In summary, class imbalance is a major issue in statistical and machine learning methods, which can be addressed by using data treatment methods such as the SMOTE-N and ADASYN-N technique. However, there is a challenge when the dataset has predominantly nominal predictors with few or no numeric predictors. This study aims to address a critical gap in the existing literature by introducing a novel technique for synthetic data generation specifically tailored for nominal predictors in the context of crash severity analysis. It is hypothesized that these data treatment techniques will improve the prediction accuracy when trained on machine learning models compared to the control or raw dataset. The proposed technique is expected to improve the accuracy of weather-related crash severity classifications. The findings from this study serve as valuable insights into the impact of these techniques on the accuracy and robustness of machine learning models when applied to crash data, contributing to more reliable and effective crash severity analysis methodologies.
Methodology
Data Used and Processing
The data used in this study was obtained from the Highway Safety Information System (HSIS). The HSIS is a comprehensive, multi-state database encompassing a wide array of information. It includes detailed crash data, extensive roadway inventory details, traffic volume statistics, and specialized inventory data that covers aspects such as intersections, interchanges, and roadside hardware elements. Crashes that occurred in North Carolina between January 1, 2015 and December 31, 2017 were extracted. Crashes are reported using case numbers and observations with the same number indicating that the vehicles involved are part of the same crash incident. To gain a thorough understanding of the crash occurrence process, Washington and Haque ( 39 ) argued that crashes with different causes should be modeled separately. Therefore, only weather-related crashes were extracted, while crashes occurring under clear weather conditions were excluded in this study.
The final dataset only included crashes that happened under non-clear weather conditions (i.e., cloudy, rain, fog/smog, sleet/hail/freezing rain/drizzle, severe crosswinds, or blowing sand conditions, as described in the crash reports). Crash severity is defined in the HSIS database as five different levels (i.e., fatal crashes, injury type A, injury type B, injury type C, and no injury/PDO). In this study, the crash severity was re-categorized into three levels, that is, severe injury (fatal and injury type A), moderate injury (injury type B and injury type C) and PDO/no injury. The re-grouping is done to simplify the analysis by reducing complexity while ensuring that each category is meaningful. It is also intended to capture similar outcomes where closely related interventions are required. Overall, a total of 238,252 weather-related crashes were recorded within the study period with 2952 severe injury crashes, 71,688 moderate injury crashes, and 163,612 PDO crashes.
The summary statistics gives information about the counts and percentages of different types of weather conditions for severe injury, moderate injury, and PDO injury crashes. Cloudy weather is the most common weather condition in all three categories, with 63.4% of severe injury crashes, 57.6% of moderate injury crashes, and 56.6% of no injury crashes occurring under cloudy conditions. Rain is the second most common weather condition, followed by snow, fog, smog, and smoke. Sleet, hail, and freezing rain/drizzle are less common, with blowing sand and dirt being the least common weather condition for all three severity types.
The dataset presented in Table 2 underwent initial preprocessing to facilitate model selection and evaluation. Initially, the comprehensive dataset was partitioned into two distinct subsets: a training set and a testing set. The formation of the test dataset was accomplished through the methodology of random sampling, ensuring a distribution proportional to the size of each constituent class. Consequently, 20% of the data, uniformly distributed across classes, was extracted to constitute the testing set. A random seed was set to ensure that when the data is resampled, it does not select part of the samples that was used in training the dataset. The residual 80% of the data served as the control training set, subject to diverse treatment methodologies to generate corresponding treated training datasets.
Descriptive Statistics of Variables for Analysis
Note: PDO = property damage only.
Treatments for Imbalanced Data
In the domain of machine learning, there are a variety of approaches that can be employed to tackle imbalanced crash data. The consequence of neglecting this discrepancy can result in a biased model, particularly for the minority class. In this study, two data imbalance treatment techniques were employed to assess how they influence the prediction of crash severity related to weather conditions.
Synthetic Minority Oversampling Technique-Nominal
The SMOTE, introduced by Chawla et al. ( 7 ), addresses class imbalance by oversampling minority instances through the creation of synthetic data points. The SMOTE-N, as presented in Algorithm 1, is an extension designed for nominal datasets, using a modified version of the value difference metric (MVDM) to measure the distance between categorical feature values. The MVDM considers the occurrences of values and their response classes. The distance between feature vectors is calculated using a weighted Euclidean or Manhattan distance. In the case of the SMOTE-N, weights are often disregarded, as it primarily aims to balance data distribution between classes rather than direct classification.
Adaptive Synthetic-Nominal Technique
The ADASYN-N technique is designed to address class imbalance with nominal predictors as an extension of the ADASYN technique ( 8 ) initially intended for numeric predictors. In the ADASYN-N technique, as presented in Algorithm 2, the aim is to generate synthetic data for the minority class in a way that prioritizes events that are challenging to learn. The process has a training dataset containing n samples, where xi represents data in a p-dimensional feature space X and yi is the class label. The number of synthetic data points to be generated (G) is determined based on the desired level of balance, controlled by the parameter β (ranging from 0 to 1). For each xi in the minority class, the algorithm calculates k-nearest neighbors in the p-dimensional space and computes ri, which represents the ratio of majority events among these nearest neighbors. Higher ri values indicate events that are more challenging to learn. These ri values are normalized, ensuring that the sum of normalized ri values equals 1.
Feature Selection
A variable section technique called permutation feature importance (PFI) was employed as presented in Algorithm 3. The idea behind this technique, as presented in Figure 1, is to destroy a feature of interest xj

Permutation feature importance.
The errors without permuting the features and with permuted feature values are measured. Repetition of the feature permutation was done 500 times and the average of the differences of both errors was computed. Figure 2 shows how the result of the PFI is interpreted. For example, if crash type is the most important variable, it implies that destroying information about crash type by permuting it increases the error of the model the most. This interpretation of the PFI helps one to select features of the model that are most sensitive and guides model refinement.

Permutation feature importance and its interpretation.
Machine Learning Techniques
Extreme Gradient Boosting
An XGBoost model is an ensemble of decision trees, where each subsequent tree tries to correct the errors made by the previous ones. It is an iterative process that aims to minimize a loss function. Mathematically, the prediction
where fk(x) is the prediction of the kth tree.
The objective function that XGBoost tries to minimize is represented in
where θ represents the parameters of the model, L(θ) is the training loss function, and Ω(θ) is a regularization term that controls the complexity of the model.
In the case of a classification problem such as traffic crash severity prediction, L(θ) is typically the log loss for binary classification, or the softmax loss for multiclass classification. The “gradient boosting” part of XGBoost is because it trains each new tree to predict the negative gradient (or “residual”) of the loss function with respect to the current predictions. This is why it is called “gradient boosting,” as it uses gradient information to boost the performance of the ensemble. The regularization term Ω(θ) in the objective function is what distinguishes XGBoost from regular gradient boosting. In XGBoost, Ω(θ) is expressed as
where T is the number of leaves in the tree, wj represents the scores on the leaves, γ controls the complexity of the model (the number of leaves in the trees), and λ controls the L2 regularization on the leaf scores. This regularization term helps to prevent overfitting by penalizing complex models.
Random Forest Model
The RF model is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Suppose a RF model consists of N decision trees ( 40 ). Each tree gives a classification, and the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Each tree is grown as follows.
If the number of cases in the training set is N, then N cases are sampled at random—but with replacement from the original data. This sample is the training set for growing the tree.
If there are M input variables, a number m is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
Each tree is grown to the largest extent possible and there is no pruning.
For a given test record, each tree in the forest gives a classification. The forest chooses the classification having the most votes (over all the trees in the forest) and takes the average of outputs by different trees in the case of regression. Mathematically, the prediction of a RF model for an input x can be defined as shown in
where fi(x) is the prediction of the ith decision tree. The model would take input features of a traffic incident (such as speed, weather condition, time of day, etc.) and output a crash severity class (severe injury, moderate injury, or PDO). The model would be trained on a labeled dataset, and the aim would be to minimize the discrepancy between the predicted and actual labels. The RF’s ability to combine multiple decision trees helps it to avoid overfitting and generally results in a robust prediction performance.
Results
The results of data imbalance treatment using the SMOTE-N and ADASYN-N technique are presented in this section. The dataset from both treatment methods is applied to the RF and XGBoost. The dependent variable has three ordered levels: “1” is severe injury crashes, “2” is moderate injury crashes, and “3” is PDO crashes.
Feature Extraction
Figure 3 represents the PFI as determined by the RF model. The description of variables on the y-axis are provided in Table 2. The PFI is a technique for estimating the contributions of individual features to the predictive power of a model by observing the effect on model performance by randomly permuting the values of each feature, one at a time ( 31 ). In this study, the RF model was trained on a subset of the dataset. A representative dataset was used for this process because this technique is computationally demanding. In the plot, each bar corresponds to a specific feature in the dataset, and the length of the bar corresponds to the importance of that feature. Positive values indicate that the performance of the model decreases when the feature is shuffled, suggesting that the model relies on the feature to make accurate predictions. In other words, when the feature is perturbed or randomized (shuffled), the model’s predictability suffers because it loses access to the valuable information contained within that feature. Therefore, a decrease in performance on shuffling implies that the feature plays a crucial role in the model’s overall accuracy. This highlights the importance of the feature in the predictive process, as it significantly contributes to the model’s ability to make accurate predictions. Conversely, negative values indicate that the performance of the model improves when the feature is shuffled, suggesting that the model might be overfitting to noise in the feature.

Permutation feature importance plot.
The top three features obtained from this analysis are crash type, vehicle type, and locality, suggesting that these features are the most important for predicting the severity of a crash according to the trained model. However, it is important to note a few caveats. Firstly, while PFI provides a useful way to rank the importance of features, it does not provide any information about the nature of the relationship between each feature and the target variable ( 28 ). Secondly, it is possible that important features might appear unimportant if they are highly correlated with other features ( 28 , 33 ). Finally, the results might differ if the analysis were performed on the full dataset or a different sample because this analysis was performed on a sample of the original dataset.
Comparing Treatment Methods and Machine Learning Techniques
The model performance metrics are summarized in Table 3. The confusion matrix for the RF and XGBoost are summarized in Table 4.
Model Performance Metrics
Note: RF = random forest; XGBoost = extreme gradient boosting; SMOTE-N = synthetic minority over-sampling technique-nominal; ADASYN-N = adaptive synthetic-nominal; PDO = property damage only.
Confusion Matrix for the Random Forest (RF) and Extreme Gradient Boosting (XGBoost)
Note: SMOTE-N = synthetic minority over-sampling technique-nominal; ADASYN-N = adaptive synthetic-nominal; PDO = property damage only.
Best Treatment Method
The focus is to assess the effect of the selected methods on the classification of crash severity into three categories, severe injury (SI), moderate injury (MI), and PDO, as seen in Figure 4. The primary evaluation metric used was the F1 score, which considers both precision and recall, providing a balanced performance assessment. For the RF model, the best-performing data imbalance treatment method was the ADASYN-N technique, achieving a test F1 score of 45.42. The ADASYN-N technique effectively addressed the class imbalance in the case of moderate injury crash, resulting in improved predictions for this category. However, the control dataset proved most effective for handling the class imbalance specific to PDO, which has a test F1 score of 82.47. The ADASYN-N technique was also identified as the best method for addressing the imbalance in the case of severe injury crashes, achieving a test F1 score of 16.14. The ADASYN-N technique significantly improved the model’s ability to predict severe injury crashes.

Performance metrics on the test dataset.
For the XGBoost model, the data imbalance treatment method that yielded the highest test F1 score of 44.96 was the ADASYN-N technique for predicting moderate injury crashes. This data treatment method effectively handled the imbalance in the moderate injury crash category, leading to improved predictions for this severity level compared to the control dataset. The SMOTE-N emerged as the best-performing method for PDO crashes, with a test F1 score of 72.36. The SMOTE-N successfully mitigated the class imbalance issue, resulting in more accurate predictions for PDO crashes. The SMOTE-N was identified as the most effective data imbalance treatment method for severe injury crashes, achieving a test F1 score of 18.21. The SMOTE-N significantly improved the predictive performance for this severity category.
A notable consideration is that the choice of evaluation metric has a significant effect on the assessment of data imbalance treatment methods. While the F1 score was used as the primary metric, other metrics prioritizing different aspects of classification performance may lead to different rankings of the methods. For example, if recall is prioritized over precision, the best-performing data imbalance treatment method might change.
In machine learning, the difference in performance between the training and test datasets can provide valuable insights into how well the model is learning from the data. In this study, the F1 score was used to evaluate this, which balances both precision and recall. A large difference between the training and test F1 scores could indicate overfitting, while a small or negative difference might suggest underfitting or a good fit, depending on the absolute scores.
Model Fit on Datasets
Overfitting is a common issue in machine learning where the model learns the training data so well that it includes noise or random fluctuations. This results in a model that performs extremely well on the training data but poorly on unseen test data (Figure 4). Potential overfitting was observed in some scenarios.
The XGBoost model on the control dataset for severe injury crashes demonstrated the most pronounced case of overfitting, with an F1 score difference of 86.77.
The RF model with the SMOTE-N method for severe injury crashes and the RF model with the ADASYN-N method for severe injury crashes also showed large F1 score differences, indicating potential overfitting.
In these cases, the models achieved very high F1 scores on the training data, suggesting they were able to capture the nuances of the training data very well. However, their performance dropped significantly on the test data (Figure 4), suggesting that they may have overfit to the noise or specific patterns in the training data that do not generalize to the unseen data. On the other hand, a small or negative difference in F1 scores between the training and test datasets could suggest that the model is underfitting or fitting the data well. Underfitting happens when a model is too simple to capture the complexity of the data, performing poorly on both the training and test data. If the model fits the data well, it will have good performance on both datasets. From our analysis, the following scenarios demonstrate good fitting:
the XGBoost model with the ADASYN-N method for PDO crashes has an F1 difference of 7.01;
the XGBoost model with the SMOTE-N method for PDO crashes has an F1 difference of 7.60;
the RF model on the control dataset for PDO crashes has an F1 difference of 13.64.
In these cases, the models achieved reasonable F1 scores on both the training and test data, suggesting they were able to generalize well the unseen data. These observations highlight the importance of evaluating the performance of a model not only on the training data but also on unseen test data. If a model is overfitting, strategies such as simplifying the model, collecting more data, or using regularization or cross-validation might help to improve its generalization. If a model is underfitting, making the model more complex or engineering better features could help it capture the underlying patterns in the data.
Discussion
The results delve into the effectiveness of data imbalance treatment methods when used on different machine learning algorithms in the context of weather-related crash severity analysis. The findings demonstrate that the choice of method significantly influences the model’s ability to handle imbalanced data, and the effect varies depending on the severity category being predicted and the specific machine learning algorithm being used. One noteworthy observation is the consistent effectiveness of the ADASYN-N method for addressing moderate and severe injury crashes in both the RF and XGBoost models. The ADASYN-N method appears to be a robust approach for generating synthetic samples of the minority class, effectively balancing the class distribution and enhancing the model’s performance in predicting severe and moderate injury crashes. The consistent success of the ADASYN-N method across different algorithms indicates its potential as a reliable data imbalance treatment method for these crash severity levels. Fountas et al. ( 41 ) argued that the generation of data for weather-related crashes involves underlying latent processes that should be considered during analysis. Recognizing these processes is essential for enhancing crash severity prediction accuracy and for the proposed various statistical methods to address this issue effectively ( 41 – 43 ). The ADASYN-N method appears to address this limitation and is more resilient to the heterogeneity introduced in the dataset, consistently performing well across all the algorithms tested.
The ADASYN-N method’s effectiveness in predicting moderate and severe injuries can be attributed to the specific strengths of this approach. Unlike standard oversampling techniques, the ADASYN-N method adaptively adjusts the synthetic sample generation based on the learning difficulty of minority classes. This feature is beneficial for severe and moderate injury crashes, which might have more complex or less consistent patterns than PDO crashes. By focusing more on the harder-to-learn examples, the ADASYN-N method helps create a more balanced and representative training set for these types of crashes.
The notable performance of the ADASYN-N method in predicting moderate injury crashes suggests that this category, while not as rare as severe injury crashes, still benefits significantly from a more balanced representation in the training data. The method’s data generation process helps the model better understand the variations in moderate injury crashes, improving its predictability in this class. While both the ADASYN-N technique and SMOTE-N improve the model’s ability to predict severe injury crashes, the improvement is marginal. In addition, the model may become too attuned to the nuances of the training data, that is, there is a potential risk of overfitting, reducing its generalizability to real-world situations. In contrast, the prediction of moderate injury crashes, supported by a larger dataset, tends to be more robust and generalizable.
The marginal improvement in severe crash prediction achieved in this study, while seemingly small, is nonetheless significant, especially considering the high stakes involved in severe crash scenarios and the challenges presented by the highly imbalanced dataset. Even a slight enhancement in predicting severe crashes can be crucial. These crashes, although less frequent, often have the most devastating consequences, including serious fatalities or injuries ( 44 ). Therefore, any improvement in accurately identifying potential severe crashes, no matter how marginal, is valuable as it could contribute to life-saving interventions, more effective emergency response planning, and targeted safety measures ( 45 ). In the scenario of weather-related crashes, a marginal improvement indicates that the model is overcoming some of the inherent biases and learning meaningful patterns related to severe crashes.
High model performance on training data is often achievable with complex models. However, these models may not generalize well to unseen data, which is a phenomenon known as overfitting ( 46 ). Conversely, a model that is too simple may not perform well even on the training data, leading to underfitting, and therefore also lacking generalization ( 46 , 47 ). A pronounced case of overfitting was observed in some cases, notably the XGBoost model on the control dataset for severe injury crashes and the RF model using both the SMOTE-N and ADASYN-N method for the same crash category. This is evidenced by high F1 scores on training data but a substantial drop in performance on test data, indicating these models captured training data nuances, including noise, which did not generalize well to unseen/uncaptured data. This result shows that while these methods can improve model learning for underrepresented classes, they also introduce the risk of overfitting, especially when the synthetic samples do not perfectly represent the real-world data distribution ( 32 ). In such cases, the importance of interpretability becomes paramount; understanding the model’s decision-making process allows practitioners to discern whether predictions are based on genuine patterns or artifacts of the synthetic data ( 48 ). Therefore, one must weigh the benefits of improved predictions against the potential loss of transparency when settling for a machine learning model, ensuring the chosen model maintains a level of interpretability that supports reliable and actionable insights in practical applications.
Furthermore, the study highlights the superior performance of the SMOTE-N for the XGBoost model in predicting PDO crashes. This observation suggests that the SMOTE-N effectively addresses the class imbalance present in the PDO crashes category for the XGBoost algorithm. This finding emphasizes the significance of selecting data imbalance treatment methods tailored to the characteristics of the machine learning algorithm and crash severity type to achieve optimal performance.
Conclusions
This study successfully showcases the importance of choosing appropriate data treatment techniques, specifically the SMOTE-N and ADASYN-N technique, for handling nominal predictors in machine learning models applied to weather-related traffic crash severity prediction. The effectiveness of these methods varies depending on the crash severity level being predicted and the machine learning algorithm used. The ADASYN-N technique proved particularly effective in balancing class distribution and enhancing the accuracy of both the RF and XGBoost models for severe and moderate injury crash predictions. In contrast, the control dataset showed notable efficiency with the RF model for predicting PDO crashes, while the SMOTE-N significantly improved the XGBoost model’s performance in the same category. This indicates that the XGBoost model benefited from a more balanced dataset provided by SMOTE-N, which likely offered more examples of the minority class to learn from. The high prediction accuracy of PDO crashes in the RF model is not unusual as the model shows its bias to predicting the majority class.
The study establishes a clear benchmark for practitioners in selecting suitable techniques to generate synthetic samples for addressing underrepresented crash categories, such as severe and moderate injury crashes. By demonstrating the varying effectiveness of the SMOTE-N and ADASYN-N technique across different machine learning models and crash severity levels, this study provides valuable insights for optimizing data treatment in crash severity prediction. The insights gained from this study are instrumental in guiding the development of more accurate and reliable crash severity prediction models, ultimately aiding in better informed and more effective traffic safety measures and policy decisions.
Despite these promising findings, further research is warranted to explore the applicability of the introduced synthetic data generation technique to other weather-related datasets and enhance its performance in real-world scenarios. Assessing the technique’s scalability and efficiency in handling large-scale datasets would be crucial, particularly for real-time applications or big data scenarios. In addition, investigating the technique’s effectiveness on more complex and diverse datasets would contribute to a deeper understanding of its potential in various crash severity analysis contexts.
This study demonstrates that the SMOTE-N and ADASYN-N technique are effective data imbalance treatment methods for improving weather-related crash severity prediction in RF and XGBoost models, respectively. The effectiveness varies by crash severity category, with the ADASYN-N technique excelling in severe and moderate injury crash predictions and the SMOTE-N in PDO crash predictions. This study provides valuable guidance for researchers in choosing suitable techniques to create synthetic samples, particularly for underrepresented categories in weather-related crash severity prediction. The study paves the way for transportation agencies to develop more accurate crash severity prediction models. Such models are essential in enhancing road safety measures and building safer as well as more resilient roadways, especially in adverse weather conditions, ultimately protecting road users.
Footnotes
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: S.S. Pulugurtha, A. Ogungbire; data collection: A. Ogungbire; analysis and interpretation of results: A. Ogungbire; draft manuscript preparation: A. Ogungbire, S.S. Pulugurtha. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper is partially prepared based on information collected for a research project funded by the United States Department of Transportation – Office of the Assistant Secretary for Research and Technology (USDOT/OST-R) University Transportation Centers Program (Grant # 69A3551747127).
This paper is disseminated in the interest of information exchange. The views, opinions, findings, and conclusions reflected in this paper are the responsibility of the authors only and do not represent the official policy or position of the USDOT/OST-R, or any other State, or The University of North Carolina at Charlotte or other entity. The authors are responsible for the facts and the accuracy of the data presented here. This paper does not constitute a standard, specification, or regulation.
