Abstract
Road crash prediction is a fundamental key in designing efficient intelligent transportation systems. There has been a pronounced progress in the use of machine learning models for crash events assessment by the transportation safety research community in recent years. However, little attention has been paid so far to evaluating reduced-visibility crash occurrences within a heuristic ensemble system. This study presents a proactive multicriteria decision-making system that can predict crash occurrences based on real-time roadway properties, land zones’ characteristics, vehicle telemetry, driver inputs and weather conditions collected using a desktop driving simulator. A key novelty of this work is implementing a genetic algorithm-based feature selection approach along with ensemble modeling strategies using AdaBoost, XGBoost and RF techniques to establish effective crash predictions. Furthermore, since crash events occur in rare instances tending to be underrepresented in the dataset, an imbalance-learning methodology to overcome the issue was adopted on the basis of several data resampling approaches to increase the predictive performance namely SMOTE, Borderline-SMOTE, SMOTE-Tomek Links and ADASYN strategies. To our knowledge, there has been a limited interest at adopting an ensemble-based imbalance-learning strategy examining the impact of real-time features’ combinations on the prediction of road crash events under reduced visibility settings.
Keywords
Introduction
It is a distressing fact that the number of traffic road traffic accidents is continuously increasing. Road accidents are recognized to be one of the most significant concerns and intimidating issues that societies are confronting nowadays, leading to many health problems, financial losses and fatalities. The World Health Organization [96] reports that 1.35 million people die in road traffic crashes every year, and a further 20–50 million are injured or disabled worldwide. In Morocco, an average of 10 civilians are killed and another 33 are seriously wounded every (Ministry of Equipment and Transport 2017). The leverage of vehicle telemetry and driver entries have been proven to have serious impacts on the recognition of crash events [8,23,73]. Conversely, relevant research outlined the major implication of roadway design and land zones properties on safe driving and on the road crash analysis [7,93]. On another note, the effect of reduced visibility settings such as adverse weather conditions and night-time driving has been considered as a significant factor affecting road crash frequency [20,24,37]. However, little to no effort has been conducted to evaluate the impact of real-time information combining land zone characteristics and roadway properties on top of weather conditions, driver inputs and vehicle telemetry on the prediction of crash events for drivers during night-times. Consequently, perceiving under what factors road crashes occur and which aspects increase the likelihood of a car crash in low clarity configurations, would have a substantial influence on designing productive policy interventions in order to prevent accidents from happening. Within this context, an effective road crash analysis system under reduced visibility settings is undeniably of great necessity and utility.
A large number of studies have been conducted on road traffic accidents’ prevention. Crash events investigation is a complex phenomenon, affected by several key factors such as the driving behavior and environmental factors [5,22]. As such, the reasons behind motor vehicle collisions are multifaceted. Crash events have been linked to factors such as skill level [44], lack of experience [21], and risk-taking behaviors [68]. Investigations into collision records have also implicated factors like excessive vehicle telemetries such as speed and acceleration [98], reckless driving [6], traffic violations [49], driver stress along with other physiological and psychological traits [16,23], and substance use [36]. Collectively, these findings highlight the impact of inexperience, lack of skill, driver’s mental and physical state as well traffic violation conducts and risk-taking behaviors in the occurrence of crash events. Furthermore, these contributing factors appear to be influenced by driver gender, as young male drivers are more likely than young females to be involved in collisions due to risk-taking behaviors like excessive speeding and impairment by drugs and alcohol [25,81]. On another note, the effects of reduced visibility settings such as time of day and inclement weather covariates on traffic maneuvers and safety outcomes have become a major issue. In fact, the number of crashes that occur at night time have been found to be higher than during day time [3,62]; The average number of vehicle kilometers driven at night accounts for less than 20% of the total, but 40–50% of traffic fatalities occur at night hat crash while severity is at least two times higher during night hours than during the day [76]. Moreover, adverse weather conditions are considered as one of the most perilous events since drivers tend to adapt their driving behavior to adjust the conditions presented by inclement weather [1]. For these reasons, various intelligent systems have been integrated into mass-produced vehicles to reduce the likelihood and severity of traffic accidents. These intelligent transportation systems are specifically designed to anticipate potential collisions, issue warnings to drivers, or even take autonomous actions [45,77]. Both experimental studies and market feedback have provided evidence of the effectiveness of crash prediction systems in enhancing driving safety and receiving positive evaluations from drivers [58,80].
Hazardous traffic conditions and unsafe driving behaviors have been examined in numerous previous studies in order to characterize road crashes and develop efficient real-time traffic management strategies [46,61,99]. Despite the fact that the previous studies present practical insights into the assessment of unsafe driving behavior, it is essential to note that examining the impact of real time data captured from the vehicle, driver, weather, roadway design and land zones categories is relatively limited. Vehicle telemetry and driver inputs have been found to have a high effect on the analysis of road accidents [4,48]. In reference to land zone characteristics and the roadway properties, multiple scholars have examined the underlying factors of route design and different stretches and how they result in a significant change of the driving conduct [26,39]. On the other hand, Weather conditions were found to be the primary cause of more than 1.25 million accidents (21% of all vehicle crashes), leading to about 418,000 injuries (19% of crash injuries), and nearly 5000 casualties (16% of all casualties) [31], yet, most of the research that endorsed weather variables in crash analysis adopted data acquired from police crash reports which could be susceptible to inaccuracies as the reported conditions may be what were observed by the person filling the crash report and not the effective weather status at the time of accident [23,71]. In this work, real-time data was captured using a driving simulator; Adopting simulations experiments in the field of transportation safety research has been increasing in recent years as they imitate the driving conduct in a safe environment, with the major gain of holding an entire empirical control over all contexts and the ability to examine numerous design structures. Moreover, it would be very dangerous to carry out crash analysis simulations on real driving environment [23,27]. The endorsed route scheme incorporated multiple roadway properties like mountainous, steep and flat sections, whereas trials were carried out during six adverse weather patterns namely light fog, heavy fog, light rain, heavy rain, light snow and heavy snow seasons. As for land zones characteristics, four types were considered, namely, industrial, commercial, residential and rural areas. The driver input actions (e.g. wheel angle and pedal positions) on top of vehicle telemetry (e.g. acceleration and tires temperature) were consistently registered during the simulations and preprocessed in order to pinpoint the most relevant crash event precursors.
In crash prediction investigation, statistical learning-based algorithms such as linear regression [81], gaussian regression [75] and discriminant analysis [8] have been widely employed. Still, statistical techniques for crash events analysis frequently sustain severe data quality and necessitate large amount of historical data as well as deliver inconclusive outcomes when handling features with a great deal of categories [84,91]. On the other hand, machine learning (ML)-based techniques have been found to outperform statistical modeling in predicting forthcoming events and have depicted significant results in multiple transportation systems [57,69]. The major interests of ML models can be described by (i) their autonomously surmounting major non-linear problems using datasets from multiple sources; (ii) their ability to easily incorporate newly data in an attempt to improve estimation performance, (iii) and their predictive and explanatory ability through the extraction of rules. The Adaptive Boosting (AdaBoost), eXtreme Gradient Boosting (XGBoost) and Random Forest (RF) techniques are ones of the most essential machine learning models that have been used for crash events prediction [52,74,78,101]. Adopting AdaBoost in evaluating safety measures for road traffic analysis outlined an effective dataset handling, with a superb classification capability and low false predictions [23,72]. In parallel, XGBoost is widely known for its powerful predictive performance and fast training speed as well as outperforming other machine learning techniques [60]. On another note, relevant researchers have stated the effectiveness of RF in various fields as it has been demonstrated to decrease variance compared to a single decision tree, and it is also robust to outliers and missing values [10].
One more instrumental factor in the analysis of road crash events is the ratio of crash and non-crash observations in the dataset. Generally, road accidents related instances include relatively fewer crash samples compared to no accidents’ data points. Multiple scholars endorse the conventional proportion of accepting 4 non-accident cases for each accident case [100,103,104]. However, this is likely to result in imbalances as there would be a bias toward the majority class since that predictive learners prioritize the label with the greater number of instances leading to an over-prediction of this class [23,24]. To handle this issue, resampling techniques are adopted to balance the class distributions such as the traditional methods Random oversampling (ROS) and random under-sampling (RUS); However, ROS strategies prevent data loss by focusing on duplicating the samples of the minority which could lead to over-fitting, whereas RUS approaches attempt to balance the ratio of the classes by eliminating observations from the majority class, which makes is it likely to miss certain important information [63]. The Synthetic Minority Oversampling Technique (SMOTE), judged as one of the most powerful re-sampling algorithms which was presented by [13] to solve the imbalance issue by producing synthetic instances from the minor class, can overcome both of information loss and overfitting [30]. In the case of skewed instances, SMOTE is efficient in determining identical but more specific sections in the feature dimension as the decision region for the minority class [12]. SMOTE has acquired a lot of admiration and has broad scope of practical applications; That is, many variants of SMOTE have been developed, such as Borderline-SMOTE (BL-SMOTE) [40], SMOTE-Tomek Link (SMOTE-TL) [9] and ADaptive SYNthetic sampling technique (ADASYN) [42]. To ensure satisfying performance results with the imbalanced dataset, variations of the proposed balancing approaches along with the aforementioned prediction models have been employed. Furthermore, The variable selection technique has been carried out using the Genetic Algorithm (GA) [35] is one of the most widely used optimization methods for finding solutions in complex and nonlinear search spaces, it has been naturally employed to solve feature selection problems, in fact, it is the first evolutionary algorithm extensively adopted for feature selection [90]; GA is commonly acknowledged for being a powerful search technique since it can search profoundly in large spaces and obtain efficient global solutions rather than traditional approaches such as filter and wrapper methods.
To the best of our knowledge, little to no research has examined the impact of several combinations of five adopted features namely roadway properties, land zones characteristics, driver actions, vehicular telemetry and multiple weather conditions for the prediction of road crashes during night times. Moreover, a key novelty of this work is that ML models were developed with good prediction performance, using genetic algorithm-based feature selection approach. The work presented on this paper is an extension of our conference paper originally presented at the International Conference on Networking, Information Systems & Security [25], which aimed to analyze road crash occurrences for drivers during rainy conditions. In the present study, we design and validate a multicriteria decision-making system for the prediction of road crash events for novice drivers under reduced visibility settings. Additional material has been included in order to create a more in-depth research paper. The objectives of the proposed system are four-fold:
Include additional comprehensive real-time features – vehicle kinematics, driver inputs, weather conditions, roadway properties along with land zones characteristics – acquired during supplemental driving simulations. Identify the most influential factors contributing to the likelihood of crash events during night-time driving using Genetic Algorithm. Explore the effects of various combinations of the five adopted features’ categories – vehicle kinematics, driver inputs, weather conditions, roadway characteristics and land zones attributes – on the occurrence of night-time crash events. Construct numerous prediction models – XGBoost, AdaBoost and RF – using multiple synthetic resampling strategies namely SMOTE, Borderline-SMOTE, SMOTE-Tomek and ADASYN, to handle the data imbalance problem.
Driving simulator experiment
Apparatus and participants
A total of 93 volunteers (69 males and 24 females) between the ages of 22 and 34 were recruited for the experiments. All were in a good health, and had (corrected to) normal vision. In reference to the provided information about the experiment’s general intentions, all participants were naïve to the purpose of the study and gave informed consent form about data recording of their driving performance.
The study was conducted out using a fixed-based driving simulator located at the University of Cadi Ayyad (UCA) facility. Simulator driving experiments hold a significant advantage of simulating the driving experience in a safe environment with a complete experimental control over driving conditions including all types of climate, terrain, and traffic [27]. Certainly, it would be very risky to carry out trials on real road settings. The driving simulation was run through the Project Cars 2 simulator by (Slightly Mad Studios), which was adopted in numerous road safety analysis research [24–26,83,97], using a DELL XPS running on Windows 10. Computational analyses were performed on the same platform and on a 2015 MacBook Pro with an i7 2.8 GHz chip, 16 GB RAM and SSD hard drive. Participants viewed the simulation on a 27-inch LCD monitor with a resolution of 1920 × 1080 pixels positioned approximately 75 cm from the driver’s eyes, and heard auditory via a surround speaker system. The computer was fitted with a Logitech® G27 Racing Wheel set (steering wheel, accelerator pedal, and brake pedal) with the adjustable Logitech Evolution® Playseat, simulations were conducted with automatic gear selection, thus gear shifter was not needed. Figure 1 illustrates the simulator setup.

Experimental setup of the desktop driving simulator at UCA.
The subjects were appointed to a quiet laboratory to virtually drive the vehicle. An overview of the adopted layout of the driving route with the flat and mountain stretches are presented in Fig. 2. The driving scenario was performed in night time and under different weather conditions (light fog, heavy fog, light rain, heavy rain, light snow and heavy snow) and intended to simulate various intricacies and features that real-world driving involve in order to examine the effect of the factors on driving behavior and to collect enough raw data before the crash. The adopted protocol had similar traffic conditions and the identical number of outer events for all participants.
Upon arrival, participants read through and signed an informed consent form to indicate their agreement to participate in the experiment and completed a questionnaire assessing their demographic characteristics and recent activities. The experimental session consisted of two separate visits to the simulator where drivers were instructed to drive as they usually do in a real driving situation and follow the traffic rules. The first visit was a practice session prior to the experiments to ensure that the subjects were able to become familiar with the simulated driving environment and become comfortable with the vehicle controls, whereas in the second visit which is the main trail, drivers navigated the vehicle during night-times under three different adverse weather conditions along with several hazard scenarios located along the route such as a surrounding road user (e.g. another vehicle). The simulation protocol incorporated different roadway characteristics including mountainous sections as well as flat sections and steep segments. This set of climate and route factors serve to ensure various elucidations while conducting the vehicle.

Overview of the driving scenario route layout with the flat (a) and mountainous (b) stretches.
Data were continuously recorded throughout each drive with a sampling frequency of 20 Hz through the UDP protocol. The driving environment included representative buildings and landmarks, and participants were instructed to adhere to traffic regulations such as traffic lights and signs. It is important to note that crash events in studies of crash prediction are typically unexpected and infrequent. Thus, the road tests adopted in this study encompassed various elements commonly analyzed in crash events, such as making turns, changing lanes, and using signals. The aim was to investigate the impact of these factors on driving behavior and the occurrence of crash events. The scenarios were randomly generated by a virtual simulator to encourage participants to drive as they would in real-life situations, simulating a range of maneuvers similar to standard on-road drives. The goal was to create a driving environment that closely resembled participants’ everyday experiences.
The simulator collects data of vehicle telemetry (e.g. speed, acceleration), driver inputs (e.g. throttle/brake pedal position, steering wheel position), roadway properties (e.g. flat, mountain), land zones characteristics (e.g. industrial, residential), and finally weather data (i.e. season) in which the season feature prevails the weather status at the time of the crash namely fog, rain or snow as shown in Fig. 3. Time to Collision (TTC) to the nearest road users was also computed and included within the vehicle kinematics. The TTC feature value was calculated on the basis of the method presented in the work of [92], which involved multiple contexts of unconstrained vehicle activity. These are five categories of metrics that may affect traffic safety. It is well acknowledged that the driving simulator is a powerful and useful tool for studying how drivers behave on different types of lands [11,27]. This makes it a very promising way to find new solutions to enhance road safety and operations. In this work, the adopted route layout included various land zones characteristics in form of three urban lands namely residential, commercial and industrial areas along with a fourth type of rural lands which were recorded during the driving trials by the virtual simulator. Residential lands are areas that are being used solely or primarily for residential purposes on which housing predominates. Whereas commercial areas cover most types of non-residential properties including restaurants, stores and any land typically used for businesses; on the other hand, industrial zones refer to lands which are used for industrial activities such as production, manufacturing, warehousing and so on. Lastly, rural lands that are defined as properties located outside urban zones with a low population density where any existing structures are spread out. The dependent variable is the crash event, coded as a binary variable with a value of 1 if a crash was identified and 0 if not. Apart from the categorical features of land zones characteristics, roadway properties and weather conditions, all variables are with continuous values. Table 1 summarizes the grouping and definitions for all acquired features; a thorough and comprehensive data screening that includes cleaning and consistency checks is executed to secure data operability and validity for the analysis.
Within this context, approximately 75900 samples were recorded during simulations where the ratio of crash observations is about 3%, indicating that the data are extremely imbalanced, which has been found in similar studies related to crash prediction [47,54,59] though the adoption of the endorsed data balancing strategies. In regard to the work related to the intervention time [23,26,94,102], we recovered the 12 s length data segments, from 16 s to 4 s prior to the crashes, as the crash data to validate the patness of the suggested prediction strategy.

Weather conditions for driving scenarios: (a) heavy rain, (b) light rain, (c) heavy snow, (d) light snow, (e) heavy fog, (f) light fog.
Definitions of all the acquired features during driving simulations
The primary purpose of this work is to construct road crash prediction models by examining the most relevant inputs, create pertinent features’ combination and develop practical machine leaning techniques. First, we conduct a feature selection procedure using GA technique to determine the crash strongest precursors. Then, we outline five feature combinations based on five distinct data inputs: Roadway properties (R), Driver inputs (D), Weather conditions (W), Vehicle telemetry (V) and Land zone characteristics (L) in order to compare the performances of the adopted models with different inputs. That is, three well-known classification methods, Adaptive Boosting (AdaBoost), eXtreme Gradient Boosting (XGBoost) and Random Forest (RF), along with four balancing techniques – SMOTE, Borderline-SMOTE, SMOTE-Tomek and ADASYN – are used to develop crash prediction models, and compared to each other using 10-fold cross validation samples. The following feature aggregations were used as input space:
RDWV: Roadway properties + Driver Inputs + Weather Conditions + Vehicle Telemetry.
VDWL: Vehicle Telemetry + Driver Inputs + Weather Conditions + Land zone characteristics.
RVLD: Roadway properties + Vehicle Telemetry + Land zone characteristics + Driver Inputs.
RVWL: Roadway properties + Vehicle Telemetry + Weather Conditions + Land zone characteristics.
ALL: Roadway properties + Driver Inputs + Vehicle Telemetry + Weather Conditions + Land zone characteristics.
We developed 12 distinct types of classification models for each combination. The classification performance evaluations are measured using a 10-fold cross validation technique. In this experimentation methodology the dataset is first partitioned into 10 roughly equal-sized distinct subsets. For each experiment nine subsets are used for training and the one part is used for testing. This procedure is repeated for 10 times for each of the 9 model types. Then test results are aggregates to portray the “unbiased” estimate of the model’s performance. Figure 4 represents the endorsed methodology in this study.

Illustration of the proposed decision-making system.
Variable selection
One of the aims of this analysis is to determine predictors that have the highest contribution in predicting road crash events as using a large number of features in a classification without any process of reduction may cause an increased error of estimates and over adjustment of the model [50], which may influence both the understanding of the correlation between features and, more significantly, the utility of the model during the prediction process. In machine learning applications, feature or variable selection techniques are employed for identifying and eliminating the irrelevant undesired, and redundant input features or characteristics that do not affect the accuracy of the ML model. However, the process of selecting the most relevant feature is a complex combinatorial task. Two techniques that have been widely used in many studies for feature selection are the Filter method and the Wrapper method [106]. In the filter method, the process of feature selection is carried out as a pre-processing step before machine learning model’s implementation to the selected features or inputs. The features are selected based on the relationship with the target and between the variables instead of cross-validation performance. This approach may mislead the algorithm for feature selection. On the other hand, in the Wrapper methods, the actual ML model is trained on each set of features to select the most relevant features. The limitation of this approach is overfitting due to the training of ML models based on different features [107]. As such, we adopted the broadly employed Genetic Algorithm is a metaheuristic method that mimics natural selection inherent in biological reproduction.
GA algorithm [35] has been widely applied in machine learning to find near-optimal solutions to optimization and search problems, for which the universally optimal solution would be prohibitively expensive to find. The algorithm accomplishes this by utilizing biologically inspired genetic operations. In GA, a set of solutions is selected called population, and a single solution in a set of the population is known as a chromosome. Each chromosome is passed through the fitness function, which is a criterion that estimates the quality of each chromosome. If the desired objective is obtained, the algorithms stop at that point. Otherwise, another set of offspring (a new set of the population) is generated by mutation, crossover, and selection operators. This way, the algorithm is continued until the best result is obtained (termination point is achieved). The best feature is passed through each ML method (Babatunde et al., 2014). Table 2 shows the best GA parameters while Table 3 summarizes the selected features.
Parameters adopted in Genetic Algorithm
Parameters adopted in Genetic Algorithm
List of the selected variables after feature selection
Road crash analysis often produces an unbalanced data set as the target classes are not equally represented. Such imbalances induce a bias toward the class with the majority of observations, since modeling classifiers prioritize the class with the higher number of samples leading to an over-prediction of the this class [28]. Balancing approaches are designed in order to resolve this issue by balancing class distributions in the data set. In this work, the SMOTE technique along with Borderline-SMOTE, SMOTE-Tomek and ADASYN were applied.
SMOTE, presented by [13], generates synthetic minority instances based at random intervals between existing minority cases rather than duplicating existing minority cases. The technique first finds the k-nearest neighbors of each minority case, following the recommendation of, on the basis of the required over-sampling, multiple iterations are carried out in which one neighbor is randomly chosen from the k-nearest neighbors. Then the difference between the instance in process and its neighbor is computed and the new synthetic instances are included in the data set and appointed to the minority class. SMOTE over-samples the minority class without data duplication, as such, the over-fitting challenge can be prevented [28,34]. In comparison to SMOTE, Borderline-SMOTE [40] is more interested in the minority class instances that are closer to the border-line (i.e., majority class examples) which are harder to classify correctly, giving higher importance to these examples to improve performance results. It works in the same way as SMOTE, but it generates synthetic examples based on the minority class examples on the borderline. On the other hand, the SMOTE-Tomek Link strategy combines the SMOTE ability to generate synthetic data for minority class and Tomek Links [89] ability to remove the data that are identified as Tomek links from the majority class. The Tomek links are samples of data from the majority class that is closest to the minority class data. As for ADASYN, which is a popular extension of SMOTE [42]. It adopts the same strategy as SMOTE to produce new samples of the minority class. However, ADASYN generates more points around the minority instances which are closer to the majority class. In particular, for each minority instance, its k nearest neighbors are determined and the learning difficulty rate is calculated according to
Prediction models
This work examines three distinct tree-based models namely Random Forest, Adaptive Boosting, and eXtreme Gradient Boosting. In general, a basic decision tree experiences the issue of high variance, i.e. if the training data is split into several portions then the outcomes resulted from each one of the sections can be quite different [18]. To resolve this problem, RF technique aggregates several decision trees and based on the principle of “bootstrap aggregation” (bagging), it decreases the variance of the machine learning model. In bagging, each training set is created by building a bootstrap replication of the original training set [17]. Conversely, Boosting is a process in which the trees are developed sequentially, using knowledge from previously shaped trees. AdaBoost and XGBoost models are founded by fitting the trees iteratively on residuals rather than the target variable [23,51,67].
Adaptive Boosting (AdaBoost)
AdaBoost, presented by [32], is one of the most popular boosting algorithm. It is a highly effective and commonly used ensemble classifier. In AdaBoost training, the distribution weights of samples are high when the error is large and low when the error is small; Then, the samples are trained based on the new weight distribution to improve the predicted output. The AdaBoost is a machine learning algorithm which feeds the input training set to a weak learner algorithm repeatedly. During these repeated calls, the algorithm maintains and updates a set of weights for the training set. Initially, all weights are equal. However, after each call, the weights are updated such that the weights of incorrectly classified examples are increased. This forces the weak learner to focus on the hard examples in the training set.
AdaBoost generates base learner: the first one,
eXtreme Gradient Boosting (XGBoost)
XGBoost is a scalable tree boosting method that has become one of the most popular machine learning methods [14]. This method is an advanced version of the gradient boosting algorithm [33] and was designed to achieve not only high accuracy but also low risk of overfitting and this is obtained by simplifying the objective functions that allow combining predictive and regularization terms, but maintaining an optimal computational speed. XGBoost is preferred by data scientists because its high execution speed out of core computation; It has been widely employed in industry due to its high performance in problem-solving and minimal requirement for feature engineering [19].
The processes of additive learning in XGBoost starts by fiiting the first learner to the whole space of input data, and a second model is then fitted to these residuals for tackling the drawbacks of a weak learner. This fitting process is repeated for a few times until the stopping criterion is met. The ultimate prediction of the model is obtained by the sum of the prediction of each learner. The general function for the prediction at step t is presented as follows:
Random Forest (RF)
Random forests [10] are ensembles of de-correlated decision trees. The algorithm extends the idea of bootstrap aggregating (bagging) by using a random selection of features to determine the best variable/split-point within the process of growing trees to the bootstrapped samples from the training data (feature bagging). When random forests are used for classification problems, the resulting classification is based on the majority vote derived from all class votes from each tree.
Based on the input vector x, RF constructs a number K of regression trees and calculates the average of the outcomes. When the trees are built, the predictor function of RF is described as follows:
In order to prevent the correlation of the different trees, RF uses bagging to generate trees from different training data subsets which creates diversity. Bagging is an approach used for training data creation by resampling randomly the original input space with replacement, i.e., with no deletion of the data selected from the input sample for generating the next subset. Thus, some data may be used more than once in the training, while others might never be used which achieves a greater stability by making it more robust when encountering slight alterations in input data [10]. On the other hand, when the RF makes a tree grow, it uses the best feature/split point within a subset of evidential features which has been selected randomly from the overall set of input evidential features. Therefore, this can decrease the strength of every single tree, but it reduces the correlation between the trees, which reduces the generalization error.
Experimental results
Building prediction models
In an effort to demonstrate the validity of the classifiers’ assessment, parameter optimization for each of the classifiers AdaBoost, XGBoost and RF was conducted to select the best performing penalty parameters through cross-validation. Grid Search method was applied to tune the proposed models; Grid search is one of the most widely used methods in and it has been proven as an efficient way to determine the model’s hyperparameters [15,24]. In reference to XGBoost, seven hyperparameters has been tuned namely Number of estimators (i.e. the number of boosting trees), maximum depth (i.e. the maximum number of edges from the node to the tree’s root node), minimum child weight (i.e. the minimum sum of weights of all observations required in a child), gamma (i.e. the minimum loss reduction required to make a split), subsample (i.e. the ratio of observations for sampling to construct each tree), colsample by tree (i.e. the ratio of columns for sampling to construct each tree), the regularization parameters alpha and lambda, and learning rate (i.e. The amount of updating weights)
As regards RF, two parameters needs to be set for building a prediction model: the number of regression trees and the number of evidential features which are utilized in each node to make regression trees expand [79]. Breiman [10] depicted that the generalization error always converges by boosting the number of trees; thus, overtraining is not an issue. On another note, decreasing the number of evidential features leads to a minimized correlation among trees, which improves the model’s performance. In order to optimize these parameters, a large number of experiments were carried out using different numbers of trees and split evidential features. The random forest algorithm requires the tuning parameters: ntree is the number of regression trees grown based on a bootstrap sample of the original data set and mtry which is the number of various predictors to try at each node.
Lastly, the loss function used for the AdaBoost tuning is the exponential loss function; while the base estimator is a decision tree with depth of five and the limit for the maximum number of base estimators at which the boosting is terminated is 200. As there is a trade-off between learning rate and estimators, the learning rate is set to 0.01. To cope with the imbalanced issue, data balancing techniques were applied. The proper way to apply rebalancing strategies is to address the imbalance issue is by oversampling only the training set while the test set is left intact [23,24,65], this way the evaluation would be more realistic. The optimal tuning parameters for models’ performance adopted for AdaBoost, XGBoost and RF are highlighted in Table 4.
Optimized model parameters
Optimized model parameters
The data being balanced is measured with a criterion called imbalance ratio (IR) that is calculated by the number of instances in majority class (no-crash instances) divided by the number of instances in minority class (crash instances). A dataset with imbalance ratio less than 1.5 is usually considered to be balanced [26,29,53]. The obtained IR for the adopted dataset is about 10.5, that is, the choice behind the selection of the model evaluation metrics needs to take into consideration the imbalance issue. Within this context, A variety of frequently used performance measures are used to evaluate the quality of the classification models. Recall, also known as true positive rate (TPR) or sensitivity, is defined as the proportion of correctly classified positives (i.e. crash events correctly classified). Since the primary goals focus is to correctly predict the rare events of the accident class, recall is a particularly substantial metric of classifier performance in this case. Precision on the other hand is a measure of accuracy outlining the relevance ratio of the predicted instances, i.e. percentage of truly predicted events from all predicted events.
Where the True Positive (TP) indicates the number of crash occurrences correctly classified, and False Positive (FP) indicates the number of non-crash events incorrectly classified as crash-events. False Negative (FN) indicates the number of crash events incorrectly classified as non-crash events, and True Negative (TN) indicates the number of non-crash occurrences correctly classified. In the following. Gmean, is considered as a metric of stability between correct classification of positive class and negative class viewed independently. It is usually adopted in order to resist the imbalances in the dataset [56,64]. On another note, F1 score is a highly informative measure as it considers both precision and recall measures, thus taking the class-balance issue into account. Another useful metric is the Matthews correlation coefficient (MCC), which is a measure of the agreement between the observed and predicted binary classifications. It is highly adopted for imbalanced and its value is between −1 and 1 where 1 is a result of perfect prediction. As this is a class imbalance problem, all of G-mean, F1 score and MCC have been adopted to assess the prediction models, they are described as follows
The data were partitioned into training and validation sets for model formation and validation. In this study, for the evaluation of the classification performance of each classifier and with the aim of obtaining a more accurate estimate of crash prediction, 10-fold cross-validation was adopted. This method is recognized for its susceptibility to yield minimal bias and variance in contrast with the other validation methods, including the leave-one-out method [55]. Moreover, k-fold cross-validation has been known to prevent the over-fitting issue in the estimation of performance [70,87]. Data was trained using nine subsets of the input space, while the remaining subset was employed to evaluate the performance of predictive learners. The training was repeated 10 times, leaving out one subset that has already been used as a training data set in the previous training. An average is then obtained through the 10 recorded metrics as the mean performance measures. This kind of approach reduces the influence of data dependence and enhances the reliability of the resultant evaluation [23,26,95].
Modeling performance
When dealing with imbalanced observations, Accuracy may suffer due to bias towards the majority class [38,86]. Therefore, it is critically fundamental to endorse the suitable performance metrics to assess classifier validity and guide model learning. There are multiple performance evaluation metrics that consider class distributions in the adopted dataset, such as G-mean, MCC and F1-score.
The results of all 12 models for each performance measure across every configurations were listed in Tables 5, 6, 7, 8, 9. Each cell is filled with average of the relevant performance metric after cross-validation. The RDWV combination based on AdaBoost-ADASYN attained 84.2% and 86% for F1-score and G-mean values respectively, whereas XGBoost-SMOTE-TL showed 87.2% MCC. As for VDWL performance, a 82.8% G-mean was obtained using AdaBoost-ADASYN, while XGBoost-SMOTE achieved a 81.9% F1-score and a value of 81.6% MCC was reached. AdaBoost-ADASYN produced also good results for the RVLD combination with 81.9% F1-score, AdaBoost also achieved a high MCC with a value of 82.3% this time based on SMOTE-TL; Whereas XGBoost-ADASYN attained a 83.2% G-mean. On the other hand, ADASYN in RVWL displayed superior performance using AdaBoost and XGBoost with values of 85% and 86% for F1-score and MCC respectively, whilst AdaBoost-Borderline-SMOTE obtained 89.1% G-mean. The maximum of predictive performance was acquired using all features and ADASYN as the resampling strategy for AdaBoost with 95.4% and 93.7% values of F1-score and G-mean respectively, along with XGBoost-SMOTE-TL for MCC with a value of 94.2%. Results illustrated for each category of combined features are in Fig. 5.
Performance results for the crash events using RDWV features
Performance results for the crash events using RDWV features
Performance results for the crash events using VDWL features
Performance results for the crash events using RVLD features
Performance results for the crash events using RVWL features
Performance results for the crash events using ALL the features

Modeling performance for each the combined features: (a) RDWV, (b) VDWL, (c) RVLD, (d) RVWL, (e) ALL.
When analyzing the performance results of the adopted machine learning models with the employed balancing approaches, most of the superior outcomes have been acquired based on the ADASYN and SMOTE-TL techniques compared to the other resampling strategies; extensive research have also proven that ADASYN and SMOTE-TL has a better efficiency than other over-sampling techniques [2,41,66,85]. Furthermore, AdaBoost has confirmed to be more efficient than XGBoost and RF in F1-score, and G-mean, whereas XGBoost obtained the best average MCC scores. More details about modeling overall performance is outlined in Fig. 6

Average performance measures for classification models.
While inspecting the results of features’ combination for the adopted models, we perceived that higher predictions are acquired when land zones characteristics and weather conditions are both considered. Consequently, to better conceive the likelihood of crash occurrence, the relationships between crash events and land zones characteristics under the distinct weather seasons have been analyzed; the representations of these correlations are illustrated in Fig. 7. As revealed, more that 35% of total crash occurrences have been stated in commercial zones. The remaining ones have been registered in the remaining land zones as follows: 29.78% in rural areas, followed by 27.65% in residential regions, while industrial zones came last with 7.44% of road crashes. In addition, 37.23% and 20.21% of all crash observations have been recorded during both heavy snow and light snow conditions respectively; Heavy rain and heavy fog came after by 14.89% and 13.82% respectively, whereas only about 7% have been observed in light fog conditions succeeded by light rain conditions by about 6% of all accidents. Furthermore, the results also demonstrate that each of the weather conditions have evidenced crash events throughout all four land zones characteristics, except for the light fog and light rain where there were no accidents recorded in industrial areas.

Crash events for multiple land zones characteristics under adverse weather patterns.
On another note, we further outline the distribution of the multiple roadway properties during the selected weather conditions in Fig. 8. As depicted, more than 30% of all crashes have occurred in mountainous and steep mountainous terrains. Steep sections came next by an amount of 24.46%, while flat regions came last by about 8.5% of all accidents. Moreover, the results also display that each of the roadway terrains have evidenced crash events during all weather conditions, aside from light fog patterns during which only mountainous accidents have occurred and light snow condition in which no flat accidents have happened. Finally, no flat accidents nor mountainous ones have been recorded under any light rain conditions.
In view of the dominance of steep, mountainous and steep-mountainous roadway designs in the amount of perceived crash events, we further outline the average speed distribution between crash and non-crash events during heavy snow and heavy rain patterns in which most of crash events have occurred (Fig. 8). As can be seen in Fig. 9 for steep designs, Fig. 10 for mountainous roadways and Fig. 11 for steep-mountainous ones, the distribution of average speed before crash events are more frequently to be wide-spread than non-crash events, particularly in steep designs; which implies that the driving conditions before crash event tends to be more diverse than non-crash events, which is consistent with [26,88,105].

Crash events for multiple roadway properties under adverse weather patterns.

Distribution of average speed between crash and non-crash events in steep designs during (a) heavy rain and (b) heavy snow conditions.

Distribution of average speed between crash and non-crash events in mountainous designs during (a) heavy rain and (b) heavy snow conditions.

Distribution of average speed between crash and non-crash events in steep-mountainous designs during (a) heavy rain and (b) heavy snow conditions.
Table 10 demonstrates the average performance measures for AdaBoost, XGBoost and RF have been compared based on Roadway properties, Driver inputs, Land zone characteristics, Vehicle telemetry and Weather conditions with four other efficient models: Support Vector Machines (SVM), Bayesian Networks (BN) as well as Logistic Regression (LR). The three adopted models AdaBoost, XGBoost and RF appears to be the best performing classifier in terms of average performance for all the modeling metrics. To further explore the findings visually, the aforementioned results are presented in Fig. 12.
Comparison of average crash prediction performance for different modeling techniques

Average performance overview for classification models.
Road traffic crashes have been viewed as one of the major causes leading to numerous health problems, economic losses and fatalities, more particularly, during adverse weather conditions and night times. As such, a comprehensive analysis that targets reducing road crashes and enhance traffic safety for drivers using functional strategies is essential. Relevant research that have studied this topic endorsed traditional statistical techniques which usually suffer from poor data quality and expect a large amount of historical data. On the other hand, machine learning models have shown to supersede statistical learning in predicting crash occurrences and have indicated powerful results in many transportation systems. In this paper, a multicriteria decision-making system for the prediction of road crash events under reduced visibility settings is outlined while endorsing the heuristic genetic algorithm-based features selection approach and data balancing techniques to handle the data imbalance issue, on the grounds of real-time data from multiple sources, comprising weather conditions, roadway properties, land zones characteristics, vehicle kinematics and driver inputs.
First, the designation of the strongest predictors leading to the occurrence of crash events have been carried out. Then, the impact of various combinations of the five adopted features’ groups – roadway properties, weather conditions, land zones characteristics, vehicle kinematics and driver inputs – on the likelihood of crash events have been utterly explored. Next, several prediction techniques have been built based on AdaBoost, XGBoost and RF models using multiple balancing approaches namely SMOTE, Borderline-SMOTE, SOMTE-Tomek Links and ADASYN to adress the data imbalance issue since crash related observations frequently generate imbalanced observations considering that the target classes are not equally represented. Simulation experiments were conducted on a roadway comprising of different geometric properties and sites under six inclement weather seasons: light fog, heavy fog, light rain, heavy rain, light snow and heavy snow conditions. Roadways characteristics along with a variety of land zones in adverse weather patterns could exacerbate the effect of the driving behavior leading to road accidents, and hence, the incorporation of these attributes is critical in the context of active traffic safety systems.
As regards model performance results, different combinations of features were employed in order to compare the outcomes of the endorsed models with different inputs. It was found that the superior predictive performance was acquired based on vehicle kinematics, driver inputs, roadway properties, land zones characteristics, and weather conditions all combined acquired and ADASYN as the resampling strategy for AdaBoost with 95.4% and 93.7% values of F1-score and G-mean respectively, along with XGBoost-SMOTE-TL for MCC with a value of 94.2%. Furthermore, the findings also revealed that increased performance is obtained when land zones characteristics and weather conditions are both considered, that is, more that 35% of total crash occurrences have been stated in commercial zones. The remaining ones have been registered in the remaining land zones as follows: 29.78% in rural areas, followed by 27.65% in residential regions, while industrial zones came last with 7.44% of road crashes. In addition, 37.23% and 20.21% of all crash observations have been recorded during both heavy snow and light snow conditions respectively. On another note, more than 30% of all crashes have occurred in mountainous and steep mountainous terrains, while 24.46% and 8.5% of crashes occurred in steep and flat areas respectively.
The findings of this research open up new directions to better outline crash occurrences under different weather conditions for multiple roadway properties and sites, which is significant for constructing efficient crash prevention strategies by providing crash risk warnings to drivers promoting safe conduct attitudes. However, to be admitted, there exists some limitations that need to be addressed. Simulator studies provide a convenient and adjustable environment, especially when it comes to investigating road crash events as it would be very risky to carry out real life trials on real road settings; however, the driving simulator is not an integral substitute for the real-world driving experiences. Indeed, the authenticity of the results acquired using a driving simulator rely on the tasks in question within the simulated environment. However, crash data during fairly similar conditions based on related research could be further processed. Further to this, although the above results have evidenced the performances of crash occurrence prediction with driver inputs, vehicle telemetry, land zone characteristics, roadway properties and weather conditions, additional complicated measures such as drivers’ physiological and mental states measures could provide other insightful measures. Finally, potential future directions of this study may comprise broadening the predictive models to include other undersampling and oversampling techniques to handle class imbalance issue for the better prediction of crash events.
Footnotes
Acknowledgement
This research was jointly supported by the (1) Moroccan Ministry of Equipment, Transport and Logistics and (2) Moroccan National Center for Scientific and Technical Research (CNRST).
Conflict of interest
None to report.
