Intelligent technique for traffic congestion prediction in Internet of Vehicles using Randomized Machine Learning

Abstract

Traffic congestion is a challenging issue faced by people and government traffic agencies. Traffic congestion not only increases travel time but also increases noise pollution, air pollution, and financial losses. There are many factors which affect the speed of a vehicle. Some of the factors are weather, wind speed, road conditions, and construction work. On highways, the low speed of vehicles can cause traffic congestion or delays. Machine learning can play a vital role in the detection of traffic congestion and hence in avoiding delays. When accurate parameters and correct structure are fed to the machine learning model, traffic congestion can be predicted accurately. This paper designs a technique to predict traffic congestion states with the help of the Extra Tree Classifier machine learning model. The proposed Extremely Randomized Machine Learning (ERML) system model predicts 94% accuracy for congestion state classification. It gives better results as compared to other machine learning models.

Keywords

Internet of Vehicles Intelligent Transportation System Vehicular Ad-hoc Network Internet of Things Machine Learning Traffic Congestion Prediction

1 Introduction

One of the major problems in large cities is traffic congestion. It not only impacts the daily lives of people but also impacts the social development and economy of the country. Government bodies constantly monitor and try to resolve traffic congestion in the cities. Owing to its non-predictable nature traffic congestion prediction is very difficult.

Traffic congestion is a pervasive and growing problem that affects societies worldwide, with profound impacts on various aspects of daily life. It refers to the situation in which the demand for road space exceeds its capacity, resulting in slower speeds, longer travel times, and often complete standstills. The consequences of traffic congestion extend far beyond mere inconvenience, touching on critical facets of society, including travel time, noise pollution, air pollution, and financial losses. Accurate prediction of traffic congestion is pivotal for mitigating these issues and improving traffic management.

There are many factors responsible for traffic congestion. There may be dynamic or non-dynamic factors like traffic speed, accidents on roads, construction of new roads, weather conditions, sudden malfunctioning of traffic beyond mere inconvenience, touching on critical facets of society, including travel time, noise pollution, air pollution, and financial losses. Accurate prediction of traffic congestion is pivotal for mitigating these issues and improving traffic management.

There are many factors responsible for traffic congestion. There may be dynamic or non-dynamic factors like traffic speed, accidents on roads, construction of new roads, weather conditions, sudden malfunctioning of traffic lights, etc. A non-dynamic factor like weather conditions directly affects traffic conditions. Due to heavy rain, the visibility of the road for drivers becomes very challenging and results in a decrease in the speed of vehicles. Owing to these factors analyzing traffic congestion is challenging. Many researchers have tried to address this issue. Solutions for congestion prediction-based problems can be categorized as hardware-based equipment and vehicular ad-hoc network technology. Advancements in Vehicular Ad-hoc Networks (VANETs) can increase the capability of finding traffic congestion. Integration of VANET with the Internet of Things i.e. IoT forms a network called the Internet of Vehicles (IoV) [5]. With the help of IoV, it becomes very easy to predict environmental factors like temperature, weather conditions, etc. Intelligent devices can also predict road conditions, accidental areas, and jam conditions. It can help us to get data and send the information directly onto the cloud server resulting in prior information on avoidance of congested roads and following an alternate fast route. Figure 1 shows the model of IoV for vehicular communication.

Fig. 1

IoV model for vehicular communication.

Machine learning can play a vital role in the detection of traffic congestion and hence in avoiding delays. When accurate parameters and correct structure are fed to the machine learning model, traffic congestion can be predicted accurately. This paper designs a technique to predict traffic congestion states with the help of the Extra Tree Classifier machine learning model.

In this paper, a comparative analysis of various traffic congestion detection techniques has been discussed in Table 1, and then an ERML system model has been proposed for intelligent traffic congestion prediction. The arrangement of this paper is as follows: Section 1 describes the introduction. Section 2 describes the related work. The different machine learning model is presented in Sub-section 2.1. The proposed ERML system model is described in Section 3. Section 4 presents the results and discussion. Section 5 presents the practical challenges in implementing a proposed system for traffic congestion prediction in the Internet of Vehicles (IoV) and Section 6 presents conclusion & future work of thispaper.

Table 1

Comparative analysis of various traffic congestion detection techniques

Author	Approach used	Input Dataset	Type of Traffic Congestion detected	Input Parameters	Output Parameters
Tamimi and Zahoor [18]	Delay estimation using fuzzy logic	Yes	Traffic congestion due to traffic delay	Weather conditions, Road conditions, Visibility, Traffic volume	Delay estimation
Chen, et al. [4]	AMPRFP Approach	Yes	Traffic Congestion Analysis based on Real Time Navigation System	Average speed and incident reports	Accurate travel time
Eswaraprasad and Raja [7]	Traffic profile information using the HANN-HMM approach	Yes	IoT-based traffic management	Time consumption, Traffic rate No. of vehicles	Accuracy, Precision, Recall &F-measure
Nguyen, et al. [13]	Traffic Congestion Coefficient	Yes	Traffic Congestion based on Fuzzy rules (Sector)	Density performance index, Velocity performance index	Traffic congestion coefficient
Elfar, et al. [9]	Machine learning methods are used for Short-Term traffic congestion prediction	Yes	Short Term traffic congestion prediction	Mean speed and the Speed Standard Deviation	Flow and density
Ata, et al. [1]	Artificial Back Propagation Neural Network	Yes	Traffic congestion based on delay time	Duration of Time, Traffic flow &Speed, Wind speed, Air temperature, Moisture	Delay time
Impedovo, et al. [11]	Traffic congestion classification using deep learning approaches	Yes	Vehicular traffic congestion classification using visual features	Five visual features	Light, Medium, and Heavy traffic congestion classification
Zhang, et al. [23]	LSTM-based recurrent neural network method for traffic congestion forecasting	Yes	Short-Term Traffic Congestion Forecasting	Speed, travel, time, and volume	Traffic congestion values
Sun, et al. [17]	Deep learning models used to perform congestion-level prediction	Yes	Traffic congestion prediction based on GPS trajectory data	Taxi GPS trajectory images	Deep features of the input images
Kamble and Kounte [8]	Trajectory-based ML technique	Yes	Traffic congestion on road segment	Location, Speed, Acceleration	Traffic speed over the entire road sector
Vaishnavi and Suseela [20]	Traffic congestion detection using machine learning algorithm	Yes	Geographical detection of traffic congestion based on traffic videos &images	Traffic images of sparse traffic, dense traffic, fire, and accident	Dense traffic, sparse traffic, fire, or accident
Tu, et al. [19]	SG-CNN-based model for deep traffic congestion prediction	Yes	Deep traffic congestion prediction	Traffic flow and vehicle speed	Traffic flow and vehicle speed
Yasir, et al. [25]	A prediction model-based traffic congestion prediction	Yes	Congestion prediction based on day, time, and several weather data	Day, time, temperature, humidity	Root Mean Square Error (RMSE)
Qi and Cheng, [24]	Graph Convolutional Neural Network	Yes	Grid region traffic congestion prediction	Spatial &Time features	Traffic Flow

2 Related work

The success of an efficient transportation system requires several challenges and limitations which must be addressed. Currently, most of the smart vehicles are equipped with internet connectivity and vehicles can communicate with each other through smartphones. Thus, analysis of data generated from the vehicle’s sensors can be easily done. Analyzing data on cloud servers can help in the identification of patterns and detection of traffic congestion.

Dureja and Suman [6] in his paper discussed various challenges and future aspects related to IoV and presented state-of-the-art advancements and future trends in efficient transportation using IoV.

Qi Liu et al. [19] in their paper “A random projection-based ensemble model for short-term traffic flow prediction” proposed a random projection-based ensemble model for short-term traffic flow prediction. The proposed model uses a combination of random projection and gradient boosting to build an ensemble of decision trees and was shown to achieve state-of-the-art performance on several traffic prediction datasets.

Hao Wu et al. (2021) in their paper “Random Feature Embedding for Traffic Flow Prediction” proposed a random feature embedding method for traffic flow prediction that uses random Fourier features to map the input data to a high-dimensional feature space. The proposed method was shown to outperform several state-of-the-art methods on several traffic prediction datasets.

In year 2021 Yidan Zeng et al. proposed a multi-scale spatio-temporal model for traffic forecasting that integrates convolutional and recurrent neural networks with a meta-learning framework. The model was evaluated on large-scale traffic datasets and shown to outperform state-of-the-art methods.

In 2019, a deep learning approach was used to predict traffic flows from camera images captured in the wild. The approach uses a Fully Convolutional Network (FCN) with residual connections to predict traffic flow at each pixel in the image and was shown to achieve state-of-the-art performance on several benchmark datasets. [Baojie Yang et al. (2019)]

Another technique graph convolutional neural network (GNN) was used for traffic prediction that takes into account the complex relationships between different locations in the traffic network. The model was evaluated on real-world traffic datasets and shown to outperform several state-of-the-art baselines. [Bing Yu et al. (2020)]

In [8], researchers proposed a trajectory-based machine learning technique for traffic congestion monitoring. For the prediction of traffic congestion, a machine learning approach was used with multiple parameters like hard delay and vehicle speed. For the prediction of the vehicle’s speed, the Gaussian approach was used. In [1], the Artificial Back Propagation Neural Networks approach was used for traffic congestion monitoring and smart road traffic congestion control model. The parameters like time, traffic speed, traffic flow, wind speed, air temperature, and humidity were considered for input to the neural network, and the output parameter delay time was considered for the prediction of traffic congestion. The hidden Markov Model (HMM) technique was used to find accurate and efficient traffic prediction [7]. Input parameters such as time ingestion, No. of vehicles and traffic rate was considered for training the model and feature selection was done by using the Hybrid Ant Colony Glowworm swarm technique in [21]. Visual features can also be used for traffic flow estimation. In [11], researchers divided the main process of traffic flow prediction into three stages. The first stage is vehicle detection using high-quality cameras. The second stage is visual feature extraction which is used as input parameters for the vehicle classifiers. The final stage is traffic state classification in terms of light, medium, and heavy vehicles on the road. Machine learning and Deep learning models were used for the prediction of traffic state classification. Sabah Tamimi in [18] used a fuzzy logic approach for the estimation of traffic delay. In this, researchers have considered road conditions, visibility, traffic volume, and weather conditions as input parameters and estimated delay as output parameters for the proposed fuzzy inference system. Nguyen, [13]. proposed a traffic congestion monitoring scheme based on fuzzy rules. Fuzzy rules were used to determine the Traffic Congestion Coefficient (TCC) for each road segment. This TCC is based on the velocity performance index and density performance index.

The proposed method takes less time compared to other methods. The proposed ERML method uses the Extra Tree classification technique which is much faster than other ensemble techniques. The proposed method is less overfitting than other methods because it selects random splits, which reduces the model’s variance. ERML-based methods are robust to noisy features because they use random splits at each decision node, which reduces the impact of noisy features on the model’s performance. The proposed technique monitors traffic congestion efficiently and helps to avoid congested road segments resulting in reduction delay with high prediction accuracy.

2.1 Machine learning models

Machine learning also plays an important role in the detection of traffic congestion and hence in avoiding delays. When accurate parameters and correct structure are fed to the machine learning model, the traffic congestion state can be predicted accurately.

Some of the models are discussed here which gives the most promising results.

2.1.1 Decision tree machine learning model

It is a supervised machine learning algorithm that is used to classify problems in an efficient manner [14]. It works very well for dependent variables. In this algorithm, trees can be described by two units namely nodes and leaves. The leaves are responsible for the outcomes. The nodes are sometimes called decision nodes where data is split. Decision trees can be categorized into Classification trees and Regression trees. Classification trees are used to classify the data as fit or unfit, but regression tree classifies data as continuous variables.

The mathematical formulation of a Decision Tree machine learning model involves a probabilistic model that seeks to maximize the information gain for each split. The basic mathematical formulation of a Decision Tree machine learning model is given below.

Let D be the dataset, $where D = {(x 1, y 1), (x 2, y 2), . . ., (xn, yn)}$ (1)

Let X be the set of predictor variables and Y be the target variable.

The entropy value can be calculated as:

The entropy H(S) of a set S is defined as: $H (S) = - Σ pi \log 2 (pi)$ (2) where pi is the proportion of examples in S that belong to class i.

The information gain IG (S, X) of a set S concerning a predictor variable X is defined as: $IG (S, X) = H (S) - Σ | Sj | / | S | * H (Sj)$ (3) where Sj is the subset of S for which X has value j.

Algorithm (Decision Tree):

The algorithm starts with a single node that represents the entire dataset D.

For each node, the algorithm finds the predictor variable X that maximizes the information gain IG(D, X).

The dataset D is split into subsets Sj for each value j of X.

For each subset Sj, the algorithm repeats the above steps until a stopping criterion is met, such as a maximum depth or a minimum number of examples in a node.

The final decision tree is a tree structure where each internal node represents a predictor variable, and each leaf node represents a class label.

To predict the class label of a new example x, the algorithm traverses the decision tree from the root to a leaf node, following the path that corresponds to the values of the predictor variables in x.

In conclusion, the Decision Tree machine learning model is formulated as a probabilistic model that seeks to maximize the information gain for each split. The algorithm recursively splits the dataset into subsets based on the predictor variables that maximize the information gain, resulting in a tree structure that can be used to predict the class label of newexamples.

2.1.2 Gradient Boost classifier learning model

Gradient Boost classifier is an ensemble method that combines the predictive power of base estimators to improve the generalization [Aziz, et al. (2020)]. There are two ensemble methods which are the averaging method and the boosting method. Gradient boost classifier is a boosting method that is used when there is a massive load of data to be classified. This method gives prediction with higheraccuracy.

It combines weak predictors to build a strong predictor. This machine learning method can be used for both classification and regression-type problems, and it can handle mixed types of data.

2.1.3 Random forest classifier learning model

Random decision forest is an ensemble method for the classification and regression of objects [3]. The classification of an object is based on its attributes. In this method, each tree is classified, and the collective decision is taken for the classification. The forest chooses the classification having the most votes. The algorithm used for classification using random forest is given below:

Algorithm:

**Initialize an empty list forest to store the decision trees in the forest.

For each tree in the forest (repeat n_estimators times):

Randomly select max_features distinct attributes from the total C attributes. This forms a feature subset.

Compute the best attribute a and the best-split point for node splitting based on the selected feature subset. The criterion for finding the best split can be “gini impurity” or “entropy” for classification tasks.

Split the node a into two children nodes using the best-split point. These children nodes inherit the subset of data points from the parent node.

Repeat steps 1 to 3 recursively for each child node until a stopping criterion is met (e.g., max_depth, min_samples_split, or min_samples_leaf).

Add the constructed decision tree to the forest list.

Repeat steps 2 to 3 until n_estimators decision trees are constructed.

Output the forest containing the ensemble of decision trees.

Input:

x: The training dataset, consisting of N samples and C attributes/features.

y: The corresponding labels for each sample in X.

n_estimators: The number of decision trees to create in the forest.

max_features: The number of features to consider when looking for the best split. It can be specified as a fixed number or a fraction of total features.

max_depth: The maximum depth of each decision tree, which controls tree complexity and helps prevent overfitting.

min_samples_split: The minimum number of samples required to split a node.

min_samples_leaf: The minimum number of samples required in a leaf node.

random_state: A seed for the random number generator to ensure reproducibility.

Output:

forest: A collection of decision trees that constitute the Random Forest ensemble.

2.1.4 Extra Trees classifier learning model

Extra Trees Classifier is an ensemble technique to combine the results of multiple estimators. These estimators are sometimes called de-correlated decision trees [10]. These decision trees are integrated to enhance the robustness of a single classifier. This learning model is similar to the Random Forest learning model but differs in terms of the construction of decision trees in the forest. An original training sample is used for the construction of each decision tree in the forest. Each tree from the forest at each test node finds the best feature from the random feature set for splitting the data. For the splitting of data, a special index is computed called the Gini index. Generally, information gain is used as decision criteria and the formulas for calculating the Information Gain are given in Equation 4. $\begin{matrix} Gain (A, X) = Entropy (A) - \\ \sum_{v \in Values (X)} \frac{| A_{v} |}{| A |} Entropy (A_{v}) \end{matrix}$ (4) where

Gain(A, X): Information gain for attribute ‘A’ concerning dataset ‘X’.

Entropy(A): The entropy of the attribute ‘A’. It measures the uncertainty or impurity of ‘A’ before any split.

Values(X): The set of possible values that attribute ‘X’ can take.

A_v : The subset of examples in dataset ‘X’ where attribute ‘A’ has the value ‘v’.

Entropy (A_v) : The entropy of attribute ‘A’ for a specific value ‘v’.

and Entropy (A) of data is calculated as $Entropy (A) = \sum_{i = 1}^{uc} - p r_{i} lo g_{2} (p r_{i})$ (5) where uc is the no. of unique class’s labels and pr_i is the proportion of rows with output label i.

Table 1 tabularizes various existing techniques for traffic congestion detection.

3 Proposed extremely randomized machine learning system model

IoT devices make it easier to sense environmental data and vehicular data. These devices monitor the environment periodically and send the data to the server; on the server, this collected data can be used for analyzing traffic congestion state efficiently. Figure 2 shows a system model for intelligent traffic congestion state prediction using machine learning techniques. In this model, the sensor layer generates the input parameters like time, speed, starting location, destination, location, and distance traveled by vehicles. These parameters are fed to the machine learning model for the prediction of congestion state levels. Level 0 represents no congestion, level 1 represents a moderately congested route, level 2 represents a congested route, level 3 represents extremely congested, and level 4 represents blockage in the route due to traffic jams, accidents, or any other conditions.

Fig. 2

Proposed ERML system model for intelligent traffic congestion prediction.

3.1 Sensor layer

The sensor layer is responsible for the collection of data regarding environmental conditions, such as weather, temperature, speed of the vehicle, starting location, destination, and total distance covered by the vehicle, etc. Sensors in the sensor layer act like an interface between the real world and the digital world. These are responsible for converting analog to digital signals. There are different types of sensory devices like meters, probes, sensors, and actuators that can get data like temperature, distance, location, humidity, etc. Sensors in the sensor layer cooperatively pass these data through the network to the server.

This layer is also called the perception layer, that layer belongs to two major tasks perception and Montage. A different type of inbuilt vehicle sensor gathers information from a central location and transfers it to cloud servers which focus on big data storage and processing. Handling large-scale Internet of Vehicles (IoV) data while maintaining prediction accuracy as the system scales up requires careful consideration of both data processing techniques and system architecture. Using distributed computing frameworks like Apache Spark or Hadoop facilitates the parallelization of data processing tasks. This allows for the efficient processing of large volumes of IoV data by distributing the workload across multiplenodes.

3.2 Preprocessing phase

Before extracting the features from the input data set, preprocessing of data is required. The preprocessing phase is divided into several sub-phases. The first sub-phase is data transformation in which data is converted into a particular format. In our model, the location data has come into numerical values.

In the second sub-phase, feature binarization has been used for getting Boolean values. In our case, we have represented congestion states into numerical values. Level 0 represents No Congestion; Level 1 represents Moderately Congested and so on. In the third sub-phase encoding categorial, special features like “Yes” or “No” for holiday parameters, are coded into binary values 1 or 0. The last sub-phase of preprocessing is Data Wrangling. It is the process of cleaning and unifying dirty and complex data sets in order to make them more accessible and analyzable.

With the amount of data and data sources continually rising and developing, it’s becoming increasingly difficult to keep up. So, this method usually requires manually transforming and representing data from one raw plan to another to simplify data consumption and organization. Another serious concern is that the precision and dependability of predicting traffic congestion are significantly influenced by the quality of the input data. The model’s performance may be adversely affected if the data gathered from IoT sensors or other sources is characterized by noise, incompleteness, or outdated information. To address this concern, we have implemented a rigorous data preprocessing and cleaning pipeline. This process involves identifying and handling outliers, inputting missing values, and ensuring that the dataset is up to date through regular updates. Additionally, we have employed advanced filtering techniques to reduce the impact of sensor inaccuracies and outliers in the collected data.

Furthermore, our methodology incorporates a validation framework that allows us to assess the quality of the input data and its influence on the model’s performance. By conducting sensitivity analyses and cross-validations, we can gain insights into the robustness of our predictions and identify any potential weaknesses arising from data quality issues.

3.3 Feature extraction phase

The feature extraction phase is the most important phase of the proposed system model. One important feature is the congestion level which can be measured in terms of delay. The congestion level can be defined in terms of an index whose value can be more than or equivalent to zero. This index is called the Congestion index and can be calculated as: $CI = (tc - tl) / tl$ (6) Where tc = current time for the road section and

tl = least time for the road section.

Table 2 shows the various congestion index ranges with traffic state levels.

Table 2

Traffic state levels

Congestion Index	Congestion Level
[0, 0.15)	Uninterrupted
[0.15, 0.35)	Moderately Congested
[0.35, 0.65)	Congested
[0.65, 2.0)	Extremely Congested
[2.0 and above)	Blockage

These congestion labels can be calculated on the basis of road segments. For each road segment, we can calculate the congestion index using Equation 6.

3.4 Prediction analyzer

After extracting features, different randomized machine-learning techniques can be applied as prediction analyzers for the prediction of the congestion states. We have applied various techniques like decision tree classifier, random forest, KNN, gradient boost, Adaboost, logistic regression, and extra trees classifier.

3.5 Security concern

The gathering and real-time transmission of data from vehicles gives rise to apprehensions regarding the privacy and security of the information. It is imperative to safeguard sensitive data, such as the locations of vehicles, to prevent unauthorized access. To ensure the protection of individual vehicle identities in real-time data collection, various encryption and anonymization techniques can be applied. We have implemented robust encryption measures and access controls to safeguard the confidentiality and integrity of the collected data. Certificate-based Encryption is used to protect sensitive information. We have considered public-key infrastructure (PKI) and digital certificates to encrypt communication between vehicles and infrastructure. This ensures secure and authenticated data transmission without revealing sensitive details.

4 Results & discussion

Different types of machine learning algorithms have been applied to the dataset [26] for predictions. In an ERML-based model, the input variables for these machine learning algorithms [16] are parameters like weather conditions, holidays, source location, destination, and special conditions. The parameters of learning models have been adjusted to get the best results. Extra Classifier machine learning algorithm has been found to give better results than other algorithms.

A dataset consisting of nine input variables namely weekday, current time, weather conditions, peak hours’ time, rare conditions like a traffic jam, road accident, source, destination, fastest route name, and fastest traveling time have been used for simulation. This dataset also contains one output variable in terms of congestion state. The congestion state can be non-congested, moderately congested, and extremely congested. This dataset is taken from an Internet source [26]. The input variables and an output variable are shown in Table 3.

Table 3
Input and Output variables

Serial Number Input or Output Variable Name

Input-1 Weekday

Input-2 Current Time

Input-3 Weather Condition

Input-4 Source Location

Input-5 Destination Location

Input-6 Peak/Non-Peak Hours

Input-7 Fastest Route Name

Input-8 Rare Conditions (Traffic Jam/Road Accident)

Input-9 Holiday

Output Congestion State

Serial Number	Input or Output Variable Name
Input-1	Weekday
Input-2	Current Time
Input-3	Weather Condition
Input-4	Source Location
Input-5	Destination Location
Input-6	Peak/Non-Peak Hours
Input-7	Fastest Route Name
Input-8	Rare Conditions (Traffic Jam/Road Accident)
Input-9	Holiday
Output	Congestion State

The dataset size is 945673 records and consists of nine columns. System time is the current machine time. Traffic data has been obtained from Google Map API. Table 4 shows the prediction accuracy of different machine-learning methods.

Table 4

Accuracy score of different learning models

Machine Learning Model	Accuracy Score
Extra Trees Classifier	94%
Decision Tree Classifier	92%
Random Forest	92%
KNN	91%
Gradient Boost	88%
AdaBoost Classifier	64%
Logistic Regression	50%

K-Fold cross-validation is used as an evaluation metric. Cross-validation is a procedure used for the evaluation of different machine learning methods. This metric divides the data samples into no. of groups called fold. This fold value is chosen in such a way that we can get higher accuracy. In the proposed model, 10 folds have been used for the evaluation of different machine learning models. Figure 3 shows that the Extra Trees Classifier gives the highest accuracy validation score as compared to other machine learning models and other traditional supervised algorithms.

Fig. 3

Accuracy score of machine learning algorithms.

There is a slight variation in the accuracy of algorithms due to weather conditions. The more the accuracy of weather data, the more the accuracy of results will increase. Figure 4 shows the impact of weather conditions on the correctness of algorithms.

Fig. 4

Accuracy score of machine learning algorithms with weather conditions.

Figure 4 shows there is an increase in the accuracy of the Extra Classifier from 94 percent to 94.4 percent and a major change in the case of the Gradient Boost model.

Figure 5 shows the PR curves for all classes represented for congestion prediction. Level 0 represents Uninterrupted, level 1 represents Moderately Congested, level 2 represents Congested, level 3 represents Extremely Congested, and level 4 represents the road is blocked due to special conditions like accidents, etc.

Fig. 5

PR Curves for all classifiers with accuracy score.

Extra Tree Classifier gives the best result in terms of accuracy score than other classifiers. Equations (7)–(10) demonstrate the parameters to measure the performance of the proposed model. $Accuracy = \frac{(TP + TN)}{(TP + FN) + (FP + TN)}$ (7) $Sensitivity / Recall = \frac{(TP)}{(TP + FN)}$ (8) $Precision = \frac{(TN)}{(FP + TN)}$ (9) $F 1 Score = \frac{(2 \times TP)}{(2 \times TP + FN + FP)}$ (10)

In the PR curve of the Extra Tree Classifier, the smoothness and blockage are almost touching value 1. This represents a higher accuracy score.

Table 5 shows the recall value, precision value, and F-1 score of the Extra tree classifier model.

Table 5

Recall, Precision, and F-1 score of Extra Classifier

Level	Recall Value	Precision Value	F1 - Score
Level 0	0.94390385	0.92357183	0.93362716
Level 1	0.83233383	0.8178247	0.82501548
Level 2	0.73478729	0.77517398	0.75444053
Level 3	0.92982736	0.94330675	0.93651855
Level 4	0.85617978	0.8369727	0.8339381

The confusion matrix is a powerful method for analyzing classification problems [12]. It describes how the data belonging to a single class can be assigned to multiple possible classes. Normalized confusion matrix for representing the traffic congestion classes. The normalized confusion matrix that has been used for simulation is shown in Fig. 6. In this figure, diagonal elements represent true positive elements. For uninterrupted traffic, level 0 has 0.94 true positive elements, level 1 depicts moderate congestion with 0.83 true positive elements, level 2 represents traffic congestion with 0.74 true positive elements, level 3 depicts extremely congested with 0.93 true positive elements, and class 4 stands for the blocked road.

Fig. 6

PR Normalized confusion matrix for traffic congestion classes.

Table 6 shows the comparison of the proposed ERML system with other methods in terms of accuracy. In order to assess the practical viability of our proposed model for real-world traffic management applications, we conducted a thorough analysis of its latency characteristics. We measured the time taken for the model to process incoming traffic data and generate predictions under various traffic conditions and deployment scenarios. We found that our model demonstrates promising performance in terms of latency, which is 0.45 seconds with inference times consistently within the acceptable range for real-time decision-making in traffic management. Furthermore, we optimized the model architecture and utilized efficient algorithms to minimize latency without compromising predictionaccuracy.

Table 6

Comparison between the proposed ERML system and other algorithms

Algorithms	Training (Accuracy) for Traffic state classification
Tamimi and Zahoor, [18]	78.12%
Pushpi and Shaw [15]	91.265%
Impedovo et al. [11]	84%
Zafar and UI Haq [22]	92 %
Jilani, U. et al., 2022	90.59%
Proposed ERML System	94%

It has been concluded that the proposed ERML system provides better results than the previously proposed methods in terms of accuracy value.

5 Practical Challenges in Implementing proposed system for Traffic Congestion Prediction in the Internet of Vehicles (IoV)

While the proposed system shows promise in enhancing traffic management and reducing congestion, it is essential to acknowledge and address the practical challenges associated with its real-world implementation. This section discusses key challenges, including regulatory compliance, scalability, integration with existing infrastructure, cost and operational considerations.

Regulatory Compliance:

Challenge: Adhering to existing traffic regulations and obtaining necessary approvals for deploying predictive systems in real-time traffic management.

Solution: Collaborating with regulatory bodies,policymakers, and local authorities to ensure the system aligns with legal frameworks and safety standards.

Scalability:

Challenge: Ensuring the system can handle increased data volume and user demand as the scale of the IoV ecosystem expands.

Solution: Employing scalable architecture and cloud-based solutions to accommodate growing datasets and user interactions.

Integration with Existing Infrastructure:

Challenge: Seamlessly integrating the proposed predictive model with diverse existing IoV technologies and traffic management systems.

Solution: Conducting compatibility assessments, employing standardized communication protocols, and collaborating with stakeholders to facilitate smooth integration.

Operational Considerations:

Challenge: Addressing operational issues such as system maintenance, periodic updates, and user training for effective utilization.

Solution: Implementing robust maintenance protocols, providing regular updates, and offering comprehensive training programs for relevant personnel.

Data Security and Privacy:

Challenge: Safeguarding sensitive information collected from vehicles while ensuring user privacy.

Solution: Implementing encryption protocols, anonymizing data, and adopting privacy-preserving techniques to protect user information.

Dynamic Traffic Conditions:

Challenge: Adapting the predictive model to handle dynamic changes in traffic patterns and unexpected events.

Solution: Integrating real-time data feeds, leveraging adaptive learning algorithms, and incorporating dynamic feedback mechanisms to enhance model accuracy.

Cost:

The expenses associated with establishing the communication infrastructure necessary to enable smooth data transfer among vehicles, roadside units, and central servers. Additional costs involve the ongoing maintenance and upkeep of hardware and infrastructure components, entailing considerations like software updates, hardware replacements, and system optimization to ensure enduring long-term reliability.

6 Conclusion & future work

IoV includes many vehicle nodes and infrastructure. By communicating among these components, vehicle nodes can obtain information about traffic, helping to reduce traffic congestion. Thus, it is essential to maintain connectivity among these components.

This paper presents an ERML-based methodology to predict the traffic congestion state with higher accuracy. This paper also introduces an IoV & machine learning-based model for intelligent traffic congestion prediction.

Intelligent transportation systems have more impact in the area of traffic management. It provides flexibility to minimize traffic congestion whereas traditional systems do not.

Extremely randomized machine learning techniques have been used for training the dataset. As compared to the past proposed techniques, the current proposed system gives better results i.e., 94% accuracy which is much better than previous techniques.

We think that this work also opens the door to the following research:

The work can be extended by applying other machine learning techniques which can better find out the congestion states.

With the fast development of Intelligent Traffic Management Systems, infrastructures based on sensor devices generate sensing data at the Trillion-byte level to higher. Such an unprecedented volume of data has posed considerable difficulties for real-time and fine-grained traffic prediction. Modeling and analysis methods for such situations are highly desired.

Real-time data gathering and analysis can also be the future work.

References

Ata

, Khan

M.A.

, Abbas

, Ahmad

, Fatima

, Modeling smart road traffic congestion control system using machine learning techniques, Neural Network World 29(2) (2019), 99–110. doi: 10.14311/NNW.2019.29.008

Aziz

, Akhir

E.A.P.

, Aziz

I.A.

, Jaafar

, Hasan

M.H.

, Abas

A.N.C.

, A Study on Gradient Boosting Algorithms for Development of AI Monitoring and Prediction Systems. In 2020 International Conference on Computational Intelligence (ICCI), (2020), 11–16. IEEE. doi: 10.1109/ICCI51257.2020.9247843

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32. doi: 10.1023/A:1010933404324

Chen

, Makki

, Pissinou

, Machine Learning on Congestion Analysis Based Real-Time Navigation System, International Journal on Artificial Intelligence Tools 20(04) (2011), 753–781. doi: 10.1142/S0218213011000346

Dimitrakopoulos

, Intelligent transportation systems based on internet-connected vehicles: Fundamental research areas and challenges. In 2011 11th International Conference on ITS Telecommunications, (2011), 145–151. IEEE. doi: 10.1109/ITST.2011.6060042

Dureja

, Suman . Efficient transportation: future aspects of IoV, International Journal of Vehicle Information and Communication Systems 5(3) (2020), 290–308. doi: 10.1504/IJVICS.2020.110994

Eswaraprasad

, Raja

, Improved intelligent transport system for reliable traffic control management by adapting internet of things, International Conference on Infocom Technologies and Unmanned Systems (Trends and Future Directions) (ICTUS) (2017), 597–601. IEEE. doi: 10.1109/ICTUS.2017.8286079

Kamble

S.J.

, Kounte

M.R.

, Machine learning approach on traffic congestion monitoring system in internet of vehicles, Procedia Computer Science 171 (2020), 2235–2241. doi: 10.1016/j.procs.2020.04.241

Elfar

, Talebpour

, Mahmassani

H.S.

, Machine learning approach to short-term traffic congestion prediction in a connected environment, Transportation Research Record 2672(45) (2018), 185–195. doi: 10.1177/0361198118795010

10.

Geurts

, Ernst

, Wehenkel

, Extremely randomized trees, Machine Learning 63(1) (2006), 3–42. doi: 10.1007/s10994-006-6226-1

11.

Impedovo

, Balducci

, Dentamaro

, Pirlo

, Vehicular traffic congestion classification by visual features and deep learning approaches: a comparison, Sensors 19(23) (2019), 5213. doi: 10.3390/s19235213

12.

Mitchell

T.M.

, Machine learning and data mining, Communications of the ACM 42(11) (1999), 30–36.

13.

Nguyen

D.B.

, Dow

C.R.

, Hwang

S.F.

, An efficient traffic congestion monitoring system on internet of vehicles, Wireless Communications and Mobile Computing 2018 (2018). doi: 10.1155/2018/9136813

14.

Navada

, Ansari

A.N.

, Patil

, Sonkamble

B.A.

, Overview of use of decision tree algorithms in machine learning. In 2011 IEEE control and system graduate research colloquium, (2011), 37–42. IEEE. 10.1109/ICSGRC.2011.5991826

15.

Pushpi

, Shaw

K.D.

, Artificial neural networks approach induced by fuzzy logic for traffic, Journal of Engineering Technology 1 (2018), 15.

16.

Singh

, Thakur

, Sharma

, A review of supervised machine learning algorithms. In 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), (2016), 1310-1315. IEEE.

17.

Sun

, Chen

, Sun

, Traffic congestion prediction based on GPS trajectory data, International Journal of Distributed Sensor Networks 15(5) (2019), 1550147719847440. doi: 10.1177/1550147719847440

18.

Tamimi

, Zahoor

, Link delay estimation using fuzzy logic. In 2010 The 2nd international conference on computer and automation engineering (ICCAE) (Vol. 2, 406-411), (2010). IEEE. doi: 10.1109/ICCAE.2010.5451575

19.

, Lin

, Qiao

, Liu

, Deep traffic congestion prediction model based on road segment grouping, Applied Intelligence (2021), 1–23. doi: 10.1007/s10489-020-02152-x

20.

Vaishnavi

, Suseela

, Geographical Detection of Traffic Congestion Using Machine Learning Algorithms, International Journal of Advanced Science and Technology 29(7s) (2020), 4760–4771.

21.

, Yang

, Full glowworm swarm optimization algorithm for whole-set orders scheduling in single machine, The Scientific World Journal 2013 (2013). doi: 10.1155/2013/652061

22.

Zafar

, Ul Haq

, Traffic congestion prediction based on Estimated Time of Arrival, PloS one 15(12) (2020), e0238200. doi: 10.1371/journal.pone.0238200

23.

Zhang

, Liu

, Cui

, Leng

, Xie

, Zhang

, Short-term traffic congestion forecasting using attention-based long short-term memory recurrent neural network. In International Conference on Computational Science, (2019), 304-314. Springer, Cham. doi: 10.1007/978-3-030-22744-9_24

24.

, Cheng

, Research on Traffic Congestion Forecast Based on Deep Learning, Information 14(2) (2023), 108.

25.

Yasir

R.M.

, Nower

D.N.

, Shoyaib

D.M.

, Traffic Congestion Prediction Using Machine Learning Techniques, (2022). arXiv preprint arXiv:2206.10983.

26.

https://doi.org/10.1371/journal.pone.0238200.s003