Abstract
One of the main components of Intelligent Transportation Systems (ITS) is traffic congestion predictions. Traffic congestion in metropolitan road networks has a substantial impact on sustainability. Effective congestion management lowers prolonged travel delays and pollution emissions. However, because traffic patterns are complicated, dynamic, and non-linear, accurately predicting the spread of congestion is still difficult. With the introduction of Internet of Things (IoT) devices, useful datasets that can aid in the creation of sustainable and intelligent transportation for contemporary cities have been made available. This work presents an Intelligent Deep Learning Framework that combines Bi-LSTM with the TabTransformer architecture for congestion prediction in the Internet of Vehicles (IoV). The Bi-LSTM enhances temporal modeling of traffic flow dynamics, and the TabTransformer employs self-attention to derive high-quality feature representations. The proposed framework is tested on a large-scale dataset of over 317,112 traffic instances, incorporating a wide range of contextual features such as time indicators, route characteristics, and environmental factors. The model is validated using stratified k-fold cross-validation, and class imbalance is handled by class weighting in each fold. Experiential evaluation indicates that the proposed hybrid framework achieves 99.52% accuracy, outperforming several baseline models, including Decision Tree (92%), Extra Trees (94%), LSTM (86.94%), GRU (87.09%), Bi-LSTM (87.77%), CNN-LSTM (98.41%), and TabTransformer (97.32%). Compared to the existing algorithms, the proposed work outperformed with the parameters MSE, MAE, and R2. The better performance exhibits that BiLSTM with Tab Transformer is well-suited to predict congestion patterns in road networks.
Keywords
Introduction
The automobile congestion in Urban and Rural areas has emerged as a critical challenge in contemporary area being traditional methods of mobility, which in turn results in substantial economic losses with prolonged journey durations, fuel inefficiency, and long-term degradation of environment. The forecasts generated by the International Transportation Forum represent that the traffic delays will exponentially increase by more than 25% in the coming decade if intelligent transportation technologies are not rapidly implemented globally. The rigorous increase in the automobile industry causes damage to conventional infrastructure, with unregulated urban expansion having exacerbated congestion, particularly in emerging areas. With reference to consequences, the focus is shifted towards the development of smart cities, route optimization, and dependable and prompt traffic congestion forecasting, which has become fundamental to intelligent transportation systems (ITS).
The traditional traffic modeling techniques like the Auto-Regressive Integrated Moving Average (ARIMA) model, Seasonal ARIMA (SARIMA) model, and Kalman filtering model were earlier employed for mapping short-term traffic forecasts. Precise automobile traffic forecasting is a challenging endeavor because of the stochastic nature of traffic flow and lack of infrastructure. The traditional traffic models mentioned were computationally efficient, but their effectiveness depends on assumptions of linearity and stationarity, which leads to issues when applied to nonlinear patterns, increasing the likelihood of events such as accidents, roadworks, or weather fluctuations. Above all, classical statistical models exhibit sensitivity to anomalized data and lack the capability to work over diverse road networks.
The significant transition of data-driven models is caused by the advent of machine learning methodologies like Support Vector Regression (SVR), Random Forest (RF), and Gradient Boosting Machines (GBM). These algorithms improve the accuracy of traffic prediction by dealing with nonlinearity and using different categorized features. The data-driven model relies on manual feature selection, which prevents it from utilizing time-based data derived from sequential traffic data. The lack of intrinsic time modeling limits the effectiveness of algorithms in extremely dynamic traffic situations, as during peak traffic hours and unexpected events.
Deep learning methodologies like Long Short-Term Memory (LSTM), Gated Recurrent Units (GRUs), and Recurrent Neural Networks (RNNs) are better than traditional approaches used in modeling long-range temporal dependencies and short-term predictions. The enhancement in methodology has changed the working of prediction models by letting deep learning models automatically create hierarchical and temporal representations from large datasets. LSTM networks are very efficient at finding time-series patterns in sequential data by solving the problems of minimizing gradients that are neglected while using traditional RNNs. However, these models primarily focus on how traffic patterns change over time and do not take into account things like road infrastructure, traffic movement restrictions, weather, and time of day, which have a huge impact on traffic patterns.
Varied authors have merged convolutional neural networks (CNNs) and graph neural networks (GNNs) with LSTMs to enhance the accuracy of traffic forecasts by capturing spatial correlations across road segments. CNN–LSTM hybrids and Spatio-Temporal Graph Convolutional Networks (ST-GCN) adeptly capture spatial-temporal connections; yet, they require extensive topological data and struggle with non-spatial categorical variables. The following approaches majorly overlook additional statistics, including road topological classification, lane count, or road meteorological conditions, which provide critical contextual information to understand traffic congestion trends. Recent advancements in transformer topologies have transformed time-series modeling by employing self-attention techniques to efficiently capture long-term dependencies. The Temporal Fusion Transformer (TFT), informer, and auto-former models have shown better results in predicting traffic and other occurrences that might happen in the future. However, transformer-based frameworks are primarily designed for homogenous sequence data and do not possess inherent capabilities to handle tabular contextual information that includes both categorical and continuous variables. It is difficult to model traffic prediction models when they have to combine different types of data into attention models.
The TabTransformer model is designed for learning from tabular data, which mitigates the aforementioned issues by using contextual embedding and multi-head self-attentions to record interactions between different categories. TabTransformer learns how categorical variables’ data are related to each other in a stochastic manner, which significantly improves the ability of the model to express structured metadata in comparison to traditional one-hot encoding or embedding concatenation. When an attention model combines with temporal models, it captures the distinguished properties of both static environmental conditions and changes in traffic over time. Although there have been major breakthroughs in traffic congestion prediction, the current methods still have several drawbacks. Conventional machine learning methods are incompetent in modeling complex nonlinear relationships in traffic congestion data. Although deep learning models like LSTM, GRU, and CNN-LSTM networks have made progress in modeling temporal relationships, they mainly concentrate on modeling the temporal relationships in the data, without fully exploring the diverse contextual features that are common in Internet of Vehicles (IoV) scenarios.
The recent transformer models have shown excellent performance in processing structured tabular data with attention mechanisms. However, the current methods mainly focus on using either transformer models or sequence-based recurrent models. Moreover, many existing studies have been conducted using small-scale datasets or have not considered the difficulties brought by large-scale IoV-generated heterogeneous data, such as high-cardinality categorical attributes, dynamic environmental factors, and multi-factor traffic effects. Thus, there is a research gap in constructing a hybrid deep learning model that can: Concurrently model bidirectional temporal dependencies in traffic data, Learn contextual embedding from structured IoV tabular data using attention mechanisms, and Effectively integrate this complementary information to improve predictive performance.
To fill the research gap, this paper proposes an integrated BiLSTM-TabTransformer framework to concurrently model temporal and contextual information for large-scale traffic congestion prediction in IoV-enabled smart transportation systems.
The key contribution of the proposed work is as follows. First, a novel hybrid TabTransformer–BiLSTM framework is introduced for traffic congestion prediction in Internet of Vehicles (IoV) environments. The proposed model effectively processes heterogeneous tabular traffic data containing both categorical and numerical attributes. The TabTransformer module leverages a self-attention mechanism to capture intricate feature dependencies, while the Bi-LSTM component strengthens temporal learning by modeling bidirectional traffic flow patterns. The framework minimizes the need for manual feature engineering and provides a unified solution that overcomes the limitations of conventional machines and deep learning approaches. Comprehensive experiments demonstrate the superiority of the proposed model over existing baselines, such as LSTM, GRU, CNN-LSTM, decision trees, and extra trees. The hybrid system achieves a state-of-the-art accuracy of 99.52%, accompanied by significantly reduced MSE and MAE and an improved R2 score. Overall, the proposed solution offers a robust, interpretable, and sustainable approach for real-time traffic congestion forecasting in smart-city environments.
The remainder of this paper is organized as follows: Section 2 reviews related literature; Section 3 describes the proposed methodology; the remainder of 4 presents the results & discussion. Section 5 presents the practical challenges in implementing a proposed system for traffic congestion prediction in the Internet of Vehicles (IoV), and Section 6 presents the conclusion & future work of this paper.
Related Work
To make an automobile traffic system work well, there exist many problems that need to be dealt with. Smart vehicles today can connect to the internet, and cellphones can let people talk to each other while using vehicles, due to which it becomes feasible and easy to look at the data that the vehicle's sensors require. Cloud server data is used nowadays to find patterns to figure out whether traffic is getting too heavy or not.
Traffic congestion forecasting in IoV has significantly attracted attention from researchers who employ varied deep learning and machine learning frameworks. Previously, researchers used recurrent neural networks such as the one Li et al. (2019) proposed, an LSTM-based model that was very good at finding temporal dependencies in traffic data to predict congestion. Wang et al. (2020) deployed Graph Convolutional Networks (GCNs) to integrate spatial and temporal correlations within IoV networks. Zhang et al. (2021) designed a hybrid CNN–LSTM model by demonstrating the benefits of combining convolutional feature extraction with temporal learning. At the same time, adaptive methods were observed where authors Kumar and Singh (2021) proposed a reinforcement learning-based congestion control method to enable real-time decision-making in IoV environments. To improve temporal modeling, Chen et al. (2022) authors used an attention-based BiLSTM to make short-term forecasting more accurate.
L. Huang et al. (2022) designed federated learning frameworks for decentralized IoV congestion prediction to safeguard privacy across edge devices in the context of widespread distributed learning. Gupta et al. (2022) proposed a deep autoencoder architecture for the identification of congestion anomalies, among other complementary anomaly-based detection methods.
Researchers are increasingly using transformer-based methods to design more accurate models compared to traditional methodologies. Zhao et al. (2023) proposed a transformer model to predict congestion in the IoV, whereas Singh et al. (2023) devised ensemble machine learning techniques for traffic classification. The case studies of varied authors, such as Zou et al. (2024), observed the use of deep learning for identification of trend deviations in urban networks, whereas Luo et al. (2024) advanced graph-based techniques by implementing a Gated Graph Neural Network (GGNN) for real-time IoV traffic detection.
The survey-oriented studies, including the work of Mystakidis et al. (2025), provided a comprehensive analysis of innovative IoV congestion forecasting approaches and highlighted the shift towards hybrid and interpretable models. Toba et al. (2025) demonstrated the importance of long-term LSTM forecasting, whereas Krishnasamy et al. (2025) integrated BiLSTM with adaptive optimization for sustainable traffic management.
Graph Neural Networks (GNNs) have potential for traffic congestion prediction by using the graph structure of road networks. Lakshman S. V. Sri et al. (2023) have proposed a traffic prediction system consisting of three main components, namely preprocessing, model development, and model evaluation. The proposed system has shown significant improvement in traffic management and reduction in traffic congestion using the GNN approach.
The author's proposed approach is to introduce a hybrid TabTransformer and BiLSTM model to distinguish it from existing techniques. The recent research trends primarily focus on temporal modeling (LSTM/BiLSTM) (Chen et al., 2022; Krishnasamy et al., 2025; Li et al., 2019; Toba et al., 2025), spatial modeling (GCN/GGNN) (Luo et al., 2024; Wang et al., 2020), and feature representation learning (CNN/Transformers) (Zhang et al., 2021; Zhao et al., 2023). In contrast to models such as DeepFM, which mainly target feature interaction through factorization mechanisms, or NODE and similar tree-based neural models that depend on soft decision structures, TabTransformer enables flexible attention-based contextualization and remains compatible with sequence modeling tasks downstream. Moreover, in contrast to FT-Transformer models, TabTransformer provides a stable and well-documented architecture that is specifically optimized for structured tabular data with heterogeneous categorical attributes, which matches the nature of IoV traffic data.
Despite the achievements obtained by the existing studies in the field of traffic prediction, there still exist some gaps. The existing temporal models, such as LSTM and GRU, can effectively handle sequential dependencies but cannot make full use of heterogeneous contextual features in IoV data. On the other hand, deep learning models for tabular data and transformer-based feature learning can effectively handle complex feature interactions but cannot effectively capture bidirectional temporal dynamics in traffic flow.
In this paper, the proposed framework is combined with sequential modeling (BiLSTM) with feature-selection representation learning (TabTransformer) by making it particularly suitable for IoV contexts.
Bidirectional Long Short-Term Memory (BiLSTM)
LSTM networks are the type of neural network that is designed to capture long-range demographics for fixing the problem of varied gradients that is common in RNNs. The LSTM model uses memory cells and gating mechanisms—specifically input, forget, and output gates—to control how information flows, making it easier to retain or discard certain information (Hochreiter & Schmidhuber, 1997). The BiLSTM is an improved version of the standard LSTM architecture that looks at input sequences in both directions. Unlike typical LSTMs, which only use historic information, BiLSTMs use both historic and current trends information to provide a more complete picture of sequential data. Due to the ability of BiLSTMs to process information in both directions, they become more suited for tasks that require understanding of context from both sides (Schuster & Paliwal, 1997). It has two different LSTM layers: Forward LSTM – reads the input sequence from t = 1 to t = n Backward LSTM – reads the sequence from t = n to t = 1
At each time step t, the hidden states from both directions are concatenated to form a context-rich representation:
This enables the Bi-LSTM to use information from both past and future time steps simultaneously. This dual-direction processing makes Bi-LSTM particularly effective for tasks such as NLP, speech recognition, biomedical signal analysis, and time-series prediction.
The outputs from both directions are concatenated, giving the model a more complete understanding of the surrounding context. This makes Bi-LSTM extremely powerful for tasks where both past and future information matter, such as Natural Language Processing (NLP), Sentiment Analysis, Speech Recognition, Named Entity Recognition, and Time-series forecasting. The Bi-LSTM architecture is depicted in Figure 1. A single bidirectional LSTM layer with 128 hidden units in each direction is applied to the temporal input sequence. The hidden states from the forward and backward passes are concatenated at each time step, producing a 256-dimensional temporal embedding. Dropout (0.3) and recurrent dropout (0.15) are used to enhance generalization performance. The obtained representation is then combined with contextual features in the hybrid architecture.

The bi-LSTM architecture implemented in the proposed framework. The temporal traffic sequences are processed by a single-layer bidirectional LSTM with 128 hidden units in each direction (256-dimensional output).
The TabTransformer is a recent deep learning architecture introduced specifically to improve learning from tabular data, particularly when the dataset contains high-cardinality categorical features, heterogeneous feature types, and complex feature interactions (X. Huang et al., 2020). Unlike traditional fully connected networks that struggle to capture relationships among categorical variables, the TabTransformer uses the self-attention mechanism of Transformers to produce contextual embeddings of categorical features. Traffic congestion predictions depend not only on temporal trends but also on rich contextual metadata such as road category (highway, arterial, local), number of lanes, time of day, day of week, weather conditions, and traffic management rules (speed limit, tolling zones). Traditional encoding methods (one-hot encoding and label encoding): explode the feature dimensionality, lose semantic relationships between categorical values, and fail to capture interactions across features. The TabTransformer addresses these issues using a Transformer-based contextual embedding scheme.
The TabTransformer architecture consists of three major components:
Categorical Embedding Layer
Each categorical value is mapped to a dense vector representation:
This converts discrete values (e.g., “Highway”, “Expressway”) into continuous embedding.
Transformer Encoder Block
The embedded categorical tokens are processed through multiple self-attention layers:
This allows the model to learn inter-feature relationships, share information across categorical variables and build contextual feature representations.
Continuous Feature Projection
Continuous features (e.g., speed limit, latitude, rainfall intensity) are normalized and linearly projected into the same embedding space.
Finally, tab transformer does feature fusion. Final tabular representation is
This fuzed vector now encodes: contextual interactions across categorical attributes numerical environmental and infrastructure attributes
Let X = {x1, x2,…, xf} be the set of F categorical features at a given time step. Each feature xi is associated with a learnable embedding vector ei ∈, where is the embedding dimension. The embedded feature matrix can thus be written as E ∈. . Self-attention is applied along the feature dimension, allowing each feature embedding to see other feature embeddings and learn contextual relationships while maintaining the same dimensionality in the result. Figure 2 shows the architecture of the Tab Transformer.

Tab transformer architecture.
Figure 2 represents the TabTransformer-based contextual encoding module. Categorical features are embedded, continuous features are linearly projected, and the resulting feature tokens are concatenated and transformed using a transformer encoder to capture feature dependencies. The contextual representation is obtained after global pooling.
Table 1 shows a comparative analysis of deep learning techniques for traffic congestion prediction in the existing works. The literature included in Table 1 was chosen based on its relevance to traffic congestion prediction, its relevance to intelligent transportation systems or IoV, the adoption of advanced machine learning or deep learning models, and the publication date to ensure that literature is up to date with the latest trends in research. To ensure that literature is more interpretable, review and survey articles are treated separately from research contributions.
Comparative Analysis of Deep Learning Techniques for Traffic Congestion Prediction.
This section presents the proposed hybrid deep learning framework for short-term traffic congestion prediction.
Problem Formulation
Let Xt denote the multivariate traffic state at time t, consisting of speed, flow, occupancy, and environmental conditions. Let Zt represent a vector of contextual (non–time-series) attributes including road geometry, temporal indicators, and weather metadata. The objective is to predict future congestion levels Yt + Δ for a forecasting horizon Δ ∈ {5,10,15,30} minutes.
Formally, the prediction function is defined as:
In equation (4), Ŷt+Δ represents the predicted congestion state at the future time step t + Δ, with a forecasting horizon of Δ. The expression
The function F is realized through the proposed BiLSTM–TabTransformer Fusion Network. Figure 3 shows proposed hybrid architecture.

Proposed hybrid architecture.
The proposed framework consists of four tightly coupled modules (Figure 3): Temporal Encoding Module (BiLSTM) Contextual Encoding Module (TabTransformer) Feature Fusion Layer Prediction Head (Fully Connected Network)
The architecture is designed such that temporal and contextual information are processed in parallel and integrated at the representation level, enabling the system to learn complementary patterns.
Traffic data exhibits temporal autocorrelation, daily periodicity, and sudden anomalies. To capture both forward and backward temporal influences, a Bidirectional Long Short-Term Memory (BiLSTM) network is employed.
The sequential input window Xt−L+1:t is processed by BiLSTM in both forward and reverse directions.
Forward pass:
Backward pass:
The final temporal embedding is:
This dual direction of forward and backward passes gives the model strength to analyze performance from both the beginning and end of the sequence by making the model ideal for predicting traffic congestion and rebound transitions. The temporal pooling layer (global average pooling) is used to minimize the sequence dimension by resulting in a fixed-size representation vector:
In our work, Global Average Pooling is used to pool the BiLSTM temporal features into a fixed-size vector. Global Average Pooling is beneficial in that it gives a stable representation of the bidirectional sequence information by focusing on the overall temporal patterns rather than just the last time step. In contrast to attention-based pooling, global average pooling does not add any new training parameters.
Contextual encoding is used by considering contextual factors such as road infrastructure type, weather, time of day, lane count, and geographic demographics that possess a significant impact on traffic behavior, and these are not the only factors that contribute to traffic congestion. Meanwhile, these characteristics generally encompass heterogeneous high-cardinality categorical variables, posing challenges for traditional neural network models (Hyman et al., 2012).
To encompass this difficulty, authors use TabTransformer, a transformer-based model that is specifically designed for tabular data.
Let C = {c1, c2, …, cm} which represent collection of categorical variables, where Every category is integrated into a compact vector:
This embedding form a token sequence analogous to natural language tokens. To capture inter-feature relationships, TabTransformer applies multi-head self-attention:
In the self-attention mechanism, the input embeddings are linearly transformed into Query (Q), Key (K), and Value (V) using weight matrices. The Query vector identifies which features are looking for context, the Key vector calculates the relevance, and the Value vector holds the context to be aggregated. The attention scores are calculated by scaled dot-product similarity between the Query and Key vectors, and the weighted sum of the Value vector facilitates dynamic feature interactions.
Stacking multiple transformer encoder layers ensures deep contextual feature learning:
Continuous contextual features Zcont ∈ Rq are normalized and projected:
The tabular representation is obtained by concatenating categorical and continuous embedding:
To integrate sequential and contextual information, we concatenate the learned vectors:
This fuzed representation unifies: temporal dynamics (from BiLSTM) static or slowly varying attributes (from TabTransformer)
For feature fusion, the concatenation operation is employed, where the temporal embedding obtained through the Bi-LSTM module is fuzed with the contextual representation obtained through the TabTransformer model. The concatenation operation was chosen based on its simplicity, computational efficiency, and ability to maintain complementary information between two feature spaces without requiring any extra parameters. More complex feature fusion methods, such as gated fusion or cross-modal fusion based on attention, could be employed to capture more complex interactions between temporal and contextual representations. However, this would also increase model complexity, resulting in higher computational costs. The concatenation operation is therefore a good choice for feature fusion of heterogeneous IoV traffic features in this framework.
The fusion output passes through a multi-layer perceptron (MLP) designed to perform either regression (speed/congestion index) or classification (congestion level):
ϕ = ReLU activation, σ = identity (regression) or softmax (classification).
Regularization techniques include dropout, weight decay, and early stopping to prevent overfitting.
The dataset employed in this research was originally gathered using the Google Maps API (https://doi.org/10.1371/journal.pone.0238200.s003). It contains traffic data from 123 traffic monitoring points, which include major roads in Islamabad and Rawalpindi, Pakistan, such as Murree Road, Stadium Road, 9th Avenue, and Jinnah Avenue. In November 2019, January 2020, and February 2020, data was collected every 6 to 8 min, 24 hours a day, 7 days a week. Some data points were also collected in July 2019. The dataset contains close to one million traffic data points that represent the real-time congestion patterns on urban road segments. This dataset is not a benchmark dataset like PeMS or NGSIM but a large-scale API-collected traffic dataset. The preprocessing pipeline involved label encoding categorical features, standardization of numerical values, and time feature engineering from the system timestamp. Temporal features such as current time are converted to a formatted numerical representation (for example, hour of the day and day of the week), while categorical features such as rare conditions are encoded using label encoding before model training. Continuous features (for example, travel time and distance) are standardized using fold-wise normalization to avoid data leakage. To maintain the class distribution, the data were divided into training and testing subsets using sampling. To ensure strong generalization, the model is optimized via early stopping, adaptive learning rate reduction, and checkpointing. The dataset includes 317,112 traffic samples with varied dates, origin-destination pairs, route distances and times, weather, holiday flags, special conditions, and detailed temporal features. This variability enables the model to be trained on a wide range of traffic patterns in a realistic IoV setting. Table 2 shows dataset class distribution.
Dataset Class Distribution.
Dataset Class Distribution.
The dataset size is 945,673 records and consists of nine columns. System time is the current machine time. The details of both input and output variables are presented in Table 3. A hybrid deep learning framework was developed by combining the TabTransformer module for categorical variables and the BiLSTM module for temporal and numerical features. The categorical attributes included weekday, destination, fastest route name, holiday, special condition, starting location, and weather condition, while the numerical attributes consisted of fastest route distance, fastest route time, and system time (decomposed into hour, minute, and second). Together, these formed a dataset comprising nine input features and a single output feature indicating the congestion state, classified into free-flow, moderately congested, and heavily congested categories (https://doi.org/10.1371/journal.pone.0238200.s003).
Input and Output Variables.
Accuracy, precision, recall, F1-score, and confusion matrix analysis are utilized for performance evaluation, whereas the training data were examined through loss and accuracy curves. The experimental results prove the efficiency of the hybrid TabTransformer–BiLSTM architecture in capturing both categorical dependencies and temporal patterns in traffic flow. The proposed hybrid deep learning model showed improved accuracy and resilience compared to traditional machine learning techniques such as the Extra Trees Classifier (Sharaff & Gupta, 2019), making it ideal for real-world traffic flow prediction applications.
For reproducibility and deployment in the future, all trained features like scalers, encoders, and categorical mappings were stored.
Table 4 defines the hyper parameters considered for experimental purpose.
Parameters.
Figure 4 shows the flowchart of the proposed Hybrid TabTransformer–BiLSTM model for traffic flow prediction. The framework integrates categorical feature embeddings processed through a TabTransformer module with temporal and numerical features modeled using a BiLSTM network. The fuzed representations are passed through dense layers to classify traffic states as free-flow, moderately congested, or heavily congested. The preprocessing, training, evaluation, and artifact-saving pipeline are also illustrated.

Flowchart of the proposed Hybrid TabTransformer–BiLSTM model for traffic flow prediction.
Table 5 shows the prediction accuracy of different machine-learning & deep learning methods.
Accuracy Score of Different Learning Models.
K-fold cross-validation has been employed as the primary evaluation methodology in this study. This approach systematically partitions the dataset into k equally sized subsets, or folds, where each fold is iteratively used as a validation set while the remaining folds are utilized for training. Such a procedure provides a more reliable estimate of model generalization compared to a simple train–test split (Hastie et al., 2009). The choice of the fold parameter plays a significant role in balancing bias and variance in performance estimation (Bishop, 2006). In the present work, a 5-fold cross-validation scheme has been adopted to rigorously evaluate and compare the performance of multiple deep learning models. To evaluate traffic congestion prediction in smart city environments, several statistical performance indicators are used. These include Mean Square Error (MSE), Root Mean Square Error (RMSE), Coefficient of Determination (R2), Mean Absolute Percentage Error (MAPE), and Mean Absolute Error (MAE) (Chai & Draxler, 2014). These performance indicators help to compare the accuracy of actual traffic observations pj with the predicted traffic values MSE: MSE indicates the mean of the squared discrepancies between the actual traffic congestion figures and the forecasted values.
RMSE: RMSE is calculated as the square root of the Mean Square Error (MSE) and provides the error in the same scale as the original traffic data.
R2: R2 measures how well the prediction model explains the variability in the actual traffic values. Higher values indicate better predictive performance. A value closer to 1 signifies a better fit.
Where MAPE: MAPE expresses prediction error as a percentage relative to actual traffic values.
MAE: MAE calculates the average of the absolute differences between the predicted and actual values.
Table 6 shows the validation against modern deep learning models. Figure 5 shows MAE, RMSE & R2 comparison across different models.

(a) MAE, (b) RMSE & (c) R2 comparison across different models.
Performance Comparison of Proposed Model with Existing Techniques.
Proposed ensemble Bi-LSTM & Tab Transformer technique gives the best result in terms of accuracy score than other classifiers. Equations (4.1)–(4.4) demonstrate the parameters to measure the performance of the proposed model.
Figures 6–11 shows performance comparison in terms of accuracy, precision, recall, F-1 score, Training vs. Validation accuracy comparison & Training vs. Validation Loss.

Comparison of classification accuracy across different models.

Precision comparison of the evaluated models.

Recall comparison among the evaluated models.

F1-score comparison of different models.

Training vs. validation accuracy across different modules.

Training vs. validation loss across different modules.
Table 7 shows performance comparison of benchmark deep learning models with the proposed Bi-LSTM + TabTransformer Framework.
Recall, Precision, and F-1 Score of Different Deep Learning Models and Proposed Bi-LSTM + Tab Tranformer Framework.
The confusion matrix is a powerful tool for analyzing the classification problems (Mitchell, 1999). It describes how the data belonging to a single class can be assigned to multiple possible classes. Normalized confusion matrix for representing the traffic congestion's classes. The normalized confusion matrices have been used for simulation is shown in Figure 12.
It has been concluded that the proposed framework provides better results than the previously proposed methods in terms of accuracy value.
In order to evaluate the scalability of the proposed BiLSTM-TabTransformer framework, we will analyze its computational complexity, parameterization, and scalability.
Model Complexity
The proposed framework has two major parts: (i) a Bidirectional Long Short-Term Memory (BiLSTM) module for modeling temporal patterns and (ii) a TabTransformer module for modeling contextual features. The computational complexity of the BiLSTM module with input sequence length T and hidden size H is given by:
The TabTransformer module uses multi-head self-attention on structured feature embedding. The attention computation has a complexity of:
The overall complexity of the proposed framework is therefore given by:
Training Efficiency
The model is trained using:
• 50 epochs (with early stopping),
• Batch size of 32,
• Learning rate scheduling
• Dropout and L2 regularization.
Early stopping prevents unnecessary epochs once convergence is achieved, thereby limiting computational overhead. In practice, convergence typically occurs well before the maximum epoch threshold.
Inference Latency and Deployment Feasibility
Unlike transformer-based models applied in natural language processing tasks, the TabTransformer model in this context is applied to structured tabular data of a moderate size. Therefore, the memory footprint is constant. The inference cost is linear in sequence length and quadratic in the number of structured features (which is small). This makes the model amenable to deployment in centralized IoV traffic management servers and near real-time prediction systems. Nevertheless, deployment at the edge in highly resource-constrained devices could necessitate the development of lightweight models, which is a direction for future research.
Scalability Aspects
Since the dataset has hundreds of thousands of entries and the model has shown stable convergence points during k-fold cross-validation, the framework is scalable in terms of data size and training convergence. Further optimization techniques such as pruning, distillation, or light attention mechanisms may be explored to improve computational scalability in the context of smart city infrastructure.
The proposed BiLSTM-TabTransformer framework has been shown to have application value in IoV-enabled intelligent transportation systems, as it enables accurate traffic congestion prediction by modeling temporal dynamics and contextual information. However, the proposed framework is currently validated on a single geographic region, and its performance on multiple regions is a future research direction. Moreover, computational optimization may be required for large-scale or edge-level deployment scenarios.
Conclusion & Future Work
The IoV consists of several interconnected vehicle nodes and supporting infrastructure. Due to seamless communication among these components, vehicles can access real-time traffic data, which can help us in reducing traffic congestion and improving mobility.
This work explored the use of a hybrid Bi-LSTM and TabTransformer-based framework to predict the traffic congestion states with higher accuracy. The proposed framework captures both temporal traffic characteristics and complex feature relationships within diverse traffic data, resulting in more reliable and interpretable congestion forecasting. The experimental results confirm that the fusion of Bi-LSTM and TabTransformer attains considerably improved performance compared to other approaches, reaching nearly 99% accuracy prediction. The results indicate that BiLSTM with Tab Transformer can manage complexities and non-linear traffic patterns, making it a promising solution. The dataset is generated from Google Maps API traffic information in a single geographical setting, which could result in some platform-specific characteristics. Furthermore, while the accuracy of the model is high, the transformer components of the model have moderate computational complexity that could need optimization for real-time processing. Although the proposed BiLSTM-TabTransformer framework shows robust predictive capability, future studies will focus on the use of lightweight variants of the transformer architecture to further enhance computational efficiency for edge-level deployment in IoV settings. Moreover, federated learning frameworks can also be explored to further improve the privacy preservation aspect by allowing models to be trained in a decentralized manner without aggregating data centrally. Adaptive optimization techniques can also be explored to further improve long-term scalability in intelligent transportation systems with dynamic traffic patterns. The use of intelligent feature selection methods can also enhance the computational efficiency and generalization ability. The recent metaheuristic optimization techniques, like Binary Waterwheel Plant Optimization for feature selection (Alhussan et al., 2023), offer an efficient way to remove the redundant features. The use of optimization-driven dimensionality reduction before the BiLSTM-TabTransformer fusion is a promising direction for efficient IoV traffic prediction. The work can also be extended by applying other robust deep learning techniques along with optimization algorithms that can more effectively identify congestion states. In our current approach, the feature fusion process is carried out by concatenating the temporal (BiLSTM) and contextual (tabular) features. More expressive fusion approaches, such as gated fusion, cross-attention-based fusion, or adaptive weighting schemes, could potentially further improve the modeling of interactions between temporal and contextual embedding. The dataset used for this study was collected from a particular urban region using the Google Maps API. It is also noteworthy that the characteristics of the traffic may differ for different cities based on their road infrastructure, regulations, and driving habits. Hence, further evaluation on different datasets for different cities is necessary. Future studies will explore these more advanced fusion models.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
