Abstract
As traffic cameras become prevalent, the automatic analysis of traffic scenes presents new opportunities and challenges. Advances in deep learning allow for automated characterization of traffic in such videos. This work aims to understand traffic flow without human supervision, focusing on the localization of road intersections. For this purpose, a three-stage method is proposed that uses a deep neural network for vehicle detection, an object tracker to recover vehicle trajectories, and unsupervised machine learning to detect potential incoming and outgoing traffic flows. The approach has been tested on a variety of real and synthetic videos, with satisfactory results across different camera positions, traffic patterns, and weather conditions. As a key part of the methodology, five options for clustering starting and ending track points were tested. These options included a basic strategy based on predefined spatially localized clusters, and the K-means algorithm with two methods to determine the optimal number of clusters: the Elbow method and the Silhouette score. Additionally, Mean Shift and the Density-Based Spatial Clustering of Applications with Noise were evaluated. An exhaustive analysis of the proposed clustering methods was conducted, including runtime at each stage, performance metrics, and the addition of noise to simulate tracker failures. The results demonstrated the feasibility of the proposed methodology and concluded that Mean Shift is the most suitable clustering method due to its balance of high performance, low runtime, and stable behavior against abnormal trajectory points.
Introduction
Technological improvements in video cameras, massive and fast computer storage, and network connectivity have enabled the deployment of ever larger amounts of traffic cameras for video surveillance, monitoring, and collection of data, both linear road segments and, more typically, in intersections and roundabouts.1–3 Systems for traffic surveillance are deployed for a range of goals, including recording and monitoring how vehicles use and share the road network (to identify antisocial and/or dangerous behaviors), detecting and predicting congestion, collisions, and other traffic-altering anomalies, and collecting statistics about traffic flow. Research in traffic surveillance can be classified in various areas, such as vehicle detection,4,5 scene analysis,6,7 traffic tracking8–10 and monitoring,11,12 detection 13 and management of emergency situations,14,15 detection of traffic events,16,17 detection of empty parking slots, 18 etc.
Artificial intelligence, and especially deep learning, is a technology that is currently accelerating the automation of many kinds of tasks across many disciplines and industries, especially processes that require unsupervised, automated detection with reduced labeling, such as detection of screen printing defects 19 and lightning prediction. 20 In the context of traffic surveillance, the application of deep neural networks, mainly convolutional neural networks (CNNs), to the large (and growing) amount of traffic video data enables the implementation of automated intelligent systems to monitor and predict traffic flows.
In many cases, the design of intelligent systems for traffic applications is relatively straightforward for video feeds from linear road segments, but becomes significantly more challenging in urban scenarios, such as traffic intersections and roundabouts. Vehicle detection and tracking is more difficult in these scenarios, since the vehicles have more complex spatio-temporal trajectories, with frequent accelerations and slowdowns, waiting periods to cross intersections or at traffic lights, frequent lane changes and turnings. In these urban settings, vehicles are also frequently hidden from the camera viewpoint by other vehicles, trees or buildings, either partially or completely, making traffic tracking even more challenging. 1
Recently, some researchers have applied deep learning to overcome specific difficulties in traffic analysis at intersections. For example, Abdeljaber et al. 21 developed a CNN-based tool for the automatic extraction of traffic trajectories at road intersections, and Tak et al. 22 applied YOLOv4 (a deep learning model) for vehicle detection with a camera installed at an intersection. Other researchers have applied clustering to whole traffic trajectories for various tasks, such as the classification of the different types of trajectories in an intersection, 23 the inference of traffic rules at an intersection from GPS tracks, 24 or tracking of vehicles through multiple traffic cameras with overlapping fields of view. 25
In the context of automatic video analysis at intersections, some recent work has been published. Pan et al. 26 introduced the Physics-Guided Spatio-Temporal Graph Neural Network (PG-STGNN) framework, designed to capture complex spatio-temporal dependencies in traffic networks for traffic flow prediction, i.e., flow count. A multi-intersection-aware traffic flow prognostication architecture was proposed by Shen et al., 27 which employs recent information from nearby roads using a relevance vector machine (RVM). Jakubec et al. 28 implement a YOLO-based framework to automatically analyze traffic flow, including speed and vehicle gaps, within the observed intersection area. Some approaches are composed of ensembling tasks, for instance, Tang et al. 29 develop a three-stage system: detection with YOLOv5, tracking with the MS-SORT model, and, finally, integrating trajectories with a re-identification (ReID) method based on ResNet-50 architecture to match trajectories across different surveillance videos, with the aim of analyzing intersection traffic conditions. The work of Azimjonov et al. 30 presents a vision-based, real-time traffic-monitoring system for collecting statistics on vehicle movements at intersections. The object-tracker and data-association algorithms use bounding-box properties to estimate vehicle trajectories, aiming to compute vehicle counts and instantaneous and average speeds. Song et al. 31 proposed a K-means trajectory clustering method based on NURBS curve fitting to obtain traffic flow parameters. The B-spline quadratic interpolation function is used to fit a smooth NURBS curve to the vehicle trajectory, and the K-means clustering algorithm measures the minimum distance, using the first and last endpoints to automatically divide the intersection area to count the vehicle flow.
In Table 1, the main characteristics of reported studies are summarized. The related works, which focused on intersection video analysis, share some similarities with the present study, such as the implementation of a three-stage method that includes detection and tracking. However, they differ in the final and crucial step, which defines the purpose of the system. While previous studies have primarily focused on predicting vehicle flow, analyzing intersection conditions, and collecting vehicle movement statistics—mostly utilizing private datasets—this work focuses on identifying the entry and exit points at intersections through vehicle tracking points clustering. Some of these studies also incorporate unsupervised approaches, which, alongside the reported results, demonstrate the viability of applying these methods to the addressed problem. Additionally, while many studies are limited to specific locations, the proposed method has been designed for general intersection locations and conditions using open-access datasets. Considering the drawbacks and gaps identified in the literature, the proposed methodology offers an innovative, reproducible approach that, as far as the authors know, has no comparable frameworks.
Summary of studies on automatic video analysis at intersections.
Summary of studies on automatic video analysis at intersections.
The methodology presented in this paper is intended to facilitate automated analysis of traffic trajectories at intersections with traffic surveillance cameras by applying deep learning-based methods to reduce human supervision. The proposed three-stage method automatically detects points of traffic inflow and outflow at the borders of the scene watched by a video camera. The first stage involves detecting vehicle positions in each video frame using the YOLOv5x6 model. At the second stage, vehicle trajectories are reconstructed using the Norfair tracking method. Finally, the key task is addressed by applying unsupervised clustering to these trajectories to automatically identify the traffic flows entering and exiting the intersection.
In turn, this enables the analysis of road intersection videos without the need for manual labeling of the roads and traffic lanes visible in the intersection. Further analysis of these videos (not addressed in this work) may be enabled by the automated detection of these points of traffic inflows and outflows.
While this work is focused on unsupervised detection of points of entry and departure of traffic flows in video sequences, we have previously published work on other aspects of the processing of traffic videos, from basic detection and tracking of vehicles 32 to unsupervised clustering of vehicle types 33 and detection of anomalous vehicle trajectories. 34
The rest of this work is organized as follows: Section 2 provides a detailed description of the methodology used for detecting potential incoming and outgoing traffic flows in traffic videos at intersections. Section 3 describes the settings items and the data set used for the experiments and the assessment of the performance of the proposed method. Finally, conclusions are drawn in Section 4.
The following method is proposed for identifying vehicle entry and exit points in traffic videos of an intersection. Initially, an object detection deep neural network, denoted as
Next, the output from the object detection network is fed into a tracking method,
At the start of the video, there are no tracked objects, so the set is initially empty:
The set
Then, the set
Subsequently, the clustering method is applied for clustering the set

Determining the number of clusters using the elbow method for the video Seq1_SK_1 processed with YOLOv5x6. For each possible number of clusters
Let us denote the value chosen by the Elbow method as
The set
When vehicle detection and tracking maintain sufficiently low error rates, the cluster centroids in
This section presents the results obtained by applying the previously described methodology in a series of experiments with various videos of road intersections from the perspective of a traffic camera.
Methods
The system is implemented using OpenCV to read image frames from recorded videos or, if available, to retrieve a live stream from a traffic camera. For vehicle detection, the deep learning network YOLOv5 is used. Specifically, of the range of publicly available sub-models, YOLOv5x6 is used. This sub-model is trained on the COCO dataset 35 and achieves an mAP@0.5 of 72.7. 36 Objects of COCO classes car, motorcycle, and truck are considered to be vehicle detections. Any other detections are discarded. Vehicle detections are fed to Norfair, a publicly available state-of-the-art object tracker with standard assignment heuristics and Kalman filters. 37 Default parameters are used for these components. It should be noted that this method is best suited when the point of view of the camera is high enough to minimize vehicle-vehicle occlusions that might result in tracking failures.
For each vehicle trajectory, the starting and ending points are retrieved and clustered using three state-of-the-art clustering algorithms: K-means,
38
Mean Shift,39,40 and Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (DBSCAN)
41
employing the
In the case of the K-means algorithm, 41 two methods are used in order to estimate the optimal number of clusters: the Elbow method and the Silhouette Score. The Elbow method consists of repeatedly computing clustering with different cluster numbers, measuring clustering performance (see Section 2 for details), and selecting the point of maximum curvature in the clustering metric. However, the Silhouette score is a distance-based criterion that selects the cluster number that maximizes the Silhouette coefficient, a normalized difference between the intra-cluster distance and the nearest inter-cluster distance. Please note that the other two clustering algorithms (Mean Shift and DBSCAN) do not require specifying the number of clusters.
The Mean Shift clustering algorithm is an iterative method aiming to discover “blobs” in a smooth density of samples. 40 This algorithm identifies clusters of arbitrary shape and variable size without requiring a priori specification of the number of clusters. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region by using a kernel function that determines the weight of nearby points for re-estimation of the mean. The implementation used in this work is from the Scikit-learn package for Python. 43
Finally, the Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise or DBSCAN, is a clustering method that relies on a density-based notion of clusters, which is designed to discover clusters of arbitrary shape.
41
The DBSCAN algorithm has 2 hyperparameters to be adjusted in order to get significant results. The first one is the
The goal of this clustering stage is to obtain cluster centroids that mark the positions of roads and/or road lanes at a road intersection. The aim of using all these methods—Quadrant-based, K-means (Elbow), K-means (Silhouette), Mean Shift, and DBSCAN—is to compare clustering performance and to validate the three-stage methodology proposed using several strategies. The overall proposed methodology is illustrated in Figure 2.

Methodology pipeline.
Please note that all values described in this pipeline (also in the formal description in Section 2) are in pixel space (from vehicle bounding boxes to bounding boxes’ centers, tracking and clustering of starting and ending points), without translating them to non-distorted geometric coordinates over the road plane. This means that, depending on the lens geometry and the camera pose, pixel distances of similar magnitude in different parts of the scene will actually represent very different real distances over the road plane. This will influence clustering results, as portions of the road that are far away from the camera (on in greatly compressed areas of the image plane) will naturally present different clustering from similar areas that are near the camera. However, this work is mostly concerned with unsupervised detection of entry/exit points in traffic scenes in qualitative terms. For this, clustering in the pixel space presents reasonably good results. Furthermore, detecting both arbitrary camera poses (relative to the road) and arbitrary lens geometries, and undoing them effectively, consistently and without errors, is very difficult unless using supervised methods, like the one used by García-González et al. 44 In contrast, working in pixel space can be effective even when measuring relative distances and velocities between trajectories, as shown by Fernández-Rodríguez et al. 34 For these reasons, and noting that we are more concerned with qualitative results about entry/exit points in traffic scenes, in our methodology we use coordinates in pixel space.
To evaluate the proposed methods, a diverse set of videos from several well-established open-access datasets has been utilized. A total of 14 videos have been chosen, those videos whose scenes met the criteria in which the model was implemented, from three state-of-the-art datasets depending on the purpose. To validate the proposed method in bad weather conditions, seven videos from the AAU RainSnow Traffic Surveillance Dataset 45 were chosen: Hadsundvej-1, Hadsundvej-2, Hasserisvej-1, Hasserisvej-2, Hasserisvej-3, Hjorringvej-2, and Ostre-3. For each intersection, all videos are from the same camera, but were captured at different times and under varying weather conditions, including rain and snow.
To test the proposed method with different viewing angles of the same vehicles, another six videos were taken from the Ko-PER Intersection dataset 46 : Seq1_SK_1, Seq1_SK_4, Seq2_SK_1, Seq2_SK_4, Seq3_SK_1, and Seq3_SK_4. Of these, videos whose names differ only in the ending number were recorded at the same time in the same intersection, but from cameras in different places and orientations. All videos depict four-way intersections, but some recordings do not include incoming/outgoing vehicles in one of the four ways.
Finally, to check the behavior of the model out of the scope conceived, road intersections, a non-branching road segment video was added: the video titled Highway from the 2014 CDNET dataset. 47 Additionally, three videos were generated using CARLA, 48 each showcasing different perspectives of the same three-way intersection. These synthetic videos were designed to feature scenarios such as speeding, tailgating, and unnecessary lane changes amid heavy traffic, in order to test the robustness of the proposed method when applied to videos depicting vehicles engaged in hazardous, unconventional behaviors.
Performance metrics
Given that our study explores five distinct clustering methods (Quadrant-based, K-means (Elbow), K-means (Silhouette), Mean Shift, and DBSCAN), it is essential to provide a thorough and structured presentation of the experimental results to adequately capture the unique characteristics and performance of each approach. Clustering algorithms often exhibit significant differences in how they handle data, particularly in terms of how they define clusters, their sensitivity to parameters, and their ability to manage noise or outliers. These variations are likely to influence the outcomes of our experiments, and as such, warrant individual attention and detailed analysis.
Given that the key part of the methodology proposed relies on the clustering task, it is essential to consider the clustering performance measures obtained from the underlying clustering processes. These measures provide critical insights into the quality and effectiveness of the clustering methods used, helping to validate and interpret the results. By evaluating the performance of the clustering processes, we can better assess the reliability and significance of the conclusions drawn from the data. Therefore, several well-known clustering performance measures have been considered:
Mean Squared Error (MSE) quantifies the compactness of clusters by measuring the average squared distance between points and their respective centroids, with lower values indicating tighter, more coherent clusters. Davies-Bouldin Index (DBI)
49
assesses the separation between clusters by considering both intra-cluster similarity and inter-cluster differences, with lower values signaling better-defined clusters. The Calinski-Harabasz index
49
measures the ratio of between-cluster dispersion to within-cluster dispersion, where higher values indicate better clustering structure. The Silhouette score
50
gauges the separation between clusters by comparing the average intra-cluster distance with the nearest inter-cluster distance, providing insight into how well clusters are formed. The higher the score, the better the clustering.
Together, these performance measures offer a comprehensive understanding of clustering quality, enabling more robust and reliable conclusions in any clustering study. Furthermore, the required runtimes have been recorded in order to take computation time into account as a key constraint for validating and supporting the methodology in real-world scenarios.
Results
To facilitate a comprehensive and nuanced comparison of these methods, the experimental results section will be addressed by performance metrics, allowing thorough comparison of each strategy and ensuring that each method’s results are presented in a manner that supports careful reflection on its strengths, limitations, and overall effectiveness. This structure not only enhances clarity but also provides the necessary context for understanding the conditions under which each method excels or faces challenges, thereby offering valuable insights into their comparative performance.
The proposal is a three-stage methodology—detection, tracking, and clustering—where the key part is the clustering task, which, in addition to detecting and following vehicles, gives their tracks meaning, in this case, unsupervised detection of inflows and outflows. To ensure the real-time feasibility of the methodology, runtimes are computed by stages. In Table 2, the detection and tracking execution time (ms) per frame for each video is reported together with the mean and standard deviation (Std). While the detection task takes between 37 and 42 ms, the tracking task is too fast, taking around 1 ms to track and return the starting and ending points of the vehicle’s follow. Independent of the video nature (high-traffic or lonely intersections), the two stages require less than a second.
For each video (first column), number of frames (second column) and per-frame milliseconds spent in detection (YOLOv5x6, third column) and tracking (Norfair, fourth column).
For each video (first column), number of frames (second column) and per-frame milliseconds spent in detection (YOLOv5x6, third column) and tracking (Norfair, fourth column).
In Table 3, the number of clusters estimated by each strategy is reported. It can be noted that DBSCAN yields more spread values, while K-means with the Elbow method shows the least change in the number of clusters between 3 and 5, except for Quadrant-based, which, due to its implementation, can distribute points into 4 clusters. However, if in the frame there is a quadrant without points, the number of clusters counted is 3 instead of 4. Figure 3 depicts some samples of how these clusters are distributed to ensure a qualitative inspection. The samples shown, from left to right, are the first frame of CARLA1, one of the three videos synthesized using CARLA, Hasserisvej-1 from the AUU RainSnow Dataset, and Seq1_SK_1 from the Ko-PER dataset. The Seq1_SK_1 first frame (third column) shows how the Quadrant-based strategy works (first row): assigning a point to the cluster and splitting the image into a

Samples of the first frame, where the initial and final points for each detected vehicle trajectory are overlaid on the frame. The columns, from left to right, represent the CARLA1, Hasserisvej-1, and Seq1_SK_1 videos, respectively. Each row corresponds to a different clustering method. Points that belong to the same cluster (as determined by the clustering method) are shown in the same color, while cluster centroids are marked in black.
Number of clusters obtained from the clustering process executed as part of the traffic flow detection method proposed.
To quantitatively analyze clustering performance, the results obtained from the metrics for each clustering method across all videos are reported and discussed below. Table 4 presents the Mean Squared Error (MSE) achieved by each clustering method. The Quadrant-based strategy and DBSCAN algorithm yielded the highest MSE values, while K-means and Mean Shift achieved the lowest, suggesting poor clustering performance in terms of fit. Specifically, optimizing the Silhouette score to estimate the optimal number of clusters produced lower MSE values more frequently than the Elbow method. The difference in the K-means means across the two approaches—Elbow and Silhouette—arises from videos with different numbers of clusters. In these cases, a larger number of clusters (as shown in Table 3) reduced the MSE by decreasing the distance between the centroid and the points within the cluster; for this reason, the lowest value achieved (256.893 with Ostre-3) is with K-means (Silhouette). Finally, the Mean Shift algorithm achieved the lowest MSE values in 8 out of 17 videos within the dataset. It reported the lowest mean while requiring fewer clusters than K-means (using Silhouette) or DBSCAN. This can be interpreted as a more effective distribution of clusters around the starting and ending points of the tracks.
Mean Squared Error (MSE) obtained from the clustering process executed as part of the traffic flow detection method proposed.
Note. For each video (row), the best value among the clustering methods is highlighted in bold. The lower the better.
The Davies-Bouldin Index (DBI) is the other performance metric to minimize; it improves by increasing separation between clusters and decreasing variation within clusters. The outcomes across clustering proposals are reported in Table 5. In this case, the lowest value per video is more spread, with DBSCAN showing the highest values, followed by Quadrant-based, whereas K-means and Mean Shift report more similar values. It is worth noting that the Quadrant-based strategy has the CARLA3 masked as the lowest value (along with the two K-means options) with a DBI of 0.110, because it considers 3 clusters in this video. This outstanding result is a consequence of the lack of points in one quadrant of the image; it is due to the distribution of the points, not the approach itself. In this case, the K-means algorithm optimized with the Silhouette score obtained better (lower) results than the Elbow method. The Elbow method yields the same outcome only when the K value matches that from the Silhouette approach, as in the CARLA1 and CARLA3 videos. The DBI obtained by K-means (Silhouette) with a mean and standard deviation of 0.318
Davies-Bouldin Index (DBI) obtained from the clustering process executed as part of the traffic flow detection method proposed.
Note. For each video (row), the best value among the clustering methods is highlighted in bold. The lower the better.
The Calinski-Harabasz index can be interpreted as indicating that well-defined clusters have a large variance between clusters and a small variance within clusters. According to the values obtained, reported in Table 6, the approach that maximizes the Calinski-Harabasz index is K-means (Silhoutte), which achieved the highest mean value of 2101.099 and performed best 8 times across all videos. This is followed by the Mean Shift method and the K-means method using the Elbow technique, which achieved the highest values 7 and 5 times, respectively. In contrast, both the quadrant-based strategy and DBSCAN show no outstanding values.
Calinski-Harabasz obtained from the clustering process executed as part of the traffic flow detection method proposed.
Note. For each video (row), the best value among the clustering methods is highlighted in bold. The higher the better.
The Silhouette score outcome by clustering methods is shown in Table 7. The first fact to highlight is that K-means (Silhouette) retrieved the highest values, ranging from 0.632 to 0.908, indicating good performance with a strong structure. This is to be expected, as K-means with Silhouette is specifically designed to maximize this metric. The Silhouette score measures how compact clusters are, with how well separated they are. Therefore, when the number of clusters increases, as shown in Table 3, often clusters become smaller and more compact, which results in a lower intra-cluster distance that can boost the score. This could explain why K-means with the Elbow method obtained good performance (0.755
Silhouette score obtained from the clustering process executed as part of the traffic flow detection method proposed.
Note. For each video (row), the best value among the clustering methods is highlighted in bold. The higher the better.
Finally, in Table 8 is shown the runtime (ms) employed by each clustering method in the third stage of the methodology: cluster starting and ending points retrieved by the first (detecting) and the second (tracking) steps. These results are considered a key metric due to the importance of solving in real-time scenarios. In terms of runtime, the fastest clustering approach is DBSCAN (5.061
Runtime (ms) obtained from the clustering process executed as part of the traffic flow detection method proposed.
Note. For each video (row), the best value among the clustering methods is highlighted in bold. The lower the better.
Due to the complexity of decision-making, we aim to compare clustering methods by considering performance metrics: MSE, DBI, Calinski-Harabasz index, Silhouette score, and clustering runtime as key factors to optimize. In this multi-objective context, where no method achieves all objectives, it is important to make trade-offs among them. For this reason, a Pareto front, Figure 4, is generated to summarize the results, where each clustering performance metric (vertical axes) is compared against clustering runtime (horizontal axes). Each data point represented in a light color corresponds to a value from a video, while the dark color represents the mean. The color of the points indicates the clustering strategy used, and the rhombus marker shape identifies the means on the Pareto front. A mean value belongs to the Pareto front if it is non-dominated, meaning there is no other solution that can improve at least one objective without worsening at least one other objective. In this case, each plot is treated as an objective aimed at minimizing runtime, making DBSCAN the primary choice, as is reported at Table 8. However, due to its low performance metrics, the Mean Shift method is another solution, as both algorithms appear on the Pareto front in each plot. Following these two options is K-means with Silhouette optimization, which comes out three times out of four on the front as well. Although Mean Shift is not the fastest in terms of runtime, it offers a better balance between speed, with a mean under 17 ms, and high-performance metrics that result in well-defined and structured clusters, excelling in MSE, while also retrieving competitive outcomes in DBI, Calinski-Harabasz, and Silhouette scores.

Trade-off between clustering runtime and each clustering measure—MSE, DBI, Calinski-Harabasz, and Silhouette score—among all clustering methods—Quadrant-based, K-means (with Elbow method and Silhouette score optimization), Mean Shift, and DBSCAN—. Colors represent each clustering method. Values attained by videos are shown as a scatter plot with light marker colors, while the mean across all videos is shown as a marker with a darker color. Means are evaluated as a Pareto front, and those in the Pareto front, i.e., the optimized trade-off between runtime and performance, are represented with a rhombus marker. The arrows represent whether the metric should be maximized (
The difference in performance between Mean Shift and DBSCAN lies in how the algorithms reach and expand clusters. Mean Shift is an iterative method that shifts data points to find local maxima of the kernel density estimation. This approach can be computationally intensive but generally produces high-quality clusters. In contrast, the DBSCAN algorithm checks only the points within a specified radius, making it computationally efficient. However, this may result in reduced clustering accuracy. Therefore, DBSCAN is often more suitable for large datasets that require extensive analysis. In the task addressed, cluster entry/exit points at intersections, the Mean Shift approach, although not the fastest, achieves high performance with low runtime.
We evaluate the robustness of the methodology and the clustering methods against tracking failures by including in the clustering process additional points as if they were starting/ending points in a legitimate, actually tracked trajectory, and checking how the clustering process is affected by the additional noise in the input data. To simulate failures in the tracking method, we draw these additional points from the set of all centers of bounding boxes for detected vehicles across all video frames in a scene. In this way, the probability distribution of additional, erroneous points is not uniformly random, but mimics the observed patterns of traffic flow in the video sequence, simulating tracking failures that change the input data to the clustering methods. For each video sequence, we add increasingly higher amounts of simulated tracking failures, computed as percentages of all starting and ending points detected in the video sequence (i.e., before adding any noise). We test three scenarios, setting the amounts of simulated tracking failures as 5%, 10%, and 20% of the total starting and ending points in each video sequence. Silhouette score is chosen to analyze clustering method behavior under increasingly higher noise levels because it ranges
Figure 5 depicts a heatmap illustrating how the Silhouette score changes when adding noise points to the input data for the clustering algorithm, relative to the baseline (i.e., without additional noise). If

Heatmap of difference across percentage of noise added and non-noise baseline in Silhouette score. Since it is a function to maximize, negative values (indicated in red) denote a decrease when noise is added.

Samples of the first frame from Ostre-3 video, where the cluster centroids for each percentage of added noise are overlaid on the frame. From left to right and top to bottom, correspond to the legend: Quadran-based, K-means (Elbow), K-means (Silhoutte), Mean Shift, and DBSCAN.
Traffic surveillance videos from road intersections capture traffic patterns that can be analyzed to extract valuable scene information. In this work, a three-stage method is proposed to automatically identify potential incoming and outgoing traffic flows. The process begins by detecting vehicle positions in each video frame with the YOLOv5x6 model. These frame-by-frame detections are then used to reconstruct vehicle trajectories through the Norfair tracking method. By applying unsupervised clustering to these trajectories, the resulting cluster centroids provide an automatic understanding of the traffic flows entering and exiting the intersection.
In order to provide a comprehensive evaluation of traffic patterns captured in those road intersection surveillance videos, we compare several different clustering algorithms—Quadrant-based, which is a spatial assignment of points spatially; two versions of K-means, applying the Elbow method and the optimization of the Silhouette score; Mean Shift, and the DBSCAN algorithm—. The inclusion of multiple clustering methods allows us to analyze the adaptability of each algorithm to the nature of the video data and identify which performs best in this context. Traffic flows in videos can vary in complexity, and each algorithm offers a unique approach to clustering, which helps us compare their effectiveness at handling the dynamic movement of vehicles. This multi-faceted approach ensures that the system is adaptable to a variety of real-world traffic scenarios.
To support the evaluation of the above-mentioned clustering performance, we incorporated the runtime by stages and several clustering performance metrics, in order to provide quantitative backing for assessing the quality of the clusters produced by the different algorithms, ensuring that the analysis is thorough and consistent. Additionally, a comprehensive analysis of the trade-off between clustering runtime and performance metrics within a multi-objective decision-making process has been conducted using a Pareto front. Finally, to test methodology robustness against tracker failures, a set of noise inputs corresponding to a percentage of noise added was included. To compare results, differences between the noisy and noise-free versions were extracted and reported as a heatmap. The evaluation demonstrates the effectiveness of the proposal, given the slight performance decrease.
This gives us a clearer understanding of which algorithm is best suited for extracting meaningful traffic flows from the video data. While DBSCAN runs faster, Mean Shift offers a better balance between computational cost, performance, and robustness in the presence of noise, making it the most suitable clustering method.
The proposed method has been tested with a set of videos, both from publicly available datasets and synthetically generated. Experimental results demonstrate that using better object detection models reduces the number of false negatives and false positives in the placement of cluster centroids at road ends. Since the method is robust to false starting/ending points in vehicle trajectories, it also works when vehicles follow unconventional trajectories (speeding, tailgating, repeatedly switching lanes) and in bad weather (rain and snow). The method can be useful when processing large batches of traffic videos from many different intersections, in order to provide awareness of the layout of the intersection, as well as to help in the detection of vehicles making anomalous entries or exits in each traffic video (i.e., far from any of the clusters, or in the fringes of an existing cluster). These findings highlight the potential of traffic forecasting, providing essential insights for intelligent transportation systems.
For future research on the system proposed in this article, the authors plan to explore two main research lines. The first involves experimenting with different tracking algorithms beyond the one used in the current work to determine whether these alternatives can improve performance. By testing various tracking methods, the authors aim to enhance the accuracy and reliability of vehicle trajectory detection, particularly in complex or crowded traffic scenes. This exploration of new tracking approaches may lead to better overall results and more robust handling of vehicle movements under diverse conditions.
The second research focus is on translating the coordinates of all real-world scenarios presented in this work to real-world GPS coordinates. This step would allow the system to map detected vehicle trajectories more accurately to real-world geographic locations, providing a more practical and scalable solution for real-world applications. By aligning the traffic data with GPS coordinates, the system can support advanced traffic analysis and integration with other geo-referenced systems, such as navigation or traffic management platforms as well as enabling data fusion from multiple cameras, in order to provide more data points to the clustering algorithms, and enabling the application of the algorithm to more complex traffic road layouts that cannot be adequately watched from the point of view of a single traffic camera.
To address the lack of semantic information provided by the methodology, since it just provides clusters of road localization, another further research line could be implementing a fourth stage post-processing where clusters will be semantically labeled. This proposal could be explored with supervised techniques, such as deep learning-based models or through a visual large model (VLM).
These three research directions will help extend the current system’s applicability and improve its effectiveness in real-world traffic monitoring and analysis scenarios.
Footnotes
Author contributions
All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.
Funding
This work is partially supported by the Autonomous Government of Andalusia (Spain) under projects PPRO-TIC163-G-2023 (TIC163-G-FEDER) and DGP_PIDI_2024_00462; also by the Ministry of Science and Innovation of Spain, grant number PID2022-136764OA-I00, project name Automated Detection of Non Lesional Focal Epilepsy by Probabilistic Diffusion Deep Neural Models. It includes funds from the European Regional Development Fund (ERDF). It is also partially supported by the Fundación Unicaja under project PUNI-003_2023, project name Intelligent System to Help the Clinical Diagnosis of Non-Obstructive Coronary Artery Disease in Coronary Angiography, the Instituto de Investigación Biomédica de Málaga y Plataforma en Nanomedicina-IBIMA Plataforma BIONAND under project ATECH-25-02, and the Instituto de Salud Carlos III, project code PI25/02129 (co-financed by the European Union). The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga. They also gratefully acknowledge the support of NVIDIA Corporation with the donation of an RTX A6000 GPU with 48Gb. The authors also thankfully acknowledge the grant of the Universidad de Málaga and the Instituto de Investigación Biomédica de Málaga y Plataforma en Nanomedicina-IBIMA Plataforma BIONAND.
Conflicts of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
