Abstract
The cornerstone to achieve the development of autonomous ground driving with the lowest possible risk of collision in real traffic environments is the movement estimation obstacle. Predicting trajectories of multiple obstacles in dynamic traffic scenarios is a major challenge, especially when different types of obstacles such as vehicles and pedestrians are involved. According to the issues mentioned, in this work a novel method based on Bayesian dynamic networks is proposed to infer the paths of interest objects (IO). Environmental information is obtained through stereo video, the direction vectors of multiple obstacles are computed and the trajectories with the highest probability of occurrence and the possibility of collision are highlighted. The proposed approach was evaluated using test environments considering different road layouts and multiple obstacles in real-world traffic scenarios. A comparison of the results obtained against the ground truth of the paths taken by each detected IO is performed. According to experimental results, the proposed method obtains a prediction rate of 75% for the change of direction taking into consideration the risk of collision. The importance of the proposal is that it does not obviate the risk of collision in contrast with related work.
Keywords
Introduction
Autonomous driving for implementation in ground vehicles is a research field in recent years, there are areas of opportunity to carry out the displacement in real traffic environments without object collisions.
The problem to be solved in order to achieve autonomous vehicles is the detection of obstacles present in urban routes and thus the estimation of their trajectories to avoid collisions.
It is important to mention that a route or path intention of the object of interest is defined as the action of changing direction through the path options that do not involve collision [7].
The effectiveness of the proposed methods depends on the ability to process the environmental information in order to establish the conditions of movement and compute possible changes in the direction of the detected obstacles to avoid the risk of collision [8].
The approach to the vehicular environment is important because it depends on how the problem is addressed and, consequently, the possible solution to be found. For example, the traffic scene can be considered as a set of objects into a static environment in discrete time and continuous interaction space [16]. With this, the kinematic state of the objects present in the scene is firstly considered.
The prediction of directional changes in autonomous displacement is a subject of recent research, models have been developed to predict complete trajectories, i.e., with the intention inferred in the process. Hidden Markov models [12], Recurrent neural networks [1], Gaussian models [3], Dynamic Bayesian networks [16, 20] as well as Reinforcement learning [10, 22] have been used in such models, there are also works that cover the prediction of intention independently or followed by movement planning [2].
So far the most advanced models are based on probabilistic compute according to the kinematic and/or dynamic conditions of the objects detected on the road. This allows handling the problems of uncertainty and ambiguity [22].
A desirable method assigns a high probability to the correct action eventually taken by the changes of direction of the tracked objects. The model can then be used in risk assessment for motion planning or modeling the behavior of the tracked object [7].
In this paper, a method for the motion estimation of multiple objects detected in video is presented. The proposed model is based on a Dynamic Bayesian Network (DBN) that takes into account the causal correlations of the scene dynamics and thus calculates the probability of change of the object direction vector with respect to the collision risk.
This paper is organized as follows. In Section 2, relevant related works are described, in Section 3, proposal details and relevant parameters are presented, in Section 4, the experiments and the results obtained are shown. Finally, Section 5 presents the conclusions from this work.
Related work
The problem of trajectory estimation in vehicular environments has been widely addressed [18], for example, Duan et al. [11] present a hierarchical reinforcement learning method for autonomous vehicle decision making that does not rely on a large amount of labeled driving data. This approach models and learns direction change as a Markovian decision process where at each time interval of interest, vehicle observes a state, performs an action, receives a scalar reward, and finally reaches next state. However, it exhibits fluctuation due to random initialization and number of times that an autonomous car reaches the destination at each epoch. In comparison, learning speed and performance is poor according to the author.
In the paper published by Xu et al. [4], a feature normalization scheme is developed and a strategy is established to build three-dimensional Gaussian process regression models from two-dimensional trajectory patterns to capture spatio-temporal characteristics of traffic situations. However, since traffic environment is a dynamic and uncertain system, the subsequent action is no longer optimal when executing a decision sequence since velocity in this process is considered invariant, which does not occur in real situations.
On the other hand, Schulz et al. [15] model the process in a DBN that allows the specification of the relationships between obstacles as well as the causal and temporal dependencies for handling measurement uncertainty. The decision-making process for each visible obstacle is composed of three hierarchical layers: route intent, maneuver intent, and continuous action. The network structure is adapted at runtime (creating and removing route - maneuver hypotheses), however, the method has only been tested on 3 different traffic scenarios.
As mentioned in related works, the general problem statement involves determining the set of possible routes and maneuvers given a topological map. In this way, the continuous action of each obstacle is then derived by context-dependent behavioral models given its route and maneuver intentions [6, 24, 25].
It is important to note that the recurrent updating of the belief with poses observations and velocities of a group of interest objects allows to infer the intentions of whether or not a path change is existing. With this information, a probabilistic trajectory prediction is generated by directly simulating the current belief. Finally, predicted trajectories explicitly incorporate highest probability of path change and consider interdependencies between multiple objects [5, 13].
In summary, the above works propose to integrate a probabilistic transition system into sampling-based algorithms where the input to the model is deterministic (a sample of the belief), while the output is a probability distribution over the actions, from which samples can be drawn again if required. The action is posed to distribute normally given a potential path intention and the frame of the situation at the current instant [8, 24].
The novel proposed in this work makes contributions in terms of detecting multiple IO and predicting their change of direction. The object direction changes are obtained according to a DBN with respect to the probability distribution as a function of the collision risk. The following section describes in detail the implemented methodology.
Methodological proposal
For this work, a traffic scene consists of a set of objects participating in the traffic in a time-varying space that represent a risk of collision for the vehicle capturing the video (ego-vehicle) [7, 8]. The objects to be detected can be cars, pedestrians, cyclists, animals and others.
The proposed approach estimate the trajectory to be taken by a detected object, the flow and compute of information is carried out through the modules depicted in Fig. 1.

Information compute modules.
First, the road information is captured with a pair of video cameras (to emulate stereoscopic vision) to determine the regions of interest (ROI). The ROI detection module consists of two cooperating sub- modules: a multi-layer Convolutional Neural Network (CNN-submodule) that encloses the detected objects in the frame in a bounding box and the Disparity Map (DM-submodule) estimates the approximate distance of these objects from the ego-vehicle (left-right image information processing).
These ROIs are tracked with respect to consecutive data are acquired in time, this allows to obtain physical characteristics about the movement of objects and related information. Finally, the information is provided into a probabilistic DBN model which, given a dataset of observed objects, possible trajectories for each one are predicted.
The CNN-submodule considered in this work is based on the Yolo-V5 deep learning framework [14] and the standard CNN architecture, this network predicts 4 coordinates for each bounding box.
The CNN-submodule input is the original image of traffic scene for each frame, the feature map F
m
of the m - th layer is processed and it is computed as Equation (1).
The CNN-submodule process culminates with the detection of all IO and obtaining the bi-dimensional coordinates of the bounding box. Subsequently, with the coordinates of the bounding box, the changes in orientation and translation are obtained, in addition the changes to the height and width dimensions of the object. This information is provided to the DM-submodule.
The distance calculation of the IO with respect to the ego-vehicle is carried out in the DM-submodule, a stereo device (two slightly horizontally offset cameras mounted on the ego-vehicle) is used to provide the information to calculate the disparity map.
The distance/depth d is computed given the image data frame left (F l ) with respect to frame right (F r ), these are captured by a pair of cameras with the same focal length (Ф), framework is depicted in Fig. 2.

Two-camera stereoscopic configuration with focal length and aligned centers for disparity mapping compute.
Points values φ l and φ r are triangulated given the focal length in both cameras and the distance between them (B). Value φ r is negative number passing through the image plane with respect to center-line of the camera (solid arrow with regard Fr). On the other hand, φ l is a positive number since the central plane of the camera on the left (solid arrow with respect to Fl) is taken as reference [4].
The Equation (2) is used to estimate the depth (through all the values of bounding box).
The object information is obtained by processing data from a pair of video cameras and a Global Position System device (GPS) mounted on the ego-vehicle. Obstacle movement conditions (relative velocity) are obtained indirectly with the use of the GPS sensor.
The position of the i - th OI detected in the scene is specified by its coordinates (x
i
, y
i
) from the center of the bounding box. Note that x and y indicate the location relative to the ego-vehicle in the longitudinal and lateral directions, respectively. The reference system, coordinate global O = (0, 0), corresponds to the ego-vehicle location, the position of the surrounding OI at time t is denoted as
The trajectory of the i - th object is determined by position sequence
The position coordinates are obtained every Δt seconds; the dynamic scene environment changes over time, so the intervals of change can be represented as t i = ti-1 + Δt. A temporal change corresponds to the number of consecutive frames considered in the Δt interval, for example, it may correspond to 30 frames per second (fps) or a smaller range. Likewise the GPS sensor data is in synchronization with the fps capture.
Vehicles position is considered within an observable range of -10 to 10 meters (m) in the lateral direction and from 4 to 40 m approximately in the longitudinal direction. These ranges are determined taking into account the valid detection range of the sensors and the road environment where the data are collected.
Proposed DBN
The proposed DBN topology incorporates prior analysis of the state-position of IO to track and estimate their trajectories in the traffic scene to avoid collisions. A general formulation of the problem involves defining traffic scene including several types of participating objects with different mobility characteristics.
The variables in the discrete state space are defined to facilitate the analysis of the problem. These variables are: relative velocity of displacement
Information from the environment and interaction of some IO (vehicles, pedestrians, cyclists, etc.) are used to design latent states, while causal dependencies are used to design conditional dependencies in traffic. Dynamic models are used to design the motion state space and action-reaction variables [17].
Video capture considers the existence of discrete latent variables Tr j designed to include the intentions of each participant in the scene, i.e. the path of the j - th IO.
The probability distribution Tr
j
is dependent on all observable states Tr1, Tr2, . . . , Tr
n
but it is not dependent on the latent states of other scene participants. With this, a conditional discrete distribution is obtained for the j - th object and also the collision probability (

The Bayesian network model is developed over two time segments, and the decomposition of conditional latent state dependencies is also observed. The solid and dashed lines are causal and temporal observational dependencies respectively.
Inference process of DBN refers to the probability calculation of certain maneuvering intention. Based on the established network and previously learned parameters, it uses all observable states in two con- tinuous time slices as evidence for inference [23].
The result of the inference is the probability of intention at time t + 1. The intention that has the maximum posterior probability is chosen as the prediction result, the specific computational process is described below.
For each set of objects in the scene, the variables of interest are: the velocities of the objects in the frame to be analyzed
Now, it is considered a set of objects with given parameters Z t = [V t , D t , Γ t , Ψ t ] detected in scene and sampled according to analyzed frames starting from t. Under initial conditions it can be implied that the set conditions Z0 can be given as an approximation (∼) of the probability by the input data, i.e. Z0 ∼ P (V0, D0, Γ0, Ψ0 ∣ C0). Then, each direction change is predicted according to transition probability Zt+1 ∼ P (Zt+1 ∣ Z t ).
The DBN is defined to propagate the conditional dependence relationships between the variables of interest and observe their effect in the time interval to analyze. Therefore, given the interval t = 1, 2, 3,..., T, the proposed DBN topology and the variables in the behavior layer, the joint probability distribution [19] is expressed as Equation (3).
Conditional probability can be written as P (C t |TR t ) when TR t can be some combination for the latent spatial variables resulting in a variance matrix.
As mentioned earlier, discrete latent variables are defined to help estimate intentions and also these are used as indicators for the switching dynamics system in addition to incorporate prior knowledge of traffic interaction within the DBN.
Prediction in the DBN implies that model is develo- ped without observation of all states; a critical point is to show whether model is able to capture possible interaction between objects without observing information of a particular state.
Once the predictor variables and the network structure have been determined, the next step is to estimate the conditional probability distribution between the main and secondary nodes.
The results of the behavioral estimation are obtained according to the historical observation information, which can be expressed as Equation (4).
Network structure variables involve to estimate the parameters of conditional probability distributions based on the DBN structure. Since there are hidden nodes in the network, this problem is considered partially observable. In the proposed topology, hidden layer parameters H = (TR t ), observable layer parameters Op = (Z t , C t ) are defined.
Therefore, in the partially observable case, logarithmic likelihood is defined by Equation (5).
The Expected Maximization (EM) algorithm [23] iterates between an expectation step and a maximization step to find maximum likelihood (ML) parameters, in this case, the possible paths in relation to the information provided. These two steps iterate until a convergence is reached, i.e., given Z t converges in probability to K (obtained approximate path affine to the real path) with respect to the data size increases Z t ⟶ P K [23].
To optimize the partially observable case, ML estimation theory is used to fit the proposed model and estimate the possible parameters closest to the collision risk value based on the conditional probability distributions with respect to the parameters of the observable variables of multiple objects in the scene.
EM algorithm uses the Jensens inequality [26] to iteratively maximize. Therefore, the Equation (5) can be rewritten as Equation (6).
The proposed interaction model should represent the interaction behavior of multiple traffic participants with an arbitrary category. For example, it is possible to define the state spaces of a typical traffic environment for objects case in general. A first inference for development of the proposed algorithm for trajectory estimation is based on global latent state vectors (Z t ), i.e., prior states specific to the traffic participants.
In Algorithm 1, the current state vector and the collision inference state C t of the video traffic parti- cipants are provided as input. An expected output is the approximation of the latent state vector at time t + 1.
Three direction vectors (u) are also defined to identify the change of direction, which are: maintaining the frontal displacement, change of displacement to the left or change of displacement to the right.
It is shown that a conditional state occurs such that P (TR t |Zt+1), that is, the path conditions imply that there is a change of direction in the displacements of the participants (j - participants), this is denoted by P (TRt-1|Z1:T, Z0) throughout the course of the video (previous frames t - 1).
The relationships of the variables and the joint probability distributions (JPD) are necessary to process the queries in DBN topology to obtain the set of probable inferences with respect to the normalized collision probability.
With respect to the topology described (Fig. 3), experiments are conducted to determine the results of the proposal. Given conditions of the variables, query can be performed and collision probability can be inferred.
The information to be processed, in the different modules of the proposed method (Fig. 1), corresponds to left and right images of videos from the Kitti vision Benchmark database [21] related to 17 videos with a duration of 15 sec. each one with a sampling rate of 30 fps and a resolution of 1242 x 374 pixels. The experiments were run on a core i7 processor at 2.5 GHZ, with an available ram memory of 8 GB, a graphics processor intel HD Graphics 520 (skylake GT2) and S.O. ubuntu 16.04 LTS.
In the same way, 16 videos were captured in streets with vehicular and pedestrian traffic with a duration of 8 to 9 minutes with a capture rate of 30 fps and a resolution of 1920 x 1080 pixels. It is worth mentioning that videos corresponding to the left and right side of the scene are captured. Thus, the amount of information processed corresponds to approximately 460,800 frames.
Change of direction of the obstacle can be determined by analyzing the lateral position of the object and the previous location where the obstacle was located. Then, the characteristic before change of direction of obstacle should be determined using statistical data of the lateral relative velocity, angle-direction, estimated distance to the obstacle and spatial position, as well as the other obstacles present in scene.
Results
Particularly, the results obtained through our proposal are: the collision probability according to the change of direction vector, the estimation of the spatial position and the comparison of the path to follow vs. the ground truth (GT). Initially, in the results of the experiments performed, only one object is detected and the probable path to follow is calculated; subsequently, possible paths of multi-object in the scene are obtained.
For example, the parameters obtained for the variables of interest, in this case for a single detected object, in a vehicular scene are: Approximate average relative velocity (v
t
) from detection in the range of 20 to 50 km/hr. Distance (d
t
) from detection in an initial range of 10 and final range of 40 m in frontal direction. Initial orientation angle (ψ
t
) (frontal approach (p
t
)) and at the end of the trajectory (movement to the left (pt+1)).
The above-mentioned information on the parameters is depicted in the Fig. 4 where the displacement traveled and the changes of direction of the object (in this case a car) are shown.

Representation of the information on the displacement of an object of interest.
Table 1 shows the results of the processing of the given vehicular scene information, so that event that has the highest probability of occurring is: no collision since it is moving away from the left side of the road with respect to the ego-vehicle position (probability collision(no)_left = 0.75).
Experiment results by collision inference for each discretized path direction
In the previous result, the scene and the conditions of the variables when is detecting a single obstacle are described. In the case of processing and estimation of multiple objects in the scene, the same process is carried out iteratively.
The detected objects with information Z
t
relative to . . . , (Vt-1, Γt-1, Dt-1, Ψt-1), (Vt+1), (V
t
, Γ
t
,
Figure 5 shows the inferred direction vectors for multiple objects in scene. It shows (from top to bottom) a series of consecutive frames with the detected objects, likewise, the predominant direction vector obtained is colored (to highlight even more the enveloping frame is also colored with the same color) where the green color corresponds to the right direction vector, the blue color corresponds to the left direction vector and the red color to the front direction vector.

Estimation of the probability of changes in direction of multiple objects.
Path results inference can be compared with respect to GT by plotting values of probabilities obtained from the direction vector over the course of video (# frames determined). To show the behavior of the direction vector (changes) obtained during the interval of interest, the information is displayed in graphs.
In the Figures 6–8 each graph, corresponds to the tracking of a single object and the normalized percent probability estimate obtained for each direction vector, the most probable direction change is the one with the highest value. The probability of change direction (in solid line) and the ground-truth (in dotted line) are shown. Similarly, the color line GT indicates the direction taken by the obstacle in that interval (green involves change of direction to the right, blue involves change of direction to the left and red implies keep the front direction).

Probability of changes in direction vs. GT (right).

Estimation of the probability of changes in direction vs. GT (front).

Probability of changes in direction vs. GT (left).
For example, in Fig. 6 it is shown that the GT of the trajectory of the object in scene corresponds to the movement in right lateral direction and the behavior of the response obtained from the proposal distributes the probability of displacement initially in similar proportions (in the 3 direction vectors) however with respect to obtain more information (course of the frames) the change of direction to the right increases the probability. The analysis can be replicated for the Figs. 7, 8.
As expected, there is an error with respect to the GT. The Fig. 9 shows the variation of the position of the GT path of the object vs. the estimated position. With each pair of points (x, y) obtained from the spatial position in previous time intervals as well as the direction (angle of orientation), the route to follow can be described. The key purpose of intention estimation in this work is to give probabilistic results of certain discrete variables that contribute to decisions of road participants.

Movement probability obtained vs GT (position).
Figure 10 shows the comparative angle of orientation (in angular degrees) of that object. Both output values are compared against their respective GT.

Movement probability obtained vs GT (angle).
The proposed topology in this work is evaluated by quantifying the direction vector of the estimated motion; the result to be evaluated is the obtained error (normalized) from the difference between the motion estimation and the real path (Z(GT)t) at t. To evaluate the direction vector, the Mean Squared Error (MSE) is used for both tracking and prediction Equation (7).
Table 2 shows the normalized error obtained with respect to a sequence given to the proposed model to evaluate the motion conditions of the objects in the scene. The information refresh rate corresponds to an interval of 5 frames, i.e., the direction vector is recalculated every 0.166 seconds and the error associated with the possible change of direction is also adjusted with the GT. In the case that it is desired to estimate an IO route, it would only be necessary to adapt the information of the direction changes with respect to the location on the road map.
Experimental results (MSE) between GT and spatial position and direction angle of objects in 10 vehicular scenes
Similarly, to validate the results obtained, a compa- rison of the proposal against the state of the art is performed. The proposed model is compared with the RBDHM model [20] and with the ERI model [15], such comparison includes data of the variables of interest, present in the three models (velocity, position, orientation and separation distance of the objects).
Initially, a comparison of the spatial location of the IO is carried out given a time interval (in this case 5 seconds) to determine the subsequent location. The Fig. 11 shows the results obtained from 16 video sequences (horizontal axis) and the comparison of the spatial position obtained by each method with respect to the GT (vertical axis). This allows a qualitative evaluation of the results obtained.

Comparison of proposal vs. related work.
The next issue to compare is the estimation of the displacement vector. As mentioned, the estimated direction vector is the one, that given the parameters of the variables obtained, is most likely to occur.
Figure 12 shows the result (normalized probability percentage) of the direction estimation obtained by the proposed method for an IO in a vehicular scene of 800 frames, the results obtained from related works with respect to the same scene and finally the GT (in this case frontal direction vector) of the path taken by the IO. With the percentage of probability obtained through the frames the error for each method is obtained with respect to the GT. Estimated error for the inference of the trajectory vs. GT in the interval of interest corresponds to RBDHM of 0.12, to ERI of 0.16 and the proposal obtains an error of 0.23. The analysis is performed similarly for the information presented in Figs. 1314.

Comparison of proposal vs. related work.

Comparison of proposal vs. related work.
For the scene with GT left direction the obtained errors by RBDHM, ERI and the proposal are: 0.33, 0.35 and 0.26 respectively (Fig. 14).

Comparison of proposal vs. related work.
For the interval of interest in the scene with GT right direction, for RBDHM the estimated error corresponds to 0.29, the proposal has 0.25 and ERI has 0.31 (Fig. 13).
RBDHM method has a quantitative approach close to the GT, the proposal and the ERI method have similar results, but as the graph, in the Figs. 12–14, shows there is oscillation over the course (of the interval) of the scene. This oscillation is undesirable as it implies abrupt changes of direction.
The execution time of the modules is important to know in order to determine the application of the implemented methodology in a real physical system. Therefore, the execution time of each module with the database (frames-videos) was obtained.
The box-plot graph in the Fig. 15 shows the information over time of each module used to perform the trajectory inference with respect to the 33 videos that correspond to the database used in this work. For the disparity map module, an average execution time (aet) of 0.448 seconds with a standard deviation (sd) of 0.028 was obtained, for the object detection module aet is 0.053 and sd of 0.015, for determining the spatial position aet is 0.059 seconds and sd of 0.014, for tracking the aet corresponds to 0.153 seconds and sd is 0.007, finally for the inference module aet is 0.169 and sd of 0.012 is obtained.

Processing time information for each module.
Therefore, with respect to the video capture rate 30 fps = 0.033 seconds, the sampling interval to obtain data and process it corresponds to an interval of every 5 frames, i.e. 0.033 * 5 =0.166 seconds between each detection interval and subsequent trajectory estimation. However, currently in this work the processing time required to perform the detection of all objects of interest and the trajectory estimation of these objects is on average of 0.729 seconds. It is worth mentioning that this range of execution time is due to the considerable number of objects in the video scenes and the calculations that must be performed to estimate the trajectory of each one. Given these parameters, there is still a need to improve the performance in order to be able to perform the execution in real time, i.e., to process the data at a rate of approximately 0.166 seconds, this is a function of the power of the hardware on which the processing is performed.
In this paper a novel DBN topology is presented to infer the probabilities of path change with respect to the information obtained in video by modeling the spatio-temporal characteristics of the motion of the detected objects.
The experiments performed so far provide data on the method performance as well as features of the DBN topology implementation. Specifically, it can be mentioned that qualitatively DBN is able to determine path changes, quantitatively DBN obtains normalized collision probability parameters that differentiate objects with collision risk.
On the other hand, the proposal presents consistent results (slight changes) with a smooth motion close to what a driver observes (according to the results shown in the Figs. 6–8, 12–14 and Table 1). Similarly, this determines the probabilities of direction change with an upper limit value of 0.75 % probability, this is due to the fact that the proposed DBN divides the causal relationships of the path inference not only in the direction vectors but also takes into consideration the risk of collision associated with each change of direction.
Therefore, if the proposed method determines a probability ≥0.75 %, it implies a result similar to the 0.80 - 0.90 % obtained by RBDHM or ERI.
As future work, It is proposed to analyze and complement the approach presented with respect to the inference of trajectories in a vehicular environment by obtaining a better error rate with respect to the spatial location (distance of the objects with respect to the ego-vehicle), as well as increasing the number of direction vectors to infer (>3).
The proposed can be improved, since in spite of not making such abrupt changes in the direction vectors, the estimation of the next position with respect to the GT can still be smoothed, so it is important to analyze and if necessary expand the causal relationships of the variables in the DBN topology.
Finally, as complementary future work, the run time improvement is considered in order to experiment our approach in real time cases, in the disparity mapping stage as well as in path inference, with more powerful hardware than the one being used.
Footnotes
Acknowledgments
The first author thanks the support provided by the CONACYT scholarship number 700546.
