Moving object detection and trajectory prediction based on image processing

Abstract

For moving object detection and trajectory prediction in video images, it is necessary to perform image processing, feature extraction, and localization of the object. Therefore, this paper designs an optimized Kalman-Elman (KE) algorithm for trajectory prediction. In order to remove the noise points on the measured values in the Kalman filter algorithm and to solve the problem of the random setting of the initial weights and thresholds of the Elman neural network, we encode the above parameters and improve the two algorithms by using Particle Swarm Optimization (PSO). Quantitative values of the object feature extraction are used as input parameters of the Elman neural network. After a large amount of training, we obtain the predicted position of the moving object finally. The experimental results show that the prediction error of this method is significantly smaller when it is compared with previous methods.

Keywords

Image processing feature extraction trajectory prediction Particle Swarm Optimization

1. Introduction

With the development of computer vision, video image processing has been further developed and applied. It is possible to adopt relevant methods to locate and predict the moving object’s trajectory [1]. Compared with the static object in a single image, this field is more likely to find valuable information and has a wider range of applications. In a related study [2], the authors proposed a tracking algorithm based on spare representation and nonlinear resampling by combing an improved particle filtering algorithm. The method obtained the duplicated number of each particle depending on its related probability and the number of all particles in the set, thus maintaining the diversity of particles effectively. Reference [3] proposed a vision tracking algorithm based on a quantum genetic algorithm, which has high tracking accuracy and fast tracking speed. Accordingly, the algorithm needed to increase the number of evolutionary generations to ensure convergence. Reference [4] used the collected first-person video to predict the future position of pedestrians. First-person videos usually contain significant self-motion, which will affect the position of the person in future frames. Reference [5] proposed a multi-flow recurrent neural network model to predict the future position of vehicles. The model captured object position, scale, and pixel-level observation data separately. However, this method didn’t take into account the fact that real roads could have pedestrians who could hide the object vehicle. Yang et al. [6] mainly used supervised learning technology to predict the trajectories of pedestrians. The processed tracking trajectories became smoother and more accurate than the original ones. It is worth noting that the method only considers the case of a linear regression problem, which is the simplest case in reality. Zhang et al. [7] proposed a multi-vehicle tracking and detection model based on super-pixel segmentation. The model combined the difference method with the super-pixel segmentation method, which demonstrated perfect performance and minor computation in the open scenario. Challenges such as occlusion and view variation remain obstacles in the field of vision tracking. Reference [8] proposed a new generic multi-view tracking framework. One of the key innovations for visual object tracking with multi-camera inputs was a cross-camera trajectory prediction network that addressed the problem of missing objects such as occlusion. Recent research [9] has explored goal-driven trajectory prediction methods that explicitly estimate the long-term goals to help predict their future behavior. While these models take an important step forward, they adopt the oversimplified assumption that the agent’s intentions are represented by only a single long-term goal.

This paper mainly uses the Elman neural network for processing. It’s one model of Recurrent Neural Network (RNN) [10, 11]. As a typical dynamic neural network, the Elman neural network is widely used and has better adaptability and computing capability. In addition, this paper uses PSO to optimize the noise points on the measured values in the Kalman filter algorithm and the initial weights and thresholds of the Elman neural network. Then we train the Elman neural network in combination with the feature extraction method. Compared with the existing object detection and trajectory prediction algorithms, the algorithm proposed in this paper can reduce the error and improve the prediction accuracy effectively. Thus, our main contributions are as follows:

1)
We propose a feature extraction method that uses the Gaussian Mixture Model (GMM) for foreground image extraction [12].
2)
We propose a trajectory prediction method based on image feature fusion. The method combines the Elman neural network, the Kalman filtering algorithm, and the PSO.
3)
We evaluate the trajectory conformance rate of the original Kalman algorithm, the improved Kalman algorithm, the original Elman algorithm, and our method. The experimental results show that increasing the population size of the KE method has a different effect on the trajectory prediction. Moreover, the prediction effect of our method is superior to the state-of-the-art methods.

2. Moving object feature extraction method

2.1 Object feature extraction method

This paper uses the object feature extraction method to detect the moving object in the video image. It focuses on image enhancement and segmentation technologies, sunch as image graying, image binarization, edge detection [13], filtering and noise reduction, morphological processing, and other image processing.

Object feature extraction can be divided into two categories: subject extraction and detail extraction.

The subject extraction part uses the GMM for the moving object. By inputting an object image and a background image to establish a model, so as to detect the moving object preliminarily. We elimate the noise in the image through the filtering and noise reduction in a combination of median filtering and Gaussian filtering.

Because the background image of successive frames in the dataset will change slightly, it is necessary to reconstruct and update the background. We take the background image of successive $n$ frames for averaging and use the updated $\bar{B}$ as the background image in the experiment:

$\displaystyle\bar{B}=\frac{1}{n}({B_{1}+B_{2}+\ldots+B_{n}})$ (1)

For each pixel in a video image, its color can be reflected by establishing a Gaussian distribution. When extracting foreground images, a single Gaussian distribution does not reflect well enough. For the strain capacity of the scene, it is not sensitive enough either. Therefore, we use a combination of multiple Gaussian distributions to further reflect the distribution.

In the GMM, assuming that the color sequence of each pixel point is represented as $\{{x_{1},x_{2},\ldots,x_{T}}\}$ , the probability density function $p({x_{T}})$ can be expressed as:

$\displaystyle p({x_{T}})=\sum\limits_{i=1}^{j}\omega_{i,T}\ast\eta({x_{T},\mu_% {i,T},\tau_{i,T}})$ (2)

Where $\eta({x_{T},\mu_{i,T},\tau_{i,T}})$ represents the $i$ -th Gaussian distribution at moment $T$ . $\omega_{i,T}$ represents the weight of the $i$ -th Gaussian distribution at moment $T$ and the sum of all weight values is 1. $\mu_{i,T}$ and $\tau_{i,T}$ represent the mean and covariance matrix of the $i$ -th Gaussian distribution at moment $T$ .

The detail extraction performs the basic image processing steps based on the moving object that was found by the subject extraction. These steps include edge detection, filtering and noise reduction, morphological processing, object area marking, and so on. For the edge detection, this paper uses the second-order Canny operator for image detection. For the filtering and noise reduction, a combination of median filtering and Gaussian filtering is mainly used to eliminate impulse noise and Gaussian noise. For the morphological processing, this paper uses the method in a combination of expansion and corrosion. For the object region marking, this paper uses the connected region marking method: In the final foreground image obtained by detail extraction, the minimum rectangle surrounding the object region can be obtained by judging each pixel and its connectivity with adjacent pixels. Finally, we track and obtain the object feature information.

The object feature extraction method proposed in this paper is as follows:

Input: Average background image

\bar{B}

, video image of the current frame

i

Output: Final foreground image

H_{i}

, object feature information

Subject extraction part

Step 1: Use the GMM to detect the foreground object: Input the average background image

\bar{B}

and the video image of the current frame

i

, construct a GMM model for operation and segmentation, and obtain the binary foreground image

F_{i}

Step 2: Perform filtering and noise reduction on the foreground image

F_{i}

after segmentation.

Detail extraction part

Step 3: Use the Canny operator to detect the background edge and the image edge of the current frame

i

. Denote the generated binary edge images as

A_{i}

and

B_{i}

, respectively.

Step 4: Add a blank image

C_{i}

with the same size as

A_{i}

and

B_{i}

. If the pixel

({j,k})

A_{i}

correspond to

B_{i}

: If

A_{i}({j,k})=B_{i}({j,k})

C_{i}({j,k})=0

. Otherwise

C_{i}({j,k})=1

. Then calculate all pixels until a complete binary image

C_{i}

is generated.

Step 5: Add each pixel in the foreground image

F_{i}

and the binary image

C_{i}

to obtain a new image

D_{i}

Step 6: Add each pixel in the foreground image

F_{i}

and the image

D_{i}

to obtain a new foreground image

G_{i}

Step 7: Obtain the final foreground image

H_{i}

by filtering and noise reduction and morphological processing of the new foreground image

G_{i}

Step 8: Mark the object region of the final foreground image

H_{i}

and extract the feature information.

2.2 Quantitative methods of object features

In order to improve the accuracy of the Elman neural network algorithm and reduce the prediction error, the input parameters of the Elman neural network need to be set. The specific method is to quantify the extracted object feature, including the geometric and motion features of the object.

Geometric features are used to describe the image characteristics of the object, including the minimum rectangule area of the surrounding region, the actual number of pixels in the region image, and the number of pixels in the convex region image. Where the actual number of pixels in the region image and the number of pixels in the convex region image are obtained from feature extraction.

Motion features are used to describe the position and motion status of the object, including the centroid coordinates $C$ , the predicted coordinates of the Kalman filter algorithm, the position information of the minimum rectangle $R_{p}$ , the average velocity $\bar{v}_{x}$ , the average acceleration $\bar{v}_{y}$ , the horizontal displacement $X_{x}$ , the vertical displacement $X_{y}$ , and the distance $s$ . Where the prediction coordinates of the Kalman filter algorithm are obtained by the following KE trajectory prediction method. $C$ and $R_{p}$ are obtained by feature extraction. $\bar{v}_{x}$ can be obtained from the difference ${\Delta}x$ between the abscissa coordinates of the centroid of adjacent frames and the interframe time ${\Delta}t$ . $\bar{a}_{x}$ can be obtained from the difference ${\Delta}v_{x}$ between the horizontal velocity of adjacent frames and the inter-frame time ${\Delta}t$ . $X_{x}$ is the difference between the horizontal coordinates of the centroid of adjacent frames ${\Delta}x$ .

The values obtained above are used as the input parameters of the Elman neural network.

3. KE trajectory prediction method

In this section, we will introduce the process of trajectory prediction. In order to eliminate the noise in the measured values of the Kalman filter algorithm and solve the problem of the random setting of initial weights and thresholds of the Elman neural network, the proposed optimized Kalman-Elman (KE) trajectory prediction is made up of three stages:

1) Object feature extraction stage

At this stage, we preprocess the input dataset to obtain the quantitative value of object features and detect the moving object.

2) Kalman optimization stage

At this stage, we encode the noise in the measured values of the Kalman filter algorithm: Assuming that there are $m$ sample values, the noise in $m$ measured values is generated and particles with a size of $1*m$ are obtained.

In PSO, we use the Euclidean distance quantization error between the position prediction and the true value as the fitness of each particle.

$\displaystyle\textit{fit}_{1}=\sqrt{({x_{1}-x_{0}})^{2}+({y_{1}-y_{0}})^{2}}$ (3)

Where $P_{t}({x_{1},y_{1}})$ represents the object position initially predicted by the Kalman filtering algorithm. $T_{t}({x_{0},y_{0}})$ represents the real object position. $\textit{fit}_{1}$ represents the distance between the two positions.

After determining the fitness of the particle, it is necessary to calculate the optimal position of the individual particle and the global optimal position. We determine whether the termination condition of PSO has been reached. If so, output the optimal solution and decode the particle. If not, update the particle’s position and velocity. If the updated particles do not satisfy the termination conditions of PSO, iterative updating is required until the condition is met.

The equations used in the process of updating the velocity and position of the particle are shown as:

$\displaystyle S_{1}=c_{1}*r_{1}*({\textit{best}_{1}-X_{i-1}})$ (4) $\displaystyle S_{2}=c_{2}*r_{2}*({\textit{best}_{2}-X_{i-1}})$ (5) $\displaystyle V_{i}=W*V_{i-1}+S_{1}+S_{2}$ (6) $\displaystyle X_{i}=X_{i-1}+V_{i}$ (7)

Where $c_{1}$ and $c_{2}$ represent the learning factors, $r_{1},r_{2}\in[{0,1}]$ . $\textit{best}_{1}$ and $\textit{best}_{2}$ represent the best position sought by the particle itself and the current global best position. $S_{1}$ and $S_{2}$ represent the self and global gain values. $W$ represents the weight. $V_{i}$ and $X_{i}$ represent the velocity and position of the particle, respectively.

Finally, the particle that corresponding to the optimal PSO solution is decoded and used as the noise in the measured values of the improved Kalman algorithm. We obtain the prediction results from the Kalman optimization stage by running the improved Kalman algorithm.

3) Elman optimization stage

At this stage, we first encode the initial weights and thresholds in the Elman neural network: Assuming that the input layer of the Elman neural network has $a$ nodes, the hidden layer has $b$ nodes, and the output layer has $c$ nodes. Therefore, there are $(a\ast b+b\ast c)$ weights and $(b+c)$ threshold values. Then we get a size of $1\ast(a\ast b+b\ast c+b+c)$ particles and take the first $p\%$ of the total sample as the training sample. Then the quantitative values of the object feature are used as the training input parameters of the Elman neural network, and the corresponding real result is used as the training output value. We use the remaining $(1-p\%)$ quantitative values as the test input parameters of the Elman neural network and perform the network output test of the object result finally.

In PSO, we use the Euclidean distance quantization error between the position prediction and the true value as the fitness of each particle.

$\displaystyle\textit{fit}_{2}=\sqrt{({x_{1}-x_{0}})^{2}+({y_{1}-y_{0}})^{2}}$ (8)

Where $P_{t}({x_{1},y_{1}})$ represents the object position initially predicted by the Elman neural network. $T_{t}({x_{0},y_{0}})$ represents the real object position. $\textit{fit}_{2}$ represents the distance between the two positions.

After determining the fitness of the particle, the remaining steps are the same as in the Kalman optimization stage. Finally, the particle that corresponding to the optimal PSO solution is decoded and used as the initial weight and threshold of the Elman neural network. We obtain the final prediction result for the object by retraining and testing the Elman neural network.

The flow diagram of the Kalman-Elman method is shown in Fig. 1.

Figure 1.

Diagram of the Kalman-Elman method.

4. Experiment and analysis

The experiment uses the Traffic Intersection video dataset provided by LIMU of Kyushu University, which contains video images of multiple vehicles moving at intersections. The Traffic Intersection dataset is provided by the Advanced Department of Information Technology at Kyushu University, Japan, which provides some ground truth datasets to evaluate the movement. We use Matlab software for testing. We mainly select some continuous images from the dataset as experimental data and perform trajectory prediction for moving objects.

4.1 Dataset image preprocessing

Figure 2a represents the grayscale image of the moving object. Figure 2b represents the binary image of the moving object obtained by the GMM model. Figure 2c represents the binary image after filter processing.

Figure 2.

Results of image processing.

Figure 2d represents the final foreground image of a single moving object. Figure 2e uses the eight-connected method to delineate the minimum rectangle and locate the centroid of the detected object, and takes the geometric center of the object graph as the real location.

The object detection and feature extraction method proposed in this paper is compared with the traditional background difference method, the GMM method and the ViBe [14] method. FPS and FDR are two measurements that can be used to compare different detection methods in continuous frame images.

$\displaystyle\textit{FPS}=\frac{F}{T}$ (9) $\displaystyle\textit{FDR}=\frac{B_{1}\cap B_{2}}{B_{1}\cup B_{2}}$ (10)

It can be seen in Table 1 that the method proposed in this paper has the best results in foreground detection, with the highest average FDR value. However, it is inferior to other methods in terms of processing speed.

Table 1

Comparison of different object detection methods

Method	Average FPS (frame/s)	Average FDR (%)
Background difference method	52	43.2
GMM method	41	65.6
ViBe method	81	68.9
Ours method	39	80.3

4.2 Results and analysis of experiments

This experiment focuses on trajectory prediction for different moving objects in the processed dataset. The purpose is to predict the possible position of the object in the next frame based on the object information in the current frame. We try to reduce the error of the predicted position as much as possible and improve the prediction ability. On this basis, we compare and analyze the prediction effects of the Kalman filtering algorithm, the Elman neural network algorithm, and the KE method comprehensively.

Figure 3.

Tendency diagram of PSO fitness.

In this experiment, the low fitness value of PSO is used as an index to evaluate excellent particles. The smaller the fitness of the particles, the better. Figure 3 reflects the iteration ordinal number and fitness of the KE method. With the number of iterations in PSO going up, the fitness of different populations will keep going down until it reaches a steady state. By changing the population size constantly, we set the population numbers at 5, 10, and 20 respectively. It can be concluded that within a certain range, the larger the particle swarm size is, the smaller the particle’s final fitness and the better performance will be.

By summing up the fitness of each particle, the cumulative prediction error $C_{e}$ is calculated as follows:

$\displaystyle C_{e}=\sum\limits_{i=1}^{D}{\sqrt{({x_{i}-x_{0}})^{2}+({y_{i}-y_% {0}})^{2}}}$ (11)

Table 2

Trajectory conformance rate

Method of prediction		TC (%)
Original Kalman algorithm		8.2
Improved Kalman algorithm		34.7
Original Elman algorithm		18.4
Method of KE	The population is 5	77.6
	The population is 10	87.8
	The population is 20	91.8

Figure 4.

Accumulated predictive errors comparison diagram of different methods.

Figure 5.

Predictive abscissas of different methods.

Figure 6.

Predictive ordinates of different methods.

Figure 7.

Comparison diagrams of moving object trajectory prediction.

Where $P_{t}({x_{i},y_{i}})$ represents the predicted object location. $T_{t}({x_{0},y_{0}})$ represents the real object location. $D$ represents the number of images in the dataset.

We compare the predicted trajectory data with the real data. If the cumulative prediction error is within a certain pixel range ( $C_{e}\leqslant\alpha$ ), the trajectory is regarded as highly conforming. The trajectory conformance rate $T C$ is calculated as follows:

$\displaystyle TC=\frac{T_{CD}}{T}$ (12)

Where $T_{CD}$ represents the number of conforming track data and $T$ represents the total number of real data.

After analyzing the cumulative prediction error value $C_{e}$ and the trajectory conformance rate $T C$ of different methods, we compare the prediction effects of these methods comprehensively.

Figure 4 shows the comparison of the cumulative prediction error value $C_{e}$ of the original Kalman algorithm (Method 1), the improved Kalman algorithm (Method 2), the original Elman algorithm (Method 3) and the KE method (Methods 4–6) in the same video dataset. The population size of the KE method is set at 5, 10, and 20, respectively (Iteration ordinal number is 200). The results show that the prediction error of the KE method after continuous optimization is decreasing. The prediction error of Method 6 is reduced by 82.2%, 65.8%, 68.8%, 12.9%, and 4.0%, respectively, compared with methods 1–5, which greatly improves the prediction accuracy.

It can be seen from Table 2 that the trajectory conforming rate of the KE method is higher than that of other methods. The KE method with a population of 20 has the highest TC value.

Figures 5 and 6 contain the predicted and real coordinates of the moving object under different methods. From the figure, it can also be seen that the predicted positions of these methods are better matched with the real positions, and the predicted movement trends are more similar to the true movement trends.

Figure 7a shows the comparison of predicted trajectories of different methods and the real situation. Figure 7b and c show the comparisons of predicted trajectories of KE-20 algorithm and the real situation, respectively.

Based on the these experiments, KE-20 algorithm has the minimum prediction error, strong network training ability, and good prediction effect.

5. Conclusion

This paper proposes an optimized KE trajectory prediction method to track and predict the position of the moving object in the video image. First, we perform object feature extraction and object detection on the video image. Also, we use the PSO to improve the algorithm. Next, through continuous network training, we predict the position of the moving object at the next moment. The results reveal that the prediction effect of our method is superior to the state-of-the-art methods, and it improves the accuracy of the moving object trajectory prediction. However, the proposed method still has some problems that need to be explored and improved upon in the future. On certain traffic roads, the motion of vehicles and pedestrians may be more complex on certain spatial and temporal scales. This makes the motion environment more complex and needs to be studied more deeply for applications in different datasets.In addition, the prediction accuracy is also related to the parameter setting of the algorithm. Future work should focus on reducing the training time and extracting more representation features in order to achieve better training and testing results.

References

. An overview of target tracking technology. Ship Electronic Engineering. 2018; 38(12): 6-9.

Fan

Weng

Jiang

, et al. Particle filter object tracking algorithm based on sparse representation and nonlinear resampling. Journal of Beijing Institute of Technology. 2018; 27(1): 51-57.

Jin

Hou

, et al. Target tracking approach via quantum genetic algorithm. IET Computer Vision. 2018; 12(3): 241-251.

Yagi

Mangalam

Yonetani

, et al. Future Person Localization in First-Person Videos. arXiv:1711.11217v2 [cs.CV] 28 Mar 2018.

Yao

Choi

, et al. Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems. arXiv:1809.07408v2 [cs.CV] 3 Mar 2019.

Yang

Hao

Ding

, et al. Anomaly detection and reconciliation of pedestrian tracking trajectory. Control & Decision Conference. 2017.

Zhang

. An efficient and flexible approach for multiple vehicle tracking in the aerial video sequence. International Journal of Remote Sensing. 2018; (1): 1-32.

Ling

, et al. Visual tracking with multiview trajectory prediction. IEEE Transactions on Image Processing. 2020; 29: 8355-8367.

Yao

Atkins

Johnson-Roberson

, et al. Bitrap: Bi-directional pedestrian trajectory prediction with multi-modal goal estimation. IEEE Robotics and Automation Letters. 2021; 6(2): 1463-1470.

10.

Sutskever

Martens

Hinton

. Generating Text with Recurrent Neural Networks. International Conference on Machine Learning, ICML 2011, Bellevue, Washington, Usa, June 28 – July. DBLP, 2016: pp. 1017-1024.

11.

Ketkar

. Recurrent Neural Networks. Deep Learning with Python. Apress, 2017: 7.

12.

Dadi

Pillutla

GKM

Makkena

. Face Recognition and Human Tracking Using GMM, HOG and SVM in Surveillance Videos. Annals of Data Science. 2017; (4): 1-23.

13.

Magnier

. Edge detection: A review of dissimilarity evaluations and a proposed normalized measure. Multimedia Tools & Applications. 2017; 77(5): 1-45.

14.

Barnich

Van Droogenbroeck

. ViBe: a powerful random technique to estimate the background in video sequences. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009: pp. 945-948.