Abstract
For moving object detection and trajectory prediction in video images, it is necessary to perform image processing, feature extraction, and localization of the object. Therefore, this paper designs an optimized Kalman-Elman (KE) algorithm for trajectory prediction. In order to remove the noise points on the measured values in the Kalman filter algorithm and to solve the problem of the random setting of the initial weights and thresholds of the Elman neural network, we encode the above parameters and improve the two algorithms by using Particle Swarm Optimization (PSO). Quantitative values of the object feature extraction are used as input parameters of the Elman neural network. After a large amount of training, we obtain the predicted position of the moving object finally. The experimental results show that the prediction error of this method is significantly smaller when it is compared with previous methods.
Introduction
With the development of computer vision, video image processing has been further developed and applied. It is possible to adopt relevant methods to locate and predict the moving object’s trajectory [1]. Compared with the static object in a single image, this field is more likely to find valuable information and has a wider range of applications. In a related study [2], the authors proposed a tracking algorithm based on spare representation and nonlinear resampling by combing an improved particle filtering algorithm. The method obtained the duplicated number of each particle depending on its related probability and the number of all particles in the set, thus maintaining the diversity of particles effectively. Reference [3] proposed a vision tracking algorithm based on a quantum genetic algorithm, which has high tracking accuracy and fast tracking speed. Accordingly, the algorithm needed to increase the number of evolutionary generations to ensure convergence. Reference [4] used the collected first-person video to predict the future position of pedestrians. First-person videos usually contain significant self-motion, which will affect the position of the person in future frames. Reference [5] proposed a multi-flow recurrent neural network model to predict the future position of vehicles. The model captured object position, scale, and pixel-level observation data separately. However, this method didn’t take into account the fact that real roads could have pedestrians who could hide the object vehicle. Yang et al. [6] mainly used supervised learning technology to predict the trajectories of pedestrians. The processed tracking trajectories became smoother and more accurate than the original ones. It is worth noting that the method only considers the case of a linear regression problem, which is the simplest case in reality. Zhang et al. [7] proposed a multi-vehicle tracking and detection model based on super-pixel segmentation. The model combined the difference method with the super-pixel segmentation method, which demonstrated perfect performance and minor computation in the open scenario. Challenges such as occlusion and view variation remain obstacles in the field of vision tracking. Reference [8] proposed a new generic multi-view tracking framework. One of the key innovations for visual object tracking with multi-camera inputs was a cross-camera trajectory prediction network that addressed the problem of missing objects such as occlusion. Recent research [9] has explored goal-driven trajectory prediction methods that explicitly estimate the long-term goals to help predict their future behavior. While these models take an important step forward, they adopt the oversimplified assumption that the agent’s intentions are represented by only a single long-term goal.
This paper mainly uses the Elman neural network for processing. It’s one model of Recurrent Neural Network (RNN) [10, 11]. As a typical dynamic neural network, the Elman neural network is widely used and has better adaptability and computing capability. In addition, this paper uses PSO to optimize the noise points on the measured values in the Kalman filter algorithm and the initial weights and thresholds of the Elman neural network. Then we train the Elman neural network in combination with the feature extraction method. Compared with the existing object detection and trajectory prediction algorithms, the algorithm proposed in this paper can reduce the error and improve the prediction accuracy effectively. Thus, our main contributions are as follows:
We propose a feature extraction method that uses the Gaussian Mixture Model (GMM) for foreground image extraction [12]. We propose a trajectory prediction method based on image feature fusion. The method combines the Elman neural network, the Kalman filtering algorithm, and the PSO. We evaluate the trajectory conformance rate of the original Kalman algorithm, the improved Kalman algorithm, the original Elman algorithm, and our method. The experimental results show that increasing the population size of the KE method has a different effect on the trajectory prediction. Moreover, the prediction effect of our method is superior to the state-of-the-art methods.
Object feature extraction method
This paper uses the object feature extraction method to detect the moving object in the video image. It focuses on image enhancement and segmentation technologies, sunch as image graying, image binarization, edge detection [13], filtering and noise reduction, morphological processing, and other image processing.
Object feature extraction can be divided into two categories: subject extraction and detail extraction.
The subject extraction part uses the GMM for the moving object. By inputting an object image and a background image to establish a model, so as to detect the moving object preliminarily. We elimate the noise in the image through the filtering and noise reduction in a combination of median filtering and Gaussian filtering.
Because the background image of successive frames in the dataset will change slightly, it is necessary to reconstruct and update the background. We take the background image of successive
For each pixel in a video image, its color can be reflected by establishing a Gaussian distribution. When extracting foreground images, a single Gaussian distribution does not reflect well enough. For the strain capacity of the scene, it is not sensitive enough either. Therefore, we use a combination of multiple Gaussian distributions to further reflect the distribution.
In the GMM, assuming that the color sequence of each pixel point is represented as
Where
The detail extraction performs the basic image processing steps based on the moving object that was found by the subject extraction. These steps include edge detection, filtering and noise reduction, morphological processing, object area marking, and so on. For the edge detection, this paper uses the second-order Canny operator for image detection. For the filtering and noise reduction, a combination of median filtering and Gaussian filtering is mainly used to eliminate impulse noise and Gaussian noise. For the morphological processing, this paper uses the method in a combination of expansion and corrosion. For the object region marking, this paper uses the connected region marking method: In the final foreground image obtained by detail extraction, the minimum rectangle surrounding the object region can be obtained by judging each pixel and its connectivity with adjacent pixels. Finally, we track and obtain the object feature information.
The object feature extraction method proposed in this paper is as follows:
In order to improve the accuracy of the Elman neural network algorithm and reduce the prediction error, the input parameters of the Elman neural network need to be set. The specific method is to quantify the extracted object feature, including the geometric and motion features of the object.
Geometric features are used to describe the image characteristics of the object, including the minimum rectangule area of the surrounding region, the actual number of pixels in the region image, and the number of pixels in the convex region image. Where the actual number of pixels in the region image and the number of pixels in the convex region image are obtained from feature extraction.
Motion features are used to describe the position and motion status of the object, including the centroid coordinates
The values obtained above are used as the input parameters of the Elman neural network.
KE trajectory prediction method
In this section, we will introduce the process of trajectory prediction. In order to eliminate the noise in the measured values of the Kalman filter algorithm and solve the problem of the random setting of initial weights and thresholds of the Elman neural network, the proposed optimized Kalman-Elman (KE) trajectory prediction is made up of three stages:
1) Object feature extraction stage
At this stage, we preprocess the input dataset to obtain the quantitative value of object features and detect the moving object.
2) Kalman optimization stage
At this stage, we encode the noise in the measured values of the Kalman filter algorithm: Assuming that there are
In PSO, we use the Euclidean distance quantization error between the position prediction and the true value as the fitness of each particle.
Where
After determining the fitness of the particle, it is necessary to calculate the optimal position of the individual particle and the global optimal position. We determine whether the termination condition of PSO has been reached. If so, output the optimal solution and decode the particle. If not, update the particle’s position and velocity. If the updated particles do not satisfy the termination conditions of PSO, iterative updating is required until the condition is met.
The equations used in the process of updating the velocity and position of the particle are shown as:
Where
Finally, the particle that corresponding to the optimal PSO solution is decoded and used as the noise in the measured values of the improved Kalman algorithm. We obtain the prediction results from the Kalman optimization stage by running the improved Kalman algorithm.
3) Elman optimization stage
At this stage, we first encode the initial weights and thresholds in the Elman neural network: Assuming that the input layer of the Elman neural network has
In PSO, we use the Euclidean distance quantization error between the position prediction and the true value as the fitness of each particle.
Where
After determining the fitness of the particle, the remaining steps are the same as in the Kalman optimization stage. Finally, the particle that corresponding to the optimal PSO solution is decoded and used as the initial weight and threshold of the Elman neural network. We obtain the final prediction result for the object by retraining and testing the Elman neural network.
The flow diagram of the Kalman-Elman method is shown in Fig. 1.
Diagram of the Kalman-Elman method.
The experiment uses the Traffic Intersection video dataset provided by LIMU of Kyushu University, which contains video images of multiple vehicles moving at intersections. The Traffic Intersection dataset is provided by the Advanced Department of Information Technology at Kyushu University, Japan, which provides some ground truth datasets to evaluate the movement. We use Matlab software for testing. We mainly select some continuous images from the dataset as experimental data and perform trajectory prediction for moving objects.
Dataset image preprocessing
Figure 2a represents the grayscale image of the moving object. Figure 2b represents the binary image of the moving object obtained by the GMM model. Figure 2c represents the binary image after filter processing.
Results of image processing.
Figure 2d represents the final foreground image of a single moving object. Figure 2e uses the eight-connected method to delineate the minimum rectangle and locate the centroid of the detected object, and takes the geometric center of the object graph as the real location.
The object detection and feature extraction method proposed in this paper is compared with the traditional background difference method, the GMM method and the ViBe [14] method. FPS and FDR are two measurements that can be used to compare different detection methods in continuous frame images.
It can be seen in Table 1 that the method proposed in this paper has the best results in foreground detection, with the highest average FDR value. However, it is inferior to other methods in terms of processing speed.
Comparison of different object detection methods
This experiment focuses on trajectory prediction for different moving objects in the processed dataset. The purpose is to predict the possible position of the object in the next frame based on the object information in the current frame. We try to reduce the error of the predicted position as much as possible and improve the prediction ability. On this basis, we compare and analyze the prediction effects of the Kalman filtering algorithm, the Elman neural network algorithm, and the KE method comprehensively.
Tendency diagram of PSO fitness.
In this experiment, the low fitness value of PSO is used as an index to evaluate excellent particles. The smaller the fitness of the particles, the better. Figure 3 reflects the iteration ordinal number and fitness of the KE method. With the number of iterations in PSO going up, the fitness of different populations will keep going down until it reaches a steady state. By changing the population size constantly, we set the population numbers at 5, 10, and 20 respectively. It can be concluded that within a certain range, the larger the particle swarm size is, the smaller the particle’s final fitness and the better performance will be.
By summing up the fitness of each particle, the cumulative prediction error
Trajectory conformance rate
Accumulated predictive errors comparison diagram of different methods.
Predictive abscissas of different methods.
Predictive ordinates of different methods.
Comparison diagrams of moving object trajectory prediction.
Where
We compare the predicted trajectory data with the real data. If the cumulative prediction error is within a certain pixel range (
Where
After analyzing the cumulative prediction error value
Figure 4 shows the comparison of the cumulative prediction error value
It can be seen from Table 2 that the trajectory conforming rate of the KE method is higher than that of other methods. The KE method with a population of 20 has the highest TC value.
Figures 5 and 6 contain the predicted and real coordinates of the moving object under different methods. From the figure, it can also be seen that the predicted positions of these methods are better matched with the real positions, and the predicted movement trends are more similar to the true movement trends.
Figure 7a shows the comparison of predicted trajectories of different methods and the real situation. Figure 7b and c show the comparisons of predicted trajectories of KE-20 algorithm and the real situation, respectively.
Based on the these experiments, KE-20 algorithm has the minimum prediction error, strong network training ability, and good prediction effect.
This paper proposes an optimized KE trajectory prediction method to track and predict the position of the moving object in the video image. First, we perform object feature extraction and object detection on the video image. Also, we use the PSO to improve the algorithm. Next, through continuous network training, we predict the position of the moving object at the next moment. The results reveal that the prediction effect of our method is superior to the state-of-the-art methods, and it improves the accuracy of the moving object trajectory prediction. However, the proposed method still has some problems that need to be explored and improved upon in the future. On certain traffic roads, the motion of vehicles and pedestrians may be more complex on certain spatial and temporal scales. This makes the motion environment more complex and needs to be studied more deeply for applications in different datasets.In addition, the prediction accuracy is also related to the parameter setting of the algorithm. Future work should focus on reducing the training time and extracting more representation features in order to achieve better training and testing results.
