Abstract
With the rapid development of deep learning, convolutional neural networks have gradually become the main means to extract features of dynamic image sequences. The motion vector estimation algorithm, as the key to the stability of image sequences, directly affects the performance of image stabilization systems, so the motion estimation algorithm for convolutional neural networks is necessary. The study proposes an improved convolutional neural network based on loss-free function, and applies it to the extraction of dynamic image features. On this basis, the motion estimation algorithm is then optimised by combining grey-scale projection and block matching methods. The experimental results show that the new loss-free function-based convolutional neural network has better recognition capability with an error rate of only 15% in dynamic image recognition. The accuracy of the optimised motion estimation algorithm is as high as 95.1% with a PSNR value of 16.636, which is higher than that of the traditional grey-scale projection algorithm. In terms of video processing, the improved algorithm has a higher PSNR value than the search block matching method, the bit-plane matching method and the full search block matching method, with a higher steady image accuracy and high operational efficiency, providing a new research idea for the improvement of motion estimation algorithms. In general, the proposed algorithm is a significant improvement over the current mainstream algorithms in terms of image accuracy, processing performance and number of operations, and it provides a new research idea for the improvement of motion estimation algorithms.
Introduction
In recent years, image feature extraction using deep convolutional neural networks has developed rapidly [1]. However, current convolutional neural networks are increasingly difficult to train. Not only does the number of parameter tuning increase, but the convergence of the loss function is also poor [2]. In addition, with the increasing quality of videos, how to implement the combination of convolutional neural networks and motion vector estimation algorithms has gradually become the focus of research. The estimation accuracy and detection rate of motion vector estimation algorithms, which are key factors directly affecting the stability of image sequences, are limited by the algorithms themselves. Therefore, this study optimizes the problems of existing image feature extraction techniques by performing effective feature extraction of dynamic image sequences and analyzing the optimization methods of motion estimation algorithms, so as to improve the accuracy of image feature extraction and enhance the operational performance of motion estimation algorithms.
The article is divided into five main parts. The second part is Related Work, which introduces and sorts out the main research achievements in the field of the research object in recent years. The third part is Methods, which introduces the design and principle of the algorithm and model. The fourth part is Experiments, which introduces the performance experiment results of the proposed model. The fifth part is the Conclusion, which summarizes the research results.
Related work
In recent years, professionals at home and abroad have conducted in-depth research on image sequence motion vector estimation algorithms using neural networks. Li et al. [3] transformed the kernel size estimation problem into a regression problem in order to recover the blurred kernel of single blurred images and a potentially clear images, and solved it by constructing a convolutional neural network, which showed that the method accurately estimated the motion blurred kernel size and was able to obtain better images. Li et al. [4] proposed a new dynamic image representation method based on convolutional neural networks by merging convolutional features across a maximum pool of frames and using a linear support vector machine to train the classifier. Experiment proved its high performance. Saif et al. [5] employed convolutional neural networks as the main architecture for traffic estimation, image segmentation and action recognition, and processed feature maps using normalised vector maps with additional spatial and temporal information in the fusion layer. Experiments showed that the method improved the efficiency of behaviour recognition. Ma and Zhao [6] used convolutional neural network representations as an overall image representation and topologically retrieved similar looking keyframes from the topology map and used the sharpness metric to select keyframes, effectively avoiding blurring of keyframes. Bu [7] addressed the problem of poor human motion gesture recognition by extracting feature maps using deep convolutional neural networks, which had lower dimensionality and higher discrimination than traditional feature images extracted from fully connected layers. Peng et al. [8] proposed a new network structure allowing an arbitrary number of frames as network input and used a feature connectivity layer that combined motion information and appearance, while introducing a module consisting of a temporal pyramid pooling layer and an encoding layer. This structure was proved to have high operational efficiency. Suzuki and Ikehara [9] used implicit learning of inter-frame motion to generate intermediate frames directly and learn the differences among them through residual learning. The results showed that the method improved recognition performance and can avoid uncertainty in motion estimation.
Jesi et al. [10] addressed the overfitting problem of traditional hidden conditional neural field classifiers by extracting feature vectors based on a neural network with weighted deviation means. Experiment showed that it reduced execution time and improved accuracy. Ahmed and Khot [11] proposed a directional edge recovery method with a video encoder compression scheme to improve the spiral pixel reconstruction algorithm, and experimental results showed that the method was able to improve video coding efficiency. Traver and Paredes [12] used convolutional neural networks to regress motion parameters in order to perform iamge motion estimation, and used synthetic image morphing for experiments on existing image datasets. The results showed that the method the accuracy obtained was better than that of Cartesian images. Duan et al. [13] proposed the application of deep convolutional neural networks to single-person pose estimation for the problem of low accuracy of multi-person pose estimation. It was used to generate human heat maps, which improved the generalisation of extracted features. Shao et al. [14] developed a new neural network structure for transient changes, temporal correlation and interference of unknown factors on image sequences. Errors were eliminated through temporal correlation, and experimental results showed that its convergence speed is fast. Wu et al. [15] proposed a feature fusion strategy integrating image recognition and motion estimation to address the problem of poor real-time performance of neural networks. Experiments showed that it could extract more representative features for multi-task learning. Sun et al. [16] used deep convolutional neural networks for feature extraction to cope with noisy factors and unfamiliar scenes, and experimental results showed that the method was robust to interference from noisy factors.
Recently, there are also some researches on Dynamic Image Sequences. Farokhah [17] studied the performance of facial emotion recognition model and proposed a dynamic image sequences recognition method based on machine learning. The experimental results show that the method is practical. Tsintotas et al. [18] and its research partners applied Dynamic Image Sequences technology to robot positioning and mapping, and their new method was tested on seven public datasets, and the results showed that this method was superior to the original method. In summary, convolutional neural networks are capable of feature extraction and classification of dynamic image sequences. Most researchers have proposed new convolutional neural network models to address the problems of poor accuracy and inefficiency, and the accuracy of convolutional neural network in some application scenarios is improved. However, the accuracy still has room for improvement and it is difficult for such techniques to take into account the indexes of operation accuracy, resource saving and time spending, while less research has been done on the motion vector estimation algorithm based on neural networks. Therefore, in this study, the motion vector estimation algorithm of neural networks is improved and applied to image feature extraction in order to improve the performance of the algorithm image feature extraction.
Convolutional neural network-based motion vector estimation algorithm for dynamic image sequences
Dynamic images based on convolutional neural networks
With the continuous improvement of video and image quality used by people, users have higher requirements for Dynamic images than ever. In order to meet the requirements of users, deep convolution neural network is a feasible technology. Deep convolutional neural networks have strong image feature extraction capability and can be applied to various image recognition algorithms. The alternating superposition of pooling and convolutional layers forms the structure of deep convolutional neural networks. The feature map in the convolutional layer is formed by the convolutional kernel of that layer after sliding convolutional filtering of the output image of the previous layer. The convolutional layer is immediately followed by the pooling layer, where the down-sampling operations of the features are performed, including the selection of the maximum pooling and the calculation of the average pooling [19]. The image features from the convolutional layer need to be quantized and then connected to the fully connected layer. Finally the output layer and the features are connected to the classifier and output together into various types of probabilities. The label corresponding to the largest probability is selected among the output probabilities for the final recognition result. The model of a deep convolutional neural network is shown in Fig. 1.
Convolutional neural network model.
Each layer of the convolutional neural network consists of
In Eq. (1),
In Eq. (2),
Average pool diagram.
The basic structure of a fully connected layer is a multilayer perceptron, consisting of two layers, an input layer and an output layer, connected by a weight matrix. The input layer of the multi-layer perceptron is formed by vectorising the feature map of the previous layer. The weight matrix is inner-producted with the input vector, then mapped by the activation function, and the result is formed into the output layer. Therefore, the fully connected layer is calculated as shown in Eq. (3).
In Eq. (3),
Convolutional neural networks are trained in a way that at its core uses machine learning methods to extract image features, which are then transformed into convolutional kernels. The study proposes a new loss-free function convolutional neural network for image feature extraction. Each vector is specified to obtain the image training set and vectorisation set, each element of the matrix in the intrinsic plot represents the distance between one of the samples patch and another patch, as shown in Eq. (4).
In Eq. (4),
In Eq. (5),
In Eq. (6),
In Eq. (7),
In Eq. (8),
In Eq. (9),
Training the second layer is exactly similar to the first layer, and repeating this process can have the second convolutional layer trained. Then the formula for each group of features is shown in Eq. (11).
In Eq. (11),
Improved convolutional neural network model diagram.
After extracting the feature of dynamic images by convolutional networks, motion vector estimation is then performed. The main motion estimation algorithms include block matching algorithm, representative point matching method and grey scale projection algorithm, etc. The study focuses on block matching algorithm and grey scale projection algorithm. The block matching algorithm is to divide the reference frame image into a series of non-overlapping macroblocks, which are treated as blocks to be matched, and treat all pixels within the blocks as pixels with the same motion characteristics [21]. The difference between the horizontal and vertical coordinates is the displacement of the matching block, i.e. the translation motion vector of the current frame. The block matching method is shown in Fig. 4.
Schematic diagram of block matching method.
The absolute mean difference, normalised intercorrelation function and mean square error are the matching criteria of the block matching motion estimation algorithm. They determine the accuracy, data reading complexity, matching operation complexity and memory management complexity of the motion estimation. Based on the matching criterion, the best search path is found. The full search method is the most reliable search method, which will match all pixels within a certain search range and then get the optimal motion vector, which is reliable, simple and accurate, but the computation is too huge. The greyscale projection algorithm mainly reflects the distribution of image greyscale and is applied to two adjacent frames for rough motion vector estimation. Specifically, the current frame and the reference frame are projected, and then the rank correlation between them is calculated separately and the minimum value is found in their correlation curves, and the resulting motion vector is the roughly estimated motion vector. The general steps are image mapping, projection filtering and correlation calculation. Image mapping is the pre-processing of all frames in the image sequence with histogram equalisation, which maps the two-dimensional image grey scale information into two independent one-dimensional information, and then performs the row and column projection of the grey scale values, i.e. accumulates the grey scale values of the pixels in all rows and columns of the image.
In Eq. (12),
In Eq. (13),
In Eq. (14),
In Eq. (15),
Identification error rate change chart of MFA-1 and MFA-2.
In Eq. (16),
In Eq. (17),
In Eq. (18),
Firstly, the experiment and analysis of the dynamic image recognition with the convolutional neural network model based on the loss-free function proposed in the study was carried out. Two MFA convolutional layer networks, namely MFA-1 and MFA-2 were built, containing one and two layers of MFA convolutional layer networks respectively. The number of convolutional kernels in MFA-1 ranged from 4 to 32, while the number of convolutional kernels in the first layer of MFA-2 was the same as the former, and the second layer was set to a fixed number of 8. The parameters
As can be seen from Fig. 5, both MFA-1 and MFA-2 have increasing recognition rates as the number of convolutional kernels increases. Among them, MFA-2 reaches the lowest error rate at
Recognition accuracy of differen method on ICDAR2017
Recognition accuracy of differen method on ICDAR2017
Change diagram of MFA-2 recognition error rate.
As can be seen from Fig. 6, the recognition error rate of MFA-2 tends to decrease in general as the parameters
As can be seen from Table 1, the MFA convolutional network algorithm proposed by the Institute achieves an accuracy of 95.1% for extracting image features, which is significantly higher than the recognition accuracy of the other algorithms. In addition the CNN
Row column projection curve of reference frame and current frame.
As can be seen in Fig. 7, the minimum value of the column correlation curve is 2
Motion estimation results of dynamic image sequences
From Table 2, it can be seen that after median filtering of the five local motion vectors, the mean of the middle three motion vectors after sorting is selected to yield a finely estimated motion vector of (0,
Comparison of several motion estimation algorithms
PSNR curve.
As can be seen from Table 3, the PSNR value of the improved algorithm proposed in the study, which performs coarse estimation before fine estimation, is 16.636, which is significantly higher than that of the traditional grey-scale projection algorithm and slightly higher than that of the full-search block matching at 16.537, indicating a higher steady image accuracy. As can be seen from the number of operations, the number of operations of the improved algorithm is 428080, while that of the full-search block matching method is 3326528 and 435562 for the traditional grey-scale projection method, the improved algorithm has a much lower number of operations than the full-search block matching and a lower complexity. The performance of the full search block matching method, the improved algorithm, the search block matching method and the bit-plane matching method were compared, and the PSNR graphs of the four algorithms for video processing are shown in Fig. 8.
As can be seen from Fig. 8, the improved algorithm achieves the highest PSNR value of 30, which is higher than the PSNR values of the other three algorithms, indicating that it is more effective and has higher performance than the other algorithms for processing dynamic image sequences.
By combining the improved convolutional neural network with the motion estimation algorithm to improve the stability of dynamic image sequences. The experimental results show that the improved MAF convolutional network model has an increasing recognition rate and a gradually decreasing error rate as the number of convolutional kernels increases, and the recognition error rate decreases as the number of dissimilar and similar neighbours increases. With one less layer of both convolutional and pooling layers, the accuracy of extracting image features is 95.1%, which is slightly higher than the 94.7% of the CNN+Softmax algorithm. The PSNR value of the improved motion vector estimation algorithm is 16.636, which is significantly higher than that of the traditional grey-scale projection algorithm, and slightly higher than that of the full-search block matching algorithm at 16.537, indicating a higher steadystate image accuracy. In terms of the number of operations, the number of operations of the improved algorithm is only 428080, which is much lower than that of the full-search block matching algorithm at 3326528, indicating a lower complexity. In terms of processing dynamic images, the improved motion estimation algorithm has a higher PSNR value than all three commonly used algorithms, indicating a superior processing performance. In summary, this algorithm has higher image accuracy and processing performance, as well as lower number of operations and complexity compared to the main algorithms in the current image feature extraction field, which is an significant improvement to the field. However, the study takes less account of changes in depth of field during the dynamic image transformation process, and further analysis of this is needed.
