Abstract
Abstract
Digital video stabilization (DVS) allows acquiring video sequences without disturbing jerkiness, removing unwanted camera movements. A good DVS should remove the unwanted camera movements while maintains the intentional camera movements. In this article, we propose a novel real-time DVS algorithm with a short-delay. The proposed DVS compensates the camera jitters applying an adaptive fuzzy filter on the global motion of video frames. The adaptive filter is an infinite impulse response (IIR) non-causal filter which is tuned by a fuzzy system adaptively to the camera motion characteristics. The fuzzy system itself is also tuned during operation according to the amount of camera jitters. The fuzzy system uses two inputs which are quantitative representations of the unwanted and the intentional camera movements. The global motion of video frames is estimated based on the block motion vectors which resulted by video encoder during motion estimation operation. Experimental results indicate a good performance for the proposed algorithm.
Keywords
Introduction
Digital video stabilization techniques have been studied for decades to improve visual quality of image sequences captured by compact and light weight digital video cameras. When such cameras are hand held or mounted on unstable platforms, the captured video generally looks shaky because of undesired camera motions. Unwanted video vibrations would lead to degraded view experience and also greatly affect the performances of applications such as video encoding [1–4] and video surveillance [5, 6]. With recent advances in wireless technology, video stabilization systems are also considered for integration into wireless video communication equipment for the stabilization of acquired sequences before transmission, not only to improve visual quality but also to increase the compression performance [1]. Solutions to the stabilization problem involve either hardware or software to compensate the unwanted camera motion. The hardware-based stabilizers are generally expensive and lack the kind of compactness that is crucial for today’s consumer electronic devices [7, 8]. On the contrary, a DVS system that is implemented by software can easily be miniaturized and updated. Consequently, DVS system is suitable for portable digital devices, such as digital camera and mobile phone.
In General, a DVS system consists of two principal units including motion estimation (ME) and motion correction (MC) units. The ME unit estimates a global motion vector (GMV) between every two consecutive frames of the video sequence. Using the GMVs, the MC unit then generates smoothing motion vectors (SMVs) needed to compensate the frame jitters and warp the frames to create a more visual stable image sequence.
According to the motion models being considered, the already proposed global ME techniques for DVS system can roughly be divided into two categories: (1) two-dimensional stabilization techniques which deal with translational jitter only [9–20] and (2) multi-dimensional stabilization techniques which aim at stabilizing more complicated fluctuations in addition to translation [21–25]. Most of the existing algorithms fall into the first category because the translation is the most commonly encountered motion and the complexity of estimating translation parameters is relatively low for real-time stabilization. In the second category, the majority of algorithms [21, 24] considered a three-dimensional or perspective motion model, while a few algorithms considered affine motion model [23] or sensors attached to the camera to provide absolute 3D orientation [25].
Regarding to the ME task of DVS systems, most previous approaches attempt to reduce the computational cost by using fast ME algorithms, e.g. gray-coded bit-plane matching [9], two-bit transform [10], multiplication-free one-bit transform [11], Laplacian two-bit transform [12], and binary image matching of color weight [13]. In another approach, the global ME is limited to small, pre-defined regions [16, 17]. Such approaches consider DVS and video encoding separately and attempt to trade the accuracy of motion vectors (MVs) for the computational efficiency; nevertheless they improve the computational efficiency at the expense of degradation in the accuracy in ME and thereafter in MC tasks.
Since both the video encoder and the digital stabilizer of a digital video camera use a ME unit, we can integrate digital stabilizer with video encoder [2, 26] by making the two modules of a digital video camera share a common local motion vectors (LMVs) estimation process, as shown in Fig. 1. ME can take up 90% of the total computation of a digital stabilizer and 50% –70% of a video codec. Combined together, the operation required for ME can consume more than 70% of the total computation [4]. The ME task in video encoders usually is implemented on frame blocks by a block matching process to estimate a MV for each block (BMV).
The ME unit plays an important role in DVS system and its estimation accuracy is a decisive factor for the overall stabilization performance of the system. In video frames with smooth or complex texture regions, the estimated BMVs may not be in coincidence with the real motion of the blocks. Although such LMVs are applicable to the local motion compensation task which is executed in the encoder, they cannot be used for the global motion compensation which is executed by the DVS. These LMVs include some noises that degrade the global ME task. In order to remove the noisy LMVs in these regions some algorithms are proposed in [27–30]. The valid BMVs as LMVs are used for the global ME and MC in next steps.
After global ME, the next essential task of a DVS system is MC in which the unwanted camera jitters are separated and removed from the intentional camera movement. Among the various MC algorithms proposed in the literature, smoothing of the GMV by low-pass filtering is the most popular. For instance, a MV integration method is used in [9] and [31] which utilizes a first-order infinite impulse response (IIR) low-pass filter to integrate differential motion and to smoothen the global movement trajectory. A frame position smoothing (FPS) algorithm, based on smoothing absolute frame positions that achieve successful stabilization performance with retained smooth camera movements, is utilized for MC in [17, 32–39]. Off-line discrete Fourier transform (DFT) domain filtering is proposed for FPS-based stabilization in [32]. Kalman filter and fuzzy systems have widely been used in DVS applications [33–39]. Real-time FPS-based stabilizer using Kalman filtering of absolute frame positions has been proposed in [17, 33]. In the presented algorithm in [34] two Kalman filters are operated in parallel, one of which is used as a reference filter with a constant high process noise variance and another one is used as stabilization filter in which a fuzzy system lets the process noise variance to be adjusted. In this case, the process noise variance of the stabilization filter is adaptively changed by the fuzzy system according to the residual between the stabilization and reference filter output. Presented DVS in [35] utilizes a fuzzy system in which membership functions (MFs) are optimized to motion dynamics [35]. A membership selective fuzzy stabilization, in which the stabilization system selects between a pre-determined set of MFs according to instantaneous motion characteristics is proposed in [36]. A MF adaptive fuzzy filter based on smoothing of absolute frame position for video stabilization is presented in [37]. In this method initially a short mean filter is applied to raw absolute frame displacement as pre-process, to reduce the dynamic range of the fuzzy system input. Fuzzy stabilization is then achieved through fuzzy correction mapping. In this method output MFs of the fuzzy system are continuously adapted so as to constitute a MF adaptive fuzzy filtering process. It is shown in [38] that the performance of the Kalman filter can be improved by a fuzzy system that improves the Kalman filteroutput.
Almost all fuzzy systems utilized by presented algorithms [34–38] have a similar structure with two inputs as fuzzy inputs and a little difference in fuzzy MFs. In these algorithms, the first input is the difference between the absolute frame displacement and the a priori estimate of the stabilized frame position that achieve by a Kalman filter or another predictor. The second input indicates the change of first input over the last two frames.
Regarding to the MC task of DVS systems, almost all published algorithms try to smoothen the global movement trajectory by a kind of low-pass filtering. An important drawback of the low-pass filtering is that smoothened movement trajectory is delayed with respect to the desired camera displacements. A stricter filtering provides more stabilization at the expense of more trajectory delay and vice versa. More trajectory delay means losing more image content after stabilization.
A good MC unit should remove the unwanted camera motion while tracks the intentional motion without any delay. For this purpose, it should discriminate the unwanted and intentional camera motions while adjust the smoothing filter adaptively according to the amount of unwanted and intentional camera motions. The studied published MC algorithms lack some of these features. For example algorithms presented in [27, 37] suffer from the lack of discrimination of unwanted and intentional camera motions. Moreover, the proposed adaptive algorithm in [27] suffers from a continuous and well adaptation. They use an adaptive filter with a smoothing factor that is switched between only two values and therefore it leads to undesirable jumps in frame position. The proposed algorithm in [38] is a fuzzy Kalman system consists of a fuzzy system with a Kalman filter that shows a higher performance than proposed method in [33] but still suffers from well adaptation.
In this article we propose a DVS algorithm with new features in ME and MC units. The ME unit estimates a GMV based on the BMVs which are estimated by the video encoder. Therefore, the computational complexity of the DVS is very low and the accurate motion information is used without extra computation cost. Moreover, in order to improve the accuracy of GMV estimation task an adaptive thresholding algorithm is used to remove the noisy invalid LMVs. The MC unit of the proposed DVS system is an adaptive fuzzy filter that applied on the global motion of video frames to smooth the camera movement trajectory adaptively. The adaptive fuzzy filter is a non-causal IIR filter which is tuned by a fuzzy system adaptively to the characteristics of unwanted and intentional camera motions. The fuzzy system itself is also tuned during operation according to the amount of camera jitters. The fuzzy system adjusts the filter parameters by using two novel inputs which are quantitative representations of the unwanted and the intentional camera motions. Due to the non-causality of filter, it needs a look forward to few (about 2) coming frames. This causes a short delay in real-time system. Experimental results show a good performance for the proposed DVS algorithm.
The remainder of this article is organized as follows. The details of the proposed video stabilization algorithm are presented in Section 2. We proposed a method for numerical DVS performance assessment in [40] which is summarized in Section 3. Some experimental results are presented in Section 4 and the article is concluded in Section 5.
The proposed DVS system
A flowchart of the proposed DVS system is depicted in Fig. 2. The details of the proposed system are described in the sequel.
Block-based ME
The block-based ME is used to generate the LMVs. Since the ME is done by the video encoder, the computational complexity of the DVS is very low and the accurate motion information is used without extra computation cost. In this article, to test the proposed DVS system independent of the encoder, a full search motion estimation algorithm with full-pixel resolution is taken for 8×8 blocks over a search range of 33×33 pixel to achieve the BMVs.
The ME algorithm works as follows. First, the current frame is divided into a number of N×N blocks and a MV for each block is computed. The resulting MV points to the most correlated reference block in the previous frame within the search area. To measure the goodness of each candidate MV(x,y), the mean absolute difference (MAD) measure is used as:
Where C (x + k,
y + L) and
R (x + i + k,
y + j + L) denote the block pixels in
the target frame, and the displaced block pixels in the reference frame, respectively. The
candidate MV (i, j) with the smallest MAD is chosen as
the MV of the current block according to:
The ME unit plays an important role in DVS system and its estimation accuracy is a decisive factor for the overall performance of stabilization system. Block ME process typically computes some wrong MVs which are not in coincidence to the real motion direction of the blocks. Although, such MVs can be useful for the motion compensation in encoder, they include noise and should not be used for the global motion compensation and video stabilization operations. These LMVs include some noises that degrade the global ME task. The noisy MVs are mostly obtained from two types of regions including: very smooth regions with lack of features and very complex uneven regions [27–30]. Inspiring from the algorithm presented in [27], two qualifying tests, namely “Smoothness Test” and “Complexity Test”, are used to detect and remove the noisy MVs by an adaptive thresholding method as follows. The valid BMVs as LMVs are used for the global ME and MC in next steps.
Smoothness test
The noisy MVs corresponding to the smooth regions such as sky image are detected by
thresholding of the average of MAD as:
Where denotes the average of
calculated MADs within the search area, during ME of nth block.
th
1 is also defined as:
The noisy MVs corresponding to the complex texture regions are identified by another
thresholding as:
Where threshold th
2 is defined adaptively as:
It is notable that MAD is computed during ME by encoder. Therefore, the smoothness test and complexity test have no additional computational complexity cost for the proposed DVS system.
Provided results on different video contents show that, using fixed thresholds for different video contents may cause a remarkable amount of invalid noisy LMVs remain or a notable amount of valid LMVs be removed. To solve this problem, the values of thresholds th 1 and th 2 are adjusted adaptively based on the video content for each frame. Note, if ME is executed by a fast search algorithm rather than full-search algorithm at the encoder, the MADs calculated during ME are used for adaptation of thresholds th 1 and th 2.
Original LMVs and validated LMVs for a sample frame are presented in Fig. 3. This figure shows that many noisy LMVs have been removed by the LMV validation process.
The global ME unit produces a unique GMV for each video frame, which represents the
camera movement during the time interval of two frames. Since the LMVs obtained from the
image background tend to be very similar in both magnitude and direction, we used a
clustering process to classify the motion field into clusters corresponding to the
background and foreground objects. The global motion induced by camera movement is
determined by a clustering process that consists of the following steps.
As an example, Fig. 4 shows the largest histogram bin at coordinates (5, 12) yields the GMV.
Unwanted ME and correction
An estimated GMV may consist of two major components: an intentional motion component
(e.g. corresponding to camera panning) and unintentional motion component (e.g.
corresponding to handshake). A good MC algorithm should only remove the unwanted motion
while maintain and track the intentional motion. Assuming that the unwanted motion is
corresponding to the high-frequency components, the proposed algorithm uses a low-pass
filter to remove the unwanted motion component. A SMV is resulted by filtering that
resembles the intentional camera movement. The proposed method calculates SMV in the form
of third-order auto regression as
The smooth filtering is implemented on the vertical and horizontal components of the
GMVs separately. The smoothing factor of filter i.e.
α (n) is adjusted by a fuzzy system continuously for
MC of each frame. In facts, two fuzzy systems with a similar structure are used
corresponding to the vertical and horizontal motion components. The fuzzy system has two
inputs (Input1, Input2) and one output. The fuzzy inputs are defined as:
Defining suitable inputs for an adaptive DVS system has a great impact on the system performance. Only relevant inputs can provide precise discriminations between unwanted and intentional camera motions to be used for the adaptation of smoothing filter of an adaptive DVS. Different scenarios for combination of unwanted and intentional camera motion can be considered. As examples, some scenarios are presented graphically in Fig. 5. In graphs (a) and (b) camera has an intentional accelerating movement plus noise or unwanted motion. The noise amplitude is high in (a) while it can be ignored in (b). Graph (e) is corresponding to a camera movement path while panning in which the camera is moving with a constant velocity without any acceleration and noise. The explanations of all graphs are summarized in Table 1.
From the adaptive filtering point of view it is important to measure the amount of noise and the intentional camera movement velocity and acceleration. A stricter smoothing filter is needed when the noise amplitude is high to remove the noise. On the other hand the strict smoothing filter prevents following of camera path when it has an intentional high acceleration. Therefore, the smoothing factor of filter should be tuned carefully proportional to the amount of noise and camera movement acceleration. According to this, we defined the fuzzy inputs so that Input1 gives information about the amount of noise and Input2 gives information about the amount of camera movement acceleration. It is notable that amount of camera movement velocity itself does not have any constrain on the filtering so it is not measured and used here.
The proposed fuzzy system tunes the Smoothing factor of the 3rd-IIR filter, α (n), adaptively according to the amount of noise and the camera intentional accelerating movement. In the proposed fuzzy system, trapezoidal and triangular MFs are used for the inputs, and the output, respectively. The number of MFs has been selected so as to obtain decent performance with as few MFs as possible to maintain low system complexity. Theexperimentally designed input and output MFs and also the surface of desired outputs are shown in Fig. 6. According to experimental results, the performance of used Smoothing filter is more sensitive to α’s changes where α has a large value. Therefore, more MFs of the fuzzy output are concentrated in this operating area. The constructed rule base is containing 30 rules as presented in Table 2. The proposed fuzzy system was implemented while the min function was used for the fuzzy implication and the max function used for the fuzzy aggregation. Furthermore, the centroid defuzzification method was applied. The output of fuzzy system defines the α of Smoothing filter, i.e. α (n), for MC of nth video frame.
Study on a number of video sequences has shown that the range of fuzzy inputs (Input1,
Input2) is very variable on different video contents. Therefore, fixed MFs for the
inputs of fuzzy system cannot provide a good stabilization performance over all video
contents. In order to have a good performance for the proposed DVS over different video
contents, it is proposed to adjust the MFs of fuzzy inputs adaptively to recently
received video frames. The range of MFs for the fuzzy inputs i.e. (0, Input1(max)) and
(0, Input2(max)) are modified adaptively as:
After computing the smoothing factor α (n) by fuzzy
system, the camera motion path constructed by the GMVs is filtered by the smoothing
filter to compute the smoothened motion vectors. For the first three frames, a fixed
large value for α (n) is used. After computing SMV,
the unwanted motion vector (UMV) is obtained by
To restore the current frame to its stabilized position, we offset the current frame by
the accumulated UMV, AMV, defined by
Numerical performance assessment of MC unit of a digital stabilizer is a difficult task since the pure intentional camera motion path as reference is not available when camera has an intentional motion [4]. To solve the lack of reference, a model was proposed in [40] for the intentional camera motion to produce synthetic camera motion path as reference. Collected statistics from real video sequences, each was including 150 frames, shown that a Gaussian probability distribution function (PDF) can be fitted to the first twelve frequency components of discrete sine transform (DST) of camera path [40].
Having a reference path for intentional camera motion, a random noise is added to it to simulate the unwanted camera motion. While provided noisy motion path is processed by a DVS algorithm, computing a distance measure such as mean-square error (MSE) between the processed motion path and the reference path can be considered as a numerical performance measure for the DVS.
Experimental results
The performance of the proposed DVS method is evaluated over 15 video sequences covering different types of scenes. Since there is no well-known video sequence in this research field, the algorithm is applied on a number of real video sequences which are easily available. For example, some used video sequences are available at [41, 42]. These sequences have a frame rate of 25 fps and a picture size of 352×288 pixels. Sample frames of used video sequences are shown in Fig. 7. We worked with both gray-scale and color test sequences where in both cases ME is implemented on the luminance component. Good experimental results are obtained with M = 3 in formula (8–11). However, a larger M provides more smoothness at the expense of more tracking delay and vice versa. The stabilizer performance is assessed according to the smoothness of the resultant global motion compared to the original sequence and the gross movement preservation capability. To evaluate the performance of the proposed DVS algoritm it was applied on several data sets with various unwanted motions. The results of the proposed DVS algorithm are compared with results provided by presented algorithm in [43]. An adaptive IIR fuzzy filtering technique is proposed for MC in [43]. Some graphical comparison results between the original motions and the smoothed motions resulted by our DVS and presented algorithm in [43] are presented in Fig. 8. The results provided by the presented algorithm in [43] show that it has an acceptable action in noise removal but it does not work very well especially in tracking of gross camera movements in all cases. Whereas results demonstrate that proposed fuzzy system provides expanded stabilization, while enables close tracking of gross camera movements. Small-scale subjective quality test also demonstrated that human eyes have better visual perception to the stabilized videos by the proposed DVS system than the original videos in all cases.
According to the numerical assessment method explained in Section 3, we compared the performance of MC unit of our DVS algorithm with that of presented algorithm in [43]. For the evaluation purpose we initially produced four synthetic signals y 1, y 2, y 3 and y 4 as reference motion paths by the camera motion model presented in [40]. Then, a random noise was added to the reference signals to simulate the unwanted camera motion. These synthetic signals and corresponding noisy signals are shown in Fig. 9. The noisy signals were processed by the DVS algorithms. The performance of DVS algorithms were compared by computing the MSE between the reference signals and corresponding processed signals. The processed signals are shown in Fig. 10. Provided numerical results are presented in Table 3. According to the numerical results, the proposed DVS system has provided much lower MSE values for all cases the anchor DVS. Moreover, graphical comparison results show a higher stabilization performance and tracking capability for the proposed DVS.
Conclusion
In this article, we proposed a computationally efficient DVS algorithm using motion information obtained from a hybrid block-based video encoder. Since some of the obtained MVs are not valid, an adaptive thresholding was developed to filter out valid MVs and to compute an accurate GMV for each frame. The proposed stabilization technique effectively estimates the intentional camera motion by exploiting the characteristics of unwanted motions; an adaptive fuzzy non-causal IIR filter is proposed to fulfill two apparently conflicting requirements: close follow-up of the intentional camera movement and removal of the unwanted camera motion. In order to improve stabilization performance, inputs MFs of the fuzzy system are continuously adapted according to motion properties of a number of recently received video frames. The fuzzy system utilizes two inputs which are quantitative representations of unwanted and intentional camera motion. Implementation and simulation results show a high performance for the proposed DVS algorithm.
