Abstract
It is hard to densely track a nonrigid object in long term, which is a fundamental research issue in the computer vision community. This task often relies on estimating pairwise correspondences between images over time where the error is accumulated and leads to a drift. In this paper, we introduce a novel optimisation framework with an Anchor Patch constraint. It is supposed to significantly reduce overall errors given long sequences containing nonrigidly deformable objects. Our framework can be applied to any dense tracking algorithm, e.g. optical flow. We demonstrate the success of our approach by showing significant error reduction on 6 popular optical flow algorithms applied to a range of realworld nonrigid benchmarks. We also provide quantitative analysis of our approach given synthetic occlusions and image noise.
Introduction
Tracking a set of landmark points through multiple images is a fundamental research issue in computer vision. We define tracking in this work as the estimation of corresponding sets of vertices, pixels or landmark points between a reference frame and any other frame in the same image sequence. In the last two decades, optical flow has become a popular approach for tracking through image sequences [4, 29] in fields [12, 32]. Compared with feature matching methods e.g. [22], optical flow provides subpixel accuracy and dense correspondence between a pair of images. In this work, we focus in particular on improving tracking in image sequences using optical flow, and our contribution applies to this class of algorithm.
One of the main drawbacks of optical flow is drift [6, 16]. Errors accumulated between frames over time result in movement away from the correct tracking trajectory. Between single image pairs, this problem may not be noticeable. However, accumulation when tracking across long sequences can be particularly problematic. Several authors have previously attempted to reduce optical flow drift in tracking. DeCarlo et al. [7] introduce contour information on a human face to improve tracking stability, while Borshukov et al. [4] employ manual correction. More recently, Bradley et al. [5] proposed an optimisation method constrained by additional tracking information from multiview video sequences. Beeler et al. [2] then introduced the concept of anchor frames for human face tracking. In this approach, the sequence is decomposed into several clips based on anchor images which are visually similar to a reference frame. Their optimisation method shortens the tracking distance from reference frames to the target frame to help alleviate errors. However, their approach is domain specific (faces), and assumes that the entire face will return to a neutral expression (the anchor) several times throughout the sequence. In general, it is difficult to label anchor frames on general object sequences with large displacement motion e.g. waving cloth, as there is usually significant deformation between the reference frame and the other frames. In addition, repeated patterns are typically not global as observed in a face (return to a neutral expression). Rather, they occur in smaller local regions at intermittent intervals.
In this work, we focus on tracking long video sequences using optical flow algorithms, and specifically concentrate on reducing drift (See Figure 1). The general strategy of our approach is to shorten tracking distances for local regions throughout a long sequence. Our proposed framework combines long term feature matching with optical flow estimation. It may be applied to the tracking of general objects with large displacement motion, and results in a significant reduction in drift. We first detect Anchor Frames for a sequence (Section 4). This provides an initial set of start points for tracking the sequence. Our main contribution is extending this approach by proposing the concept of Anchor Patches (Section 5). These are corresponding points and patches throughout the sequence which are propagated directly from the reference frame. Our framework substantially reduces overall drift on a tracked image sequence, and may be applied to any optical flow algorithm in a straightforward manner. In our evaluation, we apply the proposed optimisation framework on 6 popular optical flow estimation algorithms to illustrate it’s applicability. We provide analysis of our method using 6 synthetic benchmark sequences (Section 7) generated using a method similar to [10], three of which are degraded by adding occlusion, gaussian noise and salt&pepper noise. In addition, we show its applicability on a popular publicly available real world facial sequence with manually annotated ground truth. We show that our proposed optimisation framework significantly improves tracking accuracy and reduces overall drift when compared against the baseline optical flow approaches alone.
This paper is organized as follows: In Section 2, an overview of our proposed optimisation framework is outlined. Sections 3, 4, 5 and 6 give details of the four major steps in our framework. In Section 7, we evaluate our approach using 6 optical flow algorithms tested on 6 synthetic benchmark sequences and a real world facial sequence.
System overview
Our proposed optimisation framework reduces overall optical flow drift given long image sequences, and provides additional robustness against other issues such as large displacements and occlusions. The major procedure is shown in Table 1. The aim of our Anchor Patch optimisation Framework (APO) is accurately tracking a mesh denoted by M R = (V R , E R , F R ) from a reference frame I R to every other frame I i in the sequence. M i = (V i , E i , F i ) denotes the corresponding mesh on frame I i . In the following sections, the four major steps are discussed in detail.
Step one: Computing optical flow fields
The first step is to compute an optical flow field between every frame and its successor over a long video sequence in both forward and backward directions (Fig. 2). In our evaluation, we consider application of our APO framework on a number of dense correspondence optical flow or tracking approaches, e.g. Brox et al. [6], Classic+NL [28] and ITV-L1 [30]. Let
In order to evaluate the optical flow at a specific pixel
Where α1, α2 and α3 are weights for controlling the contribution of each pixel in the 3 × 3 area. Using this 3 × 3 kernel is supposed to give extra robustness against the subpixel accuracy and illumination changes [14, 17–19]. In our experiments, all these weights are set as α1 = 1, α2 = 0.25 and α3 = 0.125 which refer to the distance from the centre pixel
After obtaining our optical flow fields, anchor frames are then detected in a similar manner to Beeler et al. [2], with the difference that we employ SIFT for feature matching as opposed to Normalised Cross Correlation (NCC), and additionally use our Error Score function (Section 3) to evaluate matches. The main procedure is as follows (Fig. 3):
After labeling anchor frames that are visually similar to reference frame, these are used as a basis to partition the entire image sequence into several independent clips. This also allows computation in the next steps to be performed in parallel. In addition, the mesh M R is propagated from the reference frame I R to each anchor frame I A using SIFT matches and a direct optical flow field between them. More detail can be found in Section 6.1. The propagated mesh in an anchor frame is denoted M A = (V A , E A , F A ). Because of large displacement motion between anchor frames, and the fact that many images in a deformable sequence may not return to a reference point, these alone are typically insufficient to provide reliable tracking. In the next section, the Anchor Patch concept will be introduced to overcome this issue.
Step three: Labeling anchor patches
The motivation of the original Anchor Frame method [2] is to provide multiple Starting Points for tracking. Since error accumulates, the technique is intended to reduce overall error accumulation across long image sequences. However, as mentioned in the previous section, large displacement motion and complex motion may yield a fact that most images in a video sequence have significant visual differences from the reference frame.
The main observation in long image tracking is that local spatial patterns throughout a sequence may be repeated - i.e. part of a cloth might return to the same position several times throughout a video. We take advantage of these repeating regions in order to track between shorter segments, and thus alleviate error accumulation. Apart from taking an entire image as anchor information, an Anchor Patch is defined as a set of individual vertices or a group of pixels in the non-reference frame (any other frame in the sequence), which are highly correspondent to a specific part of the reference. The benefit of using anchor patches is to provide additional information for correcting accumulated errors when tracking using optical flow. This technique can also reduce the impact of a low-quality anchor frame (i.e. the one is too dissimilar from the reference frame). Before anchoring patches on non-anchor frames, we first obtain a set of high-quality SIFT feature matches between the reference frame and non-anchor frames, i.e. those frames are not already labelled as the reference frame, or an existing anchor frame. This process proceeds as follows (See Figure 5):
The set of matches
Barycentric coordinate mapping
We suppose to determine the pixel position in a non-anchor frame which corresponds to the position of a vertex on the reference mesh M
R
in I
R
. These correspondences provide our baseline for stable tracking throughout the image sequence. Figure 6 illustrates the process of anchoring patches where v = (x, y)
T
denotes a vertex in M
R
; f* = (x*, y*)
T
, and denotes SIFT features in the reference frame I
R
. Similarly, denotes SIFT features in a non-anchor frame I
i
. For the non-anchor frame I
i
, we have which denotes previously obtained corresponding SIFT feature matches. We wish to calculate the new vertex position v′ = (x′, y′)
T
in the non-anchor frame I
i
. We do this by searching for the three nearest SIFT features f* in a small 5 × 5 search window centred on the vertex of interest v. Next, v′ is calculated by solving the Barycentric Coordinate Mapping equations as:
Where β* are intermediate variables that satisfy β1 + β2 + β3 = 1. In practice we found this technique to provide an accurate transformation when applied to small region (5 × 5 pixel block). However, more sophisticated (although slower) interpolation methods could also be used. The process is performed on every vertex in M R .
After Barycentric Coordinate Mapping, candidate anchor patches denoted by are obtained in non-anchor frames I i . We also have matches , the strength of which can be evaluated using our error equation (1). Using this error, we select final anchor patches in a non-anchor frame I i using where η is a predefined threshold.
Step four: Mesh propagation
The objective of our optimisation framework is to track a mesh M R from the reference frame to every other frame in an image sequence. Given tracking information from the previous sections, this process is separated into two steps: first, the mesh M R is propagated from reference frame to anchor frames (Sections 4 and 6.1). Second, the propagated mesh M A is propagated from anchor frames to the non-anchor frames within the clip (Section 6.2).
Propagating from the reference frame to anchor frames
The mesh propagation process from the reference frame to the anchor frame is as follows:
After this stage, information for every vertex in M R is established from the reference frame to the anchor frame.
Propagating from anchor frames to non-anchor frames
The entire image sequence is partitioned into clips which are bound by different anchor frames. The propagation process can be individually performed within these clips in parallel. Within these clips, the anchor patches are supposed to improve overall tracking stability and accuracy. In order to use anchor patches in this process, we define Nearest Anchor Patch as follows. For vertex v in M
A
, the Nearest Anchor Patch of v on frame I
i
is the anchor patch on non-anchor frame Ii+k which is nearest to I
i
in the image sequence. Figure 8 shows an example where frame Ii+k is the frame which is nearest to frame I
i
in image sequence and contains anchor patch matching to v in anchor frame I
A
. The main tracking procedure proceeds (Fig. 7) as follows:
Due to the fact that the anchor frames divide the overall sequence into smaller clips, this allows the mesh propagation in between to be calculated in parallel. In the next section we perform an evaluation of our framework.
We evaluate APO with a range of 6 popular optical flow estimation methods which are publicly available from the Middlebury Evaluation System [1]. Combined local-global Optical Flow (CLG-TV) [8], Large Displacement Optical Flow (LDOF) [6] and Classic+NL [28] are state of the art while the Horn and Schunck (HS) [13], Black and Anandan (BA) [3, 28], Improved TV-L1 (ITV-L1) [30] are classic optical flow frameworks and also widely used. CLG-TV is a high speed approach that uses a combination of bilateral filtering and anisotropic regularization and also one of the top three algorithms in the normalized interpolation error test from Middlebury. LDOF is an integration of rich feature descriptors and variational optical flow and one of best current optical flow estimation algorithms for large displacement motion. Classic+NL provides high performance in the Middleburry evaluation by formalizing the median filtering heuristic and Lorentzian penalty as explicit objective functions in an improved TV-L1 framework. The HS method is a pioneering technique optical flow. BA provides improvements to the HS framework by introducing robust quadratic error formulation. ITV-L1 is a recent and increasingly popular optical flow framework which uses a similar numerical optimisation scheme to Classic+NL. Our choice of a mixture of newer, state of the art methods, with older traditional approaches, is to highlight the fact that irrespective of the approach used, our APO framework provides significantly improved tracking in all cases.
For our evaluation, we compare the optical flow estimation methods previously mentioned – with and without our optimisation framework – on 7 long benchmark sequences with ground truth. Table 3 gives an overview of the benchmark sequences used in our evaluation. In previous work Garg et al. released to the community a set of ground truth data for evaluating optical flow algorithms over long sequences. This is as opposed to the Middlebury dataset, which just considers optical flow between pairs of images, and is therefore not applicable to our framework. The sequences of Garg et al. contains 60 frames and are generated using interpolated dense Motion Capture (MOCAP) data from real deformations of a waving flag [31]. As shown in Fig. 9, we use the same MOCAP data to generate a long video sequence and three other degraded sequences, each of which contains 237 frames of size 500 × 500 pixels. The three degraded sequences are generated in order to test the robustness of our APO framework under different image conditions. They are generated by individually adding synthetic occlusions, gaussian noise and salt & pepper noise with the same parameters described in [10]. In order to increase the diversity of the sequences, we include three other sequences. One is a Talking Face Video (Frank) sequence which contains 300 frames with 68 ground truth annotation points per frame. The other two are also synthetic benchmark sequences generated using MOCAP data of Salzmann et al. [27] from the carton and serviette deformations. One contains 266 frames of size 1024 × 768 while the other contains 307 frames of the same image size. In addition, we also consider the effect of the number of SIFT features detected in the frame, and how this affects overall tracking stability of the APO framework. All optical flow algorithms are applied with default parameter settings from their original papers.
Our baseline optical flow based tracking strategy – for each of the above algorithms – is performed as follows: First, the optical flow field is computed (in forward direction) for every pair of adjacent frames in the sequence. We then mark the initial tracking points in the first frame using the same ground truth points in the same frame of the sequence (Table 3). The correspondent points in the next frame are computed based on the optical flow field in between. This process is repeated until correspondent landmark points are obtained in every frame of the sequence. The average Endpoint Error (EE) [1] is then calculated against the ground truth annotation points. We then apply our APO framework using the same optical flow fields.). Note that the parameter values relevant to the APO framework are initially and experimentally selected, but then remain constant in all our evaluations.
Table 3(a) shows the measurement of average Endpoint Error (AEE) in pixels over all the frames of the sequences. We highlight the top three best AEE measures for each sequence using superscripts next to different values. Notice that APO significantly reduces the AEE compared to the baseline optical flow methods. Our optimisation framework yields the best AEE measure in all the cases. For instance, ITV-L1 with APO performs the best in sequence Original while LDOF with APO yields the best result in sequence Frank. We also observe that although in the Guass.Noise and S&P.Noise sequences the improvement is less than in the unaltered sequences, the overall result is still an improvement with the addition of APO. We also observe that LDOF gives good results even without APO. It is because that the LDOF framework takes into account both regular optical flow energy and the feature technique. The latter contributes additional accuracy.
Table 3(b) shows another experiment, in which we performs Garg et al. [10] and Pizarro et al. [25] on our benchmark sequences (results on Carton, Serviette and Frank are not available.) using a direct tracking strategy. Here we compute the optical flow fields directly from the reference to any other frames of the sequence. The annotation points are then directly tracked to the test frames using those flow fields. Note that the numbers in Table 3(b) may be slightly different from their original work [11]. It is because that, first, our sequences are extended to 237 frames which is around 3 times longer; second, we evaluate the tracking results of only 160 annotation points instead of all the pixels. We observe that the both state-of-the-art approaches (Garg et al. and Pizarro et al.) give higher accuracy than any other baseline methods in Table 3(a). The hidden conditions are (1) the tracking distance is minimum for Garg et al. and Pizarro et al. which very much reduces the accumulate errors; (2) both Garg et al. and Pizarro et al. shows high accuracy for nonrigid surface tracking in the record [10, 25]. And all our sequences contain single nonrigid object. However, such direct tracking strategy cannot handle the situation where objects may be temporally out of the scene. In addition, the object appearance in the reference may be significantly different from the one in some other frames of the sequence. That brings extra difficulty to optical flow estimation. In this measure (Table 3(b)), our
While we consider ourselves primarily with tracking over long sequences, the shorter sequences are consider as well. In Table 4, the AEE measures of various methods are compared on the first 30 frames of our benchmark sequences. We observe similar AEE measures as in the long sequence case (Table 2). The APO framework significantly increases the tracking accuracy – outperforming the baseline tracking methods in all cases even given degradation (e.g. Gauss.Noise and S&P.Noise). Moreover, the BA with APO is also observed to overfit in the noisy sequences while Classic+NL with APO yields the best measures in both sequences of Gauss.Noise and S&P.Noise.
We also evaluate the effect on tracking accuracy by varying the number of selected features. Different numbers (50% and 0%) of features are randomly selected from the initial full detection feature set before performing Anchor Patch detection. Information on our total number of features can be found in Table 3, e.g. there are 364.80 features averagely on each frame of the sequence Original. Table 5 shows an AEE comparison given various numbers of features. We observe that AEE is improved given more features in all cases. Another interesting observation is that our optimisation framework provides lower error against the baseline tracking strategy even given sparse or no features (0% feature). Note that in this case, our APO framework defaults to using an optical flow method with just the Anchor Frame approach [2]. Also note – for example by comparing to Table 3 – that this indicates that the APO framework also provides significant tracking improvement over using anchor frames alone.
We also make the visual comparisons on two of our sequences, Frank and Serviette. The former is real world sequence with ground truth annotation points, while the latter is synthetic sequence overlaid with a ground truth mesh. In Fig. 10, we observe noticeable drift problems given the baseline optical flow tracking strategy. Also note that more details can be found in the corresponding video footage where we visually show that our framework significantly reduces the drift.
The computational consumption of our framework heavily relies on the supplementary optical flow method, because we need to calculate the optical flow fields twice (forward and backward) for every pair of adjacent images. Apart from this, our framework can be implemented in a parallel computation fashion. Anchor frames divide the sequence into clips which give multiple start points for tracking. In the implementation, a GPU version of SIFT approach [9] is applied for feature detection and matching (around 10 frames per second on our benchmarks). The whole framework is constructed under CUDA platform. Assuming all optical flow fields are obtained, our framework reach real-time efficiency (around 2 frames per second) on our benchmarks using on a 2.9 Ghz Xeon 8-cores, NVIDIA Quadro FX 580, 16 Gb memory computer.
Conclusion
In this paper, we have presented a novel optimisation framework using Anchor Patches constraint, which improves the tracking on mesh or sparse points through long image sequences. Our optimisation framework temporally anchors the image regions throughout the sequence in order to mitigate the effect of Error Accumulation (Drift). In the evaluation, our approach combined with 6 popular optical flow algorithms and show significant improvement against baselines methods on 7 benchmark sequences. Such datasets include 6 synthetic benchmark sequences with realworld deformation and 1 realworld sequence.
Footnotes
Acknowledgments
We thank Ravi Garg and Lourdes Agapito for providing their GT datasets. We also thank Gabriel Brostow and the UCL Vision Group for their comments. The authors are supported by the EPSRC CDE EP/L016540/1 and CAMERA EP/M023281/1; and EPSRC projects EP/K023578/1 and EP/K02339X/1.
