Abstract
This work proposes a system designed to estimate the ego-motion of a synchronized calibrated stereo camera in scenes containing a moderate number of moving objects. This is particularly useful in busy road scenes and populated urban areas. The key novelty of the proposed approach is that it estimates the motion of clusters of pixels between stereo frames, which allows it to explicitly reject clusters in motion. This is in contrast to current state-of-the-art algorithms, that tend to treat moving elements as outliers, which are removed using strategies such as RANSAC or M-estimators. Unfortunately treating moving pixels as outliers can give poor performance when the motion represents a significant portion of pixels. The proposed approach overcomes this, if the motion is due to many independently moving objects (such as people or cars). Our experiments show promising results in a variety of urban environments.
Keywords
Introduction
Computer Vision from a sequence of images is far more powerful than from just a single frame. However, in order to relate an entire sequence of images, it is important to know the location of the camera at the point of each image’s acquisition. Visual odometry is the name given to the problem of estimating the motion of the camera, between frames, based only on image data. The search for good visual odometry solutions is a popular topic in literature, especially in conjunction with the automation of cars [1, 2, 3], simultaneous localization and mapping [4] (SLAM), and unmanned aerial vehicles [5] (UAV).
This work focuses on estimating the trajectory of a synchronized stereo camera in an urban environment with moving objects. The proposed algorithm specifically seeks to estimate the motion of a free moving (six degrees of freedom) stereo camera between frames captured at a moderate frame rate (30 frames per second).
The challenge, in such an environment, is overcoming the fact that a large number of pixels, within an image, can be associated with objects that are moving in the scene. Many state-of-the-art algorithms treat moving objects as outliers, that are filtered using a strategy such as RANSAC [6] or M-estimators [7]. Unfortunately, when outliers counts exceed certain thresholds, these algorithms tend to perform poorly.
The strategy, proposed in this work, is to find several clusters of corresponding pixels between images. Then, using a direct linear transform (DLT), each of these clusters is able to yield an independent motion estimate of the camera between images. If the scene consists of some or many independently moving objects with a generally static background, it is expected that there will be a variety of motion estimates. However, those associated with background elements should be relatively consistent, while those associated with several independently moving foreground elements would generally not be. Thus, the estimated motion of the camera, in such cases, may be determined from the set of corresponding clusters of pixels that have the most agreement in terms of estimated camera motion.
A review of current literature is given in Section 2. The proposed visual odometry solution is described in Section 3. The framework used to assess the solution is given in Section 4. Results are in Section 5. This work then concludes with Section 6 and future work is described in Section 7.
Literature review
Visual Odometry is the problem of estimating the motion of Hthe camera between frames, based purely on image data. Typically, these estimates are computed based on the way specific features physical position evolve between images.
One popular choice of feature, is a sparse set of 2-D points that are obtained by corner detectors such as the Harris Detector [3] and the Features from Accelerated Segment Test [2] (FAST) detector. These 2D image points are typically converted to 3-D space using either the properties of calibrated stereo cameras [6-8] or a probabilistic map [4, 5] in the case of monocular systems. Once the 2-D points have been converted to 3-D, camera motion can be estimated using an algorithm such as Epnp [9], if the problem is formulated as a 3-D to 2-D mapping, or the approach of Arun et al. [10], when the problem is formulated as a 3-D to 3-D mapping. If a 3-D to 3-D mapping is chosen, in the context of depth from stereo, it is recommended that scale biases be accounted for with an appropriate technique, such as the one proposed by Dubbelman and Groen [11].
Another approach to estimating camera motion between images is to register dense depth maps using the iterative closest point [12] (ICP). This approach is typically used in conjunction with algorithms that acquire depth maps using active depth sensor cameras such as Intel
It is also possible to estimate camera motion between images using gray-scale or color information directly. A typical strategy to achieve this, is to use the Lucas-Kanade inverse compositional [15], as can be seen in work by Klose et al. [16] and Forster et al. [5].
All of these approaches assume that the variation of the chosen features between images is purely due to the motion of the camera. Of course, this may not be the case, especially if there are moving elements in the scene. Unfortunately, moving objects is a rather common phenomenon, especially in urban settings. By far, the most common approach, in the state-of-the-art, to deal with these moving objects is to treat them as another source of noise [1], along with lighting variance, occlusions and correspondence mismatches. There are several strategies for dealing with such noise, RANSAC [1, 3, 6, 17] being the most popular. The Kalman filter [8] is another commonly used approach, along with M-estimators [7]. Another popular technique, that takes advantage of the iterative non-linear solvers that are typically used to refine camera motion estimates introduces weighted errors [18] to minimize the effect of outliers on the final result. The perceived problem with this approach, is that outliers are assumed to represent a small percentage of the data in most filtering schemes, however this may not be the case in typical congested and cluttered urban scenes.
The few approaches which have explicitly tackled the problem of scene motion with respect to visual odometry employ strategies revolving around basic low-level object recognition. Kitt et al. [8] uses a Support Vector Machine [19] (SVM) scheme trained to identify objects traditionally associated with motion (cars), and mask out those objects during ego motion estimation.
The main advantage of object recognition, is that it is able to identify a group of pixels within an image, and assign a set of attributes to that group. It is often assumed that a group of pixels that all corroborate a single classification, is more reliable than an approach that operates on a pixel-by-pixel basis to filter outliers. However, object recognition can be slow, often requires training data, and sometimes leads to the unnecessary filtering of pixels (a stationary car may be filtered out if cars are assumed to be moving objects). The proposed approach uses the idea of Engel et al. [4] and Yao et al. [20] to track against regions of high gradient change within images, which can be relatively common in man-made uniform environments that are generally devoid of much texture. However, this work exploits the characteristics of these regions which tend to be clusters of pixels with assigned attributes that determine whether those clusters are reliable for features tracking.
Proposed approach
Prerequisites
The proposed system is specifically designed to be used in conjunction with a synchronized stereo camera system as
It is also expected that the camera system
Here
where
A distortion model is also extracted for each camera, which in this work, is defined as follows:
where
The extrinsic matrix is defined as
where
Calibration may be performed using the popular Zhang calibration [21], however the experimental work featured in this article prefers Tsai’s calibration [22] following the approach of Gee et al. [23].
After calibration, each stereo pair
It is assumed that stereo matching [25] is performed on each rectified stereo pair
where
After a sequence of images has been acquired, rectified and stereo matched, the proposed system attempts to estimate the trajectory of the camera by estimating the motion of the camera between consecutive rectified stereo pairs
Given rectified stereo pairs
The first step in doing this, is to identify a set of pixels
Feature detection pipeline. Feature detection starts with determining the magnitude of the gradients within the image and extracting areas with the largest gradients as candidate feature points. Depth filtering is performed to remove features around edges. Finally clustering is performed to identify each cluster feature.
Candidate features. The original image is on the left, while candidate features map after a threshold has been applied is on the right (candidate features are in white).
Ideal cluster features
Given an image of size
and the gradient in the
which leads to an image gradient estimate of:
The resulting map
After finding the gradient map
where
Cluster features. (Left) Original Image. (Right) Cluster features overlaid on the original image. Each different cluster is marked in a different color.
As stated before, candidate features on the physical edges of objects are undesirable. These are filtered out from the candidate feature map
Filter candidate features on physical edges.[1] FilterFeatures
where
Clustering
The final stage in the proposed feature detection is to cluster features into groups. The proposed strategy is to group features into connected regions based on 8-neighbor connectivity. It is hoped that clusters formed in this way represent areas of high gradient variation within the image. Note that since the described features in the system corresponds closely with edge features, 8-neighbor connectivity was selected in order to identify cluster groups with a diagonal orientation.
Filter candidate features on physical edges.[1] Cluster
With respect to the Algorithm 3.2.4, assume that the function
Figure 3 shows the original image of a scene and the identified clusters in that scene after applying the proposed algorithm. Each cluster is marked with a different color.
Cluster feature matching
The main purpose of the proposed cluster feature, is to find robust correspondences between consecutive frames, assuming a small amount of rigid motion between frames. Constructing a feature, from multiple pixels, is intended to yield more robust matching than matching single pixels based on feature descriptors or optical flow. Another advantage of the proposed cluster features, is that it consists of seven or more pixels, which means that it is possible to estimate the motion of the cluster feature across images based on its internal correspondences. This has three advantages:
In perfectly static scenes, all cluster feature matches should yield similar rigid transforms. Thus, outliers should be easily identifiable. An inability to find a single rigid transform that adequately describes the motion of a feature cluster, can indicate non-rigid motion of the pixels that make up the cluster feature, or inaccurate depth estimates in the depth map. Either way, it identifies the feature as untrustworthy to track against, and therefore the cluster feature can be rejected. This is preferable to rejecting the cluster later on in the tracking process as an outlier. When a small minority of the pixels making up the cluster feature are mismatched, they can be automatically corrected as the points making up the cluster feature are assumed to be a rigid structure (as consistent with a static scene assumption).
Given left images
The estimation of camera motion begins by recognizing that each cluster feature
Assume that the set
Deriving the DLT can follow a similar approach to that of Tsai calibration [22]. Thus, the goal is to solve the system of equations given by:
where
given that
where
Matched features. Image 
As with Tsai calibration, the magnitude of
After camera motion matrix
The degree to which the camera motion estimate of the cluster feature between images accounts for the motion of its pixels can be estimated using an error function that finds the average difference between the corresponding points
If the value of
The goal of the above section was to estimate the camera motion encoded as matrix
So far, a feature consisting of clusters of pixels with high gradient variance has been introduced, and a means of finding corresponding clusters in other images has been explained. The process of matching these cluster features has produced a set of
The goal of this section is to refine this set of estimates into a single estimate. The challenge is, of course, that there is expected motion between the image
The main strategy, used to find the motion with the most consensus, is to group motion estimates based on their similarity to each other, and find the group with the most members. In order to do this, a variant of the RANSAC [29] algorithm is used. Firstly, each motion estimate is converted from a rotation-translation matrix to its associated Lie algebra
The final pose
Pose estimate. The original image at time 
In this work, a new feature for ego-motion tracking was introduced, called the cluster feature. The goal of the experimentation proposed below is to verify that there is a significant advantage of using the proposed cluster feature, in the context of moving urban environments, over other state-of-the-art algorithms. This section introduces the equipment and evaluation metric used. The results of the experiments are then covered in Section 5.
Equipment
The experiments performed in this section were conducted using two GoPro Hero 3+ Black Edition camera with in-built hardware synchronization. These cameras were housed in a waterproof stereo box manufactured by GoPro (see Fig. 6) with a 33 mm baseline. The cameras were set to capture images with size 1720
GoPro camera. The GoPro Hero 3+ camera in the waterproof stereo case used within this section.
Experiments were executing on an Intel
Image database. Selected stills from each of the twelve video sequences used for the presented experiments. Note that this database is expected to be available at 
The proposed algorithm requires three provided threshold parameters for noise filtering. These three parameters are
There are two state-of-the-art algorithms representing the general variety ego-motion estimation algorithms available: Feature-based and dense pose estimation.
Feature-based algorithms: A common approach to estimating camera motion is to use feature point matching. The basic idea is that two frames are provided to the algorithm with a significant amount of overlap. Our implementation finds corresponding feature points using the Fast Feature detector [32] and Lucas Kanade optical flow [26]. The correspondences are then converted into 3-D points (using provided depth maps) leading to a pose estimation by alignment using iterative least-squares based on Levenberg Marquardt. Feature point outliers are filtered using a RANSAC based derivation of the fundamental matrix [33]. Many elements of this implementation mirror state-of-the-art implementations such as FOVIS [34].
Dense pose estimation: Steinbrücker et al. [35] provide a fast accurate implementation of a dense pose estimation algorithm from RGB-D cameras [35]. While this algorithm was strictly developed for active depth sensor cameras such as Microsoft’s Kinect, this algorithm was found to work reasonably well on depth-maps derived from stereo matching. It is included in this experimentation as dense pose estimation and is quite popular at the moment, with several published works claiming that dense strategies give superior performance over feature-based approaches [36].
In order to test robustness to scene movement, a collection of twelve video sequences were captured. Each of these video sequences featured moving elements. Care was taken to select video sequences from a variety of scenes including indoor scenes, outdoor scenes, scenes with many man-made structures and scenes that consisted of natural objects (such as trees, plants, landscape etc.). Stills from each of the videos are shown in Fig. 7.
It should be noted that there already exists several available databases of video sequences, for the purpose of validating visual odometry algorithms. However the authors of this work were particularly interested in scenes with various types of motion, which is typically not the focus in such videos. Therefore it was decided to create a new database of videos specifically for this work. This database is expected to available is expected to be available at
Assessement
Experiments were generally performed by processing the captured video sequences with the algorithms. The main measure of accuracy used was photometric error. Photometric error measures the quality of motion estimate
Intuitively, the smaller the photometric error, the more accurate the pose estimate. Naturally, there are other factors as well including lighting changes, movement in the scene and depth map accuracy. However, the approach should still be useful when comparing the results of different algorithms using the same data. Indeed, an algorithm that is able to consistently produce smaller photometric errors than another on the same data is clearly more accurate, despite the fact that it may not achieve “perfect” photometric scores.
The derivation of photometric error may be formulated starting with image
where
Database sequence photometric errors
The first set of experiments were performed against scenes with moving elements to test the claim that the proposed algorithm was more robust in scenes with moving elements that other algorithms. The results are shown in Table 1.
Photometric error (avg
std) in moving scenes
Photometric error (avg
The results show that the proposed algorithm was the top performing algorithm in all videos except for the last one. The last video featured a mostly static indoor scene with two slow-moving human subjects – footage that is well suited to the direct matching algorithm, and thus it was the top performer in this sequence. However the direct matching algorithm appeared to be the most susceptible to motion in the images, while the feature-based approach was generally more robust, while still deteriorating significantly with increased level of motion in the images. The proposed algorithm was significantly better in videos 8 (forward motion scene with a women walking directly in front of the camera), 10 (large amount of moving foliage), 11 (a close-up of a bushes) which were the videos with the most motion.
Timings per frame are provided per the tested algorithms in Table 2. While the proposed algorithm is slower than the RGB-D algorithms, its timing is similar to the feature tracking approach. Note that these timings only include the time taken to perform visual odometry, and do not include the preprocessing of the images (i.e. calibration, image acquistion, distortion removal, rectification, stereo matching, and disparity-to-depth conversion). It should be noted that the same preprocessing is required for each of the three presented algorithms.
Average time per ego-motion estimation strategy
Average time per ego-motion estimation strategy
A road walk experiment of about 200 m was performed along a segment of Auckland, New Zealand’s Karangahape Road. Experiments were conducted on a Friday afternoon, where Karangahape Road is a fairly busy street in the Auckland CBD. The experiment was conducted in order to test the proposed system in conditions of moving traffic and pedestrians.
Karangahape walk. Selected stills from the walk down Karangahape Road showing the moving traffic and pedestrians that were present in the scene.
Figure 8 shows some of the moving elements that were present in the scene. The sequence of 4628 calibrated stereo frames were given as input to the proposed algorithm, which estimated the trajectory of the person walking with the camera. The superimposed top-view of the path is shown in Fig. 9.
Details of the road walk experiment
Karangahape path. The path (dotted line) acquired using the proposed system superimposed on a map of the walk acquired from Google Maps.
Roadside statue. An image taken from a 300 stereo frame sequence of a statue sitting on the side of Karangahape Road.
Model top view. A top view of the 3-D reconstruction of the camera. The path of the camera, estimated by the proposed algorithm, is shown as a red dotted line.
Model front and back views. A front view (top) and a back view (bottom) of the generated model is shown.
Table 3 shows the details of the road walk experiments. Note that the time of 38 minutes 29 seconds is significantly longer than the actual walk time of 2 minutes 30 seconds. The main reason for this is that the algorithm currently performs at a frame every half-a-second while the camera was capturing at 30 frames per seconds. However, as seen in Section 5.2, our implementation is only slightly slower than the feature point approach. It should also be noted that the algorithm is executed on the CPU without much optimization and without parallelization. However clearly, if the goal if this algorithm is to support real-time applications such as driver-less car navigation, an important area of future work is to establish whether the algorithm can be sped up significantly without a significant loss in accuracy.
As a final test, it is important to verify that the proposed strategy for Visual Odometry is able to support 3-D reconstruction. While the primary focus of the work presented in this paper has been to track a calibrated stereo camera in a moving scene, moving objects still pose a problem in a 3-D reconstruction pipeline after odometry, since it is hard to merge a sequence of point clouds containing moving objects. For this reason, it was decided to focus the 3-D reconstruction test on a relatively static scene. Figure 10 is a frame taken from a 300 stereo frames sequence of a roadside statue. The images were acquired using the GoPro Hero 3+ Black Edition camera.
The proposed algorithm was used to estimate the trajectory of the stereo camera as it captured the 3-D reconstruction. Stereo frames were then merged together within a Truncated Signed Distance Function [37]. The marching cubes algorithm [38] was used to extract the final mesh. The top view of the resulting 3-D model is shown in Fig. 11 and the front and back views are shown in Fig. 12.
Conclusions
The aim of this work was to improve ego-motion estimation. The proposed approach was to introduce a new, more robust feature which was named the “cluster feature”.
The cluster feature is composed of multiple feature points which lead to its improved robustness. This composition also allows to correct small mismatch errors and to estimate its match correctness with respect to both its internal consistency (the consistency of the matches of the features that make up the cluster feature) and external consistency (the degree to which this cluster feature agrees with other cluster features within the image). Finally, the cluster feature combines feature based “reprojection error” metric with “photometric error” metric and yields a compromise between the two approaches.
Experimental results demonstrated that the proposed feature is capable of performing visual odometry in scenes with moderate amounts of movement. In this realm, the proposed algorithm outperforms the other state-of-the-art approaches tested.
Future work
The approach in this work shows promise as a solution to the problem of determining the motion of a camera from images in scenes with a variety of moving objects. However there is clearly a lot more research to be done before the proposed solution is a viable solution for modern applications, such as augmented reality and real-time 3D reconstruction algorithms. Speed is naturally the foremost concern, and already the authors are working on ways to speed up algorithms using parallelism, SIMD and GPU technologies. There is also some work going into a faster cluster feature matching algorithm that is non-iterative. Another project is using machine learning to establish whether clustering and matching using models acquired through machine learning is a viable fast solution.
The assessment of presented in this work is at the level of a proof of concept, using fairly good quality cameras without much motion blur or illumination artifacts and a single metric with respect to accuracy. A much more thorough investigation needs to be undertaken with a much larger database of images captured with a variety of different camera types and conditions. The algorithm needs to be assessed with several different types of evaluation metrics in both the image plane (in pixels) and also in the scene domain (in metric units such as millimeters) such as Hausdorff distance [39]. Such a study could answer research questions about the robustness of the proposed algorithm in difficult conditions and establish the types of situations that favor the proposed algorithm and those that don’t. Also the effect of calibration quality can be assessed on the algorithm, along with the effect of various types of distortion models and noise models.
A final area of interest is the 3D reconstruction of moving scenes. The 3D reconstruction presented in Section 5.4 was chosen to be static because merging moving elements is difficult. However there are several ideas as to how this problems can be avoided or overcome. One such idea is to reconstruct only static elements and leave out moving elements, this would entail generating a 3D reconstruction of a street by ignoring the pedestrians and vehicles for example. Another idea is to capture scenes with known moving objects and use motion models to predict and deform point clouds so that motion across frames may be reconciled.
Footnotes
