Abstract
The aim of this paper is to track objects in a sequence of surveillance video frames by providing accurate object contours. Video analysis is performed in four key phases: (i) modeling the object/background by means of the Gaussian Mixture Model and the Kalman Filter, (ii) extracting a rough contour for the object through separating the moving objects from the still background, (iii) finding the corresponding points and the object location using the shape context matching algorithm, and eventually, (iv) extracting the exact shape contour. The suggested methodology is accurate enough to track objects in terms of scale change, translation, rotation, and partial occlusion and it can be used in real-time applications. Results on different video sequences establish the capability of this technique for a variety of applications such as traffic control cameras, speedometers, CCTV security cameras, photography and etc.
Introduction
Visual object tracking is defined as finding the location of a favorite moving object in a range of video frames; in other words, the objective is to find a region in each frame which is occupied by the target object at any given time [1, 2]. The past thirty years have seen increasingly rapid advances in object tracking algorithms due to developments in computer technology, high quality video cameras, and growing need for automatic analysis of video frames [3]. Visual tracking is a major area of interest within the field of machine vision and it has been applied in many aspects including motion-based recognition, robot vision, video surveillance, human-computer interactions, image compression, etc. Although acceptable results have been reported under controlled situations with fixed cameras and rigid objects, these situations are quite limited in real-life applications. There are still numerous challenges in this issue, such as complex movements of objects, effects of noise, brightness variations, scale changes or rotations of the objects, partial occlusion as well as non-rigid changes of an object, complex environments, and the need for real-time analysis [4, 5, 6, 7, 8, 9, 10, 11]. An important issue in object tracking is choosing optimal, fast and reliable tracking algorithms [5].
The first serious discussions and analysis of the object tracking emerged during the 1980s considering the movement of objects as a continuous and smooth route and there was no abrupt appearance change. However, research has consistently shown that this simple assumption differs from what happens in the real world. Hereupon, recent algorithms aimed to consider abrupt appearance changes and even complicated conditions in which the object leaves the scene. In this regard, successful tracking systems should meet three important conditions [3]: (i) Robustness: the capability of tracking the intended object even in complex situations like cluttered background or complex object movement, (ii) Adaptability: the ability to adapt the system with the changes in the object shape, and (iii) Real-time processing: highly efficient algorithms together with fast equipment for live video streams.
Several approaches have been suggested in order to overcome these situations. These approaches are divided into the deterministic methods based on optimization solutions [6, 7, 8], and the probabilistic methods based on the posterior probability distribution estimation [9, 10, 11, 12, 13, 14]. In most of these tracking approaches, objects are assumed to be in an approximately rectangular or elliptical shape. However, many objects have more complex shapes; for example, the image of a hand with open fingers or the image of the body of some animals like snakes or scorpions cannot be represented by simple shapes. In some recent studies, attempts have been made to find more accurate shape estimate of complex objects [15, 16]. Ran and Malik [15] used figure/ground segmentation by means of the Conditional Random Field (CRF) to compute a mask by which they tracked the object. This technique provides a robust tracking system, which is appropriate for long video streams like a soccer match. However, they regrettably shown that the generalisability of such results to real-time applications is problematic, due to high computational complexities of the figure/ground segmentation.
The most popular tracking method is the active contour model proposed by Kass et al. [17]. The goal of this method is to get a tight contour which enfolds the object. In earlier studies, the contour energy was described by the gradient of the image. But the use of gradients made the algorithm sensitive to noise and resulted in low contrast images. Some techniques solved these problems by integrating the region based criteria into the contour models; however they still have a number of problems in use, including occlusion and cluttered background [18, 19].
Variations in the background (especially the brightness) is another important factor to affect the performance of the segmentation technique. So far, several algorithms present dynamic description for backgrounds, which improve the quality of the detection of moving objects. Some of these algorithms are modeling algorithms such as the Kalman Filter [20] and the Gaussian Mixture Model (GMM) [21, 22]. In these techniques, first, the background is modeled by which, an initial location for the moving object is estimated. Then, the context modeling algorithms determine the exact location of the preferred object [23]. In order to achieve this goal, the context matching algorithms first receive the image of the pattern and according to its features, they search for the target in another image. These techniques handle the background variations in this account.
This paper provides an effective procedure for tracking objects in video sequences, which is robust against translation, rotation, scale change and partial occlusion. In addition, due to its low computational complexity, it can be used in real-time applications and also in cluttered backgrounds. Instead of directly finding the exact contour of the object, this approach first incorporates the object/background modeling algorithm and moving object detection technique, to estimate the initial location of the object. As a result, the computational complexity is significantly decreased and the algorithm can work quite well in cluttered environments. In the second phase, the shape context matching algorithm is adopted to estimate the possible changes in the shape of the moving object including translation, rotation, and scale variations. Through this, the location of object contour is found with a good accuracy. Finally, in order to determine the more accurate contour, the Marching Squares method is employed to extract the boundaries of the intended object. Correspondingly, the methodological approach taken in this study is a mixed methodology based on figure/ground modeling and matching algorithms.
The overall structure of the study takes the form of four sections, including this introduction. Section 2 begins by laying out the steps of the proposed object tracking approach, and looks at how the modifications on the GMM, the Kalman Filter, and the context matching algorithm can improve the performance of the visual object tracking system. Section 3 presents the results of applying this tacking procedure to different videos. Finally, Section 4 gives a brief summary and critique of the findings and includes a discussion of the implication of the results to future research into this area.
The proposed approach
The block diagram of the proposed approach. In each block, the functioning is shown in the normal font and the result is written in the boldface font.
To date various methods have been developed and introduced to track objects in the video sequences. The shape context is one of the more practical descriptors which is fast and robust to small changes of the shape; thus a tracking method based on this descriptor can be used in real-time applications. This section presents the proposed approach integrating the figure/ground modeling and the shape context matching. The general framework includes four repeated phases for each frame. At first, the figure/ground model is found with the use of the Gaussian Mixture Model and the Kalman filter, through which the moving part of the image in determined. Then, a coarse contour is extracted using the pixel edge detection techniques. Following this, the shape context matching is recruited to localize the favorite object and determine its state in order to find the initial boundary, and finally an exact contour enclosing the preferred object is extracted by means of the Marching Squares. In fact, the objective of the first two phases is to extract a coarse contour of the object, and the goal of the two final phases is determining the location and the exact contour of the target.
Shape context is one of the most accurate matching algorithms in separating and detecting objects in images [24, 25]. Despite its high accuracy and capability of extracting the histogram characteristics, the performance of the shape context algorithm can be still affected in the presence of noises and other background disturbances. In fact, to keep the performance high, the shape context algorithm needs to be provided an initial approximate location of the object by other techniques. Hence, instead of searching the whole image, the algorithm searches only within this approximate location for the object; this can significantly reduce the effects of noises and disturbances and increase the accuracy and the speed. As mentioned earlier, in the target images, moving objects are considered as foregrounds and fixed objects are considered as background or disturbances. Once the background and fixed objects are removed, important information about the moving objects and other changes in the image can be obtained. Therefore, in the beginning, the fixed background is modeled and the changes and movements are detected using the initial model. Movement detection is one of the effective methods that can estimate the initial location of the object, without being involved in the conditions of brightness or color of the object. Since there exist various types of modeling systems for the background, one can choose an appropriate technique based on the application and the goal. The computational complexity of these systems is much smaller compared to that of techniques that use the object color characteristics for estimating the initial location of the objects. The procedure of the tracking algorithm of moving objects of our work is shown in Fig. 1, with respect to which, the following parts explain each of the main phases.
Foreground/background separation
Minimal techniques – as common methods for separating the background – create models for the background through the images that are continuously being updated. Background is considered as a part of an image which remains fixed and without considerable changes. In the pixel-based algorithms, each pixel in an image belonging to a video sequence is classified as either still or moving; and at the end, a collection of moving pixels determines the moving part represented by a Binary Movement Map (BMM). Several methods have been proposed for this purpose; nonetheless the GMM and the Kalman Filter have received considerable interest. The next part describes each of these two methods.
The gaussian mixture model (GMM)
One of the complicated cases in modeling the background is when the scenes contain repetitive movements such as movements of branches and leaves of trees or for example, presence of a moving flag. The GMM algorithm, proposed by Grimson and Stauffer [26], is defined for modeling background in outdoor scenes. In this method, for each pixel in the position
where
where,
In Eq. (3),
In which, the value of
In order to estimate the background model using the GMM, first the Gaussian distribution models are arranged in a descending order with respect to the rank
For a new input image, the matching of each pixel against the Gaussian distributions are checked. If the matching distributions is among the first
Kalman Filter is an optimal predictor of the status of a process when the sequence of input samples has a normal distribution. The Kalman filter can be used for segmenting the background in a sequence of frames by modeling each pixel as a dynamical system. Here, it is assumed that the variations of the background is linear and the initial conditions are known. For a pixel at position
And the estimated states are computed as below:
where
And the gain matrix is found using the pre-determin-ed binary map for the input frame:
The parameter
In addition in Eq. (11),
Once the pixels are classified as the moving or the background, a Binary Movement Map (BMM) is constructed in which moving pixels are shown in white and background pixels are represented by the black color. In the segmentation step, some pixels may be misclassified due to noises, which leads to plenty of holes and gaps in the body and the contour of the moving objects. Here, we used a median filter to reconstruct the edges and fill the holes and gaps to attain smooth surroundings in the BMM. Following this, the remaining disturbances need to be removed from the background. These disturbances could be very intense and cause serious difficulties in the matching phase. In order to remove these disturbances, those objects whose number of pixels are less than a pre-determined threshold are omitted. Thus, only big and more integrated objects will remain and the noise and disturbances will be attenuated in an acceptable manner. Since now, we simply refer to this corrected BMM as the BMM while the original one is called theinitial BMM.
Edge detection of a rabbit image: (a) the image of a rabbit, (b) the extracted edges (red points), (c) the selected representative points.
The next step is to select a set of edge points of moving objects and extract a rough estimation of the moving part. To do this, first the edge pixels of whole image are determined using one of the edge detection techniques such as the Canny method. Among these edges, only those that belong to the moving part are important and the others should be eliminated. This selection process is called the edge classification. To do this, we recruit the BMM as a mask to set on the image edge points. Hence, what remains are the edge points selected by the BMM, which belongs to the contours of the moving objects. A simple rectangle enclosing these edge points is called the rough contour. Figure 2a shows an image of a rabbit, as a simple example. Figure 2b illustrates the extracted edges of the rabbit image. Since number of the edge points is too large, next phase needs too many calculations to be performed to find the matched points; even though the accuracy increases. For this reason, a small number of the edge points will be selected as the representative points. Figure 2c shows the selected edge set of the rabbit image.
Edge pixel matching
Shape context is a feature descriptor proposed by Belongie and Malik [28] in 2000 based on the idea of measuring shape similarity and recovering point correspondences. The main objective of the shape context algorithm is to find a favorite pattern in a new image by searching for the best edge point matchings. It deserves to be mentioned that the object inside the image may be undergone some changes such as scale change, rotation or translation, which in turn makes much more difficult to find the best matches. In this part, the functioning of the matching algorithm is described by an example.
Consider a shape represented by a set
Then, mean of the matrix
Next, the entries of
In the next step, the angles between the points are determined in a manner similar to that for the distance and the angle matrix
Now, the values of the angles should also be quantized by considering a number of levels. This quantization is done as follows, where the
The distance and angle matrices are used to form a log-polar histogram for each point in the pattern. For example, consider the rabbit in Fig. 3a with the selected edge points in Fig. 3b. In order to form the histogram, divide the plane around each point into a grid of

Edge detection of a shape: (a) The search area, (b) the extracted edges, (c) the selected edge points.
The matched points of the rabbit pattern. Shown the pattern in red, the true matched shape in the search area in green, and other shapes in white. Connection to the white parts are mismatches. (a) the shape context algorithm, (b) the shape context algorithm with the gradient angles cost function. (c) Determining the location of the intended shape in the target image.
Similar steps are followed for the new image to extract the shape contexts of the edge points. Suppose that the goal is to find the shape pattern of the rabbit in Fig. 4a. In so doing, at first the edges of the new image are extracted as in Fig. 4b and the representative points on the edges are determined as in Fig. 4c. Next, for every point of the pattern and the new image, the shape context is extracted. Following this, the distance between the shape context of the points of the pattern and those of the new image are found. For each point of the pattern, a point from the new image with the minimum distance of the shape contexts is selected as the matched point. Figure 5a shows how these matchings are performed. As it is clear, in most cases the algorithm found the matched points correctly; however, some errors also exist.
As it was mentioned, the shape context matching algorithm is designed for the points on the object edges. The Canny edge detector determine the gradient vector (containing the direction of the derivative vector) in addition to the location of the edge points, by which it is possible to reduce the matching errors in the shape context algorithm. Consider two points on the edges of the pattern and the target image with the gradient angles
To use this measure to reduce the mismatch errors, first the context matching algorithm extracts the matched points, for which then the cost Eq. (18) is computed. If this measure is smaller than a threshold, these two points will remain matched; and if not, they will be removed from the set of matched points. Hence, a large number of incorrect matchings will be removed and the overall performance of the system will significantly improve (Fig. 5b). Once the intended shape is found, a rectangle enclosing the matched points will be drawn to represent the initial outlier (Fig. 5c).
On completion of finding the matched points, the general contour representing the boundary of the object should be estimated and finally the accuracy of this contour needs to be increased. To do this, first the contour of the pattern is determined. Then among the matched points, those with the most matches (matching distance less than a threshold) are selected, for which using the scale, translation and rotation operators, the state of the target object can be determined in the new image. This state helps to draw an initial contour for the found shape by transforming the pattern contour to the new image. Selecting a few best matched points reduces the computational complexity of the step. Although this method can determine the location and the contour surrounding an object, the results of the simulations show that the accuracy is not acceptable. Therefore, in this paper we determine the location and the initial contour of the intended object more accurately by intersecting this contour and the set of moving pixels. This leads to a more accurate estimate of the object location.
The exact contour
When the precise location of the object is determined, finally the Marching Squares method [30] is applied in order to draw an exact contour around the object and separate it from the surrounding environment. This is a simple method with low computational complexity that finds the boundaries of the object quite well.
The simulation results
In the previous section, the proposed algorithm for the visual object tracking was completely described. It was suggested to combine the shape context matching algorithm with the figure/ground modeling methods to achieve acceptable results. In this section, the simulation results of the algorithms are presented with several examples. It is demonstrated that the proposed approach is effective for tracking objects even in complicated backgrounds.
The first phase of the proposed algorithm is modeling the foreground/background and extracting the moving parts. An initial estimate of the object location should be a-priori known; the more accurate the initial estimate is, the more accurately context matching algorithm can identify and extract the matched points. In addition, by attenuating noise and disturbances, number of points which need to be examined by the matching algorithm will be reduced, and as a result the calculation load will decrease. In this paper, the first goal of the modeling is making the background robust against changes such as the environment brightness and the imaging noises. In order to remove the noise and also fill the gaps around and inside a moving object, a median filter was used to remove small sized set of pixels. In this subsection, the figure/ground modeling is evaluated and compared by three methods: (i) the initial BMM resulted from the GMM, (ii) the corrected BMM found from the GMM and (iii) the results of the corrected BMM constructed by the Kalman filter. In Fig. 6, the target object (the cell phone) is being moved by a hand in front of a camera. Here, the objective is to extract the moving parts (the cell phone
The movement scene: (a) the fixed background, (b) a frame containing moving objects.
Object/background modeling using the GMM: (a) the initial BMM with no moving object, (b) the initial BMM with a moving object in the scene.
Modeling the background using the masked edge points: (a) the corrected BMM when there is no moving object at the scene, (b) BMM in the presence of a moving object.
The corrected BMM created by the Kalman filter in frames 212 and 250.
The shape context matching algorithm functions based on extracting log-polar histogram and looks for points in the new image matched to the pattern. In order to obtain a more accurate matching, the gradient angles are also used. Figure 5a showed the matching algorithm looking for the rabbit pattern in the new image, from which, it can be deduced that this algorithm has wrongly matched some points. Meanwhile, in Fig. 5b, the gradient angles cost function together with the context matching algorithm are used. It is evident that this technique has noticeably reduced the number of wrong matchings.
The characteristics of the proposed approach
Defining the shape to be tracked and the environment: (a) the pattern of the glass, (b) the fixed background of the scene and, (c) a frame with the moving glass.
Steps of finding the edge map: (a) moving part, (b) the BMM, (c) Image edge map, (d) rough contour, (e) pattern edge map.
Process of determining the exact contour: (a) point correspondence with the shape context matching, (b) best matches selection, (c) finding the initial outlier around the target object (the yellow rectangular), the initial contour (the red rectangular), and the exact contour (the bold red curve).
Tracking the favorite object in different frames. First row: the edges detection of the moving parts; second row: matching the points using the shape context algorithm. Third row: drawing the outline and contour around the target object.
Tracking the favorite objects. First column: a vase with flowers, second column: a hand with open fingers, and the third column: a hairbrush.
Now, the functioning of the proposed algorithm is illustrated in a tracking application. The objective is to detect and track a “glass” moving by hand of a person in the target scene. The image of this glass is first given as a pattern to the tracking system (Fig. 10a). Figures 10b and c show the fixed background of the scene and a frame containing the moving object. The background scene displays a normal room with a complex environment and objects of different shapes. First, the moving part is separated from the fixed background by one of the figure/ground modeling (Fig. 11a) and the corrected BMM is found (Fig. 11b). The image edge map is obtained by means of the Canny method (Fig. 11c), which masking by the BMM gives the edge points of the moving parts, by which the rough contour is drawn (Fig. 11d). In addition, the edge map of the favorite pattern is determined (Fig. 11e).
The shape context matching algorithm is recruited to find the best matches in Fig. 11d with Fig. 11e. Then the mismatches are removed using the gradient angle cost function (Fig. 12a) and the best matches are selected (Fig. 12b) to determine the state of the object. By intersecting this outlier with the BMM, the initial contour would be extracted. Finally, by means of the Marching Squares method, the final contour of the target object is extracted (Fig. 12c). The proposed method has been applied to several complex objects. In Fig. 13 a glass is detected and tracked in different scenes and in Fig. 14 a vase containing flowers, a hand with open fingers and a hairbrush in different backgrounds are tracked.
The results of the simulations demonstrate the effectiveness of the proposed algorithm in tracking objects. In situations that the objects are rotating or in cases like the movement of a human hand with fingers opening and closing, the system still works well. The high accuracy is due to two factors: first, the figure/ground modeling with the corrected BMM results in a good estimate of the initial object location and reduces the surrounding noise. Second, using the gradient angel cost function, the mismatches are also removed. The simulations showed that the proposed system can detect and track the target object with a rotation of up to thirty degrees. Moreover, the system is robust to variations in the color and the structure of the background (such as being non-flattened or having a special pattern), which is resulted from the modeling and detecting the pattern independent of the background. The GMM is working well for multiple objects of interest even when they overlap each other and the tracking will be recovered if occlusion takes place. Meanwhile, the Kalman filter is faster and more accurate when the probability distributions are normal. Since the shape context descriptors contain rich information, they are robust against small rotation and partial occlusion. Table 1 briefly presents the characteristics of the proposed approach.
The purpose of the current study was to propose a method for tracking objects in a sequence of video frames using the shape context matching algorithm. The main idea was using figure/ground modeling and image descriptors to determine the contour of the intended object in a video frame. The proposed approach was applied on several video sequences. The results of the simulations showed that the technique of this paper performs well in tracking objects with translation, rotation, scale change and partial occlusion. The tracking method accurately determines the contour of the object, accordingly it is appropriate for real-time applications. This paper had the following achievements: i. The GMM and the Kalman filter were combined with the context matching algorithm to detect the target object. Describing the background using the Kalman filter for high speed moving object is more appropriate, while the GMM works better in outdoor spaces or cluttered backgrounds. ii. In the foreground/background modeling, the noise was reduced and the unwanted small objects are omitted, which lead to more connected edge boundaries for the matching phase. iii. In the matching step, with the use of the gradient angel cost function the mismatches were noticeably reduced, which increases the accuracy of the resulted outline of the object. Both the Kalman filter and the GMM consider the intensity change of the background as normal distribution or mixture of Gaussian components. What is now needed is a study involving background with abnormal distributions such as Cauchy or Chebyshev. Furthermore, using color based detectors, may lead to less mismatches and therefore, the accuracy can be improved. Future studies on tracking multiple objects, considering multimodal noises, occlusions and nonlinear movement of the objects are recommended. In general, therefore, it seems that the present approach and its prospects are promising in order to achieve successful algorithms in tracking objects.
