Abstract
Generating class-agnostic object proposals followed by classification has recently become a common paradigm for object detection. Current state-of-the-art approaches typically generate generic objects, which serve as candidates for object classification. Since these object proposals are generic whereas the categories for classification are domain specific, there is a gap between the generation of object proposals and the classification of object proposals. In this paper, by taking advantages of the intrinsic structure and the complexity of each category of objects, we propose a novel tree-based hierarchical model to learn object proposals, from top proposals produced by the existing object proposals generation methods. First, we develop a tree-structured representation for each object to capture its hierarchical structure feature. Second, we propose a 23D compact feature vector to represent objects’ visual features. Third, we formulate a learning schema which evaluates the objectness of each proposal. Experiments demonstrate the significant improvement of the proposed approach over the state-of-the-art method in terms of object detection rate. An application is proposed based on this approach to help children learn and recognize objects by their visual appearances and their sub-parts structures.
Introduction
Object detection, as the task of locating and recognizing object categories in images and videos, is a major research field in computer vision. Multi-class object detection is challenging since it needs to not only find the locations of the objects but also specify the category of the object at each location. A lot of research work has been done in object detection and recognition and achieved a lot of progress in recent two decades. In earlier work, the problem was treated as a binary classification problem for each object category versus non-object category on a huge number of windows (i.e., sliding windows) [7, 19] in different sizes and aspect ratios scanned over each image. This approach has two major drawbacks: First, the number of potential candidate windows in each image is extremely large; Second, to detect objects from n categories, n detectors need to be ran separately on all windows of each image, which makes the computational cost grow linearly with the number of categories. Haar-cascades with boosting techniques (e.g., AdaBoost) [10, 18] significantly boosted the speed with integral images and quickly eliminated a large number of non-object windows. These techniques have achieved successful detection accuracy on certain types of objects such as cars and human faces. Segmentation-based [9, 16] and saliency-based [1, 11] techniques were proposed to reduce the search space for possible windows. These “filtered” windows are also called proposals. In recent years, the class-agnostic approaches [3, 17] were introduced to generate proposals which are likely to contain objects of interest. A (n + 1)-way classification with n object categories and 1 background category is then followed to classify each object. The ever-growing labeled data size and computing power (e.g., GPU computing) made the aforementioned approaches work with the deep neural network-based architectures [5, 8] (e.g., Convolution Neural Network (CNN)). However, this type of approaches rely on the deep architecture in both steps. The class-agnostic generation of proposals is expensive and slow.
Hence, object proposals generation remains a bottleneck in object detection. Due to the rapid development of mobile computing and cloud computing, some applications of object detection needs partial computing to be done on mobile devices [13–15]. Quickly and accurately generating object proposals has been of a great topic of interest. A new BING-feature-based objectiveness measure [2] was proposed to quickly generate object proposals (at 300 frames per second) with a high detection rate(i.e., 90% with 50 proposals and 96.2% with 1000 proposals on the VOC 07 data set [4]). Another approach, named EdgeBox [20], used edges as a sparse yet informative representation, and the number of contours that are wholly contained in a bounding box as a measure to compute the objectness. It achieved the state-of-the-art object recall rate (e.g., 75% recall at Intersection over Union (IoU) of 0.7 using 1000 proposals on the VOC 07 data set [4]) with the speed of approximate 0.25 seconds per image.
Existing methods for object proposals (e.g., BING [2] and EdgeBox [20]) typically use measures of likelihood of a bounding box containing objects to retrieve top n bounding boxes as candidate proposals. Due to the class-agnostic nature of these methods, all generated object proposals are generic objects. On the other hand, during the classification stage, the object categories are usually bounded in a certain domain due to the limited labeled training data sets. For example, the VOC 07 data set [4] has 20 object categories for training and the number of object categories in generated object proposals is significantly more than 20. This difference means that the object proposals usually contain additional categories of objects, which are not in the labeled categories for classification. Figure 1 shows 20 generic object proposals generated by an existing object proposals generation method. These generic object proposals include the aeroplane itself, a “propeller” part, and other parts of the aeroplane. However, the labeled categories in the training data contain the “aeroplane” category instead of the “propeller” category and other parts categories. Similarly, a “bicycle wheel” might be detected as a proposal whereas the classification labels only contain the “bicycle” category; A “train window” might be detected as a proposal but the classification labels only contain the “train” category. The false positives caused by these gaps have significant negative impact on object detection in terms of both recall and precision. Consequently, it will further affect object recognition accuracy in the object proposals classification stage. By reducing these false positives, we could improve precision and retrieve more appropriate candidate proposals to improve recall as well. In the classification stage, all generated object proposals, which are not in the labeled categories, are treated as the “background” category. However, some of these “background” objects are in out-of-domain categories (i.e, object categories are not in the set of object categories in the domain of problems being addressed) and are therefore not truly “background”. Hence, reducing these false positives in the object proposals generation stage will also further improve the classification accuracy.
We observe each object category differs from others not only by its visual appearance as a whole part, but also by its intrinsic structure such as their sub-parts and the complexity of its structures. For example, a bicycle differs from a train when being looked as a whole part, which counts for visual appearance as a whole. In addition, a bicycle contains two “wheels” and one “frame” whereas a train contains “multiple rectangular windows”. A bounding box of the blue “sky” object may not contain any sub-parts as an object since it’s relatively pure and it’s true background. Hence, in additional to visual appearance, sub-part structures and the complexity of an object could be employed to construct richer objectness to better represent the objects to reduce the aforementioned two types of false positives and improve the recall rate for any domain specific object detection task.
Driven by this intuition, we develop a model which takes an object’s visual appearance as a whole and its parts-based hierarchical structure into account. There are four major novelties for the proposed work. First, we develop a compact 23D feature vector as the visual appearance representation of an object; Second, we develop an algorithm to map each object to a hierarchical tree structure representation, which captures the relationships among sub-parts and the complexity of the object structure. Third, we formalize a learning schema to measure each proposal’s objectness by comparing visual features and tree structures using the nearest neighbour method. Last, we introduce a practical application utilizing our approach to help children learn and recognize objects via automatically generated graph patches.
The paper is organized as follows: Section 2 introduces the proposed method in detail. Section 3 presents the experimental settings together with the results. Section 4 describes a real-word application utilizing our approach to help children learn and recognize objects. Section 5 concludes the work with summaries, discussions, and future work.
The proposed method
The proposed method aims to select a few number of proposals to achieve the same recall as obtained by significantly more proposals produced by existing object proposal generation method. As a high-level view, we first employ the existing object proposal generator on each image to produce top n object proposals. Since many of these proposals include generic objects and backgrounds, and only a small portion of them include the objects that belong to domain-specific categories in the labeled training data, we develop a learning schema to compute a confidence score (i.e., objectness) for each object and re-rank top n objects in each image. This learning schema aims to rank category-specific objects towards the top to improve the recall rate for different levels of n. In other words, we will select a few number of proposals from the top n object proposals to achieve the same recall as obtained by n proposals produced by the state-of-the-art methods. The subsections below are organized as follows: Section 2.1 introduces the proposed compact features for object’s appearance as a whole. Section 2.2 describes the methodology to map each object to a hierarchical structure represented by a tree. Section 2.3 formalizes the proposed learning schema to compute a confidence score of each object using the compact features and the tree-based hierarchical model.
Compact feature representation
Even though approaches based on deep neural networks could learn feature representations with high discriminative power, it is quite costly in terms of time. For object proposals generation, we want to develop a fast learning schema to eventually rank object proposals from the category-specific dataset higher than object proposals from other categories and background. Hence, the constructed compact feature should have high discriminative power to distinguish objects from non-objects, and moderate discriminative power across different categories of objects.
We observe that object and non-object proposals differ a lot in terms of their properties. First, non-object proposals usually contain background and object proposals usually have one dominate object present within the bounding box. Second, non-object proposals are generally more evenly distributed and have less variations in terms of their color pixels. In contrast, object proposals are generally more skewly distributed and have more variations due to the few dominate colors in the main object and a small background region outside of the object boundary. Third, from the spatial point of view, background proposals are usually more evenly distributed among different color pixels inside the bounding box while object proposals would have a few dominant colors occupying the majority region of the bounding box. Hence, the variation-related and spatial-related properties could be used to distinguish object proposals from non-object proposals.
To this end, we define the first category of features to measure the color variations by computing variances in red, green, and blue channels (three values) and entropy of the luminance-based histograms within the proposal (one value). We then define the second category of features to measure the spatial evenness of the color distribution. Specifically, we convert the proposals into a histogram in red, green, and blue color channels. We empirically choose the top five bins with the highest frequencies and compute their relative normalized frequencies. This would produce 15 features (five relative normalized frequencies for each of the three channels).
Finally, we observe that the positions of the object proposals are not evenly distributed within an image. For example, the probability for object proposals near the four corners or the edges of the image is far less than the probability for object proposals near the center of an image. Some aspect ratios of the object proposals may occur much more frequent than other aspect ratios due to the limited shapes of objects in the entire set of object categories. Following this intuition, we define the third category of features by incorporating the distribution of the relative locations and aspect ratios of proposals into extracted features to better represent each proposal. Since the image dimension (i.e., the width and the height) may vary among different images, we use the relative positions and dimensions of the proposals within an image to define four additional features: the relative width (ratio of the width of the proposal to the width of the image), the relative height (ratio of the height of the proposal to the height of the image), the relative horizontal coordinate, and the relative vertical coordinate, which are the coordinates of the upper-left corner of the proposal divided by the width and the height of the image, respectively. These four features would capture the properties regarding the relative position and the aspect ratio of a proposal. In all, the compact feature of the length of 23 is extracted for a proposal and will be used to classify object and non-object proposals later on.
To clearly explain the proposed features, we provide the mathematical definition for some of these features. For a given image I with width W and height H and a proposal B in the image with x and y representing the horizontal and vertical coordinates of the upper-left corner of B, respectively, let w and h represent the width and height of B and Pi,j,c represent the pixel value at the horizontal and vertical coordinates (i, j) of I in the channel c of the RGB color space, where c ∈ {R, G, B}.
For the first category of the proposed features, the variance of B in a red color channel could be expressed as:
The variances of B in green and blue channels are similarly computed by substituting c with G and c with B in equation (1), respectively.
To compute the entropy, we compute the histogram based on the pixel counts for each of 768 bins (i.e., 256 possible values times 3 channels), and apply the standard entropy formula on values stored in 768 bins.
For the second category of the proposed features, we use 64-bin histogram on each of the RGB channels to compute the five highest normalized frequencies. Firstly, we compute the histogram based on the pixel counts of 64 bins on each of RGB channels. Secondly, for each channel, we normalize the frequency of each bin by dividing the total pixel counts in the channel. Lastly, we rank relative frequencies and select top 5 relative frequencies in each channel to produce a total of 15 values.
For the third category of the proposed features, we compute the proposed relative horizontal coordinate (Xr), the relative vertical coordinate (Yr), the relative width (Wr), and the relative height (Hr) as follows:
A total of n object proposal bounding boxes may be retrieved by the proposal generator for an image. Since each object or non-object bounding box could include smaller bounding boxes corresponding to sub-parts of an object, we could convert all bounding boxes on each image to several trees in the following way: each node represents a bounding box; If a bounding box B a is mostly included in another bounding box B b (defined by the ratio of the overlap between B a and B b to B a ), the corresponding tree node T a should be a child node of bounding box B b ’s tree node T b . This condition will be applied to all bounding boxes in an image. Figure 2 shows two examples of this type of mapping.
sortedBbsDescend ← sortBySizes(bbs)/*sort bbs by sizes in descending order*/
trees ← new List /*initialize a new list to place all resultant trees*/
insertSuccess ← False
insert(tree, bb, thresh) /*see Algorithm 2 for details of this function*/
insertSuccess ← True
break
end if
end for
continue /* process next bouding box */
newTree ← createNewTreeNode(bb)
trees.append(newTree)
end if
end for
return trees
We then propose a method to build multiple trees by considering all bounding boxes in an image. First, we sort all bounding boxes by their areas (i.e., width * height) in the descending order. Second, we use the first bounding box (i.e., the biggest bounding box) to create a root node as the initial tree and sequentially process the remaining sorted bounding boxes one-by-one. For each bounding box, we check if it can be inserted in any of the existing trees by calculating the ratio of the overlap between the bounding box of the root of the tree and the candidate bounding box to the candidate bounding box. If the computed ratio is larger than a pre-set threshold, we insert the bounding box into the currently examined tree. If the computed ratio is smaller than a pre-set threshold for all existing trees, the bounding box cannot be inserted into any of the existing trees, and we use this bounding box to create a new tree by setting it as the root node of the new tree. The detail of the algorithm is shown in Algorithm 1.
newNode ← createNewTreeNode(bb)
treeNode.addChild(newNode)
return
end if
insert(c, bb, thresh) /*recursive call*/
end if
end for
newNode ← createNewTreeNode(bb) /* all children of treeNode do not have sufficient overlap with bb */
treeNode.addChild(newNode) /* Add newNode as a child of treeNode */
return
The detail of the tree insertion function is summarized in Algorithm 2, where the inputs are a bounding box and a tree (i.e., treeNode) and the output is the altered tree after the insertion. It should be noted that the input bounding box is mostly included in the bounding box corresponding to the root of the input tree based on the pre-condition in Algorithm 1. Therefore, calling this function will insert a bounding box at the right location of the tree, where its parent corresponds to the smallest bounding box which has the sufficient overlap with the inserted bounding box. We use recursion to implement this function as summarized in Algorithm 2. Let’s denote the function by insert(treeNode, bb, thresh). The based case is that the treeNode has no children. In this case, we simply create a new node for bb and add this new node as a child of treeNode. If the treeNode has children, we will check if bb is mostly included in the bounding box corresponding to any of the children. If yes, we make a recursive call by passing this child, bb and thresh as parameters. If bb is not mostly included in the bounding box correspondings to any of the children, we simply create a new node using bb and add this new node as the child of treeNode. The details of this algorithm are shown in Algorithm 2.
Learning schema to compute confidence scores
Based on Sections 2.1 and 2.2, each object proposal bounding box has a tree associated with it. If the root node of a tree T i corresponds to a bounding box B i , all children of T i correspond to all bounding boxes inside of B i where i ranges from 1 to the total number of trees. We compute feature F i from box B i as the root node feature. Hence, for each bounding box Bi,j within B i , Fi,j represents its visual feature and tree Ti,j represents its hierarchical structure, where j ranges from 1 to the total number of nodes in the tree.
Since each bounding box maps to a tree, we can compare two bounding boxes by comparing their corresponding trees and their visual features. We define the distance between two bounding boxes as:
Here, we use the Euclidean distance to compare two visual features and use the edit distance to compare two trees.
For each object B
i
, we compute its distance to each B
k
in the training data using Dist (B
i
, B
k
) and find the n closest B
k
s. Among nB
k
s, there may be p objects and q non-objects (i.e., p + q = n). So, we define a confidence score as:
The higher the confidence score, the more likely the proposal being an object. Last, on each image, we order all object proposals by their confidence scores in the descending order. Suppose there are M bounding boxes in the training set and N bounding boxes in the testing set, similar to the nearest neighbour algorithm, the computation complexity of the proposed algorithm to compute the confidence scores of all testing bounding boxes is O (MN).
The proposed method has been applied to the Youtube-Object Dataset V2.2 [12] since we want to test the proposed method on the video dataset. The dataset contains a total of 720,000 frames with 6,975 annotated bounding boxes from 10 categories of objects. The annotated frames are divided into training and testing sets. Specifically, one object was annotated per image in the training set and multiple objects were annotated per image in the testing set. There are 4306 annotated frames with 4306 bounding box annotations in the training set. There are 1781 annotated frames with 2669 bounding box annotations in the testing set.
Due to the nature of the video dataset, the frame sequence within the same shot changes gradually and the differences between frames within the same shot are small. We select the first labeled image within each shot as the representative image from that shot. Some shots do not have any labeled image, we simply skip them. As a result, all images we selected are from different shots. We call these labeled images as shot representative images. Among these shot representative images, 870 images were from the training set and 334 images were from the testingset.
We apply EdgeBox [20] on every shot representative image and generate top n bounding boxes per image. Since this is the video dataset and the sizes of objects is relatively large in each image, top 50 bounding boxes per image from EdgeBox would detect nearly 70% of true objects. We use top n = 50 in our experiment. We empirically set the parameters for EdgeBox as follows: the step size of sliding window search is 0.8; the non-maximum suppression threshold is set to be 0.55, and min score of boxes to detect is 0.01. As a result of this step, we generate up to 50 bounding boxes for each shot representative images. We will use these bounding boxes as the input to produce the training and the testing sets for the proposed method.
To produce the training set for the proposed method, we compare the generated bounding boxes with the ground truth annotations from the shot representative images in the training set. We labela bounding boxes as positive (i.e., contains object) if Intersection over Union (IoU) is greater than 0.7 and as negative (i.e., does not contain object) if IoU is less than 0.5. For boxes with IoU values between 0.5 and 0.7, we simply ignore them. We then apply the proposed method described in Section 2 on shot representative images in the testing set to compute a new confidence score for every proposal and re-order them in the descending order. Same labeling process has been applied to compute the recall.
To evaluate the performance of the proposed method, we compute the recall rate of top n proposals ranked by EdgeBox and the proposed method. Table 1 shows the comparison of the recall rates between EdgeBox and the proposed method with n ranging from 1 to 50. Figure 3 shows the comparison in curves. We clearly see that the proposed method outperforms EdgeBox for most values of n. The convergence occurs at n = 50 is due to the fact we use all 50 boxes from EdgeBox to re-order proposals. It shows that the proposed method could use much fewer top proposals (e.g., 40 proposals) to achieve the similar recall as EdgeBox, using more top proposals (e.g., 50 proposals). Figure 4 shows the qualitative results of the object proposals generated by the proposed method.
One potential application
Using the method we described, we can design an application to help children learn different categories of objects. Given a set of images and a training set, a number of object proposals can be automatically generated by applying the proposed method on each image using a training set. Since the proposed method uses the hierarchical model to represent object proposals as trees, each object proposal bounding box may also contain a hierarchy of other bounding boxes. We could crop out several bounding boxes from each image. For each cropped out bounding box, we can further crop out several bounding boxes which naturally corresponds to different parts of the objects. We can display these parts at random locations, and promote children to use the parts of objects to fill corresponding object proposal holes by correctly matching the object parts to the right positions. This practice could enable children to gradually get familiar with the concept of each category of “object” and have a basic notion of how different parts form a certain category of objects.
Summary and discussions
In this paper, we propose a tree-based hierarchical model to learn object proposals and design an application to teach children the concepts of objects. The proposed work has the following contributions: Developing a compact feature vector to represent the object’s visual appearance Proposing a tree-based hierarchical model to capture the object’s internal structural features Formalizing a new objectness measure by incorporating both visual and tree feature representations Designing a practical application to teach children to learn the concepts of objects using the proposed method
We demonstrate the proposed method outperforms the state-of-the-art method (i.e., EdgeBox) in terms of the detection rate. Specifically, we use fewer object proposals to achieve comparable recall rate achieved by more object proposals produced by other methods.
Since object proposals generation usually serves first step before classification for object detection task, it will be interesting to use the proposed method to generate the object proposals, followed by classification methods based on deep neural network to test the possible improvement of final object detection rate. This is one direction of our future work. Also, since the proposed method is more suited for video datasets where the objects appear relatively larger in each frame, we will combine the proposed method with the tracking method to extend the object detection from images to videos.
There are also some limitations of the proposed method. Since the proposed method needs to build the hierarchical structure based on the geometric relationships of object proposal bounding boxes, it does not perform well when too many object proposal bounding boxes are generated per image and too much overlap among all bounding boxes. To address these limitations, we can fine tune parameters of existing object proposals generation methods to detect a reasonable number of bounding boxes with good overlap. In this way, the proposed method might capture better hierarchical structure of each object and become more generally applicable to other domains.
