Abstract
The ubiquitous availability of cost-effective cameras has rendered large scale collection of street view data a straightforward endeavour. Yet, the effective use of these data to assist autonomous driving remains a challenge, especially lack of exploration and exploitation of stereo images with abundant perceptible depth. In this paper, we propose a novel Depth-embedded Instance Segmentation Network (DISNet) which can effectively improve the performance of instance segmentation by incorporating the depth information of stereo images. The proposed network takes binocular images as input to observe the displacement of the object and estimate the corresponding depth perception without additional supervisions. Furthermore, we introduce a new module for computing the depth cost-volume, which can be integrated with the colour cost-volume to jointly capture useful disparities of stereo images. The shared-weights structure of Siamese Network is applied to learn the intrinsic information of stereo images while reducing the computational burden. Extensive experiments have been carried out on publicly available datasets (i.e., Cityscapes and KITTI), and the obtained results clearly demonstrate the superiority in segmenting instances with different depths.
Introduction
Accurately and expeditiously segmenting countable objects (e.g., individuals, vehicles and bicycles) of stereo imagery, a.k.a. instance segmentation in the computer vision community [7, 53], is a core problem for autonomous driving. Conventional segmentation approaches refer to segment amorphous image regions by analysing pixels in similar spatial semantics, such as positions and texture information. With the advent of object-based detectors, segmentation tends to interpret pixels at the instance level, which reflects the more detailed discrepancies of instances and greatly promotes the development of instance segmentation. The development of instance segmentation is inseparable from object detection [12, 42] and semantic segmentation [4, 57]. Therefore, it not only requires to accurately localise objects but also precise instance mask prediction. With the rapid development of depth sensors in the field of autonomous driving and robotics, additional depth information has potential to become additional information and be incorporated into off-the-shelf segmentation architectures to improve instance segmentation performance, but research in this direction has not yet been explored.
Despite the numerous efforts in improving the performance of instance segmentation, most of the aforementioned methods extract features from the monocular images [55], thereby only capturing limited texture information. Concretely, with monocular vision, the distance between objects is often gauged inaccurately due to the limitation of the ability to perceive the depth from one object to another. Binocular vision can sufficiently estimate the depth between objects through their overlap and displacement, but it lacks concern in the field of instance segmentation. Inspired by the recent successful use of depth information in stereo matching scenarios, we introduce an effective method to apply the depth estimation based on the binocular vision system to the instance segmentation. Specifically, taking the binocular images as input, we first estimate the depth perceptions for the paired images and then ingeniously devise a module to calculate the cost-volume of depth disparities. Analogously, the colour cost-volume can be generated and concatenated with the obtained depth cost-volume, forming a joint cost-volume, which can be regarded as the pre-processing manipulation and fed into the standard instance segmentation architecture (i.e., Mask-RCNN [19]). In addition, we employ Siamese-style Convolutional Neural Networks (CNNs) to mutually learn the inherent semantics implicit in stereo images, and ensure that the computational cost is equivalent to that of ordinary networks based on monocular images. The contributions of our approaches can be summarised as follows:
(1) We propose a novel
(2) We design a new module to compute the depth disparities of stereo images, which can be fused into the colour cost-volume and improve the performance of the instance segmentation.
(3) The shared-weights networks are employed to discover the intrinsic semantics implied in the input stereo images and avoid yielding a large number of parameters.
(4) We conduct extensive experiments on Cityscapes and KITTI datasets to show the effectiveness of our framework. We also present quantitative visualisations to demonstrate the priority of the proposed method in discovering distant objects.
The rest of this paper is organised as follows. A comprehensive investigation of related works will be given in Section 2. Then, we will illustrate and analyse the proposed framework in Section 3. Consequently, experimental results and ablation studies will be shown in Section 4. Finally, we will present a conclusion of the paper.
Related work
In this section, we will comprehensively review the literature related to the proposed method from three perspectives, including the RGB image-based instance segmentation methods, the image depth-based instance segmentation methods and Siamese-CNN.
Proposal-based methods generally refer to the use of object detectors to obtain region proposals, and then estimate appropriate masks for the detected outputs. Early methods relied heavily on hand-made features and classifiers, such as Random Forests [29] or Support Vector Machine (SVM) [17]. Recently, with the advent of deep learning, fully convolutional neural network has gradually replaced traditional methods and becomes the dominant framework for semantic segmentation [7]. Proposed an end-to-end framework that cascades a multi-task network to implement multiple functions simultaneously, including distinguishing instances, determining masks, and classifying objects [28]. Proposed the first differentiable fully convolutional semantic segmentation framework, which inherits the advantages of FCN and instance mask proposals. Mask R-CNN [19] extended the function of Faster R-CNN [42] to generate a pixel-wise mask prediction for each object by aggregating segmented mask branches.
Proposed approach
In this section, we will depict the proposed DISNet in detail. As shown in Fig. 1, our DISNet includes the process of calculating depth and RGB cost-volumes, as well as a backbone network for instance-level segmentation. The pipeline of the entire architecture will be displayed in 3.1. Then, it will introduce the detailed information of each module in 3.2 and 3.3, respectively. In 3.5, it will describe how we design the final objective function.

Overall architecture of our DISNet. Our DISNet takes binocular images as input, and generates the corresponding cost-volumes based on the estimated depth and RGB information. The obtained cost-volumes will be concatenated and fed into the ongoing backbone to produce the segmentation results at the instance-level.
As illustrated in Fig. 1, the proposed architecture (i.e., DISNet) consists of four parts, which are the network used to yield the RGB cost-volume, the depth generation and embedding module, and the segmentation module. Specifically, the RGB cost-volume is generated by following the method introduced in GC-Net [24]. However, the colour cost-volume of monocular images is extremely susceptible to the interference of occluded objects (caused by the single shooting angle of the camera) and objects with similar colours (due to the close proximity of adjacent objects in unstructured scenes).
To tackle the above problems, we explored the way to distinguish the objects with similar colours, and then proposed two adjacent modules (namely, depth generating module and depth embedding module) to extract depth disparity from binocular images. Compared with the latest segmentation scheme that only uses monocular images [14], the binocular image pair is synthesized from the left-view image and the corresponding right-view image, in which contains extensive depth disparities. Effective use of these depth disparities can make up for the lack of object geometry information and benefit to distinguish these objects in identical colour. This motivates us to introduce delegate modules to calculate the depth cost-volume from the stereo image.
Once the depth cost-volume is obtained, it will be concatenated with the RGB cost-volume and sent to the segmentation network to predict the contours of the instances. The optimisation of the entire network tends to search the optimal solution that satisfies the best balance between the disparities and segmentation.
Depth generating
The depth generation process of DISNet is rendered in blue in Fig. 1. The deep generation is inspired by the structure expressed in GC-Net [24]. Specifically, the input left-view and right-view images are processed by a feature extractor similar to ResNet18 [20]. The dimension of the resulting feature is
In order to obtain the cost volume from the left-view to the right-view, it needs to move to the right along the epipolar line of the left tensor until the maximum disparity is reached. The left-view and right-view features are concatenated in an pixel-wise manner and yield a 4D cost volume (the dimension is
Once the above process is complete, the natural way to determine the optimal gap for each pixel is to apply the differential Soft-argmax [36] to the adjustment cost. However, the original Soft-argmax function considers returning the index of the maximum value of the entire tensor, which will cause greater errors when processing pixel-level data. To alleviate the impact of the above function, we implemented a multi-layer argmax function based on multi-layer convolution to generate a single value for each pixel. In addition to the multi-layer argmax function, it also requires Sigmoid activation to accurately estimate the pixel gap in actual training.
After obtaining the learned left depth map, we refer to the method introduced in [40] to transform it to the right depth map. Especially, the transformation process needs to abide by the following left and right consistency loss [14]
After extracting the depth features from the stereo image, we immediately considered how to effectively integrate the obtained information into the existing network. To achieve this goal, we designed a module of depth cost-volume by imitating the calculation process of the RGB cost-volume. Specifically, this module is constructed by a series of convolution and deconvolution layers (see the yellow part in Fig. 1). Furthermore, a Siamese-style CNN is employed to reduce the computational cost and the number of model parameters.
Module of detection and segmentation
The depth cost-volume generated from the above process will be concatenated with the conventional RGB cost-volume to form a joint representation that can be regarded as the input of detection and segmentation network. To show the effectiveness of our method, we select the network of Mask R-CNN [19] as backbone architecture. Precisely, it involves two functionalities, including a Region Proposal Network (RPN) and a mask classifier. In RPN, it usually defines 9 anchor boxes to generate multiple regions of interest (RoI) and a lightweight binary classifier to predict the confidential scores of detected objects. Non-maximum suppression is another well-known operation applied to anchor points with high objective scores. The RoI Align network outputs multiple bounding boxes and adjusts them to a fixed size. Then, it will extract the features in the bounding box and feed them into the fully connected layer, and accompanied by the softmax function for classification. The regression models are required to further improve the quality of bounding box prediction. Besides, the extracted features are also sent to a mask classifier composed of two CNNs, and a binary mask is output for each RoI. By doing so, the mask classifier allows the network to generate a mask for each category without interference between categories.
Objective function
The objective function adopted by our method consists of two parts, one of which aims to regularise the accuracy of the calculated disparity, and the other is the segmentation loss function in the Mask R-CNN [19]. Considering that the ground truth of disparity map is available, we choose a smooth
The loss function introduced in Mask R-CNN
In this section, detailed descriptions of experimental datasets and the evaluation metrics will be presented. Subsequently, we will introduce implementation details of proposed network. Obtained results will be compared with extensive baselines and state-of-the-art methods. In addition, we will offer quantitative visualisations and ablation studies to comprehensively evaluate the priority of the proposed framework.
The dataset Cityscapes [6] is a large-scale datasets released for tasks related to urban scene understanding and autonomous driving. It provides rectified stereo image pairs and the corresponding disparity maps pre-calculated by the SGM algorithm[21]. It contains 5,000 high-quality maps for both the left view images which are finely annotated at the pixel level, as well as corresponding right view which are not annotated. These images are split into three non-overlapping sets of 2975, 500, and 1525, which are used for training, validating, and testing, respectively. Eight semantic categories are selected from 19 formally annotated classes to evaluate and report the performance of instance-level segmentation, including rider, car, bus, truck, train, motorcycle and bicycle. Since the depth data is not directly accessible in the Cityscapes dataset, we employ the method introduced in [37] to preprocess the depth map.
KITTI dataset [11] is an outdoor dataset, including 32 training scenes and 29 test scenes by using the Eigen split [10]. The RGB images and depth values are captured by car-mounted stereo cameras and the rotating Velodyne laser scanner, respectively. We evaluated our method with 3,712 training images and reported the performance of 120 validation images and 144 test images split by [41]. In addition, for disparity images, we refer the dedicated work of [37].
Comparison with baselines and state-of-the-art methods
To show the effectiveness of the proposed method, we compared two variants of our methods with state-of-the-art methods in Table 1. As can be observed,the mean average precision on the DISNet(Mask R-CNN) outperforms the original one by 6.7% and the UPSNet of our method outperforms the original UPSNet [49] by 4.4% in terms of the mean average precision metric. Specifically, the segmentation result has improved significantly in most categories, including the categories of person, train, truck, motorcycle and bicycle. These improvements undoubtedly prove the effectiveness of exploiting the depth information of stereo image pairs.
Overall AP and AP50 and precision of each class on Cityscapes validation set. The replicated results are indicated by *
Overall AP and AP50 and precision of each class on Cityscapes validation set. The replicated results are indicated by *
To obtain an intuitive understanding of our models, we have visualised the segmented images obtained through our DISNet and state-of-the-art methods (i.e., Mask R-CNN [19] and UPSNet [49]) in Fig. 2 and Fig. 3, respectively. From the qualitative results illustrated above, it is not difficult to realize that our method can correctly segment objects even in complex scenes. Note that the algorithm accurately detects more objects than compared methods owing to depth information providing more accurate guiding role. In particular, notice in the second row of Fig. 2 that we produce masks more smoothly for the cars on right part of the image, and the third row of same figure that our network recognizes the bicycle under the person which Mask-RCNN has failed. Moreover in the last row person and bicycles that have not detected by other methods are clearly recognized in our results. Similarly, our method successfully finds out many objects that are ignored by original UPSNet, such as the person on right part of scene in last row of Fig. 3.

Visualisation results on Cityscapes val set. From left to right, each column of the image represents the input images, the ground-truths, the results of Mask R-CNN [19] and our DISNet (Mask R-CNN), respectively. Best viewed in colour.

Visualisation results on Cityscapes val set. From left to right, each column of the image represents the input images, the ground-truths, the results of UPSNet [19] and our DISNet (UPSNet), respectively. Best viewed in colour.
Since the annotations provided by the official KITTI dataset cannot be directly converted to the format corresponding to the COCO style, we only display the experimental results through qualitative illustrations. From Fig. 4, we can see that most objects have been detected and assigned the correct mask. Especially, rear objects in images with similar colours can also be distinctly classified.

Qualitative visualisations on KITTI dataset [11]. From the left to right, each column of the image represents the input RGB images, the ground truth of instance train id and our results.
Depth generation module
We also investigated the impact of whether to import the generated depth module on the performance of the experiment. From Fig. 5, we found that learning depth from stereo images by introducing an independent network can significantly improve the segmentation effect. Variants of Mask R-CNN and UPSNet based on our DISNet can bring more than 4% improvement. This indicates that the depth map obtained by the convolutional network can make better use of the depth information and have a positive impact on the segmentation results.

Effectiveness of adding depth generating module.
To further illustrate the effectiveness of proposed DISNet, we designed an ablation study to investigate whether Depth Cost-volume is employed during training. The results of the ablation study are summarised in Table 2. As can be seen in Table 2, there are notable improvements when we adopt Depth Cost-volume in the Mask R-CNN and UPSNet architecture. For Mask R-CNN, the Average Precision is relatively improved by 6.1%, and the Average Precision50 is improved by 11%. The improvements also can be seen on UPSNet, where the Average Precison and Average Precision50 are improved by 3.8% and 4.1%, respectively. It is worth noting that UPSNet has made several major modifications to Mask R-CNN, changing the internal structure of the layer used to generate the instance Mask, therefore its degree of improvement is not as good as Mask R-CNN.
Performance with or without the depth cost-volume, where GDC represents for Generate Depth Cost-volume
Performance with or without the depth cost-volume, where GDC represents for Generate Depth Cost-volume
Since the objective function we introduced earlier is composed of two parts, different weights will affect the dominance of different parts during model optimisation, thereby affecting the final result. In order to better choose the appropriate weights for each part, we designed ablation experiments for hyper-parameters. As the results shown in Table 3, the best results are achieved with setting the values of λ d and λ m to 0.4 and 0.6, respectively. This indicates that when the detection module loss takes major role in loss term, the results can be better.
Comparison of accuracy using different hyper-parameters in the loss function
Comparison of accuracy using different hyper-parameters in the loss function
Taking into account the characteristics of different networks, we provide a comparison of segmentation results using three mainstream network structures. From Table 4, we can see that the best accuracies achieved by utilizing the ResNeXt [20]. The reason is that the ResNeXt architecture replaces the standard residual block with the "split-transform-merge" strategy used in the Inception model [45]. The input of the block is projected into a series of low-dimensional representations. Before merging the results, it applies several convolution filters separately instead of performing convolution on the entire input feature map.
Comparison of accuracy using different backbone networks
Comparison of accuracy using different backbone networks
In this paper, we have proposed a novel network architecture that takes advantages of the depth information extracted from stereo images to improve the quality and accuracy of instance segmentation. To this end, we have introduced a new module for computing the depth cost-volume, which can be integrated with the colour cost-volume to jointly capture useful disparities of stereo images. Experiments conducted on the challenging datasets suggested that our method can outperform the baselines and state-of-the-art methods in different metrics. Shortly, we will explore ways to deploy our framework to tasks related to 3D instance segmentation.
Acknowledgment
This work was supported in part by National Natural Science Foundation of China under the Grant Number 61872187, 62077023 and 62072246, in part by Natural Science Foundation of Jiangsu Province under the Grant Number BK20201306, and in part by the “111 Program” under Grant No. B13022.
