Semi-supervised semantic segmentation using an improved generative adversarial network

Abstract

This paper proposes a better semi-supervised semantic segmentation network using an improved generative adversarial network. It is important for the discriminator on the pixel level to know whether it correctly distinguishes the predicted probability map. However, currently there is no correlation between the actual credibility and the confidence map generated by the pixel-level discriminator. We study this problem and a new network is proposed, which includes one generator and two discriminators. One of the discriminators can output more reliable confidence maps on the pixel level and the other is trained to generate the probability on the image level, which is used as the dynamic threshold in the semi-supervised module instead of being set manually. In addition, the trusted region shared by the two discriminators is used to provide the semi-supervised reference. Through experiments on the PASCAL VOC 2012 and Cityscapes datasets, the proposed network brings better gains, proving the effectiveness of the network.

Keywords

Semi-supervise semantic segmentation generative adversarial network confidence map

1 Introduction

Semantic segmentation has developed rapidly with the development of various studies and is widely used in autonomous driving, biomedical and other fields [1 –4]. Although the convolutional neural network proposed based on fully convolutional neural network (FCN) [6] greatly improves the performance of semantic segmentation, it requires accurate per-pixel annotations for each training image, which will consume a lot of cost and time. In order to simplify the work of obtaining high-quality data, semi-supervised and weakly-supervised methods have been applied to semantic segmentation tasks. These methods usually assume additional annotations on the image level [7 –11], box level [12, 13] or pixel level [14 –17].

For semi-supervised semantic segmentation, some work uses generative adversarial network (GAN) [18, 19] which includes a generator network and a discriminator network. The generator network uses a semantic segmentation network to generate the probability maps of the semantic labels. The discriminator network generates the confidence map by distinguishing generated samples from target ones. For unlabeled data, the ground truth is obtained according to the manually set threshold and the confidence map generated by the discriminator network [20].

In this article, we focus on the discriminator network and the confidence map it generates, which supervise the generator to produce more accurate semantic segmentation results. Our work proposes a method for generating more reliable confidence maps, which is extremely important to improve the performance of semi-supervised semantic segmentation. First, for the generator network, an effective discriminator network is required to generate the reliable confidence maps, so as to further improve the performance of the generator. Secondly, for unlabeled data, it is necessary to generate the reliable pseudo label masks based on the confidence maps, which can be used to infer sufficiently close areas from the ground truth distribution. If the confidence maps are unreliable, it may generate unreliable pseudo label masks, which will affect the results of the generator.

For the image-level discriminator, its propose is to distinguish whether the generated sample is real and its output can represent the probability that the input is real. It regards generator samples as negative samples and target samples as positive samples. In most pixel-level discriminator network [20, 21], its propose should be to distinguish whether the pixels in the generated sample are real. However, regardless of whether the pixels are real or not, it is unreasonable to regard all pixels in the generated samples as negative samples. As a consequence, the pixels in the generator sample cannot be distinguished and the generated confidence map does not reflect the correct probability of the segmentation result. As shown in Fig. 1, the heat map is generated based on the overlay of the image and the confidence map. About the color of heat map, blue means low reliability and red means high reliability. According to the heat map, we can see that the value of this confidence map is generally lower and unreliable. The heat map of the second and the third columns are generated by [20] and our network respectively. Our method aims to generate the more reliable confidence maps. The probability of the confidence map is low regardless of whether the prediction is correct, which illustrates the inconsistency between the confidence map generated by the previous discriminator and the actual confidence probability. Obviously, reliable confidence maps can improve the semantic segmentation results.

Fig. 1

Practical examples that have lower confidence in labeled pixels.

To improve the performance of semi-supervised semantic segmentation, our method focuses on the discriminator network and generates a reliable confidence map. In order to achieve this goal, we design a new discriminator network to generate confidence map, including network inputs, network structure, ground truth of confidence map and loss function. We improve the network structure of the previous discriminator to which we add the encoder-decoder structure [22, 23] on the basis of the previous network. The ground truth of the confidence map is generated based on the consistency between the generator sample and the target sample. The cross-entropy loss is used as the discriminator’s loss function. At the same time, in the semi-supervised module, the credibility threshold is determined according to another discriminator network. Finally, based on the results of the two discriminators, a common trusted region will be generated as the ground truth of unlabeled data.

In summary, the contributions of this work are the following:

We propose a semi-supervised semantic segmentation network, which is the first network to solve the problem of confidence map and ground truth of unlabeled data. It explores a new direction for semi-supervised semantic segmentation by GAN. By considering the credibility of the confidence map, the semi-supervised semantic segmentation result on mIOU is 2.1% higher than [20].

For the discriminator network, we improve the structure of the discriminator, which uses the encoder-decoder structure and the proposed new loss function to generate a more reliable confidence map. The experimental results show that the results after using the discriminator branch are positively correlated with the true confidence.

For the generator network, we use deeplabv3 + [24] network instead of deeplabv2 [25] as the semantic segmentation network to further improve the performance of the generator network.

For the semi-supervised module, we use the other discriminator to generate the total probability as threshold on the image level. Based on the confidence map and the probability as threshold generated by the two discriminators, a common region is used to obtain more reliable ground truth for unlabeled data. Experimental results on the PASCAL VOC 2012 [26] and Cityscapes [39] datasets validate the effectiveness of the proposed semi-supervised semantic segmentation network using GAN.

2 Related work

Semantic Segmentation. Current semantic segmentation tasks are growing rapidly and many key innovations have been proposed in recent years. To apply deep learning to pixel-by-pixel image semantic segmentation, FCN [6] serves as the foundation of modern semantic segmentation methods by replacing the fully connected layer of the image classification network with the convolutional layer. Dilated convolution [27] achieves an expanded receptive field while maintaining the same feature resolution. PSPNet [28] uses spatial pyramid pooling to obtain multi-scale features in order to get multi-scale context.

However, for segmentation ground truth, labeling pixel-level images is very difficult. In order to reduce labeling costs, many studies on weak supervision and semi-supervision have been proposed in recent years. The method in [5] proposed a decoupled deep neural network to solve this problem.[23] defined revisiting dilated convolution to apply to semi-supervised semantic segmentation. The method in [29] explores the relationship between image label and segmentation. [30] refines the segmentation by fusing object localization. Moreover, in order to make up for the loss of detailed boundary information caused by less marking information, some researches raise semi-supervision. [31] and [32] get semantic segmentation by using generative adversarial network with a few fully-annotated images.

Generative adversarial networks. In recent years, GANs are applied to many aspects of image processing, such as image generation [33, 34] and image enhancement [35]. GANs have been widely used in semantic segmentation in recent years. [32] presents the first application of adversarial training for semantic segmentation. [20] takes the regions sufficiently close to actual label map from the confidence map as ground truth of unlabeled data for the semi-supervised part. However, the threshold to judge whether it is credible is refined by several hand-designed constraints before training.

3 Method

3.1 Motivation

When using GAN for segmentation tasks, the discriminator needs to supervise the generator to generate more realistic data through the output confidence map. For unlabeled data in semi-supervised learning, the pseudo label mask is generated based on the confidence map generated by the discriminator, so the reliability of the generated confidence map is very important. However, the value of confidence map generated by the current discriminator does not reflect the probability of whether the pixels are real. To quantify this problem, we compare the confidence probability of the discriminator output in [20] with the true confidence probability. Specifically, we conduct experiments using the network [20] with 1/8 images as labeled data on the PASCAL VOC 2012 validation dataset. The distribution of the predicted probability for each pixel in the confidence map output by the discriminator and the distribution of true confidence probability is shown in Fig. 2(a). Probability represents the value for each pixel on each confidence map. Proportion indicates the percentage of pixels that actually predicted correctly or incorrectly. TPred denotes that the pixel on the predicted label map is true, which is the same as the one on the ground truth label map, otherwise expressed by FPred. We can see that the actual number of correct pixels is relatively high when the predicted value is low. This indicates that the score of the confidence map generated by the discriminator is inconsistent with the true confidence probability.

Fig. 2

Relationship between the probability distribution of the confidence map and the true probability proportion in [20].

In applying GAN to segmentation and semi-supervised learning, if an unreasonable reference probability map is used to train the discriminator, the confidence map generated by the discriminator cannot represent the prediction probability on the pixel level probability. This motivates us to generate more realistic confidence maps based on reliable reference. Thus, we propose a new network which improves the structure of the discriminator network and learns a more reliable confidence map. The distribution of the pixels’ probabilities in the confidence map generated by our framework is shown in Fig. 2(b), which is proportional to the proportion of actual correct predictions.

3.2 Network architecture

Based on the semi-supervised semantic segmentation network [20], the overall structure after adding the new discriminator network is shown in Fig. 3. Assume that the above discriminator network is known as D₁ and the other discriminator network is called as D₂. We will present the details of our framework in the following sections.

Fig. 3

Network architecture of the proposed semi-supervised semantic segmentation.

Define the size of the input RGB image X_n as H × W × 3 and we use G (X_n) to denote the class probability map over C classes of size H × W × C that the generator network produces, with the input given X_n. We denote one discriminator network as D₁ (·) that outputs 0 or 1 and the other discriminator network as D₂ (·) that outputs the confidence map of size H × W × 1. L_n is the ground truth label map of X_n, of which the size is H × W × 1 and the pixel values represent the categories they belong to. Y_n of size H × W × C is the ground truth of L_n after one-hot encoding.

Discriminator network D₁. This discriminator aims to classify a predicted label map as real or fake on the image level. The structure of the discriminator is similar with most discriminator networks like [32]. We use convolution layers with 3 × 3 kernel and the stride size of 2 followed by a Relu layer. It takes the probability maps output by generator as input and outputs the probability that represents the credibility of the label map generated by generator. Furthermore, the probability is used as the threshold of the semi-supervised module. For the loss function of discriminator D₁, we use the binary cross-entropy loss:

$\begin{matrix} L_{D_{1}} = & - (1 - y_{n}) log (1 - D_{1} (G (X_{n}))) \\ - y_{n} \log (D_{1} (Y_{n})) \end{matrix}$ (1) where y_n = 0 if the sample is generated from the generator network and y_n = 1 if the sample is the ground truth label map.

Discriminator network D₂. The role of the discriminator network is to generate a more reliable confidence map. We adopt 5 convolution layers, 4 Leaky-ReLU layers and 2 up-sampling layers. For the 5 convolution layers, we set the kernel size to 3 and stride size to 2 and the channels are {64, 128, 256, 512, 1}. For the Leaky-ReLU layers, each of them follows a convolution layer except the last one and is parameterized by 0.2. For the up-sampling layers, we set the rate of up-sampling to 2 and add them to the last layer and the third layer. We use the probability maps outputted by the generator as its input. The innovation of the discriminator is the referenced confidence map and the loss function.

a) Referenced confidence map: We define C_map as the confidence probability of the generator output G (X_n). The discriminator is used to judge the credibility of the predicted label map generated by the generator, thus the output of the discriminator is represented by a confidence map. The output of the generator is selected to the category corresponding to the maximum probability, then a pixel-level prediction label map can be obtained. The resulting predicted label map is defined as:

$G^{'} {(X_{n})}^{(h, w)} = \underset{c}{argmax} G {(X_{n})}^{(h, w, c)}$ (2) so the ideal C_map is defined as follows:

$C_{map}^{(h, w)} = {\begin{matrix} 1, G^{'} {(X_{n})}^{(h, w)} = L_{n}^{(h, w)} \\ 0, G^{'} {(X_{n})}^{(h, w)} \neq L_{n}^{(h, w)} \end{matrix}$ (3)

That is, the confidence map is based on whether the predicted label map is consistent with the ground truth, so that each pixel is binarized.

b) Loss function: When using GAN for classification tasks, the discriminator network acts as a classifier by using the 0-1 distribution as the overall distribution just like D₁. The discriminator maximizes the probability of real images and minimizes the probability of generated images, so the probability represents the similarity between the generated image and the ground truth distribution. Nevertheless when GAN is used in semantic segmentation, the discriminator D′ in [20] used it on the pixel level, and the loss function of the discriminator is:

$\begin{matrix} L_{D^{'}} = & - \sum_{(h . w)} (1 - y_{n}) log (1 - D^{'} {(G (X_{n}))}^{(h, w)}) \\ + y_{n} log (D^{'} {(L_{n})}^{(h, w)}) \end{matrix}$ (4)

y_n = 0 if the sample is drawn from the generator network and $y_n = 1 if the sample is from the ground truth label. But semantic segmentation is aimed at single pixel, it needs to get the confidence probability of each point. The overall distribution does not reflect the distribution of individual pixels. Take generator samples for example, some works like [20] use 0 as the reference value to calculate the loss function no matter the prediction is truth or false. As a result, the probabilities of the pixels cannot be regressed correctly, so the confidence map output by the discriminator is inaccurate. Therefore, in this paper, C_map is used as ground truth and the cross-entropy loss is used to regress the pixel-level probability. The loss function is:

$L_{D_{2}} = \sum_{(h, w)} C_{map}^{(h, w)} \log (D_{2} {(G (X_{n}))}^{(h, w)})$ (5)

In the discriminator module, we use the above two discriminators. One is to classify input images as true or false on the image level. The other is to infer the correct probability on the pixel level. Combining the two discriminators above, we give the adversarial loss:

$\begin{matrix} L_{ad} = & - log (D_{1} (G (X_{n}))) \\ - \sum_{(h, w)} log (D_{2} {(G (X_{n}))}^{(h, w)}) \end{matrix}$ (6)

Semi-supervised module. In the semi-supervised part, for unlabeled data, because there is no corresponding ground truth label map, there is no way to calculate the loss function of the generator network using cross entropy. However, as stated in the discriminator D₂, the confidence map generated by the discriminator can represent the similarity of the predicted label map and the ground truth label map. Further, we use the overall probability which is the output of discriminator D₁ as a threshold for inferring the trusted region. Therefore, we adopt the threshold to binarize the confidence map and use the intersection to highlight the trusted areas in the predicted label map, so as to generate a reliable ground truth of unlabeled data.

Assume G″ (X_n) ^(h,w,c) is the one-hot encoded label map of G′ (X_n) ^(h,w), then G″ (X_n) ^(h,w,c) = 1 if c = G′ (X_n) ^(h,w). The ground truth of unlabeled data $Y_{n}^{'}$ of size H × W × C is defined as:

$\begin{matrix} Y_{n}^{' (h, w, c)} = & I (D_{2} (G (X_{n}))^{(h, w)} > D_{1} (G (X_{n}))) \\ \cdot G^{″} {(X_{n})}^{(h, w, c)} \end{matrix}$ (7) where I (·) is the indicator function. Thus the cross-entropy loss for the semi-supervised module is:

$L_{s} = - \sum_{(h, w)} \sum_{c \in C} Y_{n}^{' (h, w, c)} log (G {(X_{n})}^{(h, w, c)})$ (8)

Generator network. We improve the generator network by replacing Deeplabv2 with Deeplabv3 + . We adopt the ResNet-101 [36] model with dilated convolution as the generator network backbone. Unlike what has been done in [20], we apply the dilated convolution in the last three convolution layers of the backbone with strides of 2,4 and 8. In addition, we apply the Atrous Spatial Pyramid Pooling (ASPP) different from that in Deeplabv2 by using several distinct dilation rates. Besides, we use the image-level pooling to enlarge the receptive fields. In particular, we employ the decoder module by replacing the up-sampling layer in order to recover accurate object segmentation details. In the end, the output of the decoder module matches the size of the input image.

For the generator’s loss, we apply a wghted hybrid loss function which includes the multi-class cross entropy loss L_ce., adversarial loss provided by discriminator L_ad and semi-supervised loss L_s. λ_ad and λ_s are two weights for minimizing the proposed multi-task loss function. The minimized objective function to be trained is:

$L_{g} = L_{ce} + λ_{ad} L_{ad} + λ_{s} L_{s}$ (9) where L_ad and L_s are shown above and the multi-class cross entropy loss L_ce is:

$L_{ce} = - \sum_{h, w} \sum_{c \in C} Y_{n}^{(h, w, c)} \log (G {(X_{n})}^{(h, w, c)})$ (10)

4 Experiment

We conduct experiments on PASCAL VOC 2012 and Cityscapes segmentation datasets, and we use the mean intersection-over-union (mean IOU) as the evaluation metric. We follow the PASCAL VOC 2012 settings, using the 10,582 images train split for training, and the 1449 images validation split for validation. For Cityscapes dataset, the number of annotated images for training, validation and testing are 2975, 500, 1525 respectively. Besides, the number of classes in PASCAL VOC 2012 and Cityscapes are 20 and 19 respectively.

4.1 Implementation details

We complete experiments based on the [20]. For PASCAL VOC 2012 dataset, during the training, input images are resized to 321 × 321 by random scaling and cropping and we train all the networks for 20k iterations with batch size 10. For Cityscapes dataset, the size of input images is 512 × 1024 and we set the number of iterations to 40k and the batch size is 2 during training.

For the generator network, we use ResNet-101 as the backbone and set the initial learning rate as 0.001. Stochastic Gradient Descent (SGD) [37] with momentum 0.9 and weight decay 10^-4 is used as the optimizer. For the discriminator training which include two discriminators D₁ and D₂, the learning rates are both 10^-4 using Adam optimizer [38]. For training semi-supervised module, we randomly divide the training dataset into labeled data and unlabeled data. Train the generator and the discriminator when using labeled data and train only the generator network when using unlabeled data. About hyper-parameters, we set λ_ad as 0.01 and λ_s as 0.1.

For the training process, we first train the generator and discriminators with the labeled images for 5000 iterations. Then we fix the parameters of discriminators and train the generator with the unlabeled images. As described in Section 3, the pseudo label masks are generated from the outputs of the two discriminators.

4.2 Experimental results

For semi-supervised validating, we randomly take 1/8, 1/4, and 1/2 amount of pictures as labeled data and the other as unlabeled data. We use our proposed generator network as baseline. We compare the results between [20] and our baseline to prove that our baseline model has better performance.

As shown in Table 1, it shows the results on PASCAL VOC 2012 dataset. Our baseline gets stable improvement (about 1.5%) in mean IOU. Besides, we experiment the discriminator D′ in the [20] and the discriminator D₂ we proposed on different baselines. The highlighted results shown that our discriminator D₂ can achieve stable improvement on distinct baselines and can improve the baseline from 1.3% to 3.1%. Furthermore, for the combination of baseline and discriminator, the mIOU of our proposed network is about 1.5% higher than that of [20].

Table 1
Comparison of the baseline and discriminator on pascal voc 2012 validation results

Method D′ D₂ Data amount

1/8 1/4 1/2 Full

[20] 66.0 68.3 69.8 73.6

✓ 67.6 71.0 72.6 74.9

✓ 68.1 71.3 72.9 75.3

our baseline 67.2 69.4 71.3 75.2

✓ 68.6 71.9 73.6 76.2

✓ 69.0 72.5 74.1 76.5

Method	D′	D₂	Data amount
[20]			66.0	68.3	69.8	73.6
	✓		67.6	71.0	72.6	74.9
		✓	68.1	71.3	72.9	75.3
our baseline			67.2	69.4	71.3	75.2
	✓		68.6	71.9	73.6	76.2
		✓	69.0	72.5	74.1	76.5

Table 2 shows the results on Cityscape dataset. Both our baseline and discriminator have consistent performance with these on PASCAL VOC 2012. The baseline has about 1% increase over the baseline in [20]. The proposed discriminator D₂ improve 0.8% to 1.9% gain over the baseline and the network achieves about 1% gain in mIOU over [20].

Table 2

Comparison of the baseline and discriminator on Cityscapes validation results

Method	D′	D₂	Data amount
			1/8	1/4	1/2	Full
[20]			55.5	59.9	64.1	66.4
	✓		57.1	61.8	64.6	67.7
		✓	57.3	61.8	64.9	67.8
our baseline			56.4	61.0	64.9	67.5
	✓		58.0	62.7	65.6	68.6
		✓	58.3	62.8	65.8	68.9

Table 3 reports the effects of different modules on performance on PASCAL VOC 2012. For the discriminator D₁, it has small improvement like most image-level discriminators [32]. About the semi-supervised module, we first manually set the threshold to 0.9 when the discriminator module D₁ is not used in experiments. The proposed semi-supervised learning module brings overall 0.8% to 2.1% gain. As the experimental results show, using the output probability of discriminator D₁ as the dynamic threshold, the performance is better than the static threshold which is set manually and the segmentation results achieve about 0.5% gain. Table 4 shows the performance of each modules. Both the discriminator D₁ and semi-supervised method have positive improvement on performance. They both achieve the performance gain from 0.3% to 0.5%.

Table 3

Comparison of the different combinations we proposed on PASCAL VOC 2012 validation results

Method	D₁	D₂	Semi-S	Data amount
				1/8	1/4	1/2	Full
Base-line				67.2	69.4	71.3	75.2
	✓			67.3	69.6	71.5	75.3
		✓		69.0	72.5	74.1	76.5
		✓	✓	71.1	73.7	74.9	N/A
	✓	✓		69.8	72.9	74.5	76.8
	✓	✓	✓	71.6	74.1	75.4	N/A

Table 4

Comparison of the different combinations we proposed on Cityscapes validation results

Method	D₁	D₂	Semi-S	Data amount
				1/8	1/4	1/2	Full
Base-line				56.4	61.0	64.9	67.5
	✓			56.4	61.1	65.0	67.5
		✓		58.3	62.8	65.8	68.9
		✓	✓	58.5	63.2	66.1	N/A
	✓	✓		58.3	63.1	65.9	69.2
	✓	✓	✓	58.8	63.4	66.4	N/A

Figure 4 shows the results of different combinations generated by the proposed method. We can find that the combination of D₁, D₂ and semi-supervised module gives the most accurate and detailed results. Moreover, as shown in the Table 5, compared with [20], our network improves the overall performance by 2% in semi-supervised semantic segmentation.

Fig. 4

Visual results on the PASCAL VOC 2012 dataset using 1/8 labeled data.

Table 5

Performance comparison of semi-supervised semantic segmentation with [20] on the PASCAL VOC 2012 validation set

Method	Data amount
	1/8	1/4	1/2	Full
[20]	69.5	72.1	73.8	N/A
Our	71.6	74.1	75.4	N/A

4.3 Ablation study

The choices of the discriminator network input: For the discriminator network, we first study the design choices of the its input. We explore several designs by combining the RGB images with the predicted label map G (X_n) or the ground truth label map Y_n. These designs are shown in Fig. 5 and can be described as follows:

C classes prediction maps.

C classes prediction maps concatenate RGB images: All the C classes prediction maps are concatenated with RGB images.

Label map concatenates RGB images: The label map of the target class is taken and concatenated with the RGB images.

C classes prediction maps multiply RGB images: All the C classes prediction maps are multiplied with the RGB images.

Fig. 5

Designs for input of discriminator.

The results are shown in Table 6. We observe that the performance of various designs is improved compared to the baseline and the four designs have little effect on performance. In conclusion, we use the choices of C classes prediction maps because of its best results.

Table 6

Results of designs for the discriminator D2 input

Setting	Fully Supervised
baseline	75.2
(a) prediction map	76.2
(b) prediction map+image	76.1
(c) label map+image	76.0
(e) Prediction map × image	76.1

The design of discriminator network. To capture sharper object boundaries, instead of directly upsampling in traditional methods, we replace the decoder structure by adding a simple and effective decoder module to obtain sharper segmentation. According to the different output strides and upsampling methods, there are several design choices for the decoder module which are shown in Fig. 6:

output stride is 16 and up-sampling factor is 16.

output stride is 4 and up-sampling factor is 4.

output stride is 16 and up-sampling factor is 2.

output stride is 16 and the up-sampling factor is 4.

Fig. 6

Designs for structure of discriminator.

Table 7 shows the results for the above discriminator network. By comparing the results, we can find that the third and the fourth network structures are better than the first two. At the same time, using the third network with the output stride = 16, the performance can be improved by the third one but the computational complexity will be increased. Thus, the last decoder structure is used as our default choice.

Table 7

Results of structures for the discriminator D2

Setting	Fully supervised
(a) baseline	75.2
(a) stride 4 + factor 4	75.4
(b) stride 16 + factor 2	76.6
(c) stride 16 + factor 4	76.5

Hyper-parameter analysis: In order to determine the effect of the new discriminator on the generator and the semi-supervise module, two hyper parameters need to be determined: λ_ad and λ_s. We propose the control variable method to determine the two hyper parameters in Table 8. We first set λ_s = 0 and set λ_ad to 0.001,0.01,0.1. The best result is obtained when λ_ad = 0.01. Then, we set λ_ad = 0.01 and try different values of λ_s for the experiments. As a result, we achieve the best performance when λ_ad = 0.01 and λ_s = 0.1.

Table 8

Results on hyper parameter

Data Amount	λ _ad	λ _s	mIOU
1/8	0.001	0	68.7
1/8	0.01	0	69.0
1/8	0.1	0	68.8
1/8	0.01	0.05	70.4
1/8	0.01	0.1	71.6
1/8	0.01	0.11	70.9

5 Conclusions

In this paper, we propose an improved network structure for semi-supervised semantic segmentation which focuses on the confidence map of discriminator output and the ground truth of unlabeled data. By adding a new discriminator network branch, a more reliable confidence graph is generated and the performance of semi-supervised semantic segmentation is superior to the compared one. For the semi-supervised module, we replace the manual threshold setting method with the output probability of the other discriminator output, which further improves the performance. A large number of experiments on the PASCAL VOC 2012 dataset confirm the effectiveness of the network.

References

Feng

, Haase-Schuetz

, Rosenbaum

, Hertlein

, Glaeser

, Timm

, et al., Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges, (2019).

Lin

, Shen

, Hengel

A.V.D.

and Reid

, Efficient piecewise training of deep structured models for semantic segmentation, (2016).

Liu

, Li

, Luo

, Loy

C.C.

and Tang

, Semantic image segmentation via deep parsing network, (2016).

Zhou

X.Y.

and Yang

G.Z.

, Normalization in Training U-Net for 2D Biomedical Semantic Segmentation[J], 2018.

Hong,

Seunghoon

, Noh

Hyeonwoo

and Han

Bohyung

, Decoupled deep neural network for semi-supervised semantic segmentation, arXiv preprint arXiv:1506.04924 (2015).

Long,

Jonathan

, Shelhamer,

Evan

, Darrell and Trevor. , Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis & Machine Intelligence, (2017).

, Song

, Sun

, Ku

, Yang

, Liu

, et al., CAMEL: A Weakly Supervised Learning Framework for Histopathology Image Segmentation, 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, (2020).

Sun

and Li

, Saliency guided deep network for weakly-supervised image segmentation[J], 2018.

Hong

, Noh

and Han

, Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation[J], 2015.

10.

Papandreou

, Chen

L.C.

, Murphy

and Yuille

A.L.

, Weakly- and semi-supervised learning of a dcnn for semantic image segmentation, (2015).

11.

Pathak

, Krhenbühl

and Darrell

, Constrained Convolutional Neural Networks for Weakly Supervised Segmentation[J], 2015.

12.

Song

, Huang

, Ouyang

and Wang

, Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation, (2019).

13.

Dai

, He

and Sun

, BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation[J], 2015.

14.

Ahn

, Cho

and Kwak

, Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations[J], 2019.

15.

, Ouyang

, Principe

J.C.

, Farrington

and Li

, Weakly supervised learning of point-level annotation for coral image segmentation, OCEANS 2019 MTS/IEEE SEATTLE. IEEE, (2019).

16.

, Wu

, Huang

, Yi

, Yan

, Li

, et al., Weakly supervised deep nuclei segmentation using partial points annotation in histopathology images, IEEE Transactions on Medical Imaging PP(99) (0), 1–1.

17.

Bearman

, Russakovsky

, Ferrari

and Fei-Fei

, What’s the point: semantic segmentation with point supervision, (2016).

18.

Goodfellow

, Pouget-Abadie

, Mirza

, Xu

, Warde-Farley

, Ozair..

and Bengio

, Generative adversarial nets, In Advances in neural information processing systems (pp. 2672–2680), (2014)

19.

Mirza

and Osindero

, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784, 2014.

20.

Hung

W.C.

, Tsai

Y.H.

, Liou

Y.T.

, Lin

Y.Y.

and Yang

M.H.

, Adversarial learning for semi-supervised semantic segmentation, (2018).

21.

Mittal

, Tatarchenko

and Brox

, Semi-Supervised Semantic Segmentation with High- and Low-level Consistency[J], (99), IEEE Transactions on Pattern Analysis and Machine Intelligence PP (2019)1–1.

22.

Badrinarayanan

, Kendall

and Cipolla

, SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J], IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017:1–1.

23.

Wei

, et al., Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

24.

Chen

L.C.

, Zhu

, Papandreou

, Schroff

and Adam

, Encoder-decoder with atrous separable convolution for semantic image segmentation, In Proceedings of the European conference on computer vision (ECCV) (pp. 801–818), (2018).

25.

Chen

L.C.

, Papandreou

, Kokkinos

, Murphy

and Yuille

A.L.

, Semantic image segmentation with deep convolutional nets and fully connected crfs, arXiv preprint arXiv:1412.7062, (2014).

26.

Everingham

, Gool

L.V.

, Williams

C.K.I.

, Winn

and Zisserman

, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2) (2010)303–338.

27.

and Koltun

, Multi-Scale Context Aggregation by Dilated Convolutions[C], International Conference on Learning Representations (ICLR), 2016.

28.

Zhao

, Shi

, Qi

, Wang

and Jia

, Pyramid Scene Parsing Network[J], 2016.

29.

Wei

, Liang

, Chen

, Jie

, Xiao

, Zhao

, et al., Learning to segment with image-level annotations, Pattern Recognition 234–244, (2016).

30.

Wei

, Xiao

, Shi

, Jie

, Feng

and Huang

T.S.

, Revisiting dilated convolution: a simple approach for weakly- and semi-supervised semantic segmentation, (2018).

31.

Xiao

, Wei

, Liu

, Zhang

and Feng

, Transferable semi-supervised semantic segmentation, (2017).

32.

Luc

, Couprie

, Chintala

and Verbeek

, Semantic segmentation using adversarial networks, (2016).

33.

Bao

, Chen

, Wen

, Li

and Hua

, Cvae-gan: fine-grained image generation through asymmetric training, (2017).

34.

Zhai

, Chen

, Tung

, He

, Nawhal

and Mori

, Lifelong GAN: Continual Learning for Conditional Image Generation, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2020).

35.

Jung

, Yang

and Cremers

, Multi-Frame GAN: Image Enhancement for Stereo Visual Odometry in Low Light[J], 2019.

36.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, (2016).

37.

Hariharan

, Arbelaez

, Bourdev

L.D.

, Maji

and Malik

, Semantic contours from inverse detectors, International Conference on Computer Vision, IEEE, (2011).

38.

Kingma

D.P.

and Ba.

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).

39.

Cordts

, Omran

, Ramos

, Rehfeld

, Enzweiler

, Benenson

, Franke

, Roth

and Schiele

, The cityscapes dataset for semantic urban scene understanding, In CVPR, 2016.

Semi-supervised semantic segmentation using an improved generative adversarial network

Abstract

Keywords

1 Introduction

3 Method

3.1 Motivation

4.1 Implementation details

4.2 Experimental results

Table 1 Comparison of the baseline and discriminator on pascal voc 2012 validation results Method D′ D2 Data amount 1/8 1/4 1/2 Full [20] 66.0 68.3 69.8 73.6 ✓ 67.6 71.0 72.6 74.9 ✓ 68.1 71.3 72.9 75.3 our baseline 67.2 69.4 71.3 75.2 ✓ 68.6 71.9 73.6 76.2 ✓ 69.0 72.5 74.1 76.5

References

Table 1
Comparison of the baseline and discriminator on pascal voc 2012 validation results

Method D′ D₂ Data amount

1/8 1/4 1/2 Full

[20] 66.0 68.3 69.8 73.6

✓ 67.6 71.0 72.6 74.9

✓ 68.1 71.3 72.9 75.3

our baseline 67.2 69.4 71.3 75.2

✓ 68.6 71.9 73.6 76.2

✓ 69.0 72.5 74.1 76.5