Fully convolutional networks semantic segmentation based on conditional random field optimization

Abstract

Each pixel can be classified in the image by the semantic segmentation. The segmentation detection results of pixel level can be got which are similar to the contour of the target object. However, the results of semantic segmentation trained by Fully convolutional networks often lead to the loss of detail information. This paper proposes a CRF-FCN model based on CRF optimization. Firstly, the original image is detected based on feature pyramid networks, and the target area information is extracted, which is used to train the high-order potential function of CRF. Then, the high-order CRF is used as the back-end of the complete convolution network to optimize the semantic image segmentation. The algorithm comparison experiment shows that our algorithm makes the target details more obvious, and improves the accuracy and efficiency of semantic segmentation.

Keywords

Conditional random field (CRF)fully convolutional networks (FCNs)semantic segmentation

1. Introduction

The more computer vision technology develops, the more people are increasingly trying to use the theory of deep learning to solve many problems in image processing and recognition technology. Convolutional neural networks based on deep learning can fuse image segmentation and image recognition, and image features can be extracted by the convolutional network. The image is divided into a group of regions with certain semantics, and their categories are identified. Finally, the semantic image with each pixel tag is obtained, and the image analysis and understanding are completed.

The traditional image segmentation divides the digital image into several specific regions with unique properties. These regions do not intersect each other, and each region meets some similarity criteria of gray, texture, color and other features. The target is separated from the background based on these features. This method does not need complex model construction and large-scale training samples, so the method is more intuitive. However, the target recognition method based on feature points needs to correctly match the feature operator and calculate the model estimation from the template image to the image to be matched. The matching accuracy will be affected by image noise and parameter space transformation, and the algorithm is relatively inefficient. Later, scholars put forward the idea of target recognition based on random sampling consistency and global information probability model. For example, Fischler and Bolles proposed RANSAC algorithm [1], which is based on the framework of hypothesis verification, estimates the parameters of the model from the data set containing external points by iterative method, and obtains the correct interior point. The whole process is random and data-driven. Matas et al. [2] tested the subset of sample points and selected subsets from the data set. Only when all the points in the subset become interior points can the remaining set of points be verified, which improves the efficiency of the algorithm. Later, Lafferty et al. [3] proposed a CRF model, which integrates local features of multiple types of images, links global information and direct posterior probability modeling, effectively and accurately completes pixel classification, and integrates image segmentation and recognition tasks.

For semantic image segmentation through deep learning, in the early days, the pre-segmentation map was first generated by traditional image segmentation methods, and then the pre-segmented images are classified by CNN network training. But now, it is more through the design of various convolution neural network models, so that the image semantic segmentation results can be directly obtained by training the image. Compared with target detection, semantic segmentation is a deeper level of image understanding, and it is a more important way of image understanding. For example, super-pixel image segmentation based on fully convolution neural network (FCN) is the most widely used semantic image segmentation and recognition model. Reference [4] proposed the design idea of changing the deep neural network into fully convolution network (FCN) to complete image recognition, and obtained more accurate semantic image segmentation results. Compared with the SDS method proposed in reference [5], the accuracy and speed have been improved. Convolution neural network models SegNet [6] and DeconvoNet [7] train image pixel features through convolution and deconvolution to solve the segmentation problem caused by the change of object size. In addition, target recognition based on multi-scale depth structure image segmentation network [8, 14], using CRF or Region Proposal to help reasoning, training multi-scale convolution neural network, etc. This paper will focus on how to optimize semantic segmentation based on deep learning to improve the accuracy and efficiency of semantic segmentation.

2. Semantic segmentation based on conditional random field

Conditional random field [1] (CRF) is to model the target sequence based on the observation sequence. It is an undirected graph model, which calculates the joint probability distribution of the whole marker sequence under the given observation sequence to be marked. When dealing with the semantic segmentation problem, the basic analysis method is to convert the image segmentation problem into the image label problem.

For an image $I$ of size $W\times H$ , we define the random variable $Y=\{{y_{1},y_{2},\cdots y_{w\times h}}\}$ as the observation variable, which represents every pixel on the image. Set a random variable $X=\{{x_{1},x_{2},\cdots x_{w\times h}}\}$ to represent the category label of a pixel. In this way, each pixel is regarded as a node and the relationship between pixels is regarded as an edge. Then the composition of conditional random field can be expressed by ( $X, Y$ ). Under the condition of given $Y$ , the conditional probability distribution of $X$ is expressed by $P(X|Y)$ .

Let variable $V$ be the number of each pixel, variable $L$ as the category label set of pixels, and $N_{i}$ be the set of all pixels in the neighborhood of a pixel. If the random variable $X$ constitutes a Markov random field represented by undirected graph $G=(V,N_{i})$ , then each random variable $x_{i}$ satisfies the Markov property:

$\displaystyle V=\{{1,2,\cdots N}\}(N=W\times H)$ $\displaystyle L=\{{l_{1},l_{2},\cdots l_{w\times h}}\}$ (1) $\displaystyle P(x_{i}|x_{u}:u\in V-i)=P(x_{i}|x_{u}:u\in N_{i})$

Where $V-i$ represents the pixels connected by all edges of pixel $i$ in undirected graph $G=(V,N_{i})$ .

Equation (2) indicates that in the Markov random field represented by undirected graph $G=(V,N_{i})$ , the conditional probability of each node can be calculated from the probability distribution of the nodes in its neighborhood. Therefore, the task of image semantic segmentation using conditional random field is to make each random variable $x_{i}$ get the correct category marker $l_{i}$ from the category set $X$ , and the model thinks that the label makes the random variable $X$ and the observation variable $Y$ the most correct. When each pixel gets the category label, the image segmentation problem is completed. According to the Bayes criterion, if $\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over{x}}$ satisfies the maximum posterior probability, it is the optimal segmentation result:

$\displaystyle\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}% \over{x}}=\arg\mathop{\max}\limits_{X}P(X|Y)$ (2)

The posterior probability of conditional random field conforms to the Gibbs distribution, and its expression is defined as follows:

$\displaystyle P(X|Y)=\frac{1}{Z(Y)}\exp(-E(X|Y))$ (3)

Where $E(X|Y)$ is a function of $x$ with respect to energy and $Z(Y)$ is a normalization factor.

The energy formula is as follows:

$\displaystyle E(X)=\theta_{u}E_{u}(X)+\theta_{p}E_{p}(X)=\theta_{u}\sum_{i\in V% }{\psi_{u}(x_{i})+}\theta_{p}\sum_{i\in V,j\in N_{i}}{\psi_{p}(x_{i},x_{j})}$ (4)

Where $\psi_{u}(x_{i})$ is the univariate potential function and $\psi_{p}(x_{i},x_{j})$ is the binary potential function.

$\displaystyle\psi_{u}(x_{i})=-\log P(x_{i}|y)$ (5) $\displaystyle\psi_{p}(x_{i},x_{j})=\left\{\begin{array}[]{l}0\quad\text{If }x_% {i}=x_{j},\forall i,j\in V\\ \theta_{p}+\theta_{v}\exp(-\theta_{\beta}\|{I_{l}-I_{j}}\|^{2})\quad\text{% Other}\\ \end{array}\right.$ (6)

In order to enhance the constraint relationship between pixels and their regions, Kohli [8] proposed a high-order potential function as an additional constraint condition, which greatly improved the efficiency of the algorithm. The higher order energy function is defined as:

$\displaystyle E(X)=\sum_{i\in V}\psi_{u}(x_{i})+\alpha\sum_{i\in V,j\in N_{i}}% \psi_{p}(x_{i},x_{j})+\beta\sum_{c\in S}\psi_{c}(x_{c})$ (7)

Where $S$ is the set of regions to be segmented, and $c$ is a super pixel block composed of a certain number of pixels.

Kohli defines two kinds of high-order potential functions, which are based on the region consistent potential function and the segmentation quality sensitive potential function, one of the definitions is as follows:

$\displaystyle\psi_{c}^{p}(x_{c})=\left\{\begin{array}[]{ll}0&\text{If }x_{i}=l% _{k},\forall i\in c\\ \theta_{p}^{h}|c|^{\theta_{\alpha}}&{\text{Other}}\\ \end{array}\right.$ (8)

Another definition is as follows:

$\displaystyle\psi_{c}(x_{c})=\left\{\begin{array}[]{l}0\quad\text{If }x_{i}=l_% {k},\forall i\in c\\ |c|^{\theta_{\alpha}}\left(\theta_{p}^{h}+\theta_{v}^{h}\exp\left(-\theta_{% \beta}^{h}\frac{\left\|{\sum\limits_{i\in c}{(f(i)-\mu)^{2}}}\right\|}{|c|}% \right)\right)\quad\text{Other}\\ \end{array}\right.$ (9)

The higher-order potential function introduces a higher-level consistency criterion, that is, by judging the quality of a group of hyperpixel segmentation, better semantic segmentation effect can be obtained.

Figure 1.

Semantic segmentation results by conditional random field. (a) Original picture (b) univariate potential function (c) binary potential function (d) higher-order potential function.

It shows the segmentation effect of different potential functions in Fig. 1. It is clear that the segmentation results with binary potential function are better than those with only one variable potential function, and the target details are more obvious. The high-order potential function can improve the segmentation effect slightly and improve the efficiency of the algorithm.

Conditional random field model is almost perfect in mathematics, involving probability, expectation, optimization and other knowledge, and has good effect in natural language processing. However, the segmentation accuracy of this kind of algorithm depends on the image label to some extent. Algorithm execution speed and accuracy need to be improved to better meet the real-time requirements of robot vision.

3. Semantic image segmentation based on Fully convolutional network

Jonathan Long [2] and others proposed a Fully convolutional networks (FCN) model at the CVPR2015 conference in 2015, which can realize end-to-end semantic image segmentation. FCN uses the deconvolution layer to replace the full connection layer in CNN network, and uses bilinear interpolation upsampling method to restore the same size of the input image, and produces a prediction for each pixel. Finally, it classifies pixel by pixel on the up sampled feature map to obtain the semantic segmentation image result.

Usually, there will connect several full connection layers after the convolution layer in CNN network. The feature map generated by convolution is mapped to a feature vector in the sample label space to obtain the prediction probability for the whole image category, which is suitable for classification and regression tasks. For example, the output of the final fully connected layer of the AlexNet [3] is a vector has 1000 dimensional. The probability of the input image belonging to each category is given by the softmax classifier, which can achieve 1000 category classification. Although this network structure can accurately determine the category of objects contained in an image, it cannot classify each pixel, nor can it outline the specific outline of the object, so it is hard to achieve precise semantic image segmentation.

Fully convolutional networks is an endtoend image segmentation method, which allows the network to predict the pixel level and directly get the label map. It can accept input images of any size. The feature map is classified pixelbypixel based on the feature map of up sampling. However, the convolution pooling operation in the front end of Fully convolutional networks will reduce the original image and reduce the image resolution. Although the end-to-end output is guaranteed by upsampling, the segmentation accuracy is not high due to the loss of information. In order to make the classification of pixels more accurate, we can combine the high-resolution features in front of the convolution layer with the low-resolution features of the following layers.

Figure 2.

Upsampling optimization graph based on skip layer.

As shown in Fig. 2, if the original image is convoluted and pooled once, the output step is halved. After five convolution pooling operations, the image will be reduced to 1/32 of its original size. At this time, the original image size can be obtained by 32x upsampling. We call this net FCN-32s. If a 1 $\times$ 1 convolution layer is added to the top of the pool4 layer to generate additional class prediction, the FCN-16s model is obtained by adding 2x upsampling layer of conv7 layer and then a 16x upsampling back to the image. The FCN-8s model is constructed by fusing the predicted value of pool3 layer with the 2x upsampling value of pool4 layer and 4x upsampling conv7 layer then 8x upsampling back to the image. We compare the semantic image segmentation of Pascal voc2012 dataset images using FCN-32s, FCN-16s and FCN-8s (See Fig. 3). As you can see, the edges of the FCN-32s upsampling and segmentation results are relatively smooth, and the details are not well reflected, while the segmentation results of FCN-8s are more accurate than FCN-16s and FCN-32s. But in general, the segmentation results are not fine enough, and there still have disparity between the edge details and the true segmentation image.

Although Fully convolutional networks can better achieve semantic image segmentation, due to its high-dimensional feature information after convolution operation is relatively abstract, and the upsampling operation is relatively simple, resulting in the loss of detailed information of target structure, easy to ignore and allocate small target objects, cannot well express the category correlation between adjacent pixels, and lack of spatial consistency.

Figure 3.

Comparison of semantic image segmentation results. (a) Original image (b) FCN-32s (c) FCN-16s (d) FCN-8s (e) Ground truth.

4. CRF-FCN network based on conditional random field optimization

For improving the accuracy of segmentation, inspired by CRFasRNN, we use high-order conditional random field as the back-end of complete convolution network to optimize semantic image segmentation. First of all, we need to detect the original image based on feature pyramid networks, extract the target area information, and construct the high-order potential function of conditional random field. Then, we input the images to be segmented to Fully convolutional networks training to generate a rough prediction graph. Then, the prediction graph is upsampling, and the high-order conditional random field is used for iterative optimization.

4.1 Target detection based on feature pyramid networks

In order to detect small objects, this paper uses feature pyramid networks to achieve target detection. Based on the images in voc2012 database, the network training results are shown in Fig. 4. As you can see that the target detection based on feature pyramid networks can also detect small objects in the image, and the relative accuracy is relatively high, and the speed is relatively fast. Then the detected region of interest is trained to determine the parameter model of the high-order potential function of the conditional random field.

Figure 4.

Target detection based on feature pyramid networks.

4.2 Algorithm flow of highorder conditional random field model

The iterative algorithm flow of CRF-FCN is as follows:

(1)
The input image $\{{(X^{(n)},Y^{(n)})}\}$ sampled on the prediction graph of the complete convolution network is used as the input training sample of the high-order FCN model;
(2)
The approximate distributions of input sample initialization $\psi_{u}(x_{i})$ and random variables $X_{i}$ and $Y_{d}$ are:

$\displaystyle Q^{0}(X_{i}=l)=\frac{1}{Z_{i}}\exp(-\psi_{u}(l))\quad Q^{0}(Y_{d% }=b)=h_{d}^{b}(1-h_{d})^{1-b}$
(3)
According to the current model parameters, the energy function of the $t$ iteration is updated:

$\displaystyle E(X)=\sum_{i\in V}{\psi_{u}(x_{i})+}\alpha\sum_{i\in V,j\in N_{i% }}{\psi_{p}(x_{i},x_{j})}+\beta\sum_{c\in S}{\psi_{c}(x_{c})}$
(4)
Update $\psi_{u}(x_{i})$ , $\psi_{p}(x_{i},x_{j})$ and $\psi_{c}(x_{c})$ ;
(5)
Calculate the approximate distribution of random variables $X_{i}$ and $Y_{d}$ :

$\displaystyle Q^{t+1}(X_{i}=l)=\frac{1}{Z_{i}}\exp(-E^{t}(X_{i}=l))\quad Q^{t+% 1}(Y_{d}=b)=\frac{1}{Z_{d}}\exp(-E^{t}(Y_{d}=b))$
(6)
If it does not reach the maximum number of falls T we set, go to step 3 and repeat the cycle.

4.3 Training process of CRF-FCN

The Fully convolutional networks is based on conditional random field model optimization in this paper. The main FCNs use FCN-8s. The high-order conditional random field model trained by feature pyramid networks target detection information is written as a network layer structure similar to convolution layer, which is added after the softmax layer of fully convolutional networks. This method can avoid the influence on the forward and backward propagation of the network. The structure of networks and training process are shown in Fig. 5.

Figure 5.

CRF-FCN convolution neural network model.

Figure 6.

The Comparison of segmentation results. (a) Original image (b) FCN-8s (c) Ours (d) Ground truth.

The steps of network training are as follows:

(1)

The input image is processed and the size is 256 $\times$ 256;

(2)

Make image data set (training data set, test data set) and label data set (LMDB or LEVELDB format);

(3)

The ground truth image in the dataset is processed to generate the label image for training, and then the LMDB is generated;

(4)

After compiling the conditional random field algorithm written in C++, it is written into the fully convolutional networks according to the definition requirements of relevant network layer, and added to the back of softmax network layer;

(5)

Select the images of voc2012 dataset as experimental data, and train and test the effectiveness of semantic image segmentation performance improvement based on CRFFCN networks on the basis of trained FCN-8s like Fig. 6.

From the image segmentation, we can see that compared with FCN-8s, the fully convolutional networks model optimized by high-order conditional random field has better semantic image segmentation effect. Because the high-order potential function of conditional random field is trained based on feature pyramid networks, the recognition and segmentation effect for small-scale targets is obviously better than that of the conditional random field, and the segmentation in the details should be more accurate, and the edges can be more clearly recognized.

4.4 Comparison and analysis

In the semantic segmentation experiment, we generally use the mean pixel accuracy and the Intersection of Union (IoU) for each type of target to quantitatively evaluate the segmentation accuracy of the model, and use the time spent by each algorithm in processing an image to evaluate the executive efficiency of the algorithm.

The average pixel accuracy is defined as:

$\displaystyle\text{mean accuracy}=\frac{1}{N}\sum_{i=1}^{N}{\frac{\textit{obj}% _{tp_{i}}}{\textit{obj}_{tp_{i}}+\textit{obj}_{fn_{i}}}}\times 100\%$ (10)

Intersection of Union is defined as:

$\displaystyle\text{IoU}=\frac{\textit{obj}_{tp_{i}}}{\textit{obj}_{tp_{i}}+% \textit{obj}_{fp_{i}}+\textit{obj}_{fn_{i}}}\times 100\%$ (11)

Where $N$ is the recognition target class, $i\in N$ . $\textit{obj}_{tp_{i}}$ is the true positive pixels number, $\textit{obj}_{fn_{i}}$ is the false negative pixels number, $\textit{obj}_{fp_{i}}$ is the false positive pixels number. The larger the IoU value, the better the accuracy performance of the model.

The experiment is running on the computer of AMD FX 8300, FX-8300, 8-Core, 8G memory. The experimental platform is Caffe (Python 3.6) under Windows 10, and NVIDIA GeForce GTX 1080 Ti graphics card is selected to accelerate convolution network training.

Table 1

Comparison of the mean IoU with difference object of the algorithm

Algorithm	Dog	Chair	Motorbike
FCN-8s	70.28	23.42	58.22
FCN-8s+CRF	73.14	24.35	62.42
Ours	75.33	29.89	68.54

The comparison of the accuracy of each algorithm in different lines is given in Table 1.

Table 2

Comparison of segmentation accuracy and running time of the algorithm

Image	Algorithm	Pixel accuracy%	Mean IoU%	Inference time/ms
Dog	FCN-8s	91.15	70.28	378.8
	FCN-8s+CRF	96.81	83.14	23079.2
	Ours	99.04	85.33	468.8
Chair	FCN-8s	90.51	23.42	380.4
	FCN-8s+CRF	92.23	38.35	24341.7
	Ours	93.22	43.89	641.9
Motorbike	FCN-8s	92.59	58.22	387.4
	FCN-8s+CRF	95.32	68.42	25210.3
	Ours	96.79	70.54	747.2
Mean value	FCN-8s	91.42	50.64	382.20
	FCN-8s+CRF	94.8	63.30	24210.4
	Ours	96.35	66.59	619.3

The comparison of average segmentation accuracy and running time of each algorithm is given in Table 2.

The experiment is tested by using the trained network model. Compared with FCN-8s model, our model improves the segmentation accuracy by optimizing the semantic segmentation results based on CRFs. As can be seen from mean IOU. Compared with FCN-8s+CRF model, our model has similar segmentation accuracy. However, from the perspective of semantic segmentation images, the segmentation at the details is more accurate, and the edge can be more clearly recognized.

In our model, the high-order conditional random field model trained by FPN target detection information is written as a network layer structure similar to convolution layer, which is added after the softmax layer of FCN. Therefore, compared with FCN-8s+CRF model, the influence time of out model is greatly reduced using the trained model for semantic segmentation can improve the efficiency and greatly shorten the image processing time. Therefore, after training the model for semantic segmentation can improve the efficiency and greatly shorten the image processing time.

5. Conclusion

This paper analyzes in detail the structure of CRF and FCNs for semantic segmentation. Based on the discussion of the above methods, we propose a CRF-FCN networks based on conditional random field optimization. It takes the high-order conditional random field of target detection as the back-end network layer of the complete convolution networks, and iteratively optimizes the pre segmentation results of the complete convolution networks to obtain better segmentation accuracy. The effectiveness and accuracy of the method in semantic image segmentation and recognition are verified by image segmentation experiments.

Footnotes

Acknowledgments

This research has been financed by Projects funded by National Natural Science Foundation of China (Grant: 51875266).

References

Bolles

R.C.

and Fischler

M.A.

, A RANSAC-based approach to model fitting and its application to finding cylinders in range data, IJCAI, 1981, 637–643.

Matas

and Chum

, Randomized RANSAC with T D tree test, Image and Vision Computing 22(10) (2004), 837–842.

Lafferty

J.D.

Mccallum

and Pereira, FCN Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, in: Proceedings of the Eighteenth International Conference on Maching Learning, 2001, pp. 282–289.

Long

Shelhamer

and Darell

, Fully convolutional networks for semantic segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.

Hariharan

Arbelaez

and Girshick

, Simultaneous detection and segmentation, Computer Vision-ECCV, 2014, 297–312.

Badrinarayanan

Handa

and Cipolla

, SegNet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling, Computer Vision and Pattern Recognition (cs.CV), 2015, 1254–1264.

Noh

Hong

and Han

, Learning Deconvolution Network for Semantic Segmentation, in: IEEE International Conference on Computer Vision, 2015, pp. 1520–1528.

Chen

L.C.

Papandreou

and Kokkinos

, Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, Computer Science, 2016, 357–361.

Long

Shelhamer

and Darell

, Fully convolutional networks for semantic segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.

10.

Krizhevsky

Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional, in: Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.

11.

Sun

Zhang

et al., Gastric histopathology image segmentation using a hierarchical conditional random field, Biocybernetics and Biomedical Engineering, 2020, 1535–1555.

12.

Golpardaz

Helfroush

M.S.

and Danyal

, Nonsubsampled contourlet transform-based conditional random field for SAR images segmentation, Signal Processing 174 (2020), 107623.

13.

Qiu

Gao

and Han

, Saliency detection using a deep conditional random field network, Pattern Recognition 103 (2020), 170266.

14.

Zhao

Zhang

and Hu

, A multi-scale strategy for deep semantic segmentation with convolutional neural networks, Neuro Computing, 2019, 273–284.

15.

Shaaban

A.M.

Salem

N.M.

and Al-atabany

W.I.

, A semantic-based scene segmentation using convolutional neural networks, AEU – International Journal of Electronics and Communications 125 (2020), 153364.

16.

Sang

Zhou

and Zhao

, PCANet: pyramid convolutional attention network for semantic segmentation, Image and Vision Computing, 2020, 103997.