Abstract
Deep convolutional neural networks (CNNs) have shown outstanding performance in salient object detection. However, there exist two conundrums under-explored. 1) High-level features are beneficial to locate salient objects while low-level features contain fine-grained details. How to combine these two types of features to promote accuracy is the first conundrum. 2) Previous CNN-based methods adopt a convolutional layer after extracting features to infer saliency maps. While encountering images that are different greatly from training dataset, adopting a convolutional layer as a classifier is not robust enough to detect all salient objects. In addition, limited receptive field and lack of spatial correlation will cause salient objects to be incomplete while blurring their boundaries. In this paper, a Lateral Hierarchically Refining Network (LHRNet) is put forward for accurate salient object detection. Firstly, LHRNet efficiently integrates multi-level features, which simultaneously incorporates coarse semantics and fine details. Then a coarse saliency prediction is made from low-resolution features by convolution. Finally, a series of nearest neighbor classifiers are learned to hierarchically restore the missing parts of salient objects while refining their boundaries, yielding a more reliable final prediction. Comprehensive experiments demonstrate that this network performs favorably against state-of-the-art approaches on six datasets.
Introduction
Salient object detection aims to identify the most visually distinctive objects or regions in an image, which commonly serves as the first step for a variety of computer vision applications such as semantic segmentation [17], image and video compression [18], object recognition [19, 20], visual tracking [21] and scene classification [22].
In many previous traditional methods [23–26], low-level hand-crafted features, center-surround contrast and heuristic priors play essential roles to determine the position of a prominent object. But it is difficult to use these methods to detect salient objects with diverse scales, shapes and locations in complex scenes especially when image background is similar to foreground.
Recently, deep convolutional neural networks (CNNs) have delivered remarkable performance in many recognition tasks. CNNs can be trained end-to-end through back propagation without the need to extract traditional hand-crafted features and do a lots of heuristic priors. However, repeated subsampling operations such as pooling layers and convolutional layers adopted in earlier CNN-based models will lead to a significant decrease in the initial image resolution, which is fatal for dense prediction tasks. In order to deal with this problem, the Fully Convolutional Network (FCN) [27] adopts de-convolutional layers to recover space information and achieves the best results in semantic segmentation. Motivated by the superiorities of FCN, several attempts have been performed to utilize FCN in salient object detection.
Most previous models based on FCN for salient object detection exist two conundrums.
a) Some models [5, 7] predict saliency maps by mainly using high-level semantic features from deep convolutional layers without taking low-level spatial details into consideration. However, high-level features are beneficial to locate salient objects while low-level features can make the saliency maps to retain fine object boundaries. Although different levels of features are beneficial to the improvement of results, simply concatenating them is not the optimal way. Hou et al. [13] add short connections between multiple side output layers to combine features of different levels. Lin et al. [29] propose a top-down architecture with lateral connections to build high-level semantic feature maps at all scales. Zhang et al. [8] integrate multi-level feature maps into multiple resolutions then adaptively combine these feature maps to predict saliency maps. Zhang et al. [9] propose a model to pass messages among features from different levels and design a mechanism to adaptively fuse these messages. All of these models improve the accuracy by constructing communication between different side outputs layers, essentially expanding the parameter space and complexity of their networks, but their results are still not satisfactory. How to effectively use different level features is the first conundrum.
b) Many methods [5, 13] adopt convolutional layers with kernel size 1x1 or 3x3 as classifiers at the end of network to infer saliency maps, which is sub-optimal to detect all salient objects with diverse scales, shapes and locations especially those having complex boundaries in complex scenarios. This problem can be explained in three ways. (1) The parameters of convolutional kernels depend on patterns of training dataset. At the end of training, the parameters of the convolutional kernels are fixed. Once the network encounters a pattern that differs greatly from the training dataset, the network will produce an extremely terrible saliency map because it no longer adaptively adjusts the parameters of convolutional kernels according to the input. (2) The receptive field of a convolutional kernel is limited. When a convolutional kernel just covers a small region, it can not determine which of the internal pixels are located in the foreground area and which are in the background area. Salient object detection needs to determine which objects belong to the foreground and which objects belong to the background from global semantic information. This situation can be shown as Figure 1. But it is obviously impractical to set the size of a convolutional kernel to be as large as the input image, which will lead to an explosion of memory and computation.

Red solid dot represents a certain pixel, and red dashed square boxes represent different sizes of receptive fields of convolutional layers. It is not able to determine the saliency of the pixel from small receptive field. In other words, whether the green area or the orange area belongs to salient object is confusing. As the receptive field increases, it is clearly seen that the pixel belongs to a dog running on the lawn. In this case, it is undoubted that the pixel should be masked as salient.
(3) Convolutional layers use sliding window to share weights, but this operation will result in a lack of spatial correlation between different parts of same salient object that are far away. For example, in a complicated scene, a person wearing clothes may have the arm properly segmented, but his legs are lost. If global spatial correlation is introduced in the inference process, the correlation between different parts of the same salient object is enhanced. As long as one of the parts is correctly segmented, the remaining ones can be correctly segmented regardless of their distances. Rather than adopting a convolutional layer to infer results, LPS [12] learns a nearest neighbor classifier to highly refine results generated by previous methods, which can be considered as a substitute for conditional random field (CRF) [30]. Nearest neighbor classifier (NNC) has no parameters, and it can obtain the saliency of each pixel from global contrast, which achieves global spatial correlation to some extent. But the structure proposed in LPS is not an end-to-end model because it needs to adopt results produced by other models as initial saliency maps.

Visual examples of the proposed LHRNet, Amulet [8], DCL [6] and GBMPM [9]. The red dashed square boxes indicate the missing saliency information. All of Amulet, DCL, GBMPM and our S4 adopt a convolutional layer at the end of network to infer saliency maps. It can be seen that although the salient objects in S4 are incomplete and have fuzzy boundaries, the missing parts of them are restored and their boundaries are refined through a series of NNCs.
In this paper, motivated by the superiorities of NNC, a Lateral Hierarchically Refining Network (LHRNet) is proposed for accurate salient object detection. Firstly, LHRNet efficiently integrates multi-level features, which simultaneously incorporates coarse semantics and fine details. Then a coarse saliency map is predicted from low-resolution features by a convolutional layer. Finally, a series of nearest neighbor classifiers are learned to hierarchically restore the missing parts of salient objects while refining their boundaries, yielding a more reliable final prediction. Figure 2 lists some saliency maps predicted by our LHRNet and other state-of-the-art methods. The main contributions in this paper are summarized as follows: A well-designed information flow structure is constructed to better integrate high-level sematic features and low-level fine-grained details. A lateral hierarchically optimization structure implemented by nearest neighbor classifiers is proposed to restore the missing parts of salient objects while refining their boundaries, yielding a more reliable final prediction. A Lateral Hierarchically Refining Network (LHRNet) consisting of the two structures described above is put forward for accurate salient object detection. Apart from taking full advantages of abundant multi-level features, LHRNet can avoid the problems caused by adopting convolutional layers as classifiers after extracting features to some extent. Comprehensive experiments demonstrate that the LHRNet performs favorably against state-of-the-art approaches under different evaluation metrics on six datasets.
Salient object detection
Previous conventional methods [23–26] implement an unsupervised structure to segment salient regions from the backgrounds. These methods typically extract local pixel-wise or region-wise hand-crafted features such as RGB, grayscale and texture, then compare these features with global features. For example, Yang et al. [24] rank the similarities of the image elements with foreground cues or background cues via graph-based manifold ranking. The saliency of an image element is defined based on its relevance to the given seed or query. Cheng et al. [25] introduce a regional contrast based salient object detection algorithm which simultaneously evaluates global contrast differences and spatial weighted coherence scores. But these methods will produce terrible results while encountering salient objects with diverse scales, shapes, and locations in complex scenarios.
Deep convolutional neural network (CNN) which is trained end-to-end through back propagation has made a major breakthrough in many computer vision tasks. A lot of research efforts have been made to develop various deep architectures for salient object detection. Wang et al. [1] construct two deep neural networks to integrate local pixel estimation and global proposal search. Li et al. [2] extract multi-scale features for each super-pixel and then append fully connected layers on top of model to predict saliency degree. Zhao et al. [3] design a deep model to extract multiple contexts that are integrated into a multi-context deep learning framework in the end for salient object detection. These models only adopt high-level features obtained by stacking convolutional layers and fully connected layers, regardless of spatial information of input images. In order to deal with this problem, Lee et al. [4] encode low-level distance map of super-pixels and high-level semantic features of CNNs, then combine them to predict saliency of each super-pixel. Li et al. [6] propose two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream to predict saliency maps. Apart from these methods, some works endeavor to aggregate multi-level convolutional feature maps to detect salient regions. Zhang et al. [8] firstly integrate multi-level feature maps into multiple resolutions, then combine these maps to predict several saliency maps which are fused to generate a final result in the end. Zhang et al. [9] take into consideration information transmission from shallow side output layers to deep ones, so they propose a bi-directional message passing model to integrate multi-level features. Luo et al. [11] combine local and global information through a multi-resolution 4x5 grid structure to promote features with strong local contrast. Hou et al. [13] introduce a series of short connections between shallower and deeper side output layers instead of directly connecting loss layers to the last layer of each stage.
Boundary refinement of salient object detection
The saliency maps generated by simply using convolutional neural networks tend to have very blurred boundaries. The conditional random field [10] has been used as post-processing instrument to optimize the boundaries of salient regions. CRF is an undirected graph model combining the characteristics of maximum entropy model and hidden Markov model. Besides, Liu et al. [5] firstly make a coarse global prediction by learning various global structured saliency cues, then construct a network to hierarchically refine the details of saliency map step by step. Wang et al. [7] construct a deep recurrent network to incorporate a coarse predicted map as saliency prior and then automatically learn to refine the map by iteratively correcting its previous errors. In addition, recent methods attempt to optimize the loss function to obtain relatively exquisite boundary. Luo et al. [11] add an extra loss function while training their network to penalize errors on the boundary. Cai et al. [35] construct a new topological metric space, with the implicit metric being determined by a deep network. However, all of CNN-based methods mentioned above still adopt a single-scale convolutional layer with kernel size 1x1 or 3x3 to infer saliency maps, which is incapable of fundamentally solving the problems discussed at the beginning. Different from these methods, Zeng et al. [12] believe that learning a unified detector such as a convolutional layer is hard to detect all varieties of salient objects. So given an input image with its coarse saliency map that is produced by other previous methods, Zeng et al. firstly construct a DNN as an embedding function and then learn a nearest neighbor classifier to refine the coarse saliency map, which can be considered as a substitute for conditional random field.
Lateral hierarchically refining network
The proposed Lateral Hierarchically Refining Network dubbed LHRNet can restore the missing parts of salient objects while refining their boundaries step by step. Figure 3 shows the intact illustration of the LHRNet. The structure of the proposed network will be described in Section 3.1. The detailed formulas of nearest neighbor classifier (NNC) for salient object detection will be given in Section 3.2. Section 3.3 provides the loss function to train the LHRNet.

The architecture of the LHRNet. All additional convolutional layers in the information flow structure use 3x3 convolution kernels and padding. Each H*W*C annotation denotes the corresponding feature maps’ height, width and channel. For more concise statement, all ReLU layers and up-sample operation are not drawn in this structure diagram. Up-sample operation is adopted to adapt the spatial scale of different convolutional features. In practice, bilinear interpolation is used as up-sample function. Firstly, a coarse saliency map is predicted in the deepest side output layer by a convolutional layer with kernel size 3x3. Then the map is refined step by step by learning a series of NNCs.
The architecture of LHRNet is built on the pre-trained VGG16 [36] which has been widely used as feature extraction backbone in many deep learning models due to its simplicity and good generalization properties. VGG16 stacks repeated fully connected layers to infer results for image classification, which is not suitable for dense prediction tasks. So all the fully connected layers and the last pooling layer are removed. The input image is resized to 256x256, and feature maps at five levels Conv1_2, Conv2_2, Conv3_3, Conv4_3 and Conv5_3 are selected, which is denoted as f
i
, i = 1, 2, 3, 4, 5. VGG16 appends subsampling operation such as a pooling layer to each convolutional layer, so the resolution of f
i
is
Visual context is essential to detect salient regions with large variations in scale, shape and position. Szegedy et al. [31] launch inception that contains multiple branch CNNs with different convolutional kernels to extract features from different receptive fields. Inspired by inception, its variants [32, 33] are put forward to achieve competitive results in object detection and classification. Yu et al. [34] propose dilated convolution which supports exponential expansion of the receptive field without loss of resolution or coverage. They applies four parallel dilated convolutions with different dilated rates to systematically aggregate multi-scale contextual information. Zhang et al. [9] design a MCFEM to capture multi-scale contextual information by stacking dilated convolutions which have the same kernel size 3x3 with different dilated rates that are set to 1, 3, 5, 7. In the LHRNet, MCFEM is added to f
i
for the sake of obtaining abundant multi-scale information. This operation will generate feature maps
Given an input image I with its feature maps h and coarse saliency map S, the average feature value of pixels in foreground region and background region are extracted respectively, obtaining two vectors as anchors. This process is performed as Eq. (4).

Visualization of adopting nearest neighbor classification to refine saliency maps. (a) input image. (b) feature maps. For a more concise description, the structure that produces the feature maps is omitted. (c) and (d) represent coarse foreground map and coarse background map respectively, both of them are gray-scales. It should be noticed that (d) =255 - (c). (e) and (f) denote foreground anchor and background anchor. (g) denotes the refined saliency map. After extracting the average feature value of pixels in foreground and background respectively to obtain two vectors as anchors, the Euclidean distances of each pixel and the two anchors are calculated separately. Softmax function is adopted in the end to generate the refined saliency map.
The NNC has two advantages: a) NNC is feature similarity calculation based on distance metrics, not derived from nonlinear mapping of convolution kernels. Its essence is a non-parametric decision mechanism, which can greatly reduce the interference of different data distribution and improve the generalization ability. NNC calculates the feature distance of each pixel and anchor when optimizing the initial result. As long as the feature vector of a pixel can correctly encode the information through ahead information flow structure and is close to its corresponding real anchor in probability, it can be correctly marked. Therefore, the NNC can recover the missed regions and obtain fine boundaries. b) NNC is based on global contrast. For each picture to be detected, the anchors are generated by feature averaging from the global view of the initial saliency map. Therefore, the foreground anchor can encode the central feature of the salient regions, while the background anchor can encode the central feature of the background. The saliency value of each pixel is determined by the feature similarity between the pixel and the anchors. Therefore, NNC is based on global contrast rather than local contrast. This decision mechanism is more in line with the nature of salient object detection: salient object detection should assign labels to pixels based on global context, rather than just their neighborhood.
Now there exist three refined saliency maps S1, S2, S3 and a coarse saliency map S4. Cross entropy (CE) is adopted to optimize the network. For S k , the loss function is written as Eq. (7).
Experimental setup
1: Augment I by horizontal flipping and vertical flipping
2: last is the loss of the last iteration;
cur is the loss of the current iteration;
LHR denotes the proposed network;
L CE denotes Cross Entropy Loss;
3: Initialize the model parameters of the first 13 convolutional layers θ1 by pre-trained VGG16 net and the model parameters of other convolutional layers θ2 by using truncated normal method, θ = θ1 ∪ θ2;
Initialize last = 0, cur = 0;
4:
5:
6: Obain corresponding predicted salienct map P i = LHR (I i ) , P i includes {S1, S2, S3, S4};
7: Calculate cross entropy loss L CE (G i , P i );
8: Update model parameters θ through back propagation;
9: cur ← cur + L CE (G i , P i );
10: end for
11: if iter = =1 or cur < last
12: last=cur;
13: cur=0;
14:
15: Break;
16:
17:
18:
First of all, comparison between S k is performed. Then LHRNet is compared with 11 state-of-the-art methods, including PiCANet [14], GBMPM [9], Amulet [8], DCL [6], DHS [5], DSS [13], ELD [4], NLDF [11], RFCN [7], SRM [15] and UCF [16]. For fair comparison, the saliency maps of different methods are provided by the authors or achieved by running available codes.

Examples of saliency maps in different side output layers. It can be seen that S4 can only basically locate the salient regions with blurred boundaries. Some tiny regions that were missing before, such as the legs of scolopendras, the feet of ants, and the horns of deer, are correctly segmented through continuous refinements. This precision is not achieved by many previous deep learning models.
The maximum F-measure and MAE of S k
The performance of different methods under the metrics of maximum F-measure and MAE. The top results are highlighted in red, green, and blue, respectively. It can be seen that the LHRNet performs favorably against 11 state-of-the-art approaches in terms of maximum F-measure and MAE on six datasets

The PR curves of the LHRNet perform better than other 11 state-of-the-art methods on six datasets.

Qualitative comparison of the LHRNet and state-of-the-art algorithms.
In this paper, a Lateral Hierarchically Refining Network (LHRNet) is put forward for accurate salient object detection. Apart from taking full advantages of abundant multi-level features, LHRNet can avoid the problems caused by adopting convolutional layers as classifiers after extracting features to some extent, such as insufficient generalization ability due to fixed parameters of convolutional kernels at the end of training, the limited receptive field and the lack of spatial correlation. Comprehensive experiments demonstrate that the proposed LHRNet performs favorably against state-of-the-art approaches under different evaluation metrics on six datasets.
Footnotes
Acknowledgments
This research was supported by National Key R&D Program of China (no. 2017YFC0806000) and National Natural Science Foundation of China (Grant Nos. 11627802, 51678249).
