Abstract
Although convolutional neural networks (CNNs) are leading the way in semantic segmentation, standard methods still have some flaws. First, there is feature redundancy and less discriminating feature representations. Second, the number of effective multi-scale features is limited. In this paper, we aim to solve these constraints with the proposed network that utilizes two effective pre-trained models as an encoder. We develop a cross-form attention pyramid that acquires semantically rich multi-scale information from local and global priors. A spatial-wise attention module is introduced to further enhance the segmentation findings. It highlights more discriminating regions of low-level features to focus on significant location information. We demonstrate the efficacy of the proposed network on three datasets, including IDD Lite, PASCAL VOC 2012, and CamVid. Our model achieves a mIoU score of 70.7% on the IDD Lite, 83.98% on the PASCAL VOC 2012, and 73.8% on the CamVid dataset.
Keywords
Introduction
Since the beginning of the digital image processing field, image segmentation has been a major issue in computer vision. Semantic segmentation assigns per-pixel labels to object categories in a given image, which is a difficult process that necessitates precise object category, position, and shape prediction. Due to the rich semantic and spatial information provided by semantic segmentation, it becomes a fundamental component of many visual understanding systems. It has a large impact on a diverse range of intelligent applications such as video surveillance, urban planning, traffic analysis, autonomous driving, augmented reality, robot sensing, etc. From the simplest methods, such as thresholding, k-means clustering, watershed methods, etc., to more sophisticated methods, such as graph cuts, conditional and Markov random fields, etc., several image segmentation methods have been developed in the past.
Recently, deep Convolutional Neural Networks (CNNs) have been widely used for their powerful feature extraction, representation learning, and end-to-end trainable framework capabilities in semantic segmentation tasks. Particularly, fully convolutional networks (FCNs) [27] have become the foundation of using CNNs for semantic segmentation. Since then based on FCN, many state-of-the-art (SOTA) works have been proposed in the literature. Among many SOTA methods of segmentation, the backbones are often retrieved from classification models such as VGGNet [33], ResNet [17], DenseNet [19], and Xception [8], etc. These classification models extract similar features several times at different levels of the network, resulting in redundant information flow. Furthermore, previous approaches still have some issues including how to efficiently extract multi-scale information and focus on crucial channel and spatial information to improve the discrimination ability of learned features.
Several approaches such as pyramid pooling [48] and dilated convolution [4,5,7] have been introduced to extract multi-scale information. Furthermore, encoder-decoder architectures [2,31] are proposed to capture varying scaled objects by combining low-level and high-level semantic features. However, the interdependence between distinct channels in multiple-scale features receives less attention in these approaches. As a result, the model overlooks the difference between local representation and contextual dependencies for various categories. The dependencies of distinct channel mappings can significantly improve the feature maps to reflect specific semantics.
In response to the above-discussed problems, we propose an encoder-decoder based architecture named cross-form efficient attention pyramidal network (CEAPNet) for semantic segmentation. Unlike previous studies that have used a single pre-trained model for the encoder, we combine two effective pre-trained models as a backbone for the encoder: a deep residual network (ResNet) [17] and an efficient channel attention network (ECANet) [36]. ResNet architecture is deep enough to extract semantic information while still being able to manage the vanishing gradient problem and ECANet focuses on modeling local cross-channel dependencies, providing rich informative features. Further, we propose an enhanced spatial pyramid module named context-aware cross-form attention pyramid (CCFAP) that combines multi-scale and channel attention to generate effective contextual features at different scales. Generally, the spatial pyramid modules [5,48] extract multi-scale features using either parallel dilated convolutions or pooling layers. Dilated convolution uses local information, whereas the pooling layer uses global information to obtain multi-scale features. Our spatial pyramid pooling module is extended to a cross-form level with four parallel branches, each including a parallel dilated conv-layer and average pooling layer. Thus, multi-scale feature extraction through local to-global pyramid aggregates global information and local information. Moreover, we employ efficient channel-wise attention (ECA) strategy to model the channel-interdependencies of multi-scale features. Different from existing channel attention [18,38], ECA records local cross-channel interactions through fast 1D convolution without involving dimensionality reduction (DR). Similar to DeeplabV3+ [7], we also fuse low-level and high-level features in the decoder part. Before that, we use spatial-wise attention to filter the low-level features with spatial information. As a result of processing through distinct attention mechanisms, the network has improved channel and spatial feature dependencies and has obtained more accurate results. The following is a summary of our major contributions:
We propose a cross-form efficient attention pyramidal network (CEAPNet) that processes the encoded high-level and low-level features to obtain segmentation prediction. As an encoder, it uses two pre-trained models: ResNet 101, winner of the ILSVRC 2015 competition, and ECANet. Aiming at high-level features, a context-aware cross-form pyramid module is proposed to extract more effective multi-scale features. Moreover, channel attention is employed to fully utilize the semantic information of multi-scale features by weighting more informative channel features. Aiming at low-level features, we apply spatial-wise attention to get more effective location information which we then combine with the high-level multi-scale features to ensure precise segmentation results. We prove the validity of the proposed network on three datasets: PASCAL VOC 2012, IDD Lite, and CamVid.
Related works
In this section, we will review several studies on image segmentation that are closely related to our work based on the effectiveness of the multi-scale context and the attention mechanism.
Multi-scale feature fusion
The first way is to combine features from several scales using deep neural networks. For this, encoder-decoder networks have proven to be effective. FCN [27] and U-Net [31] have utilized skip connections to gradually concatenate the decoder and corresponding encoder features from multiple scales. Exfuse [46] uses enhanced feature fusion to lessen the gap between encoder and decoder features. To improve the multi-scale feature learning capability of U-Net, FCDenseNet [21] and GC-DenseNet [34] replace the original blocks with densely connected blocks. UNet
However, with only this encoder-decoder structure, it is difficult to interpret visual context and multi-scale information adequately.
Multi-scale context extraction
Another option is to use a module to collect multi-scale context from the same feature maps. An image pyramid is a simple method of obtaining multi-scale image representations by feeding several resized input images into the same model and then fusing the results [12,44]. However, the image pyramid approaches are inefficient in particular due to the additional time spent in training and inference. Some works [1,5,7,30] have exploited dilated convolutions with various rates to embed multi-scale contexts. Meanwhile, PSPNet [48] has adopted a pyramid pooling module to capture contextual data at various scales for object recognition. Furthermore, RefineNet [26] and ICNet [47] both combine the feature maps generated from the input images of multiple resolutions for multi-scale information. By concatenating residual feature maps extracted from consecutive convolution blocks, ESSN [22] is capable of generating multi-scale information. CE-Net [14] utilizes a context extractor module that consists of a multi-kernel pooling and dense atrous convolution block to obtain wider context and global information, respectively.
These solutions can solve the issue of scale variations of objects in complicated scenarios to some extent, but they are less effective due to a lack of understanding of contextual dependencies.
Multi-scale attention mechanism
A multi-scale network with attention mechanism not only obtains a larger receptive field but also assists the model to retain important information on different scales. Several studies [9,11,23,25,39,40,49] have exploited the attention mechanism in different ways to achieve remarkable performance. Specifically, Chen et al. [6] has used an attention mechanism to select more discriminating multi-scale features and predict each pixel in a single image individually. Zhang et al. [45] proposed an attention-directed network that uses the channel and spatial attention methods for saliency detection. The most recent non-local/self-attention approaches connect each pair of pixels to enhance context aggregation [16,37,43]. By summing the two attention modules from distinct branches, DANet [13] adaptively merged local features with their global dependencies. CCNet [20] introduces recurrent criss-cross attention that can more efficiently and effectively gather context information from remote dependencies, thus reducing the computational complexity of self-attention. However, these approaches are usually memory and processing intensive because there are so many pixels and connections between them. EMANet [24] rethinks the attention mechanism from the perspective of the expectation-maximization (EM) algorithm, computing the attention map iteratively in the same way that the EM algorithm does. EAPNet [41] uses an efficient channel attention pyramid module for high-level features which can acquire effective channel attention maps by skipping dimensionality reduction and residual attention fusion block for low-level features to improve the performance.
Different from the above-discussed papers, we use an attention mechanism not only in the multi-scale feature extraction module but also in the encoder to enhance the feature representations.
Proposed architecture
For semantic segmentation, researchers have preferred encoder-decoder architectures with pre-trained models as an encoder. We also propose an encoder-decoder architecture named CEAPNet for semantic segmentation. The overall network is illustrated in Fig. 1. Our proposed network has mainly three parts: (i) a combined pre-trained encoder for feature extraction, (ii) a context-aware cross-form attention pyramid (CCFAP) module for multi-scale feature extraction through global and local pyramids, and (iii) a decoder. The following sub-sections will discuss each part in detail.

Overview of our proposed CEAPNet.
We employ a pre-trained model for feature extraction in the encoder rather than training from scratch. There are various pre-trained models in the literature that have been trained on a large ImageNet dataset. Previous works like [1,5,7,13,22,34,35,41,42,48] have used a single pre-trained model as an encoder while in our proposed architecture, we have combined two pre-trained models: residual network (ResNet) [17] and efficient channel attention network (ECANet) [36]. The ResNet architecture has a depth of 101 so the pre-trained model is referred to as ECA-ResNet 101 as shown in Fig. 1. Generally, the final feature responses of Deep CNNs used for the image classification task have output stride = 32 i.e. features are 32 times smaller than the original input size. These small resolution feature maps are not very useful for semantic segmentation. That’s why we change the stride of a convolution layer from 2 to 1 in the last encoder block and adopt

Pre-trained architecture with residual units and the size of their outputs. (a) ResNet 101: going deeper without dilated convolution in the last block, (b) ECA-ResNet 101: going deeper with dilated convolution and setting dilation rates to 2, 4, 2 for an output stride = 16. Key: the notation

Basic building block for, (a) ResNet 101 model, (b) ECA-ResNet 101 model.

Structure of efficient channel attention (ECA) block.
In the research community, there is a growing consensus that deeper networks are better at learning complicated visual representations, which could lead to greater performance. However, simply stacking layers on top of each other will not improve performance. The familiar vanishing gradient problem makes deep networks challenging to train. So, as the network expands in size, its performance gets saturated, if not severely deteriorated. He et al. [17] proposed a solution to this problem by introducing residual blocks which allow training hundreds or even thousands of layers while still achieving excellent results. Its main idea is to introduce an “identity shortcut or skip connection” that bypasses one or more layers, as indicated in Fig. 3. With ResNets, the gradients can travel directly downstream from hidden layers to early layers using skip connections. Without the skip connection, input ‘x’ is multiplied by the weights (w) of the layer followed by adding a bias term (b) and then applying an activation function. Mathematically, we can represent it as follows:
ECANet architecture
The channel-wise attention mechanism has recently been proven to hold a lot of promise for improving deep CNN performance [18,29,36,38]. To enhance the characterization capability of feature maps at different levels of the encoder, ECANet [36] introduces efficient channel attention (ECA) block which can be integrated with any model without increasing the model complexity, see Fig. 3(b). For convenience, we consider high-level features of a ResNet block as
The channel dimension C is proportional to the kernel size k of
Context-aware cross-form attention pyramid module
The main purpose of this module is to capture efficient multi-scale features through global and local information to identify objects of varied sizes. After the last encoder block, many works have utilized different pyramid methods [5,7,48] which either focus on dilated convolutions or pooling operations for multi-scale feature extraction. On the other hand, our suggested pyramid module integrates both strategies as shown in Fig. 5. The proposed cross-form attention pyramid is made up of four parallel cross-form branches that correspond to the four levels of the pyramid. Every branch is made up of two parallel sub-branches namely dilated sub-branch and pooling sub-branch, and the channel-wise attention mechanism.
Dilated sub-branch: It applies a
Pooling sub-branch: It consists of an average pooling layer with corresponding pyramid scales. Similar to dilated branch, we apply a
Channel-wise attention mechanism: After merging the outputs of both sub-branches, the ECA block is used to rescale important feature maps and discard unnecessary ones.

Structure of context-aware cross-form attention pyramid (CCFAP) module.
Dilated branch allows us to extract dense features of multiple scales without increasing the number of parameters. This is done by inserting the holes (zeroes) within the filter’s weight matrix, resulting in a larger filter’s field of view. In 2D, dilated convolution mathematically can be computed as follows [5]:
The main purpose of the decoder is to restore the resolution of down-sampled feature maps. We use a simple decoder structure like DeeplabV3+ [7] with an additional spatial attention strategy as shown in Fig. 1. Similar to DeeplabV3+, the high-level features obtained from the above-mentioned pyramidal module, are upsampled by a factor of four using bilinear interpolation and then merged with equivalent low-level features of the encoder. Before that, these low-level features are passed through

Design of spatial attention (SA) block.
Since low-level features provide a lot of location data, it’s best to concentrate on the salient regions that will help you to discover the object’s location and identify the target structure. For this, we apply spatial-wise attention to low-level features, giving closer attention to the ‘where’ is an important region to predict. As illustrated in Fig. 6, we first perform max pooling, average pooling, and
We aggregate input tensor
Experiments
We perform extensive experiments on three demanding semantic segmentation and scene parsing datasets including IDD Lite [28], PASCAL VOC 2012 [10], and CamVid [3] to validate our proposed architecture. We begin with the description of the datasets in Section 4.1, followed by a description of the implementation in Section 4.2. Thereafter, we conduct many ablation experiments on the IDD Lite validation set to demonstrate the efficacy of every module of our proposed design in Section 4.3. Finally, in subsequent sections, we compare our method to SOTA models in terms of mean intersection over union (mIoU) on all three datasets.
Datasets
IDD Lite: It is an unstructured road scene dataset that accurately reflects the road infrastructure of Asian countries. It includes 1404 training images, 204 validation images, and 408 test images. Only training and validation images have their respective ground truths. The original images have a resolution of
PASCAL VOC 2012: It is a semantic segmentation benchmark dataset that originally included 1,464 images for training, 1,449 for validation, and 1,456 for testing. It has a total of 20 foreground objects classes and a background class. We use the augmentation to expand the dataset to 10582 images for training by [15]. During the training, we use a crop size of
CamVid: This is another challenging urban street scene dataset in the domain of autonomous driving. It contains a total of 701 images, including 376 for training, 101 for validation, and 233 for testing. It has 11 classes like building, car, tree, fence, sky, etc. and the original image resolution is
Implementation details
All the experiments are executed on the Pytorch framework with a 16GB NVIDIA TESLA P100 GPU. We use ECA-ResNet 101 that has been trained on the ImageNet dataset for the starting point of all our networks. As discussed in Section 3, we modify the backbone network so that output features are
Ablation experiments
Ablation for context-aware cross-form attention pyramid module
The first experiment was conducted to validate the effectiveness of our introduced CCFAP block. We compare it with different pyramid methods. For better comparison, we set three different spatial pyramid structures of the CCFAP module as in [5,48], and [4]: modified ASPP (mASPP) deletes all pooling branches; modified Spatial Pyramid Pooling (mSPP) deletes all dilated branches and modified Atrous Image Cascade (mAtIC) sequentially connects the dilated and pooling branches while keeping the remaining parameters constant, respectively. The outcomes of the experiments are shown in Table 1. The results demonstrate that the CCFAP module outperforms other methods. When compared to mSPP and mASPP, CCFAP improves the performance by 2% and 1.2% mIoU. Although mAtIC uses both operations in a series manner, it didn’t perform well. The reason might be pooling operation which is applied after dilated convolution not on the input image itself, so it is not able to add global information of original input. In our module, the different sizes of objects and the relationship between an object (e.g. car) and its sub-regions (e.g. windows, tires, etc.) are effectively learned by the CCFAP module. This is due to the inclusion of both ASPP and SPP structure that helps in extraction of rich multi-scale features through local and global information. It can also be seen in Fig. 7 that global information allows capturing the context of the whole image to correctly classify the ‘drivable’ and ‘roadside object’ classes in CCFAP. Also, different size vehicles and riders are clearer in the segmentation results of the CCFAP module than others.
Comparison of different spatial pyramid methods on IDD lite validation set
Comparison of different spatial pyramid methods on IDD lite validation set

Segmentation results of different pyramid methods on IDD lite validation set. Here, the first row represents the ground truths and the remaining rows represent the predicted outputs of specified bounding box areas (in the first row) for compared methods.
We further do several experiments to address how the dilation rate and pooling scale affect the CCFAP module. Different combinations of dilation rates and pooling scales are applied and results are recorded in Table 2. We end up setting the dilation rates as 3, 6, 12, 18 and corresponding pooling scales as 4, 8, 16, 32 which produces the best results for our network (we explored a variety of dilation rates and pooling scales but only the most accurate were reported).
In this section, we investigate the importance of channel attention and spatial attention on the segmentation performance of the proposed network. Specifically, we have used ECA for the high-level features in the CCFAP module and spatial attention for low-level features. The results are reported in Table 3. Results demonstrate that adding channel attention to the pyramid structure enhances performance by 2.5% in comparison to the baseline. We may conclude that higher-level features of certain channels have a significant impact on segmentation results and the channel attention amplifies this impact, hence boosting network performance. When the SA block is added, the mIoU increases by nearly 1.1% from 67.4% to 68.5%. The spatial attention tells the network which pivotal part should be focused on and suppressed. Thus, it improves the low-level feature representations which have an abundance of spatial information. By visually examining the segmentation results in Fig. 8, we can see those attention modules have a significant decrease in the false classification of pixels.
Results of ablation study on IDD lite validation set for different pooling scales and dilation rates
Results of ablation study on IDD lite validation set for different pooling scales and dilation rates
Results of ablation study on IDD lite validation set for the effect of attention mechanisms on network performance

Segmentation results with or without attention mechanisms on IDD lite validation set.
Further, we have conducted several experiments to choose the kernel size of spatial attention, and results are recorded in Table 4. We steadily increased the kernel size and discovered that performance improved as the kernel size rose to 9. Further increment degraded the performance, so we settled on 9 as kernel size which achieved the best result.
Results of ablation study on IDD lite validation set for kernel size of spatial attention
In this section, we compare the ECA block with three attention modules including squeeze and excite (SE) [18], convolutional block attention module (CBAM) [38], and Triplet attention (TA) [29]. The accuracy results (mIoU) are reported in Table 5. Although SE and CBAM blocks provide significant performance improvements, these blocks don’t account for efficient cross-dimension interaction. Both approaches first project channel descriptors into a low-dimensional space before mapping them back into a high-dimensional space which causes loss in inter-channel interaction. On the other hand, TA incorporates cross-dimensional interaction without involving any dimensionality reduction but has more parameters than ECA. As it has been discussed in Section 3.1.2, ECA records not just cross-dimensional interdependence and interactions of features for rich feature representations but also avoids channel dimensionality reduction without increasing the parameters. It has utilized
Comparison of different channel methods on IDD lite validation set. (Here, C: channel dimension, r: reduction ratio, and k: kernel size)
Comparison of different channel methods on IDD lite validation set. (Here, C: channel dimension, r: reduction ratio, and k: kernel size)

Comparison of segmentation results of different models on IDD lite validation set. Specific areas are highlighted in bounding boxes for better comparison.
The proposed network is compared with several SOTA models on the IDD Lite dataset. All SOTA models were benchmarked on different datasets, so we re-evaluated these models on the IDD Lite dataset and report the results of the validation set in Table 6. Our proposed model yields 70.7% mIoU which is 5.1% and 4.6% greater than DRN DeepLabV3+ [1] and ERFNet [30], respectively. All previous works have used mainly single pre-trained architecture as a backbone for feature extraction whereas CEAPNet has exploited two effective pre-trained models for the feature extraction process which has an advantage over them. Among previous works, DANet [13], DF-DAM [39], and GLNet [25] adopt attention mechanisms to improve the performance but those are not very effective. Although EAPNet [41] adopts an efficient channel attention mechanism but the absence of global information during multi-scale feature extraction, makes it inferior to our proposed model by 2.6% mIoU. For visual comparison, we consider only the best two models among previous works, and the results are shown in Fig. 9. As can be seen from the figure, CEAPNet is more detailed than GLNet and EAPNet. For example, light poles and small size vehicles are more visible in the segmentation results of CEAPNet. For roadside objects and far objects, the other two models are confused whereas CEAPNet is more distinguishable between them.
Results on the IDD lite validation set
Results on the IDD lite validation set
We also run the experiment on the PASCAL VOC 2012 dataset to ensure that the proposed CEAPNet is generalizable. Comparisons with previous SOTA models on the validation set are recorded in Table 7. As can be observed, our suggested CEAPNet produces results that are comparable to previous attention-based models including DF-DAM [39], DANet [13], and EAPNet [41]. We also compared our model with other models such as HFCN [42], DeeplabV3 [5], DeeplabV3+ [7], Exfuse [46], and APCNet [16]. Results show that our model obtains the best performance of 83.98% mIoU without pre-trained on the MS-COCO dataset and outperforms baseline DeeplabV3+ by a mIoU value of 4.13%. We also compare the segmentation effects of the best two models among compared SOTA models with our proposed CEAPNet in Fig. 10. It can be seen that we achieve more obvious segmentation results for target objects than APCNet and EAPNet. Our proposed CEAPNet classifies objects such as a person, horse, sofa, sheep, etc. with better edge segmentation.
Results on the Pascal VOC 2012 validation set
Results on the Pascal VOC 2012 validation set

Comparison of segmentation results of different models on Pascal VOC 2012 validation set.
Results on the CamVid test set
We also evaluated CEAPNet on another challenging road scene dataset i.e. CamVid to further examine the validity and reliability of the proposed model. Our model is compared with several SOTA models including ESSN [22], Dense Decoder [2], GC-DenseNet [34], and FC-DenseNet [21] as shown in Table 8. Compared with SOTA models, our proposed method achieves the best performance of 73.8% mIoU. It also outperforms attention-based methods including PPANet [9], AGLNet [49], and Fan et al. [11] by 3.7%, 4.4%, and 13.7% mIoU, respectively. Note that some of the approaches in Table 8 make use of high-resolution images, which expectedly would further improve our results. From Fig. 11, we can see that our model has improved accuracy in some important categories like sidewalk, car, and small categories like a sign, pedestrian, bicyclist, etc.

Comparison of segmentation results of different models on the CamVid test set. Specific areas are highlighted in bounding boxes for better comparison.
In this paper, our proposed CEAPNet segmentation model integrates two effective pre-trained models such as ResNet and ECANet for the encoder and channel attention into the pyramid module. The utilization of two pre-trained models as backbone architecture for feature extraction has a more positive impact on the segmentation performance. Different from other works which have only used pyramid structures for multi-scale feature extraction without giving importance to relevant features, our proposed pyramid structure with channel attention has further improved the performance. Additionally, spatial attention further complements the low-level features for fusion with high-level features. Finally, experiments show that our proposed network achieves 70.7%, 83.98%, 73.8% mIoU on IDD Lite, PASCAL VOC 2012, and CamVid datasets, respectively. In the future, we plan to combine efficient channel attention with alternative CNN architectures to improve semantic segmentation accuracy.
Footnotes
Acknowledgements
The authors would like to acknowledge the University Grants Commission (UGC) for fellowship throughout this research work.
