Abstract
Urban street scene analysis is an important problem in computer vision with many off-line models achieving outstanding semantic segmentation results. However, it is an ongoing challenge for the research community to develop and optimize the deep neural architecture with real-time low computing requirements whilst maintaining good performance. Balancing between model complexity and performance has been a major hurdle with many models dropping too much accuracy for a slight reduction in model size and unable to handle high-resolution input images. The study aims to address this issue with a novel model, named M2FANet, that provides a much better balance between model’s efficiency and accuracy for scene segmentation than other alternatives. The proposed optimised backbone helps to increase model’s efficiency whereas, suggested Multi-level Multi-path (M2) feature aggregation approach enhances model’s performance in the real-time environment. By exploiting multi-feature scaling technique, M2FANet produces state-of-the-art results in resource-constrained situations by handling full input resolution. On the Cityscapes benchmark data set, the proposed model produces 68.5% and 68.3% class accuracy on validation and test sets respectively, whilst having only 1.3 million parameters. Compared with all real-time models of less than 5 million parameters, the proposed model is the most competitive in both performance and real-time capability.

Introduction
Semantic scene segmentation is an important task in scene analysis wherein each pixel of an input image is assigned a class label. These labels or classes could include a wide range of objects, such as people, car, table, building, train, etc. depending on the data sets. This study focuses on urban street images which contain objects relevant to outdoor street scenes, such as traffic light, traffic sign, car, train, rider, etc. As with any research in this area, segmentation of partially occluded objects will be limited to the parts of these objects visible in the image.
Over the last decade, extensive research has been conducted in this field and it has been shown that in robotics, video surveillance and autonomous car driving industries, semantic segmentation plays a key role. It helps the machine segment an image automatically, but how efficiently and accurately machine can perform this task, is always a critical question. Scene segmentation models can be divided into two categories: off-line models [4, 5, 6, 32, 46] that require a large number of parameters and real-time models [24, 27, 37, 38, 41, 45] that have much less number of parameters. Real-time models are therefore more suitable for resource-constrained devices and are the main focus of our work. It is still a challenge in the research to develop real-time models that achieve good segmentation performance. We attempt to address this challenge in this work.
Previous studies have shown that deep convolution neural networks (DCNNs) are widely used in various domains such as classification [20, 34], object detection [35], instance and semantic segmentation [1, 14, 18, 21, 32]. Due to their high robust nature and handling capability of rich semantic features, they often produce promising results on various publicly available benchmarks [2, 8], mostly under an offline setting with very expensive GPU computing clusters. However, achieving a good performance in a real-time environment, especially for resource-constrained devices, is still a big challenge for all researchers. Over the decade, several real-time semantic segmentation models [23, 24, 41, 45] have been developed to address semantic segmentation for computer vision embedded devices, but the inability of handling high-resolution input images hampers model performance. High-resolution input demands large resources if the model is too deep. To maintain a balance between model complexity and input resolution, the authors of [26, 41, 45] introduce a multi-branch approach in which the deep branch is used for capturing rich contextual information whereas the shallow branch is responsible for retaining boundaries/edges details. Nevertheless, mutually independent branches hardly contribute to the learning ability of the model and also the addition of the shallow branch with high-resolution input image slows down its performance. To speed up the inference speed, some studies [1, 24] prune the redundant channels from the model architecture. Such approach boosts inference speed of the model by costing model’s performance.
For semantic segmentation, rich contextual details along spatial and channel dimensions are important. These contextual information can be extracted by a series of convolution layers. The common approach is to adopt a popular DCNN model as a feature extractor to extract details from the input. Many segmentation models use ResNet [36] as a backbone due to its high scalability and robustness. Moreover, the weights of ResNet models trained by ImageNet [9] dataset are easily available on the web which can boost segmentation performance. Due to its popularity in the field of image classification, several variants of ResNet model are proposed [39, 44]. However, with the increase in depth and width of the model, the number of parameters and floating point operations (FLOPs) also increase drastically which makes the model incapable of handling high-resolution input in real-time environments. Various off-line segmentation models, such as OCR [42], DeepLab (all variants), [4, 5] use ResNet as an encoder. However, for real-time execution, segmentation model needs a light-weighted backbone to reduce the computational cost and memory usage. The MobileNet architecture [13, 30] fulfills this requirement, thus it is suitable for embedded devices having low hardware specifications. Due to the optimised structure of residual blocks of MobileNet, it always results in less parameters and FLOPs, hence less memory usage. For that reason, many real-time scene segmentation models, such as ContextNet [26], FAST-SCNN [27], FANet [33], have adopted MobileNet residual blocks. Even in the field of image classification [34] and object detection [35], MobileNet is also widely used. Inspired by the optimized and robust design of MobileNet’s architecture, current research designs a light-weighted backbone network by assembling a series of residual bottleneck blocks found in MobileNet [13]. It will help the proposed semantic segmentation model to achieve the desired balance for real-time applications using resource-constrained devices.
Traditionally, an encoder-decoder design is employed in semantic segmentation in which rich contextual feature maps are extracted by the encoder and a series of upsampling layers are used in the decoder to reconstruct the scene and produce the output of the same size as the input. When using upsampling layers, a model can lose certain rich contextual details due to large semantic gap between levels. To reduce this semantic gap and localize the contextual information better, this research proposes a new technique, called Multi-level Multi-path Feature Fusion (M2-FF) method for semantic segmentation. This technique is discussed in detail subsequently.
Number of parameters vs class mIoU on Cityscapes test set – proposed M2FANet sets a new state-of-the-art performance among semantic segmentation models of having less than 5M parameters.
In pursuit of better contextual representation, this study exploits different feature scaling techniques such as Pyramid Pooling Module (PPM) [46] and Atrous Spatial Pyramid Pooling (ASPP) module [5]. Due to the optimized design and usage of dilated convolutions, ASPP performance is better than PPM and it also provides better object localization facility by maximizing the size of the receptive field. In ablation study section, the details about the model’s performance using both PPM and ASPP are discussed.
The key contributions of the proposed study are as follows: firstly, this research presents an optimized light-weighted backbone by exploiting residual blocks of MobileNetV2 to extract rich contextual details from high-resolution input images. Due to its’ optimized architecture and less number of parameters, the proposed model is suitable for resource-constrained real-time semantic scene segmentation applications. Secondly, it also introduces a new, yet effective, feature aggregation technique (M2-FF) at the decoder end for better region localization and context engrossment. Top-down paths in M2-FF module ensures better context assimilation whereas bottom-up path enhances the entire feature hierarchy by propagating the local signals from bottom layer to top layer. Thirdly, by introducing a feature scaling module, the proposed model shows a better way to maximize the receptive field for capturing the contextual details from the feature map without adding any extra parameters. Fourthly, the proposed model is tested and evaluated on two different publicly available benchmarks on urban street scenes. Finally, the results of the study clearly state that compared to the existing semantic segmentation models of less than 5 million parameters, the proposed model achieves state-of-the-art results on Cityscapes [8] dataset. The plot in Fig. 1 displays the position of the proposed model in the real-time category. The number of parameters and GFLOPs of other competitive real-time models are at least five times higher than the proposed model which makes the proposed model (M2FANet) superior than the existing real-time scene segmentation models for resource constrained embedded devices.
The following table shows all the acronyms used in this paper.
Acronyms
Semantic scene segmentation models typically follow a pyramid architecture. In the encoder stage, a deep CNN typically computes a feature hierarchy layer by layer and develops an inherent multi-scale pyramid shape. At the decoder end, the high-semantic feature map is up-sampled and fused with the previous layers’ feature maps through lateral connections to recover higher spatial dimensions. Once spatial details are completely extracted, the model will predict the class label for each pixel to complete the segmentation process. Different DCNN models, such as VGG [32], ResNet [36], Xception [7], and MobileNet [13], are often used as a feature extractor in the encoder end. Due to the deep network design, most of these DCNNs need extensive hardware support to generate output. However, due to the overwhelming use of mobile devices, developing real-time semantic segmentation model became an important research area in the field of computer vision. Therefore, in the following subsections, first, we discuss the models which are designed for offline computation, then we provide the details of semantic segmentation models targeting real-time computation.
Off-line segmentation
Traditional semantic scene segmentation models are off-line models due to their deep architectural design. Inspired by the outstanding performance of DCNNs in the field of image classification, Fully Convolutional Network (FCN) [21] has shown a revolutionary approach to deploying a DCNN model as an encoder. It replaces the fully connected layers from the top of the DCNN by a convolution layer to generate a spatial map and feeds it to the decoder circuit in order to produce segmentation output. Based on this foundation, several offline models, such as UNet [29], SharpMask [25], DeepLabV2 [4], RefineNet [16], PSPNet [46], DeepLabV3 [5], OCR [42], HANet [6] are then developed by introducing lateral connections between the low-level feature maps across resolutions and semantic levels. Most of these off-line semantic segmentation models use ResNet-101 or higher variants [38] as the feature extractor due to their capability to capture many vision features. The literature has also shown that with the increase of network depth and width, the performance of the model improves [5, 39, 46]. For instance, with the use of deep ResNet-101 as the encoder, DeepLab [5] generates a better accuracy than using VGG-16 [32] or ResNet-50 [36] or Xception [7]. However, due to the large depth (101 layers) and large width (2048 channels), ResNet-101 contributes a large number of parameters and GFLOPs which makes all semantic segmentation models using it [5, 6, 16, 42, 46] computationally inefficient to run in real time. This elevates the need for designing lightweight scene segmentation models for resource-constrained embedded devices.
Real-time segmentation
Due to the growing demand of designing real-time applications in the field of classification [22], clustering [3], segmentation [14, 24] and object detection [35], considerable researches have been conducted to optimize the existing off-line models and to produce acceptable real-time performance on resource-constrained devices. SegNet [1] was one of the pioneering models targeting real-time computation by introducing a small architecture. Although, it reduced model overall parameters as compared to offline models, but model still requires extensive hardware support to run. Later on, ENet [24] proposed an extremely efficient framework by reducing the number of down-sampling operations and introducing high dilation rate in bottleneck blocks. It helps the model reduce the number of parameters and GFLOPs drastically; however, the accuracy of this model is severely compromised. Few more studies, such as ICNet [45], ContextNet [26], BiSeNet [41], GUN [43], were developed to improve the real-time performance. All these models introduced a new approach, called multi-branch approach targeting real-time computation. Figure 2b demonstrates the design of multi-branch encoders. In this method, one dedicated deep branch is introduced in the encoder to extract rich contextual information from input. To reduce the computational cost due to the depth of the branch, it accepts a lower resolution input whereas other branches, named as shallow branch, receive a higher resolution input. The shallow branch is mainly deployed for capturing boundary and texture details from the input which will be used at the decoder to fuse with deep features. Thus, this multi-branch approach satisfies the goal for real-time computation with high resolution input images. However, independent branches at the encoder did not contribute to the learning ability of the model. Moreover, the independent multi-branch approach introduces a large semantic gap between shallow and deep features.
Different approaches. From left to right: (a) One-branch encoder, (b) Multi-branch encoder, (c) Feature reuse in sub-encoder (d) M2-FF decoder. This approach uses semantic features at different levels from encoder network and progressively map them in the multiple paths at decoder end. Dotted arrows (green color) in the last path defines the presence of feature aggregation path in which features from encoder will be mapped with the features at current path by skip connections.
By considering the drawback of the independent branches, FAST-SCNN [27] introduced a new technique, called Down-Sampling. Instead of using a completely separate branch from the beginning, it deployed a down-sampling module at the initial stage of the pipeline. This module reduced the input resolution to a quarter of the original input size. After this module, it created two branches – a deep branch for feature extraction and a shallow branch for preserving texture and pattern details. This approach reduced the computational cost and enhances the performance. However, FAST-SCNN could not reduce the large semantic gaps between local and global features. Moreover, the model loses its ability to retain boundaries information due to scaling up the final feature by a large factor at the decoder.
In contrast to these approaches, DFANet [15] introduces the concept of sub-networks in the encoder. Traditionally, an encoder follows pyramid structure in which a stack of convolution layers processes an input image and reduces its spatial dimensions as the input reaches to the end. This low resolution high semantic feature map is used as an input for decoding. In contrast to this traditional approach, DFANet [15] uses this semantic feature as an input for next sub-encoder network and thus the process is repeated for the third sub-encoder. Layers at the same level but in different sub-encoders share their gradient information among themselves. This approach produced new state-of-the-art results in real-time scene segmentation. However, it still has as many as 7.8 million parameters and large GFLOPs due to multiple sub-networks at the encoder end.
The first three diagrams in Fig. 2 show the architectural difference of existing approaches. It clearly displays that all existing methods upsample the final semantic feature map by a large factor in the decoder side which causes a loss in spatial details. Thus, the boundary degeneration effect can be noticed in the output. To address this issue, few existing real-time models such as FAST-SCNN [27], DFANet [15] introduced a simple feature fusion module which fuses deep features with shallow features at a particular stage before upsampling the final feature map. This approach helps the model to fuse contextual details with spatial details at lower resolution. However, it could not resolve the boundary degeneration effect. Moreover, mapping shallow features at higher resolution with deep features at lower resolution introduces a large semantic gap. To reduce this semantic gap between the spatial dimensions of the local features and global features, features at different levels need to be added in multiple directions. Hence, this study proposes a new technique, called M2-FF, in the decoder side to fuse features from top five stages in multiple directions. Furthermore, introduction of this module eliminates the need for upsampling the final feature map by a large factor. Thus, it addresses boundary degeneration effect in scene parsing. The far right diagram (d) of Fig. 2 illustrates the proposed approach.
The idea to fuse features at different levels was firstly introduced by FPN [17] and it was a multi-level feature fusion technique. By adding a top-down path and lateral connections among layers at same level, it tries to reduce the semantic gap between semantic feature maps of different levels. Later on, PAN [18] shows that adding a top-down path at decoder reduces semantic gap but introduces a localization issue unfortunately. To boost the localization capability of the entire feature hierarchy, [18] proposes another path from bottom to top to fuse local features with global features. By adopting this technique, an efficient, scalable object detection model, named EfficientDet [35], has recently been introduced by the Google Brain team. Both studies [18, 35] have shown that the addition of a bottom-up path provides better object localization and context accumulation in the feature maps. Motivated by this approach, current study introduces a novel Multi-level Multi-path feature fusion approach at the decoder end. Thus, the proposed study addresses all the issues of the existing real-time scene segmentation models.
In this section, the complete architecture of the proposed model is discussed. By exploiting dilated convolution, depth-wise separable convolution (DsConv), feature scaling and multi-stage feature fusion techniques, this study significantly extends the preliminary research [33] without drastic an increase of parameters and computational cost.
Network architecture
The overall architecture of the proposed M2FANet model is shown in Fig. 3. In what follows, the detail of every component of the model, including the backbone network, M2-FF, ASPP, classifier module, dilated convolution, depth-wise separable convolution, and non-linearity functions is addressed.
Complete pipeline of M2FANet.
Since the main focus of this research is to design an optimized scene segmentation model for resource-constrained devices, it employs the mobile residual bottleneck block (MBConv) of MobileNetV2 [30]. In the preliminary investigation FANet [33], a down-sampling module is deployed at the beginning of the network to reduce the input resolution to 1/8 of the original input size before passing the tensor to residual blocks. However, in the proposed design, it is replaced by MBConv blocks with less channels. Utilization of down-sampling module generates less parameters and GFLOPs. It mainly controls the input resolution and boundary details by not contributing much for holding the spatial details from the input scene. On the other hand, it has shown that deploying MBConv of different expansion ratios at the initial stage preserves more contextual and spatial details due to its squeeze and excitation architecture [30]. Although MBConv block generates more parameters and GFLOPs compare to the down-sample module, it still has less channels at the initial stage and hence controls the increase of model parameters and GFLOPs.
The proposed study uses two types of bottleneck residual block of MobileNetV2 [30] – MBConv1 and MBConv6, which are named based on their expansion ratio. It is the ratio between the size of the input bottleneck and the inner size, and is either 1 or 6. Each block contains an input followed by several bottlenecks, then followed by an expansion. To reduce the number of parameters, Depth-wise Conv (DwConv) layer is deployed at the expansion stage. Table 2 displays the structure of an MBConv block. Here, Non-linearity activation is deployed in the first two layers as as suggested in [30]. However, to retain meaningful information, it is skipped after the last layer of each MBConv. The filter size of the first block is 24 and it is progressively increased in the successive blocks. It is shown in [30] that the use of width multipliers from 0.35 to 1.25 for all resolutions generally produces a better performance. Based on this strategy, the proposed study uses multipliers between 0.35 and 0.5 to set the number of channels in the successive MBConv blocks. It avoids higher values of the width multiplier to control model width and make it computationally efficient for real-time applications. To exploit the architectural advantage of [12], this research uses a squeeze and excitation module in each residual bottleneck block and test the model’s performance. In the result section, it compares the performance of MobileNetV2 and MobileNetV3 residual blocks. Previous publications [12, 40] demonstrate that the addition of the squeeze and excitation module in the bottleneck architecture improves performance; however, the complete architecture of the network totally depends on Network Architecture Search (NAS) [10], which makes network design unpredictable. It also requires high CPU/GPU power to leverage neural architecture search for automatically predicting feature network design. To keep the proposed network design simple, predictable and usable for real-time computation, the final proposed model uses MBConv blocks of MobileNetV2. Experimental results also confirm that the inverted residual block of MobileNetV2 performs better than MobileNetV3 blocks. Moreover, the addition of squeeze and excitation module of MobileNetV3 results in a large number of parameters and GFLOPs, which makes the model more difficult to run in a real-time environment.
Bottleneck residual block
Bottleneck residual block
Layer architecture of backbone network
The layer architecture of the backbone network is displayed in Table 3. It clearly displays that an input image passes through maximum seven stages in the encoder side to produce a global feature map. For an input image of 1024
The literature has shown that fusing features at different levels helps the model combine rich contextual information with spatial details and reconstruct the image. Higher-level neurons strongly respond to entire objects while low-level neurons more likely capture local texture and patterns which stimulates the necessity of adding a top-down path to propagate rich semantic features from high to low level and enrich all features at different levels with sensible classification knowledge.
Traditionally, a high-level rich semantic feature map
Design of Multi-level Multi-path feature fusion module at decoder. a) Top-down path for multi-scale feature fusion (blue arrows), b) Bottom-up path for better localization (green arrows), c) Top-down path for better feature aggregation with the help of skip connections (red arrows) from backbone. There are lateral connections (black arrows) among the layers at same levels.
Formally, given a list of semantic features
To reduce number of operations and number of parameters, the standard convolution layer is replaced by depth-wise separable convolution layer.
ASPP, introduced by [5], is a powerful tool to explicitly control the spatial dimensions of semantic feature maps produced by DCNNs. It also enhances the ability of DCNNs to handle both large and small objects efficiently by providing a robust mechanism to control the field-of-view of filters in convolution layers.
A dataset usually contains objects of different classes with various sizes. Although DCNNs have shown astonishing capability of classifying an arbitrary region of scene by employing a small kernel, typically 3
where the dilation rate
Typically, ASPP is deployed on top of the encoder network to control the spatial resolution of the resulting feature maps if the output stride of encoder is
This is the last module of the whole pipeline. Features produced by the ASPP module are upsampled by a factor of 4 and then simply mapped with the second level (S2) features (
Depth-wise separable convolution
To achieve the target of designing an efficient scene segmentation model for resource-constrained devices, this study replaces the standard Conv layer by DsConv in many places. In the feature aggregation module, after fusing features of two consecutive levels, it deploys a separable convolution block in which a depth-wise separable convolution is performed, followed by a batch-normalization layer. DsConv factorizes a standard convolution into two stages – a depth-wise convolution (DwConv) and a point-wise convolution (1
Nonlinearities
The selection of the activation function in DCNN models has a significant impact on model prediction. It helps the model preserve non-linearity while passing knowledge from one layer to the next layer. The most widely used non-linearity function is Rectified Linear Unit (ReLU) due to its strong convergence rate during gradient descent optimization. Although, other activation functions such as Sigmoid, Tanh are smoother than ReLU, but they have not been as popular as ReLU. Recently, the Google Brain team has proposed a new activation, called
where
To evaluate the proposed model in urban street scene analysis, extensive experiments are carried out on two different benchmark datasets: Cityscapes [8], CamVid [2]. Experimental results clearly illustrate that the proposed model outperforms many existing semantic segmentation models which have less than 5 million parameters. In the following subsections, this paper first discusses the datasets and implementation details, followed by a series of ablation studies on the Cityscapes benchmark dataset. Finally, it compares the proposed model with some existing off-line and real-time segmentation models and report the results on both validation and test sets. Consistent with previous work, this study reports model parameters, GFLOPs, class and category meanIoU.
Datasets
The proposed study mainly used fine-tune images and considered only 19 classes for pixel annotations. The whole dataset is divided into three parts- training set (2,975 images), validation set (500 images) and test set (1,525 images). The labels for the training set and validation set are supplied by the benchmark whereas the labels for test set are not provided. However, test set predictions are submitted to the Cityscapes evaluation server and the results are discussed in the results section.
Implementation details
To conduct this experiment, this study uses a dual Nvidia GeForce RTX 2080Ti GPUs system, each GPU has 11 GB of memory. To exploit the parallel processing power of GPUs, it uses CUDA 10.2. The proposed model is developed using
Finding optimal learning rate.
Inspired by [4, 13, 46], the proposed study uses the ‘poly’ learning rate by setting 0.045 as the base value and 0.9 as the power. To find out the optimal learning rate in each epoch during training the model, model is trained for 5 epochs using a polynomial scheduler and the corresponding losses against different learning rates are plotted. Thus, the upper and lower bound of learning rate for training are set. Figure 5 illustrates the plot of learning rate vs model loss as an example. To calculate model loss, the categorical cross entropy loss function is exploited.
Following the suggestion by MobileNetV2, the proposed model uses
Previous study [33] has shown the effectiveness of using modified design of BiFPN for region localization and context aggregation. Taking that as a starting point for this study, the design of FANet is updated for the better scene segmentation. Table 4 shows the evaluation of the proposed M2FANet model. Clearly, it produces better results than the existing real-time scene segmentation models having less than 5 million parameters.
Results of ablation study on Cityscapes validation set
Results of ablation study on Cityscapes validation set
The first three rows of Table 4 show the preliminary result of FANet already presented in [33]. From the forth row, it shows the additional results obtained from the ablation study using the proposed model. At the initial stage, the backbone of FANet is replaced by an ImageNet pre-trained MobileNetV2 model and performance is measured on Cityscapes validation set. MobileNetV2 [30] has 12 MBConv blocks with increasing channel sizes (16 to 320). The literature shows that increasing the depth and width of the model will likely enhance the performance. However, it also increases model parameters and number of operations. Specifically, MobilenetV2 has more than 2.6M parameters, and this increases model’s overall parameters and GFLOPs. Similarly, the fifth row of Table 4 shows that the use of MobileNetV3 [12] as feature extractor of the new design supplements model performance, but at the cost of a drastic increase in computation. Moreover, the addition of squeeze and excitation modules in residual blocks of MobileNetV3 is determined by Neural Architecture Search (NAS) [40] which makes model structure unpredictable. Therefore, optimizing the backbone architecture is really difficult for MobileNetV3.
The above observation motivates this study to stick with MBConv blocks of MobileNetV2. The depth and width of the FANet model is modified by increasing the number of residual blocks and width multiplier. The proposed model replaces down-sampling module of FANet by three MBConv blocks of different expansion ratios and last two MaxPooling layers of FANet model are also replaced by three MBConv blocks. Table 4 clearly illustrates that due to the addition of new blocks, the number of parameters is increased by 0.2M; however, the model’s performance is boosted by 2.6%. To control the number of parameters of MBConv blocks, the proposed design uses the width multiplier between 0.35 and 0.5 to increase model’s width. Table 4 also shows that without using any feature scaling technique, M2FANet enhances the performance by 1.5% compare to FANet. However, empirically it has been proved that feature scaling at different branches with different scaling rates can further escalate the model’s performance.
This study explores two powerful feature scaling techniques: PPM [46] and ASPP [5]. It is noticeable from Table 4 that both techniques boost the performance by a recognised percentage. Compare to PPM, utilizing ASPP at the decoder side is more effective due to its design and the presence of dilated convolution branches. In PPM, features are processed by four ImagePooling branches with different rates whereas in ASPP, one ImagePooling branch, one 1
This section presents the performance of the proposed model on the Cityscapes dataset and compares model’s performance with other existing off-line and real-time scene segmentation models. It presents model parameters, GFLOPs, pixel accuracy, Class and category mean Intersection Over Union (mIoU) on validation and test sets. It also demonstrates inference time and FPS (frame per second) of the models which are trained under the same system configuration. This study did not pre-train the model with ImageNet [9] dataset. Note that the domain of ImageNet dataset is different from urban street scenes. Due to this different domain knowledge, this study focuses urban street scene benchmarks and trains the model with related domain knowledge datasets. Cityscapes [8] provides 5,000 fine-tune and 20,000 coarse annotated images for training the model. The current study reports both results of fine-tune and fine-tune with weakly annotated data (Table 5) at 512
Model performance on Cityscapes fine-tune and coarse datasets
Model performance on Cityscapes fine-tune and coarse datasets
It also trains the proposed M2FANet on the fine-tune datasets at different input resolutions and present the results in Table 6. At 512
Class-wise M2FANet performance on validation set at different input resolutions
Category-wise M2FANet performance on Cityscapes validation dataset
Table 7 displays the category-wise model performance at different input resolutions. In the Cityscapes dataset, all classes are distributed into seven categories: flat, construction, object, nature, sky, human, and vehicle. The proposed model’s performance is outstanding in five categories (flat, construction, nature, sky and vehicle) across all resolutions. However, the object category has a low accuracy (59.7–62.4%) across all input sizes, possibly due to asymmetrical class distribution and tinny shape of traffic sign, traffic light and pole.
Results on validation set
To compare the proposed model with other existing models, this study trained one off-line model (DeepLab) and three real-time segmentation models (SegNet, ContextNet, FAST-SCNN) under same system configuration. Results of comparison on the validation set are displayed in Table 8. Models marked with the sign * are trained during this study where either public code is available or sufficient implementation details are known. For other models, results are extracted from the literature. Due to a large number of parameters and GFLOPs, standard convolution layers of DeepLab and Segnet are replaced by DsConv layers. For DeepLab, instead of using ResNet101 or VGG-16, a pre-trained Xception model is used as a backbone. This study also incorporated ASPP on top of the encoder as a dense feature extractor. Due to the large size of DeepLab, this study could not train the model at full input resolution. It trained Deeplab with 512
Performance evaluation of different models on Cityscapes validation set
Performance evaluation of different models on Cityscapes validation set
Efficiency comparison of all trained models
It is noticeable from Table 8 that the proposed M2FANet generates better segmentation results on Cityscapes validation set among all the listed models. Although, ICNet [45] produces an accuracy (67.7%) very close to M2FANet, it has much higher GFLOPs and model parameters than the proposed one. This makes the proposed M2FANet superior than other real-time scene segmentation alternatives. This study achieves 68.5% mean IoU on the Cityscapes validation set without using any post processing techniques. In [15], the authors claimed around 70% class accuracy on Cityscapes validation set; however, the number of parameters is 6 times higher than M2FANet and thus it is not as efficient as the proposed model. Moreover, after studying [15] architecture, it is observed that this model has a much higher GFLOPs than the proposed model. To have a reasonable comparison, current work compares the proposed model performance with DFANet-B variant which has 4.8 million parameters. Table 8 shows that the proposed model performs much better while having less parameters and GFLOPs. This study also found that the authors of [15] claimed much smaller GFLOPs of their best model. However, the current study suggests that it is not possible to have low GFLOPs count with large number of parameters (7.8 million). It calculated the GFLOPs count and reported in Tables 8–10. In off-line scene segmentation model, PSPNet [46], HANet [6, 42] achieve outstanding results (above 81%) due to its deep neural architecture.
The current work also presents the inference time and the rate in terms of FPS (frame per second) of the models which are trained under the same system configuration. It is often the case in the literature that such comparison of inference time and FPS is based on different hardware, which makes it less insightful as these indicators can vary significantly. It also depends on the size of input image and different hyper-parameters of the model. Therefore, comparing the inference time of different models ran in different hardware platforms with different input resolutions does not provide a clear picture model’s superiority in terms of efficiency. Therefore, to present a balanced comparison, this study measures the inference time of all the trained models under the same system configuration and with similar input size. The results are reported in Table 9. It demonstrates that ContextNet [26] is quite efficient overall among all the trained models. It is understandable as ContextNet has the least number of parameters among all the listed models in Table 9. It can process on an average 11 frames per second where as the proposed model can process 9.1 frames per second. The inference time is measured using a single Nvidia GeForce RTX 2080Ti GPU system with the same input size. Table 9 also presents model size which is the size of the model after completing the whole training process.
Performance evaluation of different models on Cityscapes test set
To compare with other existing models, this study exhibits the test set results in Table 10. The results of other models are extracted from their original papers and Cityscapes leader-board. It displays that all off-line and real-time semantic segmentation models except ContextNet [26], FAST-SCNN [27], ENet [24], have at least four times higher number of parameters and GFLOPs than the proposed model. In addition, they all achieved a lower prediction accuracy network than M2FANet did. Table 10 also demonstrates that very few existing real-time models can handle full-resolution input images. Among all listed models, ICNet [45] produced 69.5% class accuracy on full-resolution test images of Cityscapes, whereas the proposed model generates 68.3% accuracy. However, it is distinctly noticeable that ICNet has almost 5 times higher number of parameters than M2FANet. This makes ICNet less favourable for real-time applications. The proposed M2FANet has less parameters and less GFLOPs compared to most models.
Table 10 also presents GFLOPs count of all the models. It is noticeable that instead of counting GFLOPs at full input resolution, most of the models report their GFLOPs count at low input resolution. As GFLOPs count depends on input size, it is expected that the GFLOPs count will be lower in this setting. This study reports 37.3 GFLOPs count of the proposed model with 1024
Performance comparison on CamVid
Current research also evaluates M2FANet on the CamVid dataset. Table 11 shows the class accuracy and pixel accuracy of M2FANet and other compared models on this dataset. It is clearly evident that M2FANet outperforms the other models: it achieves 60.0% class accuracy despite of having small number of training images. To have a fair and reproducible comparison between the different models based on their core architecture, the present study did not use any data augmentation techniques and any other similar datasets in this experiment. Of course, one would expect an increase in performance should any other techniques are used. For example, some literature achieved such improvement by many extras, one of which was additional 3,000 training images from other datasets outside CamVid. However, these results are difficult to reproduce due to the lack of details in these papers. For completeness, this study quotes the original figure from previous publications for those models which are not trained by the current study. It is expected that the proposed model will definitely achieve even better performance when trained with these extra images from related data with similar annotations.
Performance evaluation on validation set of CamVid dataset
Performance evaluation on validation set of CamVid dataset
Output by different models using Cityscapes validation image.
Output by M2FANet using Cityscapes test images.
Output by FAST-SCNN, FANet and M2FANet using CamVid test image.
Prediction on Cityscapes benchmark
Figure 6 shows segmented output produced by different models on the validation set of Cityscapes benchmark. The first image is the RGB input followed by ground truth, colour code and predicted outputs by different models. It clearly displays that the output image quality generated by M2FANet is much better than other models and this has been already verified by the quantitative results in Table 8. Similar observation can be drawn from Fig. 7. The first row shows RGB test images from Cityscapes, whereas second row displays segmented output by M2FANet. It is observed that the proposed model is capable of identifying and localizing every objects in the scene whilst not overlooking tinny objects such as traffic light, traffic sign, and pole.
Prediction on CamVid benchmark
Figure 8 exhibits the output produced by FAST-SCNN, FANet and the proposed M2FANet on selected images from CamVid validation set. In contrast to Cityscapes, here models are trained with 12 classes (including background class) on this dataset. It is distinctly visible that output by M2FANet is much better than the output generated by FAST-SCNN and FANet. Especially, boundaries of every class are much smoother in the proposed model output compared to others. Though the current study notes that Camvid is much smaller than other scene segmentation datasets and this generally reduces the performance of any models, an observation that has also been established in previous works. Again, one may address this limitation by using extra images from other datasets or using data augmentation techniques, which will benefit all models. However, the focus of the current research is on a fair comparison between the models and thus this direction could be pursued elsewhere.
Summary and conclusion
In semantic segmentation, offline models often produce outstanding results. However, due to its large network design, these models are inefficient on mobile devices. To fulfill the real-time computational requirement for resource-constrained embedded devices, over the few years several lightweight scene parsing models have been proposed. Although these models produce competitive results in real-time environment, but still noisy feature, class overlapping, miss-classification, boundary degeneration effects can be seen in the model’s output. Moreover, some existing real-time models producing state-of-the-art results have still large number of parameters and GFLOPs. Therefore, by exploiting new modern techniques such as feature scaling, multi-directional feature fusion, semantic segmentation model architecture can be optimized without hampering model performance and efficiency.
Thus, focusing on our objectives, this study presented a light semantic scene segmentation model, named M2FANet for resource-constrained devices, capable of handling high-resolution input images in the real-time environment. The goal of this research was to develop a model that achieves the best performance among models with low computational requirements, which is defined as having less than 5 million parameters. The optimized backbone network and proposed multi-level multi-path feature aggregation module at decoder efficiently produce a new state-of-the-art results compare to existing models in the same category. Current research also demonstrates the usefulness of feature scaling and multi-directional feature fusion techniques for better object localization and contextual engrossment. The proposed model is evaluated on two public scene segmentation benchmarks. The results show that the proposed model is suitable for urban street scene analysis in real-time mode. In the future, this study plans to extend the proposed model for indoor scene analysis and expand the evaluation on other public benchmark datasets. We believe that results on indoor scene datasets will be consistent with our current results as our proposed model is for semantic segmentation for any complex scene. For reproducing the results presented in this work, this study has made the implementation of the proposed model and selected models on the official Github repository
Footnotes
Author’s Bios
for the ARC. He has been on the organising committee, served as invited technical program committee member of many conferences and workshops in the areas related to his research.
