Abstract
Instance segmentation has a wide range of applications, including video surveillance, autonomous driving, and behavior analysis. Nevertheless, as a type of pixel-level segmentation, its prediction performance in practice is substantially affected by low-resolution (LR) images resulting from the limitations of image acquisition equipment and poor acquisition conditions. Moreover, because their immense computational costs prevent the implementation of existing segmentation models on embedded devices, the development of a lightweight segmentation model has become an urgent necessity. However, it is challenging to achieve sound results with high efficiency and portability. From another perspective, to improve understanding of detailed objects, an architecture is needed that promotes an advanced interpretation of the segmentation, that is, a refined mask with texture. Our main contribution, called TextureMask, consists of the MobileNet-FPN for Mask R-CNN methods, segmentation with cropping, and a gradient sensitivity map, which are then merged into a unified map to refine and enrich the mask with texture information. Furthermore, preprocessing and post-processing algorithms are incorporated. Experiments demonstrated that our technique exhibits good pixel-level segmentation performance in terms of both accuracy and computational efficiency for a given LR input, and it can be easily implemented in embedded platforms.
Introduction
Video images represent one of the most common data types in modern life, and instance segmentation is a key research area in the field of human-computer interaction. Nevertheless, various real-world factors (such as poor image acquisition equipment, unfavorable acquisition conditions, and lens shake) impair the quality of the collected images. Therefore, the obtained data are mostly low-resolution (LR) images. These obstacles have restricted further research on instance segmentation considerably, including that for applications such as video surveillance, autonomous driving, and behavior analysis. However, few studies have focused on segmentation for LR data so far. Furthermore, LR segmentation algorithms must be lightweight for practical front-end applications. These necessities make LR segmentation a serious challenge.
In general, previous works focused on the use of normal images that are usually of high quality. In contrast, this paper proposes models that are compatible with (LR) input, a common characteristic in many real-world image collecting devices. In addition, instance segmentation, compared to detection and recognition, is considered to be a difficult and complicated task for pixel-level segmentation, because of the large number of parameters involved and the considerable computing power needed.
Recently, convolutional neural networks (CNNs) [1] have been employed for a variety of object detection tasks and have achieved ideal performance. Although some researchers have attempted to increase system accuracy by adding layers and complicating the network, the resulting elaborate model structure usually requires substantial memory and computing power for storage and processing. Such operational prerequisites have considerably constrained their popularization and application. Therefore, few of the aforementioned algorithms are readily applied to a range of real-world platforms such as mobile or embedded devices. Furthermore, few studies have addressed this challenge so far. In this light, the aim of this study was not only enhanced accuracy, but also lightweight algorithms for practical front-end applications.
Technically, the existing typical segmentation models, such as Mask R-CNN [2], U-Net [3], PSPNet [4], and other networks, do not perform well on LR images, and the predicted mask is often regarded as incomplete and of low quality. For practical applications such as behavior analysis and object tracking, we consider not only the single mask, but even further detailed and advanced information analysis. Therefore, it is not ideal to directly use a single mask for practical applications and further research. In view of this, we aimed to establish a new approach to propose a novel type of high-quality mask with fused texture information in order to better understand and analyze images.
Several papers on instance segmentation [5, 6, 7, 8, 9, 10, 11, 12, 13, 14] have proposed methods that embrace either instance or semantic segmentation; our algorithm, in essence, merges and keeps the merits of both ideas. To overcome the limitations of existing approaches, a novel mask generation framework is proposed, called TextureMask. More specifically, the operation of the mask combination refines and enriches the pixel-level detailed mask with texture information. Our study addressed the problems of difficulty in building a lightweight instance segmentation model for LR and the pursuit of obtaining merged masks with texture-level details. Accordingly, we constructed a compound model that consists of MobileNet-FPN for Mask R-CNN, segmentation with cropping, and a gradient sensitivity map. All these pieces are then integrated into a unified model to refine and enrich the mask with the inclusion of texture information. Furthermore, for LR, preprocessing is employed, and a motion vector is proposed as post-processing for the challenge of imputing incomplete mask prediction.
Related work
To address the lightweight application concern, two streams of ideas have been exploited. The first is to compress and simplify the trained CNN model to remove redundant parameters and increase the processing speed. The techniques for optimization involve pruning [15, 16, 17], quantization [18, 19, 20], and Huffman coding [21]. In contrast, the second idea is to replace the computation of a convolutional network by lightweight frameworks such as MobileNet V1 [22], MobileNet V2 [23], ShuffleNet [24, 25], and EfficientNet [26]. The goal is also to reduce the network parameter set while maintaining its performance. Currently, MobileNet V1 [22] and EfficientNet [26] have been widely adopted in portable devices because of their easier implementation without sacrificing accuracy. From another perspective, the CNN also motivates a cohort of novel and influential detection methods including R-CNN [27], Fast R-CNN [28], Faster R-CNN [29], and Mask R-CNN [2]. R-CNN (regions with CNN features) is a classical algorithm for object detection. Fast R-CNN improves R-CNN by incorporating SPPNet [30]; this combination largely solves the issue of high memory and processing requirements. Faster R-CNN is built upon Fast R-CNN and RPN [29], through which it generates a regional suggestion box. Subsequently, Mask R-CNN is improved by adding a branch of segmentation on the basis of Faster R-CNN.
The Mask R-CNN method is regarded as state-of-art instance segmentation, owing to its high precision and versatility. Moreover, by subjoining disparate branches, it meets various demands such as instance segmentation and human pose recognition. Nonetheless, the immense computation and memory resource requirements are still a hindrance for Mask R-CNN.
U-net is a small-footprint, state-of-the-art network for semantic segmentation; it is based on full convolutional network expansion and modification. The network consists of two parts: a contracting path to gather contextual information, and a symmetrical expanding path for precise positioning. U-Net has the merits of light weight and high semantic segmentation performance for large targets.
In addition to these solutions, a wide range of related studies has been published on segmentation in video sequences. Ventura [31] introduced an end-to-end recurrent network for video segmentation. Zhu Yi [32] presented a segmentation model that improves semantic segmentation by video propagation and label relaxation. Other algorithms have also exhibited superior performance [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]. Although there is a body of high-performing algorithms for common video datasets, few mature LR segmentation frameworks with easy portability and high efficiency have been designed.
Proposed approach
Overall pipeline
To address the aforementioned challenges, we integrated, sequentially, MobileNet-FPN for Mask R-CNN, semantic segmentation with cropping, and a gradient sensitivity map. The first two components, Mask R-CNN and U-net, are widely known, respectively, as state-of-art instance and semantic segmentation models, and they do not interfere with each other. Therefore, the merits of each method are preserved: the former exhibits a high detection rate, and the latter has excellent semantic segmentation. In addition, a sensitive map alone that provides the texture characteristics of the target is also regarded as a well-recognized and popular approach for attaching texture information to mask prediction. Our combined model, therefore, takes advantage of the component at each step.
To be more specific, we aimed to incorporate the following criteria as the core TextureMask architecture, as shown in Fig. 1:
MobileNet-FPN for Mask R-CNN Segmentation with cropping Gradient sensitivity map
Beyond the power of these three elements, we also incorporated a pre-processing step, namely, Super-resolution (SR) reconstruction (to address low-resolution data), and a post-processing step, called SR mask prediction and refinement (using temporal correlation among consecutive video frames) to further enhance the reliability of mask prediction.
Overall pipeline.
Because of the lack of high-quality physical equipment, lens shaking, and other issues, most acquired images have poor quality or even mosaic characteristics. Moreover, the entire architecture is faced with the challenge of LR data, which directly affects the segmentation results. These conditions motivated us to rebuild a high-resolution (HR) dataset by utilizing SR reconstruction on the initial data.
We adopt deep networks for image super-resolution with sparse prior (SCN) [44] as the preprocessing for the reconstruction. SCN takes advantage of sparse prior and deep learning, and incorporates three independent optimized modules of sparse representation, mapping, and sparse reconstruction into a sparse network. Technically, training with SCN is equivalent to co-optimizing the three modules in order to gain the global optimal solution, and the reconstructed image has acceptable visual quality. In fact, the sparse coding network makes full use of the prior information of images. First, the method acquires the sparse prior information from images through the feature extraction layer. Thereafter, it establishes a feed-forward SCN for sparse encoding and decoding of data through the learned iterative shrinkage and threshold algorithm (LISTA) [45]. Finally, the approach utilizes a cascade network to reconstruct the images. In terms of network structure, SCN retains the image block extraction, representation, and reconstruction layer in a super resolution using convolution neural network (SRCNN) [46], and it includes the LISTA network in the middle layer.
As shown in Fig. 2, the input image
where
SCN pipeline.
In the first stage, Mask R-CNN is a cutting-edge model for segmentation. Nonetheless, MobileNet V1 [22] is employed as the backbone to reduce the amount of calculation, because ResNet-based approaches require excessive computing resources. Specifically, MobileNet V1 is a type of lightweight network that sets up an efficient factorization to shrink the model size and meet the hardware limitations on small devices. Its core unit is a depthwise separable convolution that integrates the filtering step (by 3
The computational details are illustrated in Fig. 3. Suppose there is an input data of size
This evidence proves that the depthwise convolution can considerably reduce the amount of calculation. With regard to the structure, MobileNet can be divided into the “Depthwise Conv Block” and “Conv Block,” as shown in Fig. 4. In short, MobileNet is designed with five stages, and the output for each stage is 32, 64, 128, 256, 512, and 1024, respectively. The details of the MobileNet architecture are listed in Table 1.
Details of convolution.
MobileNet architecture.
Details of MobileNet architecture
In practical applications, when implementing ResNet-based and MobileNet-based algorithms (the two candidate networks in this study), experiments showed that different images usually exhibit considerably different segmentation outcomes in the two candidates. In the experiment, there were three outcomes that ResNet-FPN and MobileNet-FPN may exhibit in practice. These included: (i) the former is superior in segmentation results, (ii) the latter has better results, and (iii) both networks deliver almost indistinguishable results. A natural problem is to know how to select one of the two networks. In this section, we present a pre-classifier branch designed to select the more favorable segmentation network.
The feature extraction process is as follows. Images may possess a rich set of information. To better extract the details (such as the texture and the edge gradient features) of the images, in our application, we extract the significant features from two perspectives: the gray level co-occurrence matrix (GLCM) [47] and the histogram of the oriented gradient (HOG) [48]. The GLCM is a texture analysis method, exploring the spatial distribution relationship between pixels. In our algorithm, GLCM considers the following four features:
Contrast: The value represents the gray level pixel difference to gauge how pixel pairs look, contrasting to each other. It is proportional to the deviation of the off-diagonal from diagonal elements, expressed as
MobileNet-FPN for Mask R-CNN.
The training procedure.
In total, the features consist of these four properties with four angles (0
The HOG characterizes the local gradient direction and the gradient intensity distribution of an image. To construct the HOG, we first partition the image into small and connected regions. For each region, the algorithm collects histograms of the gradient or edge direction of each pixel inside, and then merges these histograms to generate the HOG. We combine the texture features of GLCM and the gradient features of HOG as the final features (Eq. (3)). Thence, we can effectively capture the entire feature of the image when these two complementary aspects are employed in combination.
We chose four high-performing classifiers as pre-classifier candidates in order to measure performance, and then selected the best of them in the proposed framework: naive Bayes (NB) [49], random forest (RF) [50], adaptive boosting (Adaboost) [51], and support vector machine (SVM) [52]). We designated two labels, 0 and 1, indicating that the image could be better employed by MobileNet-FPN or ResNet-FPN, respectively. In particular, when MobileNet and ResNet behave similarly, we label the target as 0 because MobileNet usually requires less time and resources. We then split the dataset into training and testing sets and apply the four classifiers to the dataset as follows.
The entire role of stage one, namely, MobileNet-FPN for Mask R-CNN with pre-classifiers, is presented in Fig. 5. In detail, the pre-classifiers output 0 or 1, indicating which candidates to set for a given input image on the subsequent segmentation network. The initial segmentation results of stage one are called Mask 1. The training procedure is shown in Fig. 6.
At this stage, U-Net exhibits considerable virtues, including light weight, fast training speed, suitability for large-object segmentation, easy implementation, and high robustness. In turn, EfficientNet is a novel type of lightweight model with high efficiency; its modeling process balances all dimensions of depth, width, and resolution in the network to improve accuracy and efficiency. Considering the advantages of portability and easy transplantation, EfficientNet was adopted as the encoder feature extractor and decoder of U-Net.
Gradient sensitivity maps
The gradient sensitivity map in the third stage is employed in extracting the texture feature of each object. The technical details are as follows. Given the input image
For any image
Here
Gradient sensitivity maps. (a) Data. (b) Vanilla gradient. (c) Smoothed vanilla gradient. (d) Guided backProp. (e) Smoothed guided backProp.
We can calculate a simple random approximation. Random samples are taken near the input
Where
The activation equation is defined as:
The backpropagation:
The guided backpropagation is shown:
Merging and post-processing.
Results of the mask sequences. (a)–(b) list two sets of the results with and without post-processing.
Finally, the smoothed gradient and guided back-propagation are incorporated together as smoothed guided back-propagation and called the gradient sensitivity map. It is applied to extract the texture features in stage three. As illustrated in Fig. 7, the comparison of the four methods reveals that the smoothed guided back-propagation algorithm can realize higher performance for the texture information.
The sub-results from each of the three branches are then integrated into a unified result to refine the mask with the texture information. Technically, for each part, we first perform an erosion and then dilation algorithm to remove noise. Then, the final result can be incorporated by multiplying and adding among the outcomes. Specifically, as illustrated in Fig. 8, the first stage can predict the basic mask, but ignores detailed information about certain locations, such as wheels, hands, and feet. For the second subdivision, more attention can be absorbed and supplemented. Furthermore, technically, the sensitivity map can be perfectly extracted to describe the texture features. As shown in Fig. 8, the merged information from three different levels can further enrich the advanced and texture-level masks.
Post-processing
An additional problem to be addressed. To solve the challenge of undetected and incomplete objects, we further pattern a post-processing algorithm called SR mask prediction and refinement, as shown in Fig. 9, after merging the branches. Fundamentally, considering the time-series nature of the frame-by-frame images of the video, one can predict subsequent images by relying on the strong correlation within (i.e., intra-frame) and among images (i.e., inter-frame) defined as the spatio-temporal features in disparate frames [55, 56, 57]. In particular, we apply a motion vector [58, 59] to predict and purify these incomplete and missing masks to build up the succeeding mask sequences. Moreover, for each object, we produce the initial mask sequence by finding the center of gravity of each mask and sorting the masks for the adjacent frames to determine the minimum distance between the centers of gravity. Thus, prediction is performed in view of the sequence. Our matching algorithm locates the pixel block inside
Thanks to its power to easily find the globally best-matched block, the block-based full search motion estimation method [60] is employed in this case. Similarly, we gain the final mask sequence from the prediction and refinement of the neighboring frames. The results, including those with and without post-processing, are shown in Fig. 9. The two panels present two examples of how incomplete (the dotted blue rectangle) or undetected (the yellow block) mask prediction could be improved or imputed by the post-processing stage of our model. The first and second row of each panel illustrate the model output without and with post-processing, respectively, and images from left to right on each row are consecutive frames with time. In detail, in the third column of the second row, the human body mask was completed from its primal version (i.e., the blue rectangle) according to the predicted masks in the preceding and subsequent frames. In addition, the two masks in the third and last column on the second row of the bottom panel were also imputed on the basis of predicted masks in the surrounding frames.
Dataset
We used the three datasets, UCF Sport Action, Street_View_MB, and Visdrone2018, for the experiments because they exhibit a diverse collection of segmentation examples with LR and varied scenes, such as street views, vehicles, pedestrians, and sports action. These datasets are LR because of lens shake, hardware conditions, and other situations (see Fig. 10 for sample images). The datasets were obtained from BBC, ESPN’s TV news, drone video, and other sources. We randomly partitioned each category into training, verification, and test sets with a fixed ratio of
Details of datasets
Details of datasets
Examples of dataset.
Experimental comparison of the SR reconstruction
Examples of SR reconstruction.
In our experiments, we deployed the NVIDIA Jetson TX2 as mobile devices because of its 256-core NVIDIA PascalTM architecture and its energy efficiency in users’ embedded AI computing devices.
Preprocessing
For preprocessing, we further demonstrated the power of SCN reconstruction using three rules: peak signal-to-noise ratio (PSNR) [61], structural similarity index (SSIM) [62], and FeatureSIM [63]. PSNR is most commonly used for measuring the quality of lossy compression code reconstruction, and it is efficient for image quality assessment (IQA). A higher PSNR generally indicates higher-quality reconstruction. Similarly, for the other two IQA measures (SSIM and FeatureSIM), a higher value indicates better image quality.
As shown in Table 3, SCN reconstruction is superior to that of three traditional methods, namely, nearest interpolation (NEA), bilinear interpolation (BIL), and bicubic interpolation (BIC), in terms of the three criteria. From Table 3, we observe that SCN outperforms the others under the three criteria. It can be demonstrated that images generated by the SCN have higher quality compared to the other models; this comparison is shown in Fig. 11. As seen in Fig. 12, compared to LR, more objects are detected and more complete masks are achieved in the SCN-reconstructed HR image segmentation. The dotted blue rectangles are marked as incomplete objects, the yellow blocks are called undetected masks, and the red ones represent better HR performance results.
Pre-classifiers
We further randomly partitioned each category into training data and test data with a fixed ratio of 8:2. We used three indicators – precision (Eq. (11)), recall (Eq. (12)), and F 1-score [64] – to measure the performance of the classifiers. Defined for each label, precision evaluates the proportion of true positives relative to false positives, and recall gauges the number of true positives compared to the false negatives. As shown in Table 4, because SVM achieves the best scores under all three criteria (precision, recall, and F1 score) among all candidate classifiers and it also enjoys the numerous advantages discussed in Section 3.3.1, we eventually chose SVM as the pre-classifier method.
Results of pre-classifiers
Examples of segmentation in LR and HR.
Experimental comparison for two candidate networks
Experimental comparison of mAP and mIoU
Details of mAP results.
Examples of comparison results.
In our experiment, we set the learning rate to 0.0001 and the number of training steps to 1000. For the first stage, the comparison of the two candidate networks is listed in Table 5. It can be seen that MobileNet-FPN exhibits a multitude of merits compared with the ResNet-based model. Experimental results listed in the table indicate that the MobileNet and ResNet results are close to each other. First, running the pre-classifier takes little time and computing resources compared to the main segmentation, so it can be ignored. In addition, the MobileNet-based approach is small; it contains only 48 kernels, whereas the uncompressed model has 125 kernels. Next, our algorithm also improves throughput and running time. Ultimately, the method consumes only 98 MB of memory after compression, whereas the original takes 260 MB. The results imply that the MobileNet-based approach can be easily transplanted to mobile devices, owing to its remarkably low computation and memory cost. The details of statistical results are shown in Fig. 13.
A comparison of the Mask R-CNN, U-Net, refined mask, and refined mask with texture approaches is illustrated in Fig. 14. As demonstrated in Fig. 14, U-net exhibits excellent refinement of the detailed target. Furthermore, the texture details in each object can be perfectly extracted by gradient sensitivity maps. In particular, note that the refined mask is more complete and perfect compared to the mask predicted by a single Mask R-CNN. The textured mask not only presents strong segmentation ability, but also shows the object details well. Overall, the texture part enables users to perfectly understand the detailed information about each object, which can strongly and effectively promote further research.
Additional results (Table 6) show that the MobileNet-FPN model has slightly higher mean average precision (mAP) than ResNet-FPN; this means that the MobileNet-based model has strong robustness and portability. Moreover, by combining post-processing, the accuracy can be further improved. Meanwhile, the mean intersection over union (mIoU) in the proposed method is improved compared to other models. Hence, these results prove that our model exhibits superior performance and robustness.
In brief, the proposed TextureMask model represents an entirely new approach to drawing the mask by concatenating five major modules: preprocessing, MobileNet for Mask R-CNN, segmentation with cropping, sensitivity mapping, and merger with post-processing. Experiments prove that the novel model achieves the superior performance of masking with texture in terms of high efficiency, high accuracy, and low resource utilization, so that it can be easily ported to front-end applications.
Conclusion
We constructed a robust merged segmentation architecture (TextureMask) by integrating three different perspectives to generate textured masks, which can provide an advanced explanation for understanding objects from a new perspective. Meanwhile, TextureMask combines the merits of different deep learning algorithms. Our experiments demonstrated that the proposed model exhibits the superior performance of textured masks in the LR dataset. Furthermore, it is notable that TextureMask can be easily transplanted to mobile devices for segmenting because of its remarkably low memory cost and high precision. Therefore, the algorithm proposed in this paper has both high efficiency and portability. In future research, we aspire to make further contributions to make the model even more lightweight and improve the effectiveness of the masks.
Footnotes
Acknowledgments
This research was funded by National “13th Five-Year” Science and Technology Projects in China, and the funding number is JSZL2017203B023.
