Abstract
The rise of surveillance systems has led to exponential growth in collected data, enabling several advances in Deep Learning to exploit them and automate tasks for autonomous systems. Vehicle detection is a crucial task in the fields of Intelligent Vehicle Systems and Intelligent Transport systems, making it possible to control traffic density or detect accidents and potential risks. This paper presents an optimal meta-method that can be applied to any instant segmentation model, such as Mask R-CNN or YOLACT++. Using the initial detections obtained by these models and super-resolution, an optimized re-inference is performed, allowing the detection of elements not identified a priori and improving the quality of the rest of the detections. The direct application of super-resolution is limited because instance segmentation models process images according to a fixed dimension. Therefore, in cases where the super-resolved images exceed this fixed size, the model will rescale them again, thus losing the desired effect. The advantages of this meta-method lie mainly in the fact that it is not required to modify the model architecture or re-train it. Regardless of the size of the images given as input, super-resolved areas that fit the defined dimension of the object segmentation model will be generated. After applying our proposal, experiments show an improvement of up to 8.1% for the YOLACT++ model used in the Jena sequence of the CityScapes dataset.
Keywords
Introduction
In recent years, instance segmentation has emerged as one of the most popular applications in deep learning and computer vision applications. This proliferation has been caused mainly by decreasing the costs of video surveillance systems and increasing the installation of these systems leading to an expansion in the generation and collection of data. Technological advances in the field of artificial intelligence and image recognition related to visual scene understanding have led to significant progress in research for several disciplines, such as the medical field or autonomous driving. Significantly, detection and surveillance systems have benefited from these advances.
The application of convolutional neural networks is widely extended for many problems and contexts, usually image-related. Some of the uses are medical issues detection [1, 2, 3], pattern recognition [4], and scene understanding [5]. In the field of instance segmentation, CNN application has allowed the identification of the objects captured in the scene. Using these models has allowed us to locate them with a mask containing the object together with his class. Compared to classical techniques, instance segmentation requires the perception of the elements appearing in the scene to associate each image pixel with the object to which it belongs.
One of the areas in which the detection of elements in video surveillance systems is relevant is the management of road networks. Due to the growing interest in vehicle automation, there is an increasing interest in automatic road understanding applications, such as hazard detection [6], vehicle communication [7], or controlling traffic flow and density [8, 9, 10]. This task is complex because it must deal with some challenges related to the scope of the scene. In the traffic control system, it is necessary to address several problems. Data challenges include the diversity of backgrounds determined by the video surveillance systems location, orientation, angle, and distance from the vehicles. In terms of detection, elements can appear in a wide variety of positions with diverse scale sizes. In addition, there are several cases of occlusion and motion blur. These problems make it challenging to apply instance segmentation models in this field. Beyond video surveillance on road networks, image segmentation is useful to identify anomalies or defects on surfaces [11, 12, 13] and understand complex scenes [14, 15].
Several instance segmentation models have proliferated based on convolutional neural networks (CNN). In image segmentation, two models prevail mainly: Mask R-CNN [16], and Yolact++ [17].
The architecture of Mask R-CNN [16] is grouped into two main stages. The first uses a Region Proposal Network (RPN) to identify candidate regions. Subsequently, it uses a convolutional neural network as a feature extractor named backbone. The bounding box, his class, and the binary mask for each region of interest (RoI) are finally predicted in the second phase. On the other hand, Yolact++ [17] is a fully-convolutional model for real-time instance segmentation. This model first sets up a series of tentative masks and then predicts each, avoiding repooling. While these models achieve good results in datasets such as COCO (Common Objects in Context) [18], they present deficiencies in datasets that contain elements of different sizes. One of them is Cityscapes [19], where the models achieve a low mean average precision (mAP).
A novel meta-method that improves the accuracy obtained using instance segmentation models according to the presented problems is developed. This work skillfully combines several concepts, approaches, techniques and components, such as Instance segmentation models (Mask R-CNN and YOLACT++), Super-Resolution and Vehicle detection. The segmentation masks generated from the initially detected elements are used as a starting point. After applying super-resolution to the initial image, it generates an optimal set of sub-images with the objective of re-inferring on them to increase the number of pixels of the elements captured by these systems. An optimization algorithm will reduce the number of sub-images processed by the model. The re-inference makes it possible to detect elements not identified a priori.
On the other hand, those elements detected multiple times are processed according to their binary mask, obtaining much more accurate detections and with a higher degree of confidence. Unlike other proposed approaches, our method does not require modifications to the models architecture to be used. Moreover, it also does not require retraining. This feature is significant and can be considered critical if we apply a detection system to road networks using the thousands of cameras that currently exist since it is not feasible to retrain the model for each camera manually. These improvements allow the model to detect a higher number of elements more quickly. We first demonstrate the effectiveness of our approach on different sequences of the Cityscapes dataset [19]. The mean average precision of the employed instance segmentation models has been improved. For example, going from a rate of 19.4% by applying Mask R-CNN, we obtain an increase in this rate, reaching up to 26.8%. Additionally, the total number of detected elements and the number of sub-images to be processed has also been determined.
The rest of this article is organized as follows. Related work is discussed in Section 2. Section 3 details the improvements developed and explains in more detail the workflow. Section 4 includes the applied instance segmentation models, the used dataset, the employed metrics, the defined hyperparameters, and the tests performed together with their results. In Section 5, conclusions and future work are presented.
Related work
Given the relevance of instance segmentation, it has been an ongoing field of work since good results have been achieved on simpler problems such as image classification and object detection. The problem is commonly approached as the natural evolution of object detection, as seen in the proposal [20] based on the novel R-CNN model [21]. Since then, many methods for segmenting object instances have been proposed, such as the three-phase in cascade (from bounding box to mask and then to instances) proposal of [22]. Another is the famous Mask R-CNN [16] that extends the Faster R-CNN detection model [23] by adding a parallel branch to the bounding box extraction for the prediction of the mask of each object. There are also proposals such as [24], that base their segmentations on the shape of the objects to reduce the dependency on the quality of the bounding box to avoid failures caused by erroneous proposals inherited from object detection.
The close relationship of this problem with object detection means their solutions face similar weaknesses. To date, the general approach to solving the problem is to use supervised learning and therefore requires vast amounts of data to train neural models. This data requires pixel-level image annotation, which makes it exceptionally costly in person-hours to obtain very specific datasets such as [25]. As in other areas, alternative solutions have been searched to increase the data on which to train automatically. One example is the generation of artificial data based on cutting and pasting instances of objects in other images [26], which in this problem seems to work particularly well. Different approaches train using more easily available data, such as [27], which trains from image-level labeled data by obtaining pseudo masks before training a Mask-RCNN, or [28], which requires annotating a single pixel for each object in the image.
To improve the results of instance segmentation, some propose using other image recording technologies, such as [29], which takes advantage of information obtained from stereo cameras to perform 3D segmentation to help separate instances. However, a more popular scheme is integrating a segmentation model into a meta-algorithm to improve output quality. [30] applies polymorphic analysis to adapt a previously obtained segmentation mask to fit objects shapes better. [31] represents image regions as a quadtree to identify and correct inconsistencies between the proposed mask and the input image. There are proposals based on image scaling techniques, such as [32]. Slicing Aided Hyper Inference (SAHI) is a computer vision library for performing large-scale object detection and instance segmentation. It is based on a generic slicing-aided inference and fine-tuning pipeline. According to a dimension selected by the user, it generates windows on which to re-infer before upscaling previously.
This proposal is framed within these methods that use an existing model (Mask R-CNN and Yolact++) integrated into a meta-method. The objective is to improve instance segmentation models when dealing with objects of different sizes. Several recent proposals work on this line. [33] improves Mask R-CNN by using a multi-scale Region Proposal network structure. [34] studies the relation between the receptive field’s spatial information and the object’s size with the segmentation and proposes MR R-CNN, a network with convolutional layers hanged to the proposed semantic segmentation layer containing features pyramids. [35] proposes SCMask R-CNN, a Mask R-CNN improvement to enhance detection in high-resolution images with dense targets and complex backgrounds by modifying the ResNet101 [36] backbone to obtain more discriminative information and adding dilated convolutions. Our proposal differs from these approaches on one fundamental point: instead of proposing a new network structure to address the problems, we propose a meta-algorithm that can be used with any instance segmentation model. This is an advantage, as it will allow the proposal to improve the results as new and better instance segmentation models emerge. On the other hand, our proposal will also benefit from faster novel network approaches.
One of the critical elements in our proposal is the use of a super-resolution method, for which there is a wide range of options, such as DRCN [37], VSDR [38], or FSRCNN [39]. We have chosen to use FSRCNN due to the widespread use of the method and the good results obtained. This model extends to speed up SRCNN [40], for which the initial interpolation is removed. New convolutional layers are added with a smaller filter size. Additionally, a new deconvolutional layer is introduced at the end.
Methodology
Architecture of the proposed system.
The proposed system for instance segmentation enhancement is composed of several modules. The system architecture is shown in Fig. 1. Next, the modules are described.
A deep learning neural network
where
The first step of our procedure (stage 1 in Fig. 1) consists in applying a super-resolution network
The second step (stage 2 in Fig. 1) is to process the input image
For each potential detection
where
After that, a graph
Next, it is found which sets of detections could be managed with a single run of the image segmentation network
The cliques are sorted in decreasing number of nodes to yield the list A counter is initialized to zero for each node to keep track of how many times a node appears in the cliques of The reduced set of cliques The first clique of All nodes whose counters are higher than 3 are removed from the cliques in All cliques of If
For each clique
Segmentation refinement module
Next, the image segmentation network
where
The object detections of
where
Consequently, the set of object detections for window
The final step of our procedure (stage 14 in Fig. 1) is detailed next. A cluster
We tested the proposed meta-method on images for instance segmentation, which is challenging according to the size range of the elements. The following subsections present the selected instance segmentation models and the super-resolution model, the experiments, the dataset, evaluation metrics, the experimental setup, and results.
Instance segmentation models
The proposed meta-algorithm aims to improve the segmentation of the elements detected by re-inference on super-resolved sub-images. One of these meta-method advantages is that it does not modify the selected instance segmentation model. On the other hand, it does not require re-training to improve the mean average precision (mAP). Two of the most commonly used models have been selected to determine the improvement obtained after applying the presented proposal.
Mask R-CNN[16]: It is an extension of Faster R-CNN [23]. This model has been selected from the Tensorflow Model Zoo repository.1
Yolact++[17]: A fully-convolutional model for real-time instance segmentation. It has several improvements over its predecessor YOLACT [42], such as incorporating deformable convolutions into the backbone network.
Both models have been trained with the COCO dataset (Common Objects in Context) [18]. This dataset is composed of a large variety of scenes. The advantage of this dataset lies mainly in the number of different classes labeled, with up to 91 different classes with diverse scales. Besides the two segmentation models selected previously, our proposal can be applied using any deep convolutional neural network instance-based segmentation model.
Super-resolution model
The Fast Super-Resolution Convolutional Neural Network (FSRCNN) is a super-resolution model that has been designed to be fast and efficient, with a processing time that is significantly faster than many other models. This makes it an excellent choice for real-time applications or for use on devices with limited processing capabilities. In addition to its speed, FSRCNN is also known for its high reconstruction accuracy, consistently producing high-quality images with good visual quality and detailed structures. It is relatively simple to implement and does not require many parameters or computational resources, making it easy to deploy in various settings. FSRCNN is also highly flexible and can enhance a wide range of image types, including natural, text, and medical images. Lastly, this model is robust and able to handle noisy or low-quality images effectively, making it a reliable choice for use in challenging situations.
Differences between upscaling and the application of Super-Resolution with a scaling factor of x2 on a vehicle extracted from the CityScapes – Zurich sequence.
Figure 2 shows the differences between direct upscaling and super-resolution applications with a scaling factor of x2 according to a vehicle detected in the frame. Comparing the two extracted areas, the super-resolved one features a clear enhancement in terms of the quality and sharpness of the image.
A comparison has been made among the following methods:
Original Model (RAW): The direct application of the unmodified raw instance segmentation model.
Super-Resolution Only (SR Only): The image is processed by applying super-resolution and is given directly as input to the instance segmentation model.
Slicing Aided Hyper Inference (SAHI)[32]: Re-inference from rescaled zones using Mask R-CNN R-50-FPN model.
Our Proposal: The presented proposal aims to generate new sub-images by applying super-resolution optimally based on a parameter denoted as window size
The Super-Resolution Only proposal gives as input the image after applying super-resolution to the object segmentation model. However, according to the dimensions of these images, this will not significantly impact the accuracy in some instances. The object segmentation models set a fixed input dimension. If the processed images exceed this size, the model will downscale them again. On the other hand, SAHI [32] generates several regions based on the previously defined dimensions, and then, the model performs an upscaling based on his input size dimension, which may cause a worse result in cases where the images have low resolution.
The Cityscapes dataset [19] has been used to evaluate the presented proposal. This dataset is conformed by complex real-world urban scenes, an enabling factor for many applications. It is composed of 5000 2048
This dataset highlights the great variety of scenes that it contains. The images are divided according to the place where they were collected. The sequences used in this section were Zurich, Ulm, Munster, Monchengladbach, Lindau, Krefeld, Jena, Erfurt, Darmstadt, Bochum, and Frankfurt. Due to the great amounts of different objects and distances, this dataset is a challenge for typical CNN-based instance segmentation approaches. They tend to miss small objects, blend them or change their class due to the small number of pixels they represent in the whole image.
Evaluation metrics
Selected values of the hyperparameters
Selected values of the hyperparameters
Mean average precision obtained for the Ulm sequence of the CityScapes Dataset according to different window sizes
A series of criteria have been selected to determine the improvement obtained after applying the presented proposal.
First, the mean average precision (mAP) obtained by the COCO evaluator is determined.2 The average accuracy is calculated over multiple IoU (Intersection over Union), which range from 0.5 to 0.95 and a step size of 0.05. The IoU selected is equivalent to the minimum area determined based on the annotation set as Ground truth (GT) and the one obtained by the instance segmentation model to define a match as positive. The mAP is also included as a function of the scale of the objects to determine the enhancement index concerning the image size that provides for it. The sizes considered are as follows:
Additionally, the total number of detected elements in the selected sequence, the average elements detected per frame, and the number of sub-images to be re-inferred for each sequence are collected to determine the quantitative improvement obtained.
During the experimental testing phase, we selected a set of hyper-parameters specified in Table 1. Eleven different
Mean average precision obtained for the Jena sequence of the CityScapes Dataset according to different window sizes
using the YOLACT++ model. A detection is valid when its confidence exceeds the set threshold of 50%
Mean average precision obtained for the Jena sequence of the CityScapes Dataset according to different window sizes
Total number of elements detected in the sequence, average and standard deviation of the number of detected elements, and average and standard deviation of the number of sub-images processed per frame for different values of the windows size
In this study, the proposed methodology is evaluated to determine the improvement in the accuracy rate obtained by the selected models. The presented proposal aims to reduce the number of sub-images the model needs to re-infer to improve its accuracy. Therefore, it is necessary to previously determine the factor called window size, denoted as
Regardless of the model on which the presented proposal has been applied, the mAP value obtained will increase compared to the direct application of the model. Referring to Table 2, we can see that for the general mAP, corresponding to the first column of the table, the model reaches 19.4%. On the other hand, applying our proposal in the less strict option, defining a window size of 30%, an increase of 7.4% is achieved, reaching a total accuracy of 26.8%. Additionally, we present the competitor, SAHI [32], with different window sizes. In the least restrictive case, in which SAHI generates 25 zones on which to re-infer, our model achieves, even in the most restrictive case (
Total number of elements detected in the sequence, average and standard deviation of the number of detected elements, and average and standard deviation of the number of sub-images processed per frame for different values of the windows size
. Our proposal is compared with the YOLACT++ model
Total number of elements detected in the sequence, average and standard deviation of the number of detected elements, and average and standard deviation of the number of sub-images processed per frame for different values of the windows size
Balance between precision and computational complexity. Average Sub-Images Processed are presented on the X-axis. The Y-axis represents the Mean Average Precision (mAP). Both axis coordinates have been calculated using the mean of 11 different sequences of the CityScapes Dataset (Zurich, Ulm, Munster, Monchengladbach, Lindau, Krefeld, Jena, Erfurt, Darmstadt, Bochum, and Frankfurt) using the Raw model and our proposal with different Windows Sizes 
Balance between precision and number of sub-images generated on which to re-infer for the Zurich sequence. Total Sub-Images Processed are presented on the left axis (bars). The right axis (diamond markers) represents the Mean Average Precision (mAP). The X-axis represents the applied methodology (Our proposal vs SAHI). Our proposal and SAHI are represented with different window size configurations. In our case, through the percentage, and for SAHI, through its width and length. The model used was Mask R-CNN.
An example applied to frame 000084_000019_leftImg8bit of the sequence is denoted as Darmstadt. The left side shows the results obtained by the raw model, while the right side shows the detections after applying our proposal with a window size of 0% using the Mask R-CNN model.
An example applied to frame 000000_002083_leftImg8bit of the sequence is denoted as Krefeld. The left side shows the results obtained by the raw model, while the right side shows the detections after applying our proposal with a window size of 0% using the Yolact++ model.
An example applied to frame 000001_002353_leftImg8bit of the sequence is denoted as Monchengladbach. The left side shows the results obtained by the raw model, while the right side shows the detections after applying our proposal with a window size of 50% using the Yolact++ model.
According to
For the Mask R-CNN model, Table 4 shows the reduction of the required computing according to the selected model, going from the least restrictive case, with a total of approximately 23 sub-images per frame to 9 with
Figure 3 shows the balance between accuracy and computational complexity based on the number of total sub-images to be re-inferred. The X-axis represents the average number of sub-images processed per frame, while the Y-axis represents the mean average precision. The direct application of the model represents the worst results, obtaining the lowest mAP. We set the best solutions to proposals close to the upper left corner since this will mean that it has obtained the highest mAP by processing the least number of sub-images. According to the results, after applying the presented proposal, it is possible to considerably reduce the number of sub-images to re-infer without significantly sacrificing the mean average precision obtained.
Figure 4 shows how increasing the window size
To visually show the improvement obtained after the application of the presented proposal, a series of qualitative results are included, represented by Figs 5, 6, and 7. We can affirm that the number of detected elements is higher than the direct application of the model.
Therefore, according to the obtained results, our proposal allows obtaining a Pareto front in which, by altering the
This paper presents an optimized meta-algorithm to enhance the performance of different image segmentation models. The proposal involves three key elements. The first one consists of performing super-resolution of the input video using a suitable super-resolution neural network. As a result, the level of detail is enhanced, increasing the number of pixels for objects in the scene to facilitate subsequent object segmentation. Secondly, the original low-resolution video is processed by an image segmentation model to produce a set of potential detections. Based on these detections, a graph will be calculated where each node represents an element. Its connections will be defined according to the distance between the two elements. As a third step, a heuristic is applied to determine the optimal number of sub-images on which the model should be re-inferred, according to a previously established window size. The image segmentation network gets new detections for each optimized window. Finally, the detections for the same object are clustered together, and a unified segmentation mask is then calculated from them.
Experimental results show that our approach significantly outperforms the base image segmentation model’s precision. Specifically, our proposal detects more objects with a higher mean Average Precision in a wide variety of real-world sequences of the CityScapes dataset without modifying the layers that compose the original model or re-train it. Models such as YOLACT++ obtain a maximum gain of up to 8.1%, increasing the average number of elements detected per frame and reducing the number of sub-images where the model has to re-infer.
In future work, we plan to compare the accuracy obtained by instance segmentation models applying different algorithms and super-resolution models such as GANs or fuzzy models to determine if there is a significant improvement by using other techniques. In addition, the proposed meta-method could enhance the performance of other deep learning models recently applied to generate novel perceptual metrics [43], generate composite images [44], identify dementia using magnetic resonance images of brain asymmetries [45] based on the re-inference of several super-resolved zones by the instance segmentation model, or even for short-term scheduling road works [46].
Footnotes
Acknowledgments
This work is partially supported by the Ministry of Science, Innovation and Universities of Spain [grant number RTI2018-094645-B-I00], project name Automated detection with low-cost hardware of unusual activities in video sequences. It is partially supported by the Autonomous Government of Andalusia (Spain) under project UMA18-FEDERJA-084, project name Detection of anomalous behavior agents by deep learning in low-cost video surveillance intelligent systems. All of them include funds from the European Regional Development Fund (ERDF). It is also partially supported by the University of Málaga (Spain) under grants B1-2019_01, project name Anomaly detection on roads by moving cameras, and B1-2019_02, project name Self-Organizing Neural Systems for Non-Stationary Environments. The authors thankfully acknowledge the computer resources, technical expertise, and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga. The authors acknowledge the funding from the Universidad de Málaga. Iván García-Aguilar is funded by a scholarship from the Autonomous Government of Andalusia (Spain) under the Young Employment operative program [grant number SNGJ5Y6-15].
