Abstract
Several limitations in algorithms and datasets in the field of X-ray security inspection result in the low accuracy of X-ray image inspection. In the literature, there have been rare studies proposed and datasets prepared for the topic of dangerous objects segmentation. In this work, we contribute a purely manual segmentation for labeling the existing X-ray security inspection dataset namely, SIXRay, with the pixel-level semantic information of dangerous objects. We also propose a composition method for X-ray security inspection images to effectively augment the positive samples. This composition method can quickly obtain the positive sample images using affine transformation and HSV features of X-ray images. Furthermore, to improve the recognition accuracy, especially for adjacent and overlapping dangerous objects, we propose to combine the target detection algorithm (i.e., the softer-non maximum suppression, Softer-NMS) with Mask RCNN, which is named as the Softer-Mask RCNN. Compared with the original model (i.e., Mask RCNN), the Softer-Mask RCNN improves by 3.4% in accuracy (mAP), and 6.2% with adding synthetic data. The study result indicates that our proposed method in this work can effectively improve the recognition performance of dangerous objects depicting in the X-ray security inspection images.
Introduction
Nowadays, millions of people take public transportation such as subways and civil aviation planes. As the urban population grows, the number of people using public transportation is increasing gradually, which brings great insecurity problem while gaining convenience. Therefore, the security inspection of baggage will protect public places from terrorism and other negative influences, and it also plays a critical role in safety precautions [1, 2]. Manual observation is strongly dependent on the experiences of the security inspectors, and there might be missing and false detections of dangerous objects.
Therefore, a rapid, accurate and automatic approach to assist inspectors to detect dangerous objects in X-ray scanned images is desired eagerly. However, the traditional automatic detection methods either have low detection effectiveness, or the dangerous object types detectable are over simplified [3], which cannot comprehensively solve the security inspection problem. The emergence of deep learning technique may help solve various problems which are hard to traditional image processing methods and has a huge impact on the design of image processing algorithms for security inspection. The convolutional neural networks (CNN) designed by the winners of the ImageNet competition have brought great improvements in the field of target recognition in images [4]. As an important application of deep learning, target detection can provide valuable information for image and video understanding, and so it can also be effectively used to the dangerous object recognition in X-ray security inspection images.
In the existing researches of target detection, the region-CNN (RCNN) [5] by Girshick et al. uses a selective search algorithm to find bounding boxes around the objects of interest, but this model training is very inefficient. A few months later, Girshick et al. [6] improved the RCNN model by updating the selective search algorithm and reduced the training time. The improved model is called Fast RCNN [6]. One year later, Ren et al. [7] deleted the selective search algorithm, introduced the Region Proposal Net (RPN), and designed the so-called Faster RCNN [8] with a new target detection model, which greatly reduced the training time. In addition, He et al. [9] and others recently proposed the Mask RCNN, an instance segmentation network [10]. This architecture can perform target detection and semantic segmentation tasks simultaneously. It is also the first time to achieve an end-to-end instance segmentation.
The innovations of the proposed method are at least three aspects: (1) a purely manual segmentation for labeling the X-ray security inspection dataset; (2) in order to alleviate the problem of missing and false detections due to the overlapping of dangerous objects, thereby increase the segmentation accuracy, the Mask RCNN model is improved by using ResNet-101 as the backbone network of feature extraction, and combined with the softer-non maximum suppression (Softer-NMS) [11, 12] and (3) a data augmentation method for compositing of X-ray security inspection images is proposed to enhance further the performance of the improved Mask RCNN model (i.e., Softer-Mask RCNN).
Methodology
Instance segmentation is a relatively complex task in the computer vision (CV) field, because it has the characteristics not only of semantic segmentation, but also of target detection, and all instances need to be located accurately. Therefore, the research of instance segmentation has long been combining semantic segmentation and target detection methods. Both methods are top-down, that is, to find the area, where the instance is located, is determined through a method of target detection (e.g., bounding box), then semantic segmentation is performed in the detection box and each segmentation result is output for the instance. One of the most typical methods is Mask RCNN [9].
Instance segmentation method based on Mask RCNN
Mask RCNN is a region-based CNN framework, wherein an additional branch is added at the end of Faster RCNN model to segment the target object. Meanwhile, Mask RCNN uses the ROI Align method to ensure the spatial correspondence of each pixel in order to reduce the error caused by the ROI Pooling quantization operation. Mask RCNN can improve the extraction of feature images by changing backbone network and designing different weight layers to build neural network models with different depths. At present, the main models of CNN include AlexNet [14], VGG [15], GoogleNet [16], ResNet [17], and so on. The addition of residual structure can improve the convergence performance of the model, so this study uses ResNet as the backbone network of feature extraction. As the targets to be segmented in image may contain a variety of dangerous objects, ResNet101 with good performance is selected as the backbone infrastructure of Mask RCNN. With the support of ResNet101-FPN, Mask RCNN performs end-to-end training with strong generalization performance, and the inference speed is 5fps. Furthermore, we replace the NMS with the Softer-NMS in this paper. The network structure of the improved Mask RCNN (i.e., Softer-Mask RCNN) is shown in Fig. 1.

The network structure of the Softer-Mask RCNN.
As shown in Fig. 1, the framework of Mask RCNN consists of the following three portions: The backbone network is used to extract the feature maps from the input image. Feature maps exported from the backbone network are sent to RPN, and then undergo the Softer-NMS and ROI Align processing to generate regions of interest (ROI). In the third portion, both ROI generated by RPN and feature images are input into fully-connected layers (FC) and fully-connected network (FCN).
In this way, the input image is processed by three branches to get the final outputs of the respective category, candidate boxes, and segmentation result.
At present, the NMS algorithm [11] is mainly used to filter out prediction boxes that are not effective for the same target. In a specific target detection task, the confidence level of each prediction box will be determined through the classification task branch. Due to the generation of too many prediction boxes, there are multiple boxes with different effects for the same target, so the NMS algorithm is used to eliminate the prediction boxes with non-maximum scores. However, the current NMS algorithm would directly remove the detection boxes with low scores in the overlapping targets, resulting in missing detection.
Considering the above problem, some researchers tried to overcome it with an improved the Soft-NMS algorithm [11]. Learning from the lessons of the NMS, the Soft-NMS reduces the score during algorithm execution rather than simply deleting the detection box whose IOU is larger than the threshold. The algorithm process is the same as NMS, but functions are used for the original confidence score, and the goal is to reduce the confidence score. However, this algorithm is essentially also a greedy algorithm and cannot guarantee finding the global optimal detection box.
In response to these problems, some researchers proposed the Softer-NMS algorithm [12, 13]. The main idea is to use the Dirac loss function to measure the box regression and position score. Specifically, the Softer-NMS algorithm predicts a Gaussian distribution instead of only bounding box location:
The Softer-NMS algorithm models the position probability distribution of the detection box based on the NMS and the Soft-NMS. For overlapping detection boxes, a more accurate detection box can be obtained according to the characteristics of high overlap and large weight of the detection box with small variance in position distribution. In this way, the inaccuracy caused by the NMS directly deleting the detection boxes with high overlap can be reduced. Consequently, the false detection and missing detection in X-ray security inspection images can be improved, and the detection accuracy can be increased at the same time.
Nowadays, deep learning has a great potential in solving various problems faced by traditional image processing to a certain extent and brought great influence on the design of image processing algorithms for X-ray security inspection. As an important topic in deep learning, target detection can provide valuable information for image analysis and understanding and can also be effectively applied to the dangerous object recognition in X-ray security inspection images.
However, one hard problem is that in the training process of the deep learning model, a large amount of hand-marked security inspection images is needed as an image set. Too few images or too simple content in the image set will reduce the accuracy of the prediction model. Therefore, a composition method is desiring to augment the existing security inspection image set.
We propose the method for compositing of new and realistic X-ray security inspection images, which is concise in steps and convenient in implementation. The method is based on digital image processing to composite of a new X-ray security inspection image from two arbitrary X-ray security inspection input images, which can be conveniently used to augment the X-ray security inspection image set for training the deep learning model for the dangerous object recognition. The flow diagram of compositing of a new X-ray security inspection image is shown in Fig. 2.

Flow chart of compositing of a new and realistic image from two X-ray security inspection images.
As shown in Fig. 2, the method is to composite of a new and realistic X-ray security inspection image through the following steps: Arbitrarily select two color images from the existing X-ray security inspection image set and at least one of them contains dangerous objects. Carry out the affine transformation on any one of them. Since it is complicated to use RGB (red, green and blue) components to composite of image, RGB components are transformed into HSV (hue, saturation and brightness) components. The HSV components of the two selected images are properly fused to obtain the HSV components of the composite image. Convert the composite image from the HSV components to the RGB components. Output the composite image from the RGB components.
An instance of compositing of a new and realistic image from the two X-ray security inspection images is shown in Fig. 3.

An illustrative instance of compositing of a new and realistic image from the two X-ray security inspection images.
Experiment platform
This experiment was accomplished in the Ubuntu20.04 environment. The computer memory is 256G, the CPU is Inter(R) Xeon(R) CPU 4216 @ 2.40GHz, the GPU is RTX2080ti, and the video memory is 11GB. The Python version is 3.6, and both CUDA10.0 and cudnn7.6.5 are installed to support the GPU. The Tensorflow framework is used to build a neural network, the version is 1.14.0.
Dataset
The dataset used in this study is the latest security inspection image dataset, SIXray [18], published by the Chinese Academy of Sciences, which is of X-ray scanned baggage images acquired under the real scene of a subway security inspection station, as shown in Fig. 4.

Typical image instances from X-ray security inspection image set.
The dataset contains 1,059,231 images, of which 1,050,302 are scanned images of baggage without dangerous objects, and 8,929 are scanned images of baggage containing dangerous objects. Dangerous objects are divided into five categories. Due to that some images contain more than one dangerous object, the sum of all objects in the different categories is greater than the total number of images. As shown in Fig. 5, the five categories of dangerous objects are Gun, Knife, Wrench, Pliers and Scissors. The corresponding quantities of dangerous objects are 3,131, 1,943, 2,199, 3,961 and 983, respectively. This study involves 8,929 scanned images of baggage containing dangerous objects in a ratio of approximately 1 : 2 with the 17,674 scanned images of baggage containing non-dangerous objects. Among these images, this study selects ∼10% of the images into the test set, and the remaining images into the training set.

Statistical chart of the numbers of dangerous objects in the different categories.
For the existing dataset, a certain number of positive sample data are composited by using the X-ray image composition method proposed by us in this paper, so that the positive and negative ratio in the dataset is close to 1 : 1. The statistical distributions of different datasets used for the different models is listed in Table 1.
The statistical distributions of the datasets used for the different models
For the X-ray security inspection images in this dataset, an image labeling tool, Labelme, is used to label the contours and category information of dangerous objects in the images, and then generate a json file to store the labeling information of each image. Four files (i.e., image, label viz, mask and yaml) can be obtained by parsing json, wherein mask is the semantic contour information, which is used to train the model, as shown in Fig. 6.

Typical annotation instances from X-ray security inspection image set.
This study uses the Mask RCNN trained on the COCO dataset as a pre-training model, and carries out three experiments with X-ray security inspection images, namely: the Mask RCNN experiment, the Softer-Mask RCNN experiment, and the Softer-Mask RCNN experiment with the composite images. The parameters used in the three experiments remain unchanged.
ResNet101 is selected as the feature extraction network in this paper. Firstly, the Mask RCNN model is used to recognize security inspection images, and then the model is optimized to the Softer-Mask RCNN model. Finally, composite images are added for data augmentation. Pre-training weights are loaded before training and optimized by SGD optimizer. The momentum was 0.9. The initial learning rate is set to 0.001. The batch size is set to 16, including 150 epochs in total.
Experimental results
The curves of loss values in the training process are shown in Fig. 7, wherein the loss values of the three experiments all declines rapidly with almost the same trend. As the epoch is less than 20, loss decreases rapidly. As the epoch is between 20 and 80, loss decreases relatively slowly. As the epoch is greater than 80, loss tends to be stable. The final loss values of the three experiments are all convergent to ∼0.01. The experimental results show that all the models converge, and the trained result of the third model is optimal.

Loss curves of the three network models during training.
In order to quantitatively evaluate the performances of the trained models, mAP is used as the metric for measuring the average recognition accuracy of all categories of dangerous objects,
The experiments in this study use the same test set, and the test results for the trained models are listed in Table 2 that only using Mask RCNN as a model has a good effect on the performance in detecting dangerous objects. It can more accurately recognize the categories and locations of dangerous objects, but the performance is slightly worse than that of the second detection model (i.e., the Softer-Mask RCNN). Their essential difference is between the NMS and the Softer-NMS. Through the test results, it is found that the model training accuracy combined with the Softer-NMS is 3.4% higher than that of the original detection model. It shows that the Softer-NMS can effectively alleviate the problems of the missing detection and false detection, which improves the target detection performance and training accuracy. The third experiment is based on the second experiment and adds a certain proportion of composite images to augment the data. Additionally, as listed in Table 2, the overall detection accuracy is improved by 2.8% after data augmentation, while the recognition accuracy of each category of dangerous objects has also been improved greatly. It shows that the improved model with the composite images has the best training effect, which is beneficial to improving the detection accuracy of dangerous objects in X-ray security inspection.
The results of comparative experiments
To better show the training results, we plotted the confusion matrices for different models as in Fig. 8(a1)-(a3). These confusion matrices can clearly show the number of correctly or incorrectly detected dangerous objects among each model. In addition, we added the P-R curves corresponding to the five categories of dangerous objects in Fig. 8(b1)-(b5), which the IOU threshold was 0.5. The horizontal coordinate is the precision (P) and the vertical coordinate is the recall (R). The area formed by the P-R curve is the AP(Average Precision). Averaging the AP is mAP. The precision (P) and recall (R) are defined as:

The confusion matrices for (a1) Mask RCNN; (a2) Softer-Mask RCNN; (a3) Softer-Mask RCNN+Composite images; and the P-R curves for (b1∼5) different dangerous objects.
The visualization of the experimental results is shown in Fig. 9, wherein Fig. 9(a)-(b) are the original images and their corresponding label images. Figure 9(c)-(e) are the segmentation results of the Mask RCNN, the Softer-Mask RCNN and the Softer-Mask RCNN+Composite images, respectively.

The results of the three methods. (a) the original images; (b) the label images; (c) the Mask RCNN; (d) the Softer-Mask RCNN; and (e) the Softer-Mask RCNN+Composite images.
By comparing Fig. 9(c) and (e), it can be seen that the Softer-Mask RCNN model can effectively alleviate the false detection problem in the original model. By comparing Fig. 9(c)-(d) and (d)-(e), it can be observed that the Softer-Mask RCNN model can detect not only the missing dangerous objects in the original model, but also the overlapping dangerous objects, which indicates that this model has certain effectiveness to the recognition of overlapping dangerous objects.
In order to compare the effectiveness and efficiency of the different models, their performances are evaluated and listed in Table 3.
Performance comparison of different networks
As listed in Table 3, the mAP of the Softer-Mask RCNN model is 7.7% higher than that of single shot multiBox detector (SSD) [19] model, 6% higher than that of you only look once-v3 (YOLOv3) [20] model, and 5.7% higher than that of Faster RCNN model. The mAP of the Softer-Mask RCNN model with composite images added was 10.5% higher than that of SSD model, 8.8% higher than that of YOLOv3 model, and 8.5% higher than that of Faster RCNN model. The improved model is obviously superior to the other models, which indicates that our model has an effective improvement in the recognition ability of dangerous objects in X-ray security inspection images. However, as the depth of the network increases, there is a substantial reduction in the speed of the model’s computation. Our model is slower than the lightweight networks (e.g., SSD and yolov3), which is a major challenge in our future works.
Although most of the current researches on the security inspection dangerous objects recognition use target detection algorithms [21, 22], we in this study proposes and applies a classical instance segmentation model, which can segment the dangerous object contours and predict its category more intuitively. From the experimental results shown in Fig. 8, it can be observed that the proposed model has good recognition performance. In order to further improve the segmentation accuracy of the Mask RCNN model, we proposed to combine the instance segmentation model with the Softer-NMS algorithm, so that the model can effectively alleviate the problem of missing detection and false detection, thereby improving the training accuracy. Meanwhile, we proposed a new and realistic image composition method for the unbalanced problem of the X-ray security detection dataset. The method specializes in X-ray RGB images and composite new X-ray images through the affine transformation and RGB conversion, which make the dataset more balanced. Moreover, the proposed method can also augment the overlapping dangerous objects in the dataset, to improve the accuracy of the proposed model for overlapping dangerous objects recognition. The experimental and analysis results also preliminarily verify the performance of the proposed method. Meanwhile, there are still few missing detections to be solved in the future research.
There are various other types of dangerous objects in the real X-ray security images, such as toxic powders and explosives, etc., which are not involved in this work. The dataset in this paper is only designed for the five typical categories of dangerous objects, namely guns, knives, wrenches, pliers, and scissors, mainly for the goal of algorithmic validation. This is also due to the limited access to other types of dangerous objects, and we will try to legally acquire their images in the future. Furthermore, the data of dangerous goods in the dataset used in this paper are unbalanced, for example, the number of scissors is relatively few among all dangerous objects considered. Therefore, we used the method proposed for compositing of new X-ray images to augment its data, so that its recognition accuracy has been obviously improved, even though it is slightly lower than other dangerous objects. The underlying reason may be that the shape of scissors is more variable as compared with other items, making it difficult to distinguish them in the X-ray images.
The proposed method can effectively segment the dangerous objects. However, the network model is slightly large, and the segmentation accuracy requires further improvement. In the future, we will collect more images for different categories of security inspection dangerous objects and study the methods to further simplify the network structure and improve the segmentation accuracy.
Conclusion
In this paper, we investigated the recognition of dangerous objects in X-ray security inspection images, which is a promising application in industry yet remains fewer studied. To facilitate research in this field, we contribute the high-quality and purely manual segmentation for labeling an existing X-ray security inspection dataset, SIXRay. Additionally, we proposed a method based on deep learning model for recognizing dangerous objects in X-ray security inspection images. The method uses Mask RCNN instance segmentation algorithm as the original model and proposed the Softer-Mask RCNN model by combining the Mask RCNN with the Softer-NMS. The acquired Softer-Mask RCNN model may improve the overall training performance compared to the original model. Finally, we proposed a data augmentation method specifically for X-ray security inspection images, which augments the dataset to enhance the prediction performance. The experimental results showed that the proposed method has good performance for the recognition of dangerous objects in X-ray security inspection images, which can effectively alleviate the problem of missing and false detections and improve the detection accuracy.
Footnotes
Acknowledgments
The work is supported in part by Shaanxi Provincial Natural Science Foundation of China (Grant No. 2020SF377), Xi’an Key Laboratory of Advanced Controlling and Intelligent Processing, China (2019220714SYS022CG044), and National Natural Science Foundation of China (No. 62071378).
