Intrusion detection of railway clearance from infrared images using generative adversarial networks

Abstract

The intrusion detection of railway clearance is crucial for avoiding railway accidents caused by the invasion of abnormal objects, such as pedestrians, falling rocks, and animals. However, detecting intrusions using deep learning methods from infrared images captured at night remains a challenging task because of the lack of sufficient training samples. To address this issue, a transfer strategy that migrates daytime RGB images to the nighttime style of infrared images is proposed in this study. The proposed method consists of two stages. In the first stage, a data generation model is trained on the basis of generative adversarial networks using RGB images and a small number of infrared images, and then, synthetic samples are generated using a well-trained model. In the second stage, a single shot multibox detector (SSD) model is trained using synthetic data and utilized to detect abnormal objects from infrared images at nighttime. To validate the effectiveness of the proposed method, two groups of experiments, namely, railway and non-railway scenes, are conducted. Experimental results demonstrate the effectiveness of the proposed method, and an improvement of 17.8% is achieved for object detection at nighttime.

Keywords

Railway clearance infrared image detection CycleGAN SSD

1 Introduction

The safe operation of railways has become challenging with the rapid development of railway transportation networks. To ensure the safety of the track, Yang et al. propose a void-disease identification algorithm to promote the application of ground-penetrating radars to detect ballastless track subgrade diseases in high-speed railways [1]. And we solve railway safety problems from the perspective of effective railway intrusion detection measures [2, 3]. Approaches for the intrusion detection of railway clearance can be divided into contact and noncontact approaches. Contact detection methods include dual grid and fiber grating detection [4]. Noncontact methods include infrared radiation technology [5] and image processing-based detection [6]. Among these methods, strategies based on image processing exhibit application potential. Target detection techniques, such as scale-invariant feature transform (SIFT) [7], histogram of oriented gradients (HOG) [8], and Haar features [9], are used to process images captured by video surveillance systems. However, these methods for obtaining features through image processing are susceptible to interference from the complex environment of a railway scene, such as uneven lighting, occlusion, and noise. Their shortcoming of poor anti-interference ability will cause the detection effect to fail in meeting the requirements for actual use. Therefore, more reliable and effective detection methods should be used.

Deep learning has recently dominated the fields of computer vision, natural language processing, and automatic control. The hierarchical features extracted using deep learning are more discriminative and representative compared with those extracted using other methods. Hence, several prominent object detection frameworks, such as regions with convolutional neural networks [10], you only look once [11], and single shot multibox detector [12], have been proposed on the basis of convolutional neural networks (CNNs) [13]. These deep learning-based methods with enhanced anti-interference ability have been successfully applied to object detection in the railway scene. In these methods, a deep learning model can be derived by training a large number of railway scene images; the model is then optimized to achieve good results in accordance with the particularity of a railway scene [14 –16].

Although object detection issues using visible RGB images have been considerably addressed for the intrusion detection of daytime railway clearance, object detection using infrared images at nighttime remains a challenging task due to the lack of sufficient training samples of infrared images of railway scenes. A comparison between visible RGB and infrared images of a railway scene is shown in Fig. 1. A normal deep learning model experiences difficulty in detecting objects, such as falling rocks, people, and trains, from infrared images because of the lack of color texture information. To solve this problem, we propose a transfer learning method that can generate a composite image similar to that of the infrared image style, and then using this composite image to train the SSD model. From the perspective of transfer learning, visible RGB images captured during the day are used as the source domain, and infrared images captured at night are used as the target domain. Our method includes two phases: generating training samples and detecting targets. The first stage is inspired by the image generation function of a generative adversarial network (GAN) [17], and an approach for generating sample augmentation methods using CycleGAN is proposed [18]. This method generates a sufficient number of composite images under the condition that a large number of RGB images are visible in the source domain. A composite image with a style similar to an infrared image in the target domain is then used as the training sample. In the second stage, a recent successful SSD model is utilized. We modify the SSD network to make it suitable for railway intrusion detection, and then the model is trained using the labeled samples obtained during the first stage.

Fig. 1

Image comparison. (a) RGB railway image. Lighting conditions are good, and objects are easily visible. (b) Infrared railway image. Images are acquired at night. The light is uneven, and the objects are unclear.

The remainder of this article is organized as follows. Related studies are presented in Section 2. A data augmentation method based on the CycleGAN algorithm is proposed in Section 3. A method for railway intrusion detection based on the SSD model is introduced in Section 4. The experimental results are discussed and analyzed in Section 5. Lastly, the conclusions of our work are drawn in Section 6.

2 Related work

Deep learning algorithms work efficiently depending on the ability of their data to express characteristics; however, they require the support of a large amount of data [19]. In practical applications, collecting a large amount of data in advance is a necessary condition for training deep learning models, and only a sufficient amount of data can meet the needs of practical applications. However, obtaining training data for special scenarios, such as rain, snow, and night, is difficult. The performance of a model trained without sufficient data is inadequate. Therefore, investigating how deep learning models can be trained using a small amount of data is important. Common data expansion approaches are mostly based on image processing methods, such as mirroring, rotation, and random cropping. These strategies are simple, easy to implement, and can increase the amount of data. However, these methods can occasionally exert negative effects, such as the occurrence of a vertical flip operation during facial recognition. Therefore, a data expansion method based on image processing cannot fully solve the problem of small training data and can only be used as a method for producing supplementary data.

With the rapid development of deep learning, new methods with stronger practicability and effectiveness have provided faster solutions for data expansion. In the classification of tumor gene expression data, Jian et al. indicated that the application of deep learning is rare due to insufficient training samples for gene expression data [20]. Accordingly, they proposed a sample expansion method to solve this problem. Inspired by the concept of a denoising automatic encoder (DAE) [21], their method obtains a large number of samples by randomly cleaning partially corrupted inputs. The extended samples obtained using this method cannot only maintain the advantages of corrupted data in DAE but can also solve the problem of insufficient training samples for gene expression data to a certain extent. Li et al. investigated the object detection problem in small and complex indoor scenes [22]. They proposed a target detector based on the deep learning of small samples by considering the small sample size of an indoor scene, the complex background, and the poor effect of object detection. This method uses a synthetic sample generator to automatically enhance the training samples. First, the target area is extracted from the training set. Then, the target area is segmented to obtain the foreground target. Finally, the random foreground target and the background image are fused to generate a composite image.

Data expansion has progressed considerably with the emergence and development of GANs [23 –25]. Goodfellow et al. proposed GAN, which consists of two networks, namely, a generator network and a discriminator network. The network inputs the D-dimensional noise vector and converts the noise vector into an image. Zhu et al. proposed a data enhancement method using GAN because of the small amount of labeled data and the uneven label distribution [26]. Their method can complement and complete data manifolds and find good margins between neighboring classes. With the further development and improvement of GAN, the DCGAN [27] and CGAN [28] architectures were presented. DCGAN replaces the generator and discriminator structures in GAN with two CNNs and then learns and modifies the CNN framework. Diaz-Pinto et al. explored the glaucoma assessment system and found that its performance was highly influenced by the number of labeled images used in the training phase. Their method was proposed to solve the problem of synthesizing retinal fundus images by training a variational autoencoder [29] and the DCGAN model on 2357 retinal images, improving system performance [30]. CGAN adds conditional constraints on the basis of ordinary GANs by adding “train” as a condition to ordinary GANs used to generate train images. The function of inputting text as a condition for generating a required picture can be realized through CGAN. With the emergence of the CycleGAN network, the capabilities of GANs have been further enhanced. Zhu et al. used a pair of GAN networks to construct a ring network structure that realizes the conversion of two different types of domain images, including the conversions of landscape and oil paintings and horses and zebras. This feature provides a good direction for expanding data. Fang Liu applied CycleGAN as the core of the method to perform unpaired image-to-image translation between different MR image datasets [31]. The new technique further improved the applicability and efficiency of CNN-based segmentation of medical images. Liang et al. used CycleGAN to synthesize CT images from CBCT images [32]. They also compared the CycleGAN model with DCGAN and PGGAN, and proved that CycleGAN is superior to the other two models [33]. Proposes modified CycleGAN that generates an even distribution of heterogeneous face data. Combined with other methods proposed by the author, the accuracy of age estimation was improved.

In the current study, the SSD algorithm is used for target detection in night railway scenes. Samples of daytime railway scenes are generated using the CycleGAN network to obtain samples of nighttime railway scenes to augment the data. This feature solves the problem of insufficient training samples and trains the SSD models efficiently to improve detection accuracy.

3 Generation of synthetic images using GANs

3.1 CycleGAN

GAN initially generates images on the basis of noise to observe its specific distribution. The traditional GAN is unidirectional, and it includes a generator G and a discriminator D. The generator and the discriminator confront each other during the training process. Generators continuously improve the ability to generate fake data, and discriminators continue to improve the ability to distinguish between true and fake data. In the end, the network is dynamically balanced to produce images that meet requirements.

CycleGAN is essentially a ring network composed of two pairs of mirrored GANs. The structure contains two generators G and F and two discriminators D_A and D_B. The core idea of CycleGAN involves the use of the paired GANs to realize mutual conversion between two domain data. Figure 3 shows the network structure of CycleGAN.

Fig. 2

Flow chart of the proposed approach.

Fig. 3

Structure diagram of CycleGAN. The CycleGAN network includes generators G and F and discriminators D_A and D_B. Generator G converts the A domain image into the B domain image, and generator F converts the B domain image into the A domain image, i.e., A^f and B^f are generated. Simultaneously, the generator can also generate A^f and B^f to the corresponding original image, i.e., A′ and B′ are created. Generator loss is calculated using A, A′ and B, B′. Discriminator loss is calculated using A, A^f and B, B^f. False and true data are denoted by 0 and 1, respectively.

In CycleGAN, if an existing generator G can convert the image style of the A domain to the B domain and a generator F can convert the image style of the B domain to the A domain, then generators G and F should be equivalent. That is, after the A domain image is converted to G (A) by G, generator F can convert G (A) to the B domain. Similarly, the B domain can also perform the corresponding processes of F (G (A)) ≈ A and G (F (B)) ≈ B. Therefore, by using a cycle consistency loss to incentivize this behavior, the cycle loss formula is expressed as $\begin{matrix} Lcyc (G, F) = E_{a \sim pdata (a)} [∥ F (G (a)) - a_{1} ∥] \\ + E_{b \sim pdata (b)} [∥ G (F (b)) - b_{1} ∥] \end{matrix}$ (1)

For the mapping function G : A → B and its discriminator D_A, the discriminator loss is expressed as $\begin{matrix} L_{GAN} (F, D_{A}, B, A) = E_{a \sim p_{data} (a)} [{logD}_{A} (a)] \\ + E_{b \sim p_{data} (b)} [log (1 - D_{A} (F (b)))] \end{matrix}$ (2)

Similarly, for the mapping function F : B → A and its discriminator D_B, the discriminator loss is expressed as $\begin{matrix} L_{GAN} (G, D_{B}, A, B) = E_{b \sim p_{data} (b)} [{logD}_{B} (b)] \\ + E_{a \sim p_{data} (a)} [log (1 - D_{B} (G (b)))] \end{matrix}$ (3)

All the losses of the final network are added and expressed as follows: $\begin{matrix} L (G, F, DX, DY) = L_{GAN} (F, D_{A}, B, A) \\ + L_{GAN} (G, D_{B}, A, B) + Lcyc (G, F) \end{matrix}$ (4)

The generator consists of an encoder, a converter, and a decoder. The coding part consists of a CNN, which performs the function of extracting features from the input image. The input image size is [1, 256, 256, 3]. After the three-layer convolution module, the output scale is a characteristic map of 1×64×64×256. The converter uses a nine-layer ResNet module. Each ResNet module is a neural network layer composed of two convolutional layers that can retain the original image features during the conversion process. The output scale is still 1×64×64×256. The decoder consists of two deconvolution modules and one convolution module for recovering low-level features from the feature vectors. The output is generated by using the Tanh activation function to achieve the conversion between the source and target domains.

The discriminator belongs to a convolutional network. It extracts features from the image and then adds a convolutional layer, which produces a 1D output to determine whether the extracted features belong to a particular category. An inputted image is predicted by the generator as an original image or an output image.

3.2 Generation of synthetic images

The CycleGAN network must be trained to generate images. We use daytime RGB and nighttime infrared images as source and target domain data, respectively. These two types of images are used as training samples to train the CycleGAN model. The source domain data include the RGB data of 2255 daytime railway scenes, including 1800 training and 455 test data. The target domain data contains 400 nighttime infrared images as training data.

3.3 Model training and parameter setting

The CycleGAN model is used to transform the image style to ensure that data are changed from the daytime scene to the nighttime scene. The training effect of the CycleGAN model determines the amount of information contained in the training data of the SSD model, which will affect the learning effect of the SSD model.

The input image of the CycleGAN model is a three-channel RGB image with a default size of 256×256. Image normalization includes random cropping, random mirroring, and normalization operations. Most parameters in the CycleGAN training process are default values. The initial learning rate is set to 2 × 10 ∧ (–4), the decay_epoch is set to 100, the learning rate linear decay rate is zero, and the maximum epoch is set to 200. The optimizer uses the Adam [34] algorithm and the hyperparameter sets of β1 = 0.5 and β2 = 0.999. Model training is completed in Windows 7 with PyTorch 0.4.1 [35].

Figure 6 intuitively presents that image quality and effects are gradually improved and the capabilities of the generator and discriminator are significantly improved as the number of iterations increases. After training is completed, we can obtain the CycleGAN models in different stages and use the generator model to generate a large amount of target domain data. The target domain data generated by iterating the model at different times are considerably different. This substantial difference can enrich the characteristics of the data, make the target domain data sufficient, provide additional information and improve learning results. Figure 7 compares the original image with the generated image. The CycleGAN algorithm runs on a computer equipped with a Nvidia GeForce RTX 2080 graphic card. The processing time is 78.2 ms to convert an RGB picture to an infrared picture.

Fig. 4

Generator structure.

Fig. 5

Discriminator structure.

Fig. 6

Image generation process. The daytime RGB image is at the top row. The middle row image is generated after 20 training iterations. The bottom row image is generated after 60 training iterations.

Fig. 7

Images generated using our data augmentation method. (a) Daytime images of non-railway scenes. (b) Generated infrared-style images of non-railway scenes. (c) Daytime images of railway scenes. (d) Generated infrared-style images of railway scenes.

4 Railway intrusion detection using SSD model

This study trains the SSD model by using a large number of nighttime infrared images of railway scenes to realize the detection of railway foreign objects in a nighttime scene.

The SSD algorithm is a one-stage detection method based on a feed forward convolutional network. This algorithm uses multi scale feature maps for different scale predictions, and predicts on the basis of aspect ratio to achieve fast and effective detection results. The SSD object detection procedure is described as follows. First, input a picture is inputted to the SSD network, and feature maps are extracted from the picture is subjected to convolutional neural network (CNN) to extract features to generate a feature map. Then the six-layer feature map is extracted, and default boxes are generated at each point of the feature map. Finally, all the generated default boxes are collected. The final default box is selected by NMS (non-maximum suppression) and used as the output result. Figure 9 shows the network structure of the SSD model.

Fig. 8

Flowchart of SSD.

Fig. 9

SSD network structure.

The SSD network adopts VGG-16 as the basic network. The VGG-16 network is characterized by its small convolution kernels and sufficient number of network layers, which can contribute to achieving a good feature extraction function. VGG16 contains 13 convolutional layers, 3 fully connected layers and 5 pooling layers. The size of the convolution kernel used in the convolution layer is 3, that is, both width and height are 3. The pooling layer uses the method of maximum pooling. Both the convolutional layer and the fully connected layer use Rectified Linear Unit (ReLU) as the activation function. The diagram of VGG16 is shown in Fig. 10.

Fig. 10

Diagram of VGG16 Network.

After establishing VGG-16, SSD is connected to a plurality of convolutional feature layers of different sizes, including conv4_3, fc7, conv6_2, conv7_2, conv8_2, and conv9_2. The prediction values of multiple scales can also be obtained using these auxiliary convolutional layers to detect targets of different sizes. The number of a priori frames set by various feature maps is different. The settings of the a priori box include dimensions and aspect ratio. For the scale of the a priori frame, the scale of the a priori box increases linearly as the size of the feature map decreases, and the calculation method is as shown as follows: $\begin{matrix} S_{i} = S_{\min} + \frac{S_{\max} - S_{\min}}{m - 1} (i - 1), i \in [1, m], \end{matrix}$ (5) where S_max represents the size of the highest convolutional layer of the a priori frame to be predicted, S_min denotesthe size of the lowest layer a priori frame, and m refers to the number of feature maps.

To determine aspect ratio, a_r∈ { 1, 2, 3, 1/2, 1/3 } is generally selected. For a specific aspect ratio, the formula for calculating the width and height of the a priori box is expressed as follows: $\begin{matrix} W_{i}^{a} = S_{i} \sqrt{a_{r}}, H_{i}^{a} = S_{i} / \sqrt{a_{r}}, \end{matrix}$ (6) where S_i refers to the actual scale of the a priori box. In addition, when the aspect ratio is 1, an a priori box is added on the basis of the five original a priori boxes with the scale of $S_{i}^{'} = \sqrt{S_{i} S_{i + 1}}$ and a_r = 1 to ensure that morphologically rich targets can be accurately detected.

The loss function of SSD is defined as the weighted sum of localization loss and confidence loss as follows: $\begin{matrix} L (x, c, l, g) = \frac{1}{N} (L_{conf} (x, c) + {aL}_{loc} (x, l, g)) \end{matrix}$ (7) where N is the number of positive samples in the a priori box, and the weight coefficient α is set to 1 through cross-validation. The localization loss is regarded as the Smooth_L1 loss as follows: $\begin{matrix} L_{loc} (x, l, g) = \sum_{i \in Pos}^{N} \sum_{m \in {cx, cy, w, h}} x_{ij}^{k} {smooth}_{L 1} (l_{i}^{m} - {\hat{g}}_{j}^{m}), \\ {\hat{g}}_{j}^{cx} = \frac{(g_{j}^{cx} - d_{i}^{cx})}{d_{i}^{w}}, {\hat{g}}_{j}^{cy} = \frac{(g_{j}^{cy} - d_{i}^{cy})}{d_{i}^{h}}, \\ {\hat{g}}_{j}^{w} = log (\frac{g_{j}^{w}}{d_{i}^{w}}), {\hat{g}}_{j}^{h} = log (\frac{g_{j}^{h}}{d_{i}^{h}}), \\ {smooth}_{L 1} (x) = {\begin{matrix} 0.5 x^{2} if | x | < 1 \\ | x | - 0.5 otherwise . \end{matrix} \end{matrix}$ (8)

The following confidence loss is the softmax loss: $\begin{matrix} L_{conf} (x, c) = - \sum_{i ɛ Pos}^{N} x_{ij}^{p} log ({\hat{c}}_{i}^{p}) - \sum_{i ɛ Neg} log ({\hat{c}}_{i}^{0}) \\ where {\hat{c}}_{i}^{p} = \frac{\exp (c_{i}^{p})}{\sum_{p} \exp (c_{i}^{p})} \end{matrix}$ (9)

SSD obtains the object’s target category and score after matching the a priori box and produces the final test result through a step of nonmaximum value suppression.

5 Experimental results and discussion

To verify the feasibility and effectiveness of the proposed method, we conduct two sets of experiments. Experiment 1 is the detection of humans and cars in non-railway scenes at nighttime. Experiment 2 is the detection of trains and humans in railway scenes at nighttime.

5.1 Experiment 1: Detecting pedestrians and cars in a non-railway scene

The data set of Experiment 1 is provided in Table 1. The image processing method for data set 1-F includes horizontal mirroring, Gaussian noise, and Gaussian blur.

Table 1
Data set of Experiment 1

Data set Description Quantity of images

Training set: 1-A Daytime non-railway images 2000

Training set: 1-B Generated from 1-A by CycleGAN 2000

Training set: 1-C Nighttime non-railway images 500

Training set: 1-D Consists of 1-B and 1-C 2500

Training set: 1-E Image processed by1-D 10000

Test set: 1-T Nighttime non-railway images 300

Data set	Description	Quantity of images
Training set: 1-A	Daytime non-railway images	2000
Training set: 1-B	Generated from 1-A by CycleGAN	2000
Training set: 1-C	Nighttime non-railway images	500
Training set: 1-D	Consists of 1-B and 1-C	2500
Training set: 1-E	Image processed by1-D	10000
Test set: 1-T	Nighttime non-railway images	300

The model is trained after preparing the data set. The SSD model plays a decisive role in the actual detection effect. For the learning ability of the model, parameter setting in training is crucial for the detection result. We choose SSD300, which is a three-channel RGB image with dimensions of 300×300 after the input image is preprocessed. The initial learning rate, lr_policy, stepvalue, gamma, and maximum epoch are set to 5 × 10^-4, multistep, 60000 and 80000, 0.1, and 100000, respectively. After many times of practice, we find that setting the learning rate and step value to the above values can make the objective function converge to minimum within an appropriate time, and make the model train better. The maximum epoch is set to 100000 to ensure that the model can be fully trained without underfitting and overfitting. The optimization algorithm is SGD, and the momentum is set to 0.9. The SSD computes mAP with an11-point interpolated average precision. The model is trained on a Linux 16.04 system. The SSD model runs on a computer equipped with a Nvidia GTX 1080 graphic card. It takes 25 ms to process a picture.

All the final models trained from the training set listed in Table 1 are tested on 1-T. Some detection results are presented in Fig. 11. Figure 11(a) shows some results of SSD model trained only by daytime dataset, in which pedestrians and some cars of small sizes are missed. The detection results of SSD model trained by generated images are shown in Fig. 11(b), in which all the objects are correctly detected. Figure 11(b) demonstrated that the generated images by CycleGAN is effective to improve detect rate for infrared images.

Fig. 11

Test results of Experiment 1. (a) Detection results of the model trained on dataset 1-Awith missing detection. (b) Detection results of the model trained on dataset 1-E.

The result comparison is reported in Table 2. Table 2 shows that the model trained on training set 1-A obtains low detection accuracy on test set 1-T primarily due to the apparent difference between day and night images. The considerable difference between the training and test sets produces poor learning results, model performance, and detection effect. We used data set 1-B generated by CycleGAN to train SSD and obtain a significantly improved test result than that of training set 1-A. Table 2 presents highly similar test results of models trained using datasets 1-B and 1-C. We combine datasets 1-B and 1-C to obtain dataset 1-D. The model obtained by training the SSD model via image processing on data sets 1-D and 1-E are also tested. The results show that the detection effect is significantly improved and exhibits enhanced detection capabilities.

Table 2

Comparison of the test results of the models trained on different datasets on dataset 1-T

Training data set	1-A	1-B	1-C	1-D	1-E
mAP	28.2%	50.0%	57.7%	68.2%	71.2%

5.2 Experiment 2: Detecting pedestrian in railway scene

The data sets used in Experiment 2 are reported in Table 3. To demonstrate the effectiveness of our method, a data augmentation method based on PixelDA [36] is included in Experiment 2 for comparison. PixelDA is a pixel-level domain adaptation algorithm. A PixelDA model was trained using RGB railway scene images and non-railway scene infrared images, which is the same as that of the CycleGAN model. A data set entitled 2-C was generated by the trained PixelDA model conditioned on the daytime railway images. In test set 2-T, we only have samples labeled as “person.” The environment and parameters of the training model are approximately the same as those in Experiment 1. However, the step value in Experiment 1 is set to 80000 and 100000, and the maximum epoch is set to 120000 in Experiment 2.

Table 3
Data set of Experiment 2

Data set Description Quantity

Training set: 2-A Daytime railway images 2255

Training set: 2-B Generated from 2-A via CycleGAN 2255

Training set: 2-C Generated from 2-A via PixelDA 2255

Test set: 2-T Nighttime railway images 300

Data set	Description	Quantity
Training set: 2-A	Daytime railway images	2255
Training set: 2-B	Generated from 2-A via CycleGAN	2255
Training set: 2-C	Generated from 2-A via PixelDA	2255
Test set: 2-T	Nighttime railway images	300

The final models trained from training sets 2-A and 2-B are tested on 2-T. Several sample detection results are presented in Fig. 12. In Fig. 12, the SSD model trained by daytime dataset fails to detect the pedestrian, whereas the SSD model trained by our CycleGAN strategy works well. The comparison of Fig. 12(a) and Fig. 12(b) demonstrates the proposed data augmentation strategy via CycleGAN is effective.

Fig. 12

Test results of Experiment 2. (a) Detection results of the model trained on dataset 2-Awith missing detection. (b) Detection results of the model trained on dataset 2-B.

The result comparison is reported in Table 4. In Table 4, we first test dataset 2-T on the model trained on dataset 2-A. Given the considerable difference between day and night image features, the training and test sets are unrelated with poor detection effect. We then use the style-transformed dataset 2-B to train the SSD model and test dataset 2-T. Form Table 4, we can clearly see that an improvement of 17.9% is achieved compared with the SSD model trained only by the daytime dataset. Compared with data augmentation strategy based on PixelDA, an improvement of 9.3% is obtained. This finding indicates that the model exhibits good detection ability at the railway site at night.

Table 4

Comparison of the test results of the models trained on different datasets on dataset 2-T

Training dataset	2-A	2-B	2-C
mAP	63.6%	81.4%	72.1%

5.3 Discussion

The analysis of Tables 3 and 4 demonstrate that differences in the scenes will seriously influence the effectiveness of target detection. When image detection in a night scene is performed using a model trained on daytime data, the detection effect is poor and cannot meet the needs of practical applications. The collection of training samples for night scenes is difficult. CycleGAN can transform the style of daytime data that we already have and enable us to obtain the same style of data as nighttime infrared images. The conversion function of CycleGAN can reduce the workload of data collection. Simultaneously, the combination of image processing and expansion can considerably increase the amount of training set data. The proposed method is practical and improves detection accuracy and the effect of SSD model training.

Given that the difference between night and day scenes is too large and many interference factors affect the detection process, the ideal result cannot be obtained in target detection. The imaging effect of the image to be inspected is the major factor that influences the detection effect. The imaging effect of the test image is the primary reason that affects the detection effect. The imaging results of commonly used infrared cameras are poor. Uneven light, insufficient color information, and other factors exert serious impact on the detection effect. For the test samples of infrared images used in the experiment, train lights, streetlights, and other strong light sources will cause them to lose local information and also interfere with the detection of surrounding objects. Hence, the effect of target detection is significantly affected. In addition, most cameras are installed in fixed positions and do not have an autofocus function. The resulting incomplete image display, poor feature extraction, or loss of target information will affect the detection results. Furthermore, the target position, size of the occluded part, and light interference will change the original characteristics that the target should exhibit. Therefore, the parameters learned by the model do not play an efficient role, resulting in poor detection effect. Accordingly, using as many nighttime scene samples as possible to train the model can enrich the diversity of the training samples, improve the learning ability of the model, enhance the detection of complex night scenes, and obtain satisfactory results.

Compared with the regular training set with a sufficient amount of data, the SSD detection model cannot perform good detection results in night scenes and can only be used to analyze experimental results. However, we use CycleGAN to transform existing data into data with the desired style. The overall style of the generated image demonstrates the characteristics of a nighttime infrared image. The features extracted via model training are biased toward an actual night scene and can improve detection performance. Simultaneously, the amount of training data increases, the adaptability and anti-interference ability of the model improve, and good detection results are obtained under the characteristics of complex night scenes.

The style conversion function of the CycleGAN model can solve the problem of insufficient samples, improve the ability of the SSD model to detect night images, and achieve good detection effects.

6 Conclusion

The invasion of railways by abnormal targets is an important issue that threatens railway safety. Most existing detection methods for railway scenes are only suitable for daytime, and their detection effect on nighttime images is poor. However, the amount of infrared railway image data captured at night is small, and using this type of data as the training set for model training is challenging. To solve this problem, we propose a method for generating samples using CycleGAN. The railway image of a daytime scene is generated as a composite image with the same style as a nighttime infrared image. The generated image is used as the training set to train the SSD model. The accuracy of the proposed method for detecting infrared images at night is 17.77% higher than that of the daytime model. The overall detection ability of this method is better than those of the traditional approaches.

In the future, we plan to detect railway foreign objects in other abnormal scenarios and will adjust the SSD network to further improve its accuracy in target detection.

Footnotes

Acknowledgments

This research is supported by National Natural Science Foundation of China (62071006), Beijing Natural Science Foundation (4182020), and Key Laboratory for Health Monitoring and Control of Large Structures (KLLSHMC1901), Shijiazhuang, 050043.

References

Yang

and Zhao

, Curvelet transform-based identification of void diseases in ballastless track by ground-penetrating radar, Struct Control Health Monit 29(4) (2019), 1–18.

Nefti

and Oussalah

, A neural network approach for railway safety prediction, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583). Vol. 4. IEEE, (2004).

Kyriakidis

, Hirsch

and Majumdar

, Metro railway safety: An analysis of accident precursors[J], Safety Science 50(7) (2012), 1535–1548.

Catalano

, Bruno

F.A.

, Pisco

, Cutolo

and Cusano

, Intrusion detection system for the protection of railway assets by using Fiber Bragg Grating sensors: a Case Study, Photonics Conference. IEEE. (2014).

Garcia

J.J.

, Losada

, Espinosa

and Urena

, Dedicated smart IR barrier for obstacle detection in railways, Conference of IEEE Industrial Electronics Society. IEEE. (2005).

Ruvo

P.D.

, Distante

, Stella

and Marino

, A GPU-based vision system for real time detection of fastening elements in railway inspection, Image Processing (ICIP), 2009 16th IEEE International Conference on. IEEE. (2009).

Lowe

D.G.

, Distinctive Image Features from Scale-Invariant Keypoints[J], International Journal of Computer Vision 60(2) (2004), 91–110.

Dalal

and Triggs

, Histograms of Oriented Gradients for Human Detection, In Proceedings of the International Conference on Computer Vision & Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; IEEE Computer Society: Washington, DC, USA, (2005).

Viola

and Jones

, Rapid Object Detection using a Boosted Cascade of Simple Features, In Proceedings of the 001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 8–14 December 2001; IEEE Computer Society: Washington, DC, USA, (2001).

10.

Girshick

, Donahue

, Darrell

and Malik

, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, In Proceedings of the 2014 IEEE Conference on Computer Vision & Pattern Recognition, Columbus, OH, USA, 23–28 (2014).

11.

Redmon

, Divvala

, Girshick

and Farhadi

, You Only Look Once: Unified, Real-Time Object Detection, In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, (2016), 27–30.

12.

Liu

, Anguelov

, Erhan

, Szegedy

, Reed

, Fu

C.Y.

and Berg

A.C.

, SSD: Single Shot MultiBox Detector, In Proceedings of the the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 (2016), 21–37.

13.

, Zhang

, Ren

and Sun

, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, IEEE Trans Pattern Anal Mach Intell 37 (2014), 1904–1916.

14.

Mittal

and Rao

, Vision Based Railway Track Monitoring using Deep Learning[J], (2017).

15.

Yang

and Jungang

, Study on safety inspection of railway train operation based on deep learning algorithm[J], China Safety Science Journal (2018).

16.

Faghih-Roohi

, et al., Deep Convolutional Neural Networks for Detection of Rail Surface Defects, International Joint Conference on Neural Networks (IJCNN 2016) IEEE, (2016).

17.

Goodfellow

, Pouget-Abadie

, Mirza

, Xu

, Warde-Farley

, Ozair

, Courville

and Bengio

, Generative adversarial networks, Advances in Neural Information Processing Systems 3 (2014), 2672–2680.

18.

Zhu

J.Y.

, Park

, Isola

and Efros

A.A.

, Unpaired image-to-image translation using cycle-consistent adversarial networks, In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232). (2017).

19.

Pan

S.J.

and Yang

, A Survey on Transfer Learning[J], IEEE Transactions on Knowledge and Data Engineering 22(10) (2010), 1345–1359.

20.

Jian

, Wang

, Cheng

, et al., Tumor gene expression data classification via sample expansion-based deep learning[J], Oncotarget 8(65) (2017), 109646–109660.

21.

Vincent

, Larochelle

, Lajoie

, et al., Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion[J], Journal of Machine Learning Research 11(12) (2010), 3371–3408.

22.

, Zhang

and Qu

, Object detection based on deep learning of small samples[C], 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI) 2018, 449–454.

23.

Baluja

and Fischer

, Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387 (2017).

24.

Wang

T.C.

, Liu

M.Y.

, Zhu

J.Y.

, et al., High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs[J]. (2017).

25.

Luan

, Paris

, Shechtman

and Bala

, Deep photo style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4990–4998). (2017).

26.

Zhu

, et al., Emotion Classification with Data Augmentation Using Generative Adversarial Networks, Advances in Knowledge Discovery and Data Mining. Springer, Cham, (2018).

27.

Radford

, Metz

and Chintala

, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[J], Computer Science (2015).

28.

Mirza

and Osindero

, Conditional Generative Adversarial Nets[J], Computer Science 2014, 2672–2680.

29.

Kingma

D.P.

and Welling

, Auto-Encoding Variational Bayes[J]. (2013).

30.

Diaz-Pinto

, et al., Retinal image synthesis for glaucoma assessment using dcgan and vaemodels, International Conference on Intelligent Data Engineering and Automated Learning. Springer, Cham, (2018).

31.

Liu

, SUSAN: segment unannotated image structure using adversarial network.[J], Magnetic Resonance in Medicine (2018).

32.

Liang

, Chen

, Nguyen

, et al., Generating Synthesized Computed Tomography (CT) from Cone-Beam Computed Tomography (CBCT) using CycleGAN for Adaptive Radiation Therapy[J]. (2018).

33.

Kim

Y.H.

, Lee

M.B.

, Nam

S.H.

, et al., Enhancing the Accuracies of Age Estimation with Heterogeneous Databases Using Modified CycleGAN[J], IEEE Access (99) (2019), 1–1.

34.

Kingma

D.P.

and Ba

, Adam: A Method for Stochastic Optimization[J], Computer Science (2014).

35.

Pytorch. [Online]. Available: https://pytorch.org/.

36.

Bousmalis

, Silberman

, Dohan

, et al., Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks[J], 2017, 95–104.