Traffic sign detection and recognition based on multi-size feature extraction and enhanced feature fusion module

Abstract

Deep learning has dominated the research field of traffic sign detection, but the traffic sign detection algorithms based on deep learning have difficulty in solving the two tasks of localization and classification simultaneously when performing traffic sign detection on realistic and complex traffic scene images, and the images or the types of traffic signs provided by the public dataset used by the relevant algorithm cannot meet the situations encountered in realistic traffic scenes.To solve the above problems, this paper creates a new road traffic sign dataset, and based on the YOLOv4 algorithm, designs a multi-size feature extraction module and an enhanced feature fusion module to improve the algorithm’s ability to locate and classify traffic signs simultaneously, in view of the complexity of realistic traffic scene images and the large variation of traffic sign sizes in the images. The experimental results on the newly created dataset show that the improved algorithm achieves 83.63% mean Average Precision (mAP), which is higher than several major object detection algorithms based on deep learning for the same type of task at present. The newly created dataset in this paper is publicly available at https://github.com/zhang1018/Traffic-sign-dataset-for-public.

Keywords

Traffic sign detection and recognition traffic sign datasets autonomous driving convolutional neural networks intelligent traffic system

1 Introduction

With the rise of intelligent transportation system and autonomous driving technology, road traffic problems are becoming increasingly serious. As a major component of both, the reliability of traffic sign detection is crucial.

Traditional research methods use the color and geometric features of traffic signs to detect them, and this method not only takes a lot of time to design hand-made features for the specific color [1, 2] and shape [3 –5] of traffic signs in images, but also is sensitive to external factors such as weather changes and occlusions.To solve the problems in the traditional method, related research [6, 7] added machine learning to the traditional method, which divides the detection into two steps, first locating the region of traffic signs in the image using the traditional method, and subsequently classifying the traffic signs in the region using a Support Vector Machine (SVM) classifier. The SVM classifier can effectively mitigate the effect of external factors, but this method still suffers from the trouble of needing to design hand-crafted features for different traffic signs. With AlexNet [8] winning the ILSVRC image classification and object recognition competition, Convolutional Neural Networks (CNN), as one of the representative algorithms of deep learning, reappear in the object detection field.The powerful learning capability of CNN allows it to rapidly dominate the field of object detection and produces many excellent algorithms: RCNN [9], Fast R-CNN [10], Faster R-CNN [11], SSD [12], and YOLO [13]. The goal of traffic sign detection is to enable the computer to locate and recognize all traffic signs within a scene, which belongs to the part of object detection. Therefore, CNN-based research methods are also applicable to the field of traffic sign detection. The CNN-based methods eliminate the need for hand-made features as in traditional traffic sign detection methods and are robust to detect traffic signs with different backgrounds due to external factors such as lighting and weather changes.

In order to transfer the achievements of CNN in the field of object detection to the field of traffic sign detection, some research [14 –16] attempted to directly use excellent algorithms in the field of object detection to detect traffic signs, but failed to achieve a satisfactory result. Algorithms in the object detection field use the PASCAL VOC [17] and COCO [18] datasets for training and testing, and these datasets provide images with large sizes of objects, which leads to the weak ability of these algorithms to detect small size objects. But, in images of traffic scene, there are too many large size and small size traffic signs, which makes the algorithms in the field of object detection cannot perform well in the field of traffic sign detection. Other research [19 –22] constructed their own detection networks using CNN, borrowing the structure of excellent algorithms in the field of object detection and using classical feature extraction networks such as VGGNet [23], GoogLeNet [24] and ResNet [25] as the base network. These algorithms cannot effectively solve the two tasks of localization and classification simultaneously when facing with realistic traffic scene images, because they do not give an effective method to detect traffic scene images with complex backgrounds and large differences in object sizes. On the one hand, the difficulty of traffic sign detection lies in the complexity of the background of the traffic scene images, which can interfere with the algorithm during detection. On the other hand, the traffic sign sizes in the images vary greatly, and there are mostly small size traffic signs, which are usually more difficult to detect than the large size traffic signs. Therefore, to solve the problems in traffic sign detection, it is necessary to design an algorithm that can detect both large size and small size traffic signs in complex trafficscenes.

Currently, the commonly used datasets for related research are GTSRB [26] and GTSDB [27], among which only GTSDB provides data that can be used to research the detection of traffic signs. But GTSDB only provides images and label information of three major categories of common traffic signs, which are far less than the types of traffic signs encountered in reality. Among the 43 types of traffic signs provided by GTSRB, traffic signs occupy such a large proportion that some images contain only one type of traffic signs, and these images can only be used to research the classification of traffic signs. For the above reasons, the data provided by GTSRB and GTSDB cannot meet the requirements of traffic sign detection in realistic traffic scenes, and the research results cannot be applied to actual traffic scenes. To solve the above proposed problems in the field of traffic sign detection, the main research contributions of this paper are as follows.

To make up for the shortage of public datasets, this paper creates a new dataset based on Chinese road traffic scenes with data from realistic traffic scene images. This dataset contains 77 categories of common traffic signs, and the information of categories and locations of traffic signs in each image are labeled.

The Multi-size Feature Extraction Module (MsFEM) and the Enhanced Feature Fusion Module (EFFM) are designed based on the YOLOv4 algorithm. MsFEM can help the feature extraction network to efficiently extract feature semantic information from the upper-level feature maps. EFFM can effectively retain and enhance the feature semantic information of small size objects in multi-scale prediction and improve the algorithm’s ability to detect small objects. Through experiments, these two modules show their effectiveness in improving the ability of the YOLOv4 algorithm to simultaneously locate and classify traffic signs on newly created datasets.

Several different sets of parameters are set for the MsFEM1, MsFEM2 and Maxpooling (sizen×sizen) modules of the improved algorithm for comparison experiments, and we finally obtain a set of parameters with the best performance.

The remainder of the paper is organized as follows. Section 2 introduces the related work in the field of traffic sign detection and recognition. Section 3 introduces the newly created dataset. Section 4 introduces the proposed method. Section 5 and Section 6 are the experimental and concluding sections, respectively.

2 Related work

CNN-based methods have become the mainstream research methods in the field of traffic sign detection and recognition, which has the same idea as the methods in the field of objectt detection. Object detection has essentially two tasks, one is detection and the other is classification. At present, CNN-based object detection methods can be divided into two-stage and one-stage. Two-stage divides detection and classification into two steps, first predicting the regions where objects exist in the image, followed by predicting the categories of objects in the regions; One-stage uses detection and classification as one step, directly predicting the location and categories of different objects in the image.

R-CNN is the first to use the two-stage method, which uses selective search [28] to generate object proposals and recognize these object proposals, and each generated object proposal is processed by convolutional neural networks, which leads to the low efficiency of R-CNN. Subsequently, Girshick et al. proposed Fast R-CNN, which uses a softmax layer in the final layer of the network to replace the SVM classifier in the R-CNN, but still didn’t solve the problem of low efficiency. In order to solve the problem of low efficiency, Ren et al. proposed Faster R-CNN. The highlight of Faster R-CNN is the use of region proposal network (RPN), which is a network structure that efficiently locate the object region. The RPN generates object proposals based on the feature maps extracted from VGG16 or ResNet101, determines whether these object proposals belong to foreground (containing objects) or background (not containing objects) by softmax layer, and performs bounding box regression on the object proposals belonging to foreground, which obtains the effective localization of the object regions. All object proposals in Faster R-CNN are generated on the RPN, so that each object proposal doesn’t need to be processed by the convolutional neural networks, which accelerates the detection speed of the network. But, RPN still does not meet the real-time requirement.

In order to further speed up the detection speed of the network, algorithms such as YOLOv3 [29] and SSD using the one-stage method successively emerge, and the design of these algorithms borrows the idea of full convolution of FCN [30] and multi-scale prediction of FPN [31]. The idea of multi-scale prediction is to use feature maps of different sizes obtained from feature extraction networks to perform feature fusion between high and low layer first and then separately and independently perform prediction operations. The full convolution idea is different from the traditional CNN network which uses a fully connected layer for classification at the end of the network, it uses a 1×1 convolution kernel instead of a fully connected layer at the end of the network. The output of full convolution is a feature map of the same size as the previous input layer, each value on which is the network’s prediction of a region of the original input image. The purpose of full convolution is to detect and classify the image at the pixel level, and calculate the loss function pixel by pixel, which is equivalent to one training sample per pixel. Through full convolution operation, the network can reduce the amount of computation, speed up the detection, and only need to input the image into the network to directly predict the result. SSD and YOLOv3 are faster than Faster R-CNN, but there is no significant improvement in detection precision. To improve the detection precision of the algorithm and its ability to detect small size objects, the author of YOLOv4 [32] used CSPDarknet53 as the base feature extraction network and used PANet [33] for multi-scale prediction, and these tricks showed their effectiveness on the PASCAL VOC and COCO datasets.

As described in the previous section, traffic scene images have the large variation of sizes and the complex background, which causes the object detection algorithms to perform poorly when used directly to detect traffic signs in traffic scene images. However, these object detection algorithms provide very advanced ideas, in particular, in YOLOv4, the author gives ideas to improve the detection capability of small size objects. Therefore, this paper based on YOLOv4 constructs an algorithm that can effectively detect and recognize traffic signs combining the distribution characteristics of traffic signs in traffic scene images.

3 Traffic sign dataset

3.1 Data collection

The images in the dataset are taken from Chinese road traffic scenes, and 40% of the images are provided by two authors, Zhu et al. [34] and Zhang et al. [35]. On the other hand, this paper uses cameras to take a large number of images about urban road traffic scenes, and keeps those images that contain traffic signs by selecting and cropping these images, which are the second data source of the dataset. In order to keep the realistic road scenes in the images, this paper resizes these images to 800×800 pixels, and some of the images are shown in Fig. 1. These images are taken under different road scenes, and the background of the images and the size of the traffic signs in the images are consistent with the realistic situation encountered in reality.

Fig. 1

Traffic scene images.

3.2 Data annotation

As shown in Fig. 2, the dataset contains 77 categories of common traffic signs, and each color box in the figure represents a large category, from left to right, which are mandatory signs, prohibitory signs, warning signs and traffic signal. The character under each traffic sign represents its unique label. The traffic signs in the image are labeled by rectangular boxes and each traffic sign is labeled with a specific category. After labeling all the traffic signs in the image, the label information is saved as an xml file. During training and testing, these xml files provide the required label information, as shown in Fig. 3.

Fig. 2

77 categories of common traffic signs.

Fig. 3

Label information.

3.3 Data statistic

After selecting and cropping 120,000 traffic images, the dataset finally consists of 11,000 images containing 15,000 traffic sign instances. The distribution of traffic sign sizes (in pixels) in the dataset is shown in Fig. 4. In this paper, some traffic signs with numbers (e.g., pl50, il60, etc.) in the dataset are considered as a family (pl stands for speed limit signs, it stands for minimum speed signs), and the number of instances of each traffic sign is shown in Fig. 5. It is inevitable that the number of different categories of traffic signs in the dataset varies due to the different distribution of road conditions in reality. For example, the number of road conditions with "continuous curves" signs is less than that with "watch out for pedestrians" signs.

Fig. 4

Traffic sign sizes distribution.

Fig. 5

Traffic sign instances distribution.

In general, the data in the newly created dataset are derived from real traffic scene images, and the categories and locations of traffic signs in each image are labeled. The real scenes and detailed label information are consistent with the research on traffic sign detection and recognition in this paper.

4 Proposed method

Through experiments, the YOLOv4 algorithm doesn’t show effectiveness in locating and recognizing traffic signs. For this reason, the method of this paper is to improve the YOLOv4 algorithm in the context of the problems existing in traffic sign detection and recognition. The improved algorithm in this paper is called ME-YOLOv4, and the structure of YOLOv4 and ME-YOLOv4 are introduced in the following two sections, respectively.

4.1 YOLOv4 structure

The structure of YOLOv4 is described below. CSPDarknet53 as the base network; SPP [36] as the additional module of the neck; PANet as the feature fusion module of the neck; Head of YOLOv3 as the head.

The specific structure of YOLOv4 is shown in Fig. 6. As the base network in YOLOv4, CSPDarknet53 is the CSP [37] added to each large residual block of Darknet53 [29], which finally consists of a convolution module and five BLOCKS. The convolution module is composed of a Conv2D layer, a BN layer, and a Mish activation function. Each BLOCK contains several Resblock and convolution modules. As an additional module in the neck, the SPP module performs MaxPooling of different sizes on the feature map extracted by CSPDarknet53, which aims to increase the perceptual field of the network. As the feature fusion module in the neck, PANet performs the downsampling operation after the upsampling operation of FPN to increase the location semantic information from the lower layer. The head of YOLOv4 continues to follow the head structure of YOLOv3. In this structure, it first performs feature extraction on the down-sampled feature map using the convolution module, followed by a full convolution operation [30] on the feature map after feature extraction to obtain the final prediction results. The convolution module in the neck and head structure is composed of a Conv2D layer, a BN layer, and a Leaky relu activation function.

Fig. 6

YOLOv4 structure.

4.2 ME-YOLOv4 structure

ME-YOLOv4 is an algorithm that can effectively locate and recognize traffic signs obtained by making several improvements on the basis of YOLOv4, which are as follows.

Due to the different shooting angles, there are large differences in the sizes of different traffic signs in the traffic scene images. When constructing the feature extraction network, if the network uses only one size of convolution kernel for each layer of feature extraction, the extracted feature map cannot fully and effectively contain the feature semantic information of different size traffic signs in the upper layer feature map. Inspired by the inception network [38], this paper designs the multi-size feature extraction module (MsFEM) and uses it in the feature extraction network. The specific structure of MsFEM is shown in Fig. 7, which uses two different sizes of convolution kernel to extract features from the upper layer images. MsFEM uses different sizes of convolution kernel, which means that different sizes of perceptual fields are used to extract the semantic information of traffic signs of different sizes. MsFEM concatenates the extracted feature maps together in order to transfer the semantic information of traffic signs of different sizes in the image to the deeper layers of the feature extraction network.

The background of the traffic scene image is complex, and there are many other signs that are similar to the traffic signs in color or shape. The feature extraction network also extracts the semantic information of these signs when extracting features, and the wrong semantic information of these signs will interfere with the network during training. If these signs can be effectively removed when the network extracts feature, the extracted feature map will be more representative, which will lead to better training results. The traffic signs in this paper can be divided into mandatory signs, prohibitory signs, warning signs and traffic signal. The warning signs are mostly yellow triangles with black borders. The prohibitory signs are mostly white circles with red borders. The mandatory signs are mostly circles or rectangles with blue backgrounds. The traffic signal is rectangular boxes with different color circles or arrows. According to the feature extraction invariance of CNN, the above features of traffic signs will be extracted completely when feature extraction is performed by the feature extraction network on the traffic scene images. Therefore, a series of MaxPooling operations with different sizes and step size of 1 are appropriately performed on the extracted feature map, which can make the feature semantic information of other signs in this feature map except traffic signs be reduced. In this paper, the above trick is used to remove the interference factors from the traffic scene images. The size of the feature map in the feature extraction network decreases with the process of downsampling, and the smaller the size of the feature map represents a larger perceptual field. In order to adapt this change, this paper uses different sizes of MaxPooling for different sizes of feature maps, and each MaxPooling occurs in one downsampling process. The specific process is shown in Fig. 9.

In FPN, with the purpose of allowing the feature semantic information at the higher layer to compensate for the feature semantic information not extracted at the lower layer, the feature semantic information at the higher layer is fused with the feature semantic information at the lower layer by upsampling. This trick can enrich the feature semantic information of each size feature map, but there are also limitations. In the process of downsampling, the feature extraction network scales down the size of the feature map, so that a pixel in the feature map is equivalent to a perceptual field, which maps a region of the original image and the size of the region is determined by the downsampling multiplier. When the size of the detected object in the image is smaller than the current downsampling multiplier, the feature semantic information of the detected object is lost with the downsampling process. In this case, the upsampling operation cannot achieve the fusion of feature semantic information between the high and low layers for those object s in the image that are lost due to their small size. The result of this situation is that the network has poor detection capability for small objects. In order to solve the above mentioned problems and improve the overall detection capability of the network for traffic signs in the dataset, this paper designs the enhanced feature fusion module (EFFM), and the specific structure is shown in Fig. 8. Firstly, before the feature fusion between the high-layer feature map and the low-layer feature map, the EFFM uses the convolution module to extract the features from the low-layer feature map, in order to further extract the effective feature semantic information from the feature map, especially for the smaller size objects. Then, EFFM fuses the feature map extracted by the convolution module in the lower layer with the feature map from the higher layer, and the feature map after feature fusion is compressed by using a convolution module. Finally, in order to maintain the location semantic information in the low-layer feature map, EFFM concatenates the unconvolved feature map in the low layer with the feature map after feature compression by shortcut. With the EFFM structure, the feature map obtained from the lower layer is enriched with both the original feature semantic information and the feature semantic information from the higher layer.

Fig. 7

MsFEM structure.

Fig. 8

EFFM structure.

After adding the above improvements to the YOLOv4 structure, the structure of ME-YOLOv4 is shown in Fig. 9. The size of the convolution kernel in MsFEM and the MaxPooling are adapted to the size of the traffic signs in the newly created dataset. Besides, other parts of the network continue to use parts of YOLOv4.

Fig. 9

ME-YOLOv4 structure (The red solid box is the improvement of this paper).

5 Experiment

5.1 Training

The newly created dataset provides the complete label information of the traffic sign category and location in image, so our all experiments are conducted on this dataset and the dataset is divided into a training set and a test set in the ratio of 8:2. In addition, the mosaic [32] trick is used to enhance the data for the categories with small sample size in the dataset, so that each category can be trained a certain number of times in one iteration.

In order to verify the effectiveness of the improved method, SSD, YOLOv3, YOLOv4 and Faster R-CNN are selected from typical algorithms in the object detection field, and a set of comparison experiments are set up with these selected algorithms and ME-YOLOv4. In addition, several different sets of parameters are set for experiments on MsFEM1, MsFEM2 and Maxpooling (sizen×sizen) modules in ME-YOLOv4, with the aim of determining a set of parameters that perform better on the dataset. These parameters are set by adapting the size of the traffic signs in the newly created dataset, and the specific parameter settings for each group are shown in Table 1. For training, each group of algorithms uses the same hyperparameter setting: The initial learning rate is set to 0.001; The Adam optimizer with default parameters is used; The validation set is divided from the training set in the ratio of 7:3 to monitor the whole training process. All experiments are run on a Linux server with an Intel Xeon(R) Silver 4210 CPU, 128GB of RAM, and two NVIDIA TITAN RTX GPUs and using the Tensorflow deep learning framework.

Table 1
Parameters setting

Setting MsFE1 MsFE2 Maxpooling(sizen × sizen)

ME-YOLOv4 I s=1,m=3 s=1,m=3 size1=2,size2=3,size8=3,size8=5,size4=7

ME-YOLOv4 II s=3,m=5 s=1,m=3 size1=2,size2=3,size8=3,size8=5,size4=7

ME-YOLOv4 III s=3,m=5 s=1,m=3 size1=2,size2=3,size8=3,size8=5,size4=9

ME-YOLOv4 IV s=3,m=5 s=1,m=3 size1=3,size2=3,size8=5,size8=5,size4=7

Setting	MsFE1	MsFE2	Maxpooling(sizen × sizen)
ME-YOLOv4 I	s=1,m=3	s=1,m=3	size1=2,size2=3,size8=3,size8=5,size4=7
ME-YOLOv4 II	s=3,m=5	s=1,m=3	size1=2,size2=3,size8=3,size8=5,size4=7
ME-YOLOv4 III	s=3,m=5	s=1,m=3	size1=2,size2=3,size8=3,size8=5,size4=9
ME-YOLOv4 IV	s=3,m=5	s=1,m=3	size1=3,size2=3,size8=5,size8=5,size4=7

5.2 Experimental results and analysis

During training, we set 1000 epochs for each algorithm, and at the end of each epoch, the weights obtained from training are tested on the validation set in order to calculate the value of the loss function. The loss function value obtained on the validation set will become the monitor that monitors the whole training process, if the loss function value decreases, the weights recorded in the previous epoch will be overwritten. In the training process, if the loss function value does not decrease within 50 consecutive epochs, the learning rate will be adjusted to 0.1 times of the original one; if the loss function does not decrease within 100 consecutive epochs, the training process will be terminated and the weights recorded at the lowest value of the loss function will become the weights we finally obtain. By the above training strategy, the SSD, YOLOv3 and YOLOv4 terminate training process after 705, 735 and 720 epochs, respectively, while our improved algorithm terminates training process after 680 epochs. The test set is examined with the training weights of each group of algorithms mentioned above and the evaluation metrics of the PASCAL VOC are used to evaluate the results obtained by each group of algorithms on the test set. The mean average precision (mAP) values obtained by each group of algorithms on the dataset are shown in Table 2 and the average precision (AP) values obtained by each group of algorithms on each category in the dataset are shown in Table 3.

Table 2
The mAP values obtained by each group of algorithms on the new dataset

Algorithms mAP/%

SSD 76.22

YOLOv3 75.72

YOLOv4 80.37

Faster R-CNN 77.13

ME-YOLOv4 I 82.55

ME-YOLOv4 II 82.90

ME-YOLOv4 III 83.63

ME-YOLOv4 IV 83.48

Algorithms	mAP/%
SSD	76.22
YOLOv3	75.72
YOLOv4	80.37
Faster R-CNN	77.13
ME-YOLOv4 I	82.55
ME-YOLOv4 II	82.90
ME-YOLOv4 III	83.63
ME-YOLOv4 IV	83.48

Table 3

AP values obtained by each group of algorithms on the new dataset

Algorithms	i2	i3	i4	i5	i10	i11	i12	i14	i15	i17	i18	i19	i20	i21
SSD	0.92	0.81	0.92	0.89	0.83	0.37	0.8	0.77	0.43	0.96	0.94	0.54	0.49	0.63
YOLOv3	0.83	0.75	0.96	0.86	0.72	0.4	0.68	0.86	0.3	0.53	0.73	0.79	0.93	0.77
YOLOv4	0.86	0.83	0.94	0.92	0.64	0.35	0.6	0.75	0.37	0.73	1.0	1.0	0.95	0.85
Faster R-CNN	0.89	0.72	0.9	0.91	0.9	0.45	0.78	0.73	0.59	0.95	1.0	0.74	0.57	0.59
ME-YOLOv4 I	0.82	0.75	0.98	0.92	0.87	0.4	0.62	0.62	0.34	0.65	1.0	1.0	0.99	0.98
ME-YOLOv4 II	0.82	0.75	0.96	0.91	0.86	0.4	0.62	0.62	0.36	0.7	1.0	1.0	0.99	0.98
ME-YOLOv4 III	0.85	0.78	0.96	0.91	0.74	0.4	0.61	0.72	0.4	0.62	0.96	0.99	0.91	0.92
ME-YOLOv4 IV	0.82	0.75	0.96	0.89	0.66	0.4	0.69	0.78	0.47	0.68	1.0	0.98	0.96	1.0
Algorithms	i22	i23	i25	il	ip	P1	p3	p5	p6	p9	p10	p11	p12	p19
SSD	0.51	1.0	0.67	0.98	0.85	0.84	0.93	0.92	0.87	0.91	0.9	0.92	0.68	0.9
YOLOv3	0.84	0.72	0.78	0.93	0.7	0.6	0.86	0.81	0.73	0.92	0.76	0.79	0.58	0.89
YOLOv4	0.9	1.0	0.76	0.95	0.81	0.8	0.88	0.82	0.92	0.83	0.85	0.89	0.58	1.0
Faster R-CNN	0.64	1.0	0.78	1.0	0.87	0.89	1.0	0.9	0.9	0.88	0.98	0.9	0.64	0.9
ME-YOLOv4 I	0.95	1.0	0.78	0.98	0.73	0.8	1.0	0.95	0.99	1.0	1.0	0.93	0.7	1.0
ME-YOLOv4 II	0.95	1.0	0.78	0.95	0.73	0.9	1.0	0.95	1.0	1.0	1.0	0.93	0.67	0.96
ME-YOLOv4 III	0.86	1.0	0.78	0.93	0.73	0.9	1.0	0.94	0.92	0.99	0.92	0.95	0.83	0.92
ME-YOLOv4 IV	0.95	1.0	0.78	0.98	0.65	0.9	0.91	1.0	0.9	0.9	0.99	0.97	0.58	1.0
Algorithms	p23	p26	p27	p28	pb	pg	ph4	ph4.5	ph5	pl	pm	pn	pne	pr
SSD	0.95	0.85	0.78	0.73	0.98	0.94	0.92	0.91	0.35	0.95	0.86	0.9	0.87	0.96
YOLOv3	0.9	0.64	0.7	0.89	1.0	0.81	0.78	0.82	0.29	0.91	0.37	0.91	0.81	0.96
YOLOv4	0.85	0.98	0.75	0.78	1.0	1.0	0.9	0.93	0.62	0.98	0.85	0.95	0.81	0.98
Faster R-CNN	0.97	0.87	0.63	0.69	1.0	0.9	0.94	0.5	0.39	0.9	0.7	0.9	0.93	0.99
ME-YOLOv4 I	0.94	0.96	0.8	0.67	1.0	1.0	0.94	0.95	0.71	0.97	0.97	0.94	0.89	1.0
ME-YOLOv4 II	0.94	0.96	0.8	0.65	1.0	1.0	0.88	0.94	0.82	0.96	0.96	0.92	0.85	1.0
ME-YOLOv4 III	0.98	0.95	0.7	0.86	1.0	1.0	0.9	1.0	0.76	0.98	0.89	0.94	0.85	1.0
ME-YOLOv4 IV	0.86	0.93	0.9	0.82	1.0	1.0	0.9	1.0	0.86	0.96	0.94	0.91	0.85	1.0
Algorithms	ps	w10	w13	w18	w21	w22	w30	w31	w32	w41	w42	w43	w45	w47
SSD	0.99	0.73	0.83	0.4	0.23	0.73	0.78	0.81	0.72	0.82	0.51	0.68	0.57	0.66
YOLOv3	1.0	0.8	0.8	0.86	0.48	0.8	0.82	0.8	0.73	0.97	0.57	0.85	0.58	0.6
YOLOv4	0.8	0.75	0.83	0.86	0.67	0.66	0.9	0.69	0.86	0.82	0.71	0.66	0.64	0.58
Faster R-CNN	1.0	0.63	0.85	0.47	0.26	0.71	0.8	0.89	0.72	0.81	0.65	0.72	0.54	0.72
ME-YOLOv4 I	0.6	0.76	0.92	0.86	0.85	0.76	0.99	0.7	0.73	0.85	0.76	0.92	0.58	0.76
ME-YOLOv4 II	0.62	0.76	0.92	0.81	0.85	0.78	0.97	0.72	0.73	0.85	0.76	0.9	0.58	0.76
ME-YOLOv4 III	0.8	0.79	0.99	0.82	0.67	0.76	0.91	0.91	0.73	1.0	0.66	0.83	0.58	0.72
ME-YOLOv4 IV	0.8	0.9	0.81	0.82	0.78	0.89	0.9	0.84	0.67	0.91	0.73	0.85	0.58	0.58
Algorithms	w55	w57	w58	w59	lred	lgreen	strred	strgreen	rred	rgreen	strgreennum	strrednum
SSD	0.86	0.87	0.63	0.84	0.78	0.49	0.46	0.8	0.28	0.7	0.76	0.77
YOLOv3	0.89	0.85	0.8	0.82	0.61	0.67	0.55	0.66	0.86	0.55	0.81	0.95
YOLOv4	0.93	0.97	0.72	0.9	0.77	0.43	0.55	0.79	0.59	0.68	0.78	0.95
Faster R-CNN	0.87	0.89	0.79	0.8	0.81	0.39	0.52	0.82	0.27	0.75	0.65	0.81
ME-YOLOv4 I	0.89	0.8	0.7	0.81	0.76	0.42	0.58	0.71	0.54	0.67	0.77	0.95
ME-YOLOv4 II	0.87	0.93	0.7	0.87	0.74	0.47	0.59	0.71	0.67	0.67	0.77	0.95
ME-YOLOv4 III	0.89	0.98	0.66	0.76	0.72	0.82	0.61	0.81	0.82	0.66	0.78	0.95
ME-YOLOv4 IV	0.87	0.88	0.7	0.74	0.8	0.7	0.58	0.73	0.8	0.76	0.72	0.95

According to Table 3, on all categories, the minimum AP values obtained by the four algorithms with the settings of Table 1 are 0.34, 0.36, 0.40, and 0.40, respectively, and the minimum AP values obtained by SSD, YOLOv3, YOLOv4, and Faster R-CNN are 0.23, 0.29, 0.35, and 0.26, respectively. In addition, the categories with AP values greater than 0.7 obtained by the four algorithms with the settings of Table 1 are 82%, 83%, 85%, and 90% of the total categories, respectively, and the categories with AP values greater than 0.7 obtained by SSD, YOLOv3, YOLOv4, and Faster R-CNN are 73%, 74%, 77%, and 73% of the total categories, respectively. According to Table 2, on the dataset, the mAP values obtained by the four algorithms with the settings of Table 1 are 82.55%, 82.9%, 83.63%, and 83.48%, respectively, and the mAP values obtained by SSD, YOLOv3, YOLOv4, and Faster R-CNN are 76.22%, 75.72%, 80.37%, and 77.13%, respectively. The best mAP values obtained by the four algorithms with the settings of Table 1 are 7.41%, 7.91%, 3.26%, and 6.5% higher than those obtained by SSD, YOLOv3, YOLOv4, and Faster R-CNN, respectively. Through the experimental results, it verifies that the method proposed in this paper really improve the ability of YOLOv4 algorithm to locate and recognize traffic signs simultaneously, and also verifies that MsFEM can effectively help feature extraction networks to extract feature semantic information of traffic signs of different sizes in images and EFFM can effectively retain and enhance feature semantic information of small size objects in multi-scale prediction. The experimental results also show that the ME-YOLOv4 III algorithm outperforms other algorithms for traffic sign detection and recognition, and the algorithm can be used to detect and recognize traffic signs in realistic traffic scene images.

To better verify the effectiveness of our improved algorithm, we compare the experimental results obtained by ME-YOLOv4 III on the GTSDB with the recent research results, as shown in Table 4. Through the comparison, our improved algorithm shows some competitiveness.

Table 4

Performance of each group of algorithms on the GTSDB

Algorithms	AP/%			mAP/%
	Mandatory	Prohibitory	Warning
Liu et al. [39]	93.5	97.9	98.8	96.7
Ren et al. [40]	71.4	90.3	82.4	81.4
Serna et al. [41]	93.9	99.9	98.2	97.3
ME-YOLOv4 III	95.1	98.9	98.7	97.6

6 Conclusion and future work

To solve the problems in the field of traffic sign detection, this paper creates a new traffic sign dataset, and based on the YOLOv4 algorithm, designs a multi-size feature extraction module and an enhanced feature fusion module. On the new dataset, it experimentally verifies that the improved method proposed in this paper can effectively improve the ability of YOLOv4 algorithm to locate and classify traffic signs simultaneously. In future research, we will continue to expand the categories and numbers of traffic signs in the dataset, and continue to investigate how to improve the algorithm’s ability to locate and recognize traffic signs with large size differences in traffic images with complex backgrounds.

Footnotes

Acknowledgments

This work is supported in part by the Anhui Provincial Key R&D Program of China under Grant 202004- a05020040, and in part by the National Key Research and Development Program of China under Grant 2018YFC0604404.

References

de la Escalera

, Armingol

and Mata

, Traffic sign recognition and analysis for intelligent vehicles, Image and Vision Computing 21(3) (2003), 247–258.

Benallal

and Meunier

, Real-time color segmentation of road signs in CCECE 2003 - Canadian Conference on Electrical and Computer Engineering Toward a Caring and Humane Technology (Cat. No.03CH37436), vol. 3, (2003), pp. 1823–1826.

Khan

J.F.

, Bhuiyan

S.M.A.

and Adhami

R.R.

, Image segmentation and shape analysis for road-sign detection, IEEE Transactions on Intelligent Transportation Systems 12(1) (2011), 83–96.

Loy

and Barnes

, Fast shape-based road sign detection for a driver assistance system, in 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), vol. 1, 2004, pp. 70–75.

Lee

J.-H.

and Jo

K.-H.

, Traffic sign recognition by division of characters and symbols regions, in 7th Korea-Russia International Symposium on Science and Technology, Proceedings KORUS 2003 (IEEE Cat. No.03EX737) , vol. 2, 2003, pp. 324–328.

Fleyeh

, Biswas

and Davami

, Traffic sign detection based on adaboost color segmentation and svm classification, in Eurocon 2013, 2013, pp. 2005–2010.

Maldonado-Bascon

, Lafuente-Arroyo

, Gil-Jimenez

, Gomez-Moreno

and Lopez-Ferreras

, Road-sign detection and recognition based on support vector machines, IEEE Transactions on Intelligent Transportation Systems 8(2) (2007), 264–278.

Krizhevsky

, Sutskever

and Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, Commun ACM 60(6) (2017), 84–90.

Girshick

, Donahue

, Darrell

and Malik

, Rich feature hierarchies for accurate object detection and semantic segmentation, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.

10.

Girshick

R.B.

, “Fast R-CNN,”, in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 2015, pp. 1440–1448.

11.

Ren

, He

, Girshick

and Sun

, Faster r-cnn: Towards real-time object detection with region proposal net- 12 works, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6) (2017), 1137–1149.

12.

Liu

, Anguelov

, Erhan

, Szegedy

, Reed

S.E.

, Fu

and Berg

A.C.

, SSD: Single shot multibox detector, in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam The Netherlands, Proceedings, Part I, vol. 9905, 2016, pp. 21–37.

13.

Redmon

, Divvala

, Girshick

and Farhadi

, You only look once: Unified, real-time object detection, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.

14.

Qiao

, Gu

, Liu

and Liu

, Optimization of traffic sign detection and classification based on faster r-cnn, in 2017 International Conference on Computer Technology, Electronics and Communication (ICCTEC), 2017, pp. 608–611.

15.

Rajendran

S.P.

, Shine

, Pradeep

and Vijayaraghavan

, Real-time traffic sign recognition using yolov3 based detector, in 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2019, pp. 1–7.

16.

Tian

, Gelernter

, Wang

, Li

and Yu

, Traffic sign detection using a multi-scale recurrent attention network, IEEE Trans Intell Transp Syst 20(12) (2019), 4466–4475.

17.

Everingham

, Gool

L.V.

, Williams

C.K.I.

, Winn

J.M.

and Zisserman

, The pascal visual object classes (VOC) challenge, Int J Comput Vis 88(2) (2010), 303–338.

18.

Lin

, Maire

, Belongie

S.J.

, Hays

, Perona

, Ramanan

, Dollár

and Zitnick

C.L.

, Microsoft COCO: Common objects in context, in Computer Vision - ECCV 2014 - 13th European Conference, Proceedings, Part V Zurich, Switzerland, vol. 8693, 2014, pp. 740–755.

19.

Liu

, Du

, Tian

and Wen

, Mr-cnn: A multi-scale region-based convolutional neural network for small traffic sign recognition, IEEE Access 7 (2019), ,57120–57128.

20.

Tabernik

and Skocaj

, Deep learning for large-scale traffic-sign detection and recognition, IEEE Transactions on Intelligent Transportation Systems 21(4) (2020), 1427–1440.

21.

Lee

H.S.

and Kim

, Simultaneous traffic sign detection and boundary estimation using convolutional neural network, IEEE Trans Intell Transp Syst 19(5) (2018), 1652–1663.

22.

Yang

, Luo

, Xu

and Wu

, Towards real-time traffic sign detection and classification, IEEE Transactions on Intelligent Transportation Systems 17(7) (2016), 2022–2031.

23.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, Computer Science (2014).

24.

Szegedy

, Liu

, Jia

, Sermanet

, Reed

S. E.

, Anguelov

, Erhan

, Vanhoucke

and Rabinovich

, Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 2015, pp. 1–9

25.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 770–778.

26.

Stallkamp

, Schlipsing

, Salmen

and Igel

, The german traffic sign recognition benchmark: A multi-class classification competition, in International Joint Conference on Neural Networks, 2011.

27.

Stallkamp

, Schlipsing

, Salmen

, et al., Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition, Neural Netw 32 (2012), 323–332.

28.

Uijlings

J.R.R.

, Sande

K.E.A.V.D.

, Gevers

and Smeulders

A.W.M.

, Selective search for object recognition, International Journal of Computer Vision 104(2) (2013), 154–171.

29.

Redmon

and Farhadi

, Yolov3: An incremental improvement, CoRR abs/1804.02767 (2018).

30.

Shelhamer

, Long

and Darrell

, Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4) (2017), 640–651.

31.

Lin

, Dollár

, Girshick

R.B.

, He

, Hariharan

and Belongie

S.J.

, Feature pyramid networks for object detection, in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 Honolulu, HI, USA, 2017, pp. 936–944.

32.

Bochkovskiy

, Wang

and Liao

H.M.

, Yolov4: Optimal speed and accuracy of object detection, CoRR abs/2004.10934 (2020).

33.

Liu

, Qi

, Qin

, Shi

and Jia

, Path aggregation network for instance segmentation, in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 Salt Lake City, UT, USA, 2018, pp. 8759–8768.

34.

Zhu

, Liang

, Zhang

, Huang

, Li

and Hu

, Traffic-sign detection and classification in the wild, in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 Las Vegas, NV, USA, 2016, pp. 2110–2118.

35.

Zhang

, Huang

, Jin

and Li

, A real-time chinese traffic sign detection algorithm based on modified yolov2, Algorithms 10(4) (2017), 127.

36.

, Zhang

, Ren

and Sun

, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(9) (2015), 1904–1916.

37.

Wang

, Liao

H.M.

, Wu

, Chen

, Hsieh

and Yeh

, Cspnet: A new backbone that can enhance learning capability of CNN, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020 Seattle, WA, USA, 2020, pp. 1571–1580.

38.

Szegedy

, Ioffe

, Vanhoucke

and Alemi

A.A.

, Inception-v4, inception-resnet and the impact of residual connections on learning, in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence San Francisco, California, USA, 2017, pp. 4278–4284.

39.

Liu

, Qian

, Li

, Wang

and Zhang

, Caffnet: Channel attention and feature fusion network for multi-target traffic sign detection, Int J Pattern Recognit Artif Intell 35(7) (2021), 2152008:1–2152008:20.

40.

Ren

, Huang

, Fan

, Han

and Deng

, Real-time traffic sign detection network using ds-detnet and lite fusion FPN, J Real Time Image Process 18(6) (2021), 2181–2191.

41.

Serna

C. G.

and Ruichek

, Traffic signs detection and classification for european urban environments, IEEE Trans Intell Transp Syst 21(10) (2020), 4388–4399.

Traffic sign detection and recognition based on multi-size feature extraction and enhanced feature fusion module

Abstract

Keywords

1 Introduction

2 Related work

3 Traffic sign dataset

3.1 Data collection

4.1 YOLOv4 structure

5.1 Training

Table 2 The mAP values obtained by each group of algorithms on the new dataset Algorithms mAP/% SSD 76.22 YOLOv3 75.72 YOLOv4 80.37 Faster R-CNN 77.13 ME-YOLOv4 I 82.55 ME-YOLOv4 II 82.90 ME-YOLOv4 III 83.63 ME-YOLOv4 IV 83.48

Footnotes

Acknowledgments

References

Table 2
The mAP values obtained by each group of algorithms on the new dataset

Algorithms mAP/%

SSD 76.22

YOLOv3 75.72

YOLOv4 80.37

Faster R-CNN 77.13

ME-YOLOv4 I 82.55

ME-YOLOv4 II 82.90

ME-YOLOv4 III 83.63

ME-YOLOv4 IV 83.48