Abstract
In this paper, we have proposed and developed a comprehensive Convolutional Neural Network (CNN) classifier “WAF-LeNet” to be used in traffic signs recognition and identification as an empowerment of autonomous driving technologies. The implemented architecture is a deep fifteen-layer network that has been selected after extensive trials to be fast enough to suit the designated application. The CNN got trained using Adam’s optimization algorithm as a variant of the Stochastic Gradient Descent (SGD) technique. The learning process is carried out using the well-known “German Traffic Sign Dataset – GTSRB”. The data has been partitioned into training, validation and testing data sets. Additionally, more random traffic signs images are collected from the web and further used to test the robustness of the proposed CNN classifier. The paper goes through the development process in details and shows the image processing pipeline harnessed in the development. The proposed approach proved successful in identifying correctly 96.5% of the testing data set and 100% of the robustness dataset with the much smaller and faster network than other counterparts.
Keywords
Introduction
In the past decade, the automobile industry has made a shift toward intelligent vehicles equipped with driving assistance systems [1, 2, 3, 4, 5, 6]. Recently the automobile industry has introduced vision systems in their high-end cars. Also in the research community, vision systems for traffic sign recognition are published [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. All these systems are designed for their specific application as a pipeline of carefully tuned algorithms [7, 8, 9, 10]. In the first step of the pipeline, the input is pre-processed with fixed algorithms such as lightning correction [9], histogram stretch [10], color segmentation [7], edge detection [11], Hough transform [12], etc. The result of these steps is used to perform the classification by a matching algorithm or by machine learning techniques such as Artificial Neural Networks (ANNs) [13, 14, 15] or Support Vector Machines (SVMs) [7]. Designing such a pipeline of algorithms is very time consuming and must be redone to support new road signs or other objects. The mapping of these computationally complex algorithms to onboard vision platforms is a time-consuming task. This is especially true if the system is already deployed and the “vendor” wants to sell new functionalities to existing platforms.
In the last decade, deep learning has done a comprehensive progress and almost dominated the field of machine learning if compared to classical methods, especially after the recent advancements in GPUs (Graphics Processing Units) [18]. Accordingly, deep learning has produced astounding results in applications related to image recognition, natural language processing, autonomous cars, etc.
Distribution of training data per each traffic sign.
In this paper, our proposed solution differs from the current state-of-the-art systems in that it uses a platform-independent flexible approach. The proposed vision system is based on a fully trainable CNN that is able to detect traffic signs in whatever shape; circular, rectangular or triangular. The CNN is trained for 43 different traffic signs but this number can be easily increased through transfer learning (keep adding new traffic signs training data and then use the current trained network after modifying the output layer). The inputs of the CNN are the raw pixel values and the outputs are direct confidence values representing the possibility of a specific traffic sign.
As an example for comparison purposes, the well-known work of Sermanet et al. [13] achieved a 99.17% accuracy, however, by using a much larger multi-scale convolutional neural network that is almost 100 times bigger than the one proposed in this paper. This is off-course makes its implementation in real-time extremely costly or almost infeasible. Moreover, Qian et al. [16] proposed a CNN that has been trained as well on the same data set (GTSRB) to extract the visual attributes using what is called “Max Pooling Positions”. However, the implemented CNN is almost 4 times the “WAF-LeNet” size (which means it is slower) and produces only 95.53% accuracy.
Samples of training images from different classes.
Distribution of validation data per each traffic sign.
The proposed CNN is being trained and validated using “The German Traffic Sign Recognition Benchmark” (GTSRB) [19]. The whole dataset consists of 51,839 traffic sign image samples. The data is then partitioned into three partitions. The first partition is the training data which consists of 34,799 examples (67%). Each example is an image of size 32
The second partition is the validation dataset which consists of 4,410 examples (8.5%). Each example is an image of size 32
Samples of training images from different classes.
The third partition is the test data set which consists of 12,630 examples (24.4%). Each example is an image of size 32
Distribution of test data per each traffic sign.
Samples of training images from different classes.
Before using the training/validation images, these images need to be pre-processed to make more useful and convenient throughout the learning process. The pre-processing steps meant to improve the training results and reduce the computation as much as possible. The following steps describe the implemented pre-processing steps in order of execution:
Normalization (color): For color images by simply implementing a min-max scaling for color image data. Each RGB color gets scaled down to 0.1 Converting the color training images to grey:This is done using the OpenCV [20] function “ctvColor”. This step reduces the training computation to 1/3 Normalization (grey): This is done for grey images by simply implementing a min-max scaling (between 0.1 and 0.9) for grey image data. Filtering the noise: This is done using theOpenCV [20] function “GaussianBlur” which executes the Gaussian filter algorithm, with a kernel size of 5. Shuffling the training data: This is done onceeach training epoch to avoid pattern memorization and consequently trapping in local minima.
All the above steps have been tried. Several pre-processing using color images has been tried as well. Also, pre-processing using Gaussian blur filtering is also tried. We found out that using grey images and “no blur filtering” gives the best performance as per our many trial-and-error endeavors (which supports the results in [9]).
Finally, the actually used pre-processing pipeline is: Color-Image
The proposed convolution network consists of basically four main types of layers: the convolutional, the ReLU layer, the polling layer, and the fully connected layer [13, 21].
The convolutional layer: The big pixel matrix (digital image) will be divided into over-lapped sub-matrixes (feature maps) by sliding a kernel or feature detector (filter) over the original images and applying a dot product between them. The detected features are depending on the values of the filter. The size of the feature maps is depending on the numbers of filters (depth), the number shifting steps of the filter (stride) and the existence of the zero-padding (padding the feature map with zeros at the bounders). The ReLU: The Rectified Linear Unit (ReLU) is a non-linearity operation that applies directly after the convolution to change the negative values for each pixel by a zero. The purpose of ReLU is to introduce non-linearity in our CNN since most of the real-world data, as well as traffic signs data that the CNN needs to learn, would be non-linear. The pooling layer: Polling is a nonlinear down-sampling that uses squared filters (such 2 The fully connected layer: The fully connected layer is concerned of classification and decision making after the back-propagation. It calculates the percentage of error then repeats the previous three steps and change the weights until reach the minimum percentage of error. This layer is a traditional Multi-Layer Perceptron that uses a Softmax activation function in the output layer. The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer.
WAF-LeNet architecture
WAF-LeNet detailed structure.
The implemented neural network model is an upgraded version of LeNet architecture [22] and given the name WAF-LeNet. Figure 7 depicts the detailed structure of WA-LeNet, as well as Table 1 below, presents the layers specific information.
Two Drop-out layers are added and the 1
The learning algorithm
In this work, Adam learning algorithm [23] is used to train the proposed WAF-LeNet and update its weights iteratively based on the GTSRB training data [19]. Adam is an optimization algorithm that is used instead of the classical Stochastic Gradient Descent (SGD) learning algorithm [24]. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.
Adaptive Moment Estimation (Adam) [23] is a comprehensive technique that calculates the adaptive learning rates for each parameter in the neural network individually. Moreover, an exponentially decaying average of past squared gradients
These biases have been counteracted by calculating instead bias-corrected first and second-moment estimates:
Then, the above estimates (
for all neural network model’s parameters
In the main paper that presents the Adam technique [24], it was proposed to use the default values of 0.9 for
The output layer of the proposed WAF-LeNet is a softmax function. The softmax function or normalized exponential function [26] is a generalization of the logistic function that “squashes” a K-dimensional vector
In probability theory, the output of the softmax function can be used to represent a categorical distribution – that is, a probability distribution over
The data loss, which is a supervised learning problem measures the compatibility between a prediction (e.g. the class scores in classification) and the ground truth label. The data loss takes the form of an average over the data losses for every individual example. That is,
Moreover, the multi-class cross entropy can then be formulated as follows [26, 27, 28]:
where the labels
Raw traffic sign samples from the web.
The WAF-LeNet model is trained using the parameters listed in Tables 2–4 using Adam’s optimization algorithm. The training results are presented in Table 5.
WAF-LeNet training parameters (learning rate)
WAF-LeNet training parameters (learning rate)
WAF-LeNet training parameters (PKeep)
WAF-LeNet training parameters (others)
WAF-LeNet training results
Illustration of one-hot activation.
The training of the WAF-LeNet has been carried-out through several trials to achieve the presented results. The following observations have been composed during the training process:
Several trials have been carried out to train the current WAF-LeNet architecture using the RGB images after being normalized, however, the results were not good enough, and the validation accuracy didn’t cross the 91% level. Then the training trials have been switched to use the normalized converted grey-image data with more successful results with the validation accuracy reached 93% but didn’t go much further beyond that. Using Gaussian filtering for the both Grey and RGB images didn’t improve the accuracy, in contrary, it worsens the accuracy much further. For this reason, this technique has been skipped. In order to improve the performance further, two drop-out layers have been added as shown in Table 1 as well as increasing the breadth of the 1 For further improvement, the learning rate, as well as the keep probability of the drop-out layers, have been updated according to the profiles shown in Tables 2 and 3. This approach has significantly improved the validation accuracy to reach 96.4% as shown in Table 5. The WAF-LeNet after training has been applied to the testing data resulting in 94.5% accuracy.
Cropped and centered traffic web sign samples.
Internal generated probabilities for different classes.
Ten new random traffic sign images have been downloaded from the web for further robustness testing. Samples of these raw images are shown in Fig. 8.
The fully trained WAF-LeNet has been tested (as illustrated in Fig. 9) on these 10 raw images (after being resized to 32
To have more insight into the performance of the proposed WAF-LeNet, Fig. 11 shows how the classifier identified the traffic sign by showing the generated internal top five probabilities (the Softmax) in “layer 15” before applying the one-hot concept (taking the maximum as the final output). The graph shows that the classifier identified the traffic sign image on the left side as “ID#1–30 km/h speed limit” with 88% probability and as “ID#0–20 km/h speed limit” with 10% probability and as “ID#38 – keep right” with 2% probability. This example shows how robust the proposed classifier is. For more insight into how the CNN works, Fig. 12 depicts the 1
Furthermore, Fig. 13 shows the generated internal top five probabilities in “layer 15” with the traffic sign “ID#18 general caution” identified with a probability of almost 100%. Moreover, for additional elaboration, the feature map of the 1
1
Internal generated probabilities for different classes.
1
The presented results and further analysis identifies some shortcomings that are listed below:
Further investigation on the use the RGB images instead of grey ones needs to be carried out. Even though using grayscale images reduces complexity and computation but it deprives the CNN of precious information (like colors) is believed to be very crucial in classifying traffic signs. Given the criticality of the traffic sign recognition for the success of autonomous driving, much more training data is required to improve the performance. This is specifically true for classes that have a relatively low number of samples like Class 0, Class 41, and Class 42. There is a very big difference between classes in terms of the number of available samples in the training data which reflects in the tendency of recognizing the classes with a larger number of samples better. Therefore, more work is needed to be done to narrow this gap.
Based on the results achieved on the original testing data set of 94.5% (Table 5), if the tested images are well-centered and well-cropped (which means the traffic sign fills 75% of the image or more) as shown in Figs 2 and 10, the WAF-LeNet in its current state works with good performance. Otherwise, the network performs insufficiently.
Based on the achieved results and the above findings, the below points summarize the suggested improvements:
Increasing of the depth of the CNN (adding more convolutional layers and feature-extraction filters) and the using of RGB images instead of grey scale ones should improve the model performance significantly, as converting images to grey makes the data loose vital information about the traffic signs. Therefore, more trials need to be carried out using color images with deeper and wider networks. Equalizing the number of samples in each class by generating more data from the available data by the process of data augmentation (skewing, rotation, noise addition, etc.). Data augmentation is believed to be a good solution to size up the training samples. It can be done by applying some image processing on the current data samples like skewing, flipping, adding random bright and dark spots, adding dusty noise, etc. This should significantly improve the classification robustness. Adding “Skip Connections” to the CNN (connections that skip layers) to extract the “global” and invariant shapes and structures [13] of the traffic signs may considerably improve the performance. The WAF-LeNet is mainly used for recognition. In order to complete the whole functionality, it should be augmented with another model/net-work that focuses on detection and localization of the traffic sign within the big image[29, 30].
In this paper, a CNN-based classifier “WAF-LeNet” has been proposed. The architecture of the CNN is presented in details. The structure of the comprehensive training, validation, and testing data are described. The involved image processing algorithms have been described as well and their contributions are analyzed. WAF-LeNet has shown a very good performance in recognizing 43 different traffic signs classes. Its accuracy reached 96.4% and its robustness test achieved 100% of correct identifications. The shortcomings of the proposed approach have been discussed with proposed improvement actions for future work being elaborated. The presented solution presents a cornerstone in facilitating the existence of fully autonomous cars in the near future.
Footnotes
Acknowledgments
This work used the HPC facilities of the American University of the Middle East, Kuwait.
