Abstract
End-to-end deep learning has gained considerable interests in autonomous driving vehicles. End-to-end autonomous driving uses the deep convolutional neural network to establish input-to-output mapping. However, existing end-to-end driving models only predict steering angle with front-facing camera data and poorly extract spatial-temporal information. Based on deep learning and attention mechanism, we propose an end-to-end driving model which combines the multi-stream attention module with the multi-stream network. As a multimodal multitask model, the proposed end-to-end driving model not only fully extracts spatial-temporal information from multimodality, but also adopts the multitask learning method with hard parameter sharing to predict the steering angle and speed. Furthermore, the proposed multi-stream attention module predicts the attention weights of streams based on the multimodal feature fusion, which encourages the proposed end-to-end driving model to pay attention to streams that positively impact the prediction result. We demonstrate the efficiency of the proposed driving model on the public Udacity dataset compared to existing models. Experimental results show that the proposed driving model has better performances than other existing methods.
Introduction
As the development direction of road transportation, autonomous driving can effectively reduce traffic accidents and make rational use of traffic resources.
The traditional rule-based autonomous driving method consists of three modules: the perception system, the decision system and the control system [1]. Its advantages lie in the clear division of labor of each module, strong interpretability and high system stability. However, decision-making and control in the rule-based automatic driving method are complicated. Due to the strong dependence on the set rules when making decisions, this method cannot have the ability to learn independently. A reliable method should make different decisions based on various scenarios and learn from new situations.
With the rapid development of deep learning, the combination of deep learning and autonomous driving has become a new research direction to realize autonomous driving. Based on deep learning, end-to-end autonomous driving method accepts input environmental information and generates car control signals. Compared with the traditional rule-based method, the end-to-end method has a strong learning ability and effectively reduces hardware equipment costs. End-to-end autonomous driving method has important academic significance and commercial value.
End-to-end driving models usually use a single modal as input to predict steering angle [2–4]. Multimodality shows promising results in the subtasks of autonomous driving, such as target detection [5], image segmentation [6] and other fields [7, 8]. We study the effect of multimodality in end-to-end autonomous driving. Moreover, attention mechanism has achieved excellent results in many fields [9–11]. However, there lack of researches on applying the attention mechanism to multimodal fusion. Hence, We employ the attention mechanism to multimodal fusion.
In this paper, we propose a novel end-to-end driving model which combines the multi-stream attention module with the multi-stream network. The proposed end-to-end driving model is good at extracting spatial-temporal information from multimodality and predicting steering angle and speed. It also allocates attention to streams effectively. Contributions of this paper can be summarized as follows: In order to solve the problem that a single modality cannot meet the needs of autonomous driving, we propose an end-to-end driving model that is multimodal. The proposed driving model takes RGB frame sequence, optical flow sequence and dynamic image as multimodality that represents spatial-temporal information from multiple angles. The performance of the proposed driving model is better than the end-to-end driving model with a singlemodality. In view of the risk of overfitting in the existing end-to-end driving model, the proposed driving model adopts the multitask learning method with hard parameter sharing to simultaneously predict the steering angle and speed, which improves the stability and prediction accuracy of the model. After studying the effect of the attention mechanism on multimodal fusion, we propose a multi-stream attention module that predicts the attention weights of streams based on the multimodal feature fusion. The multi-stream attention module helps the proposed driving model to reasonably allocate the attention to different streams. The proposed driving model has been tested on the public Udacity dataset. Experimental results show that our model can accurately predict steering angle and speed.
Ralated work
New ideas are offered by multitask learning [14] and multimodal learning [15] when researchers seek end-to-end autonomous driving with better performance. Multitask learning is a machine learning method based on shared representation, which puts multiple related tasks together for learning. Yang et al. [16] proposed a multimodal multitask model, which used historically predicted driving speed as the feedback input to realize the prediction of steering angle and speed. This model adopts the multitask learning method with soft parameter sharing. However, compared to hard parameter sharing, soft parameter sharing has a higher risk ofoverfitting.
Multimodal learning refers to learning better feature representation through the complementarity between multiple modalities. Based on the multimodal early fusion method, Abou-Hussein et al. [17] used the stacked image and optical flow as the input of the model, in which the steering angle prediction performance is improved. But the early fusion method cannot fully extract the multimodal feature information. Fernandez et al. [18] used the two-stream model for steering angle prediction, which mapped the original image and pre-calculated optical flow to the steering angle. However, it only takes a single RGB frame and a single optical flow as input, which leads to insufficient spatial-temporal information.
In the field of the image sequence to text generation, Ng et al. [19] studied a variety of input methods that use explicit motion information as temporal information. The explicit motion information refers to the optical flow [20, 21] based on consecutive frames. The experimental results of this research and similar research [22,23, 22,23] show that it has better performance than single-modal CNN and LSTM networks. Simonyan et al. [8] first proposed a two-stream network in the field of action recognition. The two streams of the two-stream network extract the spatial and temporal information from the video respectively. The results of the two streams are fused to get a good classification result.
Inspired by the above work, we research multimodal learning and multitask learning in the field of end-to-end autonomous driving. The proposed end-to-end driving model is a multimodal multitask model. It extracts spatial-temporal information from multimodality and simultaneously predicts the steering angle and speed.
Multimodal learning has become a popular research direction in artificial intelligence due to its excellent effects in various applications. Multimodal fusion [24] is an important research direction in the field of multimodal research, which makes full use of the complementary information in multimodality. How to fuse multimodal information to improve model performance and robustness is the challenge of multimodal fusion. Multimodal fusion is divided into early fusion (based on feature), late fusion (based on the decision) and hybrid fusion.
Early fusion includes feature fusion and pixel data fusion. Feature fusion extracts the representation of feature from each modality and then fuses features at the feature level, which can alleviate the inconsistency between the raw data in each modality. Since deep learning essentially involves learning specific representations of raw data feature, it is sometimes necessary to fuse pixel data before extracting the feature. There are multiple methods for early fusion, including multimodal multiplication or addition of elements at the same position, construction of the encoder-decoder structure and LSTM neural network for information integration.
Late fusion is also called the decision-level fusion. The deep learning model first trains different modalities and then integrates the output results of multiple modalities. Late fusion mainly uses rules to determine the combination of different results, such as maximum fusion, average fusion, Bayesian rule fusion, ensemble learning and other rule fusion methods [25]. However, the late fusion ignores the interaction between multiple modalities.
Hybrid fusion combines early fusion and late fusion. However, it also increases the structural complexity and training difficulty of the model. Hybrid fusion methods are often used for flexible deep learning model. It is widely used in visual question answering, gesture recognition [26] and otherfields.
Attention mechanism has been applied in various fields and has achieved satisfactory results. Based on the attention mechanism, Wang et al. [27] proposed ResNet, a widely used deep neural network structure. Hu et al. [9] proposed the channel attention network SEnet that can learn the importance of the feature channels in the feature map and enhance the channel feature useful for current recognition according to the importance. Choi et al. [10] proposed a fine-grained attention mechanism for neural machine translation. Vaswni et al. [11] proposed the BERT algorithm for generating word vectors, in which the concept of attention mechanism was applied. However, there is a lack of research on applying the attention mechanism to multimodal fusion.
The proposed end-to-end driving model extracts multimodal spatial-temporal information and fuses multimodal feature. Based on the multimodal feature fusion, we propose a multi-stream attention module that predicts the attention weights of streams. The multi-stream attention module considers the different effects of streams corresponding to the multimodal input in the multi-stream network, which makes multimodal fusion more effective.
Proposed method
The proposed end-to-end driving model combines the multi-stream attention module with the multi-stream network. The multi-stream network extracts multimodal spatial-temporal information and predicts the steering angle and speed of the vehicle. The multi-stream attention module helps the multi-stream network better distribute attention to streams. This section is divided into three parts. The first part introduces optical flow and dynamic image. The second part introduces the multi-stream network in detail. The third part introduces the attention module in detail.
Preliminaries
Optical flow [20, 21] is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene, capturing the motion changes between two frames. We use RFAT method [28] to calculate the optical flow. We pre-compute the optical flow and store the optical flow fields as JPG images. RGB frame at time t is denoted as I t . Optical flow at time t F t is calculated by consecutive RGB frames {It-1, I t }, which is denoted as F t = f ({It-1, I t }).
Dynamic image [29] is an RGB image that summarizes the appearance and dynamical information of a given video sequence. The dynamic image is obtained by optimizing a rank function of a series of frames. The frames which are closer to the last frame are supposed to have a higher rank. If n is the length of RGB frame sequence, dynamic image at time t D t is calculated by the RGB frame sequence {It-n+1, It-n+2, . . . , I t }, which is denoted as D t = d ({It-n+1, It-n+2, . . . , I t }). The sample of the optical flow and dynamic image of the dataset are shown in Fig. 1.

Left: Original frame. Middle: Optical flow. Right: Dynamic image using 6 frames.
Many end-to-end driving models uses a 3-channel RGB frame obtained by the car camera as the input of the model. We propose a single-stream network with RGB frame sequence as a single modality. When predicting the steering angle and speed at time t, if n is the length of the RGB frame sequence, the RGB frame sequence is stacked as a 3n - channel tensor as input, which is denoted as I t s = stack ({It-n+1, It-n+2, . . . , I t }). Compared with the original input, RGB frame sequence obtains more spatial information and improves the predictionperformance.
The single-stream network mentioned above only obtains spatial information, but the driving behavior is closely related to temporal information. Many studies [8, 23] have shown that the input of explicit motion information can reduce the difficulty of the model implicitly learning temporal information from the original image. As the explicit motion information, optical flow and dynamic image can improve the understanding of the motion of the scene. Optical flow can accurately obtain short-term motion information of objects in images [21], and dynamic images can comprehensively summarize the appearance and long-term motion information of image sequences [29]. To fully obtain spatial-temporal information, we propose a multimodal multitask multi-stream network that adds optical flow sequence stream and dynamic image stream based on the single-stream network. The multi-stream network uses RGB frame sequence, optical flow sequence and dynamic image as multimodality, which means that the network can obtain spatial information from RGB frame sequence stream, low-level motion information from optical flow sequence stream and high-level motion information from dynamic image stream. Optical flow sequence and dynamic image are generated from the RGB frame sequence. If n is the length of the RGB frame sequence, n - 1 frames of optical flow and a dynamic image can be generated. When predicting the steering angle and speed at time t, the multimodal input of the multi-stream network includes the stacked RGB frame sequence I t s = stack ({It-n+1, It-n+1, . . . , I t }), the stacked optical flow sequence F t s = stack ({Ft-n+2, Ft-n+1, . . . , F t }) and dynamic image D t . It is worth noting that the optical flow sequence and the RGB frame sequence have the same stacking method.
The multi-stream network extracts information from multimodality through appropriately complex convolutional neural networks. After the multimodal fusion, the multi-stream network adopts multitask learning method with hard parameter sharing to realize the control of the vehicle.
As shown in Fig. 2, the multi-stream network contains multiple streams. Each stream has the same structure that consists of four convolutional blocks. After the fourth convolution block, the 4 × 4 pooling layer aggregates information of multiple dimensions into a single number. Network flatten feature map to get feature vectors and then concate feature vectors. The multi-stream network outputs vehicle speed and steering angle prediction through two independent fully connected layer structures. The last fully convolutional layer uses the linear activation function. The prediction results of speed and steering angle are expressed by regression value. Mean square error (MSE) is used to be the training loss function of steering angle and speed. In order to help the model training, we keep the two training losses at the same order of magnitude by adjusting loss weighted parameters.

The architecture of the proposed multi-steam network.
In multimodal model, the contribution of different modalities to the prediction results is different and dynamic. Multimodal model should suppress relatively unimportant modalities, and pay attention to modalities that play a positive role in prediction results and contain rich feature information. Attention mechanism can effectively allocate attention to different modalities, which is critical to the performance improvement of the multimodal model.
Soft attention mechanism can be trained by embedding in the model and backpropagated to other parts of the model through the gradient descent method. We design the attention module based on the soft attention mechanism. The multi-stream network extracts feature vectors from multimodality. From multimodality, the multi-stream network extracts feature vectors that represent spatial-temporal information from different angles. The attention module weights these feature vectors to pay more attention to crucial streams.
For the design of the attention module, the easy way is weighting different streams by manually adjusting the weighted parameters. However, this method is rough and cannot accurately allocate attention. Therefore, we propose a weighted attention module to allocate attention to streams through self-learning attention weighted parameters. Moreover, we further propose a multi-stream attention module based on the weighted attention module. These attention modules are introduced below.
Weighted attention module
The feature vectors extracted by the multi-stream network are denoted as X ={ x1, x2, . . . , x
n
}, x
i
∈ R
D
, where x
i
is output of i - th stream and D is the dimensionality of the feature vector. Streams are given self-learning attention weighted parameters B ={ b1, b2, . . . , b
n
}, b
i
∈ R1. b
i
is the attention weighted parameter of the feature vector x
i
. b
i
can backpropagate learning. Then b
i
is normalized by Softmax function, as shown in formula (1):
New feature vector considers the contribution of each stream in the model. Then, new feature vectors Y ={ y1, y2, . . . , y n } are concated to predict steering angle and speed.
The weighted attention module has the disadvantage that the trained attention weights are constant. It is difficult for the weighted attention module to make targeted attention distribution based on the difference of the extracted feature vectors. Therefore, we further propose a multi-stream attention module based on the weighted attention module. The multi-stream attention module predicts the attention weights of streams based on the fused feature vector, which makes the attention distribution more flexible and effective. The architecture of proposed end-to-end driving is shown in Fig. 3. The detail of the multi-stream attention module is as follows:

The architecture of the proposed end-to-end driving model. The lower part of the figure is the proposed multi-stream network. Three streams correspond to convolution structures. The upper part of the figure is the proposed multi-stream attention module. The multi-stream attention module concate feature vectors to obtain the fused feature vector, and then divide the fused feature vector into k feature clusters. The attention weight of each modality is calculated by a series of operations of feature clusters. The old feature vector is weighted by the corresponding attention weight to get the new feature vector.
The feature vectors extracted by the multi-stream network are denoted as X ={ x1, x2, . . . , x
n
}, x
i
∈ R
D
i
, where x
i
is output of i - th stream and D
i
is the dimensionality of the feature vector x
i
. Then we concate x1, x2, . . . , x
n
to get the x, x ∈ R
D
,
Based on the difference of the extracted feature vectors, the proposed end-to-end driving model can give more attention to crucial streams and suppress relatively unimportant streams. Compared with the weighted attention module, the multi-stream attention module is more effective and targeted to extract multimodal spatial-temporal information, which improves the performance of the model.
Datasets
The Udacity dataset we used in this paper is provided by Udacity [30], is collected by NVIDIA Dave-2 system by driving in various traffic and lighting conditions. Three cameras-Left, Right, and Center, installed behind the windshield of an automobile is used to collect image frames at 20 FPS. Time-stamped video from the cameras is captured simultaneously with the steering angle and speed applied by the human driver. This command is obtained by tapping into the vehicle?s Controller Area Network (CAN) bus.
The Udacity dataset Training images have been extracted from 6 different videos recorded by the Automobile, which contains full 33808 frames for model training and 5614 frames for model testing. Since there is no speed label in the test set given by the Udacity dataset, we use the training set of the Udacity dataset for the experiment that needs steering angle and speed labels. 80 percent of it is used as the training set, and 20 percent of it is used as the test set.
The steering angles are modelled as 1/r, where r is the turning radius to make the system independent of car geometry. The data was extracted from ROSbag files using a docker interface. The original resolution of the image is 480 × 640.
Data pre-processing
Images taken by the central camera are used for training and testing. For the optical flow, we use the RFAT method to calculate the optical flow. For example, the optical flow F100 at time 100 is calculated by consecutive frames {I99, I100}. For the dynamic images, the dynamic image corresponds to the current frame is computed using the current frame and the consecutive frames before it. For example, if 3 is the length of the RGB frame sequence, the dynamic image at time 100 D100 is calculated by the RGB frame sequence {I98, I99, I100}.
Random brightness jitter is applied as a data augmentation strategy to mimic the sunlight and shadow changing during driving. The brightness jitter ratio range is set to be ±5%. All the images are normalized to [- 1, 1] and the angles are normalized to [- π, π] for a better convergence during training. Before feeding the images into the model, we resize the images as 120 × 280 to expand the road and shrink the background.
Experiments and results
Implementation details
The experiments are implemented on a workstation with 2 GeForce GTX Titan GPUs. All code is written and implemented under the pytorch framework. A total of 150 epochs are trained with a batch size is 64. Adam is utilized as the optimizer, in which β1 = 0.9, β2 = 0.99, ɛ = 1e - 8. The initial learning rate is set as 1e - 4 for experiments on the Udacity dataset. The weight decay is used as regularization, and the weight decay factor is 1e - 4. We follow existing studies and use root mean square error (RMSE) as metrics.
Evaluation of multi-stream network
During driving, experienced drivers make decisions depending on historical information. To explore the influence of the length of historical information on the multi-stream network, we conducted experiments on the RGB frame sequence with the length of 4-7. It is worth noting that when the length of the RGB frame sequence is changed, the optical flow sequence and the dynamic image will also be changed accordingly.
Table 1 shows that as the length of historical information increases, the prediction performance of the network improves rapidly. However, after the length of the historical information is 6, the prediction performance of the network begins to decline. The best performance is achieved when the length of the historical information is 6. We use historical information with the length of 6 for the following experiments.
Performance comparison (RMSE) of the multi-stream network with different lengths of historical information
Performance comparison (RMSE) of the multi-stream network with different lengths of historical information
Since people’s driving behaviors are time-dependent, it is intuitive that more historical information can improve the prediction performance of the network. When the length of historical information is 7, the prediction performance of the network is reduced. It indicates that when there is too much historical information, the network may be mislead by the redundant historical information.
To explore the influence of the multi-stream in the multi-stream network, we designed an ablation experiment for the multi-stream network. The comparison network of the ablation experiment is as follows: Network 1): Single-stream network (a single RGB frame); Network 2): Single-stream network (RGB frame sequence); Network 3): Two-stream network (RGB frame sequence and optical flow sequence); Network 4): Two-stream network (RGB frame sequence and dynamic Image); Network 5): Multi-stream network (RGB frame sequence and optical flow sequence and dynamic image).
As shown in Table 2, compared with other networks, the multi-stream network has achieved the best prediction performance. The multi-stream network uses RGB frame sequence, optical flow sequence, and dynamic image as multimodality, which can comprehensively obtain spatial-temporal information and improve the prediction performance of the model. The comparison of network 1) and network 2) shows that the RGB frame sequence has better prediction performance than the single RGB frame, which indicates that the RGB frame sequence contains more spatial information. Comparing network 3) with network 4) and network 2), it can be found that when the input includes both spatial and temporal information, the prediction performance of the network can be significantly improved. Comparing network 3) with network 4), it can be found that the network with the optical flow sequence is better than the network with the dynamic image, which indicates that the optical flow sequence contains more time information.
Ablation experiment performance comparison (RMSE) of the multi-stream network
Figure 4 shows the training losses of different networks. As the epoch increases, the training loss of the multi-stream network is always the lowest and decreases rapidly. The multi-stream network is easy to train and realizes accurate prediction of steering angle and speed.

Comparison of training losses of different networks.
Figure 5 shows the steering angle and speed prediction curve of the multi-stream network. The red solid line is the true value and the green dashed line is the predicted value of the multi-stream network. It can be found that the predicted curve of the steering angle is consistent with the true value curve and stable. The multi-stream network accurately predicts the steering angle and effectively ensures the stability of autonomous vehicles. The changing trend of the speed prediction curve is basically consistent with the changing trend of the true value curve, which indicates that the network can learn the changing trend of the speed. Based on the same environmental conditions, the available driving speed is not unique and has an adjustment range. Therefore, the predicted value of the speed of the multi-stream network has some deviation from the true value.

Prediction curve of the multi-stream network.
The proposed end-to-end driving model (called ‘MS-MSNet’ in experiments) combines the multi-stream attention module with the multi-stream network. We explore the influence of the number of feature clusters on the prediction performance of the MS-MSNet. The feature map generated by each stream has 24 channels. The feature maps generated by MS-MSNet have a total of 72 channels. As shown in Table 3, when the number of feature clusters is small, the attention learning ability of the multi-stream attention model is insufficient. At this time, too many feature elements within each feature cluster lead to a lack of correlation of feature elements within each feature cluster. Feature cluster represents the local spatial-temporal information roughly.
Performance comparison (RMSE) of the proposed driving model with different numbers of feature clusters
Performance comparison (RMSE) of the proposed driving model with different numbers of feature clusters
As the number of feature clusters increases, the attention learning ability of the multi-stream attention model gradually improves. The best performance is achieved when the number of feature clusters is 72. At this time, the number of feature clusters corresponds to the number of feature map channels, which makes the feature elements within each feature cluster have a strong correlation. Feature cluster achieves effective performance.
As the number of feature clusters is large, the prediction performance of the model becomes worse, and the parameter quantity of the model becomes larger. The number of feature elements within each feature cluster is too small, which makes it difficult for each feature cluster to represent the local spatial-temporal information accurately.
We studied the influence of the attention module. As shown in Table 4, the multi-stream network improves the prediction performance after combining different attention modules, which verifies the effectiveness of the attention module. The attention module helps the multi-stream network distribute attention. The multi-stream attention module has better performance than the weighted attention module because the multi-stream attention module predicts the attention weights of streams based on the multimodal feature fusion. The multi-stream attention module helps the network to allocate attention more flexibly and reasonably.
Performance comparison (RMSE) of the multi-stream network with different attention modules
The self-driving problem is highly linked with safety. Thus, how CNN "sees" the world should be explainable to human beings. To tackle this problem, we use the VisualBackProp (VBP) method proposed by NVIDIA [31]. The procedures are: 1) The activation maps of each convolutional layer are averaged and results in an average map for every layer. 2) The deepest average map is up-scaled to the size of the average map of the previous layer by de-convolutions. 3) Pointwise timing the up-scale mask to the average map to get a new map. 4) Use the new map and repeat steps 2) and 3) until a mask with the size of the input image is got. Finally, lay the final mask to the input image to see the visualization result.
The visualization result of the proposed driving model is shown in Fig. 6. The activated area shows that the model is interested in roads and lane lines, which reveals that the changing trend of road curves is an essential reference for the model. Besides, some fast-moving vehicles on the side are highlighted, which are the observation objects of humans in daily driving. It means that the proposed end-to-end driving model learns to drive within the boundaries of the lane and stay away from surrounding vehicles, which is imitating human driving behavior. It is worth noting that the learning focus of the model is not pre-defined, which also proves the superiority of the end-to-end driving model.

Visualization result of the proposed driving model.
Experiment on the Udacity dataset to compare our model with other existing end-to-end driving models. Since other end-to-end autonomous driving models only predict steering angles, we compare the prediction accuracy of the steering angle. All models use the same training set and test set. The compared models include PilotNet proposed by NVIDIA [2], ST-LSTM proposed by Chi et al. [32], MSINet proposed by Wu et al. [4], Multimodal LSTM proposed by Abou-Hussein et al. [17], FM-Net and 3D-ResNet+LSTM proposed by Yuenan Hou et al. [13].
As shown in Table 5, MS-MSNet achieves the best performance. The proposed multi-stream network is slightly worse than FM-Net. Although RMSE is not a great improvement, MS-MSNet has many advantages. MS-MSNet is the multi-stream network combined with the multi-stream attention module. FM-Net is based on trained 3D-ResNet+LSTM but further enhanced with heterogeneous feature mimicking. In order to make a comprehensive comparison between MS-MSNet and FM-Net, we first compare the multi-stream network and 3D-ResNet+LSTM.
Performance comparison (RMSE) with existing researchs
Performance comparison (RMSE) with existing researchs
3D-ResNet+LSTM has excellent spatial-temporal information extraction capability. It is developed on the basis of CNN, CNN+LSTM, 3D-CNN+LSTM. However, the calculating time and difficulty of 3D-ResNet+LSTM are large. Due to the combination of 3D-ResNet50 and LSTM network, the structure of 3D-ResNet+LSTM is very complicated. At the same time, 3D network is more difficult to converge, and LSTM is prone to gradient disappearance. Moreover, the initialization parameters of 3D-ResNet+LSTM are obtained by pre-trained on the ImageNet dataset [33]. On the contrary, floating point operations (FLOPs) of multi-stream networks are one-fiftieth of 3D-ResNet+LSTM. Through excellent model design and stable 2D convolution, the proposed multi-stream network effectively extracts multi-modal complementary spatial-temporal information. The multi-stream network has a lighter network structure and a faster calculating speed.
While maintaining the advantages of the multi-stream network over 3D-ResNet+LSTM, MS-MSNet has gained more advantages over FM-Net. FM-Net is the heterogeneous feature mimicking version of trained 3D-ResNet+LSTM. For a student network, how to choose teacher networks and mimick feature requires many experiments. In the heterogeneous feature mimicking stage, multiple teacher models of 3D-ResNet+LSTM have a huge amount of operation, which increases the training time of the model greatly. On the contrary, through the design of the feature cluster, the multi-stream attention module only needs to adjust one experiment parameter and has a very small operation. Moreover, the multi-stream attention module can be directly embedded into the multi-stream network for synchronous training. With the help of the multi-stream attention module, MS-MSNet allocates the attention to different streams more reasonably to better extract the spatial-temporal information from multimodality.
RMSE could reflect the end-to-end autonomous driving capability of the model. The smaller the RMSE, the stronger the autonomous driving capability. RMSEs of the models in Table 5 are acceptable under different conditions. Based on the difference in RMSEs, models could challenge the autonomous driving task of different difficulties. PiltNet experimented on a real-time autonomous vehicle, which can drive most of the time autonomously on a typical road condition. It also drove 10 miles on the multi-lane divided highway with on and off-ramps. MSINet has achieved good performance, which drove smoothly 4.2km on the complex campus.
In this paper, we propose a novel end-to-end driving model for predicting steering angle and speed. The proposed driving model is the multi-stream network combined with the multi-stream attention module. The proposed driving model fully extracts effective information from multimodality that represents spatial-temporal information from multiple angles. Based on the soft attention mechanism, the multi-stream attention module is designed to help the proposed driving model reasonably allocate attention to streams and improve the prediction performance of the model. The visualization result show that the proposed driving model can autonomously learn to pay attention to crucial information during driving. The performance of the proposed driving model has been validated on the public Udacity dataset. Experimental results show that our model has better performance than other existing models. In the future, we will improve the autonomous driving smoothness, which aims to improve vehicle stability and safety.
Footnotes
Acknowledgments
This work was supported in part by the Anhui Provincial Key R&D Program of China (202004a05020040), in part by the National Key Research and Development Program of China (2018YFC0604404), in part by the Intelligent Network and New Energy Vehicle Special Project of Intelligent Manufacturing Institute of HFUT under Grant (IMIWL2019003), and in part by the Fundamental Research Funds for the Central Universities under Grant (PA2021GDGP0061).
