Abstract
Road detection algorithms with high robustness as well as timeliness are the basis for developing intelligent assisted driving systems. To improve the robustness as well as the timeliness of unstructured road detection, a new algorithm is proposed in this paper. First, for the first frame in the video, the homography matrix H is estimated based on the improved random sample consensus (RANSAC) algorithm for different regions in the image, and the features of H are automatically extracted using convolutional neural network (CNN), which in turn enables road detection. Secondly, in order to improve the rate of subsequent similar frame detection, the color as well as texture features of the road are extracted from the detection results of the first frame, and the corresponding Gaussian mixture models (GMMs) are constructed based on Orchard-Bouman, and then the Gibbs energy function is used to achieve road detection in subsequent frames. Finally, the above algorithm is verified in a real unstructured road scene, and the experimental results show that the algorithm is 98.4% accurate and can process 58 frames per second with 1024×960 pixels.
Introduction
Studies show that there are about 9.23 million traffic accidents caused by driver fatigue and inattentiveness worldwide each year, with more than 1.4 million fatalities [1]. In order to reduce the incidence of traffic accidents and enhance driving safety, many artificial intelligence-related research institutions and automobile manufacturers have invested a lot of manpower and resources in developing advanced driver assistance systems (ADAS) in recent years [2]. ADAS is an active safety technology that uses a variety of sensors installed in the vehicle to collect data from the environment inside and outside the vehicle and perform technical processing such as static and dynamic object recognition, detection and tracking, thus enabling the driver to detect possible hazards at the earliest possible time in order to draw attention and improve safety [3]. Real-time road detection algorithms based on vision sensors are the basis for developing driver assistance systems, which can help drivers determine whether the area ahead is a road area where vehicles can drive, whether there are obstacles, vehicles and non-road areas such as pedestrians, and also provide guidance for subsequent path planning [4, 5].
Roads can generally be divided into structured roads and unstructured roads. Structured roads have clear road markings, clear road boundaries and special color information, including: highways and urban roads [6]. Structured road problem detection can be reduced to the problem of road marking, road boundary and road color detection [7]. Unstructured roads do not have obvious road markings, road boundaries and fixed colors, and the road environment is complex and variable (e.g., there are shadows, water spots and complex obstacles, etc.), so the detection method for unstructured roads is still in the research stage [8].
The current methods on structured roads detection are as follows: Yao et al. [9] reduced the free-space estimation task to an inference problem on a one-dimensional graph by defining the potential function of the one-dimensional graph based on the image features of edges, colors, and vector machine, which eventually partitions the road space from the non-road space using the labels of the nodes. Wang et al. [10] transformed the image into the inverse perspective space, then constructed a variable road model using Bezie spline curve and detected the road based on the RANSAC algorithm to solve the model parameters. Cheng et al. [11] proposed a road surface segmentation method that fuses road geometric and appearance features, which achieves self-adaptation to different road scenes and significantly improves the segmentation accuracy compared to the method that uses geometric cues alone. Wang et al. [12] used a combination of Hough transform and parabolic model for lane detection, the algorithm consists of two parts: initial detection of road edges using Hough transform, then mid-to-side strategy is used to detect road boundary points and parabolic model is used to fit the road boundary for tracking purpose. Kong et al. [13] used a confidence-weighted Gabor filter to calculate the principal texture direction of each pixel and estimated the vanishing points of the road by an adaptive soft voting scheme based on variable voting regions, and then segmented the corresponding road regions according to the detected vanishing points. Fern
The Current research on unstructured roads detection are as follows: Cai et al. [15] proposed a detection method that fuses road images with their corresponding local road maps to improve the robustness of drivable road region detection in unstructured road environments, which can accomplish the detection of drivable road regions in regions without clear lane markings or boundaries. Geng et al. [16] firstly segmented unstructured road images into uniformly sized super pixels using the Simple Linear Clustering algorithm (SLIC), based on this, the road areas as well as non-road areas are classified based on the full convolutional neural network (CNN), finally, the classification results of the CNN are optimized using the relationship between superpixel neighborhoods and using Markov Random Fields (MRF). Mendes et al. [17] proposed a detection method combining a context window with a multilayer convolutional neural network (CNN) that is capable of detecting drivable areas in a clear road image. Brostow et al. [18] extracted action as well as structural features of objects in 3D space, then projected these features into 2D space and implemented road segmentation using Random Decision Forest (RDF). Yuan et al. [19] proposed an online framework for feature vector-based structural cue mining, which first discriminates between road boundary and non-boundary instances by means of a structural support vector machine (SSVN), the discriminated boundaries are then used to fit a complete road boundary. On this basis, the road area is extrapolated accordingly, and the obtained results are taken as ground truth values.
Although the above unstructured road detection algorithm has strong robustness, it is difficult to be widely used in actual road scenarios because the detection model is more complex and has higher computational complexity, and does not have better timeliness. In order to overcome the problems of poor robustness due to uncertainty of road surface features and structural features and poor timeliness due to complex models and high computational complexity, a new real-time unstructured road detection and tracking algorithm is proposed in this paper.
Therefore, the real-time detection technology of drivable area of unstructured road needs to overcome the following two problems: (1) the problem of poor robustness caused by the uncertainty of road surface features and structural features, (2) The problem of poor timeliness caused by complex detection model and high computational complexity. In order to overcome the above two problems, a real-time driving area detection method of unstructured road based on CNN and Gibbs energy function is proposed in this paper.
In order to solve the first problem, the homography matrix features of different regions in the image are used as the basis to distinguish road regions and non-road regions. Because flatness is the only invariable characteristic of unstructured road, whether there is a large flat area in front of the field of view in the image can be used as the criterion for the detection of driveable area. The algorithm using flatness as the criterion has the following two advantages: 1) the flatness will not change with the change of road type, color, illumination and driving environment, so that the algorithm has high robustness; 2) A single feature can simplify the algorithm model, reduce the computational complexity of the whole algorithm, and make the algorithm have high timeliness. It can be seen from reference [46] that points on the same plane in the image have similar homography matrix. Therefore, the homography matrix features of different regions in the image can be used to judge whether the front region is flat.
In order to solve the second problem, this paper first estimates the homography matrix H of different regions in the first frame image based on the improved RANSAC algorithm, and uses CNN to automatically extract the features of H and obtain the large flat region in the image, that is, the road region, to realize road detection. Compared with the traditional CNN algorithm based on road pixel features, the CNN algorithm based on regional homography matrix features has lower computational complexity. However, homography matrix estimation and road detection based on CNN are still time-consuming. Because there is little difference in road surface features between the first frame and adjacent subsequent frames, the road detection algorithm based on road surface features can be used to replace the above algorithm for adjacent subsequent frames, so as to improve the detection rate of subsequent similar frames. The road detection algorithm based on road surface features can be divided into the following two processes: (1) the color and texture features of the road are extracted from the detection results of the first frame image, and the corresponding GMMs model is constructed based on Orchard Bouman, (2) Minimize Gibbs energy function and divide road area and non-road area.
The remainder of this paper is organized as follows: in section 2 and 3, we discuss the main aspect of the proposed method. The details of the planar homography and CNN method are introduced in Section 2. The color and texture features method are given in Section 3. Different experiments conducted using the database are described in Section 4. Finally, some concluding remarks and direction for future work are provided in Section 5.
Road detection based on planar homography and CNN
The overall process of this paper is shown in Fig. 1. The road detection algorithm based on road surface features can be divided into the following two processes: (1) The road color and texture features are extracted from the detection results of the first frame image, and the corresponding GMMs model is constructed based on Orchard Bouman; (2) Minimize Gibbs energy function and divide road area and non-road area.

System description.
Since we use the flatness of the road as the criterion for detecting whether the image contains a road area, the first frame in a new road scene needs to determine whether it contains a large drivable plane area. There are two main methods on plane estimation as follows: The first method uses 3D reconstruction to calculate the world coordinates of pixel points based on dense parallax maps and camera parameters [21, 22]. This algorithm can accurately calculate the homography matrix between world coordinate points and pixel points, but the algorithm is computationally intensive and requires constant camera calibration. The second method uses the feature points matched by the road plane in the left camera image and the right camera image to estimate the homography matrix between pixels [23, 24]. This method is simpler and more effective than the first method based on stereo matching. To save computation time, the second method is chosen to estimate the single response matrix in this paper.
The homography matrix describes the correspondence between three-dimensional points in the same plane in space in two images, while the stereo vision homography matrix describes the correspondence between the same point in space in two images. When the feature points in the scene fall on the same plane in the world coordinate system, the affine relationship between the feature points in the left image and the right image can be expressed by the homography matrix

Homography of plane.

Comparison between
The matrix
Define
Where
The internal matrix of camera A and camera B are A
a
and A
b
, the rotation matrix and translation matrix corresponding to the conversion from 3D point in world coordinate system to camera A and camera B are [
Eq. (3) can be transformed into:
By substituting Eq. (5) into Eq. (4), we can get:
Eq. (6) can be simplified to:
Expanding the Eq. (8) to get three equations, and substituting the third equation into the first two equations:
Expanding Eq. (9) and (10) by multiplying the denominators:
Assuming that N pairs of feature points can be obtained from the images of camera A and camera B, the following system of linear equations can be obtained:
The Eq. (22) is the Nth feature point pair in the two images
The traditional RANSAC algorithm can estimate the homography matrix from the data set containing a large number of external points by iteratively randomly selecting internal points, which eliminates a certain number of serious error effects, and can eliminate more mismatched points, thus improving the accuracy of the homography matrix [25]. However, RANSAC algorithm has the following two shortcomings: (1) When there are many mismatches of feature points in the image, the number of iterations of the algorithm increases greatly, which reduces the execution efficiency of the algorithm and affects the accuracy of homography matrix [26]; (2) Since RANSAC only has a certain probability to get a credible model, the probability of getting a credible homography matrix is directly proportional to the number of iterations, so there is no upper limit on the number of iterations. If the upper bound is set, the final homography matrix may not be the optimal result. Subsequent improvements of RANSAC, such as MLESAC [47] and PROSAC [48], are based on resampling. These methods aim to estimate the predefined transformation model between the matching point sets through repeated sampling in the initial matching, so as to find the maximum interior point set satisfying its estimation model as a matching pair. These methods rely heavily on the accuracy of sampling. When there are a large number of outliers in the initial matching, the required sampling times are significantly increased, which greatly reduces the efficiency of the method. In view of the above two shortcomings, this paper improves the traditional RANSAC algorithm as follows: (1) The improved fast algorithm is used to extract stable and directional feature points from the difference of Gaussian pyramid (DOG), then, SIFT matching method is used to optimize the matching results, and the mismatching points are eliminated to reduce the iteration times of homography matrix estimation; (2) LM (Levenberg Marquardt) nonlinear optimization method is introduced to improve the accuracy of homography matrix by minimizing the cost function iteration. The specific process of homography estimation based on improved RANSAC algorithm is as follows:
(1) Construction of DOG
The difference image contains stable features in different blur degrees and scales; therefore, the DOG can provide great convenience for the subsequent feature point extraction. Firstly, the low-pass filter is used to smooth the image, and the smooth image is down sampled to obtain the Gaussian pyramid. The specific construction process is shown in Fig. 4.

Construction process of Gaussian pyramid.
A total of O groups of L layer images with different scales are generated in the Gaussian pyramid, the scale space (o, l) of Gaussian pyramid is composed of O and L, then, the layer l + 1 of group O of Gaussian pyramid is differentiated from the layer L of group O, get the O Group L layer image of the DOG, the scale space of the Gaussian pyramid and the construction process of the dog are shown in Fig. 5.

Scale space of Gaussian pyramid and construction process of DOG.
The original image of size M × N is defined as: I (x, y), the group O image is I
o
(x, y). In Figs. 5 and 6, the Gaussian smoothing function of layer 1 in group O is as follows:
Where δ is the standard deviation of normal distribution, and δ = 1.6. The scale space expression of the l layer of the group o is as follows:
Therefore, the scale space expression of the l layer of group O is as follows:
It can be transformed into:
(1) The improved FAST algorithm is used to extract and match the feature points;
Feature point detection algorithms mainly include Harris [27], SIFT [28] [29], SUSAN [30], SURF [31], etc., and corresponding improved algorithms: PCA-SIFT, ICA-SIFT, P-ASURF, R-ASURF, Radon SIFT, etc. [32]. Harris and SUSAN algorithms are difficult to repeatedly and accurately locate the points on the uniform area or object contour in different images. At the same time, these two algorithms can only solve the problem of rotation invariance, but cannot solve the problem of scale change [33]. However, road images are generally composed of uniform regions and simple contours, and there are scale changes between left and right images, therefore, Harris and SUSAN algorithms are not suitable for road image feature point detection. The feature points detected based on SIFT and SURF algorithm are not only invariant to rotation, scaling and brightness changes, but also stable to a certain extent to the change of viewing angle and noise [34], But its huge feature computation makes the process of feature point detection very time-consuming, so it is difficult to apply it to real-time road detection scene.
Therefore, in order to ensure that the feature extraction algorithm has scale invariance and high computational efficiency, this paper firstly uses the improved fast feature point detection algorithm to quickly extract the feature points in the left and right images, then the feature point descriptor is obtained by ORB algorithm, Finally, the initial matching results are obtained based on the FLANN matching algorithm, and then the SIFT matching method is used to optimize the matching results.
Fast algorithm is a fast method to extract feature points, but it has the following two disadvantages: (1) the detected feature points do not meet the scale change; (2) the detected feature points have no direction. To solve the above two problems, this paper proposes an improved fast feature point detection algorithm. In view of the first shortcoming, this paper uses the method in the above process (1) to construct the DOG of image, and detects the feature points in the dog, so as to make it meet the scale change effect. Aiming at the second shortcoming, this paper obtains the histogram of the gradient direction of the neighborhood points by counting the gradient directions of all the points in the neighborhood of the feature points, according to this, the direction of the feature points is determined, so that the feature points are invariant to the angle and rotation of the image. The specific process is shown in Fig. 6.

Direction determination process of feature points.
The gradient direction of the neighboring points of p in C is calculated as follows:
Where: θ (x, y) is the gradient direction of the neighboring points, L (x, y) is the brightness of pixel (x, y).
Gradient normalization method: the gradient direction is defined in the range [0°, 360°] and divided into 36 equal parts [49]. Define Δθ = θ (x, y) - n, if |Δθ| ⩽ 5, then θ (x, y) = n. The gradient direction after statistical normalization is obtained as the gradient histogram of the neighborhood points. The direction represented by the term with the largest vertical coordinate in the histogram is used as the principal direction of the feature point p. If there is a peak in the histogram greater than 80% of the peak in the main direction, the direction corresponding to the peak is the auxiliary direction of the feature point p, otherwise p has no auxiliary direction.
Randomly capture 560 images from the road video (280 images from each of the left and right cameras, and in pairs), after obtaining the FAST feature points with orientation from the above images, the descriptors of FAST feature points are extracted using SURF, ORB and SIFT algorithms in OpenCV, and using the FlANN matching algorithm to obtain the initial matching results of each of the above three algorithms, as shown in Fig. 6. On the basis of this previous, the matching results are optimized using the SIFT matching method. We set the ratio of the nearest distance to the next nearest distance to be less than the threshold value of 0.6, as shown in Fig. 7. The extraction results and time consumption of the above three algorithms are compared, and the comparison results are shown in Table 1.

Initial matching results.
Parameters of different network structures
It can be seen from Fig. 8 that the number of false matches in the initial matching result is high. It can be seen from Fig. 8 that SIFT matching method can remove unqualified matching points from the matching results, and the accuracy of the three algorithms has been greatly improved, moreover, orb algorithm takes the shortest time and the fastest matching speed, which is about 90 times of SIFT algorithm and 8 times of SURF algorithm.

Optimized matching results.
(3) Homography matrix estimation based on improved RANSAC algorithm
Firstly, the left reference image is divided into N regions with an area of w × h, the set of feature points corresponding to the nth region is I
n
, which contains S
n
pairs of feature points, n = 0, 1, 2, . . . , N; then, the homography matrix Judge whether the number of feature point pairs S
n
in the region n is greater than 4. If S
n
⩾ 4, randomly select 4 feature point pairs from the region as the inner set point The projection error of all feature point pairs in If the number of interior points in Repeat 1) ∼ 3) k times, and evaluate the result of each iteration through the projection error to get the optimal homography matrix Calculate the projection error If Calculate the partial reciprocal
The calculation formula of the minimum number of iterations k in the above algorithm is as follows:
Where: ɛ is the ratio of the number of interior points to S n , min is the minimum number of samples needed to calculate the model, p is the confidence, generally set at 0.95∼0.99. In this paper, we set min = 4, p = 0.95. Since the proportion of the interior point in the dataset is unknown, in general, ɛ can be set to the ratio of 4 to S n , and then updated to the ratio of the current maximum interior point to S n in the iterative process.
The calculation formulas of projection error
Where
Where
According to the improved RANSAC algorithm, the homography matrices of N regions in the left reference image can be estimated. Whether the homography matrices have similar features can be used as a criterion to judge whether different regions belong to the same plane. So, feature extraction and classification of homography matrix become the key of road plane detection. Traditional classifiers need to design multiple features artificially, and it takes a lot of time to verify whether these features can distinguish road plane. CNN has intelligent learning mechanism, which transforms the feature expression of samples in the original space into a new feature space through layer-by-layer feature transformation, so as to obtain more abundant high-level features and achieve more accurate classification and recognition. At the same time, CNN has been widely used in target detection [35], target recognition [36], segmentation [37] and other fields.
Structure of CNN
Common CNN is mainly composed of input layer (I layer), convolution layer (C layer), pooling layer (S layer), full connection layer (F layer) and output layer (O layer). Its structure is shown in Fig. 9.

Structure of CNN(1) Convolution layer.
Convolution layer mainly extracts features by convolution operation, which not only enhances the features of the original signal, but also reduces the noise. In order to extract more abundant features, a convolution layer often contains multiple filters, which means F
depth
> 1. Therefore, the output of the convolution layer has F
depth
component matrix, that is, the output depth of the convolution layer is D = F
depth
. After convolution of the input data and the filter, the element values in the i - th row and j - th column of the output matrix at a certain output depth d are as follows:
Where R and C are the height and width dimensions of the filter respectively, and b d is a constant.
The output of the current convolution layer activated in depth is obtained through the activation function:
The matrix at multiple output depths together constitutes the output
(2) Pooling layer
After convolution processing, each region and adjacent regions in the image have relevant feature information, which will lead to information redundancy. In order to improve the performance and robustness of the algorithm, the output of the convolution layer is down sampled. The lower sampling layer uses the local correlation principle of image to sample the image, uses a fixed size filter to move on the matrix
(3) Fully connected layer
After the convolution layer and pooling layer processing, the higher-level features are extracted, and then the classification is completed through a number of fully connected layers. Finally, the Softmax function transforms the classification results into probability distribution to obtain the road plane area.
CNN is essentially an input-output mapping, it can learn a lot of input-output mapping relationship. In this paper, the homography matrix of each region in the image is used as the input of CNN, and the road plane region is used as the final output. In the process of CNN training, the current network input and network weight are used to calculate the network output, that is, the road plane area detected by this round of training; then the error between the training result and the sample label is calculated; finally, the back-propagation algorithm is used to calculate the network weight error, and the weight update method is used to update the weight. After many times of training, the network parameters are constantly optimized, which makes the network structure tend to be stable.
The specific training process of CNN is as follows: The homography matrix of each region in the left reference image is extracted by using the improved RANSAC algorithm; The outer rectangle of each region is reserved with the region center as the center; Then, each rectangle is marked (Road area is marked as 1, non-Road area is marked as 0), and all rectangles and their corresponding labels are used as training data; CNN uses the homography moments of each region as the input of network training, and initializes the network parameters; Forward propagation stage: calculate the corresponding actual output; Backward propagation stage: calculate the difference between the actual output and the label, and adjust the weight matrix according to the method of minimizing the error.
The process of road detection based on CNN is as follows: The homography matrix of each region in the left reference image is extracted by using the improved RANSAC algorithm; The outer rectangle of each region is reserved with the region center as the center; The trained CNN is used to determine the road area in the image.
Road detection based on color and texture features
The above algorithm can accurately detect the road area in the image, but the above algorithm is time-consuming, and the surface features of the road between adjacent frames are not different. Therefore, the above algorithm can be used to obtain the road area in the first frame of the image, on this basis, the road surface feature model is constructed, and then the model is used to detect the adjacent subsequent frames, so as to reduce the time consumption of the whole algorithm. Because the road region generally has obvious color and texture features, this paper can realize the road detection of subsequent frames based on the above two features and Gibbs energy function.
Building color GMMs model
It can be seen from reference [38] that in the construction of GMMs, Orchard-Bouman binary splitting algorithm is twice as fast as K-mean algorithm. Therefore, this paper uses the Orchard-Bouman binary splitting algorithm to build the GMMs model of color and texture of road pixels and non-road pixels, and its principle is shown in Fig. 10. The construction process of color GMMs model is as follows:

Principle of Orchard-Bouman binary splitting algorithm.
In the first frame, the road region is defined as the road pixel, and the non-road region is defined as the non-road pixel; Two GMMs models are defined. GMM1 and GMM0 are used to represent road and non-road color distribution respectively. GMM1and GMM0 contain K1 and K0 components respectively; The components of GMM1 and GMM0 are obtained based on Orchard-Bouman binary splitting algorithm. The specific algorithm flow is shown in Fig. 11; GMMs models are obtained based on the classification results, and each Gaussian model can be represented by K triples.

Orchard–Bouman binary splitting algorithm.
In the above triples,
The construction process of texture GMMs model is the same as that of color GMMs model, but before the construction of texture GMMs model, it is necessary to convert the color value or gray value of each pixel into the feature value that can describe the texture. LBP (local binary patterns) is a non-parametric local texture feature descriptor with high computational efficiency. Because of its high feature discrimination and low computational complexity, LBP is selected to obtain the texture feature values of road and non-road regions.
Road detection based on Gibbs energy function
We simplify the road detection problem into a binary label problem. An image can be represented by a set
Where: E (
Where: p (Value (x, y) Lable (x, y)) is the probability that the Value (x, y) value of the pixel matches the road model and non-road model.
Where: p (LBPValue(x,y)Lable (x, y)) is the probability that the LBPValue(x,y) value of the pixel matches the road model and non-road model.
As shown in Fig. 12: Fig. 12(a) is the first frame image in the video. Plane detection is performed on the first frame image based on the algorithm in Section 1, the detection result is shown in Fig. 12(b). In Fig. 11(b), the blue area is the road area, and the gray area is the non-road area. Figure 12(c), (d) and (e) are the minimum values of E (

The road detection process based on road color, texture and Gibbs energy function.
In the process of detecting subsequent frames, if the E (U) value of the image is in range of [0, P], the subsequent frames can be detected according to the original color and texture GMMs model, otherwise, the frame is defined as the first frame image, and the algorithm in the first section is used to re detect the plane of the image, determine the driving plane area of the road, re calculate the color and texture GMMs model of the road, and then realize the continuous detection of the unstructured road in the real scene measurement.
Experiment
Experimental data set
Experimental software and hardware: vehicle camera (DF-8066), compilation environment (VS2012), operating system (64-bit windows 10 professional), memory (3GB), processor (2.27 GHz Intel (R) Core (TM) I3-350 M core). Experimental image: The image collected by the camera includes unstructured roads, that is, the image contains different road types, different backgrounds, different lighting conditions, and is affected by shadows and vehicles and pedestrians under different traffic conditions, as shown in Fig. 13. The original image size is 1024×960, a total of 31706 images, of which 23924 are used for training and 7782 for testing. Our experiment needs about 20 hours of training time, and the average test time of each image is 18 ms.

Road images.
In order to get a higher accuracy of road plane area detection, we test various network structures in Section 1.3, but we finally choose detection convolutional neural network (RD-CNN). RD-CNN includes three C layers, two S layers and two F layers. The three C layers use 16, 32 and 64 convolution kernels to extract image features. The size of convolution kernels in each C layer is 3×3, and the step size of the first C layer is 3, and step size of the second and third C layers is 1. Both S layers use the maximum value for sampling. The number of neurons in the first F layer is 256, and the number of neurons in the second F layer is 2. The parameters of different network structures are shown in Table 1. The network structure 2C-0S-16-32 has no S layer, so the recognition effect is not ideal. The network structure 2C-2S-16-32 adds two S layers, based on 2C-0S-16-32, the recognition rate is significantly improved. The network structure 3C-2S-4-8-16 transforms the 5×5 convolution kernel into two 3×3 convolution kernels, which improves the depth of the network. The recognition rate is improved, but not significantly. On this basis, RD-CNN (3C-2S-16-32-64) adds convolution kernel, and the recognition rate is further improved to 98.6%.
In order to visually compare the influence of different network structure parameters on the recognition rate, Fig. 14 shows the recognition rate of 300000 iterations per network. It can be seen from Fig. 13 that when the iteration times are small, the network fluctuates greatly. With the increase of iteration times, the parameters of adjustable network are optimized gradually, and the recognition rate is gradually increased and stabilized. By increasing the number of S layers, we can reduce the dimension of output characteristics of layer C, thus reducing information redundancy and improving network efficiency. Increasing the number of C layers or convolution cores can improve the accuracy of road detection in a certain range, but also increase the computing time of the network. Therefore, considering the accuracy and calculation efficiency of the network, this paper selects RD-CNN network to complete the road detection of the first frame in the video.

Recognition rate of different network structures.
In order to verify the robustness of the algorithm, this paper compares the algorithm in [39–45] with the algorithm proposed in this paper, and uses the following four evaluation indexes: recall (R), percision (P), F-measure (F) and accuracy (A). The specific comparison results are shown in Table 2. The detection result of road area is shown in Fig. 15. The first line in Fig. 15 shows the input image, the second line shows the image of the actual road area that can be driven, and the third line shows the road result using vanishing point (VP) [40]. The fourth row of Fig. 15 shows the possibility of using the full convolution neural network (FCN) [41], the fifth row shows the output based on the random decision forest (RDF) [42], and the sixth row shows the road results using the conditional random field (CRF) [43]. The seventh line of Fig. 15 shows the detection results of the method combining the full convolution neural network and Markov random field (CNN-MRF) [45], and the last line shows the output of the algorithm.

Unstructured road detection results based on different algorithms.
Comparison of different algorithms
The specific analysis is as follows: because the road type, road background, lighting conditions and traffic conditions in unstructured road image are uncertain, the method based on road intrinsic model [11, 39], road feature [12–14] and road classifier [40–45] cannot detect road well. However, this algorithm first extracts the feature value
In order to verify the timeliness of the algorithm, 558 test images are arranged according to the sequence in the video stream, and the running time of each step of the algorithm in the test process is counted, as shown in Table 3. Firstly, the homography matrix of all test images is estimated and detected based on CNN. It can be seen from Table 3 that the road detection algorithm based on homography matrix estimation CNN takes a long time, and the detection rate is 53.04 ms/frame, while the output rate of video screen is 30 frames / second. Therefore, the algorithm cannot be well applied to the actual scene. Then, the road area and non-road area are manually marked in the test image, and the road GMMs model is constructed, and the road area is re-detected in the test image based on GMMs and Gibbs energy function. It can be seen from Table 3 that the road detection algorithm based on GMMs Gibbs energy function takes a short time, and the detection rate is 13.26 ms/frame, but the GMMs model needs to be constructed based on the marked image, which is not applicable in the actual driving scene. Finally, the road detection algorithm based on CNN Gibbs detects the unlabeled test image, and the time consumption is 17.24 ms/frame, which is one fourth of the time consumption of the road plane detection algorithm based on homography matrix estimation CNN, and less than the image acquisition rate. Therefore, the road detection algorithm based on CNN Gibbs can meet the needs of the actual scene.
Algorithm time consumption per step
In order to further verify the effectiveness of this algorithm, this paper makes supplementary experiments to compare open source dataset. The open source dataset comes from literature [51], which includes four weather conditions: Dusk, rain, night and sunny, with 1399, 819, 2167, 1995 pictures respectively. In this paper, four groups of weather conditions are tested and compared respectively, and the final results are shown in Table 4.
Results on UAS
Results on UAS
It can be seen from Table 4 that the accuracy of this paper is the highest on Dusk set, Rain set and Sun set, and it also has a high accuracy on Night set.
The advantages of real-time unstructured road drivable area detection algorithm based on CNN Gibbs energy function can not only overcome the problem of poor robustness caused by the uncertainty of road surface and structural features, but also solve the problem of poor timeliness caused by complex detection model and high computational complexity. The deficiency of this paper is that the detection effect is not good for inclined roads. When the road is inclined, it is easy to detect the road area as a non-road area.
(1) In this paper, the smoothness of the road is used as the criterion to judge the driving area, Therefore, the algorithm is not affected by the inherent appearance characteristics of the road, illumination and driving environment. In the actual scene of unstructured road detection, this algorithm not only has high robustness (accuracy: 98.3%) and it takes less time (58frames/s), it can meet the needs of the actual scene and realize online detection.
(2) In order to improve the accuracy and efficiency of homography matrix estimation, this paper first uses the improved fast feature point extraction algorithm to quickly extract stable and directional feature points from DOG; Then, the SIFT matching method is used to optimize the matching results and eliminate the mismatched points; Finally, the LM nonlinear optimization method is introduced to iteratively optimize
(3) In order to improve the detection accuracy of the first road image, this paper automatically extracts the high-dimensional features of homography matrix in different regions based on CNN, classifies them and outputs the road regions. In the process of network training, RD-CNN has high accuracy, and the network structure is simple, which makes the detection time shorter.
(4) In order to speed up the detection of subsequent similar frames, this paper uses the detection results of the first image and constructs the road color and texture GMMs model based on orchard-bouman, then, Gibbs energy function is used to assign each pixel to road label or non-road label, finally, the first image is redefined according to the update strategy of GMMs model to realize the continuous detection of unstructured road in real scene.
