Abstract
Automatic road crack detection is a prominent challenging task, in view of that, a novel approach is proposed using multi-tasking Faster-RCNN to detect and classify road cracks. In this present study, we have collected the road images (a dataset of 19300 images) from the Outer Ring Road of Chennai, Tamil Nadu, India. The collected road images were pre-processed using various conventional image processing techniques to identify the ground-truth label of the bounding boxes for the cracks. We present a novel multi-tasking Faster-RCNN based approach using the Global Average Pooling(GAP) and Region of Interest (RoI) Align techniques to detect the road cracks. The RoI Align is used to avoid quantizing the stride. So that the information loss can be minimized and the bi-linear interpolation can be used to map the proposal to the input image. The resulting features from RoI Align are given as input to the GAP layer which drastically reduces the multi-dimension features into a single feature map. The output of the GAP layer is given to the fully connected layer for classification (softmax) and also to a regression model for predicting the crack location using a bounding box. F1-measure, precision, and recall were used to evaluate the results of classification and detection. The proposed model achieves the accuracy-97.97%, precision-99.12%, and recall-97.25% for classification using the MIT-CHN-ORR dataset. The experimental results show, that the proposed approach outperforms the other state-of-the-art methods.
Introduction
Road pavement cracks and damages are the most common issues that are found on road surfaces. As the majority of the goods are shipped on road, it is a general belief that well-paved road structures contribute more to the economic growth of a country. The measures taken for maintaining the quality of road has always been a challenge to the public authorities like the road engineers. The maintenance, inspection, and reporting of road conditions are their most important duties. However, the detection, identification, and classification of damaged or cracked roads specifically rely on human experts [22].
The members of the road maintenance department, provide the real status of the road like the severity of the road damage and the locations of the damaged road. It was claimed that the manual assessment made by the road surveyors was inaccurate most of the time. Whenever a road engineer inspects the road for maintenance, the road has to be blocked for a particular period which will affect the normal usage and bring inconveniences to the public [34]. In this work, we propose a novel approach for automatic crack detection using multi-tasking Faster-RCNN. So far, the statistical data were maintained manually by the users. We claim that the shortcomings of manual records can be overcome by the proposed approach.
In the present study, the different types of cracks and damages were categorized based on their shapes. Various image processing techniques and machine learning concepts were proposed for predicting the cracks and damages on the road surfaces. The proposed approach for crack and damage –detection and classification can roughly be divided into two major parts: the first part uses traditional image processing techniques with human intervention to derive ground truth labels for classification and detection. The second part includes the generation of automatic crack and damage –detection and classification for road surface using Convolution Neural Network.
In this work, we have collected road crack images from the outer ring road Chennai. During the collection of road images with cracks, we have also captured the longitude, latitude, date, and time of the specific location of the cracks. The statistical data is also stored and updated automatically for future reference.
There are many approaches proposed for road crack detection which can be classified into two types: 1. The traditional image processing approaches and 2. The CNN-based deep learning approaches. In traditional image processing, various approaches are used for feature extraction. These features are then used as input to a machine learning classifier. The CNN-based approach develops an end-to-end crack detection model. There are many CNN-based approaches proposed for crack detection or segmentation.
The main contributions of this paper is as follows, We propose a novel multi-tasking Faster-RCNN for detection and classification of road crack images. The proposed approach uses an RoI align in Faster-RCNN which does not quantize the stride. This reduces the error due to the quantization of strides in the traditional RoI mapping. Thus the proposed model will accurately map the region of the proposal to the original image. The multi-tasking Faster-RCNN is designed using the GAP layer, which maps the output of the convolution layer to the final fully connected layer for classification and road crack detection.
The rest part of the paper is designed as follows. In Section 2, the various related works are discussed, and Section 3 explains the pre-processing techniques for processing the collected images and identification of ground-truth label of the cracks. Section 4 introduces the proposed approach for road surface crack detection, and classification. In Section 5, the dataset collection, experiments, and analysis of the results for the proposed method are discussed. In Section 6, we discuss the conclusion of the paper.
Related works
Road crack detection is an important application of computer vision. There have been many approaches proposed for road crack detection, segmentation, and classification. The road crack detection algorithms can be broadly classified into two types: 1. Image processing approaches 2. Deep learning approaches. The image processing techniques are used in the early period of developing road crack detection algorithms. One such approach proposed in [33] uses traditional image processing algorithm for accurate segmentation and classification of road cracks. From the input images, edges are detected using the Ostu algorithm and threshold division. But the basic method of Otsu’s thresholding detects only the edges in the images. Thus, it is not suitable for noisy images and it is unable to detect low-intensity cracks in the images.
In another image processing based algorithm [24] for road crack detection, the input images are first pre-processed then given as input to the crack detection algorithm. It uses a variance discriminant analysis technique that extracts the crack regions based on pixel variance. The threshold method applied here is similar to Otsu’s Binarization method, but the crack regions are extracted using the variance in the pixels. Based on variance calculation, the threshold value is set to classify a pixel as a crack pixel or non-crack pixel.
There are other approaches that use image processing techniques for pre-processing the images and then path selection algorithm was used to detect cracks [1, 41]. The approach in [16] uses image processing technique for pavement crack detection. The input images are pre–processed using contrast stretching, morphological operations, and noise reduction. Then, it was processed by applying threshold conversion to detect the cracks. The detected broken cracks are connected and classified using a crack edge searching algorithm. CrackTree [41] method was developed to detect the crack automatically by removing the shadows of pavement images. To identify the crack connectivity, a crack probability map is constructed. These maps are connected using a minimum spanning tree to segment the complete crack regions.
A minimum path selection based approach was proposed in [1] for detecting cracks. The image is divided into smaller regions and one pixel is selected from each region called endpoints. From these endpoints, Dijkstra’s algorithm is used to select paths from one endpoint to another using intensity of the pixels. Cracks were detected in [13] using adaptive line detector in association with Hidden Marko Random Field and Expectation-Maximization algorithms (HMRF –EM). The input images are pre–processed using traditional image processing techniques like the bandpass filter which was used to reduce the high-frequency noise. The HMRF–EM algorithm calculates the spatial correlation and spatial constraints between the pixels. Then, the conditional connection algorithm was used to connect the crack segment.
The initial deep learning approaches performs classification of the images as crack or non-crack [5] and [39]. From the classified crack images, the cracks are detected using image processing methods. In [5], a deep learning approach is proposed for classification of crack images. After classification, segmentation is carried out using the image processing technique. The image is smoothed using a bilateral filter, the adaptive thresholding is used for the segmentation of cracks.
In [17], a spatially tuned robust multi-feature classifier is proposed for crack classification. It first extracts the curve regions from given images and features are extracted from these curve regions. These features are then classified using classifiers like Random Forest, SVM, etc. From the classified images, the cracks regions are identified by constructing a density map from the mosaic obtained from the images.
To identify the crack, in [30] they have introduced the multi-scale dilated attention based convolution module. The ResNet model is used to extract features from the images, then an attention mechanism was used for extracting the high-level features using upsampling. These features are then used for the classification of crack images.
The recent deep learning approaches uses end to end crack detection algorithm, which calculates the bounding boxes of the cracks or generates a binary segmented images [3, 40]. The SDDNet [3] model was used to detect the damaged region of the concrete surface using standard convolutions and densely separable (DenSep) module along with modified spatial pooling for segmentation of cracks. The model is trained using modified intersection over union loss, thus the model performs binary segmentation of the crack regions. The approach [4] performs pixel-level crack detection using U-net architecture. The U-Net is trained along with features extracted from the pre-trained ResNet-34 and ResNet-50 model. The approach in [7] also uses U-Net for pixel-level crack identification, here auxiliary interaction loss is proposed to alleviate the problem of fractured crack regions.
DeepCrack [15] approach was developed to find the pixel-wise segmentation of crack regions. It has a deep convolutional network and Deep-Supervised Nets which learns the multi-scale and multi-level features from the convolutional layers. After detection of crack regions, to refine the final prediction of crack regions, guided filtering, as well as Conditional Random Fields, were used. In [19], the automated road surface distress analysis system was designed by using the YOLO v2 framework. It predicts the bounding box locations for the given input images. Another approach in [12] uses YOLO models for road crack detection.
Though many image processing techniques were proposed for crack identification and detection, they lack in performance when compared to the convolution neural network based approaches. In this work, a novel end-to-end road crack detection and classification method were developed using the multi-tasking Faster-RCNN approach by incorporating global average pooling layer and Region of Interest Align.
Ground truth label identification
The ground truth labeling was carried out after pre-processing the road crack images using various traditional image processing techniques like image enhancement, image transformation, image filtering, contrast stretching, and edge detection. After identifying the cracks on the images, the bounding box is created. Figure 1 shows the techniques that are used for the identification of ground truth labels for the crack images. Then we manually annotate the labels of cracks from the pre-processed input images. The various image processing techniques used in the present work to pre-process the images are explained in detail in the following sub-sections.

Flow of deriving ground truth image from the obtained results of various traditional image processing techniques.
It is the basic operation performed on an image to improve the intensity level or to suppress unwanted distortions. Also, it corrects the degraded portions of an image. To pre-process the images, a few image enhancement techniques [25] were employed and analyzed.
There are various image enhancement techniques available in the literature like sharpening, log transformation, histogram equalization, adaptive histogram equalization, dynamic fuzzy histogram equalization, etc. The sharpening technique improves the contrast between bright and dark regions to highlight the features. In [31], sharpening is used for multispectral images in remote sensing. It is found that sharpening has improved the spatial resolution of these multi-spectral images. Another image enhancement technique is log transformation performed by equation 1.
Performance measuring metrics like Mean Square Error (MSE), Peak Signal to Noise Ratio (PSNR), Signal to Noise Ratio (SNR), and Structured Similarity Index Metric (SSIM) were checked for various image enhancement techniques. Table 1 compares the results of various image enhancement techniques for the MIT-CHN-ORR dataset. It was inferred that the sharpening achieves better results compared to the other image enhancement techniques.
Various image enhancement techniques - results comparison based on MSE, PSNR, SNR, and SSIM
A noise removal filter is used to remove unwanted and irrelevant information from an input image. Here, a wiener filter was used for the noise removal technique. It also helps in removing the noise with compression operation. Apart from them, various noise removal filters like Salt and Pepper Noise, Guided Filter, and wiener filter were exploited. Table 2 shows the comparison results for various noise removal techniques. Nevertheless, the wiener filter performed better compared with the other noise removal methods in the present study.
Comparison results of various noise removal filters
Comparison results of various noise removal filters
The main objective of contrast stretching is to revamp the intensity values of the images. It is used to make use of all possible intensity values in an image. Also, it had a feature to improve the dynamic range of an image. It uses two types of contrast stretching: global contrast stretching and linear contrast stretching. It is also known as the normalization of an image. The approaches in [14, 35], and [6] uses contrast stretching in various applications like remote sensing, Bio-medical image processing, and image enhancement. In this work, we use contrast stretching to pre-process the input images to develop ground truth labels and also to pre-process images given as input to the Faster-RCNN model.
Edge detection
It is a process to identify the boundary or edges in an image. It played a vital role in various applications like recognition of objects, analysis of moving objects, and recognition of patterns. In [20], detecting the edges was performed in two stages; stage one was used to identify the edge points, location, and orientation of the cubic facet model and stage two was used to decreased the noise and increased the edge structure.
Sobel operator was used in [26] to identify the edges of an input image by applying the derivative approximation method. A new double threshold technique was developed using hysteresis threshold to find out edges of an image [28]. Based on those threshold values, the improved canny edge detector was obtained.
In [2], canny edge detector was employed in scale multiplication method, which has two scales of response detection filters. Edge maps and detection criteria were used to identify the edge based on the localization criterion. MM -sobel operator and region growing were used in [29, 37], and [9] to isolate the identified edges from the given image.
Edge detection techniques will work based on the threshold value calculated for the neighboring pixel with its respective deviation in [18]. Also, they measure the quality of an edge detection as mentioned in [38]. The following edge detection techniques like Sobel operator, Prewitt operator, Roberts operator, Laplacian of Gaussian, Zero cross edge detector, K-means, Canny edge were implemented and the resulting images are displayed in Fig. 2. From the results, it was concluded that the canny edge detection provides better results compared to other algorithms. Hence it was chosen in the present study to identify the edges in the images to pre-process the images. From the edge detected images, we have manually annotated the ground truth labels for the road crack images in the MIT-CHN-ORR dataset.

Edge detection comparison results obtained for our proposed dataset.
In this work, we propose a multi-tasking faster R–CNN for detection and classification of the crack images. The input image is passed through the various convolutional layers followed by pooling layers. The output of the pooling layers is passed through the ReLU activation function. We have also used some batch normalization layers. The input images are passed through the convolution and pooling layer of the CNN model to extract the feature map. From these feature maps, 12 anchor points are derived. The anchor points are again passed through various convolutional and pooling layers followed finally by the Global Average Pooling (GAP) layer. The extracted features are passed through softmax for classifying the cracks into various types. The same features were passed through the regressor for predicting the coordinates of such information, called as the object proposal.
The feature maps were compared with the fixed features maps and assigned with RoI Align. Consider an input image of 34 x 34 size, the stride can be calculated as 34/7=4.85, it will not be rounded off the value of stride as 5. Rather, it will process as the stride value obtained as such for generating the 7 x 7 output images. Figure 3 shows the proposed architecture of multi-tasking faster region-based convolution neural network using global average pooling and RoI align.

Proposed architecture using Multi-tasking Faster R-CNN in association with Global Average Pooling and Region of Interest Align.
In this work, the term ’object’ refers to the crack or damaged part of the road image. The object proposal network works based on a feature map. The feature map contains an anchor region, that points to a location in the image which is calculated by Intersection Over Union(IOU). The IOU
a
threshold value for the mapping anchor region is finalized as 0.75 after fine-tuning all the parameters. The IOU
a
was applied on an image using equation (2).
A total of 6 anchors were obtained for the single dimension for all 16 filters based on the fourth layer of convolution. If it is 16, then, it will have a combination ratio of 1:1-16:16, 1:2-16:32, and 2:1-32:16, and similar anchors were obtained for a combination ratio of 8, 1:1-8:8, 1:2-8:16, and 2:1-16:8. While ignoring the occurrences of cross-boundary anchors, the value was approximately set as 7500. Then, by applying the Non –Minimal Suppression (NMS), the anchor is reduced to an approximate value of 1200. The output is then given as input to the global average pooling layer which calculates the average of every single feature map. It returns the single average value of all the returned features which is then passed to the fully connected layer. The softmax function is used to determine the type of the crack and the regressor is used for identifying the coordinates of axis, height, and width for generating a single bounding box.
The object proposal consists of two loss functions: The loss function is used at the classifier to identify the object class. Another loss function is used at the regressor to obtain the coordinates x and y, width, and height for constructing the bounding box.
The loss function L (C
i
) of the classifier is given by equation (3).
C i - represents the value of predicted probability.
N cl - number of anchor available in minibatch.
The loss function for regressor L (R
i
) were calculated in equations (4).
C –constant value.
N reg –total number of anchors (approximately 1200).
the smooth function is given by equation (6).
The overall loss function was calculated using the equation (8).
For predicting the class of an object, the softmax function was used for classification. Figure 4 shows a sample calculation carried out for normalizing the value. It also shows the calculation of the probability of whether the pixel is crack or non-crack from the output value.

Sample calculation for normalization.
The Region of Interest is used to select the feature map from the given input image. The input image had the size 102 x 102 and then, it was reduced to 8 x 8 size as the RoI. The stride was calculated by 12.75/3= 4.25. The value of stride will not be rounded off and the obtained value will be considered as such. Then, all the features were mapped as 3 x 3 for the region of interest and based on stride, it was divided into 4.25 for each bin. This creates a region of interest on the bottom right, top right, bottom left and top left. And each of those subcells was pooled using bilinear interpolation which continued with 4 values per cell. Finally, the cell values were computed as on average or on maximum over the 4 sub-values. Figure 5 shows the working principle of RoI Align. The RoI align preserved the spatial-oriented features with no data loss.

An illustration of the Region of Interest align.
In this section, we discuss the various results obtained from the experiments carried out in this work. The results obtained for pre-processing of images were evaluated using various metrics like PSNR, MSE, and SSIM. The results obtained for the crack detection and the classification were evaluated using various metrics like precision, recall, accuracy, and F1-measure.
Dataset collection and generation
In the present study, the data set was collected from Outer Ring Road, Chennai location, denoted as MIT-CHN–ORR data set. While capturing the images, the information was saved automatically as shown in Table 3. It saves information like the file name, latitude, longitude, elevation, altitude, accuracy, date, and time. The data set consists of 19,300 images with a combination of crack and non-crack images. The dataset consists of 9,875 crack and 9,425 non-crack images. The crack images are further classified into 5 types: Linear cracks - 3,847 images, non-linear cracks-2,894 images, Alligator cracks-2,108 images, and Damaged rods with pitfalls-1,026 images. The resolution of the images was 256x256 pixels. To compare the performance of our model, we have also used RDD dataset [17], which consists of 9,053 road damage images with 15,435 instances of road surface images.
Dataset collection and information maintenance for statistical data
Dataset collection and information maintenance for statistical data
Metrics used for ground label identification
PSNR
PSNR is calculated by comparing the original image with the reconstructed image. It is the ratio of maximum possible values of the signal to the power of distortion noise. If the obtained value is found to be high, then it will be considered as good quality.
MSE
MSE is used to calculate the error value by taking the difference between the original and the reconstructed image. If the obtained value is low, then the image quality will be considered as good. It is also referred to as Mean Square Deviation (MSD).
SSIM
SSIM is calculated by comparing the original and the reconstructed image. It is measured based on luminance, contrast, and structure. If the image had more similarities, then, it will be indicated as 1 otherwise it will be indicated as 0.
Metrics used for crack detection and classification
The proposed work calculates the F1 measure of the results based on the confusion matrix. The matrix consisted of four fields namely,
True Positive (TP) –crack images predicted accurately.
True Negative (TN) –the number of non-crack images predicted accurately.
False Positive (FP) –the number of crack images predicted as crack but they don’t have a crack. False Negative (FN) –the number of crack images predicted as non-crack but they have a crack.
Precision, Recall and F1-Measure are calculated as follows:
Precision
It is the ratio of the number of correct positive predictions to the total number of positive predictions, calculated by the equation (9).
Recall
It is the ratio of the number of correct positive predictions to the total number of positive samples in the dataset calculated using equation (10).
F1–Measure
It is the weighted average of the precision and recall. It is computed using based on the equation (11).
The multi-tasking Faster RCNN model for crack detection was implemented in Keras environment with TensorFlow backend. We have used a dataset split of 60% for training, 10% for validation, and 30% for testing. From the MIT-CHN-ORR dataset 11,580 images were used for training, 1930 images are used for validation, and 5790 images are used for testing. The input images consist of both crack and non-crack images. The stride for the CNN is set as 7. Batch Normalization (BN) is used for regularizing the Faster-RCNN model. We use ReLU as activation functions for the model. The model was trained for 2500 epochs. The Batch size is set to 64, and the learning rate is set to 0.002. We have used an ADAM optimizer with beta1=0.7 and beta2=0.99 for training the model.
Analysis of results
The proposed multi-tasking faster R-CNN is evaluated using various evaluation metrics. Table 4 shows the comparison results obtained for detection using the MIT-CHN-ORR dataset. Table 5 shows the comparison of results obtained for detection and classification using the existing dataset RDDC with the proposed approach.
Comparison of results obtained for road crack detection using our MIT-CHN-ORR dataset
Comparison of results obtained for road crack detection using our MIT-CHN-ORR dataset
Comparison of results obtained for road crack detection using the existing RDDC dataset ([17])
The following observations are made from the results shown in Tables 4 and 5. The proposed approach provides better Precision, Recall, and F1-Measure score than the traditional image processing based algorithm like Otsu’s thresholding [33]. This is due to the end-to-end training of the model to detect the cracks. The proposed approach performs better than the other state-of-the-art approaches like YOLO v2 framework [19], Supervised Deep CNN [39], and Faster R-CNN [17]. This is due to the pre-processing of the images using traditional image processing techniques. We have also used an RoI Align and the GAP layer in the proposed multi-tasking Faster-RCNN model for road crack detection. The deep learning based approaches like [17, 39], provides better performance than the traditional image processing approaches like Otsu’s thresholding [33]. For both the datasets, the proposed Multi-tasking Faster RCNN provides better performance compared to the basic CNN designed using only the GAP layer for segmentation. Thus the use of RoI Align and the multi-task training of the model helps the model in improving the performance. The use of GAP and RoI Align increases the performance compared to the model which does not use the GAP layer and RoI Align. This is due to round-off error while calculating the strides in a normal RoI layer, whereas in RoI align does not quantize the stride, hence reduces the error. The multi-tasking faster RCNN provides better performance compared to the mask RCNN with RoI Pooling. This is due to the use of the GAP layer and RoI Align which results in a lower error in RoI mapping.
Table 6 shows the result comparison of the various datasets available in the literature. The following observations are made from the results in Table 6. It shows that the proposed approach of multi-tasking faster RCNN on the MIT-CHN-ORR dataset provides better results compared to all the other approaches. This is also due to the large number of images collected in our MIT-CHN-ORR dataset. It shows that the traditional Random Structured Forests [27] provide better results for smaller dataset size like 118 and 38. But the Deep neural network based approaches [13, 39] uses larger number of images for training the model. Thus the larger number of images in the dataset collected by us provides better performance. Though the Faster R-CNN approach in [13] uses a larger dataset of 9,053 images it provides a lower performance score. The proposed approach uses 11,580 images for training the model but provides higher performance, this is due to the use of RoI align and GAP layer in the proposed approach. We have also proposed a multi-task Faster-RCNN which increases the performance of both classification and segmentation.
Comparison of results obtained for road crack detection for the existing methods with the proposed method
Table 7 shows the comparison results of classification on MIT-CHN-ORR dataset. Here ResNet+SVM is an SVM model trained using ResNet-152 features extracted from the road crack images. It shows that the multi-tasking faster RCNN provides higher when compared to basic ResNet+SVM and CNN models. Figure 6 shows the confusion matrix obtained for the classification performed using multi-tasking faster RCNN. We can see that most of the images are correctly classified. We can see that the wrong classification is more for non-crack images classified into other types of cracks. This is due to the availability of a large number of non-crack images compared to the other type of crack images.
Comparison of classification results on MIT-CHN-ORR dataset

Confusion matrix obtained using Multi-tasking Faster RCNN for classification.
Figure 7 shows the Receiving Operating Characteristics(ROC) curve for classification of the road crack images. It can be seen that the proposed multi-tasking Faster RCNN approach provides better performance and larger ROC-AUC value compared to the basic ResNet+SVM and CNN model. Figure 8 shows the sample road images with cracks, without cracks, ground truth image with bounding boxes on the cracks, and the classified images for our proposed MIT-CHN-ORR dataset. The figure shows that the proposed multi-tasking faster RCNN precisely calculates the bounding box of the cracks from the crack labels, we can see that the proposed model correctly classifies the types of crack.

ROC-AUC curve for classification of road crack images.

The first row shows the few sample non-crack images from the MIT-CHN-ORR dataset, the second row shows the few sample crack images from the MIT-CHN-ORR dataset, the third row shows a few sample ground truth for crack images with bounding boxes from the MIT-CHN-ORR dataset, fourth and fifth row shows few of the sample output images obtained using Multi-tasking Faster RCNN (linear crack, non-linear crack, alligator crack, road damage, and non-crack) from MIT-CHN-ORR dataset.
Maintaining roadways is a very tedious task as there are larger stretches of roads are available in a country. There are many approaches developed for road crack identification and detection. The present study puts forth a multi-tasking Faster-RCNN for crack detection and classification. We have also proposed global average pooling and region of interest align in the multi-tasking Faster RCNN for multi-tasking the classification and detection. The RoI Align reduces the error due to the quantization of the strides from the region proposal to the original image. The proposed models are evaluated using the MIT-CHN-ORR dataset and RDDC dataset. The performance results of the proposed methods were compared with the existing methods using various evaluation metrics. The evaluation results show that the proposed multi-tasking Faster RCNN outperforms the other state-of-the-art approaches for road crack detection and classification.
