Abstract
Cracks are a potential threat to the safety and endurance of civil infrastructures, and therefore, careful and regular structural crack inspection is needed during their long-term service periods. Many image-processing approaches have been developed for structural crack detection. However, like traditional edge detection algorithms, these methods are easily disturbed by the environmental effect. Convolutional neural networks are newly developed methods and have excellent performances in the image-classification tasks. This study proposes a fully convolutional network called Ci-Net for structural crack identification. Pixel-level labeled image training data are obtained from the online data set. Four indices are adopted to evaluate the performance of the trained Ci-Net. Crack images from an indoor concrete beam test are adopted for validation of its structural crack recognition capacity. The recognition results are also compared with those obtained by the edge detection methods. It indicates that Ci-Net exhibits a better performance over the edge detection methods in structural damage detection.
Keywords
Introduction
Civil infrastructures act as the main accommodation for sheltering or traffic for human activities. For example, bridges are one of the most important infrastructures that create smooth-running traffic. So far, China has constructed more than 800,000 highway bridges and 200,000 railway bridges, and the construction trend is still growing due to the demand of economic development. In-service bridges are encountered with the coupling effect of harsh natural environment (temperature change, humidity, acid rain, etc.), stochastic external load (overload and vehicle impact), material deterioration, fatigue effect, and so on. As time goes on, bridges will have all kinds of problems, attracting the attention to structural health monitoring (SHM) and inspection methods all over the world (Lin et al., 2017; Ni et al., 2010, 2012; Ye et al., 2012, 2013, 2016, 2018; Yi et al., 2013).
Crack detection has been one of the most important tasks during the inspection for civil infrastructures. Traditional ways of crack inspection mainly rely on manual work that is empirical, subjective, time consuming, and not quantitative. Since cracks are visible, plenty of studies have been conducted on the vision-based crack-detection methods for the purpose of automated crack detection. Many image processing techniques have been developed for this purpose. Abdel-Qader et al. (2003) compared four edge detection methods (fast Haar transform, fast Fourier transform, Sobel edge detector, and Canny edge detector) for their capacity to detect concrete cracks. Yu et al. (2007) applied the geometric properties and patterns of cracks in a structure to the image-processing routine for detecting cracks on a tunnel wall. Fujita and Hamamoto (2011) employed a median filter and a multi-scale line filter to emphasize cracks followed by probabilistic relaxation and a locally adaptive threshold to detect cracks. Li et al. (2013) used the image segmentation algorithm of the modified snake active contour model combined with the distant sensor information to detect cracks in concrete bridges. Lins and Givigi (2016) developed a vision-based method to estimate the crack dimension on the field. However, crack images with changed brightness, distortion, spotted surfaces, or other disturbing factors will bring a great challenge to edge detection or feature extraction algorithms.
To tackle the image recognition task, convolutional neural networks (CNNs) have been proved to have better performance in calculation time and recognition rate due to the idea of shared weights, reception fields, and downsampling operations (Han et al., 2018; Muller et al., 2016; Yang et al., 2018). Hubel and Wiesel (1962) proposed the concept of receptive field in research on a cat’s visual cortex system and found that the visual cells only interacted with local elements. Inspired by this idea, Fukushima and Miyake (1982) proposed the concept of neocognitron for pattern recognition. LeCun et al. (1998) proposed LeNet-5 for document recognition which was one of the first CNNs with a modern CNN structure. They usually have the combination of convolution layers, pooling or subsampling layers, fully connected layers, and some auxiliary layers followed by a classifier (Chen and Jahanshahi, 2018; Galea and Farrugia, 2017). They are able to deal with classification of many classes in architecture and extract plenty of features by the proper increase in the number of layers and the method of training (Hinton et al., 2006; Krizhevsky et al., 2012). Cha et al. (2017) proposed a method combining the CNN with a sliding window technique to scan images and it was tested by 55 images from different structures with 5888 × 3584 pixel resolutions. Kim and Cho (2018) proposed an automated detection technique for crack morphology on concrete surface under an on-site environment based on CNNs. Tong et al. (2018) applied k-means clustering analysis to pre-extract crack properties in images, and a deep CNN was adopted to calculate the selected images for the length of cracks. In these studies, recognition results with high accuracy of more than 90% were obtained, and CNNs exhibited robustness in eliminating noise due to illumination condition, blot, or other reasons. However, certain size is usually required for images to be processed by CNNs due to the existence of fully connected layers. It is inconvenient when images are taken by cameras with different pixel resolutions.
Specifically, for crack detection on civil infrastructure surface, other than detection purpose, inspectors also concern about the geometry of cracks including location, shape, orientation, width, and length. This industrial demand requires pixel-level identification of cracks. Fully convolutional networks (FCNs) are newly developed neural networks based on CNNs (Ronneberger et al., 2015). Without fixed fully connected layers, images with arbitrary size could be processed without preprocessing. Recognition is made pixel by pixel on the output result which provides the pixel-wise identification capacity and the end-to-end ability for semantic segmentation (Shelhamer et al., 2017). In the case of crack recognition, FCNs are able to demonstrate the geometry properties of the cracks, and therefore, it will facilitate the possibility for accurate calculation of crack length and width.
In this study, an FCN called Ci-Net is proposed for the detection of structural cracks in images. The proposed FCN contains convolution layers, pooling layers, upsampling layers, deconvolution layers, and a softmax classifier. Pixel-level labeled training images are collected from the CrackForest data set (Shi et al., 2016), available at https://github.com/cuilimeng/CrackForest-dataset, and the TITS 2016 data set (Chambon and Moliard, 2011), available at https://www.irit.fr/∼Sylvie.Chambon/Crack_Detection_Database.html. Crack images from concrete beam loading test are collected to test the recognition capacity of the proposed FCN, Ci-Net. The output results are also compared with those obtained by edge detection methods.
Architecture of the proposed Ci-Net
The proposed FCN, Ci-Net, consisted of seven convolution layers, two max-pooling layers, two upsampling layers, six deconvolution layers, and a softmax layer, as illustrated in Figure 1, inspired by the study of Ronneberger et al. (2015) and Shelhamer et al. (2017). The initial weight values of the kernel and bias are randomly generated from a normal distribution modified by the training process for structural crack detection. The proposed FCN is distinguished by the pixel-wise identification capacity, that is, the output result predicts the classification of each pixel in the input image, and the size of input images is arbitrary. Also, they have no fully connected layers, which usually exist in a CNN such as LeNet-5, as shown in Figure 2.

Architecture of the proposed Ci-Net.

Architecture of LeNet-5.
Feature extraction
The convolution layer conducts the operation of element multiplication to extract features from the input images. The convolution layer in the proposed FCN has a 3 × 3 kernel (i.e. the filter or receptive field) to filter the input images. Figure 3 demonstrates the process of convolution and the stride is 1 pixel. Different kinds of filters can be constructed for the extraction of different features in a convolution layer, and after the convolution operation, new matrix called feature map will be generated. Zero padding is adopted to maintain the height and width of a feature map. Thanks to the share of weight value for a filter, the calculation demand for convolution is reduced dramatically. The convolution and activation process is operated according to equation (1). A rectified linear unit (ReLU) is utilized following the filter as an activation function for the performance of convergence rate and calculation demand (Krizhevsky et al., 2012)
where wij is the weight value, xij is the input element, b is the bias, n is the size of the kernel, f(x) is the activation function, and V is the output value.

Convolution process.
Batch normalization layer is added to solve the internal covariate shift, which means the distributions of internal nodes of a deep network is changed during training (Ioffe and Szegedy, 2015). It can be expressed by
where xi is the input value of a mini batch, m is the number of training samples, μm is the mean of a mini-batch input value,
A pooling layer or a downsampling layer is operated for reducing the width and height of an income feature map. Usually, the convolution layer increases the depth of a feature map, resulting in the increase of the total number of parameters to be modified. Without a proper reduction action, the need for calculation will bring a burden to the hardware. Typically, there are two kinds of pooling methods, that is, max pooling and mean pooling, as described in the previous research (Amirshahi et al., 2016; Liu et al., 2015). The max-pooling method preserves the max value of a subarray in a coming feature map while the mean-pooling method averages the values in a subarray. According to the previous research on crack identification (Cha et al., 2017; Chen and Jahanshahi, 2018), the max-pooling method exhibits a better performance and is adopted in this study. The pooling layer has a 2 × 2 receptive field to implement the downsampling operation on the subarray, and the stride is 2 pixels. Figure 4 illustrates the process of the max-pooling method.

Max-pooling process.
Information restoration
After the convolution and pooling operation, the CNN will follow a contracting path. The input image will be converted into a vector for classification by the classifier. For the proposed FCN, the deconvolution layer receives the income feature map from the prior layer and implements an operation to restore the information (Zeiler et al., 2011). In this study, the restored information are the features of the cracks and their locations. Together with the upsampling operation, the deconvolution operation helps to expand the feature map to the original size. The deconvolution layer has a 3 × 3 kernel to process the input array.
Classification
A softmax classifier is adopted for the pixel-wise classification. The softmax function is widely used in an artificial neural network (ANN) with a form as expressed by
where k is the number of classes, Vi is the input value of the ith class, and
where H(p, q) is the cross-entropy loss for the modification of weight values in a backpropagation algorithm and Adam algorithm. p(i) is the label, and q(i) is the prediction of samples. Modification of the weight values will be discussed later in the training process.
Training and performance evaluation
Data set
Sufficient images are demanded for the training of a neural network, and the number of images should be several thousand or more (Lee and Kwon, 2017). This requirement usually limits the practical applications due to the lack of enough images for training in many specific problems. In order to obtain a satisfactory recognition rate, images are collected from several sources. Images from CrackForest and TITS 2016 with pixel-level labels serve as the training data set. The total number of raw training images is 762 (324 images with 1024 × 500 pixel resolutions and 438 images with 500 × 300 pixel resolutions). These raw training images are cropped into 80 × 80 pixel resolutions to build the training data set. The cropped images with the cracks located near the center of the images are selected. After the cropping and selection, 14,000 images with cracks are obtained for training.
Training algorithm
The purpose of training is to adjust the weight and bias values of the receptive field of the functional layers. The stochastic gradient descent (SGD) principle and backpropagation algorithm are widely used for training purposes (Shin et al., 2016). In this study, a modified SGD algorithm called Adam, proposed by Kingma and Ba (2014), is adopted for its performance in optimizing the training process. Adam uses an adaptive learning rate for each parameter in the network, which makes it better to deal with sparse data. It also keeps an exponentially decaying average of past gradients. The Adam algorithm can be expressed by
where the subscript t and t – 1 stand for the time step, θ is the parameter to be modified, θt is the value of θ in time step t, and Δθt is the modification for θt, calculated by
where ε is the learning rate with a value of 0.001,
where gt is the mean value of the gradient, m is the number of training samples, yi is the ground truth value,
Training and validation
The training and validation process is based on the data set with pixel-level labeled images. The total data set contains 12,500 images with 80 × 80 pixel resolutions, and the number for training and validation is 0.7:0.3 among 12,000 images. The batch size is 32, and after the 40th epoch, the accuracy reaches 93.6%, as shown in Figure 5. Due to the pixel-level recognition feature, the recognition rate is defined as the mean accuracy for each pixel.

Training accuracy.
Performance evaluation
After the training process, the trained Ci-Net is evaluated with 500 unused images randomly selected from the data set. The indices of precision rate, recall rate, intersection over union (IoU), and F-measure are taken for evaluation, as expressed by
where TP is the true positive, FN is the false negative, FP is the false positive, and β is a coefficient to do the trade-off between precision and recall. The higher precision rate means that the network is able to make more correct classification among the predicted classification of crack pixels. The higher recall rate means that the network is able to make more correct classification among all the crack pixels. Fβ is a weighted harmonic average comprehensively reflecting the precision and recall rate. β is set to be 1 here to give the precision rate and recall rate the same weight. The higher IoU rate means the higher overlap between the predicted classification and ground truth for crack pixels. The ideal value of the index is equal to 1, and the higher value of the index means the better performance in the specific function. The precision rate is 0.84, recall rate is 0.82, IoU is 0.727, and F1 is 0.604. The results show that the proposed FCN performs well in a structural crack recognition task.
Testing and discussions
An indoor concrete beam test is carried out to create the structural surface cracks on a concrete structure. The size of the test beam is 1400 mm × 100 mm × 160 mm, and the concrete has a standard compressive strength of 30 MPa. The test beam is poured and maintained under the standard condition to simulate the construction of concrete structures. A hydraulic jack and steel beam are used to impose and distribute the load to the test beam, as shown in Figure 6. The test is photographed using a Canon 600D digital single-lens reflex camera (5184 × 3456 pixel resolutions, complementary metal oxide semiconductor sensor) at a distance of 1.5 m. Each pixel covers a distance of 0.3 mm.

Indoor test of the concrete beam.
The ultimate load of the concrete beam is calculated according to the mechanics of materials, and one load step is set to be 10% of the ultimate load. After each step, a 5-min break is taken to allow the deformation of the beam. The beam is loaded until it collapses, and images are taken during the steps.
A part of the original crack images are put into the proposed FCN for recognition, and the result is shown in Figure 7. The non-crack area is marked in purple, the crack area is in yellow, and dark yellow stands for a higher probability of cracking. In the black and white crack area of the input image, the result is mostly marked yellow with around 90% probability. The thinner area of the crack has a smaller probability of around 40%. The disturbing white area in the top middle area is successfully eliminated. The output not only recognizes the existence of the crack but also shows the geometry of the crack which is important for the inspection of the concrete crack.

Detection of structural crack.
For a comparative study, Sobel and Canny edge detection methods are applied to detect the cracks, as shown in Figure 8. The threshold for Sobel detection method is the gradient value of the image that is larger than 95% of the gradient value. For Canny detection method, the low threshold is 12 and the high threshold is 24. It is seen from the result that the edge detection methods identify the existence of the crack with plenty of noise. For example, the white spot region of the original image is considered as a crack, while the proposed FCN detects cracks with little noise on the output image. This indicates that the proposed FCN exhibits a better performance over the edge detection methods in structural crack detection.

Comparative study for structural crack detection: (a) original image, (b) Canny edge detection, (c) Sobel edge detection, and (d) Ci-net detection.
Conclusion
A deep learning–based FCN called Ci-Net was proposed and trained for detection of the structural cracks. A total of 762 pixel-level labeled images from the online data set were collected and cropped into 80 × 80 images for training and validation, and the accuracy reached 93.6% after the 40th epoch. Four indices were adopted to evaluate the recognition performance of Ci-Net. Crack images of an indoor concrete beam test were processed by Ci-Net to assess the structural crack recognition capacity. The results indicated that Ci-Net has good robustness in crack detection with demonstration of the crack geometry, although there was false negative part for thin cracks. The comparative study showed that Ci-Net exhibits a better performance over the edge detection methods for structural crack detection.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was jointly supported by the National Science Foundation of China (Grant Nos. 51822810 and 51778574), the Zhejiang Provincial Natural Science Foundation of China (Grant No. LR19E080002), and the Fundamental Research Funds for the Central Universities of China.
