Structural crack detection using deep learning–based fully convolutional networks

Abstract

Cracks are a potential threat to the safety and endurance of civil infrastructures, and therefore, careful and regular structural crack inspection is needed during their long-term service periods. Many image-processing approaches have been developed for structural crack detection. However, like traditional edge detection algorithms, these methods are easily disturbed by the environmental effect. Convolutional neural networks are newly developed methods and have excellent performances in the image-classification tasks. This study proposes a fully convolutional network called Ci-Net for structural crack identification. Pixel-level labeled image training data are obtained from the online data set. Four indices are adopted to evaluate the performance of the trained Ci-Net. Crack images from an indoor concrete beam test are adopted for validation of its structural crack recognition capacity. The recognition results are also compared with those obtained by the edge detection methods. It indicates that Ci-Net exhibits a better performance over the edge detection methods in structural damage detection.

Keywords

convolutional neural networks deep learning fully convolutional networks structural crack detection structural health monitoring

Introduction

Civil infrastructures act as the main accommodation for sheltering or traffic for human activities. For example, bridges are one of the most important infrastructures that create smooth-running traffic. So far, China has constructed more than 800,000 highway bridges and 200,000 railway bridges, and the construction trend is still growing due to the demand of economic development. In-service bridges are encountered with the coupling effect of harsh natural environment (temperature change, humidity, acid rain, etc.), stochastic external load (overload and vehicle impact), material deterioration, fatigue effect, and so on. As time goes on, bridges will have all kinds of problems, attracting the attention to structural health monitoring (SHM) and inspection methods all over the world (Lin et al., 2017; Ni et al., 2010, 2012; Ye et al., 2012, 2013, 2016, 2018; Yi et al., 2013).

Crack detection has been one of the most important tasks during the inspection for civil infrastructures. Traditional ways of crack inspection mainly rely on manual work that is empirical, subjective, time consuming, and not quantitative. Since cracks are visible, plenty of studies have been conducted on the vision-based crack-detection methods for the purpose of automated crack detection. Many image processing techniques have been developed for this purpose. Abdel-Qader et al. (2003) compared four edge detection methods (fast Haar transform, fast Fourier transform, Sobel edge detector, and Canny edge detector) for their capacity to detect concrete cracks. Yu et al. (2007) applied the geometric properties and patterns of cracks in a structure to the image-processing routine for detecting cracks on a tunnel wall. Fujita and Hamamoto (2011) employed a median filter and a multi-scale line filter to emphasize cracks followed by probabilistic relaxation and a locally adaptive threshold to detect cracks. Li et al. (2013) used the image segmentation algorithm of the modified snake active contour model combined with the distant sensor information to detect cracks in concrete bridges. Lins and Givigi (2016) developed a vision-based method to estimate the crack dimension on the field. However, crack images with changed brightness, distortion, spotted surfaces, or other disturbing factors will bring a great challenge to edge detection or feature extraction algorithms.

To tackle the image recognition task, convolutional neural networks (CNNs) have been proved to have better performance in calculation time and recognition rate due to the idea of shared weights, reception fields, and downsampling operations (Han et al., 2018; Muller et al., 2016; Yang et al., 2018). Hubel and Wiesel (1962) proposed the concept of receptive field in research on a cat’s visual cortex system and found that the visual cells only interacted with local elements. Inspired by this idea, Fukushima and Miyake (1982) proposed the concept of neocognitron for pattern recognition. LeCun et al. (1998) proposed LeNet-5 for document recognition which was one of the first CNNs with a modern CNN structure. They usually have the combination of convolution layers, pooling or subsampling layers, fully connected layers, and some auxiliary layers followed by a classifier (Chen and Jahanshahi, 2018; Galea and Farrugia, 2017). They are able to deal with classification of many classes in architecture and extract plenty of features by the proper increase in the number of layers and the method of training (Hinton et al., 2006; Krizhevsky et al., 2012). Cha et al. (2017) proposed a method combining the CNN with a sliding window technique to scan images and it was tested by 55 images from different structures with 5888 × 3584 pixel resolutions. Kim and Cho (2018) proposed an automated detection technique for crack morphology on concrete surface under an on-site environment based on CNNs. Tong et al. (2018) applied k-means clustering analysis to pre-extract crack properties in images, and a deep CNN was adopted to calculate the selected images for the length of cracks. In these studies, recognition results with high accuracy of more than 90% were obtained, and CNNs exhibited robustness in eliminating noise due to illumination condition, blot, or other reasons. However, certain size is usually required for images to be processed by CNNs due to the existence of fully connected layers. It is inconvenient when images are taken by cameras with different pixel resolutions.

Specifically, for crack detection on civil infrastructure surface, other than detection purpose, inspectors also concern about the geometry of cracks including location, shape, orientation, width, and length. This industrial demand requires pixel-level identification of cracks. Fully convolutional networks (FCNs) are newly developed neural networks based on CNNs (Ronneberger et al., 2015). Without fixed fully connected layers, images with arbitrary size could be processed without preprocessing. Recognition is made pixel by pixel on the output result which provides the pixel-wise identification capacity and the end-to-end ability for semantic segmentation (Shelhamer et al., 2017). In the case of crack recognition, FCNs are able to demonstrate the geometry properties of the cracks, and therefore, it will facilitate the possibility for accurate calculation of crack length and width.

In this study, an FCN called Ci-Net is proposed for the detection of structural cracks in images. The proposed FCN contains convolution layers, pooling layers, upsampling layers, deconvolution layers, and a softmax classifier. Pixel-level labeled training images are collected from the CrackForest data set (Shi et al., 2016), available at https://github.com/cuilimeng/CrackForest-dataset, and the TITS 2016 data set (Chambon and Moliard, 2011), available at https://www.irit.fr/∼Sylvie.Chambon/Crack_Detection_Database.html. Crack images from concrete beam loading test are collected to test the recognition capacity of the proposed FCN, Ci-Net. The output results are also compared with those obtained by edge detection methods.

Architecture of the proposed Ci-Net

The proposed FCN, Ci-Net, consisted of seven convolution layers, two max-pooling layers, two upsampling layers, six deconvolution layers, and a softmax layer, as illustrated in Figure 1, inspired by the study of Ronneberger et al. (2015) and Shelhamer et al. (2017). The initial weight values of the kernel and bias are randomly generated from a normal distribution modified by the training process for structural crack detection. The proposed FCN is distinguished by the pixel-wise identification capacity, that is, the output result predicts the classification of each pixel in the input image, and the size of input images is arbitrary. Also, they have no fully connected layers, which usually exist in a CNN such as LeNet-5, as shown in Figure 2.

Figure 1.

Architecture of the proposed Ci-Net.

Figure 2.

Architecture of LeNet-5.

Feature extraction

The convolution layer conducts the operation of element multiplication to extract features from the input images. The convolution layer in the proposed FCN has a 3 × 3 kernel (i.e. the filter or receptive field) to filter the input images. Figure 3 demonstrates the process of convolution and the stride is 1 pixel. Different kinds of filters can be constructed for the extraction of different features in a convolution layer, and after the convolution operation, new matrix called feature map will be generated. Zero padding is adopted to maintain the height and width of a feature map. Thanks to the share of weight value for a filter, the calculation demand for convolution is reduced dramatically. The convolution and activation process is operated according to equation (1). A rectified linear unit (ReLU) is utilized following the filter as an activation function for the performance of convergence rate and calculation demand (Krizhevsky et al., 2012)

V = f (\sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{ij} x_{ij} + b)

(1)

where w_ij is the weight value, x_ij is the input element, b is the bias, n is the size of the kernel, f(x) is the activation function, and V is the output value.

Figure 3.

Convolution process.

Batch normalization layer is added to solve the internal covariate shift, which means the distributions of internal nodes of a deep network is changed during training (Ioffe and Szegedy, 2015). It can be expressed by

μ_{m} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}

(2)

σ_{m}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{m})^{2}

(3)

{\hat{x}}_{i} = \frac{x_{i} - μ_{m}}{\sqrt{σ_{m}^{2} + ε}}

(4)

z_{i} = γ {\hat{x}}_{i} + ϕ

(5)

where x_i is the input value of a mini batch, m is the number of training samples, μ_m is the mean of a mini-batch input value, $σ_{m}^{2}$ is the variance, ${\hat{x}}_{i}$ is the normalized value of x_i, ε is a parameter with a constant value of 10⁻⁸ to keep a positive value, z_i is the scaled and shifted value of ${\hat{x}}_{i}$ , γ is the scale ratio with an initial value of 1, and φ is the shift with an initial value of 0.

A pooling layer or a downsampling layer is operated for reducing the width and height of an income feature map. Usually, the convolution layer increases the depth of a feature map, resulting in the increase of the total number of parameters to be modified. Without a proper reduction action, the need for calculation will bring a burden to the hardware. Typically, there are two kinds of pooling methods, that is, max pooling and mean pooling, as described in the previous research (Amirshahi et al., 2016; Liu et al., 2015). The max-pooling method preserves the max value of a subarray in a coming feature map while the mean-pooling method averages the values in a subarray. According to the previous research on crack identification (Cha et al., 2017; Chen and Jahanshahi, 2018), the max-pooling method exhibits a better performance and is adopted in this study. The pooling layer has a 2 × 2 receptive field to implement the downsampling operation on the subarray, and the stride is 2 pixels. Figure 4 illustrates the process of the max-pooling method.

Figure 4.

Max-pooling process.

Information restoration

After the convolution and pooling operation, the CNN will follow a contracting path. The input image will be converted into a vector for classification by the classifier. For the proposed FCN, the deconvolution layer receives the income feature map from the prior layer and implements an operation to restore the information (Zeiler et al., 2011). In this study, the restored information are the features of the cracks and their locations. Together with the upsampling operation, the deconvolution operation helps to expand the feature map to the original size. The deconvolution layer has a 3 × 3 kernel to process the input array.

Classification

A softmax classifier is adopted for the pixel-wise classification. The softmax function is widely used in an artificial neural network (ANN) with a form as expressed by

P_{V_{i}} = \frac{e^{V_{i}}}{\sum_{i = 1}^{k} e^{V_{i}}}

(6)

where k is the number of classes, V_i is the input value of the ith class, and $P_{V_{i}}$ is the estimated probability of the ith class. $P_{V_{i}}$ is a number between 0 and 1, and a larger $P_{V_{i}}$ means a higher probability to a class. A cross entropy loss is defined as

H (p, q) = - \sum_{i = 1}^{k} p (i) \log q (i)

(7)

where H(p, q) is the cross-entropy loss for the modification of weight values in a backpropagation algorithm and Adam algorithm. p(i) is the label, and q(i) is the prediction of samples. Modification of the weight values will be discussed later in the training process.

Training and performance evaluation

Data set

Sufficient images are demanded for the training of a neural network, and the number of images should be several thousand or more (Lee and Kwon, 2017). This requirement usually limits the practical applications due to the lack of enough images for training in many specific problems. In order to obtain a satisfactory recognition rate, images are collected from several sources. Images from CrackForest and TITS 2016 with pixel-level labels serve as the training data set. The total number of raw training images is 762 (324 images with 1024 × 500 pixel resolutions and 438 images with 500 × 300 pixel resolutions). These raw training images are cropped into 80 × 80 pixel resolutions to build the training data set. The cropped images with the cracks located near the center of the images are selected. After the cropping and selection, 14,000 images with cracks are obtained for training.

Training algorithm

The purpose of training is to adjust the weight and bias values of the receptive field of the functional layers. The stochastic gradient descent (SGD) principle and backpropagation algorithm are widely used for training purposes (Shin et al., 2016). In this study, a modified SGD algorithm called Adam, proposed by Kingma and Ba (2014), is adopted for its performance in optimizing the training process. Adam uses an adaptive learning rate for each parameter in the network, which makes it better to deal with sparse data. It also keeps an exponentially decaying average of past gradients. The Adam algorithm can be expressed by

θ_{t} \leftarrow θ_{t - 1} + Δ θ_{t}

(8)

where the subscript t and t – 1 stand for the time step, θ is the parameter to be modified, θ_t is the value of θ in time step t, and Δθ_t is the modification for θ_t, calculated by

Δ θ_{t} = - ε \frac{{\hat{s}}_{t}}{\sqrt{{\hat{r}}_{t}} + δ}

(9)

{\hat{s}}_{t} \leftarrow \frac{s_{t}}{1 - ρ_{1}}

(10)

{\hat{r}}_{t} \leftarrow \frac{r_{t}}{1 - ρ_{2}}

(11)

where ε is the learning rate with a value of 0.001, $\hat{s}$ is the modified value of the first-order momentum s, $\hat{r}$ is the modified value of the second-order momentum r, δ is a constant with a value of 10⁻⁸, ρ₁ is the decay coefficient of the first-order momentum with a value of 0.9, and ρ₂ is the decay coefficient of the second-order momentum with a value of 0.999. The initial values of s and r are 0 and can be calculated by

s_{t} \leftarrow ρ_{1} s_{t - 1} + (1 - ρ_{1}) g_{t}

(12)

r_{t} \leftarrow ρ_{2} r_{t - 1} + (1 - ρ_{2}) {g_{t}}^{2}

(13)

g_{t} = \frac{1}{m} \nabla_{θ} \sum_{i = 1}^{m} L (f (x_{i}; θ), y_{i})

(14)

where g_t is the mean value of the gradient, m is the number of training samples, y_i is the ground truth value, $L (f (x_{i}; θ), y_{i})$ is the loss function, and $f (x_{i}; θ)$ is the predicted probability.

Training and validation

The training and validation process is based on the data set with pixel-level labeled images. The total data set contains 12,500 images with 80 × 80 pixel resolutions, and the number for training and validation is 0.7:0.3 among 12,000 images. The batch size is 32, and after the 40th epoch, the accuracy reaches 93.6%, as shown in Figure 5. Due to the pixel-level recognition feature, the recognition rate is defined as the mean accuracy for each pixel.

Figure 5.

Training accuracy.

Performance evaluation

After the training process, the trained Ci-Net is evaluated with 500 unused images randomly selected from the data set. The indices of precision rate, recall rate, intersection over union (IoU), and F-measure are taken for evaluation, as expressed by

Precision = \frac{TP}{TP + FP}

(15)

Recall = \frac{TP}{TP + FN}

(16)

F_{β} = (1 + β^{2}) \frac{Precision \times Recall}{β^{2} \times Precision + Recall}

(17)

IoU = \frac{TP}{TP + FN + FP}

(18)

where TP is the true positive, FN is the false negative, FP is the false positive, and β is a coefficient to do the trade-off between precision and recall. The higher precision rate means that the network is able to make more correct classification among the predicted classification of crack pixels. The higher recall rate means that the network is able to make more correct classification among all the crack pixels. F_β is a weighted harmonic average comprehensively reflecting the precision and recall rate. β is set to be 1 here to give the precision rate and recall rate the same weight. The higher IoU rate means the higher overlap between the predicted classification and ground truth for crack pixels. The ideal value of the index is equal to 1, and the higher value of the index means the better performance in the specific function. The precision rate is 0.84, recall rate is 0.82, IoU is 0.727, and F₁ is 0.604. The results show that the proposed FCN performs well in a structural crack recognition task.

Testing and discussions

An indoor concrete beam test is carried out to create the structural surface cracks on a concrete structure. The size of the test beam is 1400 mm × 100 mm × 160 mm, and the concrete has a standard compressive strength of 30 MPa. The test beam is poured and maintained under the standard condition to simulate the construction of concrete structures. A hydraulic jack and steel beam are used to impose and distribute the load to the test beam, as shown in Figure 6. The test is photographed using a Canon 600D digital single-lens reflex camera (5184 × 3456 pixel resolutions, complementary metal oxide semiconductor sensor) at a distance of 1.5 m. Each pixel covers a distance of 0.3 mm.

Figure 6.

Indoor test of the concrete beam.

The ultimate load of the concrete beam is calculated according to the mechanics of materials, and one load step is set to be 10% of the ultimate load. After each step, a 5-min break is taken to allow the deformation of the beam. The beam is loaded until it collapses, and images are taken during the steps.

A part of the original crack images are put into the proposed FCN for recognition, and the result is shown in Figure 7. The non-crack area is marked in purple, the crack area is in yellow, and dark yellow stands for a higher probability of cracking. In the black and white crack area of the input image, the result is mostly marked yellow with around 90% probability. The thinner area of the crack has a smaller probability of around 40%. The disturbing white area in the top middle area is successfully eliminated. The output not only recognizes the existence of the crack but also shows the geometry of the crack which is important for the inspection of the concrete crack.

Figure 7.

Detection of structural crack.

For a comparative study, Sobel and Canny edge detection methods are applied to detect the cracks, as shown in Figure 8. The threshold for Sobel detection method is the gradient value of the image that is larger than 95% of the gradient value. For Canny detection method, the low threshold is 12 and the high threshold is 24. It is seen from the result that the edge detection methods identify the existence of the crack with plenty of noise. For example, the white spot region of the original image is considered as a crack, while the proposed FCN detects cracks with little noise on the output image. This indicates that the proposed FCN exhibits a better performance over the edge detection methods in structural crack detection.

Figure 8.

Comparative study for structural crack detection: (a) original image, (b) Canny edge detection, (c) Sobel edge detection, and (d) Ci-net detection.

Conclusion

A deep learning–based FCN called Ci-Net was proposed and trained for detection of the structural cracks. A total of 762 pixel-level labeled images from the online data set were collected and cropped into 80 × 80 images for training and validation, and the accuracy reached 93.6% after the 40th epoch. Four indices were adopted to evaluate the recognition performance of Ci-Net. Crack images of an indoor concrete beam test were processed by Ci-Net to assess the structural crack recognition capacity. The results indicated that Ci-Net has good robustness in crack detection with demonstration of the crack geometry, although there was false negative part for thin cracks. The comparative study showed that Ci-Net exhibits a better performance over the edge detection methods for structural crack detection.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was jointly supported by the National Science Foundation of China (Grant Nos. 51822810 and 51778574), the Zhejiang Provincial Natural Science Foundation of China (Grant No. LR19E080002), and the Fundamental Research Funds for the Central Universities of China.

ORCID iD

Xiao-Wei Ye

References

Abdel-Qader

Abudayyeh

Kelly

(2003) Analysis of edge-detection techniques for crack identification in bridges. Journal of Computing in Civil Engineering 17(4): 255–263.

Amirshahi

Pedersen

(2016) Image quality assessment by comparing CNN features between images. Journal of Imaging Science and Technology 60(6): 1–10.

Cha

Choi

Buyukozturk

(2017) Deep learning-based crack damage detection using convolutional neural networks. Computer-Aided Civil and Infrastructure Engineering 32(5): 361–378.

Chambon

Moliard

(2011) Automatic road pavement assessment with image processing: Review and comparison. International Journal of Geophysics 2011: 989354.

Chen

Jahanshahi

(2018) NB-CNN: Deep learning-based crack detection using convolutional neural network and naive Bayes data fusion. IEEE Transactions on Industrial Electronics 65(5): 4392–4400.

Fujita

Hamamoto

(2011) A robust automatic crack detection method from noisy concrete surfaces. Machine Vision and Applications 22(2): 245–254.

Fukushima

Miyake

(1982) Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition 15(6): 455–469.

Galea

Farrugia

(2017) Matching software-generated sketches to face photos with a very deep CNN, morphed faces, and transfer learning. IEEE Transactions on Information Forensics and Security 13(6): 1421–1431.

Han

Liu

Fan

(2018) A new image classification method using CNN transfer learning and web data augmentation. Expert Systems with Applications 95: 43–56.

10.

Hinton

Osindero

Teh

(2006) A fast learning algorithm for deep belief nets. Neural Computation 18(7): 1527–1554.

11.

Hubel

Wiesel

(1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology 160(1): 106–154.

12.

Ioffe

Szegedy

(2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning (eds Bach

Blei

), Lille, 6–11 July.

13.

Kim

Cho

(2018) Automated vision-based detection of cracks on concrete surfaces using a deep learning technique. Sensors 18(10): 3452.

14.

Kingma

(2014) Adam: A method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, San Diego, CA, 7–9 May.

15.

Krizhevsky

Sutskever

Hinton

(2012) ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 60(2): 1097–1105.

16.

LeCun

Bottou

Bengio

, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11): 2278–2324.

17.

Lee

Kwon

(2017) Going deeper with contextual CNN for hyperspectral image classification. IEEE Transactions on Image Processing 26(10): 4843–4855.

18.

(2013) Image-based method for concrete bridge crack detection. Journal of Information and Computational Science 10(8): 2229–2236.

19.

Lin

, et al. (2017) Damage detection in the cable structures of a bridge using the virtual distortion method. Journal of Bridge Engineering 22(8): 04017039.

20.

Lins

Givigi

(2016) Automatic crack detection and measurement based on image analysis. IEEE Transactions on Instrumentation and Measurement 65(3): 583–590.

21.

Liu

Lin

Shen

(2015) CRF learning with CNN features for image segmentation. Pattern Recognition 48(10): 2983–2992.

22.

Muller

Tetzlaff

(2016) NEROvideo: A general-purpose CNN-UM video processing system. Journal of Real-Time Image Processing 12(4): 763–774.

23.

(2010) Monitoring-based fatigue reliability assessment of steel bridges: analytical model and application. Journal of Structural Engineering 136(12): 1563–1573.

24.

(2012) Modeling of stress spectrum using long-term monitoring data and finite mixture distributions. Journal of Engineering Mechanics 138(2): 75–183.

25.

Ronneberger

Fischer

Brox

(2015) U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of the 18th international conference on medical image computing and computer-assisted intervention, Munich, 5–9 October.

26.

Shelhamer

Long

Darrell

(2017) Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4): 640–651.

27.

Shi

Cui

, et al. (2016) Automatic road crack detection using random structured forests. IEEE Transactions on Intelligent Transportation Systems 17(12): 3434–3445.

28.

Shin

Roth

Gao

, et al. (2016) Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging 35(5): 1285–1298.

29.

Tong

Gao

Han

, et al. (2018) Recognition of asphalt pavement crack length using deep convolutional neural networks. Road Materials and Pavement Design 19(6): 1334–1349.

30.

Yang

Gao

Song

, et al. (2018) Aurora image search with contextual CNN feature. Neurocomputing 281: 67–77.

31.

Dong

Liu

(2016) Image-based structural dynamic displacement measurement using different multi-object tracking algorithms. Smart Structures and Systems 17(6): 935–956.

32.

Wai

, et al. (2013) A vision-based system for dynamic displacement measurement of long-span bridges: Algorithm and verification. Smart Structures and Systems 12(3–4): 363–379.

33.

Wong

, et al. (2012) Statistical analysis of stress spectra for fatigue life assessment of steel bridges with structural health monitoring data. Engineering Structures 45: 166–176.

34.

, et al. (2018) Stochastic characterization of wind field characteristics of an arch bridge instrumented with structural health monitoring system. Structural Safety 71: 47–56.

35.

Sun

(2013) Multi-stage structural damage diagnosis method based on “energy-damage” theory. Smart Structures and Systems 12(3–4): 345–361.

36.

Jang

Han

(2007) Auto inspection system using a mobile robot for detecting concrete cracks in a tunnel. Automation in Construction 16(3): 255–261.

37.

Zeiler

Taylor

Fergus

(2011) Adaptive deconvolutional networks for mid and high level feature learning. In: Proceedings of the 2011 international conference on computer vision, Barcelona, 6–13 November, pp. 2018–2025. New York: IEEE.