Abstract
The intelligent detection of distress in concrete is a research hotspot in structural health monitoring. In this study, Att-Unet, an improved attention-mechanism fully convolutional neural network model, was proposed to realize end-to-end pixel-level crack segmentation. Att-Unet consists of three parts: encoding module, decoding module, and AG (Attention Gate) module. The benefits associated with this module can effectively extract multi-scale features of cracks, focus on critical areas, and reconstruct semantics, to significantly improve the crack segmentation capability of the Att-Unet model. On the same data set, the mainstream semantic segmentation models (FCN and Unet) were trained simultaneously. Upon comparing and analyzing the calculated results of Att-Unet model with those of FCN and Unet, the results are as follows: for crack images under different conditions, Att-Unet achieved better results in accuracy, precision and F1-scores. Besides, Att-Unet showed higher feature extraction accuracy and better generalization ability in the crack segmentation task.
Keywords
Introduction
External load and internal material defects may inevitably cause structural damage and stiffness degradation in building structures (Yamaguchi and Hashimoto, 2009; Behmanesh et al., 2015). One of the earliest signs of structural stiffness degradation is the occurrence of cracks (Abdeljaber et al., 2017; Ren et al., 2020), which not only reduces the structural stiffness but also exposes the reinforcement in the structure to air, leading to the continuous corrosion of reinforcement and the reduction of strength. Eventually, material damage and structural deformation occur. Cracks cause serious problems on the safe use of the structure and significant losses and complete collapse of the structure (Nhat-Duc et al., 2018). Therefore, research on crack detection and identification has a high scientific and practical value for the health monitoring of building structures (Abdeljaber et al., 2017; Cha et al., 2017; Liu and Yang, 2017; Spencer et al., 2019).
Traditional structural crack identification and detection methods are usually carried out manually. However, manual detection is faced with the following problems: (1) it is time-consuming and of low efficiency; (2) The artificial detection results are greatly influenced by subjective factors, and the error rate of crack identification results is relatively high, since it is prone to the phenomena of missing detection and wrong detection. (3) There is no doubt that when cracks develop, structural safety performance significantly reduces to a certain extent. In this case, the crack inspector is exposed to a dangerous environment. Therefore, traditional crack identification no longer meets the structural monitoring requirements and crack identification(Nhat-Duc et al., 2018; Arena et al., 2014; Sattar et al., 2018; Ren et al., 2020).
In recent years, the rapid development of artificial intelligence has been promoted by deep learning, and has gradually penetrated various industries. Several industries have implemented deep learning into their fields to promote their businesses. Deep learning technology also plays a decisive role in promoting the development of the engineering field. The identification of structural cracks using computer vision technology has become a research hotspots in structural health monitoring(Chen and Jahanshahi, 2018; Li et al., 2018; Fan et al., 2019). (Arena et al., 2014) proposed the theory of crack image quantization. The recognition and semantic segmentation of structural cracks have been carried out and achieved remarkable results, using image binarization, morphological processing, segmentation, filtering, and other methods. The FCN network proposed by(Dung and Anh, 2018) for concrete crack segmentation was verified on the marked concrete crack data set and achieved sound accuracy(Cha et al., 2017). built a crack classifier using a convolutional neural network. By training 40,000 cracked and non-cracked image samples, a crack classifier model with an exact expression effect was obtained. However, the recognition of cracks in this model only reaches the level of regional blocks but does not reach the level of pixel semantic segmentation of cracks. (Huang et al., 2018) successfully applied the FCN algorithm to crack detection and identification of subway tunnels. Furthermore, this proposed method was able to quickly and accurately identify the cracks in the tunnel. Based on multi-scale neighborhood features and crack pixel features, (Ai et al., 2018) proposed a pixel-level pavement crack detection method.(Ye et al., 2019a; Ye et al., 2019b) proposed a CI-NET model suitable for structural crack identification based on the FCN theory, and through the verification of the concrete bridge loading test, this proposed model was proved to have reasonable practicability. (Wang et al., 2020) put forward a new framework for crack monitoring based on modified Alexnet, which plays an promoting role in conquering sample imbalance in the crack data set.
The imbalance of positive and negative samples is a difficult task in semantic crack segmentation. FCN and variant networks of FCN have made significant progress in crack segmentation research. The frameworks are extracted by a multi-level cascade of CNN (Convolutional neural networks) in the process of crack recognition. As the depth of the neural network deepens, similar and low-level features are repeatedly extracted by multiple cascade levels, which directly lead to multiplication of model learning parameters and an enormous waste of computer resources (Ren et al., 2020; Oktay et al., 2018). Therefore, the application of FCN and Unet to crack the semantic segmentation task reveals extreme limitations. In this study, an improved attention mechanism in the convolutional neural network (Att-Unet) was proposed and used in crack semantic segmentation tasks. The Att-Unet neural network model adopted in this study was found to be stronger in crack feature extraction ability and better robustness, compared with the current mainstream semantic segmentation network FCN and Unet.
Data
The crack image data used in this semantic segmentation model training was from GitHub's open-source public data set (Shi et al., 2016; Yang et al., 2019; Eisenbach et al., 2017; Amhaz et al., 2016; Zou et al., 2012). The total number of crack images was 1695, with each image corresponding to a binary label image. In the label image, crack pixels were marked by 0 and background pixels were marked by 1. The original image data set was divided into a training set and a validation set in a ratio of 8:2. The training set and the validation set were used to train the model’s parameters and evaluate the accuracy of the model, respectively. In the image preprocessing stage, the original images was normalized. Image normalization included image size normalization and pixel normalization. Since the size of the dataset images was not uniform, all images were normalized to 512 × 512 pixels. To improve the convergence speed and accuracy of the model, and to prevent the model gradient explosion, and was performed using Formula 1 (Li et al., 2019), the value range of normalized pixel value is [0,1]. To enhance the generalization ability of the neural network model, and prevent the overfitting phenomenon, data augmentation was implemented by horizontal flip transformation, vertical flip transformation, brightness transformation, and rotation transformation. For the rotation transformation, the rotation Angle range was a random value of [−45°,45°], and the normalization and data augmentation examples of the data set are shown in Figure 1. The total number of samples after data augmentation is 6780.

Training data normalization and augmentation.
Where:
Method
Structure of convolutional neural networks
The structure of convolutional neural networks comprises of three parts: convolutional layers, activation layers, and pooling layers, and these parts have different functions. The function of convolutional layers is mainly to extract the features of the object. An increase in the number of layers in the convolutional neural network, causes the continuous strengthening of the features learned by the convolutional layer learn the object's advanced features. The lower convolutional layer mainly extracts the details and overall structure of the object, while the features extracted by the higher convolutional layer mainly include the semantic information of the image (Ren et al., 2020). Considering the image convolution operation of a single-channel as an example: the convolution operation expression of the image is shown in Formula (2), and the schematic diagram of the convolution principle is shown in Figure 2. A sliding filter completes the point multiplication and sum operation of the relative position in a specific stride on the input image, and the output value of the corresponding position of the output image is obtained.

Convolution operation principle of the image.
Where:
For the activation layer, the nonlinear activation function is mainly introduced to make the model nonlinear. The nonlinearity of the model is significant for the convolutional neural network model. Therefore, the activation layer dramatically enhances the generalization ability of the model. This study mainly uses two activation functions: (1) ReLU (Rectified Linear Units), as shown in Formula (3); (2) Sigmoid function, as shown in Formula (4).
The functions of the pooling layer are as follows: (1) Characteristic invariance. (2) Dimension reduction of features. Each pixel in the down-sampling results represents the corresponding sub-region information in the original image, which is equivalent to the dimension reduction of the image information. (3) The pooling layer effectively inhibits the occurrence of model overfitting and improves the generalization ability of the model. In Figure 3, the pooling operation principle is demonstrated by taking the commonly used maximum pooling as an example.

Principle of MaxPooling.
Semantic segmentation model
As mentioned above, the convolutional neural network (CNN), which can accurately classify images, does not apply to every pixel’s classification in the image. Therefore, a fully convolutional neural network (FCN) with additional skip connections based on the CNN network and with the fully connected layer deleted was proposed (Long et al., 2015). FCN realizes the classification of each pixel of the image. Moreover, the skip connection added by FCN effectively combines the semantic information of deep network and location information of the shallow network, to achieve more accurate prediction.
The proposed FCN model has led to the development of many variants of FCN in semantic segmentation, among which Unet is the most successful framework. By improving FCN, (Ronneberger et al., 2015) symmetric Unet networks were obtained. The Unet architecture looks like a “U” and consists of three parts: encoding block, bottleneck, and decoding block. The encoding block of the Unet is similar to the classic VGG16, and the decoding block contains the up-sampling layer and skip connection. Bottleneck is a transition structure from an encoding block to a decoding block (Ronneberger et al., 2015). A bottleneck is usually a module that reduces the dimension of input first and then increases it. A bottleneck can achieve denoising and improve the accuracy of the model (Paszke et al., 2016; He et al., 2016). The improved Unet inherits the advantages of FCN, and at the same time, the Unet enables the decoding structure to output more accurate pixel segmentation results employing intensive skip and up-sampling. The framework of Unet is shown in the Figure 4. The horizontal number at the top of each map represents the channel number of the map, and the vertical number represents the map size.

The framework of Unet.
Attention mechanisms
The attention mechanism of deep learning can be traced back to the study of human vision. The attention mechanism is that human beings transmit signal through neurons and brain analysis after receiving external visual information, which makes human attention focus on the regions of interest and attenuates the attention weight value of unrelated regions. When the attention mechanism is applied in the deep learning process, the processing of human visual information through the introduction of the attention module is simulated. With increasing the information weight of the region of interest in the feature extraction process of the input image, and restrain the attention of unrelated areas will be restrained. The calculation process of the attention mechanism module (AG) is shown in Figure 5. According to Figure 5, it can be found that the attention module generates an attention weight value
where

The framework of attention gate.
Establishing the Att-Unet model
The most major problem in the cracks' semantic segmentation task is the imbalance of positive and negative samples. Therefore, the importance of introducing the attention mechanism is to allow the model to focus on the crack region through learning, while weakening the weight of unrelated regions. When applied to medical pathological image segmentation, the literature (Oktay et al., 2018) model obtained ideal results. Given the more prominent imbalance between positive and negative samples, the crack segmentation task in this study differs significantly from that in medical imaging. To adapt to the particularity of crack segmentation tasks, this study made some improvements: (1) The bottleneck was added to the whole Unet network to enhance the feature extraction ability of the crack segmentation model. (2) The resampling of the AG module was deleted. As shown in Figure 6, the framework of the improved attention mechanism Unet is referred to as Att-Unet. The size of the input pre-processed image was 512 × 512 pixels, the color channel was RGB, and the output image through the model operation is 512 × 512 pixels, the color channel was 2, with one channel representing the crack, and the other channel representing the background.

The framework of Att-Unet.
Model training
Computer parameters
Given a large amount of data, many parameters, and the complex structure of the Att-Unet model in this study, GPU acceleration was used to train the neural network model. The computer parameters used in this task are shown in Table 1:
Computer parameters.
Hyperparameters
The loss function measures the closeness of the true values and predicted values. The neural network model's training process calculates the values of the loss function through the forward propagation of the neural network and carrying out backpropagation with the obtained loss values according to the gradient descent algorithm, and relevant parameters are updated during backpropagation. The crack data set used in this study is a binary classification problem for each pixel, and binary cross-entropy (BCE) was used in the model as a loss function to the training model. The expression of binary cross-entropy is shown in Formula (8):
Where N is the number of pixels; y is the ground-truth class probability,
The optimizer of the Att-Unet model uses the SGD optimizer for model training. With an initial learning rate of 0.0001, a weight attenuation of 0.00001, and momentum of 0.99, the adjustment strategy of the learning rate adopts the adjustment of every 20 training steps, and the decay coefficient of the learning rate is 0.1. Under the premise of considering GPU memory performance and image size, the number of samples for each batch in this model's training was 6. The total training step was taken as Epoch = 100, and based on the above super-parameter settings, the training process of the model is as shown in Figure 7. It can be inferred from Figure 7 that the training process of the Att-Unet model was reasonable, and there was no overfitting, indicating that the model's training parameters were reasonably selected.

Process of model training.
Evaluation
The correlation between the true value and the prediction value of the model, showed that the prediction results of each pixel of the image were divided into four types: true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN). Based on the four results produced by each pixel, the evaluation indexes of the model, namely PA, IOU, precision, recall, and F1 score, were introduced to evaluate the segmentation ability of the model.
Given GPU highly parallel training, Nvidia libraries do not guarantee reproducibility without setting random seeds. Five tests were repeated, with all parameters unchanged to eliminate the contingency caused by one test. In each test, the Att-Unet model got the highest score. Table 2 shows the average scores of the five repeated tests for different models. As can be seen from the results, the IoU, PA, Precision, Recall, and F1 of the Att-Unet model on the test set were 73.65%, 98.01%, 75.94%, 67.77%, and 65.12%, respectively. Att-Unet model performs better than FCN and Unet. It should be noted that the data set has greatly impacted the results of the model (Ye and Dai, 2020; Wu et al., 2019). The crack images dataset used in this study has the following characteristics: (1) many crack images were not clear; (2) the imbalance between positive and negative samples was prominent. (3) There were many crack images under strong interference conditions. Although all of the objects of semantic segmentation tasks were crack images, the model's evaluation scores in this study differed from those reported in the literature (Zhang et al., 2020; Choi and Cha, 2020; Dung and Anh, 2018). Att-Unet achieves better performance than Unet by introducing attention mechanism to achieve semantic segmentation of cracks. Att-Unet is more suitable for samples with imbalanced positive and negative samples.
Comparison of crack segmentation accuracy in different models.
The change rule of the PR curve of different models is described in Figure 8. The PR curve represents the relative merits of the model to a certain extent, and it is positively correlated with the area enclosed by the coordinate axis and the expression ability of the neural network model. Figure 8 shows that Att-Unet had the best expression ability, followed by Unet and FCN. This law was also be verified at the intersection point of y = x and the model PR curve, that is, the larger the coordinate values of the intersection point, the stronger the model’s ability to extract features. The single image processing time of Att-Unet was 0.078 s, which is 15.19% higher than the FCN and 10.13% higher than the Unet. Due to the excellent performance of Att-Unet in crack segmentation tasks, the small disadvantage of Att-Unet in terms of time consumption is acceptable, and this shortcoming can be weakened with improved computer hardware performance.

Comparative analysis of P-R curve.
The prediction results of different models on the same test set are shown in Figure 9, which compares the strength of the crack recognition ability of different models more intuitively. In Figure 9, there are ten sets of test set identification results under different conditions. Figure 9(1)–(3) are crack segmentation results under normal visual conditions, Figure 9(4) and (5) are crack recognition results under interference. Figure 9(6) and (7) are the crack segmentation results with inconspicuous feature. Figure 9(8) are the results of crack segmentation with dark brightness. The comparison of crack segmentation results shows that, among the Figure 9(1)–(3) results, the predicted results of the Att-Unet model are very close to the label figure, which reflects Att-Unet’s reliable crack segmentation ability. In groups Figure 9(4)–(8), the crack image recognition conditions are relatively strict. But even under different severe conditions in groups Figure 9(4)–(8), Att-Unet has better performance. In terms of crack segmentation, Att-Unet showed excellent performance, followed by Unet model performance, and FCN model crack recognition ability being the weakest. In particular, prediction results of FCN were extremely unreliable for crack identification, especially in cases where the cracks were not visible, while Unet segmentation results were not ideal under strong interference. The Att-Unet model presents good robustness, strong generalization ability, and accurate crack segmentation results, enabling the smooth completion of crack segmentation and detection tasks.

Comparison of crack segmentation results in different models: (1)--(3) are crack segmentation results under normal visual conditions, (4) and (5) are crack recognition results under interference. (6) and (7) are the crack segmentation results with inconspicuous feature. (8) are the results of crack segmentation with dark brightness.
Conclusion
In this study, a fully convolutional neural network based on the attention mechanism was proposed, comprising of the encoding module, decoding module, and AG (Attention Gate) module. By introducing the attention mechanism, the Att-Unet network alleviates the imbalance of positive and negative samples to a certain extent. Att-Unet shows better generalization ability and robustness compared to the mainstream semantic segmentation network FCN and Unet. Besides, Att-Unet scored highly in the evaluation system consisting of PA, IOU, Precision, Recall, and F1. With the rapid development of deep learning, various fields are integrating deep learning algorithms for self-innovation. The Att-Unet model based on the attention mechanism has been implemented in this study, and excellent results obtained on the crack data set. Therefore, this model can be used as a new technology and method for crack identification in structural health monitoring in the engineering field.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was jointly supported by the National Science Foundation of China (Grant No. 51768033)
