Abstract
Concrete surface crack detection based on computer vision, specifically via a convolutional neural network, has drawn increasing attention for replacing manual visual inspection of bridges and buildings. This article proposes a new framework for this task and a sampling and training method based on active learning to treat class imbalances. In particular, the new framework includes a clear definition of two categories of samples, a relevant sliding window technique, data augmentation and annotation methods. The advantages of this framework are that data integrity can be ensured and a very large amount of annotation work can be saved. Training datasets generated with the proposed sampling and training method not only are representative of the original dataset but also highlight samples that are highly complex, yet informative. Based on the proposed framework and sampling and training strategy, AlexNet is re-tuned, validated, tested and compared with an existing network. The investigation revealed outstanding performances of the proposed framework in terms of the detection accuracy, precision and F1 measure due to its nonlinear learning ability, training dataset integrity and active learning strategy.
Keywords
Introduction
Surface cracks, which adversely influence the safety and durability of structures, are significant early indicators of concrete structural damage. Manual visual inspection is often conducted to examine crack characteristics, such as the existence, width, distribution and growth. These characteristics play a critical role in structural maintenance and management. However, manual visual inspection is costly, time-consuming, subjective and sometimes dangerous as introduced by Liu et al. (2016), Li et al. (2017), Kim et al. (2018), Li et al. (2018), Shi et al. (2016), Modarres et al. (2018) and Xu et al. (2018). To overcome these issues, an automatic crack detection method based on image processing techniques (IPT) is proposed and studied for partially replacing manual visual inspection of structures and buildings shown by Zou et al. (2012), Yang and Wang (2012), Yeum and Dyke (2015), Wang et al. (2018). However, these methods greatly rely on hand-crafted features due to inherent shortcomings of IPT, which constrains their widespread adoption.
Recently, the convolutional neural network (CNN) has been applied to image classification and object detection tasks (Khan et al., 2018) and is widely employed in facial recognition (Ko, 2018), natural language processing (Shen et al., 2014), and even the board game GO (Silver et al., 2016). A CNN exhibits great potential for challenging computer vision applications considering its distinguishing features, such as local connections, shared weights, and pooling (LeCun et al., 2015) and automatic feature extraction characteristics. As an innovative method, a CNN is also found to be capable of effectively detecting concrete surface cracks and has been drawing increasing attention compared with other methods (Cha and Choi, 2017; Cha et al., 2018; Dorafshan et al., 2018; Yang et al., 2018). Actually, deep learning–based techniques are extensively studied for applications to structural health monitoring (Fan et al., 2019a; Ye et al., 2019a).
Concrete crack detection methods based on deep learning are often designed by means of image classification techniques. Cha and Choi (2017) proposed a vision-based method using a CNN for detecting concrete cracks. The network included three convolutional layers, two pooling layers, activation of rectified linear units (ReLU), dropout and batch normalization layers followed by one fully connected classifier layer. The data bank consisted of 332 raw images, among which 277 images were used for training and validation processes and 55 for testing. Studies have shown that a network achieves a high degree of accuracy and robustness against influences of distortion, strong light spots and shadows. Dorafshan et al. (2018) compared the relative performance of common edge detectors and CNNs for image-based concrete crack detection for the first time with a single dataset that contained 19 high-definition images. The application of transfer learning to AlexNet showed promising results due to its test accuracy of 98% and ability to detect cracks wider than 0.04 mm as the results provided by Krizhevsky et al. (2012).
Concrete surface crack detection is a typical application of object detection. In image classification tasks, particularly for multi-classification scenarios, determining the optimal size of the sliding window is critical and challenging because images and objects may have various sizes and scales. Object detection techniques avoid this problem by searching for thousands of region proposals. An autonomous visual inspection method was proposed with a faster region-based CNN by Cha et al. (2018). Five types of damaged concrete cracks, two levels of steel corrosion (medium and high), bolt corrosion and steel delamination were considered. The aforementioned method was promising for its remarkably fast test speed, and hence, its framework could be applied to real-time damage detection via videos.
Crack detection can be regarded as a semantic segmentation application as pointed by Khan et al. (2018). The advantage of these methods is that they can provide pixel-level label prediction and hence localize cracks more accurately. Ye et al. (2019b) propose a fully convolutional network (FCN) called Ci-Net for structural crack identification. This network was trained with an online dataset which was labelled on pixel level, and was validated using images collected in an indoor concrete beam test. Investigations showed that this network exhibits a better performance over traditional edge detection methods in structural damage detection. Yang et al. (2018) obtained crack topology, crack length, maximum width, and mean width by applying an FCN, which indicated that an FCN was feasible and suitable for crack identification and measurement.
These references indicate that image classification is the foundation of crack detection and localization based on object detection methods (Cha et al., 2018; Khan et al., 2018). Meanwhile, although new networks, for example, FCN, are feasible, given the extreme similarity between cracks and crack-like non-crack disturbances, image classification-based methods are more suitable for crack detection tasks due to their moderate amount of dataset collection work and relatively high testing accuracy. Consequently, this article concentrates on the crack detection framework using image classification techniques.
Conventional crack image classification methods have shortcomings. First, edge crack samples in the classification are often discarded from the training dataset. The aforementioned negatively affects the integrity of the dataset, and hence, the trained model tends to randomly classify validation and test samples that are similar to discarded training samples. Second, raw images are often cropped into samples that are manually annotated and separated into different categories (Cha and Choi, 2017; Dorafshan et al., 2018). This procedure leads to a greater amount of tedious annotation work considering data augmentation, which may change sample annotations. This article proposes a new CNN framework for crack detection that avoids these shortcomings.
Another issue is that the ratio of positive versus negative samples is significantly smaller than 1 in the detection task, resulting in the trained classifier focusing more on the negative category due to larger number of samples, while ignoring the positive category with fewer samples. Nevertheless, a category with a small number of samples, for example, cracks, plays a critical role in the training process. In general, the common solution to the imbalance problem is over-sampling and sampling. For small-size categories, more samples are obtained by replication or data augmentation, while for large-size categories, a portion of the samples is deleted by sampling. Zhong et al. (2016) proposed a method of label shuffling to solve such problems, which may result in overfitting for a small-size dataset and long training time for large-size datasets. Inspired by active learning, this article conceives a sampling and training method to resolve the class imbalance issue.
This article studies concrete surface crack detection based on image classification techniques using a CNN, proposes a new CNN framework for this task and conceives a sampling and training method based on cross entropy ranking to address the training class imbalance issue. To be more specific, this article is organized as follows: ‘Crack detection based on a CNN’ section presents our new framework in detail, including the definition of two categories of samples based on the sliding window technique, and proposed data annotation and augmentation method; ‘Sampling and training based on active learning’ section elaborates on the class imbalance issue and proposes a remedy via a new sampling and training method. The framework and sampling method are described and evaluated in ‘Training and validation’ section; ‘Testing and comparative study’ section describes the testing results of fine-tuned models and comparative study results between two frameworks; ‘Conclusion’ section concludes the article.
Crack detection based on a CNN
We first provide definitions of centre crack samples (CCS) and non-centre crack samples (NCCS), which are the basis for the sliding window design technique. With these definitions, all samples that are cropped from raw images are placed in a specific category, and thus, integrity of the dataset is ensured. Moreover, a data annotation and augmentation method applied to crack detection is presented with the purpose of saving labelling cost.
Category definition
Surface cracks with a clean background are easily detected by the CNN classifier. However, in real-world situations, crack detection is often influenced by various disturbances (Kim et al., 2018), including water stains, shadows, potholes and stripes. Figure 1 demonstrates typical samples of various types, where the disturbances in Figure 1(c) to (f) exhibit extremely similar features to the real cracks in Figure 1(b). These crack-like non-cracks often render the detection a challenging task. Precisely detecting concrete surface cracks via samples that are contaminated by a wide range of disturbances is the ultimate objective of this research field. To achieve this objective, one first needs to determine the number of categories that the samples should be divided into. Separating the samples into several categories results in a significant workload, for example, collecting and labelling a large number of samples for each category, and training a more complicated network model requires a longer time; however, these steps are unnecessary even from an accuracy perspective. The task goal is to detect surface cracks, which means that once cracks are effectively identified and accurately located, the objective is achieved.

Typical samples of various types and categories: (a) intact, (b) crack, (c) paint, (d) attachment, (e) pockmark, (f) shadow, (g) centre crack samples (CCS) and (h) non-centre crack samples (NCCS).
The types of various disturbances are not in the scope of our concern. From this perspective, one can divide collected samples into two categories, namely, samples with and without cracks. Nevertheless, in the category of samples without cracks, an adequate number of samples with various disturbances are still required for the neural network to capture discriminative disturbance features. This issue is resolved through data augmentation, as illustrated in the following subsections. To clarify edge crack samples, conversely, we define CCS by introducing the concept of a sample centre region, which is the area of the sample centre, enclosed by a square whose side is 1/2 of the entire sample side length. The samples in which a crack intersects the centre region are placed in the CCS category; otherwise, samples are considered to belong to the NCCS category. Apparently, NCCS contain intact samples, edge crack samples, and samples with various disturbances. Figure 1(g) and (h) show typical samples of CCS and NCCS with the centre region square depicted in white. One should note that the square is depicted only to clearly describe the class definition, while any sample used for training, validation and testing contains no square.
Studies suggest that cracks distributed on sample edges are usually difficult to effectively detect (Cha and Choi, 2017; Wang et al., 2018). Moreover, an edge crack of a sample is often at the centre of an adjacent sample when raw images are cropped using the sliding window technique with a stride that is shorter than the sample length, where the position of the adjacent sample determines the crack location more accurately. As a result, one can classify edge crack samples into the category of disturbance samples, which raises the issue of a clear definition of these samples.
In conventional crack detection methods, edge crack samples, whose short lengths result in difficulties in capturing discriminative sample features, are often discarded because they are prone to misclassifications. However, discarding these negative samples may cause misclassification problems during network validation and application, where the network may encounter samples similar to those discarded prior to training. This statement is confirmed by a closer investigation in the ‘Comparative study’ section. In our framework, each sample maintains a clear category definition and, consequently, no training samples are subjectively discarded. Meanwhile, CCS are often longer than 1/4 of the sample length, which means that the shortest centre crack length is constrained, and hence, misclassification due to a very short crack length is mitigated. In summary, clear definitions of CCS and NCCS reduce annotation subjectivity and misclassification probability and therefore ensure training dataset integrity and trained model accuracy.
Sliding window
Similar to the convolution kernel, a sliding window sweeps across raw images to obtain corresponding samples. In crack detection, the stride is often set as 1/2 of the side length of the sample, which increases the possibility that cracks appear in the image sample centre. However, by doing so, most of the raw image is scanned four times, and the category of the scanned area needs to be reasonably determined based on classification results of the four samples.
When the sliding window technique is applied in combination with the definition of CCS and NCCS, the category of any raw image area depends on only the sample itself. Figure 2 shows the principle of the sliding window technique. When the window moves along the horizontal direction from W1 to W2 with a stride that is equal to 1/2 of the length of the sample, neither gaps nor overlaps occur between the centre regions of W1 and W2. The same process is performed in the vertical direction from W1 to W3. Since the sliding window stride is determined by the centre region side length, centre regions of adjacent samples are non-overlapping relative to each other. The latter indicates that any area on the raw image occurs only as the centre region in one sample. Contrary to the conventional method, the category of the centre region determined by this method is not influenced by the categories of adjacent samples.

Sliding window technique with a non-overlapping centre region. Left: An image patch of 256 × 256 pixels with a centre region of 128 × 128 pixels. Right: The sliding method conducted horizontally (W2) and vertically (W3) starting from the starting point (SP). (a) sample and (b) sliding window technique.
Annotation
Prior to network training, all training samples need to be annotated; that is, the category of each training sample has to be determined. One annotation method is to first crop collected raw images into small pieces and then manually label them. This method suffers from several weaknesses. For example, certain samples are difficult to categorize based on local information through visual inspection. For data augmentation, such as translation and rotation of samples, it is possible to change the sample categories. As a result, the sample needs to be annotated again, which is a tedious and time-consuming task. Therefore, we apply a different annotation scheme to circumvent these drawbacks.
The annotation procedure is shown in Figure 3. To mark the ground truth of cracks, a new transparent image layer is created, on which we manually trace cracks through a red colour with a filter width of 10 pixels; finally, the new layer is saved to obtain the ground truth. This process is illustrated as step 1 in Figure 3. Subsequently, the sliding window sweeps across the raw image to crop it into small-size samples, numbered as step 2 in Figure 3. Step 3 consists of cropping the ground truth into small samples by the same sliding window. To annotate each image sample, the centre region of its corresponding ground truth sample is searched to extract red pixels. If the number of red pixels is greater than a certain threshold, the image sample is considered to belong to the category of CCS; otherwise, it is part of the category of NCCS.

Annotation procedure.
One advantage of the proposed method is that it is easy to implement. Instead of annotating thousands of cropped image samples, one needs to trace several hundred raw images based on each image’s overall information. Therefore, this method is greatly cost-effective. The conventional method is time-consuming because image samples have to be labelled one by one even though most of them are intact. Conversely, the proposed method focuses on cracks directly, automatically ignoring intact areas. Furthermore, the conventional method has to annotate all samples, including original and augmented samples. The proposed method requires tracing cracks only on original images, and the resulting ground truth can be scanned to determine sample labels of augmented images. Given the very large number of augmented samples, the proposed method greatly saves annotation work compared with conventional methods.
The process of annotation itself is a critical step to provide training samples for the model. Based on the specialized annotations achieved through the proposed method, the highest degree of correctness of the annotated training samples can be ensured within a much shorter period of time, which the trained model relies on to determine the location of cracks.
Data augmentation
Since the crack dataset greatly differs from the ImageNet used for training (AlexNet) (Krizhevsky et al., 2012), to obtain a satisfactory fine-tuned network performance, a considerable number of raw training images are required. In the case of a limited number of images, data augmentation can play a significant role in enriching the dataset. Particularly for the crack detection task, the number of positive samples (CCS) is relatively small, for which data augmentation is a necessary step to enlarge the training dataset size. On the other hand, the negative sample dataset (NCCS) contains a larger number of sub-categories, and it is quite likely that there is an inadequate number of samples for some of the sub-categories. To improve the generalization performance of the classifier, the negative set requires augmentation as well.
This article considers data augmentation methods, including translation, rotation, and noise injection. The translation operation on samples requires cutting off some of the edge areas, thereby affecting the new sample types. For example, CCS becomes NCCS as cracks are present on edges. The rotation of samples requires padding, which unnecessarily increases the complexity of the sample set. Sample rotation may change sample labels as well. Both translation and rotation operations are performed on raw images in this study, thereby avoiding these issues. The translation operation is achieved by randomly generating a starting point (denoted by SP in Figure 2) of the sliding window. In particular, unlike conventional cases where the sliding window sweeps from the upper left corner of the raw image, the sliding window’s starting point in this case is uniformly distributed between 1 and 128 pixels (equal to the sliding window stride) from the image’s upper left corner; based on the latter, the sliding window technique is employed to crop the image. The sample category is determined according to the prior ground truth cropped with the same starting point. Rotation augmentation is also conducted on raw images with angles uniformly distributed between -29° and 30°, and the edges are padded with black pixels. When using the sliding window to crop, if the number of black pixels of the sample reaches a specific threshold, the sample is discarded. This procedure ensures that no padded regions are introduced to the sample dataset. To introduce noise, salt and pepper noises are added to raw images with a noise density that is uniformly distributed in (0, 0.1).
To annotate translation- and rotation-augmented samples, the ground truth marked in the ‘Annotation’ section should be operated similarly to the augmentation process. The sample labels can be attained by running a programme that implements the procedure in the ‘Annotation’ section with the obtained ground truth. For the new framework, the ground truth is marked only once by tracing original raw images. In contrast to the conventional method, the proposed method saves a large amount of annotation work.
Network model
We apply the transfer learning method on top of AlexNet to train our neural network. The advantage of using the transfer learning method is that we can benefit from the already trained network, with which we need only to fine-tune the network for our dataset (Gao and Mosalam, 2018; Fu et al., 2013). At the same time, convergence is more easily attained, and the method is more robust to hyper-parameter settings. Dorafshan et al. (2018) revealed that the transfer learning mode outperformed both the fully trained mode and classifier mode. AlexNet has resulted in the renaissance of the CNN for computer vision due to its new techniques, such as dropout and data augmentation to solve overfitting, activation of ReLU to prevent gradient vanishing of sigmoidal and tanh activations, and graphics processing units (GPUs) to greatly improve the computation efficiency. This article is aimed at proposing a crack detection framework based on AlexNet and resolving the class imbalance issue with active learning, instead of designing new networks for crack detection.
When AlexNet was first proposed, it was divided into two parts that ran on different GPUs because the GPU capacity at that time was small. Instead, we train the network on a single GPU in this article. The network has five convolutional layers and three fully connected layers together with ReLU activation, maximum pooling, dropout and local response normalization. The convolutional layers of the CNN are often referred to as feature extraction layers, and the latter fully connected layers are applied to output the classification result. For the binary classification problem in this article, AlexNet’s output layer is modified to two outputs, namely, CCS and NCCS.
Sampling and training based on active learning
In reality, the number of intact samples is often much larger than that of crack samples. The positive samples in this article contain only CCS, while the negative samples contain extensive sub-categories, including intact samples. Therefore, in this study, the positive dataset is much smaller than the negative dataset. In the investigation by Cha and Choi (2017), the number ratio of positive versus negative samples also reaches 1:10. Models trained with an unbalanced dataset tend to focus on the category with a larger number of samples while ignoring categories with fewer samples. The latter can result in poor model performance. To address this class imbalance, we change the sample category ratio of the dataset because of its simplicity of implementation and suitable performance in balancing training set categories.
To change the class ratio, the simplest method is to add several sample copies of the category with fewer samples, known as over-sampling. However, this process induces potential overfitting issues to this category, that is, too many positive samples that are exactly the same. A more reasonable method is to increase the number of positive training samples through the aforementioned data augmentation process. Consequently, this article re-trains the CNN model according to the AlexNet structure using the dataset consisting of augmented positive samples and original negative samples.
Moreover, the negative set in this article is not only large in size but also complex, that is, the negative set contains several sub-categories, and certain samples, for example, edge crack samples, are similar to positive samples. Therefore, without data augmentation, certain negative features that are relevant to fewer samples are likely to be ignored. To attain a satisfactory network, both positive and negative samples can be augmented. However, augmentation again requires solving the class imbalance problem. Although it is still possible to re-augment positive samples to achieve balance, too many samples result in great difficulties in network training in terms of GPU capacity, memory size and consumed time. An alternative solution is to sample the negative set to generate an efficient training set.
Inspired by the idea of active learning, this article proposes a sample screening method based on cross entropy ranking. Active learning is often used to actively query sample labels to reduce the tedious, costly work of sample annotation and to quickly improve the model performance (Feng et al., 2017; Fu et al., 2013; Wang et al., 2017). The philosophy is to query labels of the most informative samples to the current network and most representative samples in the dataset, thereby efficiently improving the network performance by learning from such samples (Fu et al., 2013). Similarly, this article selects the most informative negative samples to the current network and compiles a training set in combination with positive samples to avoid training the network with a large amount of data.
In active learning, entropy is used to measure the amount of information of samples contributing to the network (Wang et al., 2017). The higher the entropy, the more uncertain the sample category is within the network, and the more necessary it is to learn from this high entropy sample. Therefore, a sample with higher entropy tends to be first queried for labels. For classification of m categories, the entropy of the i-th sample is defined as follows
where
where
The flowchart of the proposed method is shown in Figure 4. First, one randomly selects the same number of negative as positive samples to compile a training dataset with positive samples. The selected negative samples, due to the random nature of their selection, are an ideal representation of the original complete negative dataset. However, features that are shared by fewer samples are likely not to be chosen. AlexNet is re-tuned with the training dataset and is adopted to predict the categories of all negative samples. The cross entropy of each sample is evaluated to rank the samples. The samples with higher cross entropies are selected and added to the previous training set to form a new one. It is clear that the new dataset contains both the randomly selected representative samples and highly complex samples. One can repeat this process until the point when training converges and a favourable network model is obtained. Training, validation and testing as will be described in the following sections demonstrate the effectiveness of the proposed sampling and training method in reducing the training dataset size and improving the network performance.

Training flowchart based on the proposed sampling method.
Training and validation
To examine the framework and sampling method and to train an applicable classifier, the proposed methods were carried out step by step and are described in this section. The implementation environment was MATLAB 2018a (MathWorks, MA, USA) on a Lenovo workstation installed with an Nvidia GPU 1080 Ti.
Image acquisition
With the aid of a mobile phone camera, 125 raw images with and without cracks were taken from buildings in a residential area. Crack detection for these raw images was very complicated due to various crack-like non-crack disturbances, as shown in Figure 1, which were taken from these images. The image resolution was 3120 × 4060 pixels.
Datasets
The 125 images were divided into two groups, 35 for the validation set and 90 for the training set. All cracks were manually traced following the method in the ‘Annotation’ section to generate the ground truth. Without any data augmentation, the sliding window swept across 90 raw images for the training set, and created 4172 positive samples (CCS) and 56,433 negative samples (NCCS) with a resolution of 256 × 256 pixels. The ratio of positive versus negative samples was approximately 1:14. The sliding window was also applied to raw images for the validation set, and generated 1445 positive and 19,945 negative samples for the validation set. This process is demonstrated in Figure 5.

Flowchart of the generation of training set 1 and validation set.
Data augmentation was carried out on raw images for the training set by means of the method described in the ‘Data augmentation’ section. A total of 1350 large images were obtained. The sliding window technique was adopted to divide images into samples of 256 × 256 pixels, thereby producing a total of 67,778 positive and 866,260 negative samples. The aforementioned process is shown in Figure 5 as well. It should be noted that these samples could not be separated into a training set and validation set, as separation tended to result in data leakage, which indicated that augmented samples from the same raw images had been grouped into different categories. Since they were endowed with a high degree of similarity, performance indicators, for example, accuracy and loss, of the validation set could not represent the actual generalization ability of the model. The latter is the reason why the 125 images were first divided into two groups, and augmentation was implemented only on images for the training set.
Three training sets were considered for comparison. The first one, referred to as training set 1, contained the 67,778 augmented positive samples and original 56,433 negative samples, as shown in Figure 5. Clearly, data augmentation was applied to the positive set to effectively balance the class sample size.
The second and third training sets were generated by applying the proposed sampling method to the augmented negative dataset, namely, by screening samples according to cross entropy ranking. In particular, 67,778 samples, which were randomly selected from 866,260 negative samples, along with 67,778 positive samples, formed a training set. With this training set, AlexNet was fine-tuned and utilized to predict categories of all negative samples. According to the cross entropy values, the negative samples were ranked, and 67,778 samples with the largest cross entropy values were selected to form a new training set along with the original set. This new set was referred to as training set 2, as shown in Figure 6. AlexNet, which had been re-fine-tuned with this set, was again employed to rank the remaining negative samples, from which another group of 67,778 samples was chosen and integrated into training set 2 to produce training set 3. Thus, the category ratios of these two sets were 1:2 and 1:3, respectively. It is meaningful to emphasize that the randomly selected negative samples remained in the new training sets. The advantage of maintaining these negative samples was that they could represent feature distribution characteristics of the original negative dataset as a whole. At the same time, the negative samples that were selected according to their cross entropy values were highly complex but more informative. The new training sets, consisting of the two components, not only represented the original dataset but also highlighted samples that were highly complex.

Generation of training dataset 2.
Fine-tuning the model
In this study, the CNN was re-trained following the overall architecture of AlexNet using the 3 training datasets described above. The learning rate and mini-batch size were set to 0.001 and 256, respectively. To comply with the network input size, all samples were resized to 227 × 227 × 3 pixels before being fed into AlexNet. During the training process, the network performance was validated every 50 iterations with the validation set shown in Figure 5. If the validation accuracy did not improve for 40 validations, we then terminated the training process.
Training and validation results
Figure 7 shows the comparison of validation accuracies during the AlexNet fine-tuning process using the three training sets. For clarity, AlexNet was re-tuned using training sets 1 to 3 and correspondingly referred to as AlexNet 1 to 3. It could be observed that the fluctuations in validation accuracy when using training set 1 were more severe, while training sets 2 and 3 resulted in stable validation accuracies. Meanwhile, the former accuracy was smaller than the latter accuracies. It should be stressed again that the same validation set was applied to the three cases, and hence, the validation accuracy difference was attributed to various characteristics of the neural network models that had been obtained through learning from different training sets. Models that had been re-trained with training sets 2 and 3 performed similarly. These results reveal the necessity of data augmentation on negative training sets and indicate that training sets compiled with the proposed method exhibited an improved representation and larger amount of information, which resulted in models that were more accurate and stable.

Comparison of the validation accuracy: (a) AlexNet 1, (b) AlexNet 2 and (c) AlexNet 3.
Figure 8 depicts training and validation losses of the three cases. The training loss of AlexNet 1 was smaller than those of AlexNet 2 and 3, in that a larger number of informative samples were selected to build the latter two training sets. Particularly at the beginning of training, the training losses corresponding to training sets 2 and 3 were approximately twice as large as that corresponding to training set 1. However, the validation losses corresponding to training sets 2 and 3 were smaller and more stable than that corresponding to training set 1. These results indicated that training sets built according to the proposed sampling method enabled the model to acquire more features and to better predict classification of the validation set. Therefore, by learning from the features of complex samples, the models outperformed the model that was trained with only positive augmented samples. In contrast to the total number of samples of the original negative set, that is, 866,260, sets 2 and 3 included 135,556 and 203,334 negative samples, respectively, thereby indicates the efficiency of the proposed sampling method. Precision and recall are two measures of network performance evaluation. However, they often affect each other; that is, when one measure is high, the other one is often low. The F1 measure combines precision and recall, and hence, is more appropriate for the performance evaluation of networks. AlexNet 2 and 3 provided more stable and higher F1 measures, which indicated a better performance than that of AlexNet 1. Although the best F1 measure was approximately 0.82, this result was considered favourable given the complexity and limited number of raw images. Due to the limit of the space, the figure is not shown here.

Comparison of the training and validation losses: (a) AlexNet 1, (b) AlexNet 2 and (c) AlexNet 3.
To quantitatively assess the model performance, various means for each model were evaluated. We first determined the validation point corresponding to the maximum validation accuracy and then determined another 9 points after the first one. Validation accuracies of these 10 points were averaged, and referred to as the mean validation accuracy. Mean values of the validation loss, training accuracy and training loss of these 10 points were evaluated as well for each model. All results are listed in Table 1. The mean validation accuracies related to AlexNet 2 and 3 were 97.46% and 97.55%, respectively, approximately 1.2% higher than that related to AlexNet 1 as 96.34%, which indicated that the errors were reduced from 3.64% to 2.54% and 2.45%, respectively; the latter represented reductions of 30.22% and 32.70%, respectively. The mean values of validation loss had also been reduced by more than 20% from 0.1076 to 0.0860 and 0.0848, respectively. On the other hand, mean values of the training loss related to AlexNet 2 and 3 were 0.0952 and 0.1115, respectively, larger than AlexNet 1, due to complex samples in the training sets. For the same reason, the mean values of training accuracy corresponding to AlexNet 2 and 3 were worse.
Testing results (%).
Testing and comparative study
With newly acquired raw images, the fine-tuned models were tested and compared with another framework, which is described in this section. Investigations showed that the proposed framework performed well due to its characteristics.
Testing of the trained networks
Another 15 raw images of 4000 × 6000 pixels were taken of a different building and cropped into 1708 CCS and 18,542 NCCS. These samples were classified by network models corresponding to the highest validation accuracy of the three training processes in the ‘Training and validation’ section. Different measures were evaluated and are listed in Table 1. Clearly, the models using training sets 2 and 3, that is, AlexNets 2 and 3, performed better in terms of accuracy, precision, and F1 measure, which was consistent with results of the previous section.
A group of selected results is illustrated in Figure 9. Disturbances can be seen clearly in Figure 9(c). Each transparent red patch represented the centre region of a sample classified as CCS, while the patches with other colours represented NCCS centre regions. All three models exhibited excellent crack detection and localization performances, even though disturbances were observed. AlexNets 2 and 3 exhibited an accurate crack detection performance, and the red patch distribution had a high degree of geometry similarity to the ground truth. Conversely, AlexNet 1 resulted in several false positive patches, two of which are marked in Figure 9(b), resulting in a relatively low precision. These results indicated the effectiveness of the proposed sampling and training methods and their necessity.

Testing results with various models: (a) ground truth, (b) AlexNet 1, (c) AlexNet 2 and (d) AlexNet 3.
Comparative study
To further investigate the performance of the proposed framework, comparative studies were conducted between the proposed framework and an existing framework, ChaNet proposed by Cha and Choi (2017), based on the raw images that had been used for testing mentioned in the ‘Testing of the trained networks’ section. First, definitions of four sample categories considering both frameworks are clarified in this subsection. Next, new training and validation datasets for ChaNet are established in accordance with these definitions using samples cropped. Subsequently, in conjunction with these datasets, ChaNet is trained and validated. Finally, the raw images are cropped and separated into four categories and then classified by ChaNet and AlexNets 1 to 3. Classification results of all sets are collected and analysed to evaluate the different net- and frameworks.
Prior to dataset establishment, four sample categories were defined as CCS, intact samples (IS), corner crack samples (CorCS) and edge crack samples (ECS) for the purpose of the comparative study of the two frameworks mentioned above. As mentioned in the ‘Crack detection based on a CNN’ section, categories of samples must be defined according to crack characteristics that have been marked on counterparts of the raw images. The CCS category meets the relevant condition mentioned in the ‘Crack detection based on a CNN’ section. ECS are image samples in which the corresponding ground truth contains a marked crack that is longer than 100 pixels and is located outside of the sample centre region. IS contains no marks on the ground truth. The fourth set, CorCS, contains those samples corresponding to ground truth patches that are endowed with short marks (less than 100 pixels) outside of the central region. The ‘Crack detection based on a CNN’ section provides helpful information regarding the ground truth and sample centre region. It is worth mentioning that CorCS has been discarded due to recognition difficulties of ChaNet. As opposed to the proposed framework, where CCS comprised the positive set and all CorCS, IS and ECS belonged to the negative set, in the case of ChaNet, the positive set, that is, the cracks, consisted of both CCS and ECS, while the negative set consisted of IS.
Since training dataset 1 of AlexNet 1 was constructed with positive augmented samples, the same strategy had to be adopted for ChaNet for the sake of fairness. Consequently, ECS were selected from 866,260 NCCS (refer to Figure 6) and then added to the 67,778 CCS to form the positive training set. IS were collected from 56,433 NCCS for the negative training set. In addition, validation samples were selected and separated similarly to the validation set in Figure 6. The numbers of positive and negative samples for the resulting training and validation sets were 110,859, 51,696, 2355 and 18,387, respectively. In addition, the raw testing images in this section were also cropped and annotated in accordance with the sample definitions, which generated 1381 positive and 8413 negative samples.
ChaNet was constructed, trained and validated under the same environment as that of AlexNet. Validation and testing results are presented in Table 2. Both types of results were consistent with each other even though levels of precision were very low, which indicated that numerous negative samples had been incorrectly classified. In contrast to Table 1, these results were deemed somewhat unsatisfactory.
Validation and testing results of ChaNet (%).
To precisely analyse network performances, testing images were classified into four sets and predicted with ChaNet, AlexNet 1, AlexNet 2 and AlexNet 3, and the results are presented in Table 3. The percentages indicate probabilities that sample sets are classified into positive sets by the relevant network. For ChaNet, 16.80% of IS were incorrectly classified as belonging to the positive set, and CorCS were almost randomly classified due to their probability of 41.69%. The former issue could be attributed to the low learning ability of ChaNet, while the latter could be viewed as the result of discarding samples. ChaNet had been endowed with only one nonlinear activation layer and three convolutional layers, which made it likely that more errors would occur for complex detection tasks than with AlexNet. In view of numerous IS, this error greatly affected network performance indicators, such as the precision in Table 2. ChaNet was never exposed to CorCS during training, which resulted in the almost random classification, in that all these samples were discarded from the training dataset. This result illustrates that ensuring the integrity of the training dataset was highly significant. The proposed framework, which was different from ChaNet, was endowed with an outstanding nonlinear learning ability due to AlexNet and ensured data integrity since no cropped samples were excluded from the training dataset. Only 2.23% of IS were predicted to be positive by AlexNet 1, which represented a greatly enhanced prediction accuracy at the cost of a considerable prediction error of ECS. Fortunately, the number of ECSs often was small, and the aforementioned error influenced fewer performance indicators. By introducing AlexNet and reasonably annotating samples, AlexNet 1 outperformed ChaNet in terms of accuracy, precision and F1 measure, for example, the prediction accuracy increased from 83.61% to 95.56%, as listed in Tables 2 and 1, respectively. Meanwhile, the use of AlexNet 2 and 3 effectively reduced the prediction error of ECS from 21.61 to approximately 7%, which resulted from the networks being trained with the active learning strategy. The active learning strategy fed the most difficult samples into the network for learning, which promptly and substantially enhanced the network performance.
Classification percentages.
Conclusion
This article presents practices of concrete surface crack detection using a CNN. A new crack detection framework based on a CNN, and sampling and training method inspired by active learning to resolve the class imbalance issue are also proposed. In particular, the new framework includes a clear definition of two categories of samples, a relevant sliding window technique, annotation and data augmentation methods. Advantages of this framework are that it can ensure data integrity and can save a large amount of annotation work. Training datasets generated with the proposed sampling method contain all positive samples, randomly selected negative samples and negative samples with higher cross entropy values. These sets not only represent the original dataset but also highlight those samples that are very complex. Comparison of performance measures of training processes using three different sets indicates that the training sets resulting from the proposed method lead to more stable and accurate models for crack detection tasks. Model testing results demonstrate that models re-tuned with sample datasets outperform those trained by datasets that contain only positive augmented samples in terms of accuracy, precision, and F1 measure. A comparative study reveals an outstanding performance of the proposed framework compared to one existing framework in terms of detection accuracy, precision and F1 measure due to its nonlinear learning ability, training dataset integrity and active learning strategy.
This article is focused on the overall framework of crack detection using a CNN and training dataset. For future work, more attention will be paid to investigating the impact of various neural network models on the performance. Different pre-trained models, such as AlexNet, GoogLeNet (Ioffe and Szegedy, 2015; Szegedy et al., 2015), and ResNet proposed by He et al. (2016), and new models have to be studied and compared in terms of crack detection performance. We also plan to propose novel models that target various types of cracks and to apply crack detection to seismic testing (Wang et al., 2020) for capturing damage process.
Footnotes
Authors’ Note
Zhen Wang is also affiliated with School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan, China.
Acknowledgements
The authors gratefully acknowledge this support.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research work was financially supported by the National Key Research and Development Programme of China (Grant No. 2017YFC0703603 and 2016YFC0701106). The work was also supported by the Hebei Provincial Transport Bureau Research Programme TH-201902 and TH-201919.
