Abstract
To reduce the reliance of existing mainstream vision-based crack detection algorithms on annotated datasets, this paper presents an unsupervised semantic segmentation-based approach. The proposed method first employs the Felzenszwalb-Huttenlocher algorithm for pre-segmenting the image, generating superpixels. Subsequently, a with an autoencoder model is designed to progressively approximate the superpixel segmentation results, and the optimal model is obtained by Bayesian optimization. Through comparative experiments with existing algorithms, it has been demonstrated that the proposed method performance comparable to supervised algorithms, even without the need for labeled data. As a result, the deployment complexity of the algorithm is significantly reduced, while expanding its applicability.
Keywords
Introduction
Crack identification is crucial for maintaining structural safety and durability in the rapidly developing infrastructure (Abdel-Qader et al., 2003; Amorim et al., 2014). Traditional crack identification methods mainly rely on manual visual inspection, which is not only time-consuming and labor-intensive but also easily influenced influenced by human factors, resulting in unstable detection results and large errors (Spencer et al., 2019). In recent years, computer vision technology (Yan et al., 2022) has been increasingly applied in the field of civil engineering (Xu and Liu, 2022a), including crack identification (Liu and Xu, 2022). Bae et al. (Bae and An, 2023) used computer vision technology for statistical crack quantification for concrete structures. In particular, the rapid development of deep learning technology has provided a more efficient and accurate method for crack identification. Deep learning-based crack identification methods, including convolutional neural networks (CNN) (He et al., 2016; Yu et al., 2022a) and transformers (Vaswani et al., 2017), can learn complex feature representations and effectively distinguish between crack and non-crack areas. In the field of civil engineering, they are used for the detection of cracks in steel structures (Xu et al., 2019), road surfaces (Chen et al., 2021; Tang et al., 2021), and bridge surfaces (Li et al., 2021), and have achieved high accuracy. Kim et al. (Kim et al., 2019) used CNN and accelerated robust features to identify cracks and determine specific location of cracks in the image. Xu and Liu (Xu and Liu, 2022b) identified cracks by using CNNs and generative adversarial network. Feng et al. (Feng and Feng, 2018) proposed a lightweight encoder-decoder network for automatic bridge crack assessment. Yu et al. (Yu et al., 2022b) proposed a hybrid detection framework for concrete cracks that considers the influence of model noise.
Existing supervised deep learning-based methods for crack identification predominantly rely on training a semantic segmentation model at the pixel level (Guo et al., 2018; Yu et al., 2018). The efficacy of these methods hinges upon the availability of substantial pixel-level annotated datasets to ensure optimal performance of the semantic segmentation model. However, constructing such datasets necessitates significant allocation of manpower and resources. Additionally, the model’s effectiveness is subject to the influence of the specific scene in which the dataset is collected, thereby posing difficulties in establishing a universally applicable model. The deployment of existing supervised algorithms to detection systems or devices necessitates extensive preparatory work including data collection, data annotation, and model training. Consequently, this significantly impedes the efficiency of algorithm deployment and presents a formidable challenge for the practical application of deep learning technology. To address this problem, in the field of computer vision, researchers have begun to explore semi-supervised (Xiang et al., 2023), weakly supervised (Huang et al., 2018) and unsupervised learning methods (Kanezaki, 2018; Van Gansbeke et al., 2021) to reduce the dependence of crack identification tasks on annotated data.
Unsupervised semantic segmentation algorithms are a class of image segmentation methods that do not rely on annotated data. Their main purpose is to learn semantic information in the image without the need for labeled data. In recent years, unsupervised semantic segmentation algorithms have made significant progress, including clustering-based methods (Pu et al., 2020), graph-based methods (Li et al., 2020), and autoencoder-based methods (Kanezaki, 2018), among others. Therefore, this paper proposes an unsupervised semantic segmentation-based crack identification method to overcome the shortcomings of existing methods.
The proposed method first uses the Felzenszwalb-Huttenlocher(FH) algorithm (Felzenszwalb and Huttenlocher, 2004) to pre-segment the image, obtaining its superpixels. Superpixel segmentation can effectively reduce the number of pixels in the image while retaining structural information. Then, we design a convolutional autoencoder neural network model to gradually approximate the superpixel segmentation results. To optimize the model's performance, we adopt Bayesian optimization algorithm (Shahriari et al., 2015) to automatically adjust the model's hyperparameters.
This paper compares the performance of the proposed algorithm with traditional digital image processing algorithm (threshold segmentation) and supervised deep learning model (U-Net) (Felzenszwalb and Huttenlocher, 2004) on the test set. The test results show that our proposed unsupervised semantic segmentation-based crack identification method can achieve high accuracy in structural crack segmentation without any labeled data. Compared with classical deep learning-based crack identification methods, our method has lower data dependence and wider applicability. Therefore, the proposed method improves the performance of existing methods and provides a new and effective framework for the field of structural crack identification.
This paper presents two main innovations and contributions. Firstly, an algorithm for structural crack identification based on an unsupervised semantic segmentation model is proposed. This algorithm allows for the construction of a crack detection model at low cost, even in the absence of annotated datasets. Consequently, it significantly reduces the deployment difficulty and expands its scope of utilization of deep learning in practical applications. Secondly, a Bayesian optimization algorithm is employed to optimize the key hyperparameters of the deep learning model. This approach enables the attainment of the best performance model with fewer iterations, greatly enhancing the optimization efficiency compared to traditional methods.
The structure of this paper is as follows: Section Crack image presegmentation based on superpixels introduces the superpixel algorithm and uses it to preprocess crack images; Section Unsupervised semantic segmentation algorithm based on superpixel pre-segmentation describes the proposed unsupervised semantic segmentation-based crack identification method and obtains the optimal model using Bayesian optimization algorithm; Section Model testing presents the test results and compares them with existing methods; Section Conclusion summarizes the entire paper and proposes future research directions. The overall idea of the paper is shown in Figure 1. Overall thinking of the paper.
Crack image pre-segmentation based on superpixels
Superpixel algorithm
Compared to complex scenarios such as autonomous driving (Feng et al., 2020), in the semantic segmentation task of crack images, only binary classification of cracks and background is needed for each pixel, and cracks generally have clear edges, making this task relatively simple. Therefore, this paper first adopts the superpixel algorithm to preprocess crack images.
Superpixels are compact, homogeneous regions of a group of pixels that represent a group of pixels in an image (Wang et al., 2017). These regions are created by grouping similar pixels based on color, texture, and proximity. Generated superpixels are usually irregular in shape and can vary in size. Superpixels are commonly used as a preprocessing step for various computer vision tasks such as object recognition, segmentation, and tracking. By reducing the number of pixels in an image, the superpixel algorithm can significantly reduce the computational complexity of these tasks while maintaining high accuracy. It is worth noting that compared to supervised learning, superpixel algorithms do not require the use of large, high-quality annotated datasets for training, so their deployment costs are much lower than those of deep learning methods based on annotated data.
Commonly used superpixel algorithms include the Simple Linear Iterative Clustering (SLIC) method (Achanta et al., 2010) based on K-means clustering algorithm, the Quick Shift (Vedaldi and Soatto, 2008) algorithm for processing complex texture images, the Felzenszwalb-Huttenlocher (FH) algorithm and the Compact Watershed (CW) algorithm (Neubert and Protzel, 2014) for processing images of different scales.
Superpixels have a wide range of applications in computer vision, including object recognition (Wang et al., 2017), image segmentation (Ibrahim and El-kenawy, 2020), and image compression (Fracastoro et al., 2015). Superpixel algorithms have allowed the development of more efficient and accurate computer vision systems, making it possible to process large amounts of data in real-time.
Felzenszwalb-Huttenlocher algorithm
There are numerous superpixel segmentation algorithms available, among which three mainstream algorithms are commonly used: Quickshift (Vedaldi and Soatto, 2008), SLIC (Achanta et al., 2012), and Felzenszwalb Huttenlocher (Felzenszwalb and Huttenlocher, 2004) algorithms:
Quickshift algorithm
The Quickshift algorithm is a superpixel segmentation method based on color and spatial distances in an image. It merges pixels by calculating their similarity, resulting in the formation of superpixels. The Quickshift algorithm uses color information and spatial distances between pixels to determine their similarity.
SLIC algorithm
The SLIC algorithm is a superpixel segmentation method based on K-means clustering. It divides the image into superpixels with similar color and spatial proximity. The SLIC algorithm has a fast computation speed and good boundary preservation.
Felzenszwalb-Huttenlocher algorithm
The Felzenszwalb algorithm is a superpixel segmentation method based on variations in edges and regions in an image. It uses edge information in the image to define superpixel boundaries and merges adjacent superpixels based on region similarity. This algorithm adapts well to textures and color variations in the image.
Advantages and disadvantages of three hyperpixel segmentation algorithms.
Given that the objective of this paper is crack detection, the demand for real-time image processing is not exceptionally high. Additionally, cracks exhibit intricate textures and possess significant randomness, necessitating algorithms with robust edge processing capabilities. Considering these factors, Felzenszwalb-Huttenlocher algorithm has been chosen in this study as the foundation for unsupervised semantic segmentation.
The core of the Felzenszwalb-Huttenlocher algorithm is to use a graph-based method to construct superpixel regions, where each pixel is regarded as a node and the connection between these nodes is determined by the adjacency relationship between pixels. By segmenting this graph, a set of independent superpixel regions can be obtained. During the segmentation process, the algorithm calculates edge weights based on the similarity and adjacency relationships between pixels and uses the minimum spanning tree algorithm to generate superpixel regions.
The process of the Felzenszwalb-Huttenlocher algorithm is as follows: 1. Initialization: Each pixel is treated as a region and assigned a unique identifier. 2. Calculate edge weights: For each adjacent pair of pixels, their color distance and spatial distance are calculated, and then they are used as the weight of the edge. Specifically, for adjacent pixels p and q, their edge weight w
p,q
is calculated as equation (1): 3. Construct the minimum spanning tree: Sort all the edges according to their weights from small to large, and add them to the minimum spanning tree one by one until the number of connected components reaches the preset maximum value or all the edges have been considered. 4. Segment the image: For each edge in the minimum spanning tree, if the identifiers of the two connected regions are different, these two regions are merged into one region, and their identifiers are set to the same value. This process continues to merge regions until all edges have been considered. 5. Output the results: The final output is a set of non-overlapping superpixel regions, each composed of adjacent pixels that are continuous and compact. These regions can be visualized using different colors or borders.
In summary, the Felzenszwalb-Huttenlocher algorithm calculates edge weights, constructs the minimum spanning tree, and then merges regions based on the minimum spanning tree to obtain a set of continuous and compact superpixel regions. The time complexity of this algorithm is O(NlogN), where N is the number of pixels, so it can quickly process large images. The main advantages of the Felzenszwalb-Huttenlocher algorithm are its fast speed, good segmentation effect, and few tunable parameters. Compared with traditional pixel-based segmentation methods, this algorithm can better handle complex features such as details and textures in images, and can produce more continuous andcompact superpixel regions. Therefore, it is a suitable algorithm for crack image preprocessing in this paper, which can effectively reduce the computational complexity of subsequent crack detection tasks while maintaining high accuracy.
The superpixel segmentation effect of the Felzenszwalb-Huttenlocher algorithm is influenced by two parameters: scale and min_size. The Scale parameter specifies the scale used for calculating boundaries, which is the neighborhood size used during the calculation process. A larger scale will result in smoother boundaries, while a smaller scale will result in more detailed boundaries. The min_size parameter specifies the minimum size of the segmented regions, which controls the granularity of the algorithm. A smaller value will result in smaller regions, while a larger value will result in larger regions. Figure 2 shows the segmentation results of the Felzenszwalb-Huttenlocher algorithm under different parameters. From Figure 2, it can be seen that scale and min_size have a significant impact on the segmentation precision, and it can also be observed that the algorithm is capable of effectively segmenting crack areas in the image, providing a good foundation for further processing by the subsequent deep learning modules. Image segmentation by Felzenszwalb-Huttenlocher algorithm.
Unsupervised semantic segmentation algorithm based on superpixel pre-segmentation
Weakly supervised and unsupervised semantic segmentation algorithms
In Section Crack image presegmentation based onsuperpixels, the Felzenszwalb-Huttenlocher algorithm clusters pixels with similar color and texture features and close distances into the same superpixel, which allows us to assume that each superpixel has the same semantic label. Although superpixels can segment the image into continuous regions with obvious semantic information, they cannot determine the category of each superpixel, making it difficult to achieve semantic segmentation of the entire image. On the other hand, supervised image segmentation models, such as U-Net, can accurately segment images pixel by pixel, but such models require a large amount of pixel-level labeled data to train.
Therefore, weakly supervised (Chen et al., 2020) and unsupervised (Souly et al., 2017) semantic segmentation algorithms have been proposed. There are two basic approaches: one is weakly supervised semantic segmentation based on class activation maps (CAM) (Selvaraju et al., 2017). Compared with pixel-level semantic segmentation datasets, the annotation cost of image-level image classification datasets is relatively low. In the field of interpretable machine learning, the CAM algorithm can characterize the areas of images that the CNN model pays the most attention to by outputting the weighted feature maps of the CNN model for image classification. The area with high heat value is the most significant area of semantic information, so weakly supervised semantic segmentation models use this as the seed region and continuously expand the pixel regions until the semantic information of all pixels in the image is determined. This method has achieved good results in public datasets, but there are some problems in crack segmentation models. Because cracks are relatively thin and long, the seed region obtained by CAM usually exceed the range of the crack, causing the subsequent expanded segmentation results to deviate significantly from the ground truth.
The other approach is based on superpixel pre-segmentation, represented by an unsupervised semantic segmentation algorithm (Kanezaki, 2018). This algorithm aims to combine the pre-segmentation results of superpixels with the pixel-level semantic segmentation function of autoencoders. By optimizing the parameters of the autoencoder to make its output results match the pre-segmentation model, accurate semantic segmentation of crack images can be achieved. This paper will adopt this algorithm to achieve unsupervised semantic segmentation of crack images.
Structure of unsupervised semantic segmentation algorithm
The proposed unsupervised semantic segmentation algorithm mainly consists of two parts. Firstly, classic clustering algorithms (i.e., the Felzenszwalb-Huttenlocher algorithm in Section Crack image presegmentation based onsuperpixels) are used to preprocess the image and segment it into several superpixels with the same semantic label. Secondly, based on the autoencoder structure in the field of deep learning (referred to as AEseg in this paper), the input image is initially segmented to obtain the label of each pixel, and this segmentation result is defined as I1. At the same time, based on the Felzenszwalb-Huttenlocher algorithm preprocessing results, the number of pixel categories in each superpixel in the image is calculated, and all pixels in each superpixel are defined as the category that appears most frequently. The processed segmentation result is defined as I2. In an ideal situation, the segmentation result of AE can accurately determine the semantic information of each pixel, and I1 and I2 should be the same. Therefore, the optimization goal of AE is to minimize the distance between I1 and I2.
Figure 3 shows the unsupervised semantic segmentation algorithm for crack images. Unsupervised semantic segmentation algorithm for crack images.
AE is a fully convolutional neural network. The input of AE is a 256 × 256 color image. The image outputs the prediction value of semantic segmentation through four convolution modules, which is a single-channel image of 256 × 256. The four convolution modules are referred to as Conv module1, Conv module2, Conv module3, and Conv module4. The first three convolution modules each have three layers: convolution layer, batch normalization layer, and ReLu layer. The last convolution module has only a convolution layer and a batch normalization layer.
The purpose of the batch normalization layer and ReLu layer is to prevent problems such as gradient explosion or overfitting. The batch normalization layer optimizes the problem of changes in the distribution of intermediate layer data to prevent gradient vanishing or explosion, and uses
The structure of the autoencoder is shown in Figure 4. Structure of the autoencoder.
During the training process, the Felzenszwalb-Huttenlocher algorithm is used to preprocess the training images and generate superpixels. The pixels in each superpixel are labeled with the same category according to the most frequent pixel category in the superpixel. The preprocessed images and corresponding ground truth segmentation masks are used to train the AEseg model. The training process is carried out by minimizing the loss function using the stochastic gradient descent algorithm.
Hyperparameter optimization based on Bayesian optimization algorithm
As a computer vision model that combines deep learning and traditional machine learning algorithms, the unsupervised semantic segmentation algorithm proposed in this paper achieves semantic segmentation of cracks by allowing the autoencoder to approximate the pre-segmented superpixel results. Therefore, the quality of the pre-segmentation by the Felzenszwalb-Huttenlocher algorithm is crucial to the results. According to Section Crack image presegmentation based onsuperpixels, in the Felzenszwalb-Huttenlocher algorithm, segmentation is mainly affected by two factors: scale and min_size. To obtain the optimal model, it is necessary to optimize the above two hyperparameters of the segment model.
Therefore, this paper manually annotated 1000 crack images from the public dataset SDNET2018 (Maguire et al., 2019) as ground truth, as the basis for hyperparameter optimization. In the optimization process, the results obtained by this paper’s unsupervised semantic segmentation method will be compared with the ground truth results to find the optimal combination of (scale, min_size) hyperparameters that make the segmentation results as close as possible to the ground truth.
It should be noted that the range of variation of the above two hyperparameters is quite large, so the range of variation of the combination of the two hyperparameters will become even larger. Traversing the parameter combinations will consume a lot of computing power and time. To solve this problem, this paper uses a more efficient Bayesian optimization algorithm to optimize the hyperparameters.
The Bayesian optimization algorithm (Frazier, 2018) is an optimization method based on Bayesian theorem, which is mainly used for optimizing black-box functions, that is, optimizing the function without clear analytical expressions through the interaction between inputs and outputs.
The Bayesian optimization algorithm has a wide range of applications in the fields of machine learning (Snoek et al., 2012) and computer vision (Zhang et al., 2015). One of its main applications is hyperparameter tuning (Victoria and Maragatham, 2021). Hyperparameters refer to the parameters that need to be manually set in the machine learning model, such as learning rate, regularization coefficient, etc. The selection of these parameters has an important impact on the performance and efficiency of the model, but usually needs to be adjusted through trial and error. The Bayesian optimization algorithm can be used to automatically discover the best combination of hyperparameters, thereby improving the performance and efficiency of the model. Specifically, the Bayesian optimization algorithm can build a model based on existing sample data, predict the performance of the next hyperparameter combination, and select the optimal hyperparameter combination for the next round of training. This method is usually more efficient and accurate than methods such as random search and grid search.
In the field of computer vision, the Bayesian optimization algorithm is also used for optimization tasks such as image classification, object detection, and image generation. For example, in image classification tasks, the Bayesian optimization algorithm can be used to optimize model architecture, weight optimization, and hyperparameter tuning, thereby improving the performance and efficiency of the model. In object detection tasks, the Bayesian optimization algorithm can be used to optimize detector architecture, weight optimization, and threshold selection. In image segmentation tasks, the Bayesian optimization algorithm can be used to optimize generator architecture, weight optimization, and hyperparameter tuning. These optimizations can help the model converge faster and improve its performance and efficiency.
Compared to genetic algorithms, simulated annealing, and ant colony algorithms, Bayesian optimization algorithm has several prominent advantages in the task of optimizing hyperparameters for deep learning models. Firstly, it employs an efficient sampling strategy by using Gaussian process models to model the search space of hyperparameters and utilizes existing sampling information for inference and prediction. In contrast, genetic algorithms and ant colony algorithms require the generation of a large number of individuals or ants for searching, while simulated annealing requires a substantial amount of random transformations. By optimizing the sampling strategy, the Bayesian optimization algorithm can find better combinations of hyperparameters with fewer sampling iterations.
Secondly, it employs an adaptive search strategy. The Bayesian optimization algorithm dynamically adjusts the sampling range and weights of each hyperparameter in the search space by continuously updating the confidence of the Gaussian process model. This adaptability enables the algorithm to better explore and utilize the information in the search space, quickly adapting to the interactions between different hyperparameters and changes in optimization objectives.
Thirdly, it possesses robust noise handling capabilities. The Bayesian optimization algorithm models the confidence of the Gaussian process model and performs modeling and elimination of noisy data. This effectively reduces search biases caused by data noise and enables more robust optimization in the presence of noise and uncertainty.
Lastly, it exhibits stronger global optimization capabilities. By maintaining a global history of hyperparameter combinations during the search process, the Bayesian optimization algorithm is capable of escaping local optima and searching for global optima in the search space. In contrast, genetic algorithms, simulated annealing, and ant colony algorithms are prone to getting trapped in local optima and require more iterations to find global optima.
In summary, the Bayesian optimization algorithm outperforms genetic algorithms, simulated annealing, and ant colony algorithms in terms of more efficient sampling strategy, adaptive search strategy, robust noise handling, and stronger global optimization capabilities in the task of optimizing hyperparameters for deep learning models. Therefore, in this paper, we adopt the Bayesian optimization algorithm to obtain the optimal unsupervised crack detection model.
The Bayesian optimization algorithm describes the characteristics of the function to be optimized by building a probability model, assuming that the hyperparameters and evaluation indicators conform to a Gaussian distribution model, that is, any finite number of hyperparameter combinations and their corresponding evaluation indicators conform to the joint normal distribution. Then, based on this assumption, the Bayesian theorem is used to calculate the value of the evaluation indicators of each point in the entire parameter space, update this model, and guide the next attempt based on the previous results, and so on, until the optimal solution is found.
The process of Bayesian optimization in this paper is shown in Figure 5. Bayesian optimization algorithm.
The hyperparameters to be optimized are scale and min_size, and the optimization goal is to maximize the distance between the output of this crack unsupervised semantic segmentation model and the ground truth. This is the same as the semantic segmentation evaluation task in supervised learning models, so this paper will use the commonly used evaluation indicators in the semantic segmentation field: Precision, Recall, and Intersection over Union (IoU). These three indicators are commonly used in image segmentation to evaluate the performance and accuracy of segmentation algorithms.
Confusion matrix of a binary classification model.
In the confusion matrix, TP (true positive) represents the number of correctly identified crack pixels, FP (false positive) represents the number of background pixels incorrectly identified as crack pixels, FN (false negative) represents the number of crack pixels that were not identified as such, and TN (true negative) represents the number of correctly identified background pixels.
Through the confusion matrix, various classification metrics such as accuracy, recall, and F1 score can be calculated. In image semantic segmentation models, the most important metrics are accuracy, recall, and IoU value.
Precision is the proportion of correctly classified positive samples to the total number of samples classified as positive by the classifier. In image segmentation, Precision refers to the proportion of pixels correctly labeled as foreground by the algorithm to all pixels labeled as foreground by the algorithm. A higher Precision indicates a higher accuracy of the algorithm to discriminate foreground pixels. The definition of Precision is shown in equation (2):
Recall is the proportion of positive samples correctly classified by the classifier to the total number of actual positive samples. In image segmentation, Recall refers to the proportion of pixels correctly labeled as foreground by the algorithm to all actual foreground pixels. A higher Recall indicates the algorithm’s ability to correctly detect foreground pixels. The definition of Recall is shown in equation (3):
IoU (Intersection over Union) refers to the ratio of intersection to union between the segmentation result and the ground truth label. In image segmentation, IoU represents the proportion of overlapping area between the algorithm’s segmentation result and the ground truth label to all foreground pixels. A higher IoU indicates a closer match between the algorithm’s segmentation result and the ground truth label. The definition of IoU is shown in equation (4):
These three metrics are the most important in image semantic segmentation tasks and are typically used together to comprehensively evaluate the performance of the algorithm. Therefore, in this paper, we add the above three metrics as the objective function, as shown in equation (5):
After defining the objective function, the Bayesian optimization algorithm can be used to optimize the hyperparameters. The specific steps are as follows: 1. Initialize 5 points the search space. In this paper, the search space is defined as equations (6) and (7): 2. Build a probabilistic model. In this paper, a Gaussian process model is used to model the objective function. The Gaussian process model assumes that any finite number of hyperparameter combinations and their corresponding evaluation indicators conform to a joint normal distribution. 3. Choose the next hyperparameter combination to try. The Bayesian optimization algorithm uses an acquisition function to determine which hyperparameter combination to try next. The acquisition function balances exploration and exploitation. In this paper, the expected improvement (EI) acquisition function is used. 4. Evaluate the objective function at the chosen hyperparameter combination. The unsupervised semantic segmentation algorithm is used to segment the crack images with the chosen hyperparameters, and the precision, recall, and IoU are calculated. 5. Update the probabilistic model. The new hyperparameter combination and its evaluation indicators are added to the existing sample data, and the Gaussian process model is updated. 6. Repeat steps 3–5 thirty times until the optimal hyperparameter combination is found. The optimization process stops when the maximum number of iterations is reached.
Bayesian optimization process.
Figure 6 shows the relationship between (scale, min_size) and the target value in the parameter space obtained by Gaussian process regression. Gaussian process regression is a powerful statistical framework that offers an effective approach for modeling and predicting complex, non-linear relationships in data. At its core, Gaussian process regression operates on the principle of assuming a distribution over functions, where any finite set of function values follows a multivariate Gaussian distribution. This distribution is defined by a mean function and a covariance function, also known as a kernel. The mean function represents the overall trend of the data, while the covariance function captures the similarity between different data points. By leveraging this distribution, Gaussian process regression allows for flexible and probabilistic modeling, enabling us to make predictions and quantify uncertainties in our estimates. It finds applications in various domains, ranging from finance and engineering to healthcare and environmental sciences, where understanding and predicting complex systems is crucial. Hyperparameter optimization target relationship based on Gaussian process regression.
The choice of kernel function determines the shape and characteristics of the Gaussian process. Common kernel functions include the Radial Basis Function (RBF), also known as the squared exponential kernel, which assumes smoothness in the function, this paper uses RBF kernel to regress the Gaussian process model, as shown in equation (8):
As can be seen from Table 1 and Figure 6, when (scale, min_size) = (10, 112), the target value reaches the maximum of 2.388, with precision = 0.757, recall = 0.924, and IoU = 0.706. Therefore, the model trained with this hyperparameter combination was used as the optimal model for crack semantic segmentation.
Model testing
Comparative experiments and analysis
In order to verify the effectiveness of the proposed algorithm in this paper, we compared the unsupervised semantic segmentation algorithm with traditional digital image processing algorithms and supervised semantic segmentation algorithms. The traditional digital image processing algorithm used the threshold segmentation algorithm, and the supervised semantic segmentation algorithm used the classic U-Net model in the computer vision field.
U-Net (Felzenszwalb and Huttenlocher, 2004) is a deep learning neural network structure mainly used for image segmentation tasks. It is inspired by semantic segmentation tasks in the field of medical image analysis and aims to solve problems such as a small amount of data and sample imbalance. U-Net is characterized by a symmetric encoder-decoder structure, in which the encoder part is a traditional convolutional neural network used to extract feature representations of the image, and the decoder part is a symmetric deconvolutional network used to restore the feature maps extracted from the encoder to the original image segmentation result. In addition, U-Net also adopts the design of skip connections, which allows the decoder to use feature representations of different levels in the encoder, thereby improving the accuracy of image segmentation. The U-Net structure is simple and easy to implement, and has been widely used in the fields of medical image segmentation, natural image segmentation, and remote sensing image segmentation. The structure of U-Net is shown in Figure 7. The structure of the U-Net model.
We used the dataset of 1000 labeled images for parameter Bayesian optimization in Section Unsupervised semantic segmentation algorithm based on superpixel pre-segmentation as the training set for the U-Net model, with training epochs = 10, learning rate = 10−5, and the cross-entropy loss function. The loss curve of training set and test set during training is shown in Figure 8. It can be seen from Figure 8 that the loss function of the U-Net model has converged after 10 epochs of training. The training process of the U-Net model.
After training the U-Net model, we manually annotated a dataset of 500 images from SDNET dataset (Maguire et al., 2019) as the test set, and segment the 500 images using traditional digital image processing methods, the U-Net model, and the method proposed in this paper. Figure 9 shows the segmentation results of some images obtained by different methods. It can be seen from Figure 9 that the segmentation effect of the proposed method is better than that of traditional digital image processing methods, and is similar to the results obtained by the U-Net method. Comparative crack segmentation example.
To quantitatively assess the performance of the three methods, this study computed the precision, recall, and IoU metrics along with their respective averages on the test set, as illustrated in Figure 10. As depicted in Figure 10, the proposed unsupervised semantic segmentation algorithm exhibits superior accuracy compared to the conventional algorithm, despite falling short of the performance achieved by supervised learning methods. Nevertheless, the proposed algorithm holds immense potential for application in scenarios where annotated datasets are scarce or labeling costs are prohibitively high, owing to its omission of pixel-level labeling requirements. Performance comparison of the three methods on the test set.
From Figure 10, it can be seen that that the unsupervised algorithm proposed in this paper does indeed exhibit some performance gaps compared to existing mainstream supervised learning algorithms on the test set. In terms of evaluation metrics, the proposed algorithm’s accuracy in crack detection tasks is currently inferior to that of the classical semantic segmentation network U-Net. In fact, when it comes to segmentation performance, both weakly supervised and unsupervised semantic segmentation algorithms in the field of computer vision are currently unable to surpass supervised algorithms.
However, due to the characteristics of the proposed algorithm, it is still valuable in practical applications. The main distinction between the method proposed in this paper and mainstream methods is that it is unsupervised. In other words, the algorithm presented in this paper does not require training. It achieves reasonably accurate segmentation results solely through pre-segmentation of images and determining the semantic information within the obtained superpixels of the pre-segmentation. Although this method falls slightly short when compared to existing supervised deep learning methods, its unsupervised nature makes it easier to deploy and applicable to a wider range of scenarios.
It is well known that mainstream supervised deep learning algorithms heavily rely on annotated data as training sets to build prediction models. Consequently, constructing a practical deep learning-based detection system requires extensive data annotation, which consumes significant manpower and resources. This is especially true for image semantic segmentation tasks discussed in this paper. Annotating a 256 × 256 resolution image takes between 2 and 5 min, and training a supervised image semantic segmentation model requires annotating thousands of such images. This becomes a significant obstacle when it comes to rapidly deploying a detection model into a detection system or device. Additionally, there is the issue of diverse infrastructure types with significant differences in surface structure, color, texture, and other features. It is challenging to find a universal dataset that covers various structural surface forms. Therefore, the common practice is to establish specialized datasets. However, for newly constructed structures, there is no readily available dataset for defects, making the acquisition of the original dataset itself a challenge. To gather a sufficient number of crack images, one must wait for structures to develop a significant number of cracks, which hampers the timely detection of structural defects in the early stages, thereby limiting the model's applicability.
Unsupervised algorithms offer a viable solution to the aforementioned issues. Although their performance currently falls short of that achieved by supervised models, these algorithms do not rely on annotated data. Consequently, they can be directly deployed in detection systems or devices without the time-consuming and labor-intensive process of data collection and annotation. Therefore, we contend that our proposed unsupervised semantic segmentation is meaningful as it significantly reduces the deployment complexity of the algorithm and expands its range of applications, albeit at the cost of sacrificing some accuracy.
Generalization and robustness analysis
In order to comprehensively evaluate the algorithm proposed in this paper, we will analyze the model's generalization performance and robustness. Specifically, we will select real-world crack images from engineering applications to verify whether the algorithm can achieve the desired results in more complex backgrounds. This evaluation aims to assess the algorithm’s ability to be applied in scenarios beyond publicly available datasets, indicating its strong generalization capability. Additionally, we will test whether the algorithm can maintain good performance when noise is added. To achieve this, we will introduce varying levels of Gaussian noise to the original crack images and utilize the proposed algorithm to detect the cracks in the noisy images. This evaluation aims to determine whether the algorithm can maintain its accuracy in the presence of noise, indicating its robustness against noise interference.
The results of applying the algorithm proposed in this paper to detect cracks in an image captured on a bridge are shown in Figure 11. The image exhibits a complex background environment. As observed from Figure 11, the algorithm proposed in this study demonstrates a relatively accurate ability to identify cracks in complex background environments. However, in a few areas with interfering noise, the proposed method may misclassify superpixels. Overall, the model maintains good generalization performance and can be applied in engineering applications. Detection results of the proposed algorithm on complex background images.
Figure 12 illustrates the detection results of the images under different levels of noise. From the figure, it can be observed that for most images, the algorithm is able to detect cracks in the image even after the addition of noise. However, as the level of noise increases, the edges of the detected cracks become increasingly less smooth. This indicates that noise still has some impact on the accuracy of the detection. Additionally, for some more complex images (the last row in Figure 12), the addition of noise may result in the inability to detect finer cracks in the image. Therefore, it can be concluded that the proposed algorithm exhibits robustness in the presence of noise, but there is still room for improvement. Detection results of images with added noise.
Conclusion
This paper proposes a crack recognition method based on an unsupervised semantic segmentation algorithm, which can achieve high-precision crack segmentation without any annotated data. By using the Felzenszwalb-Huttenlocher algorithm for image pre-segmentation and designing a convolutional neural network model with an autoencoder structure that gradually approximates the superpixel segmentation results, the proposed method can effectively reduce its dependence on annotated data and has wider applicability. In addition, this paper uses Bayesian optimization algorithm to efficiently obtain the optimal model that maximizes algorithm performance.
This paper compared the proposed algorithm with traditional digital image processing methods and supervised deep learning model U-Net. On the test set, the performance of the proposed algorithm is significantly better than traditional methods and is close to the results of U-Net. Therefore, it can be considered that the proposed method significantly improves the accuracy of traditional methods without requiring annotated datasets. The proposed method provides a new framework for structural crack recognition algorithms and provides references for future related research.
Although the proposed algorithm in this paper achieves good results, there are still some issues with the current method. The experimental results indicate that the unsupervised algorithm proposed in this study falls short of the accuracy achieved by supervised learning algorithms. The main reason for this is that unsupervised models lack any prior information, leading to misclassification of certain superpixels. If more prior information regarding cracks can be introduced into the unsupervised algorithm, it could achieve improved performance. In future research, we will integrate the algorithm proposed in this paper with theories related to deep learning interpretability, such as class activation mapping techniques, to enhance the performance of the unsupervised model by utilizing prior information.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
