Abstract
Regular inspection of damaged areas of concrete bridges is critical to the safe service of bridges. However, the complexity of bridge environments and multiple defect types pose challenges to deep learning-based inspection methods. To address these issues, this study proposes a novel segmentation model for multiple defects in concrete bridges, named ConcreteBridgeSeg-Net (CBS-Net). Firstly, it employs a convolutional neural network (CNN) to extract local features, followed by a novel Transformer integrated with a Multi-head Transposed Attention (MDTA) mechanism for global feature extraction. Subsequently, an auxiliary feature extraction branch parallel to Transformer was designed to enhance local feature extraction, and a channel feature fusion module (CFFM) was employed to fuse the feature information from both. Lastly, Efficient Channel Attention (ECA) is embedded within skip connections to minimize interference from redundant information. The experimental results show that the mean intersection over union (mIou), mean pixel accuracy (mPA), and mean precision of CBS-Net in the test set are 78.51%, 88.79%, and 86.76%, respectively, and the accuracy metrics are higher than those of other state-of-the-art segmentation models. Notably, CBS-Net achieves an inference speed of 38.72 frames per second (FPS) with an input image size of 512 × 512, demonstrating a fast segmentation rate while maintaining segmentation accuracy. Furthermore, it exhibits excellent generalization performance on public datasets, thereby confirming its applicability to real-world bridge inspection scenarios.
Introduction
Bridges play a critical role in transportation networks. Over time, concrete bridges undergo various forms of deterioration such as cracking, spalling, and exposed rebar due to factors like environmental corrosion, material aging, and heavy loads (Deng et al., 2019). Failure to promptly address these issues can compromise the stability and safety of the bridge structure. Therefore, regular inspection of the concrete structures’ surfaces and prompt maintenance of damaged areas are essential.
The traditional method of detecting surface defects on bridges relies heavily on manual inspection. Inspectors determine the type and severity of defects through visual examination and manual marking. However, manual inspections are labor-intensive, time-consuming, and prone to inaccuracies. Consequently, the transport industry has an urgent need for more automated methods of bridge surface defect detection.
Deep learning is an artificial intelligence technique based on multi-layered neural networks. It can automatically extract key features from massive data and has been used to address challenges in structural health monitoring and defect detection (Cha et al., 2024). In recent years, researchers in the field of civil engineering have utilized deep learning methods to automatically detect surface defects in infrastructure, including concrete defect image classification (Cha et al., 2017), defect detection (Cha et al., 2018), and crack segmentation (Kang et al., 2020). Among them, deep learning-based concrete defect segmentation can achieve pixel-level segmentation results, which are the basis for quantitatively assessing the severity of defects and are therefore more suitable for concrete damage detection. Convolutional neural network (CNN), a well-established architecture in deep learning, excels at extracting image features and demonstrates high computational efficiency in processing image data (Krizhevsky et al., 2017). Therefore, semantic segmentation models employing CNN as the fundamental architecture, such as Fully Convolutional Networks (FCN) (Shelhamer et al., 2016), U-Net (Ronneberger et al., 2015), LinkNet (Chaurasia and Culurciello, 2017), SegNet (Badrinarayanan et al., 2017), etc., have been extensively utilized for surface defect segmentation in concrete structures, including cracking and spalling of building structures (Kumar et al., 2021), tunnel lining cracks (Lee et al., 2020), and bridge structure cracks (Wang et al., 2020a). However, these semantic segmentation models are not specifically designed for concrete damage segmentation and still need to be improved in identifying complex concrete defects. Several researchers have proposed methods to improve the feature extraction capability of CNNs (Yang et al., 2023). For instance, Deng et al. (2020) enhanced segmentation accuracy by integrating spatial pyramid pooling modules to expand the receptive field of the feature map based on LinkNet network. Karaaslan et al. (2021) employed attention-guided techniques to enhance SegNet performance in segmenting defects in concrete infrastructure. (Zoubir et al., 2024) introduced a convolutional attention module to augment crack feature extraction based on U-Net. Due to the receptive field shrinking to a localized area after multiple down-sampling in CNNs, they excel at capturing local features but encounter difficulty in precisely capturing long-range dependencies.
In order to address the challenge of long-range dependency modeling, Transformer networks based on self-attention mechanisms have become a research focus (Han et al., 2022). These networks improve the understanding of contextual information by capturing dependencies between different positions in the input sequence. Some scholars have applied the Transformer architecture to crack segmentation in concrete structures. For example, Wang and Su (2022) proposed the SegCrack model, which uses a hierarchical Transformer structure as the encoder to achieve pixel-level segmentation of complex cracks. Similarly, Li et al. (2024a) introduced Segformer, a Transformer-based model for concrete crack detection in diverse scenarios. This model demonstrated superior segmentation performance compared to CNN-based approaches. Although Transformer-based models have achieved significant breakthroughs, Transformers exhibit high computational complexity in shallow networks and demand substantial computational resources. Furthermore, they require large amounts of training data to achieve satisfactory segmentation accuracy. Consequently, several studies have proposed combining CNN and Transformer to enhance model performance by leveraging their respective advantages. Ali and Cha (2022) proposed the IDSNet network, which combines a CNN with a self-attention module for segmentation of internal damage in concrete; similarly, Kang and Cha (2022) proposed the STRNet lightweight network, which improves crack recognition accuracy by adding a multi-head self-attention module to the decoder. Experiments show that the mIoU of these models is more than 90% and the FPS at 1280 × 800 resolution is more than 45. Zhou et al. (2023) incorporated Swin-Transformer (Liu et al., 2021), which is based on hierarchical structure, local self-attention, and a window shift mechanism to capture multi-scale features and reduce computation, into DeepLabv3 + to construct a joint backbone network for extracting local and global information, significantly improving the crack segmentation accuracy. Li et al. (2024b) proposed a hybrid U-shaped CrackTrNet model for concrete dam crack detection by embedding a standard Transformer into the CNN layer, demonstrating superior segmentation accuracy over existing state-of-the-art networks.
Current hybrid networks integrating CNN-Transformer architectures have achieved notable advancements in concrete defect detection. However, these models are primarily designed for crack segmentation, with their Transformer modules relying on local windows or standard self-attention to capture spatial global dependencies. Such approaches, constrained by fixed window sizes, impede the effective capture of long-range information, limiting their performance in segmenting complex concrete structures with multiple defects. Additionally, the complexity and multi-scale nature of concrete bridge damage backgrounds pose challenges for the precise segmentation of multiple defects. To address these challenges, this study proposes a multi-defect segmentation model for concrete bridges, termed Concrete Bridges Segmentation Network (CBS-Net). The key contributions are as follows: (1) A transformer module integrating the Multi-Dconv Head Transposed Attention (MDTA) mechanism is proposed. It generates self-attention through cross-channel information and can extract contextual information efficiently, which is superior to traditional self-attention mechanisms. (2) The designed auxiliary feature extraction branch and channel feature fusion module effectively integrate detailed and global semantic information through deep fusion, enhancing the model’s segmentation accuracy for multi-scale defects such as cracks and spalling. (3) An Efficient Channel Attention (ECA) mechanism is embedded in the decoder’s skip connections to better convey information about the characteristics of concrete defects, and to reduce interference from complex environmental backgrounds. (4) The proposed CBS-Net can accurately segment various types of defects, with precision metrics outperforming other mainstream segmentation models, making it suitable for detecting multiple concrete defects.
The paper’s structure is outlined as follows: Section Network structure provides an overview of the proposed model’s structure and details of each module, while Section Dataset and experimental configuration covers the dataset, evaluation metrics, and experimental parameters. Section Experimental results and analysis presents the experimental results and analysis, discussing the effectiveness of each module and comparing the models. Finally, Section Conclusion encapsulates the main conclusions.
Network structure
The overall framework of CBS-Net, as shown in Figure 1, consists of encoder and decoder structures. A hybrid architecture of CNN and Transformer is used as the encoder. CNN is adept at swiftly extracting primary features, while the Transformer focuses on capturing global feature correlations to efficiently extract the features of the concrete defect images. To prevent loss of detailed features during global feature extraction by the Transformer, introducing an auxiliary feature extraction branch (AFEB) to run in parallel with the Transformer, aiming to enhance local feature extraction following CNN feature extraction. Subsequently, a channel feature fusion module (CFFM) integrates the feature information from the Transformer and AFEB. The decoder is employed to restore information lost during feature extraction primarily through up-sampling modules to recover the original size of the feature map and via skip connections for concatenating features of different scales. Furthermore, an ECA module is introduced in jump connections to enhance feature transfer capability. Overall structure of CBS-Net.
CNN encoder
The CNN encoder utilized in this study employs the first four stages of the Resnet50 network (He et al., 2016) to capture local features and edge information. As shown in Figure 2, Stage 1 represents preprocessing, consisting of 7 × 7 convolutions, stride 2 convolution kernels, and max pooling convolution. For the input image (H × W × 3), the image is down-sampled twice: first, by a 7 × 7 convolution with a stride of 2, and second, by max pooling. The number of output channels is 64, and the resulting H × W size of the image is 1/4 of the original image. CNN network structure.
The latter three stages primarily consist of multiple bottleneck units, which comprise two 1 × 1 convolutions, one 3 × 3 convolution, and shortcut branches. As depicted in Figure 2, the Bottleneck adjusts the channel number through a 1 × 1 convolution in the shortcut branch and then adds it to the feature map of the main path. Bottleneck1 represents a convolutional residual module that reduces the size of the feature map by half through a 1 × 1 convolutional kernel with a stride of 2 in the shortcut branch and adjusts the number of channels for the next layer’s residual structure. Bottleneck2 represents a standard residual module where the input feature layer is directly added to the output of the main path via a shortcut branch without altering the size of the feature map. This construction method can mitigate gradient vanishing and information loss in deep neural networks.
Transformer encoder
The Transformer block is shown in Figure 3 and mainly consists of a self-attention mechanism, feedforward network (FFN) layer, and layer normalization (LN). The self-attention mechanism is utilized to capture the correlation of diverse location information and acquire global features. The FFN serves for nonlinear transformations, facilitating the model to learn more abstract features and further enhancing its fitting capability. Additionally, the Layer-Normalization contributes to improving the stability and convergence speed of the model. Transformer structure.
The computational overhead of Transformers primarily arises from the self-attention mechanism, where the computational burden of the self-attention weight matrix grows exponentially with increasing image resolution. Existing methods solve this problem mainly by dividing the image into multiple patches. However, this approach is limited by the size of the patches and cannot adequately capture long-range information interactions. Therefore, this study employs the multi-Dconv head transposed attention (MDTA) (Zamir et al., 2022) to compute self-attention across channels for generating global feature information.
The structure of MDTA is depicted in Figure 4. Given the input tensor x ∈ RH × W × C, after layer normalization processing, the context feature information across channels is first aggregated using a 1 × 1 convolution, followed by coding of spatial context information of channels using a 3 × 3 depth convolution to obtain linear projections of query (Q), key (K), and value (V). Subsequently, the dot-product of Q and transposed K generates an attention map of size RC×C, reducing computational complexity compared to the traditional self-attention map Illustration of MDTA.
The self-attention computation incorporates a multi-head mechanism, enabling the simultaneous and independent computation of each attention head. This approach assigns the features of different channels to separate self-attention heads, with each attention head being weighted by a parameter matrix to yield varying attention outputs. The results of the parallel computation of multiple attention heads are combined through the Concat function to get the final output. The calculation method is shown below:
Auxiliary feature extraction branch
The Transformer encoder primarily extracts global features. However, considering the relatively small size of cracks and exposed rebar defects in concrete, an auxiliary branch is added to extract subtle features and texture information. In this auxiliary feature extraction branch, the Convolutional Modulation Block (Hou et al., 2024) is utilized to extract local features. This module effectively interacts with local features while maintaining computational efficiency, thereby improving the recognition accuracy of subtle defects.
The convolutional modulation block is illustrated in Figure 5. Given an input feature tensor x ∈ RH × W × C, it is initially divided into two branches x1 and x2. Branch x1 performs depth-wise convolution with an 11 × 11 kernel to generate a local feature map A1, while branch x2 generates a feature map V1 through a linear transformation between channels using 1 × 1 convolution. Subsequently, the feature map A1 and the feature map V1 are multiplied element by element (Hadamard Product), where the output of A1 is used as weights to modulate the feature V1. Finally, a 1 × 1 convolution is applied to map the generated weight information back to the original feature dimensions, thereby obtaining the final output of the module. Convolutional modulation block structure.
Channel feature fusion module
Transformers and auxiliary branches are structures with different characteristics. Therefore, the channel feature fusion module (CFFM) is employed in this study to effectively integrate their respective features. This module aims to minimize redundant and irrelevant channel information, and thereby facilitating more precise segmentation results through enhanced interaction between local and global feature information.
The specific process is illustrated in Figure 6. Firstly, the feature maps of Transformer and AFEB are concatenated (Cat) in the channel dimension. Subsequently, the number of channels is adjusted using 1 × 1 convolution to halve the channels of the model. The SE (squeeze-and-excitation) module (Hu et al., 2018) is then incorporated to facilitate a deep fusion of local and global features. The SE module employs global average pooling to compress the input feature map into a 1 × 1 × C feature vector. After traversing two fully connected layers, the Sigmoid function computes the probability distribution of the vector for obtaining channel weights. Finally, these channel weight values are multiplied by the matrix corresponding to each original feature map channel to yield a feature map with channel attention. Channel feature fusion module.
Decoder
The decoder is utilized for reinstating contextual information and primarily comprises an up-sampling block along with skip connections. The up-sampling block doubles the size of the input image by bilinear interpolation. Skip connections facilitate the fusion of output concrete defect features at various scales.
The transmission process of skip connection paths for features of different scales is relatively lengthy, which hinders the effective conveyance of detailed features such as small cracks and rebars to the decoder, thereby impacting the accuracy of concrete defect image segmentation. To tackle this issue, we embed the Efficient Channel Attention (ECA) (Wang et al., 2020b) mechanism, which prioritizes essential detail features by assigning weights to the feature maps. High weights are allocated to emphasize significant information, while low weights are utilized to filter out irrelevant details.
The structure of ECA is depicted in Figure 7. Firstly, perform Global Average Pooling (GAP) on the input feature map to obtain weights W1 of size 1 × 1 × C, achieving the fusion of global feature information. Then, use adaptive one-dimensional convolution with a kernel size of k to obtain channel weights W2. Subsequently, calculate the probability distribution of channel weights W2 through the Sigmoid activation function to obtain attention weights W3. Finally, multiply the attention weights W3 by the input feature map to obtain the weighted feature map. ECA module.
Dataset and experimental configuration
Dataset
Owing to the absence of publicly accessible semantic segmentation datasets for multiple surface defects of concrete bridges. This paper collects images of surface defects on concrete bridges by photographing 17 concrete bridges. During the shooting process, a consistent device (with a camera resolution of 12 megapixels) was used for all images. The shooting angle was kept as parallel to the defect as possible. There are no fixed lighting settings, including variations in sunny and cloudy conditions, so the images captured closely resemble actual engineering scenes. Following meticulous manual screening, 1365 high-resolution images capturing clear and comprehensive representations of three prevalent types of concrete defects—spalling, exposed rebar, and cracks—were selected. All images were resized to 512 × 512 × 3.
Number of concrete defect data sets.
The Labelme software is utilized for annotating concrete defects. In the labeling process, the concrete defect image is first enlarged to the maximum extent and then labeled along the defect contour using the polygon labeling tool to form a label file, which contains the defect location, label name, and pixel value. The image of concrete structure defects and the image annotation process are depicted in Figure 8. Concrete defect image and annotation method. (a) Concrete structure defect image. (b) Image labeling process.
Experimental environment and parameter configuration
The semantic segmentation model algorithm uses Python 3.8 as the programming language, with Pytorch 1.13 as the deep learning framework, and CUDA version 11.7. The computer configuration includes 32 GB of RAM, an Intel-i5-13400 central processing unit (CPU), and an NVIDIA GTX 4070 graphics processing unit (GPU).
The model’s training period was set to 120 epochs, and using the Adam algorithm as the optimizer, with a momentum parameter of 0.9 and employing the cosine function for learning rate decay. The initial learning rate was set to 0.001, and each batch trains with 6 images. To ensure reproducible training results and avoid randomness in experimental outcomes, the order of data loading was fixed by setting a specific random seed.
Evaluation metrics
Confusion matrix.
CPAi is the ratio of the number of pixels correctly predicted as category i to the total number of pixels predicted as category i (the sum of category i columns in Table 2). The calculation formula is as follows:
CPre
i
represents the number of predicted pixels that correctly belong to class i and the number of pixels that actually belong to class i. The calculation formula is as follows:
The meaning of IoU is that the ratio of the intersection and union between the prediction and truth of a certain category. For category i, the intersection of prediction and reality is the number of pixels correctly predicted belonging to category i, and the union of prediction and reality is the sum of the number of pixels predicted belonging to category i (summation of class i columns in Table 2) and the number of pixels belonging to class i (summation of class i rows in Table 2) minus the number of pixels correctly predicted belonging to class i. The calculation formula is as follows:
Frames per second (FPS) are selected to evaluate the real-time performance of the model, where FPS is calculated as follows:
Loss function
During the process of training a neural network, the loss function is employed to calculate the discrepancy between the predicted and actual values. Utilizing this calculated discrepancy, the model iteratively optimizes its training parameters to minimize the disparity between predicted and actual values.
The size of concrete defects varies, and the pixel count of images of cracks and exposed rebar defects is relatively low. To address this issue of pixel imbalance in such images, in this paper, cross-entropy (CE) loss (Zhang and Sabuncu, 2018), Focal loss (Lin et al., 2017) and Dice loss (Sudre et al., 2017) were set respectively for comparative experiments.
The Focal loss function introduces weight coefficients α and γ to the CE loss function in order to adjust the weights of difficult and easily classified samples. This reduces the impact of easy-classified samples on overall losses, allowing the model to focus more on difficult classified samples. The Dice loss function addresses pixel ratio imbalances by disregarding a large number of background pixels when calculating ratio of the intersection and union. The formulas for CE loss, Focal loss, and Dice loss are as follows:
Experimental results and analysis
The performance of the proposed CBS-Net model is evaluated through experiments in this section. To ensure a fair comparison, all experiments are conducted under the same experimental environment and dataset, with performance evaluation carried out using the test set. Unless specified otherwise, consistent parameters are employed across all experiments. The experiments primarily encompass: (1) To determine the most suitable loss function for concrete structure multi-defect scenarios, comparative experiments were conducted using different loss functions. (2) Conducting ablation experiments to assess the efficacy of each module within CBS-Net. (3) Comparing CBS-Net with current state-of-the-art models to determine its superior performance. (4) Validating the generalization performance of the proposed CBS-Net on public datasets.
Loss function comparison results
Comparison results of three different loss functions for CBS-Net configuration.
Notes: S denotes spalling, R denotes exposed rebar, and C denotes crack.
Ablation experiment
Ablation experiment results of different components.

Example of segmentation in ablation experiment.
Table 4 demonstrates that the mIoU of CNN + Trans has increased by 2.96% and the mPre has increased by 3.29% compared to CNN. As shown in Figure 9, when utilizing only the CNN encoder for segmentation of concrete bridges with multiple defects, issues such as partial spalling and discontinuous crack segmentation arise. The incorporation of Transformer into a serial encoder structure improves the problem of discontinuous cracks, indicating that connecting CNN and Transformer in series can effectively extract global information of concrete bridges defects. Furthermore, CNN + Trans + Aux-branch exhibits a further improvement in mIoU by 1.25% compared to CNN + Trans. While segmenting more global defects, the model with an auxiliary branch can further segment local concrete spalling, indicating that the auxiliary feature extraction branch is adept at capturing local features, and the channel feature fusion module effectively integrates the auxiliary branch with Transformer, thereby enhancing the segmentation accuracy of concrete defects. Additionally, after adding the ECA module to CNN + Trans + Aux-branch, the mIoU increases by 0.59%, and the mPre increases by 0.42%, enhancing the model’s focus on the characteristics of concrete defects.
The addition of these modules improves segmentation accuracy, but inevitably reduces the image segmentation speed of the model. Therefore, the impact of each module on image segmentation speed was tested in ablation experiments. The inference speed of each model is shown in Table 4. Under the same experimental conditions, the inference speed of the network with MDTA reaches 40.71 FPS, representing a relative improvement of 30.15% compared to the 31.28 FPS achieved by the network with standard self-attention. This indicates that MDTA effectively reduces computational complexity while maintaining model performance. The FPS (frames per second) of CBS-Net is 38.72, although its inference speed is slightly reduced compared to the baseline CNN model. Given the essential requirement for real-time segmentation, which is typically considered to be more than 30 FPS, the proposed network is capable of meeting the demands for real-time detection of concrete defect images.
Model comparison
To verify the superior performance of the constructed CBS-Net, this study compares it with currently advanced segmentation models (such as PSPNet (Zhao et al., 2017), DeepLabv3+ (Chen et al., 2018), Res-Unet (Xiao et al., 2018), HRNet (Wang et al., 2020c), Segformer (Xie et al., 2021), UTCD-Net (Zhang et al., 2023), SDDNet (Choi and Cha, 2019)). To ensure the fairness of the comparative experiments, all semantic segmentation models used the Dice loss function and employed the concrete structure multi-defect dataset established in this paper. Comparative experiments were conducted in the same experimental environment, where the training set was used to train model parameters, and the test set was used to evaluate various performance metrics of the models.
Figure 10(a) illustrates the loss function curves of the different models during the training phase, while Figure 10(b) depicts the validation error curves. It is evident from Figure 10 that after 100 training iterations, both the training and validation losses for each model stabilize without significant fluctuations, indicating convergence. Notably, CBS-Net exhibits superior learning ability as evidenced by its lower loss and validation error values compared to several comparative models. Error curve. (a) Training loss curve (b) Verification error curve.
Comparison of recognition accuracy
Comparison results of accuracy evaluation metrics for different models.
Notes: S denotes spalling, R denotes exposed rebar, and C denotes crack.
By comparing the semantic segmentation images output by the models, one can intuitively evaluate the model’s segmentation performance. Figure 11 illustrates the segmentation results of models. As shown in Figure 11, all models accurately segment the contours of spalling defects in cases with no obvious background interference and a single damaged area. In terms of exposed rebars, CBS-Net excels at even smaller exposed rebar segmentation. When it comes to crack segmentation and dealing with subtle net-like cracks, all models display varying degrees of missed detections due to the difficulty in identifying such fine cracks. Among these models, CBS-Net segmentation results closely align with labeled images and demonstrate better continuity indicating that CBS-Net is more effective in extracting global crack characteristics. The segmentation results of different models.
As illustrated in Figure 12, when tested for the segmentation of multiple defects, other comparative models demonstrate notable omissions or erroneous segmentations. Specifically, in Figure 12(a), DeepLabv3 + incorrectly recognizes a portion of exposed rebar as spalling; in Figure 12(b), ResU-Net fails to recognize small-scale spalling; in Figure 12(c), PSPNet exhibits several instances of discontinuous crack segmentation; and in Figure 12(d), the HRNet missed the segmentation of exposed rebar. Additionally, while Segformer is capable of segmenting a wider range of overall defects, its precision in segmenting local, detailed defects is limited. In contrast to the other comparative models, CBS-Net does not exhibit significant missed detections or false detections. It is not only capable of accurately segmenting overall defects but also precisely identifies local, minute defects, indicating that CBS-Net is more suitable for the segmentation task of concrete structure multi-defect. Comparison of multiple defect segmentation results.
Efficiency comparison
Comparison of efficiency indicators of different models.
Notes: The image size in the efficiency test was 512 × 512 pixels and a total of 100 images were tested.
As shown in Table 6, CBS-Net has 28.63 M parameters and an FPS of 38.72 f/s. Compared to other models, CBS-Net has fewer parameters and the fastest inference speed. This indicates that the combined architecture of Transformer and CNN guarantees recognition accuracy. Moreover, the MDTA self-attention mechanism within the Transformer structure effectively reduces computational complexity. In summary, CBS-Net exhibits superior comprehensive performance.
Public dataset testing
Public dataset validation results.

Comparison of segmentation results on public datasets.
Conclusion
This paper proposes a multi-defect segmentation model for concrete bridges, namely CBS-Net. By conducting ablation experiments on CBS-Net, comparing it with currently advanced models, and testing it on public datasets, the main conclusions are as follows: (1) The ablation experiments demonstrate that incorporating a Transformer encoder into the CNN network enhances the recognition capability of global features, and adding an auxiliary feature extraction branch effectively improves the precision of local detail segmentation. Furthermore, introducing the ECA attention mechanism through skip connections reduces the interference from background information, thus further enhancing the recognition accuracy. (2) The recognition accuracy indicators of CBS-Net for multiple concrete defects, including mIoU, mPA, and mPre, have achieved 78.51%, 88.79%, and 86.76% respectively. These accuracy indicators surpass those of other currently advanced semantic segmentation models. Furthermore, the segmentation results of concrete structure defect images illustrate that CBS-Net excels at accurately segmenting mixed and minute defects, indicating its suitability for segmenting multiple defects in concrete bridges. (3) Segmentation speed tests show that CBS-Net achieves an FPS of 38.72 f/s, indicating good detection speed while ensuring segmentation accuracy. (4) When tested on public datasets, CBS-Net exhibits excellent accuracy indicators, demonstrating its outstanding generalization performance.
There are some limitations to this paper. Firstly, the training data cover only a limited range of defect types, which may impact the segmentation accuracy of the model when applied to specific defects. Secondly, Secondly, model performance was not validated on mobile devices. Future work will concentrate on two main aspects: firstly, expanding the diversity of the dataset to improve the model’s adaptability to more complex defects; secondly, optimizing the model’s structure by employing pruning techniques to reduce computational complexity while maintaining accuracy, thereby enabling easier deployment on mobile inspection devices.
Footnotes
Acknowledgments
The author gratefully acknowledges the financial support of the National Natural Science Foundation of China (Grant No. 51708188) for this study. The author sincerely appreciates the editors and anonymous reviewers for their valuable suggestions in further improving this manuscript.
Author contributions
Zihang Yu: Writing – original draft, Methodology, Investigation, Validation. Caiping Huang: Writing – review & editing, Funding acquisition, Validation. Hui Li: Visualization, Data curation.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this work was supported by the National Natural Science Foundation of China (Grant No. 51708188) for this study.
Data Availability Statement
Data will be made available on request.
