Abstract
Rapid and accurate damage assessment of railway viaducts is critical following an earthquake. This study proposes a novel two-phase framework for comprehensive damage assessment for post-earthquake reinforced concrete components in railway viaducts, which integrates image classification and instance segmentation. High-resolution images captured by cameras are first cropped or resized into standardized image blocks. In the first phase, a classification model categorizes these image blocks into background, concrete crack, and exposed rebar, enabling preliminary damage assessment of the concrete components. In the second phase, instance segmentation is applied only to the image blocks identified as containing concrete crack or exposed rebar to precisely locate the damage and achieve detailed assessment. By employing image classification to exclude non-damaged areas, the method significantly reduces the computational cost while maintaining high accuracy. The proposed method is validated using the Tokaido dataset of post-earthquake railway viaducts, demonstrating superior performance in both detection accuracy and processing efficiency compared to conventional methods.
Introduction
The railway system is an integral component of the modern transportation infrastructure. With the rapid development of the economy and society, the increasing transportation demands have led to the expansion and complexity of railway networks. To alleviate urban traffic congestion or facilitate railway construction, the structural form of railways has evolved towards a higher proportion of elevated bridges, among which viaducts have emerged as a significant structural form.
Seismic loading represents a significant challenge to the structural integrity of railway viaducts (Montenegro et al., 2016; Siringoringo and Fujino, 2018; Wei et al., 2022). Effective and rapid assessment of component and structural damage to railway viaducts following an earthquake is of critical importance (Chen et al., 2019; Ozakgul et al., 2024), as the assessment results serve as a crucial basis for decision-making regarding post-earthquake structural reinforcement and traffic resumption.
In recent years, with the rapid development of machine learning technologies and their significant advantages, machine learning has been extensively applied in the field of civil engineering, including railways (Gerum et al., 2019; Mosleh et al., 2023a, 2023b; Yin et al., 2022; Zhang et al., 2025a), bridges (Guo et al., 2025; Hurtado et al., 2024; Lei et al., 2023; Sun et al., 2023b; Wang et al., 2022; Zhang et al., 2025b), buildings and other structures (Jin et al., 2025; Luo et al., 2025; Maddalena et al., 2020; Wan et al., 2025; Wan and Ni, 2019; Xiong et al., 2024; Yang et al., 2025). Application of computer vision can be dated back to detection of human walking motions on a pedestrian bridge using recorded video (Fujino et al., 1993). As an important application domain of machine learning, computer vision has the potential to significantly enhance the efficiency, reliability, and automation of assessment processes and has achieved notable successes (Cha et al., 2017; Cha et al., 2018; Chang et al., 2026; Chen et al., 2022; Spencer et al., 2019; Levine et al., 2022; Pan and Yang, 2020; Sun et al., 2023a; Xu et al., 2019). Image processing tasks based on computer vision can generally be categorized into three types: classification (Sun et al., 2022), detection, and segmentation (Vodrahalli and Bhowmik, 2017).
Existing literature on damage detection in railway viaducts using computer vision.
Note. CD: Concrete damage, CC: Concrete crack, CS: Concrete spalling, ER: Exposed rebar, FCN: Fully convolutional network, RefineNet-AM: Refinement network with attention mechanism, CNN: Convolutional neural network, STF-PointRend: Search-transfer learning-PointRend, HRNet: High-resolution net.
The classical algorithms employed by the aforementioned scholars each possess unique strengths. Among the numerous existing algorithms, the YOLO series (Redmon et al., 2016) has garnered extensive attention due to its exceptional accuracy, real-time performance, and detection speed, making it highly suitable for dynamic, multi-object detection (Liu and Zeng, 2025). With continuous updates and iterations, YOLO has undergone effective architectural improvements and performance optimizations (Ali and Zhang, 2024). In particular, the latest version, YOLO11 (Ultralytics, 2024), represents the state-of-the-art in the YOLO series and has demonstrated significant advancements in enhancing performance across various computer vision tasks.
Considering the critical importance of rapidly and accurately conducting damage analysis and assessment of post-earthquake railway viaducts, this study leverages YOLO11’s real-time performance, detection speed, and accuracy as the core algorithm. We propose a damage assessment method for post-earthquake reinforced concrete components of railway viaducts based on a two-phase framework. This method first utilizes image classification to quickly perform a preliminary damage assessment, then strategically excludes background content to reduce the computation burden for subsequent instance segmentation. This enables more rapid and accurate identification and segmentation of the specific locations of two types of damage: concrete crack and exposed rebar, thereby achieving comprehensive damage assessment.
The proposed method follows a systematic workflow. First, high-resolution images obtained from unmanned aerial vehicles (UAVs) and cameras are preprocessed through cropping or resizing to meet subsequent operation requirement. The processed images are then divided into standardized image blocks, which undergo image classification to identify background, concrete crack, and exposed rebar. This classification provides preliminary damage assessment of the original high-resolution images. Subsequently, instance segmentation is carried out on the standardized image blocks classified as concrete crack and exposed rebar to identify the exact locations of the damage for detailed assessment. The effectiveness and accuracy of this method are validated using the Tokaido dataset of post-earthquake railway viaducts.
Two-phase deep learning method for post-earthquake viaduct damage assessment
YOLO11 model architecture
Building on the foundation of its predecessors, YOLO11 has made significant architectural and training method improvements. It adopts a more efficient architecture, with major innovations including the C3k2 block and the C2PSA block, among other advanced attention mechanisms. These architectural enhancements and optimizations boost feature extraction and spatial information processing capabilities, while maintaining high-speed inference and high accuracy.
The detailed architecture of YOLO11 is shown in Figure 1. The overall structure can be divided into three key components: the backbone, the neck, and the head. The backbone performs the core task of extracting multi-scale features from the input images. Its structure consists of a series of alternating convolutional (Conv) and C3k2 blocks stacked together, progressively downsampling to generate feature maps at different resolutions. At the end of the backbone, an SPPF block is employed to expand the receptive field and fuse multi-scale contextual information. This is followed by a C2PSA block that further enhances features and performs spatial attention operations to improve the extraction of key features. Architecture diagram of YOLO11 model.
The neck constructs an efficient bidirectional feature pyramid network. Its specific workflow proceeds as follows: First, a bottom-up path is built through two consecutive cycles of Upsample, Concat, and C3k2 blocks. Deep features undergo upsampling and are concatenated with shallow features from the backbone, thereby infusing high-level semantic information into the detail-rich mid-to-shallow features. Subsequently, a top-down path is constructed through two consecutive cycles of Conv, Concat, and C3k2 blocks. Here, the Conv block performs downsampling, reconnecting and fusing enhanced shallow information with deep features to feed precise positional details back to the high-semantic layer. This symmetric bidirectional architecture ensures thorough interaction and complementarity between deep and shallow features across different resolutions.
The head serves as the final prediction generation unit, inheriting the detection head design from YOLOv8. It employs three independent detection layers (Detect) to classify and localize objects at different scales. Despite its streamlined structure, the head efficiently and accurately completes the final detection task by directly operating on the multi-scale feature pyramid provided by the neck, which has undergone thorough fusion.
Framework of the proposed method
Railway viaducts are vulnerable to seismic activity, making rapid and accurate damage assessments of post-earthquake reinforced concrete components critical. Effective assessments provide crucial support for decision-making regarding structural repair and traffic resumption.
This study proposes a two-phase framework based on YOLO11 for rapid and precise damage assessment of reinforced concrete components in post-earthquake railway viaducts. As shown in Figure 2, this framework integrates YOLO11’s classification and instance segmentation capabilities. The first phase employs a classification model to rapidly categorize images into background, concrete crack, or exposed rebar, completing preliminary screening. The second phase performs instance segmentation exclusively on damaged images, precisely locating crack and exposed rebar areas to achieve detailed assessment. The proposed two-phase framework.
By employing a cascading strategy of classification followed by segmentation, the approach fully leverages the classification model’s high-efficiency processing capabilities to rapidly complete image classification. This enables swift identification of the approximate damage area and determination of the damaged region’s percentage coverage. Although this result is coarse-grained, it provides crucial support for rapid post-earthquake assessment and emergency decision-making. Following classification, detailed damage segmentation is performed to achieve high-precision localization and quantification of damage. This is essential for detailed post-earthquake assessment and reinforcement efforts.
Additionally, the classification model acts as an intelligent gate in the cascaded front-end, eliminating a significant proportion of undamaged background images in a single pass. This allows subsequent segmentation model to focus exclusively on potential damage areas. This strategy yields multidimensional benefits: First, computational load decreases as background images are pre-filtered. Second, background false detection rates are significantly reduced through pre-filtering, minimizing false positives during segmentation and enhancing the contour accuracy of crack and exposed rebar. Third, when network connectivity is constrained at disaster sites, lightweight classification can be performed locally, transmitting only damaged image segments. This conserves data transfer traffic while meeting privacy and bandwidth constraints in emergency scenarios. Consequently, the entire framework achieves an optimal balance within the speed-accuracy-resource consumption triangle, providing a scalable engineering pathway for large-scale, high-throughput safety screening of railway viaducts post-earthquake.
Workflow for damage assessment
The workflow for damage assessment consists of two parts: the Model part and the Application part. As shown in Figure 3, the model part primarily encompasses the establishment of a standard database and model generation. The standard database, which provides data for model training and generation, is established based on existing high-resolution images of post-earthquake railway viaducts. These high-resolution images can be obtained through UAVs and cameras. Given that the pixel count of high-resolution images often reaches millions or even tens of millions, a single high-resolution image can be cropped into standardized image blocks. This enables model training to be based on a small number of high-resolution images. Furthermore, a size of 224 × 224 pixels is the input size for many renowned structural systems (Alipour and Harris, 2020). Hence, a size of 224 × 224 pixels is chosen as the size for the standardized image blocks. Workflow for damage assessment.
The standardized image blocks were classified into three types: background, concrete crack, and exposed rebar. For images classified as concrete crack and exposed rebar, the locations of the crack and rebar exposure were annotated separately to obtain labeled data of damage locations. Data from the standard database were split, with 80% allocated to the training dataset and the remaining 20% to the testing dataset. After training and testing the models with various parameter settings, the model with the best performance was selected as the optimal model.
The application part can be systematically divided into five steps.
In step 1, UAVs and cameras are used to photograph the post-earthquake railway viaduct to obtain high-resolution images. It is worth noting that when assessing the concrete components of the railway viaduct, aftershocks may still occur. Inspectors or UAV operators can conduct inspections from relatively safe locations, effectively ensuring personnel safety. To address potential visual blind zones in UAV image acquisition, cameras can be used as auxiliary tools for supplementary photography.
Step 2 involves cropping or resizing the high-resolution images obtained from photography to meet the requirements of subsequent operations. It should be noted that if the length and width pixels of the high-resolution image are both integer multiples of 224, Step 2 is skipped directly. As shown in Figure 4, when the length and width pixels of the high-resolution image are L and W respectively, the largest integer multiples of L and W by 224 are taken, and denoted as n and m respectively, as shown in the following equations: Schematic diagram of cropping and resizing.

The cropping process is centered on the high-resolution image, using 224n and 224m as the length and width pixels respectively, and cropping the excess areas around the image. It should be noted that, due to the large number of pixels in high-resolution images, the proportion of the cropped area relative to the retained area is small. Moreover, when shooting high-resolution images, the damaged area is usually placed in the center of the image, and the importance of the surrounding areas is relatively low.
Resizing involves reconfiguring the pixels of a high-resolution image so that the length and width pixels become 224n and 224m, respectively. This process does not involve pixel loss. Although it causes a change in the image ratio, this change is relatively small and has a minimal impact on the original high-resolution image.
Step 3 involves cropping the images that have undergone Step 2 operations into standardized image blocks. Specifically, the images are cropped in sequence into m*n standardized image blocks of 224 × 224 pixels, and all standardized image blocks are numbered based on their row and column positions, such as I34 representing the standardized image block cropped from the 3rd row and 4th column.
Step 4 utilizes the trained optimal classification model to perform classification prediction and preliminary damage assessment on the m*n standardized image blocks. The classification prediction results include three types: background, concrete crack, and exposed rebar. Based on the classification prediction results and the numbering of each standardized image block (Iij), the complete high-resolution image is reconstructed to assess the distribution of damage on the image.
Step 5 involves using the optimal segmentation model to perform instance segmentation and detailed damage assessment on the standardized image blocks that were predicted to have concrete crack and exposed rebar in Step 4. It has been found that most areas in the high-resolution images obtained from photography are undamaged, i.e., background. When performing damage segmentation prediction, the standardized image blocks predicted to be background are excluded based on the classification prediction results.
Following the application of the proposed method framework to the high-resolution images, the prediction results of image classification and instance segmentation can be confirmed. The correctness-confirmed prediction results can be served as new data for model training and testing.
Image dataset preparation of reinforced concrete components in railway viaduct
Dataset description
Image filtering criteria.
Data processing
In order to maintain the original aspect ratio of the image pixels as much as possible, the 880 high-resolution images obtained after screening were divided into standardized image blocks of 224 × 224 pixels. Given that the image pixels in the dataset were 1920 × 1080, each image could be divided into 32 standardized image blocks. To facilitate the labeling of the standardized image blocks, 7590 standardized image blocks with relatively good image quality were selected.
It should be noted that although the Tokaido dataset contains label information on damage locations for high-resolution images, these labels are assigned on a per-pixel basis. When applying such label information to the YOLO algorithm, a corresponding transformation is required, which may lead to errors or inaccuracies in the label information. Therefore, in this study, the standardized image blocks used were meticulously classified and labeled with damage locations in a format suitable for the YOLO algorithm.
The 7590 standardized image blocks were classified into three types: background, concrete crack, and exposed rebar. As shown in Figure 5(a), after classifying the standardized image blocks, 3612 background, 2299 concrete crack, and 1679 exposed rebar standardized image blocks were obtained. They were then used for the training and testing of the classification model. In addition, 1000 images were randomly selected from the standardized image blocks classified as concrete crack and exposed rebar, respectively, to label the damage locations. Example results of image processing. (a) Classification results, (b) Results of damage location labelling.
Dataset specifications for model training and testing.
Results and discussion
Training procedure
The training process is well-designed to optimize model performance while maintaining computational efficiency. Both phases of the framework’s training workflow are executed within Ultralytics’ YOLO module. Phase 1 classification task utilizes the official pre-trained weights yolo11n-cls, employs the adaptive moment estimation (Adam) optimizer with cosine learning rate scheduling, and trains for 500 epochs. No additional data annotation is required, and images are stored in separate folders categorized as background, concrete crack, and exposed rebar. For the segmentation task in phase 2, the yolo11-seg network architecture is loaded first, followed by the yolo11n-seg pre-trained weights. Adam and cosine scheduling are also used, with a batch size of 16 and 500 epochs of training. Damage location labels are formatted as YOLO polygons. After each training iteration, models from both phases evaluate precision and loss metrics to systematically monitor training progress and performance.
The training process was carried out on a high-performance workstation equipped with an Intel(R) Core(TM) i9-14,900 processor with a base speed of 3.20 GHz, 24 cores, and 96 GB of memory, which provided a guarantee for efficient and fast model training and testing. In addition, the workstation used an NVIDIA GeForce RTX 4080 SUPER GPU to facilitate complex image processing and computational analysis. This hardware configuration enabled to process the large dataset within reasonable timeframes for the classification model and the segmentation model to complete training.
Evaluation metrics
In the training process, selecting appropriate metrics for model evaluation is crucial to comprehensively assess performance. The selected metrics effectively capture both the classification and segmentation capabilities of the two-phase framework.
In the metrics for evaluating model prediction performance, accuracy, precision, recall, and intersection over union (IoU) (Zheng et al., 2020) are commonly used. Accuracy represents the proportion of correct predictions among all samples, Precision indicates the proportion of actual correct samples among those predicted as positive, Recall refers to the proportion of correctly predicted positive samples among those that are truly positive, and IoU assesses the effectiveness of predictions by obtaining the ratio of the intersection to the union of the predicted and actual regions for each type. The formulas for the above metrics are as follows:
In this study, when evaluating the prediction performance of the model, the former three and the latter three of the above four metrics are selected as the evaluation metrics for image classification and instance segmentation, respectively.
Damage classification results
The classification model was trained and tested, with its training and testing performance shown in Figure 6. Figure 6(a)–(c) illustrate the changes in evaluation metrics as the number of epochs increases. It can be seen that when the number of epochs ranges from 0 to 150, the training loss and testing loss first decrease rapidly with the increasing number of epochs, and then the decline gradually diminishes, eventually stabilizing at a relatively stable position. The change in testing accuracy is opposite to that of the loss, with the accuracy remaining above 95% in the middle and later stages of the epochs. Damage classification performances of training and testing. (a) Training loss, (b) Testing loss, (c) Testing accuracy, (d) Confusion matrix, (e) Normalized confusion matrix.
It should be noted that when the number of epochs exceeds 150, the model’s performance during both training and testing becomes highly stable. To better present the changes in model performance, the figure only shows the model’s performance for epochs below 150. All the above results indicate that the classification model has been effectively trained and can achieve good predictive performance.
Figure 6(d) and (e) present the confusion matrix and the normalized confusion matrix of the test results, respectively. These matrices intuitively illustrate the prediction performance of the classification model on the test set. The vast majority of the standardized image blocks were accurately classified, with the classification accuracy for each category exceeding 96%. Notably, the classification accuracy for image blocks categorized as background and concrete crack was above 99%.
Performance on test set of classification model.
The standardized image blocks, after damage classification, were reassembled according to their numbering Iij. Combined with the damage classification results, this process yielded a preliminary damage assessment of the original high-resolution image. As illustrated in the example of Figure 7, the upper half of the figure displays the original high-resolution image, while the lower half presents the preliminary damage assessment. By comparing the original high-resolution image with the preliminary damage assessment, it can be observed that the latter accurately identifies the composition and general locations of various damage types in the original image. This enables a preliminary determination of whether a particular location in the image is background, concrete crack, or exposed rebar, thereby providing a basis for damage assessment. Example results of preliminary damage assessment.
Damage segmentation results
After the damage classification was completed, the standardized image blocks classified as concrete crack and exposed rebar were subjected to damage segmentation to accurately obtain the specific locations of the damage. The training and testing performance of the segmentation model is shown in Figure 8. The training loss decreased rapidly at first and then gradually decreased at a relatively constant rate as the number of epochs increased. The testing loss also decreased rapidly at first, then decreased slowly, and eventually stabilized at a relatively stable position. As the number of epochs increased, the precision quickly increased from 50% to around 80%, and then gradually increased and stabilized near 85%. Damage segmentation performances of training and testing. (a) Training loss, (b) Testing loss, (c) Testing precision.
Performance on test set of segmentation model.
The standardized image blocks, after damage segmentation, were reassembled according to their numbering Iij, thereby obtaining the detailed damage assessment of the original high-resolution images. As shown in the example of Figure 9, for ease of comparison and analysis, the figure is divided into three parts: top, middle, and bottom. These three parts represent the original high-resolution image, the preliminary damage assessment, and the detailed damage assessment, respectively. By comparison, it can be found that the damage locations obtained from the detailed damage assessment are very close to the actual damage locations in the original high-resolution image. This indicates that the detailed damage assessment can accurately identify the specific locations of the damage, providing precise quantitative results for damage assessment. Example results of detailed damage assessment.
Comparison of different methods for damage segmentation of exposed rebar.
The proposed method exhibits exceptional performance, achieving precision and recall rates of 91.1% and 96.3% respectively, both exceeding 90%. Comparative analysis reveals that the accuracy of the method proposed by Wang et al. (2023) is 72.4%, which is in the second place, but its relatively low recall suggests limitations in positive sample identification. Conversely, Shu et al. (2025) proposed method exhibits enhanced recall capability (86.5%) at the expense of precision reduction.
Existing methods primarily rely on convolutional neural network architectures. Their inherent structure, combined with extremely imbalanced positive and negative samples, easily leads to an imbalance between precision and recall. Models tend to over-detect to boost recall, sacrificing precision in the process. Even introducing specific attention modules fails to reconcile the balance between these two metrics. The proposed two-phase framework design effectively filters out extensive background regions, significantly reducing false detections. Concurrently, YOLO11 incorporates advanced attention mechanisms that enhance the recognition of small objects and complex scene features. This enables the proposed method to achieve an optimal balance between precision and recall.
Through the analysis and discussion in this section, the effectiveness and accuracy of the method proposed in this paper have been effectively validated, which can provide a certain reference for the research and application of damage assessment of concrete components in post-earthquake railway viaducts.
Computational efficiency analysis
Computational efficiency results of classification model and segmentation model on the test set.
It is worth noting that the proportion of image types directly impacts computational efficiency. Assuming the ratio of background, concrete crack, and exposed rebar images is nb:nc:1, the frame rate and single-image inference time for any given ratio can be calculated based on Table 7. Figure 10 illustrates the computational efficiency comparison between the proposed method and direct segmentation method, where RF and RI denote the ratio of frame rate and single-image inference time of the former to the latter, respectively. The black straight lines in Figure 10(a) and (b) represent the baseline where RF or RI equals to one, with the area above the line indicating computational performance superiority of the proposed method over direct segmentation method, and the area below indicating the opposite. Solving the equation corresponding to the baseline reveals that the proposed method outperforms the direct segmentation method in frame rate and single-image inference time when the proportion of background images exceeds 43.5% and 18.7% of the total image count, respectively. Comparison of computational efficiency between the proposed method and direct segmentation method. (a) Ratio of frame rate (RF), (b) Ratio of inference time (RI), (c) RF and RI at a 3:1:1 image type ratio.
Although the proposed method requires background images to exceed a certain threshold proportion to demonstrate its advantages, in practical engineering scenarios, the area of diseased pixels is far smaller than that of intact background. Background images often constitute the overwhelming majority (Nayyeri et al., 2019; Zou et al., 2019), and in some cases may even exceed 90% (Xie et al., 2022). Taking Figure 10(c) as an example, when nb:nc:1 = 3:1:1 (background images constitute 60% of the total), the proposed method achieves a 20% increase in frame rate and a 44% reduction in single-image inference time compared to direct segmentation. When processing large-scale image datasets, this approach significantly reduces computational cost.
Conclusions
This study proposes a novel two-phase deep learning-based framework for post-earthquake damage assessment of reinforced concrete components in railway viaducts. The key findings and contributions can be summarized as follows. (1) The proposed two-phase framework integrates image classification and instance segmentation techniques to achieve both preliminary and detailed damage assessments. By first classifying image blocks into background, concrete crack, or exposed rebar categories, and then selectively applying instance segmentation only to damaged areas, the method substantially reduces computational cost while maintaining high assessment accuracy. (2) The systematic workflow from image acquisition to final detailed damage assessment provides a practical implementation procedure that prioritizes both assessment accuracy and operational efficiency. By standardizing image blocks to 224 × 224 pixels, the method enables efficient processing of high-resolution images, facilitating model computation and rapid damage assessment. (3) The classification model demonstrated exceptional performance, achieving average accuracy, precision, and recall rates all exceeding 98%. Particularly noteworthy was the classification accuracy for background and concrete crack categories, which surpassed 99%, enabling reliable preliminary damage assessment and effective filtering of undamaged regions. (4) The segmentation model showed strong performance with average mAP50 over 87%, particularly for exposed rebar detection, with precision and recall rates of 91.1% and 96.3%, respectively. Comparative analysis with state-of-the-art methods verified that the proposed method outperforms existing approaches, achieving optimal balance between segmentation precision and recall for critical infrastructure damage assessment.
While the current implementation demonstrates good performance for concrete crack and exposed rebar detection, future work could explore integration with structural health monitoring systems. Real-time assessment could further enhance the practical utility in post-earthquake decision-making regarding structural repair priorities.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this work was supported by the National Key Research and Development Program of China (grant no. 2024YFC3810504), and the Southeast University’s Start-up Research Fund (grant no. RF1028623304).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
