Abstract
Automated computer vision-based inspections of railway infrastructures, such as component type, damaged status, and location, have been investigated actively by resorting to task-specific deep learning models. However, task-specific models that fulfill these separate inspection tasks encountered bottlenecks in improving inference accuracy, and bring huge computational costs. Multi-task deep learning, which can fulfill these inspection tasks concurrently, has yet to be fully investigated in the context of structural inspection. In this study, a multi-scale task interaction deep learning strategy is presented towards component recognition, damage segmentation, and depth estimation for a comprehensive post-earthquake inspection of high-speed railway viaducts. Three modules for multi-scale task interaction were proposed to modify the multi-task deep neural network, taking full advantage of task commonalities at multiple scales. The proposed method was validated with a large-scale image dataset of high-speed railway viaducts. Component recognition and depth estimation were incorporated to implement multi-task learning since they have higher pattern affinities at multiple scales. Results reported that mean Intersection over Unions of testing samples of component and damage tasks were 91.2% and 72.1%, RMSE of depth estimation was 1.54 m. Compared with single-task cases, training time, inference duration, and FLOPs of the multi-task model were reduced by 23%, 30%, 27% respectively. Results showed improvement in both inference accuracy and training efficiency, substantiating the superiority of the proposed strategy.
Keywords
Introduction
Viaducts are widely adopted as the basic form of infrastructure by high-speed railways to reduce the segmentation of near-ground space and reach reliable structural stability. In past decades, numerous high-speed railway viaducts have been built worldwide, and some critical ones have already reached their nominal service life. This signifies pressing structural maintenance and condition inspection issues, especially after catastrophic disasters such as earthquakes. Component recognition is the preliminary means and first step of post-earthquake condition inspection that can assess the structural system integrity. Damage recognition is a crucial inspection step for assessing the destruction extent after earthquakes. In monocular depth maps, each pixel is assigned a number indicating the distance between the object represented and the picture-shooting point in the real world. Consequently, depth estimation can provide relative geographic location and geometric structure of the surrounding environment, which bears the potential of providing guidance for the navigation of unmanned aerial vehicle (UAV) and rescue of victims after disasters.
To the majority of railway infrastructures, human-conducted visual inspection dominates the practice, which is laborious and cost-consuming, and manual on-site inspection may cause service disturbance and even interruption. Contact-sensing measurements have also been attempted in both research communities and practical engineering. Dense sensor arrays, albeit technologically mature and reliable in precision, are also limited by their low cost-effectiveness (Spencer et al., 2019) and malfunction risk during extreme events. In the past few decades, the rise of computer vision (CV) techniques shed light on overcoming inherent obstacles of sensing measurement and holds promise towards replacing human-oriented inspection. Nevertheless, captured images are subject to extensively varying backgrounds and the quality cannot be guaranteed. Consequently, there is a pressing need to introduce deep learning solutions to automate these CV issues of railway viaduct inspection so that assessment and structural health monitoring (SHM) can be facilitated (Adibimanesh et al., 2023; Bagherzadeh et al., 2023).
Prior studies have been dedicated to component recognition or detection resorting to deep learning models. Gao and Mosalam (2018) introduced VGGNet and transfer learning to classify images containing components into two classes: beam/column and wall. Liang (2019) adopted object detection convolutional neural network (CNN) to conduct component-level bridge column detection and observed promising results. Resorting to semantic segmentation networks, architectures aiming to extract component from images were developed by Narazaki et al. (2017, 2020), Sajedi and Liang (2021), and Park et al. (2021). Besides, component can also be recognized from video data by deep learning models, representative contributions were made by Narazaki et al. (2018) and Karim et al. (2022). Deep learning models have shown great potential for structural component recognition from images.
Pertinent to damage recognition, many studies have concentrated considerable attention and reported convincing results. Cracks are important signals of structural degradation and may even be the inception of catastrophic failure. Consequently, prior works have been attracted to crack detection (Calderón and Bairán, 2017). Cha et al. (2017) innovatively trained CNN with a sliding window to classify cracks of any size versus non-crack within concrete images. This model is considered as a patch-wise architecture, involving classification and object detection architectures. Other similar methods were thoroughly investigated by Kim et al. (2019), Dorafshan et al. (2018), Jang et al. (2019), Chen and Jahanshahi (2018), Karaaslan et al. (2019), and so forth. Crack profile is easy to lose in details with these approaches, albeit computationally efficient (Dong and Catbas, 2021). In contrast, pixel-wise approaches directly extract cracks from backgrounds at pixel level using semantic segmentation networks. Such models and satisfactory results were reported by Dung and Anh (2019), Ni et al. (2019a), Ren et al. (2020), Zhang et al. (2019a), Ni et al. (2019b), Zhao et al. (2022), Ding et al. (2023), and so forth. Previous studies also presented significant efforts toward detecting multiple types of damage (Cha et al., 2018; Li et al., 2019; Mundt et al., 2019). It can be concluded that deep learning architectures is competent to recognize and segment structural damages from images.
Although component and damage recognition methods resorting to deep learning models have been actively investigated, largely to this date, depth estimation is still the exception and has yet to be fully researched in the context of structural inspection.
There are also some critical limitations yet to be solved by aforementioned task-specific deep neural networks. Theoretically, the bottleneck where accuracies cannot be further improved is faced by these conventional single-task architectures when tackling component or damage recognition. As typical data-driven models, huge computational costs are always required by these task-specific models. In CV communities, multi-task learning is gathering an increasing amount of attention, bearing the potential of improving generalization ability and inference accuracy. Desired predictions can be inferred from the related tasks by leveraging domain-specific information is the main idea brough by multi-task deep neural networks. Shared representations are applicable for supervisory signals in multi-task models so that computational efficiency can be improved. In the other vein, post-earthquake assessments should be carried out cautiously after taking multiple inspection information into account concurrently in the current practice. Consequently, it is crucial for post-earthquake inspections that component and damage recognition along with depth estimation are implemented concurrently by multi-task deep learning models.
Regrettably, solutions endeavoring to multi-task deep learning containing depth estimation have yet, to date, to be fully researched. Prior works presented some preliminary attempts regarding multi-task learning for CV-based structural inspection. Hoskere et al. (2020) proposed a deep CNN to carry out multi-task learning in terms of material and damage. Ye et al. (2022) adopted multi-task HRNet to recognize components and damages by weighting and combining the loss functions of task-specific models. Li et al. (2023) developed a multi-task CNN to conduct structural component detection and damage state estimation, and three sub-tasks were incorporated as auxiliary tasks. The common scheme followed by aforementioned multi-task deep learning architectures is tasks interact at a predesignated scale or a specific receptive field. Nevertheless, it is observed that segmentation and depth estimation tasks can be related to each other to different extents at varying receptive fields (Vandenhende et al., 2020). The inference accuracy can be further improved if multi-task models can take full advantage of related representations at multiple scales.
This paper presents a state-of-the-art task interaction strategy at multiple scales toward component recognition, damage segmentation, and depth estimation for post-earthquake high-speed railway viaduct inspection. This strategy was implemented upon multi-task deep neural network architecture that can improve computational simplicity. A comprehensive post-earthquake inspection framework of high-speed railway viaducts is proposed based on this strategy. A large-scale synthetic image dataset was employed to validate the proposed strategy. The novelty of the presented strategy lies in that commonalities and differences among tasks at varying scales could be fully used so that the superiority of multi-task models concerning inference accuracy can be further improved. Filling the current research gap, the contributions of the proposed strategy lie in (1) improved inference accuracy was reported attributed to associated tasks sharing multi-scale complementary representations and features, and (2) increased training and inference speed as well as calculation simplicity were presented attributed to avoiding repeated calculation in shared layers.
Methodology
The proposed inspection framework is summarized as a flow chart in Figure 1. After capturing the seismic excitation, UAVs will navigate the affected area to capture required images of target railway viaducts. Three tasks–component recognition, damage segmentation, and depth estimation–are involved in the comprehensive post-earthquake inspection. Task affinities at each scale were quantified. Component recognition and depth estimation presented the highest affinities at every scale, which were incorporated to carry out multi-task learning. Damage segmentation task was implemented in isolation by single-task network. Three modules were introduced to implement multi-scale task interaction. Subsequently, the two networks output prediction images in parallel, providing crucial guidance and reference for structural assessments and decisions. The proposed post-earthquake inspection framework of high-speed railway viaduct.
Multi-task deep learning in computer vision
Existing multi-task deep learning architectures in CV for dense predictions can be briefly categorized into two main groups: encoder-focused and decoder-focused (Vandenhende et al., 2022). Encoder-focused architectures share task features before processing them with task-specific heads, that is, in the encoding stage. Inferences are generated once in parallel or sequentially. In contrast, decoder-focused architectures exchange information in the decoding stage. Features predicted initially are used to improve each task output in a one-off or recursive manner. Encoder-focused architectures fail to capture commonalities and differences among tasks since they directly output all predictions in one processing cycle. Nevertheless, such commonalities and differences are deemed to be likely fruitful for improving prediction accuracy. Consequently, decoder-focused architecture is utilized in the present study. Definite merits have been brought to the table (Vandenhende et al., 2022). First, the memory footprint is substantially reduced owing to the shared layer. Second, learning and predicting speed visibly accelerate attributed to that features in shared layers are explicitly avoided to calculate repeatedly. Moreover, potential has been shown that they are competent to report more remarkable accuracies and robustness when associated tasks share complementary information or act as a regularizer for one another.
Pattern affinities of component, damage, and depth tasks at multiple scales
Which tasks should be incorporated to conduct multi-task learning and how to define task affinities are still critical questions. How many and which scales should be predesignated to carry out multi-scale task interaction should also be clarified, so that tasks could take full advantage of features and representations from each other.
The first assumption is that component recognition and depth estimation tasks have higher pattern affinity, since the changing rules of depth values often imply the difference of component types. For example, depth values of pixels representing the rails often change linearly along the direction of the extension of such component. To substantiate our assumption and quantify the degree to which tasks share common local structures concerning scale, pixel affinities of each pair of tasks were calculated at different scales on the label space (Zhang et al., 2019b). For semantic segmentation tasks (component and damage), a pair of pixels was regarded similar when both of them belong to the same class. For depth estimation, a pair of pixels below the relative difference threshold was regarded similar. How well similar and dissimilar pairs of pixels are matched across tasks was quantified after the pixel affinities were calculated for each pair of tasks. This calculation was repeated at every scale. Pattern affinities of different pairs of tasks at varying scales are illustrated in Figure 2. Component recognition and depth estimation presented the highest affinities at every scale, substantiating our intuitive assumption. Consequently, component recognition and depth estimation tasks were incorporated to carry out multi-task learning. Damage segmentation task was implemented in isolation. Affinities of different pairs of tasks across scales.
Another conclusion reported by Figure 2 is that affinities across tasks are highly dependent on the receptive field (i.e., scale). Satisfactory agreements are reached between quantified task affinities and our initial consideration of the necessity for utilizing multi-scale task interaction strategy. Moreover, it can be concluded that tasks head towards lower affinities when the receptive field gets larger, and scale 1/64 has the highest affinities for each task pair. On the other hand, the image size of the dataset used in the present study is 1920 pixels ×1080 pixels, and the images were resized to 640 pixels ×360 pixels (refer to “Dataset” Section for more details). For the resized image size, the scale 1/64 leads to small image patches and patches with too small sizes lose semantic context information. Moreover, patches with large scale such as 1/2 have low task affinities and lead to huge computational expenses. Consequently, scales 1/2 and 1/64 were discarded, and four scales: 1/4, 1/8, 1/16, and 1/32 were selected to carry out multi-scale task interaction.
Multi-scale task interaction strategy
Aforementioned decoder-focused architectures conform with a pattern that task interactions and multi-modal distillation are implemented at a specific scale. It depends on a strict assumption that all relevant task interactions can solely be modeled through a designated filter operation with a specific receptive field (Vandenhende et al., 2022). Nevertheless, such assumption is not always the case and even does not agree with our intuitional cognition. Here name an example using component recognition and depth estimation tasks. Intuitively, as illustrated in Figure 3, local patches in depth map provide less information for component recognition than patches with a more global scale in this scene. The shape of component is revealed when the receptive field is enlarged, hinting at the semantic information of this scene. However, local patches cannot be ignored since they can improve the aligning for component edges for segmentation task. Consequently, we contend that high task affinity at a specific scale may not be retained at other scales, and vice versa. This necessitates multi-scale task interaction strategy that can modify multi-task learning architectures to be more sophisticated and robust. Three modules were introduced to implement multi-scale task interaction for the multi-task structural inspection (Vandenhende et al., 2020). Example of the implications of the scale on task interaction.
Multi-scale multi-modal distillation module
As shown in Figure 4, a multi-scale feature representation is extracted from the input image by an off-the-shelf backbone network, which was utilized to make initial task inferences at each scale by employing task-specific heads. A set of task-specific representations at varying scales can be then obtained. Subsequently, task features are refined by distilling information from the other tasks leveraging, which is called “multi-modal distillation”. Spatial attention mechanism (Mnih et al., 2014) was employed to implement the distilling, guiding the information transfer between the feature maps generated from different modalities for different tasks. The attention mechanism is proven to excel at choosing useful information, which is utilized as a gate function for flow control to enable the network to concentrate or ignore semantic information from other features automatically. For the ith training sample, when the information is passed to task t at scale s, the attention map Proposed for multi-task neural network: (a) multi-scale task interaction architecture, (b) backbone: HRNet. Multi-module distillation with attention mechanism, where 

Cross-scale feature propagation module
A feature harmonization block is utilized to combine the task features from the previous lower resolution scale to a shared representation, which is utilized to refine the former, before passing them to the task-specific head at the next scale with higher resolution. Cross-scale feature propagation module is illustrated in Figure 6. Cross-scale feature propagation module.
To be specific, the first step is feature harmonization. Features of N tasks are input in this module with the shape of C × H × W and concatenated to obtain a representation with the shape of C·N × H × W. This representation is further processed by a non-linear function, and the output is split into N blocks along the channel dimension. Softmax function is then utilized along the task dimension to generate a task attention mask. The concatenated representation is further processed to reduce the number of channels from C·N to C. The final output of the feature harmonization block is a shared representation that implies information from all tasks.
Subsequently, the second step is the refinement. The refinement is implemented by selecting relevant information from the shared representation through a task-specific channel-gating function. In this architecture, the channel gating mechanism is used as a squeeze-and-excitation (SE) block (Hu et al., 2020), as shown in Figure 6. SE block is a computational unit that transforms the number of channels of a given input
Feature aggregation module
Distilled features at every scale after multi-scale multi-modal distillation are up-sampled to the highest scale and concatenated. The final inferences can be then obtained by decoding these ultimate feature representations for every task by task-specific heads again.
To sum up, starting from an off-the-shelf backbone that extracts multi-scale features, initial task inferences are predicted at each scale. Task features are distilled separately to capture task interactions at multiple scales. After distillation, the distilled task features from all scales are aggregated to make final task predictions. To further boost performance, a feature propagation module is introduced to extend the framework that passes distilled representations from smaller scales to larger ones. Image dataset in this study is in high resolution. Consequently, HRNet (Sun et al., 2019; Wang et al., 2021) was adopted as the backbone of the multi-task neural network to extract task-specific low-level features. The proposed multi-scale task interaction architecture for multi-task deep learning is visualized in Figure 4.
Dataset
A large-scale image dataset of high-speed railway viaducts, termed Tokaido Dataset (Narazaki et al., 2021), was employed in the present study. Target structures and damage scenarios were generated randomly in the context of post-earthquake utilizing a unified system with synthetic environments. 200 different environments that contain 2000 high-speed railway viaducts with the standard design were developed. Generated images were annotated automatically and associated with ground truth pixel-wise information of structural component types, damage types, and depth values.
Component recognition and depth estimation
For multi-task learning of component recognition and depth estimation, 7575 images with the size of 1920 × 1080 pixels in Tokaido dataset associated with their ground truth annotations were adopted for training and testing. Seven regular component types for high-speed railway viaducts are involved in the dataset, that is, non-bridge, slab, beam, column, non-structural, rail, and sleeper. The range of depth value is [0.5 m, 30 m]. It is worth noting that the “sleeper” component exists in fewer images than the other components. Moreover, for images including sleeper, the number of sleeper pixels is relatively small compared with other types of components. Data imbalance is contended to be a critical implication of the inferior performance of neural networks for semantic segmentation. To eliminate the influence of the few pixels of sleeper, data augmentation was carried out for images with such component. Approaches for data augmentation include rotation, horizontal flipping, vertical flipping, and adjustment of saturation and brightness.
The dataset was expanded to 9282 images after data augmentation, which was randomly divided into training and testing sets after shuffling. Training set accounted for 70% and testing set for 30%, that is, 6498 images in training set and 2784 in testing set. Original images and annotations were resized to 640 × 360 pixels. Representative images and annotations for multi-task learning concerning component and depth tasks are shown in Figure 7. Examples of component and depth dataset: (a) original images, (b) component annotations, (c) depth annotations.
Damage segmentation
Network for damage segmentation was trained in isolation. 7081 images in Tokaido Dataset along with their ground truth annotations were utilized for training and testing. Regular and close-up scene images along with pure texture images were trained together. Three types of damage regularly appear on high-speed viaducts involved in the dataset, that is, non-damage, concrete damage, and exposed rebar.
The dataset was randomly divided into training and testing sets after shuffling, of which training set accounted for 65% (4602 images) and testing set for 35% (2479 images). Original images and annotations were resized to 640 × 360 pixels. Typical images and damage annotations are depicted in Figure 8. Examples of damage dataset: (a) original images, (b) damage annotations.
Experiments and results
All experiments in the present study were implemented on a computational platform equipped with an Intel Xeon E5-2678 v3 @ 2.50 GHz with 64 GB RAM CPU, NVIDIA RTX2080TI with 11 GB RAM GPU.
Implementation details
Four different scales:1/4, 1/8, 1/16 and 1/32 mentioned above were adopted to carry out task interaction strategy. Cross-entropy loss Lc was utilized for component recognition and L1 loss for depth estimation, which are defined as
For the damage segmentation task, notably, thin and subtle positive areas are often exhibited for concrete damage and exposed rebar. In analogy to the sleeper in component task, the network is prone to report unsatisfactory accuracies owing to such few positive pixels. To remedy this potential network failure, class weights were introduced to loss function by pixel-wise weighting strategy (Yang et al., 2021). Inverse frequency weighting was adopted to overcome the obstacle of data imbalance. Original pixel ratios of each damage type and that after weighting are 1: 0.0366: 0.0040 and 1: 27: 250 (non-damage: concrete damage: exposed rebar) respectively. Evidently, the weights of positive damage pixels were significantly increased after inverse frequency weighting. HRNet was adopted as the backbone and cross-entropy for loss calculation.
The multi-task neural network is named “Multi-CDNet (Multi-task Network for Component and Depth tasks)” for further discussion for convenience. The network for damage task is termed “DamNet” for further discussion. Moreover, for the sake of a convincing illustration of the superiority of the proposed multi-scale multi-task strategy, component and depth tasks were trained separately by conventional single-task HRNet network. The two single-task networks for component recognition and depth estimation are named “CompNet (Network for Component task)” and “DepNet (Network for Depth task)” respectively for further discussion. After the trial-and-error process, the optimal hyperparameters and optimizer are determined. The batch sizes of Multi-CDNet and DamNet are 4 and 8 respectively, and initial learning rates are 0.001 and 0.0005 respectively. The number of training epoch of all networks is 300. Learning rate was dropped every 50 epochs, and the drop factor is 0.5. Dataset was shuffled each epoch. Adam was employed as the optimizer. For CompNet and DepNet, training scheme was set consistent with Multi-CDNet. Results compared hereinafter were reported by the network trained with their optimal hyperparameters.
Results and evaluations of component recognition and depth estimation
For Multi-CDNet, component-wise Precision, Recall, and Intersection over Union (IoU) and their mean values (mPrecision, mRecall, mIoU) were utilized for evaluations of component recognition task, and Root Mean Squared Error (RMSE) for depth estimation task (Kazemi et al., 2024). Abovementioned evaluation metrics are defined as
Training curves of Multi-CDNet in terms of mIoU for component recognition and RMSE for depth estimation are visualized in Figure 9. For the component task, evaluation metrics of Multi-CDNet and CompNet are listed in Tables 1 and 2. Mean IoU for component task and RMSE for depth task of Multi-CDNet during training. Training and testing precision, recall, and IoU of Multi-CDNet. Training and testing IoU (%) of CompNet.
Training and testing RMSE (m) of Multi-CDNet and DepNet.
Analogously, the proposed Multi-CDNet reported much smaller RMSEs on both training and testing sets, presenting superior performance. Representative component recognition results and depth estimation results are illustrated in Figures 10 and 11. In the predicted depth map, pixel-wise absolute errors (in meters) of the predicted depth values and the ground values were mapped to the corresponding colors which were arranged in the color gradation on the right. The learning and generalization ability of the proposed multi-scale task interaction strategy can be visibly demonstrated by these two figures. Examples of component recognition results: (a) original image, (b) ground truth annotation, (c) segmentation results presented by Multi-CDNet. Examples of depth estimation results: (a) original image, (b) ground truth annotation, (c) our estimation results, (d) visualized absolute error.

Results and evaluations of damage segmentation
Evaluation metrics of training and testing sets of DamNet.

IoU for damage segmentation of DamNet during training.
It is indicated by Figure 12 and Table 4 that DamNet reported reliable segmentation results. The fluctuations of IoU during early training period can be attributed to the few pixels of two types of damage versus backgrounds. Although relatively significant fluctuations were presented, the network finally reached a steady level before the training process completed. The mIoUs of training and testing set presented to be 77.3% and 72.1% respectively. The trained network was effective-proven for damage segmentation by reported remarkable Recall values, even damages exist with extensively varying backgrounds. In contrast, Precision and IoU values were reported lower, indicating that accurate pixel-wise damage localization remains to be investigated. Typical damage segmentation results are visualized in Figure 13. Examples of damage segmentation results: (a) original image, (b) ground truth annotation, (c) segmentation results presented by DamNet.
Discussions
Comparative study of component and depth tasks reported by Multi-CDNet and the reference.
Comparative study of damage segmentation task reported by DamNet and the reference.
It is demonstrated by above two tables that proposed Multi-CDNet showed better results for component recognition task and more remarkable performance for depth estimation. RMSE of predicted depth value is reduced by more than 16%. Sleeper is deemed to be the component with the worst recognizability in the current task since it has the fewest pixels. Reference (Narazaki et al., 2021) reported the IoU of sleeper to be 66.0% while our Multi-CDNet presented 69.8%. Outstanding superiority of the proposed strategy is further corroborated by such comparison. For damage segmentation, Reference (Narazaki et al., 2021) proposed a strategy that only detected nearby damages in regions of reliable damage recognition (RRDR), and damages outside RRDR were discarded. Our DamNet reported higher metrics values without removing any damages.
Quantification results of computational simplicity and efficiency.
Loss values during training of four scales of Multi-CDNet are plotted in Figure 14. A detailed local-level diagram is plotted on the right since the curves of the scales 1/32 and 1/16 are too close to be distinguished. It is worth noting that the scale 1/4 reported the maximum loss and 1/32 reported the minimum value. The loss value increased with the increasing patch size (receptive field). In other words, the loss increased as the scale dilated. Moreover, it is commonly contended that high pattern affinity at a specified scale is most likely to have small loss values owing to tasks can be beneficial to each other to a large extent. Consequently, satisfactory agreements were observed between loss values at varying scales (Figure 14) and task affinities concerning patch scales (Figure 2). The consideration of which tasks should be incorporated and trained concurrently with multi-task architecture is further corroborated. Training loss values of four scales of Multi-CDNet.
Component recognition results predicted by Multi-CDNet and CompNet of one same image are visualized in Figure 15. The beam, slab, and non-structural components in the yellow box are noteworthy. Evidently, for these far-away component parts, segmentation results reported by Multi-CDNet is better than single-task CompNet. This can be explained that for our multi-task model, depth values of these components are changed linearly. Such linear change can provide information for component segmentation task. Aforementioned commonalities and differences among component and depth tasks at four scales were captured by the Multi-CDNet and brought benefits to both of the tasks, especially for far-away component recognition. The performance of the multi-scale task interaction strategy was subsequently improved. The effectiveness and superiority of the proposed strategy are further substantiated by this observation. Comparison of component recognition results of a designated image predicted by: (a) Multi-CDNet, and (b) CompNet.
Conclusions
In the present study, a novel multi-scale task interaction deep learning strategy was proposed toward component recognition, damage segmentation, and depth estimation for post-earthquake inspection of high-speed railway viaducts. The presented strategy was implemented upon multi-task deep neural network. The key insight behind this strategy lies in leveraging domain-specific information and shared representations at different scales. The proposed strategy is achieved by three modules for task interaction. Component and depth tasks were incorporated for multi-task learning since they have higher quantified task affinities at multiple scales. A comprehensive framework tailored to the practice of post-earthquake inspection is proposed based on this strategy. The novelty of the presented strategy is that commonalities and differences among tasks at varying scales could be fully used, so that the inference accuracy superiority of multi-task models is further improved while computational efficiency is enhanced. Conclusions are drawn as follows: • Mean IoU for component recognition and RMSE for depth estimation reported by the proposed multi-task architecture reached 91.2% and 1.54 m respectively, showing a significant rise for the former and a drop for the latter compared with single-task cases. As demonstrated by the evaluation metrics, inference accuracies of both two tasks are improved by sharing complementary representations and features of the two associated tasks, substantiating learning and generalization ability of the multi-task model. • The depth estimation task, which can be regarded as the auxiliary task of component recognition task, provided crucial information for the latter, especially for far-away parts. Commonalities and differences among component and depth tasks at different scales were leveraged fully by the proposed multi-scale task interaction strategy. • Compared with the sequential training of the two single-task networks, the training time, inference duration, and FLOPs were reduced by 23%, 30%, 27% respectively. Computational simplicity and efficiency were improved by avoiding repeated calculation in shared layers in the multi-task model.
The proposed method bears the potential of providing beneficial complementary for human-conducted manual inspection or contacted sensing measurement. Three critical inspection tasks can be carried out concurrently, providing guidance and reference for structural assessment and decision-making. Future works could be built on the development of a more complex dataset containing real-world scenes, and the transfer learning of the proposed method concerning other application scenarios such as highways or tunnels, so that to further facilitate the building of a more sophisticated, robust, automatic, and intelligent inspection of structures and infrastructures system stepping forward.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key R&D Program of China (grant number 2023YFE0115000); Science & Technology Specific Project of Jiangsu Province (grant number BZ2024047); Key R&D Program of Zhejiang Province (grant number 2023C03182); and National Natural Science Foundation of China (grant number 52361165658).
Data availability statement
The data used to support the findings of this study are available from the corresponding author upon reasonable request.
