Abstract
Wind energy is a sustainable and renewable energy source, valued and invested in by national governments. Wind power currently has significant installed capacity and growth, and wind turbines will face substantial maintenance demands in the future. However, inspection and assessment of wind turbine towers lack sufficient research attention and effective automated methods. Fine cracks in large-scale images of wind turbine towers are often obscured by noise, making accurate identification difficult. The projection of the same crack in adjacent images is susceptible to double counting, leading to incorrect evaluations. This paper proposes a crack assessment method for concrete wind turbine towers based on unmanned aerial vehicle (UAV) imaging, utilizing computer vision and artificial intelligence (AI). The flight trajectory and photographic strategy for high-rise structures are designed for data acquisition. An attention-enhanced grid-based convolutional neural network (CNN) for crack identification is developed, along with an incremental crack projection algorithm to address overlapping regions in adjacent images. Field tests demonstrate that the proposed method achieves excellent accuracy and efficiency in crack identification, localization, and quantification. The image acquisition strategy ensures reliable crack identification and three-dimensional reconstruction. The classification model, optimized by the global receptive field and attention mechanism, combined with digital image processing (DIP) adapted to grid-based classification, effectively filters noise and extracts fine cracks. Three-dimensional reconstruction and incremental projection based on structure from motion (SfM) provide accurate and efficient crack localization, allowing precise crack parameters to be obtained from the surface model for structural assessment. The proposed method integrates drone imaging, AI-based crack recognition algorithms, and three-dimensional reconstruction and projection techniques, effectively enabling automatic crack assessment for concrete wind turbine towers, thereby assisting more informed maintenance decisions.
Keywords
Introduction
Under the global goal of controlling the Earth’s temperature and promoting low-carbon, green, sustainable development, vigorously developing wind power as a clean energy source has become a global consensus. According to the International Renewable Energy Agency’s (IRENA) annual report, wind energy accounted for 26% of global renewable generation capacity, reaching 733 GW by the end of 2020 (Civera and Surace, 2022). Governments are expected to further expand financial and policy investments in wind energy. Wind turbines are facing enormous operation and maintenance pressures, which are expected to intensify in the future. The tower is the primary component of a wind turbine’s support structure, accounting for about 15% to 30% of the total cost (Blanco, 2009; Hyers et al., 2006). Cracking in the tower reduces its fundamental frequency, causing resonance (Van Zyl and Van Zijl, 2015), and can even lead to structural failure in severe cases (Ribrant and Bertling, 2007), such as the concrete tower accident in Germany (Civera and Surace, 2022). However, the inspection of wind turbine towers still relies on manual visual inspection, which is labor-intensive and time-consuming. Therefore, more automated, efficient, and accurate inspection and assessment technologies are urgently needed for wind turbines.
In China, the proportion of concrete wind turbine towers is relatively low. However, hundreds of concrete towers still exist due to their large foundations. Cracks are a critical indicator of concrete structure health, and several traditional nondestructive techniques (NDTs) have been developed for crack identification. The most basic strategy involves using strain differences between adjacent strain sensors to identify cracks (Benedetti et al., 2011). Additionally, distributed fiber optic sensors (FOS) can be used to measure continuous strains, replacing conventional strain sensors (Lee et al., 2010). Acoustic emissions and normalized accumulation parameters are used together to characterize damage in wind turbine towers. Research shows that the normalized cumulative energy parameter is sensitive to the fracture stage (Tang et al., 2021). An infrared laser vibrometer identifies dynamic structural parameters to enable condition monitoring for wind turbine towers (Dilek et al., 2019).
With the rapid development of imaging systems and computer vision, image-based technology has been widely applied in various fields, such as pavement (Li et al., 2023), tunnels (Huang et al., 2018), and bridges (Xu et al., 2019). It also holds significant potential for wind turbine towers. In particular, the rapid advancement of deep learning and convolutional neural networks (CNNs) has greatly accelerated the automation of infrastructure inspection and assessment. Deep learning CNNs have been extensively researched and applied in areas such as corrosion and coating defect assessment of coal handling and preparation plants and automated ultrasonic diagnosis of compressive damage in concrete under temperature variations, demonstrating significantly superior performance compared to traditional methods (Wang et al., 2024; Yu et al., 2023). Additionally, lightweight algorithms have been developed for the identification of multi-damage in complex environments (Jiang et al., 2024). Deep learning approaches have also achieved notable progress in crack identification, where crack identification tasks can be categorized into classification, detection, and segmentation based on task objectives. The classification task divides the image into multiple patches and identifies the presence of cracks in each patch using a sliding window method across the entire image (Cha et al., 2017). The detection task involves directly identifying the category and position of cracks within the entire image, and obtaining a rectangular area containing the cracks through bounding box regression (Cha et al., 2018). The segmentation task extracts cracks at the pixel level, determining whether each pixel belongs to the crack (Dung and Anh, 2019). The topological structure of cracks indicates that detection-based methods are unsuitable due to the self-similarity between local and global crack patterns. In segmentation-based methods, dataset annotation is costly, and datasets are often small in scale, leading to limited generalizability. Therefore, classification-based methods are a feasible approach, identifying regions of interest and then extracting cracks by combining them with digital image processing (DIP) technology.
Various robotic inspection platforms equipped with different sensors have been developed for wind turbine tower inspections. Imaging systems can be mounted on robots to acquire data. A climbing ring robot, which uses spring forces to grip the tower, has been developed for inspections. The adhesive forces between the robot and the surface are provided entirely by mechanical means (Sattar et al., 2009). A magnetic climbing robot with high adsorption force has been proposed for inspecting metal surfaces (Yan et al., 2016). These climbing robots are expensive, inefficient, and less adaptable, making them difficult to scale up for wind turbine towers.
Unmanned aerial vehicles (UAVs) have become an ideal platform for automatic and periodic inspections of wind turbine towers due to their mobility and flexible observation capabilities. UAVs have been successfully applied in civil engineering for bridge inspections (Seo et al., 2018), pavement inspections (Zhang and Elaksher, 2012), photogrammetric mapping (Martínez-Carricondo et al., 2018), and earthwork surveys (Siebert and Teizer, 2014). Research and practice show that UAV-based methods offer excellent performance and significantly reduce inspection costs (Perry et al., 2020). Path planning algorithms reduce manual intervention and enable fully automated inspections, giving UAV-based methods significant potential (Phung et al., 2017).
The imaging system equipped on UAVs can be used not only for crack identification but also for 3D reconstruction. UAVs capture data from multiple perspectives to overcome occlusions and provide a comprehensive and accurate 3D model (Khaloo et al., 2018). Structure from motion (SfM) generates a dense point cloud by using overlapping images from multiple viewpoints (Liu et al., 2016). Surface reconstruction further generates a mesh model from the dense point cloud, providing the object’s topological structure (Wei et al., 2022). By combining crack identification and projection techniques, the crack location and distribution are clearly visualized, and modified crack parameters can be obtained to evaluate the structural condition (Liu et al., 2020).
However, UAV-based crack assessments still face the following three challenges. (1) The long shooting distance of UAVs often results in images with a large field of view, making cracks appear relatively small. In such cases, image-based classification methods struggle to identify cracks, while patch-based methods are easily affected by local noise. (2) The mainstream 3D model for wind turbine towers is the point cloud model, which lacks topological information about the structural surface (Mirzazade et al., 2021). The lack of spatial constraints may result in misalignment of crack locations and measurements. (3) In image-based 3D reconstruction, crack projection in overlapping regions between adjacent images remains under-researched and under-discussed. Repeated projection can lead to overestimation of damage and reduced efficiency, requiring further optimization.
To address these challenges, this paper proposes a crack assessment method for wind turbine towers using computer vision and AI based on drone imaging. First, a data acquisition scheme, including flight trajectory and photography strategy, is designed for the high-rise structure of wind turbine towers. Second, an attention-enhanced, grid-based classification method with a global receptive field is developed for crack identification. Third, image-based 3D reconstruction is performed to obtain the surface model of the wind turbine tower. Fourth, an incremental projection algorithm is introduced to handle crack projection in the overlapping regions of adjacent images. Field tests on a real wind turbine tower are conducted and discussed for case illustration, technology validation, and error analysis of the proposed methodology.
The primary innovations and contributions of this study are as follows: (1) A UAV-based digital image acquisition strategy tailored for wind turbine tower inspections is developed. This strategy optimizes shooting distance, overlap rate, and flight trajectory to acquire higher-quality data, enhancing the foundation for precise analysis. (2) A multi-scale progressive crack identification approach is established, combining an attention-enhanced grid-based CNN with DIP. This architecture offers superior noise robustness and multi-scale sensitivity, enabling highly accurate crack identification even in large-scale images with significant noise interference. (3) An incremental crack projection algorithm is proposed to handle overlapping regions between adjacent images, effectively preventing crack double-counting. This ensures more accurate crack parameter calculation and improves the reliability of the damage assessment. (4) A comprehensive automated crack assessment solution for wind turbine towers is implemented, integrating image acquisition, crack identification, model reconstruction, damage projection, parameter quantification, and structural evaluation. This fully automated workflow significantly streamlines the inspection process, reducing manual intervention and enhancing the overall efficiency and accuracy of the assessment.
Methodology framework
The proposed methodology consists of five parts: data acquisition, crack identification, 3D reconstruction, projection positioning, and field testing. The methodology framework is illustrated in Figure 1. Methodology framework.
In the data acquisition stage, a UAV equipped with a high-resolution camera captures sequential images along a pre-planned flight trajectory. The trajectory and photography strategy are designed to maintain an optimal distance between the UAV and the structure, ensuring sufficient resolution for crack identification while covering a broad area. Additionally, an adequate overlap between adjacent images is ensured to facilitate accurate image alignment during later stages of processing. These high-resolution images form the foundational dataset for subsequent analysis.
The captured images are processed using a CNN for grid-level crack segmentation. This step primarily serves to eliminate noise interference by identifying grid regions with potential cracks, effectively narrowing the focus for detailed analysis. In the subsequent stage, DIP techniques are applied to the identified grids for pixel-level crack segmentation. This refined segmentation enhances the precision of crack boundaries and facilitates the extraction of crack width feature pairs.
Using structure-from-motion (SfM) algorithms, a sparse 3D point cloud is generated from the overlapping images. This point cloud is further refined into a triangular mesh surface model that accurately represents the geometry of the structure. The mesh serves as the spatial framework for mapping crack information from the 2D images onto the 3D model.
An incremental projection algorithm is developed to integrate 2D crack data into the 3D model. The algorithm maps crack pixels from 2D images onto the corresponding locations on the triangular mesh, taking into account the camera parameters, image orientation, and the structure’s geometry. Feature points are used to calibrate crack parameters, ensuring high accuracy in the final projection.
Field tests are conducted to validate the overall framework. Real-world scenarios are used to evaluate the accuracy and reliability of crack identification, localization, and quantification. By integrating above steps, the framework provides a comprehensive solution that not only identifies and characterizes cracks but also accurately maps their spatial positions within the structure. This integration enhances the reliability and utility of crack assessments for structural maintenance and safety evaluations.
The organization of this paper is as follows. The section on Digital image data acquisition describes the UAV flight trajectories and photograph strategies. The section on Crack identification and feature extraction introduces techniques including coarse segmentation, fine segmentation, and parameter calculation. The section on Crack localization and feature integration presents the methods for 3D reconstruction, crack projection, crack deduplication, and parameter correction. The Field tests section systematically demonstrates the proposed methodology through an on-site application. Finally, the Conclusions section summarizes the findings and discusses directions for future work.
Digital image data acquisition
Flight trajectory for high-rise structure
To meet the requirements for crack identification and 3D reconstruction, the flight trajectory and photography strategy of the UAV system must follow specific principles (Liu et al., 2020). The geometric shape of wind turbine towers is typically a cylinder or a frustum of a cone. It is essential to design a UAV flight trajectory that suits the regular structural surface of wind turbine towers, reducing the frequency of direction changes and making the flight trajectory easier to control.
The flight trajectory is divided into vertical and circumferential directions, resulting in two cruise schemes. One scheme involves thorough vertical cruising with incremental circumferential adjustments, while the other involves thorough circumferential cruising with incremental vertical adjustments. For wind turbine towers with high-rise structures, the slenderness ratio is relatively large. The latter cruise scheme requires frequent adjustments to the UAV’s position and posture at each altitude, making it difficult to operate. The former cruise scheme (Figure 2) is recommended. This scheme requires only a few adjustments in position and posture throughout the process, making it highly maneuverable. UAV cruise scheme.
Specifically, for a cylinder with height
Besides, to mitigate the potential impact of UAV vibration on the detection accuracy of fine cracks, several strategies were implemented during data acquisition. First, the UAV was controlled to stop at each designated shooting point before capturing images, allowing the system to stabilize and significantly reducing any motion blur caused by flight vibration. To further enhance image quality, three images were captured at each stop, ensuring that only the clearest and most stable images were used for analysis. Additionally, during the data processing phase, images exhibiting motion blur were automatically filtered out, further minimizing any possible distortion. From a technical perspective, an appropriate shutter speed was selected to prevent motion blur during image capture, ensuring that each frame was sharp and clear. As a result, the influence of UAV vibration on the accuracy of crack detection is minimized, maintaining the overall reliability of the inspection process.
Photograph strategy
Suitable distance for crack identification
Given the parameters of the camera and lens, the closer the working distance, the narrower the width of narrowest crack detectable. According to the similarity principle of thin lens imaging, the physical width of detectable crack can be calculated:
Given the camera and lens parameters, a shorter working distance allows the detection of narrower cracks. Using the thin lens imaging similarity principle, the physical width of a detectable crack can be calculated as follows:
Digital image-based crack assessment requires an adequate ground sample distance (GSD). The UAV typically uses a prime lens, and
Calculation parameters for working distance and rotation angle.
Sufficient overlap for 3D reconstruction
Sufficient overlap between images is essential for successful 3D reconstruction. For the circular section of a wind turbine tower, when the UAV flies vertically, adjusting only the altitude simplifies controlling the overlap between adjacent images. In contrast, during circumferential flights, the UAV must not only adjust the horizontal position but also the angle to ensure that the optical axis passes through the center of the circular section. This makes the operation more complex.
The calculation of the overlap zone can be reduced to a planar problem. As illustrated in Figure 2, where
For effective reconstruction, the overlap ratio
Crack identification and feature extraction
Crack identification consists of three steps: coarse segmentation using an attention-enhanced grid-based CNN, fine segmentation using DIP, and the calculation of crack parameters. Fine crack segmentation can be inefficient and prone to noise interference. Accuracy and efficiency can be improved by first extracting the area of interest (AOI) through coarse segmentation, followed by fine crack segmentation within the AOI.
Coarse segmentation based on CNN
The attention-enhanced grid-based CNN for coarse crack segmentation has a global receptive field, unlike patch-based CNN, which has only a local receptive field, making it prone to noise interference and less accurate. Additionally, the patch-based method requires a sliding window approach, and to avoid cracks at the window edges, the strip size is typically reduced to half of the window, lowering efficiency. Therefore, the proposed method performs better.
Crack dataset
The UAV-captured images are 5456 × 3632 pixels, which are too large for network training. Since the cracks are tiny, compressing the image would make them undetectable. Therefore, the image is divided into 16 × 16 patches for neural network input. Each patch is 341 × 227 pixels, a size that does not require a high-performance GPU and is suitable for network training.
A dataset of 3750 RGB images with dimensions of 341 × 227 × 3 is constructed for crack identification. Grid-based crack annotation software is developed to expedite marking. The dataset is split into training, validation, and test sets in a 6:2:2 ratio, with 2250 images for feature extraction, 750 for hyperparameter tuning, and 750 for model evaluation.
Network architecture
An attention-enhanced grid-based classification model according to ResNet is proposed. ResNet is a widely used backbone network in computer vision. It addresses deep network degradation by incorporating residual units through shortcut connections (He et al., 2016). The network architecture (Figure 3(a)) consists of eight modules: a pre-processing module, an input module, four downsampling modules, an output module, and a post-processing module. The proposed method takes an image as input and outputs the grid mask directly. Unlike the patch-based method, the grid-based method only requires a single calculation to classify cracks in each area. Network architecture for attention-enhanced grid-based CNN.
The pre-processing module includes resizing and normalization. The input is an RGB image of size 341 × 227 × 3. The wind turbine tower’s surface color is monotonous, so the image is converted to grayscale to speed up image reading during tests. As the network down-samples four times, the input image size must be a multiple of 16. Using bicubic interpolation, the final image size is 496 × 304 × 1. Additionally, pixel values are normalized to a range of 0 to 1 to accelerate convergence during network training.
The Input_Conv1 module integrates low signal-to-noise ratio input data. The module consists of a standard convolution block (Figure 3(b)). CBA contains a convolution layer, batch normalization, and ReLU activation. Padding is half the size of the convolution kernel, rounded down. Since batch normalization follows the convolution layer, the convolution layer’s bias is set to False by default. The convolution kernel size is 7, the output channel is 64, and the stride is 1.
The downsampling module uses residual units to construct the deep neural network (DNN) and leverages the DNN’s hierarchical structure to process semantic information. The module consists of Bottleneck blocks, which are divided into downsampling and identity blocks based on whether downsampling is performed (Figure 3(c)). Three convolution blocks in the Bottleneck backbone network handle dimension compression and expansion, ensuring accuracy while significantly reducing computation. The four downsampling modules have 64, 128, 256, and 512 channels, with repetition counts of 3, 4, 6, and 3, respectively.
The Out_Conv6 module classifies grid cells. As shown in Figure 3(d), the initial convolution block reduces dimensionality and computation. The second attention block, C3TR, employs a global receptive field and reduces noise from a broader perspective. The final convolution layer integrates channel information and outputs crack confidence via a sigmoid activation function. The Transformer (TR) block in C3TR incorporates learnable positional encoding and embedding. Figure 3(e) shows that positional encoding is learned through a fully connected layer, while embedding is learned via multi-head attention. Query (
The multi-head attention mechanism maps
The post-processing module visualizes the neural network’s output. Based on the model’s output confidence and the threshold for positive (crack) and negative (background) samples, the grid cell category is determined, and a 16x downsampled grid mask is generated. Finally, the grid mask is projected onto the original image to produce the crack coarse segmentation prediction.
Loss function
In grid-based crack identification, the imbalance between positive and negative samples is addressed using Focal loss and Dice loss. Focal loss and Dice loss operate at different scales: Focal loss targets specific parts, while Dice loss focuses on the overall structure.
Focal loss addresses the imbalance between positive and negative samples by mining hard samples as follows:
Dice loss addresses the imbalance by mining the foreground area. Dice loss, an area-based loss, is defined as:
Evaluation metrics
Confusion matrix.
At the grid level, AP is used. AP represents the area under the PR curve, calculated based on the updated PASCAL VOC standard (Everingham et al., 2015). In the PR curve, the horizontal axis represents Precision (Pr), and the vertical axis represents Recall (Re). Precision assesses the accuracy of identified cracks (equation (8)). Recall evaluates the completeness of the identified cracks (equation (9)). AP combines both metrics, as demonstrated in equation (10).
At the image level, Dice score is used. Dice sore is a set similarity metric that measures the similarity between ground truth and predictions, as follows:
Dice score is equivalent to the F1 score in terms of correct prediction of positive and negative samples, as shown below:
The F1 score, a comprehensive metric, is the harmonic mean of Precision and Recall.
Experimental results
The training starts from scratch using a mini-batch size of 48 over 100 epochs. The Adam optimizer is used, with learning rate schedules including warm-up and cosine decay. During the first five warm-up epochs, the learning rate increases linearly from 0 to 0.0001. Over the subsequent 95 epochs, the learning rate decays to 0.001% of its initial value. Exponential moving average (EMA) is applied during training to enhance the model’s robustness and generalization. Training is performed on a high-performance workstation equipped with two NVIDIA Ampere A100 GPUs (80 GB VRAM each) and two Intel Xeon Gold 6342 CPUs (512 GB RAM).
The best-performing weights on the validation set are selected as the final model weights. As shown in Figure 4, the model performance improves rapidly during training and then stabilizes. A comparison of training and validation loss shows minimal overfitting. Model generalization is tested on the test set, yielding an AP of 0.92 and a Dice score of 0.83. The model has 23,647,649 parameters and achieves 98 FPS (frames per second), meeting real-time identification requirements. The training process.
Fine segmentation
Once the grid mask is obtained, a series of fine segmentation processes are applied to extract the cracks. The fine segmentation procedure is illustrated in Figure 5. Threshold segmentation, combined with multi-scale filtering and connected component analysis, is used. The extracted cracks are then used for subsequent parameter calculations. Fine segmentation procedure.
Grid level filtering
The grid-based CNN provides an Area of Interest (AOI) that accurately reflects crack distribution, though some misidentification may occur. These misidentifications can be filtered by crack morphology. At the grid level, crack morphology filtering involves three steps: connected component analysis, speckle filtering, and line filtering.
Connected component analysis adopts the seed filling algorithm. The algorithm uses a stack based on neighborhood relations to label connected regions, requiring only a single traversal of the image.
Crack topology is characterized by irregular curves of specific lengths. Set a length threshold to filter out connected components shorter than this threshold as speckles. The least squares method fits the shapes of the remaining connected components, and linear components are filtered based on area and the sum of squared errors (SSE) as follows:
Adaptive threshold segmentation
An additional advantage of the grid-based method is the natural existence of local windows for threshold segmentation. While the global histogram is typically smooth, the histograms within the AOI exhibit a distinct bimodal distribution. In this context, the Otsu algorithm, which is effective for threshold segmentation, can achieve satisfactory results (Otsu, 1979).
The Otsu algorithm segments the image into foreground and background. The optimal threshold maximizes the difference between the foreground and background. This difference is measured by the maximum inter-class variance, and the Otsu threshold is given by:
Pixel level filtering
The threshold algorithm relies on gray distribution for image segmentation, neglecting spatial information. Therefore, after pixel-level segmentation using the threshold algorithm, additional fine filtering should incorporate crack morphology. Pixel-level filtering involves two steps: first, filtering out small noise using connected component analysis, and second, applying principal component analysis (PCA) to filter out components with non-crack topologies (Wang and Hu, 2017).
After binarization, each cell is treated as a matrix
Parameter calculation
Crack parameters derived from two-dimensional images may be distorted due to perspective effects and geometric distortions from non-flat structural surfaces. To obtain accurate crack parameters, it is necessary to project the cracks back into the three-dimensional scene.
The key parameter for projection is the crack width feature pair, which includes the crack width feature point, as shown in Figure 5. Each feature pair includes the crack segment ID, the coordinates of the skeleton point, and the coordinates of the crack width points. These parameters are obtained through crack decomposition, edge detection, and other operations to facilitate crack projection (Liu et al., 2014).
Crack localization and feature integration
Quantitative information about local structural cracks is effectively extracted through coarse segmentation, fine segmentation, and parameter calculation. However, for large structures like wind turbine towers, locating cracks solely based on local information is insufficient. Therefore, a three-dimensional model of the wind turbine tower is reconstructed using Structure from Motion (SfM) with multiple view geometry, and crack positioning is achieved through projection methods. Additionally, information about the same crack from different images is integrated to provide more accurate indicators for crack assessment.
3D reconstruction based on SfM
3D reconstruction using SfM generally involves three steps: feature matching, sparse reconstruction, and dense reconstruction. Feature matching begins with using algorithms (e.g., SIFT, SURF) to extract feature points and descriptors from images. Key points are then matched based on descriptor distances, and epipolar geometry operations are performed between each image pair. During sparse reconstruction, bundle adjustment is used for parameter optimization, resulting in camera parameter estimation and sparse scene features. Dense reconstruction uses camera parameters and positional relationships from SfM to estimate depth maps and generate dense point clouds.
Surface reconstruction is further performed on dense point clouds. The continuous triangular mesh model produced is suitable for crack projection and localization. Applying spatial constraints to dense point clouds allows the surface model to achieve greater accuracy in crack localization and measurement.
Crack projection
The location of specific cracks on a wind turbine tower is determined by projecting width feature points onto the continuous triangular mesh surface model. The projection ray
The width feature points on each image and crack are projected, and the intersection points of all projection rays with the triangular mesh model represent the real coordinates of the corresponding crack width feature points. To speed up the calculation, a collision detection acceleration algorithm using the axis-aligned bounding box (AABB) tree is employed, reducing time complexity to
Crack deduplication
During crack projection, overlapping areas between two adjacent images are projected repeatedly from both cameras. Crack projection in overlapping areas presents two main issues. The first issue is merging the same crack from multiple images when deviations are small and projections are close. The second issue arises when a crack shows significant deviations across images, potentially leading to incorrect projections that need correction. Experiments indicate that the projection error of the same crack across different camera poses using SfM is negligible. Therefore, this paper primarily addresses the first issue. Repeated projections of the same crack area also account for a significant portion of calculations. These repeated projections decrease computational efficiency and complicate crack assessment. To resolve these issues, a deduplication algorithm is proposed, which significantly enhances projection calculation efficiency through incremental projection (as shown in Figure 6). Crack deduplication.
Feature matching and overlap extraction
To extract overlaps, feature extraction must be optimized based on application scenarios and engineering requirements. Feature matching and mismatch elimination are then employed to obtain matched feature pairs and the homography matrix for adjacent images. Finally, the overlapping region is extracted using the homography matrix. Figure 6(a) illustrates the process of feature matching and overlap extraction.
Feature matching is performed using SURF (Speeded Up Robust Features) (Bay et al., 2008). Since cracks are very fine and crack pixels are unlikely to be detected as feature points, directly applying SURF to find feature points and calculate descriptors in crack images is impractical. This paper proposes processing the segmented crack binary image and sampling points on the crack skeleton line with a 20% probability to serve as crack feature points. SURF descriptors are then computed for the crack skeleton points on the original image to enhance feature richness and recognition.
This optimized feature extraction process is used to obtain feature points and descriptors for matching. Random Sample Consensus (RANSAC) is then employed to iteratively compute the homography matrix between adjacent images, eliminating mismatches. The optimized homography matrix is finally used to determine the positional relationship between adjacent images and extract overlapping regions.
Incremental projection
The computational efficiency of the incremental projection algorithm is evaluated using 10 adjacent images with overlapping regions. Testing reveals that the full projection algorithm (Liu et al., 2020) requires 36.019 seconds for crack projection and display, whereas the proposed incremental projection algorithm takes 27.901 seconds, improving efficiency by 22.5%.
Parameter correction
Localization and distortion issues arise in crack parameter calculations based on two-dimensional planar images. The high overlap rate between adjacent images in the sequence leads to significant redundant calculations when determining crack length in a single image. Additionally, the crack width calculated from the image can be distorted due to various factors. Projecting cracks onto the mesh model allows for parameter correction, leading to more accurate structural performance assessment.
Crack width is calculated using the crack width points in each feature pair. The mesh model reconstructs the wind turbine tower’s structural surface, allowing the projection points to reflect the actual spatial distribution of cracks and correct any distortion. The corrected crack width is determined by calculating the Euclidean distance between crack width points in the set of projected feature pairs.
Crack length is determined by filtering redundant information and matching continuous cracks in adjacent images using the crack deduplication algorithm described above. For each crack, the skeleton points in the projected width feature pairs are connected sequentially based on the crack segment sequence ID, forming the crack skeleton line. The length of the crack skeleton line represents the corrected crack length.
Field tests
Field tests on wind turbine towers are performed to demonstrate and validate the proposed methodology.
Introduction to test conditions
Equipment and setup
The UAV is equipped with an industrial camera and a prime lens with a focal length of 85 mm. The industrial camera features an APS-C image sensor with a size of 23.5 × 15.6 mm and a resolution of 5456 × 3632 pixels.
The UAV’s flight plan is designed to meet the requirements for crack identification and 3D reconstruction. The UAV’s maximum operating distance is approximately 2 m to ensure accurate crack identification (as shown in Table 1). The overlap between adjacent images should be at least 50%, with a rotation angle of approximately 9.19° to ensure accurate 3D reconstruction (as shown in Table 1).
The field tests were conducted in clear, cloudless weather conditions. Favorable weather conditions ensure consistent solar illumination. The images acquired during the tests exhibited uniform brightness.
Site and subject
The wind turbine project is located in Nanyang City, Henan Province, China, with a capacity of 105 MW. The hub height is approximately 130 m, and the foundation is buried to a depth of 5.2 m. The concrete tower consists of three types of precast straight cylinder segments and three types of transitional conical cylinder segments.
Constructed in 2018, the wind turbine project has experienced issues with falling blocks, cracks, and other damage to the concrete tower’s outer wall. To assess and understand the structural condition and safety, the structure must be inspected and evaluated according to relevant national standards. Based on the on-site inspection results and structural safety calculations, the current safety status of the concrete tower is assessed, and recommendations for treatment are provided. Crack inspection and assessment are key components of this process.
Data results
Crack identification
Grid-level crack segmentation results are obtained from coarse segmentation using CNN, while pixel-level crack segmentation results are achieved through DIP, as illustrated in Figure 7(a). The identification of cracks can be influenced by linear interferences such as joints and stains on the structural surface of the wind turbine tower. The proposed network model and filtering algorithm effectively eliminate such noise, demonstrating strong robustness. Crack assessment process and results.
Model reconstruction
Initially, a sparse point cloud of the structure is generated using SfM. This sparse point cloud is then used to create a dense point cloud. A triangular mesh model is constructed from the dense point cloud, and textures are mapped onto it. The model provides clear structural information, and joints between the precast and transitional segments are distinctly visible in the surface model (as shown in Figure 7(b)). This reconstructed model serves as a fundamental platform for crack localization.
Crack localization, assessment, and decision
Cracks are incrementally projected onto the surface model using the camera positions and poses estimated by SfM, as depicted in Figure 7(b). The surface model, reconstructed from the image sequence, provides a clear view of the spatial location and distribution of cracks. Additionally, the length and width of the cracks can be calculated to assess structural performance.
Cracks should be sealed and the affected areas repaired to ensure durability. The proposed method offers a reliable approach for crack inspection. Notably, cracks are concentrated near the joint between the precast and transitional segments. These joints should be maintained promptly. Furthermore, grouting should be applied to the joint to ensure that the gap is fully filled and the external surface transition is smooth, preventing future damage near the joint.
Discussions
Crack recall and robustness against noise
In the project, it is crucial to ensure that all cracks are identified. A significant number of missed cracks could lead to a serious underestimation of the structural damage, potentially resulting in an overestimation of the structural performance and leading to unreasonable decisions.
The traditional inspection method is through the “spider man” manual vision inspection. Manual vision inspection is dangerous. It is difficult for inspectors to concentrate and extremely easy to miss cracks. UAV imaging reduces the burden of external operations and makes crack identification an internal operation, significantly improving efficiency and accuracy.
The proposed method demonstrates strong robustness against noise effects through multi-scale progressive filtering and crack identification, as illustrated in Figure 5. At the grid scale, the CNN-based coarse segmentation achieves an AP of 0.92 and a Dice score of 0.83, showcasing excellent coarse segmentation capabilities. At both the macro and micro scales, corresponding to the grid and pixel levels, a series of operations effectively preserve crack information while removing noise. Furthermore, the grid-based CNN architecture inherently forms AOIs for localized thresholding, resulting in clearer threshold boundaries and sharper crack edges during fine segmentation. As shown in Figure 8, the proposed algorithm exhibits strong robustness against noise. Segmentation results of cracks against noise.
As illustrated in Figure 8, the proposed algorithm effectively handles noise, maintaining high robustness even in challenging environments. While Pr and Re are balanced at approximately 90%, some cracks are filtered due to threshold settings. However, these missed crack segments usually belong to parts of a single crack and do not significantly affect the overall assessment.
In comparison, manual vision inspection identified only four cracks, missing more than 90% of them. The proposed method, by overcoming these limitations, meets the rigorous requirements of engineering applications.
Model performance comparison
Model performance comparison.
Among the baseline models, MobileNet, as a lightweight architecture, demonstrates the highest computational efficiency with the lowest parameter count (1.01 M) and FLOPs (0.56 G). However, this efficiency comes at the expense of lower segmentation accuracy, with an AP of 0.81 and a Dice score of 0.75. In contrast, DarkNet and VGGNet provide better segmentation accuracy but require higher computational resources.
The experimental results demonstrate that the proposed network achieve the highest AP (0.92) and Dice score (0.83) among all the models. While our network exhibits the largest parameter count (23.65 M) and FLOPs (49.30 G), the increase in computational cost is relatively modest compared to DarkNet and VGGNet. Furthermore, the frame rate (FPS) of 98 confirms that the proposed network is computationally efficient and feasible for practical engineering applications.
Additionally, an ablation study was performed by removing the Transformer block from our proposed network to further validate its contribution. The results highlight the effectiveness of the Transformer block, which enhances the network’s ability to focus on critical features, leading to improved segmentation performance.
In summary, the proposed network outperforms the baseline models in terms of segmentation accuracy while maintaining a reasonable computational cost, making it well-suited for real-world applications.
Crack positioning error of the incremental projection
To verify the positioning error of the incremental projection, a quantitative analysis of the projection error in overlapping regions of adjacent images is conducted. Once the overlapping regions in adjacent images are identified, crack feature points within these regions are projected from the respective adjacent cameras, and their corresponding world coordinates are calculated.
World coordinates for 97 feature points associated with five cracks in two adjacent images are analyzed (as shown in Figure 9). The average Euclidean distance is used to measure the projection error, yielding a result of 0.86 mm. This indicates that the positioning error of projections in overlapping regions is negligible. Consequently, eliminating repeated projection operations can enhance calculation efficiency, and adjacent continuous cracks can be integrated into a single crack. Feature point projection of overlapping region.
Crack parameter error
Reliable crack assessment requires accurate crack parameters. The calculated crack parameters are compared with manually measured values to analyze errors and verify the reliability of the proposed method. Measuring crack parameters manually at high altitude is challenging (as shown in Figure 10), resulting in limited data for analysis. Parameter error analysis is presented in Table 4. Manual data acquisition of concrete wind turbine tower. Parameter error analysis.
The lengths of three cracks are calculated. The total length of each crack is determined through crack deduplication and compared with measured values. The maximum relative error is approximately 5%, and the average relative error is 3.7%. The analysis indicates that the crack length error is small and acceptable for engineering purposes.
Measuring most cracks directly is challenging due to shaking at high altitudes and limited operating space restricted by safety equipment. Particularly, measuring crack width accurately to 0.01 mm is challenging. The measured crack width was recorded. The absolute error in crack width is 0.06 mm, with a relative error of 18.8%. According to relevant technical specifications, the relative accuracy of DIP-based crack width calculations should be within 20%. Thus, the proposed method satisfies the engineering requirements for structural applications.
Conclusions
This paper proposes a crack assessment method utilizing UAV imaging for wind turbine towers, significantly enhancing the accuracy and efficiency of inspections through computer vision and AI. The UAV cruise and imaging strategy tailored for wind power tower and other high-rise structures to acquire data is discussed and designed. During crack identification, a data-driven artificial intelligence method is employed, optimizing the CNN with attention mechanisms and grid classification strategies to effectively extract AOI. An incremental projection algorithm is developed to address the issue of repeated projections in overlapping regions of adjacent images. This algorithm enhances the efficiency of crack projection positioning and meets engineering requirements. Field tests show that the proposed method effectively identifies, locates, and quantifies cracks. The main conclusions of this paper are as follows: (1) The attention-enhanced grid-based classification CNN accurately and efficiently identifies cracks. The model achieves an AP of 0.86 on the test set, demonstrating excellent generalization, and reaches an FPS of 98, meeting real-time identification requirements. Coarse segmentation using CNN and fine segmentation with DIP effectively extract cracks. (2) The complete process of data acquisition, 3D reconstruction, and crack projection provides accurate crack localization across the entire structure. The triangular mesh surface model constrains projection positioning, with an average positioning error of 0.86 mm for the same crack across different camera positions and poses. (3) The incremental projection algorithm addresses repeated projection issues in overlapping regions of adjacent images, improving efficiency by 22.5% compared to full projection. It also matches the same crack across different images, avoiding redundant calculations and ensuring reliable crack assessment.
The proposed method enables automatic crack inspection and assessment for wind turbine towers. Crack projections on the reconstructed textured surface model provide decision support for maintenance. Future research could explore the integration of automated technologies, such as UAV autonomous cruising, super-resolution reconstruction for identifying tiny cracks, and multi-source data fusion for 3D reconstruction. Additionally, advancements in artificial intelligence techniques, including unsupervised learning, image generation, and multi-task learning, could be applied for real-time crack detection. Incorporating environmental factors, such as wind loads and temperature variations, into the assessment framework will further refine the accuracy of the analysis. The development of predictive models for crack propagation, alongside the use of edge computing to process data locally and reduce latency, could enhance system efficiency. Combined with periodic inspections, this approach can enable crack growth monitoring and contribute to the development of a digital twin platform for the entire lifecycle of wind turbine towers.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the National Natural Science Foundation of China (52192662).
Disclaimer
The authors express their sincere appreciation for their support.
