Abstract
In the last decades, the majority of the existing infrastructure heritage is approaching the end of its nominal design life mainly due to aging, deterioration, and degradation phenomena, threatening the safety levels of these strategic routes of communications. For civil engineers and researchers devoted to assessing and monitoring the structural health (SHM) of existing structures, the demand for innovative indirect non-destructive testing (NDT) methods aided with artificial intelligence (AI) is progressively spreading. In the present study, the authors analyzed the exertion of various deep learning models in order to increase the productivity of classifying ground penetrating radar (GPR) images for SHM purposes, especially focusing on road tunnel linings evaluations. Specifically, the authors presented a comparative study employing two convolutional models, i.e. the ResNet-50 and the EfficientNet-B0, and a recent transformer model, i.e. the Vision Transformer (ViT). Precisely, the authors evaluated the effects of training the models with or without pre-processed data through the bi-dimensional Fourier transform. Despite the theoretical advantages envisaged by adopting this kind of pre-processing technique on GPR images, the best classification performances have been still manifested by the classifiers trained without the Fourier pre-processing.
Keywords
Introduction
Nowadays, existing strategic infrastructures such as bridges and tunnels are experiencing a substantial reduction in safety levels for deterioration phenomena due to long-term degradation effects of their constitutive materials [1, 2]. To extend the service life of existing heritage, the most widespread approach is monitoring the structural health (SHM) of the systems in order to effectively plan and prioritize preventive maintenance or rehabilitation interventions crucial for lifecycle [3, 4, 5]. Since a total replacement of existing infrastructures would be economically unsustainable [6], efficient and innovative monitoring techniques have been developed in the last decades [7]. Periodic direct testing of specimens (e.g. concrete core drilling) is a reliable solution to directly assess the quality, mechanical properties, and temporal changes of the in-situ constitutive structural materials. However, these tests provide punctual, albeit detailed, information, which does not always reflect the actual state of the entire structure [8]. Moreover, the overall involved direct testing procedures are often lengthy and costly. Therefore, to increase the productivity and quickness of periodical inspections, non-destructive evaluations (NDE), also acknowledged as non-destructive testing (NDT) techniques, have become more prominent, reliable, and adopted methods, lately [9, 10, 11, 12]. They are often employed in combination with direct testing to increase the quickness, reduce the expenses, and, in general, mutually overcome the limits of each other [13]. Principally focusing on SHM for road tunnels structures [14, 15], some of the most adopted NDT techniques are e.g. rebound hammer testing [16], ultrasonic pulse testing [17], rebar scanning with pachometer device [18], concrete resistivity [19], acoustic emission passive monitoring for micro-cracks detection [20], thermal imaging thermography with infrared cameras [21, 22], laser scanner and lidar devices to monitoring tunnel linings deformations [21]. Some innovative approaches rely on emerging advanced technologies such as distributed fiber optic sensors [23] or internet of things (IoT) edge devices [24, 25, 26]. In the current study, the authors predominantly concentrated on indirect testing with ground penetrating radar (GPR) devices for concrete linings defects detection and annotation [27, 28], even if, in literature, GPR is often adopted to reveal tunnel lining concrete layer thickness [21]. The GPR instrumentation overcomes the limitations of visual inspections, qualified only to catch superficial defects [15]. Similarly to other geophysical methods [29], the GPR device probes the tunnel linings by propagating high-frequency electromagnetic wave impulses (10–2600 MHz) and analyzing the reflected signals [30]. The impulses’ penetration level or reflection rate depends on the dielectric features of the inspected material and the possible presence of certain agents (e.g. water, reinforcement bars, the interface between concrete linings and surrounding ground, linings defects). The architecture of a GPR system is composed of emitting and receiver units, a single or dual frequency antenna, display, control, and storage unit [30]. The GPR provides images as output named profiles, where the abscissa represents the progressive distance from the beginning of the probing (i.e. beginning of the tunnel), whereas the ordinate axis represents the GPR examined lining depth. As depicted in Fig. 1, in a traditional GPR indirect testing pipeline, specialist staff decodes linings defects from the surveyed profiles with a manual, lengthy and costly post-processing phase [31].
GPR lining defects recognition by specialist staff.
To improve the efficiency, reliability, and productivity of the traditional GPR monitoring process, artificial intelligence (AI) offers innovative tools to accomplish the above-mentioned task by leveraging computer vision and image processing-based methods [32, 33, 34, 35, 36]. Specifically, deep learning (DL) techniques such as convolutional neural networks (CNNs) have been extensively employed for SHM applications [37]. In the existing literature, some innovative DL-based procedures have been introduced in GPR tunnel linings indirect monitoring recently [38, 39, 40, 41, 42, 43]. In the review paper [44], the authors evidenced that despite the first adoption of GPR device in the tunnel-related field actually started in the late 1970s, a limited number of research studies have employed deep learning techniques hitherto, motivating the current interest of the present document within this active research field. In [45], the authors adopted deep learning models just to recognize the presence of rebars and to determine the thickness of the concrete layer, without taking into account any other defects or damage. In [46, 47], the authors employed a region proposal CNN named Faster R-CNN for specific target detection in tunnel lining GPR images. Specifically, in [47] the DL models have been trained for very limited purposes, i.e for detecting only rebars in tunnel linings structures. GPR tunnel liner dielectric properties (permittivity maps) inversion and objects identification tasks have been addressed in [48] through a CNN model combined with a recurrent neural network (RNN) composed of bidirectional convolutional long short-term memory (LSTM) blocks. In [49], tunnel linings defects automatic classification has been accomplished with two convolutional models, i.e. the visual geometry group (VGG) network, i.e. the VGG-16, and the residual neural network (ResNet) with 34 convolutional layers, i.e. the ResNet-34. This study is very limited because the DL model simply divides healthy sample images from the ones with any defects, additionally without explicating which types of considered defects. Furthermore, this study does not present any generality possibilities because the models were trained only on the GPR data coming from the same tunnel, strongly restricting any direct exportation of the trained model to different tunnels. Similarly, another CNN-based automatic defects classification has been proposed in [50] by adopting the rotational region deformable convolutional neural network (R
In the present study, the authors compared three different DL models, two convolutional models, i.e. ResNet-50 and EfficientNet, and a recent transformer model in the version suited for working with image data, i.e. Vision Transformer (ViT). To the authors’ knowledge, the present work introduced for the very first time these advanced neural models, i.e. the transformer, for the GPR tunnel linings defects classification task. Peculiarly, to provide reliable, automatic, and AI-aided GPR profiles post-processing, the authors employed the hierarchical multi-level classification tree proposed in [42]. The main goal of the present work is to compare the effects on the classification performances of the three DL analyzed models with and without a prior pre-processing phase of the GPR image dataset through the bi-dimensional Fourier transform, acting as a compressive sensing tool. Compressing information permits reducing data transmission and computational efforts [53], critical aspects for future real-time implementations. The present document is organized as follows. Section 2 briefly describes the image processing with the bi-dimensional Fourier transform technique. Section 3 illustrates the AI-aided tunnel linings investigation methodology with DL-based automatic defects classification. Eventually, Section 4 provides the comparative analysis among the various DL-trained models with and without Fourier pre-processing.
Within the signal processing field, the discrete Fourier transform (DFT) represents the most acknowledged and widespread tool to investigate real-world propagation phenomena and more [54, 6]. The generality of the Fourier analysis provides the ability to analyze and decompose also higher dimensional signals, and thus any digital image which is actually a discrete ordered spatial bi-dimensional distribution of tensors of pixels [55, 56]. Considering a digital image in the spatial domain
where:
The outcomes of digital image Fourier analysis are assembled into a complex matrix, whose components are usually expressed in terms of phase (
in which
The factor
The DL models denoted as convolutional neural networks (CNN) are essentially based on convolution, correlation, and in general filtering operations. A thorough understating of these operations within Fourier analysis of digital images revealed to the authors the possible advantages of adopting the bi-dimensional Fourier pre-processing technique. Within the present study, the authors mainly focused on the convolution theorem, which states that convolving two functions
Since the correlation operation is closely related to the convolutional one, a correlation theorem holds [61]:
being
For the duality property, the convolution operation is substantially a correlation in which the filter mask is rotated with a straight angle, i.e. using a flipped kernel
Resulting magnitude pre-processed images with bi-dimensional Fourier transform of two samples belonging to class C4 (reinforcement bars) and C13 (excavation) respectively.
Data collection, preliminary preparations and the final obtained dataset with and without Fourier pre-processing.
The dataset used in the current study is based on a series of NDT campaigns conducted by the authors on several tunnel linings with the GPR device. The data have been collected on tunnels spread throughout Italy, whose construction era is between the 1960s and 1980s. To provide a proper dataset to feed a subsequent DL classifier, some basic data preparations were needed after collecting GPR profiles. Firstly, every long output image generated by the GPR testing was interpreted by specialist staff to decode linings defects as the current traditional GPR post-processing workflow [31]. The long images were subsequently cropped with constant pixels step along the abscissa, which represents the progressive distance from the beginning to the end of the tunnel lining profile. This constant pixels step was calibrated in order to provide that each image sample width generally corresponds to about five meters on the real scale length of the tunnel progressive distance. However, in order to avoid some defects that were only placed across the cropping line and consequently end up on different images, the cropping line was occasionally manually adjusted. This latter operation was done on occasion with the minimum invasive intervention, providing a new defect-centered sample image, acting as a sort of local data augmentation. Nevertheless, all the sample images will be subjected to a resizing operation to homogeneously feed the DL models always with the same resolution images. In this way, a total number of 8728 GPR sample images were obtained for the subsequent innovative AI-based paradigm based on DL tunnel lining defects hierarchical classifiers.
Afterward, to further assess the envisaged effects of the bi-dimensional Fourier transform as an image pre-processing tool, the entire dataset of 8728 GPR sample images was pre-processed adopting the 2D-FFT algorithm from the Matlab environment [64]. Specifically, after computing the bi-dimensional FFT as Eq. (1), the modulus magnitude of each pixel was computed from the resulting complex matrix with the Eq. (4), followed by the logarithmic transformation exposed in Eq. (3). Only magnitude information was retained [58], thus producing the final pre-processed reconstructed GPR sample image. Two sample image examples are presented in Fig. 2 showing the bi-dimensional Fourier pre-processing effects compared with original GPR raw images. On the left side, two raw GPR sample images illustrate the presence of two different defects evidenced by interpreting the specific pattern, in a similar way to Fig. 1. On the right column, the same images undergo to Fourier pre-processing procedure, delivering images of the magnitude of complex terms with the logarithmic manipulation of Eq. (3). In the following of the present document, for the sake of clearness, whenever the authors refer to the dataset of sample images without any Fourier pre-processing, the adjective raw will be explicitly stated, e.g. raw dataset, raw images, etc.
Dataset folder organization both for raw images and Fourier pre-processing ones representing the adopted hierarchical multi-level classification tree.
The previously described datasets of GPR sample images with and without bi-dimensional Fourier pre-processing have been classified by adopting three different DL models, briefly described in the current section. As summarized in Fig. 4, each dataset of 8728 images in total has been rearranged in a series of 14 folders in order to construct a classification tree composed of six main levels, noting that the total number of available samples gradually decreases from level 1 to level 6. To accurately classify every single defect, this procedure is based on a cascade sequence of binary classifications to produce both a first skimming division in the first levels between healthy and damaged samples, whilst accurately classifying the typology (class) of the identified defect in the other next levels. Specifically, binary classification in level 1 distinguishes between class C1, i.e. healthy samples (4130 images), and class C2, i.e. damaged samples (4598 images). Level 2 is subdivided into levels 2a and level 2b. Level 2a is devoted to categorizing between class C3, i.e. healthy samples without reinforcement bars (3638 images), and class C4, i.e. samples with the presence of reinforcement bars (492 images). Level 2b is devoted to categorizing between class C5, i.e. samples with generic possible warning mix (574 images), and class C6, i.e. samples with more specific warnings which can be further accurately categorized (4024 images). In particular, class C6 contains specific patterns that permit further automatic classifying into specific defect typologies typical in tunnel linings assessment, as evidenced in Fig. 4. This means that samples in class C6 may be later categorized as cracks, or anomalies, simple voids, excavations, and detachments. On the other hand, samples belonging to class C5 may contain multiple overlayed defects or other specific patterns that are not directly interpretable with respect to the above-mentioned standard tunnel lining defects. In those cases, the current GPR approach produces warnings that require special care from the tunnel managers. Consequently, the inspectors have to further improve the investigation level to identify which kind of defect, or a mix of defects, is occurring in those critical areas, e.g. providing in situ direct testing or other indirect testing inspections. Binary classification in level 3 distinguishes between class C7, i.e. samples with linings crack presence (900 images), and class C8, i.e. samples with other types of damage (3124 images). Level 4 is devoted to categorizing between class C9, i.e. samples with the presence of anomalies in linings (936 images), and class C10, i.e. samples with other types of defects (2188 images). Binary classification in level 5 distinguishes between class C11, i.e. samples with a simple void in the linings (1108), and class C12, i.e. samples with other types of voids (1080 images). Eventually, level 6 is devoted to categorizing between class C13, i.e. samples with excavation defect (408 images), and class C14, i.e. samples with detachment between the linings and surrounding ground (672 images). For the adopted convolutional models, a balanced training approach was forced by the class with the minimum number of samples. To avoid a biased training of the CNNs toward the class with a higher number of samples, the training set of that class was forced to a smaller set. The size of this set was defined according to the number of samples of the class with the minimum data size. This was done to guarantee fair training for the classification model, avoiding a biased classification due to the unbalanced number of images considered at every single level.
Graphical illustrative representation of the neural models with hyperparameters adopted in the present study.
The CNNs are essentially based on the convolution operation to provide an automatic hierarchical feature extraction procedure [65]. Depicted in Fig. 5, the ResNet-50 model [66] is based on a deep residual learning process that relies on identity mapping, i.e. skip or shortcut connections throughout the convolutional layer blocks. The positive impact of these shortcut paths is to improve training speed and avoid vanishing gradients [67], mitigating excessive network depth issues [65]. In the present workflow, the dataset of GPR sample images have been priory resized to a resolution of 224
For the sake of comparisons, the authors adopted the contemporary convolutional state-of-art EfficientNet. Presented in 2019 [75], it effectively incorporates multiple techniques and previous existing strategies in an innovative way. A still ongoing widespread methodology to achieve the best accuracy results and contain the required computational effort in CNN is the depth network scaling, i.e. varying the number of layers. The base model ResNet-152 was developed with 152 layers [66], however [70] demonstrated that limiting the total number of layers, e.g. to 50 (ResNet-50), provides comprehensive beneficial effects both in terms of accuracy and computational effort [76]. On the contrary, scaling up CNN models permit enlarging the receptive field [75]. Alternative scaling approaches can be found in [75, 77, 78]. [75] developed a uniform and balanced scaling aiming to optimize the computational effort in terms of floating-point operations per second (FLOPS), thus providing the EfficientNet family models. For the current tunnel defects classification, the authors adopted the base model EfficientNet-B0 [75] provided in MATLAB2021a environment [64]. As illustrated in Fig. 5, this implementation relies on 7 building blocks which employs the inverted residual blocks of MobileNetV2 [75, 79], resulting in a less connected network than ResNet models. Indeed, the residual shortcuts connect only those layers in which the number of inputs and outputs are the same [80]. For the record, MobileNet denotes smaller and more efficient neural models initially developed specifically for the limited resources of mobile hardware [81]. Their efficiency lies in the depthwise separable convolutions operation also acknowledged as spatial-separable convolution, denoted as MBconv in Fig. 5, which effectively parallelizes the convolution computing exploiting the three-channel colors (RGB), i.e. the tensor depth, of image data. A deeper insight into the MBconv1 and MBconv6 modules is detailed in [82]. Furthermore, the EfficientNet building blocks adopts the swish activation function, an improved ReLU which is also slightly negative around zero [83, 80], in combination with squeeze-and-excitation block units [84, 80]. Figure 5 illustrates the empirical trial-and-error hyperparameters set adopted to train the current EfficientNet-B0 model.
Vision transformer
To address the tunnel defects classification problem, the authors also focused on the neural transformers. Firstly presented in [85] for natural language processing (NLP) tasks, they represent a major breakthrough in the DL field with a completely different structure from the CNNs. Transformers are encoder-decoder structures that completely entrust to self-attention and multi-head attention mechanisms, without requiring convolutional layers, and adopting positional embedding to account for token positions [73]. Attention bestows the network the ability to focus on specific parts of the input embedding [85]. The multi-head attention leverages the self-attention to parallel process each embedded sequence input token and concatenates the heads outcomes with a projection layer in order to compute the scored output [73]. [86] analyzed the relationship between the convolution operation and the self-attention mechanism, evidencing the ability of this latter to capture even long-range relationships in the sequence, whereas the foremost is mainly limited to its receptive field. Recent developments have fostered the adoption of the sole encoder part of transformers [87], thus the authors in [88] proposed the Vision Transformer (ViT) to deal with image-data type. In the current study, the ViT large model with 307M parameters and 16 patches (ViT-L16) has been employed. To properly feed the transformer encoder, each input image resized to 224
Confusion matrices and classification metrics for ResNet-50 model trained with raw image data
Confusion matrices and classification metrics for ResNet-50 model trained with raw image data
Confusion matrices and classification metrics for ResNet-50 model trained with bi-dimensional Fourier pre-processed image data
Confusion matrices and classification metrics for EfficientNet model trained with raw image data
Confusion matrices and classification metrics for EfficientNet model trained with bi-dimensional Fourier pre-processed image data
Confusion matrices and classification metrics for ViT model trained with raw image data
Confusion matrices and classification metrics for ViT model trained with bi-dimensional Fourier pre-processed image data
In order to investigate and compare the Fourier pre-processing effects on DL-based classification for indirect tunnel monitoring, the three previously described DL models have been trained with both the datasets illustrated in Section 2.1, i.e. with raw GPR sample images and with bi-dimensional Fourier GPR sample pre-processed images. In the following, the obtained results are extensively discussed for each DL model individually and, in the final part, the closing Section 4.4 argues the results across the various employed techniques.
Classification results for ResNet-50
Concerning the ResNet-50 model described in Section 3.1, the authors have split the dataset with a proportion of 80% for the training set and 20% for the test set. Furthermore, the authors adopted the k-fold cross-validation method with
Table 1 reports the confusion matrices of the averaged classification results expressed in percentages for the models trained with the raw GPR samples dataset. The table also illustrates the level of overall accuracies and the class metrics precision, recall, and f1-score. It is worth noting that every level has revealed a good accuracy above 90% in all the cases, reaching a peak of 98.30% in level 5 and a minimum value of 90.40% in level 2b. Averaging all the levels of accuracies, the ResNet-50 model trained with the raw dataset, i.e. without any Fourier pre-processing, reached a global classification accuracy of 94.51%. On the other hand, Table 2 reports the confusion matrices of the averaged classification results expressed in percentages for the models trained with the bi-dimensional Fourier pre-processed GPR samples dataset. In this circumstance, level 2b stands out for its worst accuracy value stacked to 76.30%. However, in the other levels, the ResNet-50 has revealed a good accuracy above 85% in virtually all the cases, reaching a peak value of 90.55% in level 6. Averaging all the levels of accuracies, the ResNet-50 model trained with the bi-dimensional Fourier pre-processed dataset reached a global classification accuracy of 85.60%, about 8.91% below the global accuracy of the ResNet-50 model trained with the raw dataset. These results demonstrated that, notwithstanding the envisaged advantages of adopting the Fourier pre-processing technique on the GPR sample images for the convolution operation, the ResNet-50 model is not able to reach the accuracy levels of the previous case, i.e. trained with the raw GPR dataset. Downstream of the obtained results, the authors suppose that the Fourier pre-processing probably introduced an exaggerated information compression, thus providing too similar images with such detrimental effects on the classification accuracy.
In an effort to demonstrate the contingent presence of overfitting during the training phase of all the ResNet-50 trained models with and without the Fourier pre-processed dataset, the convergence curves have been reported in Appendix A in Fig. A.1. These graphs show the trend of the loss, the accuracy, the validation loss, and the validation accuracy during the training epochs or iterations. Since each level accounts for 10 different trained models because of the k-fold cross-validation procedure, the authors represented the average curves among the 10 considered models. However, for the purpose of not losing the variability information among the ten different models, the shaded area around the average curve represents the envelope among the maximum and minimum curves among the 10 considered models. Excluding level 1 in which a slightly increasing trend of the average validation loss manifests around iteration 400, the ResNet-50 with raw dataset presents a comprehensive excellent behavior without any evidence of overfitting issues. Concerning the convergence curves of the ResNet-50 model with Fourier pre-processed GPR images dataset, a noticeable overfitting problem is evidenced in the level 2b from iteration around 50, thus explaining the poor classification accuracy of that level, as illustrated in Table 2. Moreover, slightly overfitting phenomena are tangible in levels 1 from iteration around 400 and level 4 from iteration around 80.
Classification results for EfficientNet-B0
Regarding the EfficientNet-B0 model described in Section 3.2, similarly to before, the authors have split the dataset with a proportion of 80% for the training set and 20% for the test set. In a similar manner, the authors adopted the k-fold cross-validation method also for this convolutional model with
Conversely, Table 4 reports the confusion matrices of the averaged classification results expressed in percentages for the EfficientNet-B0 models trained with the bi-dimensional Fourier pre-processed GPR samples dataset. In the present case, level 2b pointed out, once again, the worst accuracy value stacked to 73.87%, i.e. 7.14% below than the counterpart EfficientNet-B0 trained with the raw dataset. However, in the other levels, the EfficientNet-B0 has revealed a good accuracy above 80% in virtually all the cases, except for level 2b, with an average reduction of 5.75% with respect to the counterpart EfficientNet-B0 trained with the raw dataset. The maximum accuracy value of 93.06% was realized in level 3. Averaging all the levels of accuracies, the EfficientNet-B0 model trained with the bi-dimensional Fourier pre-processed dataset reached a global classification accuracy of 85.94%, about 5.75% below the global accuracy of the same models trained with the raw dataset. Even in these circumstances, the obtained results proved that the bi-dimensional Fourier pre-processing provided detrimental effects in terms of classification accuracy. Both ResNet-50 and EfficientNet-B0 models exhibit a worse classification behavior with the bi-dimensional Fourier pre-processed dataset despite the envisaged beneficial effects in computing the convolution operation.
To demonstrate any potential presence of overfitting during the training phase of all the EfficientNet-B0 trained models with and without the Fourier pre-processed dataset, the convergence curves during the training iterations have been reported in Appendix A in Fig. A.2. Although the EfficientNet-B0 models trained with raw dataset apparently do not manifest any sign of overfitting issue presence, level 2b revealed a barely noticeable slightly increasing trend of the average validation loss manifests around iteration 100. Concerning the convergence curves of the EfficientNet-B0 model with the Fourier pre-processed GPR images dataset, slightly overfitting issues are evidenced in level 1 from iteration around 400, in level 2b from iteration around 80, and in level 4 from iteration around 150.
Classification results for ViT
Comparative analysis of the various DL models’ classification accuracy with and without Fourier pre-processing among the classification levels.
Concerning the ViT model described in Section 3.3, on this occasion, the authors have split the dataset with a proportion of 90% for the training set and 10% for the test set. Furthermore, due to the quite prohibitive computational costs for training the ViT model from scratch, the authors adopted a pre-trained model and provided the fine-tuning training of the head of the network only, as illustrated in Fig. 5. For the same reason of computational demanding resources, the k-fold cross-validation method has not been employed with the transformers models of the present study. Table 5 reports the confusion matrices of the averaged classification results expressed in absolute terms, i.e. the number of samples from the test set of the raw GPR samples dataset which has been predicted for each class. The table illustrates the level of overall accuracies and the class metrics precision, recall, and f1-score. It is worth noting that every level has revealed excellent accuracy results above 94% in all the cases, even reaching a peak value of 100.00% in level 3 and with a minimum accuracy value of 95.42% in correspondence of level 2b, just like the worst levels of the above-mentioned convolutional models. Averaging all the levels of accuracies, the ViT model trained with the raw dataset, i.e. without any Fourier pre-processing, reached a global classification accuracy of 98.10%. On the other hand, Table 6 reports the confusion matrices of the averaged classification results expressed in percentages for the ViT models trained with the bi-dimensional Fourier pre-processed GPR samples dataset. In this case, the worst level is the first one, presenting the worst accuracy value of 86.14%. In the other levels, the ViT has still revealed a good accuracy greater than 90% in virtually all the cases nonetheless, still reaching a noticeable maximum accuracy value of 99.07% in level 6. However, averaging all the levels of accuracies, the ViT model trained with the bi-dimensional Fourier pre-processed dataset reached a less global classification accuracy of 93.65%, with an average reduction of 4.45% with respect to the counterpart ViT trained with the raw dataset. Again, the above-mentioned results demonstrated that, notwithstanding the envisaged advantages of adopting the Fourier pre-processing technique on the GPR sample images, also the ViT model is not able to reach the accuracy levels of the training with the raw GPR dataset. Since ViT is not essentially based on the convolution operation likewise CNNs, the obtained results strengthen the authors’ suppositions of an excessive information compression produced with the Fourier pre-processing procedure, resulting in fairly deleterious effects on the classification capacity of the analyzed DL models.
For the purpose of demonstrating a possible presence of overfitting during the training phase of all the ViT trained models with and without the Fourier pre-processed dataset, the convergence curves have been reported in Appendix A in Fig. A.3. These graphs show the trend of the loss, the accuracy, the validation loss, and the validation accuracy during the training epochs. The convergence curves do not always reach the maximum of 20 epochs because of the adoption of the early-stopping criterion. This means that the training phase is early interrupted when no further improvements occur to both save computational resources and avoid overfitting training. Despite the validation loss curves appearing quite noisy during the training epochs, their global descending trends proved that ViT model trained with raw GPR images dataset does not incur any overfitting phenomena at every level. Focusing on the ViT models trained with the Fourier pre-processed dataset, the validation curve trends revealed overfitting occurrence in level 1, level 4, and slight evidence in level 3, besides they appeared to be noisier than the previous case.
Global average accuracy for the three analyzed neural models
Global average accuracy for the three analyzed neural models
In the current closing section, the authors compared the results among the various DL trained models. Figure 6 provides a comparative overview of the obtained accuracy results. The classification outcomes have been organized for the various GPR defects classification levels. The graph is arranged according to the three DL analyzed models, and depicted in two juxtaposed histogram representations related to the training phase with the raw dataset and with the bi-dimensional Fourier pre-processed dataset. At first sight of the diagram, among the various DL models, the ViT architecture delivered the highest accuracy values for virtually all the levels of both cases with and without Fourier pre-processing. However, the ResNet-50 provided an accuracy result of 88.25% with the Fourier pre-processed dataset, thus providing a higher result than ViT model. As evidenced from the convergence curves, the ViT trained with Fourier pre-processed images evidence a slightly overfitting phenomenon in level 1. Jointly with the excessive data compression of the Fourier operation, as visually demonstrated in Fig. 2, the ViT model produced the worst accuracy performance in level 1 with Fourier pre-processing concerning other models. The EfficientNet-B0 model globally produced the worst results among almost all the levels for both two cases under comparison. However, with a deeper insight, the ResNet-50 provided the worst results in level 1 focusing on raw images dataset, and in levels 2a and 6 within the Fourier pre-processed case. It is worth mentioning that generally all three DL models struggled to reach high accuracy vale in level 2b. With a deeper inspection of the various convergence curves reported in the appendix, overfitting issues emerged in ResNet-50 with Fourier pre-processed dataset, in EfficientNet-B0 in both the two analyzed cases, and in the ViT model with Fourier pre-processed dataset. The difficulties in level 2b may be related to the critical unbalance in the amount of GPR images samples between classes C5 and C6. It is worth recalling that samples belonging to class C5 may contain multiple overlayed defects or other specific patterns that are not directly interpretable with respect to the standard tunnel lining defects classification of Fig. 4. This means that it is not possible to priorly exclude that those special patterns sometimes could present some parts quite similar to specific defects patterns belonging to class C6. Therefore, it could be also reasonable that those parts in samples of class C5 may possibly mislead the neural models, thus providing misclassified results. In addition, another possible reason could also be a quite critical similarity degree among the images of these two specific classes C5 and C6. This may be plausible especially in the Fourier case, which may produce overly similar images due to excessive data information compression.
Eventually, Table 7 reports the global average accuracy results among the various levels. It is worth noting the average accuracy reductions for the three DL models between the raw dataset case and Fourier pre-processed dataset. The ViT model recorded the lowest average accuracy reduction equal to 4.45%, whereas the EfficientNet-B0 exhibited a reduction value of 5.75%. The highest reduction of 8.91% was suffered from the ResNet-50 model. Despite the second-best model in terms of accuracy is the ResNet-50 with the raw dataset, it appeared to be the least robust architecture to the induced effects of the Fourier pre-processed dataset, thus delivering the most consistent average accuracy reduction.
This paper focuses on GPR testing of tunnel linings profiles using a DL-based image recognition framework. The authors compare the performance of three DL models for indirect tunnel defects classification. Nowadays, tunnel monitoring with innovative NDT is widespread, demanding more automation from DL methods. The authors adopted a hierarchical binary classification approach to group the types of defects identified in the GPR profiles. The core and main findings of this paper can be summarized as follows:
Three DL models have been employed, two convolutional models, i.e. the ResNet-50 model and the EfficientNet-B0 model, and a recent transformer architecture, i.e. the ViT model. The authors trained all the models with two different datasets, adequately prepared to compare the induced effects of a common, widespread image processing technique, i.e. the bi-dimensional Fourier transform. The Fourier pre-processing of GPR images determined a significant accuracy reduction compared to the raw dataset. Therefore, despite the computational advantages of Fourier pre-processing, Fourier pre-processing introduced an exaggerated data compression. The related information loss leads to overly similar images, with detrimental effects on the final classification accuracy. The ViT model delivered the highest classification accuracy values for virtually all the levels both with and without Fourier pre-processing.
The current AI-aided approach for GPR indirect tunnel monitoring mainly deals with the defects classification and detection task, which is the first level of an ideal SHM paradigm [93]. Future research efforts will be directed towards the remaining three SHM steps, i.e. the damage localization, the damage severity quantification, and the actual safety health state assessment. The primary purpose of SHM is to provide a reliable and exhaustive diagnosis of existing structures and infrastructures [93]. A promising research path in that direction may naturally leverage the attention map provided by transformer models’ outputs. Further future developments to address defects localization may leverage also the potentialities offered by the object detection task, e.g. employing a Faster R-CNN. Future improvements may also involve other compressive sensing techniques and transforms, e.g. wavelet [94, 95, 96].
Footnotes
Acknowledgments
Computational resources provided by hpc@polito (
Appendix A. Convergence curves
Loss versus accuracy during the training iterations. (a-g) ResNet-50 trained with raw images. (h-n) ResNet-50 trained with Fourier pre-processed images.
Loss versus accuracy during the training iterations. (a-g) EfficientNet-B0 trained with raw images. (h-n) EfficientNet-B0 trained with Fourier pre-processed images.
Loss versus accuracy during the training iterations. (a-g) ViT trained with raw images. (h-n) ViT trained with Fourier pre-processed images.
