Application of data augmentation techniques in tunnel health assessment

Abstract

Tunnel health assessment is an important part of ensuring the structural safety and extending the service life of tunnels. However, limited by the problems of insufficient data and class imbalance in the monitoring of tunnel defects, the model may face the prediction bias during the training process. Therefore, this study introduces a tunnel health assessment method based on data augmentation to improve the classification performance and generalization ability of the model. First, the defects monitoring data of the left line of the Huilongshan Tunnel in Shaoguan City, Guangdong Province were collected, a true dataset containing 13 defect indicators was established, and preprocessing operations such as feature transformation, outlier detection and handling, missing value filling, and normalization were performed on it. Then, three data augmentation methods, CTGAN, SMOTE, and CVAE, were used to augment the dataset to generate the synthetic datasets, D_s1, D_s2, and D_s3, respectively. The similarity between the synthetic datasets and the true dataset was assessed using statistical methods, including statistical indicators, boxplots, and Q-Q plots. The effectiveness of data augmentation was then validated using three machine/deep learning models, BP neural networks, SVM, and XGBoost. The experimental results show that the synthetic dataset D_s1 generated by CTGAN performed the best in terms of accuracy (98.47%), precision (98.05%), recall (98.10%), and F1 score (98.06%), significantly improving the model’s classification performance while effectively mitigating the problems of insufficient data and class imbalance. Overall, this study demonstrates the superiority of the CTGAN method in tunnel health assessment tasks and provides a reliable data augmentation solution for tunnel health assessment.

Keywords

data augmentation tunnel health assessment CTGAN machine learning

Introduction

With the speedy development of transportation infrastructure, the number of tunnel constructions has been increasing. Tunnels face various defects during actual operation, such as cracks, water leakage, and lining damage, which threaten the structural safety and long-term stability of the tunnel.^1–3 Therefore, regular defect monitoring and health assessment have become an important part of tunnel maintenance, which is crucial to ensure the safe operation of tunnels and extend their service life. However, the problems of insufficient data and class imbalance in the monitoring of tunnel defects often affect the performance of health assessment models.⁴ Due to limitations in monitoring data collection, such as constraints from equipment and time, some health levels have limited data. This leads to difficulties for the assessment model in sufficiently learning the characteristics of minority classes during training, ultimately affecting the accurate assessment of tunnel health.

Machine learning and deep learning models have increasingly been utilized for tunnel health assessment in recent years.^5,6 Researchers have introduced algorithms such as SVM,⁷ Random Forest (RF),⁸ and XGBoost⁹ for tunnel health level prediction. In addition, CNN, DNN, and LSTM have also been used to model tunnel health status to improve prediction accuracy.¹⁰

Although these models have achieved certain results, they are highly dependent on large-scale and high-quality training data. Due to the limitation of monitoring equipment, high cost of data collection, and scarcity of samples of certain health levels, the true data often cannot meet the model training requirements, thus limiting the generalization ability of models. Especially in cases of class imbalance, models may tend to predict the majority class, neglecting the features of minority classes, which affects the reliability of defect assessment.¹¹

In order to address the problems of insufficient data and class imbalance, data augmentation techniques have become an essential approach to improving model performance. Common augmentation methods include data interpolation-based methods, such as SMOTE,¹² and generative model-based methods, such as Variational Autoencoder (VAE)^13,14 and Generative Adversarial Network (GAN).^15,16 Among them, GAN has made significant progress in various fields, including image, text, and video.^17,18 CTGAN (Conditional Tabular Generative Adversarial Network) is a derivative model of GAN tailored for tabular data and has been employed for oversampling of tabular data¹⁹ and in predicting multi-axis fatigue life.²⁰ VAE as a probabilistic generative model, can model data distributions.¹³ CVAE (Conditional Variational Autoencoder) builds upon VAE, designed to generate samples under specific conditions in the data space.²¹

This study aims to apply CTGAN, SMOTE, and CVAE techniques to tunnel health assessment, addressing the problems of insufficient data and class imbalance in the monitoring of tunnel defects through data augmentation. By generating synthetic datasets, the scale of training data is extended, especially supplementing samples of the minority class, thereby enhancing the classification performance and generalization ability of the health assessment model. This study provides scientific decision support for tunnel management authorities to help them effectively remediate tunnel defects and ensure the long-term safe operation of tunnels.

Dataset establishment

Data collection

In this study, the site monitoring data of defects on the left line of the Huilongshan Tunnel in Shaoguan City, Guangdong Province were collected. The starting mileage of the left line of Huilongshan Tunnel is K003+790, and the ending mileage is K004+956, spanning 1166 m. In order to carry out tunnel health assessment, this study divided the left line of the tunnel into 117 assessment units with 10 m as a unit of assessment, and the monitoring data of each unit contains 13 defect indicators. These defect indicators served as characteristic variables, specifically including: crack length, crack width, crack depth ratio, cavity longitudinal length, cavity depth, cavity circumferential range, water leakage status, water leakage pH value, deformation rate, inner limit ratio, lining strength ratio, lining thickness ratio, and protective layer thickness ratio. The target variable was the health level of each assessment unit of the tunnel, which was categorized into four levels of tunnel health according to the severity of the defects: Level 1 (no or minor defects), Level 2 (average defects), Level 3 (more severe defects), and Level 4 (severe defects). Finally, the true dataset contained 13 feature variables and 1 target variable, totaling 117 samples.

Data preprocessing

Since true data may contain noise, missing values, outliers, and other issues that can affect the effectiveness of data augmentation and the model training. Therefore, it is necessary to perform effective data cleaning and transformation to guarantee the integrity of the augmented data and the accuracy of model training.²² For the true dataset, this section sequentially performed preprocessing operations, including feature transformation, outlier detection and handling, missing value filling, and normalization.

Feature transformation

Of the defect indicators, water leakage status is qualitative. Based on factors such as leakage location, pressure, and flow rate, it is divided into four levels: infiltration, dripping, flowing, and spraying,²³ as shown in Table 1. To accommodate the numerical requirements of data augmentation, qualitative indicators need to be transformed into quantitative indicators. The levels of seepage, dripping, flowing, and spraying corresponded to Levels 1, 2, 3, and 4, respectively, and were assigned values of 0.125, 0.375, 0.625, and 0.875. No water leakage defect was assigned a value of 0.

Table 1.

Assessment standards for water leakage status.

Leakage location	Water leakage status
Leakage location	Spraying	Flowing	Dripping	Seepage
Crown	4	3	2	1
Side wall	3	2	2	1
Impact on traffic	Yes	Yes	Yes	No

Accordingly, when there is no water leakage defect, there is also no water leakage pH value, which needs to be reasonably filled and cannot simply be assigned a value of 0. The impact level of the water leakage pH value on tunnel lining corrosion is divided into four levels,²⁴ with the assessment standards shown in Table 2. The health level corresponding to no water leakage defect is Level 1, with a pH value range of 6.1–7.9; missing values were randomly filled within this range.

Table 2.

Assessment standards for water leakage pH value.

Health level	pH value	Effect on concrete
4	<4.0	Cement dissolution and spalling
3	4.1∼5.0	Uneven surface within a short time
2	5.1∼6.0	Prone to surface defect
1	6.1∼7.9	Attention needed during early use of concrete

Outlier detection and handling

Outliers are data points that significantly deviate from the overall trend of the data, potentially caused by anomalies, measurement errors, or extreme phenomena. A boxplot is a simple and widely used statistical chart that visualizes data distribution characteristics by displaying the five-number summary of the data. It can effectively detect outliers.²⁵

Outlier detection was conducted on the 13 defect indicators, with the results presented in Figure 1. It can be seen that, except for the crack depth ratio and water leakage pH value indicators, all other indicators exhibited varying degrees of outliers. Among them, indicators such as cavity longitudinal length and cavity depth showed a relatively high number of outliers. For the detected outliers, an initial manual review was conducted to retain significant outliers while removing obviously erroneous data. Subsequently, the remaining outliers were replaced using the extreme value method, where data pointed smaller than the lower limit value were replaced with the lower limit value, and those greater than the upper limit value were replaced with the upper limit value.²⁶

Figure 1.

Boxplot detection results.

Missing value filling

After removing significantly deviating outliers, some of the samples showed missing values. If the samples containing missing values were directly deleted, it will reduce the number of samples, thus affecting the effect of data augmentation. Therefore, the missing values need to be filled reasonably.²⁷ In order to ensure the data quality, the range of missing value filling was limited to samples containing at least 7 defect indicators, and samples with less than 7 defect indicators were deleted to avoid filling bias caused by too many defects. After removing the outliers, the data distribution is shown in Table 3. There were 10 samples with missing values, among which 1 sample contains 6 defect indicators, which didn’t meet the filling requirements and is deleted, and the remaining 9 samples met the filling requirements.

Table 3.

Data distribution after outlier removal.

Number of samples	Defect indicators
Number of samples	1	2	3	4	5	6	7	8	9	10	11	12	13
107	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
5	✓		✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
1	✓		✓	✓		✓	✓	✓	✓	✓	✓	✓	✓
1	✓	✓	✓			✓	✓	✓	✓	✓	✓	✓	✓
1		✓	✓	✓		✓	✓	✓	✓	✓	✓	✓	✓
1	✓	✓	✓	✓	✓	✓	✓	✓		✓
1	✓		✓				✓	✓	✓	✓

1-crack length; 2-crack width; 3-crack depth ratio; 4-cavity longitudinal length; 5-cavity depth; 6-cavity circumferential range; 7-water leakage status; 8-water leakage pH value; 9-deformation rate; 10-inner limit ratio; 11-lining strength ratio; 12-lining thickness ratio; 13-protective layer thickness ratio.

According to Table 3, missing values were present in 9 defect indicators, namely, crack length, crack width, cavity longitudinal length, cavity depth, cavity circumferential range, deformation rate, lining strength ratio, lining thickness ratio, and protective layer thickness ratio. The data distribution histograms for each indicator are shown in Figure 2. Observing the histograms, it can be seen that crack length, crack width, cavity longitudinal length, cavity depth, cavity circumferential range, and deformation rate had more obvious bias or outliers, and the use of median filling can reduce the influence of these extreme values. The data distribution of lining strength ratio, lining thickness ratio and protective layer thickness ratio was more centralized, and the mean value can better represent the overall data trend.

Figure 2.

Histograms of data distribution for each indicator.

Normalization

In order to ensure a consistent range of values for different features and to avoid feature imbalance during data augmentation due to different magnitudes, the data were usually normalized by Min-Max method, scaling the feature values to the range of [0, 1].

Analysis of data class distribution

The preprocessed dataset contained 13 feature variables and 1 target variable, totaling 116 samples. The target variable represented the health level of each tunnel assessment unit. The statistical results of the health level distribution are shown in Table 4. As observed from the table, the distribution of health levels was highly imbalanced. Samples with Level 1 accounted for the majority (75.86%), while Level 3 and Level 4 were severely underrepresented, with Level 4 completely absent. Additionally, the total number of samples was relatively small (only 116), which may impact the generalization performance of the subsequent model training and affect the stability of classification results.

Table 4.

Health level distribution statistics of Huilongshan Tunnel.

Health level	Number of samples (count)	Percentage (%)
1	88	75.86
2	24	20.69
3	4	3.45
4	0	0

Methodology

Study step

This study utilized three data augmentation techniques—CTGAN, SMOTE, and CVAE—to address the problems of insufficient data and class imbalance. First, the preprocessed tunnel defect data was analyzed to identify the problems of insufficient data and class imbalance. Next, the data augmentation techniques (CTGAN, SMOTE, and CVAE) were applied to the preprocessed dataset to generate synthetic datasets, and statistical analysis was performed to assess the quality of the synthetic data. Finally, machine/deep learning models were used to validate that the data augmentation techniques effectively addressed the problems of insufficient data and class imbalance.

Data augmentation

In this study, the defects monitoring data of the left line of the Huilongshan Tunnel was used as the true dataset D_t, which contains a total of 117 samples. Before data augmentation, the true dataset D_t was preprocessed to obtain the preprocessed true dataset D_t', which contains 116 samples. Data augmentation techniques were then applied to D_t' to generate synthetic datasets D_s (with synthetic datasets generated by CTGAN, SMOTE, and CVAE being D_s1, D_s2, D_s3, respectively), resulting in a total of 800 samples.

Similarity validation

To validate the quality of the synthetic dataset D_s, statistical methods were applied to perform a similarity analysis between D_t' and D_s. The descriptive statistics and distribution characteristics of the two datasets were compared to ensure the effectiveness of the synthetic data.

Tunnel health assessment model training and evaluation

After confirming the effectiveness of the synthesized data, the datasets D_t' and D_s were split into a training set at a ratio of 70% and a test set at a ratio of 30%, respectively, for training and testing the machine/deep learning models. The performance evaluation on the test set was conducted to verify whether data augmentation effectively alleviated the problems of insufficient data and class imbalance, and to assess whether it improved the model’s generalization ability and evaluation accuracy.

Data augmentation techniques

CTGAN

CTGAN is an extension of the GAN specifically designed to generate synthetic data that adheres to certain conditional constraints.²⁸ CTGAN consists of a conditional generator (G) and a discriminator (D), and it generates synthetic data that has a similar distribution to true data through adversarial training. The conditional generator (G) takes a batch of random vectors generated from random noise as input, and, in combination with the conditional information, generates corresponding synthetic data. The discriminator (D) is responsible for determining whether the input data is true or synthetic. By comparing the distributions of true data and synthetic data, the discriminator calculates its loss function and updates the model parameters.²⁹

The core of CTGAN lies in the design of a conditional adversarial training mechanism. To ensure that the generated data is both truthfulness and adheres to specific conditional constraints, CTGAN adopts a loss function based on WGAN-GP (Wasserstein GAN with Gradient Penalty).³⁰ Specifically, the discriminator’s loss function seeks to maximize the difference between the distributions of true data and generated data. The generator’s loss function, on the other hand, minimizes the objective function, making the synthetic data as close as possible to the true data and satisfying the constraints of the conditional information.

SMOTE

SMOTE is an oversampling technique based on interpolation between samples, commonly used to address problems of class imbalance, particularly when there is a shortage of minority class samples in classification tasks. SMOTE works by generating new synthetic samples between existing samples of the minority class, thereby expanding its distribution. This approach helps to equalize the class proportions in the dataset, enhancing the model’s capability to identify the minority class.^12,31 Specifically, SMOTE works in the following steps:

(1) For each minority class sample, the Euclidean distance between it and other samples in the dataset is computed to determine the k nearest neighbors;

(2) A random sample is chosen from the k nearest neighbors, and interpolation is performed between the current sample and the selected neighbor to generate a new synthetic sample;

(3) The procedure continues until enough synthetic samples are generated to reach the desired number of minority class samples.

Through the interpolation-based generation strategy, SMOTE effectively alleviates the class imbalance problem. However, since the sample generation process only considers linear relationships between samples, for datasets with more complex distributions, the synthetic samples generated by SMOTE may deviate from the true data distribution.

CVAE

CVAE is a deep learning model that adopts an autoencoder architecture and combines variational inference methods for probabilistic modeling and generating new datasets. Unlike traditional autoencoders, CVAE introduces latent variable distribution assumptions and additional conditional information during the encoding process. The goal is to learn the probabilistic distribution of the conditional latent space.³² The structure of CVAE consists of three parts: the encoder, latent space, and decoder. The encoder receives input data along with conditional information, maps them to the parameters of the latent space, and generates the probabilistic distribution of the latent variables. The latent space generates latent variables as compressed representations of the input data. The decoder reconstructs the input data based on latent variables sampled from the latent space and the conditional information.³³

During CVAE training, the goal is to minimize the loss function, ensuring that the generated latent space accurately reflects the structure of the input data and utilizes conditional information to generate data that matches specific classes. The loss function in CVAE consists of two components. The first component is the reconstruction loss, which measures the discrepancy between the input data and the reconstructed data, guiding the decoder to generate samples that resemble the original data. The second component is the KL divergence, which assesses the discrepancy between the latent variable distribution and the standard normal distribution. This encourages the latent space distribution to approach the standard normal distribution while ensuring that the generated data adheres to the given conditional information.³³

Evaluation criteria

Once the synthetic dataset is generated, the quality of the synthetic data is validated using statistical methods and machine/deep learning models to ensure that the synthetic data effectively addresses the problems of insufficient data and class imbalance in tunnel health assessment.²⁹

Statistical Methods include descriptive statistics and comparison of distribution features. For descriptive statistics, the minimum, maximum, mean, and standard deviation of both true and synthetic datasets are computed, and boxplots are drawn to reflect the data’s dispersion. For distribution features, Q-Q plots are used to visually analyze the distribution of each defect indicator in both true and synthetic data. The Q-Q plot is a common tool for testing normality, with the x-axis representing the theoretical normal distribution quantiles and the y-axis representing the normalized actual data quantiles. Based on tunnel health grades, this method visually verifies whether the distribution of synthetic data for each class matches that of the true data.

In addition, machine/deep learning methods can assess whether the synthetic dataset enables the creation of a more effective predictive model compared to the true dataset. The performance of these models serves as a measure to validate the effectiveness of the synthetic data. In this study, three machine/deep learning models were used to achieve this goal. The specific models are described below.

BP neural network

The BP neural network model employs a Multi-Layer Perceptron (MLP) structure, consisting of multiple fully connected layers. It introduces nonlinearity through the ReLU activation function and maps the input features to classification results layer by layer. During the training process, the cross-entropy loss function is selected as the objective to measure the difference between the model’s predicted probability distribution and the true labels. By minimizing the loss function, the model progressively adjusts the network parameters and updates the weights based on feedback information in each iteration until training is complete.^34,35

SVM

SVM is a highly effective machine learning algorithm, particularly effective in solving nonlinear and high-dimensional data pattern recognition problems. In classification tasks, SVM operates by identifying an optimal hyperplane that maximizes the separation between data points from different classes, thus achieving effective classification and improving the model’s generalization ability.⁷ The key hyperparameters of SVM are the kernel function and the penalty coefficient. The kernel function maps the data from a lower-dimensional space to a higher-dimensional space, while the penalty coefficient is used to balance the model’s complexity with the training error.

XGBoost

XGBoost is an improved version of the GBDT, which iteratively builds multiple decision trees through the Boosting method to correct the prediction errors of previous models.³⁶ XGBoost minimizes prediction errors via a gradient descent-based optimization process. This process minimizes the objective function, which is composed of a loss function and regularization terms. The loss function measures the error between the predicted and true values, assessing the model’s fit to data. The regularization term controls the number of leaf nodes and their weights, balancing the model’s complexity and generalization ability.

To evaluate the performance of the above models, four evaluation metrics are used: accuracy, precision, recall, and F1 score. These four metrics are widely adopted and well-suited for multi-class classification tasks. Accuracy provides an overall performance measure, while precision, recall, and F1 score offer detailed insights into classification effectiveness. In contrast, metrics such as AUC-ROC and MCC are typically designed for binary classification and are less interpretable in the multi-class context of tunnel health assessment. Therefore, we adopted these four metrics to ensure consistency and interpretability across the evaluation process.

Results and analysis

Design parameters of data augmentation models

CTGAN model

In this study, the CTGAN model was used for data augmentation, and the model was trained using the Python library ctgan. The model design parameters are as follows:

• epochs = 5000: The model is trained for 5000 epochs;

• batch_size = 100: In each iteration, 100 samples will be randomly selected from the data;

• generator_lr = 2e-4, discriminator_lr = 2e-4: The learning rates for both the generator and discriminator are 2e-4;

• generator_dim = (256, 256), discriminator_dim = (256, 256): Both the generator and discriminator networks consist of two layers, each containing 256 neurons.

SMOTE model

In this study, the SMOTE model was used for data augmentation, implemented using the SMOTE class from the imblearn library. The model design parameters are as follows:

• k_neighbors = 4: The number of nearest neighbor samples for generating synthetic samples is 4;

• samples_strategy = {1: 488, 2: 264, 3: 164}: Specifies the number of samples to be added to each class. Class 1 is increased by 400 samples, Class 2 by 240 samples, and Class 3 by 160 samples.

CVAE model

In this study, the CVAE model was used for data augmentation, implemented using the PyTorch library. The model design parameters are as follows:

• nput_dim = 13: The feature variables are 13;

• num_classes = 3: The target variable consists of 3 classes;

• fc1 = 256, fc2 = 128, fc3 = 128, fc4 = 256: The encoder contains two fully connected layers: 256 neurons in the first and 128 neurons in the second. The decoder also contains two fully connected layers: 128 neurons in the first and 256 neurons in the second;

• epochs = 100: The model is trained for 100 epochs;

• optimizer = Adam (lr = 0.0005): The Adam optimizer is used with a learning rate of 0.0005.

Validation of synthetic datasets

Based on statistical methods

As described in “Study step”, this study applied three data augmentation techniques to the preprocessed true dataset D_t' (116 samples), generating the synthetic dataset D_s (800 samples). Therefore, this section primarily compared the statistical features of D_t' and D_s, specifically the comparison of descriptive statistics and distribution characteristics. Since the dataset contains 13 damage indicators, only a subset of representative indicators is presented in this study.

Tables 5 and 6 present the statistical indicators for the true dataset D_t' and the synthetic datasets D_s (D_s1, D_s2, D_s3). From the tables, it can be observed that there are certain differences in the statistical indicators between the two datasets, with the performance varying across different indicators. Specifically, for crack length and crack depth ratio, the mean and standard deviation of D_s3 were closer to the true data; for cavity depth, the mean and standard deviation of D_s2 were similar to the true data; for lining strength ratio, the mean and standard deviation of D_s1 aligned better with the true data; for water leakage pH value, the mean of D_s3 was identical to the true data, though its standard deviation was slightly smaller; and for deformation rate, the mean of D_s1 was the closest to the true data, but its standard deviation showed greater dispersion. Overall, D_s3 performed better across multiple indicators and was closer to the statistical properties of the true data.

Table 5.

The statistical indicators of D_t'.

Defect indicators	Max	Mean	Std.
1	1	0.16	0.29
3	1	0.19	0.31
5	1	0.16	0.26
8	1	0.58	0.25
9	1	0.22	0.25
11	1	0.78	0.26

1-crack length; 3-crack depth ratio; 5-cavity depth; 8-water leakage pH value; 9-deformation rate; 11-lining strength ratio.

Table 6.

The statistical indicators of D_s.

Defect indicators	Min			Max			Mean			Std.
Defect indicators	D _s1	D _s2	D _s3	D _s1	D _s2	D _s3	D _s1	D _s2	D _s3	D _s1	D _s2	D _s3
1	0	0	0	1	1	1	0.36	0.42	0.34	0.37	0.33	0.34
3	0	0	0	1	1	1	0.36	0.55	0.33	0.36	0.36	0.33
5	0	0	0	1	1	1	0.32	0.26	0.37	0.36	0.29	0.27
8	0	0	0	1	1	1	0.48	0.49	0.58	0.27	0.27	0.17
9	0	0	0	1	1	1	0.35	0.48	0.37	0.31	0.34	0.26
11	0	0	0	1	1	1	0.66	0.51	0.54	0.27	0.34	0.22

Figure 3 presents the boxplots for the dataset D_t' and D_s. As shown in the figure, D_s1 performed the best cross multiple features. For crack length, crack depth ratio, and cavity depth, although Ds1 had a wider distribution range, its median aligned well with the true dataset, and no outliers were present. For water leakage pH value, deformation rate, and lining strength ratio, D_s1's distribution range closely matched that of the true dataset, with no significant outliers and lower data dispersion.

Figure 3.

The boxplots of D_t' and D_s. (a) D_t' and (b) D_s.

Figure 4 shows the Q-Q plots for the dataset D_t' and D_s. According to the plots, it was clear that D_s1 significantly improved the balance of the data’s class distribution and ensured that the generated data fitted the true dataset with high consistency. This was especially evident in features such as crack length and water leakage pH value, where the generated data exhibited smooth distributions that closely aligned with the statistical properties of the true data. In contrast, while D_s2 effectively increased the number of samples, its feature distribution showed higher dispersion, particularly in the fitting of the higher and lower percentiles. D_s3 appeared as three parallel straight lines, failing to capture the complexity present in the true data, resulting in overly simplified distributions that lack the ability to simulate the diversity of the true data. Overall, D_s1 better modeled the complexity of the true data, particularly excelling in terms of distribution balance and fit.

Figure 4.

The Q-Q plots of D_t' and D_s. (a) D_t', (b) D_s1, (c) D_s2, and (d) D_s3.

Based on machine/deep learning methods

As described in Section 3.3, this section used three machine/deep learning models (BP neural network, SVM, and XGBoost) to validate the effectiveness of the synthetic datasets. The true dataset D_t' and the synthetic dataset D_s were split into a training set at a ratio of 70% and a test set at a ratio of 30%, respectively. Grid search combined with 10-fold cross-validation is used for model training.

Tables 7 and 8 show the performance of the models on the dataset D_t' and D_s. Based on the performance results of each model on the different datasets, the synthetic dataset D_s1 performed the best, with an accuracy of 98.47%, precision of 98.05%, recall of 98.10%, and an F1 score of 98.06%. It significantly outperformed the true dataset and other synthetic datasets, proving that the synthetic data generated by the CTGAN method was highly effective in improving model performance, leading to a significant enhancement in the model’s classification ability.

Table 7.

The performance of the models on D_t'.

Defect indicators	BP neural network	SVM	XGBoost	Average
Accuracy (%)	97.14	94.29	91.43	94.29
Precision (%)	62.50	60.71	58.81	60.67
Recall (%)	66.67	61.90	57.14	61.90
F1 score (%)	64.44	61.30	57.78	61.17

Table 8.

The performance of the models on D_s.

Defect indicators	BP neural network			SVM			XGBoost			Average
Defect indicators	D _s1	D _s2	D _s3	D _s1	D _s2	D _s3	D _s1	D _s2	D _s3	D _s1	D _s2	D _s3
Accuracy (%)	98.75	98.33	97.92	98.75	97.92	97.50	97.92	97.08	97.50	98.47	97.78	97.64
Precision (%)	98.77	98.50	97.78	98.32	97.95	97.37	97.07	97.22	96.46	98.05	97.89	97.20
Recall (%)	98.62	97.95	96.67	98.47	97.56	96.00	97.21	96.53	96.76	98.10	97.35	96.48
F1 score (%)	98.67	98.18	97.10	98.39	97.72	96.50	97.13	96.80	96.59	98.06	97.57	96.73

Furthermore, the performance improvements from data augmentation varied across the three models. The BP neural network showed the most significant improvement when trained with the augmented datasets, particularly with D_s1 generated by CTGAN. This may be attributed to its reliance on large and diverse datasets to capture nonlinear feature relationships. SVM also benefited from data augmentation but appears more sensitive to the distribution of synthetic samples, possibly due to its dependence on clearly defined decision boundaries. XGBoost exhibited relatively stable performance across all datasets, reflecting its inherent robustness and ability to handle small or imbalanced datasets effectively. Among the three augmentation methods, CTGAN consistently enhanced model performance across all classifiers, whereas SMOTE and CVAE showed model-dependent effectiveness. These results suggested that both the choice of augmentation method and classifier affected the model’s sensitivity to synthetic data and overall performance.

Results

Based on the comprehensive analysis using statistical methods and model validation, the synthetic dataset D_s1 performed the best in terms of data augmentation and model training evaluation. Through the comparison of statistical indicators, boxplots, and Q-Q plots, D_s1 outperformed the other synthetic datasets (D_s2, D_s3) and the true dataset D_t' in simulating the statistical properties of true data and fitting the data. In the training and evaluation of machine/deep learning models, D_s1 achieved the best performance in accuracy, precision, recall, and F1 score, significantly improving the model’s classification performance. This validated that the CTGAN method effectively enhanced the model’s generalization ability and evaluation accuracy.

Moreover, the defect data was classified into three health levels: Level 1, Level 2, and Level 3. The data distribution for D_t' was highly imbalanced, with 75.86% (88 samples), 20.69% (24 samples), and 3.45% (4 samples) for the three classes, respectively. In contrast, D_s1 adjusted the data distribution to 50.75% (406 samples), 32.38% (259 samples), and 16.88% (135 samples), significantly increasing the number of samples in the minority classes. This effectively alleviated the problems of insufficient data and class imbalance.

In summary, the synthetic data generated by the CTGAN method not only improved the statistical rationality of the data but also enhanced the model’s classification performance and generalization ability. It provided a more stable and reliable data foundation for tunnel health assessment tasks.

Conclusion

This study applied three data augmentation techniques—CTGAN, SMOTE, and CVAE—to address the problems of insufficient data and class imbalance in tunnel health assessment. The results demonstrate the effectiveness of data augmentation in tunnel health assessment, particularly in improving model performance and data quality. The main conclusions are as follows.

First, applying data augmentation method proves to be an effective solution for addressing insufficient data and class imbalance in tunnel health assessment datasets. By generating synthetic data with a similar distribution to true data, data augmentation significantly expands the dataset size and addresses the challenges of manual data collection, particularly in situations where sample availability is limited.

Second, the synthetic dataset D_s1 generated by CTGAN exhibits high consistency with the statistical properties and distribution of the true dataset. The comparisons of statistical indicators, boxplots, and Q-Q plots indicate that the synthetic data not only preserves the inherent characteristics of the true data but also improves dataset balance and diversity, providing more comprehensive training data for tunnel health assessment models.

Finally, through the training and evaluation of machine/deep learning models (BP neural network, SVM, and XGBoost), it is verified that the synthetic data D_s1 generated by CTGAN improves model classification performance and prediction accuracy. This demonstrates its significant potential for application in tunnel health assessment tasks.

Supplemental Material

Supplemental Material - Application of data augmentation techniques in tunnel health assessment

Supplemental Material for Application of data augmentation techniques in tunnel health assessment by Jinglong Li, Xiufang Zheng, Diyang Chen, Lin Yang, Yupeng Cui, Kejin Li, Zengliang Song and Hui Chou in Journal of Computational Methods in Sciences and Engineering

Footnotes

ORCID iD

Xiufang Zheng

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors acknowledge financial support from the National Key Research and Development Program of China (No. 2024YFF0507902) and the University Research Project on Intelligent Detection and Rapid Remediation Technologies and Equipment for Tunnel Secondary Lining Quality (No. zy20240101).

Conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

References

Zhang

Zhu

Ren

. Analysis and study on crack characteristics of highway tunnel lining. Civ Eng J 2019; 5(5): 1119–1123.

Han

Jiang

Wang

, et al. Review of health inspection and reinforcement design for typical tunnel quality defects of voids and insufficient lining thickness. Tunn Undergr Space Technol 2023; 137: 105110.

Jiang

Wang

, et al. Estimation of reinforcing effects of FRP-PCM method on degraded tunnel linings. Soils Found 2017; 57(3): 327–340.

Zhao

Yang

Wang

, et al. Long-term safety evaluation of soft rock tunnel structure based on knowledge decision-making and data-driven models. Comput Geotech 2024; 169: 106244.

Melchiorre

Rosso

Cirrincione

, et al. Compact convolutional transformer Fourier analysis for GPR tunnels assessment. In 2023 international conference on control, Paris, 2023, pp. 1–6. Automation and Diagnosis (ICCAD).

Jin

Qiao

, et al. Intelligent safety evaluation of tunnel lining cracks based on machine learning. Eng Fail Anal 2025; 167: 109082.

Valero-Carreras

Alcaraz

Landete

. Comparing two SVM models through different metrics based on the confusion matrix. Comput Oper Res 2023; 152: 106131.

Kim

Cho

, et al. Investigation of pile group response to adjacent twin tunnel excavation utilizing machine learning. Geomech Eng 2024; 38(5): 517–528.

Nguyen

BQV

Kim

. The effect of soil physical properties on predicting shear strength parameters based on comparing ensemble learning, deep learning, and support vector machine models. Geomech Eng 2024; 39(3): 241–256.

10.

Zhang

Pan

, et al. Energy-driven TBM health status estimation with a hybrid deep learning approach. Expert Syst Appl 2024; 249: 123701.

11.

Yin

Liu

Huang

, et al. Perception model of surrounding rock geological conditions based on TBM operational big data and combined unsupervised-supervised learning. Tunn Undergr Space Technol 2022; 120: 104285.

12.

Elreedy

Atiya

. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf Sci 2019; 505: 32–64.

13.

Islam

Abdel-Aty

Cai

, et al. Crash data augmentation using variational autoencoder. Accid Anal Prev 2021; 151: 105950.

14.

Fan

SKS

Tsai

Yeh

. Effective variational-autoencoder-based generative models for highly imbalanced fault detection data in semiconductor manufacturing. IEEE Trans Semicond Manuf 2023; 36(2): 205–214.

15.

Tran

Nguyen

, et al. On data augmentation for GAN training. IEEE Trans Image Process 2021; 30: 1882–1897.

16.

Gao

Deng

Yue

. Data augmentation in fault diagnosis based on the Wasserstein generative adversarial network with gradient penalty. Neurocomputing 2020; 396: 487–494.

17.

Alqahtani

Kavakli-Thorne

Kumar

. Applications of generative adversarial networks (GANs): an updated review. Arch Comput Methods Eng 2021; 28: 525–552.

18.

Baek

Kim

Park

, et al. Conditional generative adversarial networks with adversarial attack and defense for generative data augmentation. J Comput Civ Eng 2022; 36(3): 04022001.

19.

Engelmann

Lessmann

. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst Appl 2021; 174: 114582.

20.

Zhao

Yan

. Application of tabular data synthesis using generative adversarial networks on machine learning-based multiaxial fatigue life prediction. Int J Pres Ves Pip 2022; 199: 104779.

21.

Zhang

Jiang

. Conditional variational autoencoder with Gaussian process regression recognition for parametric models. J Comput Appl Math 2024; 438: 115532.

22.

Albahra

Gorbett

Robertson

, et al. Artificial intelligence and machine learning overview in pathology & laboratory medicine: a general review of data preprocessing and basic supervised concepts. Semin Diagn Pathol 2023; 40(2): 71–87.

23.

JTG/H12-2015 . Technical specifications of maintenance for highway tunnel. Beijing: China Communications Press, 2015.

24.

Luo

. Study of diagnosis method and system for health condition of highway tunnel. Shanghai: Tongji University, 2007.

25.

Sim

Gan

Chang

. Outlier labeling with boxplot procedures. J Am Stat Assoc 2005; 100(470): 642–652.

26.

Silva

Hermsdorf

Guedes

, et al. Outliers treatment to improve the recognition of voice pathologies. Procedia Comput Sci 2019; 164: 678–685.

27.

Abnane

Idri

Abran

. Fuzzy case-based-reasoning-based imputation for incomplete data in software engineering repositories. J Software Evolu Process 2020; 32(9): e2260.

28.

Liu

Zhou

Zhang

, et al. Fault diagnosis method of box-type substation based on improved conditional tabular generative adversarial network and AlexNet. Appl Sci 2024; 14(7): 3112.

29.

Armaghani

Lai

, et al. Applying data augmentation technique on blast-induced overbreak prediction: resolving the problem of data shortage and data imbalance. Expert Syst Appl 2024; 237: 121616.

30.

Gulrajani

Ahmed

Arjovsky

, et al. Improved training of Wasserstein GANs. Adv Neural Inf Process Syst 2017; 30: 5767–5777.

31.

Douzas

Bacao

Last

. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 2018; 465: 1–20.

32.

Özdenizci

Wang

Koike-Akino

, et al. Transfer learning in brain-computer interfaces with adversarial variational autoencoders. In 2019 9th international IEEE/EMBS conference on neural engineering (NER), San Francisco, CA: 2019, pp. 207–210.

33.

Yonekura

Wada

Suzuki

. Generating various airfoils with required lift coefficients by combining NACA and Joukowski airfoils using conditional variational autoencoders. Eng Appl Artif Intell 2022; 108: 104560.

34.

Cheng

Shi

, et al. Brief introduction of back propagation (BP) neural network algorithm and its improvement. Adv Computer Sci Inf Eng 2012; 2: 553–558.

35.

Jin

Wei

, et al. The improvements of BP neural network learning algorithm. In WCC 2000-ICSP 2000. 2000 5th international conference on signal processing proceedings. 16th world computer congress 2000, Beijing, China, 2000, vol 3, pp. 1647–1649.

36.

Cao

Wang

, et al. Research on mining maximum subsidence prediction based on genetic algorithm combined with XGBoost model. Sustainability 2022; 14(16): 10421.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.15 MB