Abstract
Seismic vulnerability assessment is crucial for ensuring the structural safety of buildings, particularly in earthquake-prone regions. While Nonlinear Time History Analysis (NLTHA) provides high accuracy, its computational demands make it impractical for rapid assessments. Machine learning (ML) models, especially Artificial Neural Networks (ANN), offer an efficient alternative but often require large datasets for reliable predictions. This study introduces a hybrid Principal Component Analysis-Artificial Neural Network (PCA-ANN) model to enhance seismic response prediction by reducing input dimensionality while preserving critical information. A dataset of over one million seismic responses was generated using NLTHA on three reinforced concrete (RC) frame buildings subjected to various ground motions. Comparative analysis between PCA-ANN and conventional ANN models reveals that PCA-ANN significantly improves both predictive accuracy and computational efficiency. The PCA-ANN model achieved a correlation coefficient (R2) of 99.1% and reduced Mean Squared Error (MSE) by 87% compared to the standalone ANN. Additionally, PCA-ANN maintained robust performance with limited dataset sizes, achieving an R2 above 75% using only 25% of the dataset, whereas ANN failed under similar conditions. Further validation through Incremental Dynamic Analysis (IDA) and fragility curves shows that PCA-ANN exhibits discrepancies below 2% compared to NLTHA. The model also achieves the lowest Relative Squared Error (RSR) (18%, 21%, and 24% for low-, mid-, and high-rise buildings, respectively) and the lowest Percentage Bias (PBias) (1.7%, 0.5%, and 0.3% for the same building types) when utilizing the full dataset. These results highlight PCA-ANN’s superior reliability across varying structural heights and dataset sizes. This study demonstrates that PCA-ANN is an efficient and accurate tool for seismic risk assessment, reducing computational costs while maintaining predictive reliability.
Keywords
Introduction
Performance-based design is a sophisticated approach to seismic design. It is based on estimating the behavior of the structure under the seismic loads. Thus, predicting the potential structural damage resulting from a seismic event may be possible (Benazouz et al., 2017; Dukes et al., 2018; Zhang et al., 2022). It entails comparing the seismic demand to the building’s capacity and assuring its functionality during and after an earthquake. The accuracy of the estimation is related to the approach used. Nonlinear Time History Analysis (NL-THA) is one of the many techniques used to analyze structural response to ground motions (GMs). It is based on the analytical and numerical solution of differential equations of motion (Sun et al., 2023). Compared to Nonlinear Static Procedures (NSP) (Mebarki et al., 2024), the NL-THA is a complicated procedure requiring extensive calculations.
Machine learning (ML) has emerged as a promising tool for rapid seismic response prediction. Once trained, these models can capture complex relationships between structural and seismic parameters, allowing for near-instantaneous predictions.
Lagaros and Fragiadakis (Lagaros and Fragiadakis, 2007) Proposed an artificial neural network methodology to quickly assess the seismic demand of steel buildings from GM records. Giovanis et al. (Giovanis et al., 2016). Developed an artificial neural network model to quickly generate the median IDA curve using a Monte Carlo approach for a case study involving a 9-story steel frame. They demonstrated that using more dataset samples during the training process can reduce the error between the actual and predicted values from 10% to 2.2%. Also, Khojastehfar et al.(Khojastehfar et al., 2014) introduced an ANN-based method to construct fragility curves under a collapse damage state for a steel frame building. Mitropoulou et al.(Mitropoulou and Papadrakakis, 2011) proposed a soft computing framework to estimate the fragility curves by predicting the seismic demand for four limit states, and used three 3D RC frame buildings as a case study. Also, Benbokhari et al. (Benbokhari et al., 2023) used the ANN model to predict the seismic response of an equivalent single degree of freedom (ESDOF) to enhance the target displacement prediction. The study focuses on comparing the proposed model with some existing approaches adopted by many codes like FEMA-356 (2000), FEMA-440 (FEMA-440, 2009) and ATC-40 (Aabbas and Jarallah, 2021). Results demonstrated the ability of the proposed model compared to the existing approaches for the seismic response prediction with Mean absolute error (MAE) less than 0.005.
This type of ML can be used in several fields, for example, Mangalathu et al. (Mangalathu et al., 2018) estimated the vulnerability of skewed concrete bridges in California using the ANN, Rachedi et al.(Rachedi et al., 2021) assessed the seismic risk of existing bridges considering the soil interaction using the ANN. Asgarkhani et al. (Asgarkhani et al., 2024) used a machine learning model to predict the maximum inter-story drift ratio (MIDR) and the roof drift ratio (RDR) for steel-braced frames, achieving performance levels of 98.7% and 93.5%, respectively. Kazemi et al. (Kazemi et al., 2023) proposed an ML model to predict the seismic MIDR of reinforced concrete buildings, 94,400 samples of RC buildings were used to train the model and they achieved 96.3% performance of the best ML model. Harirchian et al. (Harirchian et al., 2020) used a classification ML model to assess the seismic hazard safety of RC buildings and the results showed a performance of 68% of the best ML model to classify the damage class. (Tang et al., 2022) proposed an ML classification model to rapidly assess the seismic damage classes of RC buildings, achieving 96.8% performance.
In addition, other several works demonstrate that the ANN can be used as an alternative approach to the numerical current approaches, especially in earthquake engineering (Abdellatif et al., 2024; Barkhordari and Jawdhari, 2023; Derakhshani and Foruzan, 2019; Hou et al., 2024; Kazemi and Jankowski, 2023; Li et al., 2022; Liao et al., 2023; Noureldin et al., 2022; Rojas-Mercedes et al., 2022; Sajan et al., 2023; Wen et al., 2022; Zhang et al., 2024;).
However, the training time of ANNs can be a significant issue, especially when analysts are working with large datasets and during the tuning of hyperparameters. Additionally, the size of the dataset can impact the complexity of the ANN, potentially leading to decreased performance. (Khan et al., 2021; Lu et al., 2021; Qi et al., 2022). Principal component analysis (PCA), on the other hand, is the solution used for reducing the dimensionality of input features in an ANN. Various research studies have been conducted using this hybrid approach, combining supervised and unsupervised learning. Asencio et al. (Asencio-Cortés et al., 2015) Utilized PCA to enhance earthquake prediction in Chile. The study found that PCA significantly improved machine learning performance, increasing accuracy from 57%, 71%, and 65.80% to 69%, 74%, and 75% respectively. Abbasi et al. (Abbasi et al., 2023) estimated the settlement of dams caused by earthquakes using a combination of ANN, PCA, and wavelet-artificial neural networks, and they reduced the dimensionality from eight to five using PCA.
Existing literature demonstrates that hybrid learning significantly influences machine learning (ML) models in terms of accuracy, neural network complexity, and training efficiency. However, there is a critical gap in understanding its impact on seismic response prediction. While previous studies have explored ML applications in earthquake engineering, they have not comprehensively examined how hybrid learning techniques, particularly the integration of Principal Component Analysis (PCA) with Artificial Neural Networks (ANN), can enhance seismic response predictions.
This study addresses this gap by investigating the effect of hybrid learning on predicting seismic responses, particularly how it improves accuracy while reducing the required dataset size. The research aims to determine whether hybrid learning can achieve reliable seismic response predictions with a limited dataset, addressing challenges related to data availability and computational cost in earthquake engineering applications.
To achieve this, a dataset will be generated using OpenSees software through nonlinear time-history analyses (NLTHA) on randomly selected reinforced concrete (RC) framed structures subjected to various ground motions (GMs). The dataset will consist of one million samples, which will be used to train two ML models for predicting the maximum inter-story drift ratio (MIDR). Additionally, the study will evaluate the influence of dataset size by training the models with 25%, 50%, 75%, and 100% of the original dataset. This analysis will provide insights into the effectiveness of hybrid learning in seismic response prediction and its potential to optimize data efficiency without compromising accuracy.
Methodology
The proposed hybrid learning approach integrates both unsupervised and supervised learning to reduce the input features. As illustrated in Figure 1, a dataset will be generated using the Finite Element Method (FEM) in OpenSees software. The proposed methodology for the structural seismic response prediction.
Initially, a nonlinear time history analysis (NLTHA) will be conducted on randomly selected reinforced concrete (RC) framed structures. The input features, including structural and ground motion (GM) parameters and the output (Maximum Inter-story Drift Ratio, MIDR), will be stored in a single file. Following this, Principal Component Analysis (PCA) will be performed to align the dataset along principal component axes. These new principal components (PCs) will serve as the input layer for the artificial neural network (ANN) model.
For the comparison study, a separate ANN model will be trained using the original input features (structural and GM parameters). The comparison will include analyses of the Incremental Dynamic Analysis (IDA) curves, fragility curves, and the effects of dataset size on the results.
Dataset generation
The performance of any ML algorithm depends directly on the collected or generated dataset. Its quality and size affects directly the predictability of the model. For this work, OpenSees is used to perform more than 1 million NLTHA using 80 GMs. The used structures are generated randomly from a selection range of the structural geometric characteristics as shown in Table 1 and illustrated in Figure 2. The random selection must also be practical that is by engineering design concepts: • The height of the first story (He) is always greater than or equal to the height of the subsequent stories (Hs). • The depth (h1) of the beam should always be greater (or equal) than its width (b1). Geometric parameters and interval values for each input. The structural geometry and material models of a RC-framed building.

The heatmap shown in Figure 3 presents the correlation matrix of the dataset, illustrating the relationships between the variables. The diagonal values are all 1, indicating a perfect correlation of each variable with itself. There are strong correlations observed between PGV and PGD (0.94), Ecum and Is (1.00), and CAV and Is (0.96), suggesting redundancy among these variables. Moderate correlations are noted between the output and Ns (0.55), Sa(T1,%) (0.54), and PGA (0.39), indicating their potential influence on the target variable. Moreover, Sa(T1,%) exhibits a strong correlation with PGA (0.72), which may imply a dependency. Most other variables display weak correlations with the output, suggesting they may not significantly impact predictive modeling. The high correlations among certain inputs could lead to multicollinearity, which may affect the performance of regression models. Therefore, identifying and potentially removing redundant variables could enhance model efficiency. Correlation heatmap between input and output features.
It is important to note that the selection of input features was done randomly and uniformly to ensure all values had an equal probability of being selected.
Selection of ground motions
The selection of ground motion in the IDA approach is a crucial step, as it provides the simulated motion representative of actual earthquake events (Wasti and Özcebe, 2003). The selection process is influenced by several factors, including intensity, frequency content, duration, amplitudes, and target response spectra. It’s important to note that making an appropriate selection helps the analyst avoid errors that may arise from insufficient stimulation of the ground motions (GMs). (Chen and Yi, 2015). Additionally, an ANN requires a diverse set of ground motions for effective training. By capturing patterns within the data, the ANN learns from various examples, which enhances its ability to develop robust models and respond to complex inputs. A wider range of data increases the likelihood that the ANN will generalize its findings and accurately represent different types of ground motion. In this case, 80 ground motion records were selected and matched from the PEER database (Center, 2013), and they are represented in Table A1. Figure 4 presents the selected ground motions from the PEER database. Figure 4(a) illustrates the target seismic response spectrum along with the 16th, 50th, and 84th percentiles of the response spectra for the selected ground motions. Figures 4(b)–(d) show the relationships between earthquake magnitude and rupture distance, magnitude and shear wave velocity, and shear wave velocity and rupture distance, respectively. These figures highlight the variability of the ground motions chosen to train the machine learning model. Ground motions selection from PEEG database: (a) Selected response spectra scaled to a target response spectrum, (b) Magnitude (Mw) versus the closest distance to the rupture (Rrup),c) Magnitude versus the closest distance to shear wave velocity (Vs30), and (d) Rrup versus Vs30.
The selected GM parameters.
Principal component analysis (PCA)
The PCA is an unsupervised machine learning algorithm used for dimensionality reduction, feature selection, and data visualization. It is a statistical procedure that is based on converting a set of correlated data into linearly uncorrelated variables called principal components (PC) (Maćkiewicz and Ratajczak, 1993).
The number of principal components is less than or equal to the number of input features. The 1st PC has the largest possible variance and each succeeding component in turn has the highest variance possible and it is orthogonal to the preceding components.
The primary objective of PCA is to identify the axes that exhibit the largest variances and provide the most informative features of the data. This process involves transforming the original dataset into a new dataset with fewer dimensions. Figure 5 illustrates the reorientation of the dataset from its original axes to the new principal component axes. By multiplying the standardized matrix by the matrix of eigenvectors, the new dataset is generated, with its size depending on the number of principal components selected. Dataset reorientation from the original axes using the Eigenvectors matrix.
Figure 6 illustrates the principal component, eigenvalues (EV), and cumulative variability (CV). PC1 has the highest EV (=4.861) and variability (27%) as a result. If a fixed threshold of 90% is applied, the dimensions can also be reduced to 11 dimensions (CV = 91.261%), which exhibit good variability. The principal components versus the eigenvalues of each PC. The red line represents the cumulative variability (%) in function of number of the PCs.
The difference between the ANN and the hybrid ANN-PCA is that the first one uses the inputs as they are (18 input features) that is: the input dimensions depend on the used variables. Whereas the second approach uses the eigenvectors to transform the original data and to reduce its dimensionality that is, the ANN inputs should be equal or less than the original variables (<= 18).
Artificial neural networks for seismic response prediction
ANNs have been increasingly applied to civil engineering and earthquake engineering to predict events and quantify seismic risk assessment (Khan et al., 2021; Lu et al., 2021). Its ability to find the relation between input and outputs made the regression and classification tasks easier and more efficient in terms of performance (Chen et al., 2023; Shivani and Rooban, 2021).
In this section, two ANN models are used, the first one is an ANN model trained using the generated dataset with 18 input features including the structural characteristics and earthquake parameters. On the other hand, the second model uses the PCA to reorientate the generated dataset and use the principal components as input features as depicted in Figure 7. The performance of both models is investigated after optimizing the hyperparameters and finding the best ones. PCA-ANN architecture for seismic response prediction.
The ANN is constructed through a series of sequential procedures, as shown in Figure 7. The first phase involves dataset preprocessing, which encompasses data-cleaning procedures such as the elimination of missing, duplicate, and infinite values. Input features should be standardized/ normalized to consider the effect of all the inputs. The input features are scaled between −1 to +1 (Al Shalabi and Shaaban, 2006).
The data will be divided into three distinct sets: training, testing, and validation, with proportions of 80%, 10%, and 10%, respectively. Cross-validation is essential at this stage to determine the average performance of the trained model, taking into account the effects of random shuffling and the selection of the training, testing, and validation data.
To determine the optimal hyperparameters for the ANN and PCA-ANN models, a series of training sessions were conducted where the correlation coefficient (R2) and Mean Squared Error (MSE) were calculated after each trial. The study explored the number of neurons (NN), the number of hidden layers (HL), and the activation functions for both models. Figure 8 illustrates the best combinations of [NN: HL] in terms of R2 and MSE for each model. It is important to note that the most effective activation functions identified were the ReLU function for the hidden layers and the linear function for the output layer. The hyperparameter investigation for the ANN and PCA-ANN models: (a) Number of neurons and (b) number of hidden layers.
According to Figure 8, the best [NN: HL] for the ANN and PCA-ANN models are [90:7] and [70:4] respectively. They correspond to the highest R2 (91.9% and 99.26%) and the lowest MSE (4e-3 and 3e-4 ). Furthermore, for training the ANNs, an “Adam” algorithm is used for optimization with a backpropagation (BP) algorithm, and it is based on three phases: (Forward pass phase, back pass phase, and updating phase). Figure 9 illustrates the performance of the ANN-PCA approach after the training phase for the dataset (Train, test, and validation) as well as the MSE for each iteration. The performance of the PCA-ANN model to predict the building’s response in terms of R2 (a) testing, (b) training, (c) validation, and (d) The mean square error of each iteration.
Effect of hybrid learning on the prediction performance
The use of PCA for dimension reduction can have both positive and negative effects on the performance of ANNs. By capturing the principal components of a dataset, PCA can simplify the data, potentially speeding up the training process and improving performance in certain cases. Xiaonan et al. (Chen et al., 2020) found that using a combination of algorithms, specifically PCA and ANN, improved the accuracy of aircraft cost estimation compared to using ANN alone. The previous section demonstrated these findings and indicated that hybrid learning could help reduce hyperparameters, ultimately decreasing the training time from 5 minutes to 2 minutes per operation.
This section explores how the number of principal components (PC) can influence the performance of the ANN in terms of the R2 and mean squared error (MSE). Figure 10 compares the mean R2—representing the average correlation coefficient for training, testing, and validation—as well as the MSE for each number of PC, against the ANN model that does not utilize PCA. The number of used principal components, along with the corresponding mean correlation coefficient R2 and MSE.
Figure 10 illustrates that as the number of PCs increases, the R2 value rises from 38% to 99.26%, while the MSE decreases from 0.47 to 0.000,543. The best R2 achieved with the ANN is 91.9%. Additionally, by using a smaller number of hyperparameters (NN: HL), the performance of the ANN can be enhanced, which leads to a significant reduction in training time. Furthermore, a higher performance can be achieved with just 13 PCs compared to the ANN, which requires substantially more time and more hidden layer units.
Case study
Three RC-frame low-, mid-, and high-rise buildings are selected as case studies to compare the hybrid learning (PCA-ANN) and ANN performance in terms of IDA, fragility assessment, and data size effect. The IDA and fragility curves obtained from PCA-ANN are compared to the numerical solutions (NLTHA). The accuracy is estimated using three statistical criteria: the correlation coefficient (R2) as written in equation (1) (Benesty et al., 2009) , the Root Mean Square Error to Standard Deviation Ratio (RSR), as written in equation (2) (Alouache et al., 2019), and the percentage bias (PBIAS), as written in equation (3) (Moriasi et al., 2007).
R2 is used to assess the degree of correlation between the actual data (NLTHA) and the predicted data. Higher R2 values indicate a stronger correlation between the two sets of data. RSR measures the dispersion between the predicted values and the actual values. An RSR value of 0 signifies a perfect simulation with the lowest variability. PBias is used to evaluate the relationship between the predicted results and the actual values. A PBias value of 0 indicates a perfect match between the ANN (Artificial Neural Network) predictions and the NLTHA values. • • • •
Statistical criteria for the performance evaluation (Annad and Lefkir, 2022).
The characteristics of the buildings
In order to check the ANN predictability, three RC frame buildings are selected to perform the IDA using the NLTHA and the ANN method. Figure 11 and Table 4 represent the elevation views and characteristics of the used buildings, respectively. The geometrical characteristics of the studied buildings: (a) low-rise, (b) mid-rise, and (c) high-rise. Characteristics of the Case Study Buildings (Low-, Mid-and High-rise RC Frame Buildings).
Impact of dataset size on machine learning performance
As Alwosheer et al. (Alwosheel et al., 2018) The dataset size should be optimal; simply having more data does not always lead to better performance. The relationship between size and accuracy is complex. An increase in data can lead to overfitting or slower training if the data includes irrelevant or noisy features. Additionally, there are no established guidelines for determining the optimal dataset size for achieving the best performance.
This subsection explores how dataset size impacts the performance of Artificial Neural Networks (ANN) and PCA-ANN models. It examines four distinct datasets and builds four ANN models using 25%, 50%, 75%, and 100% of the generated dataset, respectively. The performance of both PCA-ANN and ANN will be evaluated using statistical criteria such as R2, PBias, and RSR. Furthermore, the relative error—calculated using equation (4)—and the mean relative error between the predicted and median Incremental Dynamic Analysis (IDA) curves will also be assessed. Median IDA Curves and Relative Errors of the PCA-ANN Model and the NLTHA. (a) low-rise, (b) mid-rise, and (c) high-rise IDA curves; (d) low-rise relative errors, (e) mid-rise relative errors, and (f) high-rise relative errors.

Figure 12(d)–(f) provide valuable analytical insights. For a 100% dataset, the Mean Relative Error (MRE) is recorded as follows: 1.02% for low-rise structures, 0.36% for mid-rise structures, and 0.27% for high-rise structures. It’s important to note that the highest MRE occurs with the 25% dataset across all building categories. Furthermore, there is a positive correlation between dataset size and accuracy, highlighting the crucial role that sufficient data plays in refining predictive results. This comprehensive analysis confirms the inherent relationship between dataset characteristics and the effectiveness of ANN median predictions in seismic response prediction. To enhance result accuracy, it is advisable to create a deterministic dataset that contains no irrelevant or noisy data, as such extraneous information can adversely affect the ANN’s performance, particularly when the dataset is large. However, it is essential to note that this may increase the training time of the machine learning model.
Additionally, the results presented in Figure 13 examine the effects of dataset size and the hybrid PCA-ANN and ANN algorithms used, in terms of R2, PBias, and RSR, along with their acceptable limits. Comparison between PCA-ANN and ANN results for low-, mid-, and high-rise buildings using different dataset sizes: (a) R2 (ANN), (b) RSR (ANN), (c) PBIAS (ANN), (d) R2 (PCA-ANN), (e) RSR(PCA-ANN), (f) PBIAS (PCA-ANN).
The effect of emerging the PCA on IDA curves prediction
This section examines the impact of using PCA on the performance of predicting IDA curves. The IDA curves are determined by calculating the maximum inelastic seismic response through NLTHA for each ground motion and its respective intensities. Figures 14–16 display the IDA curves for low-, mid-, and high-rise buildings using the NLTHA, PCA-ANN, and ANN approaches (Miari and Jankowski, 2022). IDA curves of Low-rise building using: (a) NLTHA, (b) PCA-ANN, (c) ANN and (d) 50% fracile using NLTHA, PCA-ANN and ANN approaches. IDA curves of mid-rise building using: (a) NLTHA, (b) PCA-ANN, (c) ANN and (d) 50% fracile using NLTHA, PCA-ANN and ANN approaches. IDA curves of high-rise building using: (a) NLTHA, (b) PCA-ANN, (c) ANN and (d) 50% fracile using NLTHA, PCA-ANN and ANN approaches.


Figures 14–16 present IDA curves for a low-rise building, utilizing three different methodologies: NLTHA, PCA-ANN, and ANN. These figures illustrate the variation of the MIDR with PGA across various fractiles, with a particular focus on the 50th fractile to analyze the mean response more effectively. Figure 14(a)–(c); Figure 15(a)–(c); and Figure 16(a)–(c) depict the IDA curves along with the 16%, 50%, and 86% fractiles for low, mid, and high-rise buildings, respectively. Each set of figures enables a nuanced comparison of the seismic response across different building heights and analytical approaches.
According to the results depicted in Figures 14–16 The NLTHA approach exhibits a nearly linear trend, with the median (50% fractile) curve aligning well with the expected structural response, while the 16% and 84% bounds capture uncertainty. The PCA-ANN model closely follows the NLTHA results, preserving critical drift limits and structural behavior, though minor deviations appear in the spread of uncertainty bounds. The ANN model, however,
Shows greater variability, with a broader scatter of data points and a slight overestimation of MIDR at higher PGA values. When comparing 50% fractile curves, PCA-ANN demonstrates strong agreement with NLTHA, while ANN deviates slightly at higher intensities. This suggests that PCA-ANN provides a more reliable approximation, effectively balancing accuracy and computational efficiency. Overall, PCA-ANN outperforms ANN in capturing structural response trends, making it a viable surrogate model for seismic analysis, while NLTHA remains the most accurate but computationally demanding approach.
The effect of emerging the PCA on fragility assessment
Fragility assessment is a crucial process for evaluating the seismic vulnerability and condition of any structure located in a known hazard area. Fragility curves illustrate the probability of exceeding a specific limit state, such as performance level or damage state, as a function of intensity measures like PGA, PGV, or PGD. This section examines the probabilistic accuracy derived from IDA curves using NLTHA, PCA-ANN, and ANN models applied to the case study structures. Four performance levels are defined according to FEMA 356 guidelines (2000) : Immediate occupancy (IO), life safety (LS), and collapse prevention (CP) correspond to a MIDR = {1%,2% and 4%} respectively.
The fragility curves are derived using the IDA method which is represented with a lognormal cumulative distribution function as shown in equation (5):
Where P represents the probability of exceedance, x is a specific value of the PGAi , Fragility curves of IO, LS and CP performance levels for: a/ low-rise building, (b) mid-rise building, and (c) high-rise building. The MAD of fragility Curves for IO, LS and CP Performance Levels.

Table 5 presents the MAD values that compare the fragility curves derived from the PCA-ANN and ANN models to the results from NLTHA across various performance levels (IO, LS, CP) and building heights (low-rise, mid-rise, and high-rise).
For the IO level, PCA-ANN consistently shows lower MAD values than ANN, with the lowest error observed in low-rise buildings 6e-4 and the highest in high-rise structures 2e-2. In contrast, ANN exhibits larger discrepancies, particularly in mid-rise buildings 9e-2. This indicates that PCA-ANN provides a more accurate representation of fragility curves compared to ANN alone.
At the LS level, PCA-ANN also demonstrates lower MAD values, with errors ranging from 0.0107 for low-rise buildings to 0.0118 for high-rise buildings. Conversely, ANN shows higher deviations, with mid-rise structures presenting the largest MAD value 0.0862. This pattern suggests that PCA-ANN better aligns with the NLTHA results.
For CP, PCA-ANN maintains lower MAD values across all building heights, with the lowest error in high-rise buildings 0.0036. ANN, in contrast, shows significantly higher discrepancies, particularly in low-rise 0.0735 and high-rise structures 0.0875. This further confirms that PCA-ANN outperforms ANN in capturing fragility behavior.
Overall, PCA-ANN consistently produces fragility curves closer to NLTHA results compared to ANN, making it a more reliable approach for seismic vulnerability assessment. The differences are more pronounced in mid-rise and high-rise buildings, where ANN struggles to match NLTHA accuracy.
Conclusion
This study develops a hybrid learning model that combines principal component analysis (PCA) and artificial neural networks (ANNs) to enhance the prediction of seismic responses. It focuses on understanding how this hybrid learning approach impacts both the training process and the performance of the model, particularly for predicting the seismic responses of reinforced concrete (RC) framed structures. To conduct the investigation, a dataset of 1 million samples was generated using OpenSees software. Two machine learning models—ANN and PCA-ANN—were trained to predict seismic responses. A comparison was made between the two models, evaluating the incremental dynamic analysis (IDA) curves, fragility curves, and the relationship between dataset size and hybrid learning. These findings represent some of the most significant outcomes of the study: • PCA-ANN enhances predictive accuracy, achieving an R2 of 99.1% and reducing MSE by 87% compared to ANN. • The model maintains strong performance even with reduced dataset sizes, improving computational efficiency compared to the supervised learning model. • Fragility curves generated by PCA-ANN closely match NLTHA results, with discrepancies below 2%. • PCA-ANN demonstrates lower Mean Absolute Difference (MAD) values across all performance levels and building heights, outperforming ANN. • The method provides a practical balance between computational cost and accuracy, making it suitable for large-scale seismic risk assessment.
These findings highlight the potential of PCA-ANN as an efficient and accurate tool for seismic vulnerability analysis, effectively balancing computational cost with predictive reliability. Future research could investigate the impact of hybrid learning on performance-based seismic design, particularly focusing on how the model’s performance affects structural design decisions and earthquake response behaviors. This direction would further enhance the application of machine learning in earthquake engineering, contributing to more robust and resilient structural design practices.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Appendix
The selected ground motions from the PEER database (Center, 2013).
Result id
Earthquake name
Year
Station name
Magnitude
Mechanism
Rjb (km)
Rrup (km)
Vs30 (m/sec)
1
“Imperial Valley-02”
1940
“El Centro Array #9”
6.95
Strike slip
6.09
6.09
213.44
2
“Northwest Calif-02”
1941
“Ferndale City Hall”
6.6
Strike slip
91.15
91.22
219.31
3
“Borrego”
1942
“El Centro Array #9”
6.5
Strike slip
56.88
56.88
213.44
4
“Kern County”
1952
“LA - Hollywood stor FF”
7.36
Reverse
114.62
117.75
316.46
5
“Kern County”
1952
“Pasadena - CIT Athenaeum”
7.36
Reverse
122.65
125.59
415.13
6
“Kern County”
1952
“Santa Barbara Courthouse”
7.36
Reverse
81.3
82.19
514.99
7
“Kern County”
1952
“Taft Lincoln School”
7.36
Reverse
38.42
38.89
385.43
8
“Northern Calif-03”
1954
“Ferndale City Hall”
6.5
Strike slip
26.72
27.02
219.31
9
“El Alamo”
1956
“El Centro Array #9”
6.8
Strike slip
121
121.7
213.44
10
“Borrego Mtn”
1968
“El Centro Array #9”
6.63
Strike slip
45.12
45.66
213.44
11
“Borrego Mtn”
1968
“LA - Hollywood stor FF”
6.63
Strike slip
222.42
222.42
316.46
12
“Borrego Mtn”
1968
“LB - Terminal Island”
6.63
Strike slip
199.84
199.84
217.92
13
“Borrego Mtn”
1968
“Pasadena - CIT Athenaeum”
6.63
Strike slip
207.14
207.14
415.13
14
“Borrego Mtn”
1968
“San Onofre - so Cal Edison”
6.63
Strike slip
129.11
129.11
442.88
15
“San Fernando”
1971
“2516 via Tejon PV”
6.61
Reverse
55.2
55.2
280.56
16
“San Fernando”
1971
“Anza post Office”
6.61
Reverse
173.16
173.16
360.45
17
“San Fernando”
1971
“Bakersfield - Harvey Aud”
6.61
Reverse
111.88
113.02
241.41
18
“San Fernando”
1971
“Borrego Springs Fire sta”
6.61
Reverse
214.32
214.32
338.54
19
“San Fernando”
1971
“Buena Vista - Taft”
6.61
Reverse
111.37
112.52
385.69
20
“San Fernando”
1971
“Carbon Canyon dam”
6.61
Reverse
61.79
61.79
235
21
“San Fernando”
1971
“Castaic - Old Ridge Route”
6.61
Reverse
19.33
22.63
450.28
22
“San Fernando”
1971
“Cedar Springs Pumphouse”
6.61
Reverse
92.25
92.59
477.22
23
“San Fernando”
1971
“Cedar Springs_ Allen Ranch”
6.61
Reverse
89.37
89.72
813.48
24
“San Fernando”
1971
“Cholame - Shandon Array #2”
6.61
Reverse
217.54
218.13
184.75
25
“San Fernando”
1971
“Cholame - Shandon Array #8”
6.61
Reverse
218.17
218.75
256.82
26
“San Fernando”
1971
“Colton - so Cal Edison”
6.61
Reverse
96.81
96.81
301.95
27
“San Fernando”
1971
“Fairmont dam”
6.61
Reverse
25.58
30.19
634.33
28
“San Fernando”
1971
“Fort Tejon”
6.61
Reverse
59.52
61.64
394.18
29
“San Fernando”
1971
“Gormon - Oso Pump Plant”
6.61
Reverse
43.95
46.78
308.35
30
“San Fernando”
1971
“Hemet Fire Station”
6.61
Reverse
139.14
139.14
328.09
31
“San Fernando”
1971
“Isabella dam (Aux Abut)”
6.61
Reverse
130
130.98
591
32
“San Fernando”
1971
“LA - Hollywood stor FF”
6.61
Reverse
22.77
22.77
316.46
33
“San Fernando”
1971
“LB - Terminal Island”
6.61
Reverse
58.99
58.99
217.92
34
“San Fernando”
1971
“Lake Hughes #1”
6.61
Reverse
22.23
27.4
425.34
35
“San Fernando”
1971
“Lake Hughes #12”
6.61
Reverse
13.99
19.3
602.1
36
“San Fernando”
1971
“Lake Hughes #4”
6.61
Reverse
19.45
25.07
600.06
37
“San Fernando”
1971
“Lake Hughes #9”
6.61
Reverse
17.22
22.57
670.84
38
“San Fernando”
1971
“Maricopa Array #1”
6.61
Reverse
193.25
193.91
303.79
39
“San Fernando”
1971
“Maricopa Array #2”
6.61
Reverse
108.56
109.73
443.85
40
“San Fernando”
1971
“Maricopa Array #3”
6.61
Reverse
109.01
110.18
441.25
41
“San Fernando”
1971
“Pacoima dam (upper left abut)”
6.61
Reverse
0
1.81
2016.1
42
“San Fernando”
1971
“Palmdale Fire Station”
6.61
Reverse
24.16
28.99
452.86
43
“San Fernando”
1971
“Pasadena - CIT Athenaeum”
6.61
Reverse
25.47
25.47
415.13
44
“San Fernando”
1971
“Pasadena - Old Seismo Lab”
6.61
Reverse
21.5
21.5
969.07
45
“San Fernando”
1971
“Pearblossom Pump”
6.61
Reverse
35.54
38.97
529.09
46
“San Fernando”
1971
“Port Hueneme”
6.61
Reverse
68.84
68.84
248.98
47
“San Fernando”
1971
“Puddingstone dam (Abutment)”
6.61
Reverse
52.64
52.64
421.44
48
“San Fernando”
1971
“San Diego Gas & Electric”
6.61
Reverse
205.77
205.77
354.06
49
“San Fernando”
1971
“San Juan Capistrano”
6.61
Reverse
108.01
108.01
459.37
50
“San Fernando”
1971
“San Onofre - so Cal Edison”
6.61
Reverse
124.79
124.79
442.88
51
“San Fernando”
1971
“Santa Anita dam”
6.61
Reverse
30.7
30.7
667.13
52
“San Fernando”
1971
“Santa Felita dam (Outlet)”
6.61
Reverse
24.69
24.87
389
53
“San Fernando”
1971
“Tehachapi Pump”
6.61
Reverse
61.75
63.79
669.48
54
“San Fernando”
1971
“UCSB - Fluid Mech Lab”
6.61
Reverse
124.38
124.41
322.42
55
“San Fernando”
1971
“Upland - San Antonio dam”
6.61
Reverse
61.72
61.73
487.23
56
“San Fernando”
1971
“Wheeler Ridge - ground”
6.61
Reverse
68.38
70.23
347.67
57
“San Fernando”
1971
“Whittier Narrows dam”
6.61
Reverse
39.45
39.45
298.68
58
“San Fernando”
1971
“Wrightwood - 6074 Park Dr”
6.61
Reverse
61.64
62.23
486
59
“Friuli_ Italy-01”
1976
“Barcis”
6.5
Reverse
49.13
49.38
496.46
60
“Friuli_ Italy-01”
1976
“Codroipo”
6.5
Reverse
33.32
33.4
249.28
61
“Friuli_ Italy-01”
1976
“Conegliano”
6.5
Reverse
80.37
80.41
352.05
62
“Friuli_ Italy-01”
1976
“Feltre”
6.5
Reverse
102.05
102.15
356.39
63
“Friuli_ Italy-01”
1976
“Tolmezzo”
6.5
Reverse
14.97
15.82
505.23
64
“Gazli_ USSR”
1976
“Karakyr”
6.8
Reverse
3.92
5.46
259.59
65
“Tabas_ Iran”
1978
“Bajestan”
7.35
Reverse
119.77
120.81
377.56
66
“Tabas_ Iran”
1978
“Boshrooyeh”
7.35
Reverse
24.07
28.79
324.57
67
“Tabas_ Iran”
1978
“Dayhook”
7.35
Reverse
0
13.94
471.53
68
“Tabas_ Iran”
1978
“Ferdows”
7.35
Reverse
89.76
91.14
302.64
69
“Tabas_ Iran”
1978
“Kashmar”
7.35
Reverse
193.91
194.55
280.26
70
“Tabas_ Iran”
1978
“Sedeh”
7.35
Reverse
150.33
151.16
354.37
71
“Tabas_ Iran”
1978
“Tabas”
7.35
Reverse
1.79
2.05
766.77
72
“Imperial Valley-06”
1979
“Aeropuerto Mexicali”
6.53
Strike slip
0
0.34
259.86
73
“Imperial Valley-06”
1979
“Agrarias”
6.53
Strike slip
0
0.65
242.05
74
“Imperial Valley-06”
1979
“Bonds Corner”
6.53
Strike slip
0.44
2.66
223.03
75
“Imperial Valley-06”
1979
“Brawley Airport”
6.53
Strike slip
8.54
10.42
208.71
76
“Imperial Valley-06”
1979
“Calexico Fire Station”
6.53
Strike slip
10.45
10.45
231.23
77
“Imperial Valley-06”
1979
“Calipatria Fire Station”
6.53
Strike slip
23.17
24.6
205.78
78
“Imperial Valley-06”
1979
“Cerro Prieto”
6.53
Strike slip
15.19
15.19
471.53
79
“Imperial Valley-06”
1979
“Chihuahua”
6.53
Strike slip
7.29
7.29
242.05
80
“Imperial Valley-06”
1979
“Coachella Canal #4”
6.53
Strike slip
49.1
50.1
336.49
