Abstract
In recent years, with the development of flash memory technology, storage systems in large data centers are typically built upon thousands or even millions of solid-state drives (SSDs). Therefore, the failure of SSDs is inevitable. An SSD failure may cause unrecoverable data loss or unavailable system service, resulting in catastrophic results. Active fault detection technologies are able to detect device problems in advance, so it is gaining popularity. Recent trends have turned toward applying AI algorithms based on SSD SMART data for fault detection. However, SMART data of new SSDs contains a large number of features, and the high dimension of data features results in poor accuracy of AI algorithms for fault detection.
To tackle the above problems, we improve the structure of traditional Auto-Encoder (AE) based on GRU and propose an SSD fault detection method – GAL based on dimensionality reduction with Gated Recurrent Unit (GRU) sparse autoencoder (GRUAE) by combining the temporal characteristics of SSD SMART data. The proposed method trains the GRUAE model with SSD SMART data firstly, and then adopts the encoder of GRUAE model as the dimensionality reduction tool to reduce the original high-dimensional SSD SMART data, aiming at reducing the influence of noise features in original SSD SAMRT data and highlight the features more relevant to data characteristics to improve the accuracy of fault detection. Finally, LSTM is adopted for fault detection with low-dimensional SSD SMART data. Experimental results on real SSD dataset from Alibaba show that the fault detection accuracy of various AI algorithms can be improved by varying degrees after dimensionality reduction with the proposed method, and GAL performs best among all methods.
Keywords
Acronyms
Solid State Drives
Gated Recurrent Unit
Auto-Encoder
GRU Sparse Auto-Encoder
Long Short-Term Memory
GRUAE + LSTM
self-monitoring, analysis and reporting technology
machine learning
dimensionality reduction
Principal Component Analysis
Singular Value Decomposition
Factor Analysis
penalty term
Nomenclature
sparse penalty coefficient
average activation amount of j-th unit
parameter set
activation amount of the j-th unit in the hidden layer
offset vector
activation function
d′-dimensional latent representation of input data with low dimensionality with sample number k
cost function
sparse constant
number of neurons at layer l
weight matrix
d-dimensional sample set with sample number k
data reconstructed by GRU-decoder
Introduction
With the continuous development of storage technology, large-scale data centers usually deploy a large number of solid-state drives (SSDs) on the underlying storage devices to improve the data processing efficiency of the storage system; examples include the data servers of Alibaba Pangu[1], Amazon[2], Google[3], Facebook[4], Microsoft Azure[5]. In such data centers, it has been an extremely challenging undertake to ensure high availability and reliability for IT management, as various drive failures constantly occur in the field. Data centers usually adopt some data protection mechanisms, such as data copy or erasure codes [5, 6]. If the drives fail to recover the lost data despite the data protection capabilities, permanent data loss occurs, and the system cannot be used, which would be disastrous for the data centers. SSDs tend to have a limited endurance and failure is inevitable. As a result, SSDs may fail with different severity and manifestations for a variety of reasons, which can be observed in many data centers [7–10]. Compared with traditional passive fault-tolerant techniques such as Erasure Code (EC) and Redundant Arrays of Independent Disks (RAIDs), active fault detection techniques can guarantee the reliability and availability of large-scale storage systems in advance. Thus, the risk of data loss can be reduced by successfully identifying drive failures.
In order to monitor the health status of SSDs, manufacturers generally implement self-monitoring, analysis and reporting technology (SMART) in the firmware of devices. The SMART attributes contain the drive state information and possible defects. Internally, drives use the so-called “threshold method” based on SMART values to evaluate the failures, which means the drives would raise alarms if the values of one or more of the SMART attributes cross the corresponding predefined threshold. However, this "threshold method" only achieved a 3-10% failure detection rate (FDR) and a 0.1% false alarm rate (FAR) in practice [11]; in other words, this method is too conservative and misses opportunities to detect disk failures.
In the literature, many prior studies [9, 12–16] have investigated the fault detection for hard disk drives (HDDs) based on SMART data. Nevertheless, due to the complicated characteristics of SSDs [7, 17–19], only a few studies [8, 21] detect SSDs failure based on the private data from Google, which is relatively limited. Therefore, it is necessary to study SSDs fault detection based on SMART data in production environment.
Machine learning (ML) algorithms have been widely adopted in fault detection research. The detection performance of algorithms depends on the quality of data greatly. As SSD manufacturers advance in technologies, more and more SSD reliability data can be collected, which contains a large number of features. In this context, the performance of a large number of traditional machine learning algorithms is negatively affected by the high dimension of data [22]. The main reason for this behavior lies in the "curse of dimensionality" [23], this phenomenon is related to the decrease of the performance of traditional classifiers with the increase of the dimension of the input data.
Hence, processing the original data to eliminate the negative effects of the high-dimensional feature space is the premise to ensure the effectiveness of the algorithms. Dimensionality reduction (DR) can reduce the impact of noise features in the original data and highlight features more related to data characteristics [24]. Principal Component Analysis (PCA) achieves dimensionality reduction by projecting original data into a linear subspace spanning the main feature vectors of the data covariance matrix [25]. Locality discriminant analysis [26] and marginal fisher analysis [27] are also local methods which construct low-dimensional subspace based on different graphs of samples. However, these traditional dimensionality reduction methods are mostly used to deal with linear data, there are some limitations when dealing with high-dimensional nonlinear data.
Auto-Encoder (AE) is known as an effective tool for learning data coding in an unsupervised manner [28]. An auto-encoder has two parts: the encoder and the decoder. The encoder operator aims at learning low-dimensional representations for a set of data typically for dimensionality reduction, while the decoder attempts to generate reconstructions from the low-dimensional representations that can be close enough to the original inputs. As a data-driven method, an autoencoder can nonlinearly extract the most important features of data instead of relying on manually specified features; thus, it enables us to obtain an optimal dimensionality reduction performance.
Original SSD SMART data often contains multiple parameters, many of which are ineffective in determining SSD failures. Modeling based on original SSD SMART data may make the model learn many useless features and take a lot of training time. The high data dimensionality and the redundant parameters normally have negative impacts on the accuracy and efficiency of the model. Building the fault detection model based on low-dimensional SSD SMART data after dimensionality reduction can reduce the impact of invalid, redundant, or incorrect features, and improve the efficiency and accuracy of modeling, and enable the model to detect SSD faults more accurately. In view of the SSD SMART data has certain temporal characteristics, we improve the structure of traditional AE based on Gated Recurrent Unit (GRU) and propose an SSD fault detection method – GAL based on dimensionality reduction of GRU sparse autoencoder (GRUAE). The proposed method trains the GRUAE model with SSD SMART data firstly, and then adopts the encoder of GRUAE model as the dimensionality reduction tool to reduce the original high-dimensional SSD SMART data, aiming at removing redundant features, reducing the influence of noise features in original SSD SAMRT data and highlight the features more relevant to data characteristics. Finally, Long Short-Term Memory (LSTM) is adopted for fault detection with low-dimensional SSD SMART data. The experiment is based on real SSD datasets to evaluate the effectiveness of the method. The experimental results show that the proposed method can detect SSD faults more accurately than that without dimensionality reduction or feature extraction method.
The rest of this paper is organized as follows. Section II is the related works. Section III is the background knowledge. Section IV is the description of the proposed method. The dataset description and the experiments results are listed in Section V and Section VI, respectively. Finally, Section VII summarizes the conclusion of this paper.
Related works
Our work is mainly related to two lines of studies, dimensionality reduction and fault detection. For dimensionality reduction, PCA [25], as one of the very popular unsupervised dimensionality reduction algorithms, achieves dimensionality reduction by projecting original data into a linear subspace spread by the main feature vectors of the data covariance matrix. SVD [34] calculates the number of singular values and singular vectors to generate an approximate matrix that can replace the original matrix, and ranks the singular value representations of the dataset according to importance, discarding unimportant feature vectors, so as to achieve the purpose of dimensionality reduction. However, SVD is poor in dealing with the noise, and many works have studied the combination of SVD and other methods, such as locally Linear embedding [35] and Hankel matrix-based Fast SVD [36]. By finding the hidden representative factors among different features, FA [37] reduces the number of features by classifying features of the same nature into a factor, and can also test the hypothesis of the relationship between variables. Traditional dimensionality reduction methods such as PCA and SVD are commonly for the linear data, there are limitations such as slow processing speed and poor dimensionality reduction effect when processing high-dimensional nonlinear data. Therefore, many studies about AE based dimensionality reduction methods have been proposed in recent years. SSDSAE [29] is proposed for the fault detection of rotating machine, feature extraction and dimensionality reduction are firstly conducted on the vibration spectrum of collected signal, then the faults are detected based on the low dimensional data. SSDSAE has stronger robustness and information structure extraction ability compared to other deep learning algorithms. Guo et al. [30] proposed a stacked sparse autoencoder model S-SAE with three hidden layers to solve the problem of excessive dimension and excessive scattering of parameters in multi-temporal quadpol SAR images in crop classification. The classification accuracy of CNN is improved effectively by dimensionality reduction and feature extraction based on S-SAE. Shen et al. [31] applied stack-based Contractive auto-encoder technology to automatic feature extraction of rotating machinery and improved the robustness of fault diagnosis method. ClEnDAE [32] combines integration-based classifiers with denoising AEs to reduce the dimension of the input space. Literature [33] converts the original high-dimensional seismic inversion problem into a low-dimensional one by using the dimensionality reduction characteristics of AE, and effectively solves the low-dimensional seismic inversion problem through global optimization.
For fault detection, many prior works [9, 12–16] have studied the HDD fault prediction. Xu et al. [9] develop a cost-sensitive machine learning model based on sorting, which can learn the characteristics of past failed HDDs and classify HDDs according to the future error tendency. Zhang et al. [13] propose a tier-scrubbing scheme based on LSTM and Adaptive Scrubbing Rate Controller, which locate high-risk areas of HDDs according to errors in local sectors of HDDs to detect HDD faults. Xiao et al. [14] propose an HDD fault prediction model based on Online Random Forests, which can automatically evolve with the collection sequence of data and is highly adaptable to changes in SMART data distribution over time. In recent years, some optimization methods for HDD fault detection have also been studied. Ircio et al. [38] use the sliding window mechanism to extract HDD SMART data focusing on the failed HDDs, the method extracts the data that is close to the failures and then use the extracted data for fault detection. The detection accuracy is improved when the ratio of positive and negative samples is unbalanced. Chhetri et al. [39] converts HDD SMART data from tabular format to KG triple format through Knowledge Graph (KG). Then the Relational Graph Convolution Network is used for fault detection, and a new idea of HDD fault detection is proposed. Mamoutova et al. [40] proposes an Ontology-based method to automatically analyze and diagnose system logs based on semantic model. The proposed method uses Ontology to represent fault symptoms, and then analyzes data and diagnoses HDD faults in combination with ML algorithm. Wang et al. [41] models the degradation process of HDDs based on Rao-Blackwellized particle filter, and adjusts the difference between the real observed value and the estimated value of the model iteratively. It can analyze HDD SMART data in real time and evaluate the current degradation degree of HDDs, effectively reducing the false alarm rate of HDD failure prediction. Reference [42] improves the network structure of Generative Adversarial Networks (GANs) based on LSTM, and expands the total amount of data by generating virtual failed HDD data through GAN. The accuracy of HDD fault detection are effectively improved in the case of small samples.
However, the works about SSD fault detection are quite limited. Narayanan et al. [7] study the importance of different SMART features on SSD failure prediction based on random forest (RF). Mahdisoltani et al. [15] predict sector errors specifically based on Google SSD data and analyze the possibility of improving storage system reliability through the sector errors predictor. Alter et al. [20] also use data from Google to construct SSD failure prediction model by with various machine learning algorithms such as RF and support vector machine (SVM), and analyze the influence of different workloads on SSD failure. Sarkar et al. [21] study SSD fault prediction based on features provided by firmware functions. Existing works generally focuses on the analysis of original SSD SMART data and the construct the prediction model directly. There are limitations and space for optimization in the data pre-processing stage of model construction. Chandranil et al. [43] proposes a 1-class model for the unbalanced ratio of positive and negative samples of SSD fault detection, which only trains the majority class. It reduces the risk of over-fitting due to the imbalance of positive and negative samples and improves the accuracy of prediction.
The main difference between our work and other studies is that, although the optimization methods of SSD fault detection have been proposed from different perspectives by the existing works, these methods are generally based on the original SSD SMART data for direct analysis and modeling. However, the original SSD SMART data often contains multiple parameters, many of which cannot provide effective help for determining whether an SSD is faulty. And the redundant parameters normally have negative impacts on the accuracy and efficiency of the model. Traditional dimensionality reduction methods such as PCA and SVD are commonly used for linear data, but there are limitations such as slow processing speed and poor dimensionality reduction effect when processing high-dimensional nonlinear data. However, SSD SMART data has strong temporal correlation, and high-dimensional temporal data will greatly reduce the effectiveness of traditional dimensionality reduction methods. Considering such problems, the method proposed in this paper does not directly model the original SSD SMART data, but proposes a GRUAE model based on GRU in combination with the timing characteristics of SSD SMART data. The method trains the corresponding GRUAE model based on the original SSD SMART data firstly, then the encoder in GRUAE model is utilized to reduce the dimensionality of original SSD SMART data to weaken the impact of noise features in original SSD SMART data and highlight features more relevant to data characteristics. Finally, LSTM is adopted to detect faults based on the low dimensional SSD SMART data. The accuracy of SSD fault detection by machine learning and deep learning algorithms is effectively enhanced.
Preliminary knowledge
Gated recurrent unit
Gated Recurrent Unit (GRU) [44] was proposed to make each recurrent unit adaptively capture the dependency at different time scales. Similar to LSTM unit, GRU has gating units that modulate information flow within the unit, however, without having separate memory cell. The graphical illustration of the GRU is displayed in Fig. 1.

Gated recurrent unit.
The activation
This procedure of taking a linear sum between the existing state and the newly computed state is similar to the LSTM unit. However, the GRU does not contain any mechanisms to control the degree to which its state is exposed, but exposes the whole state each time.
The candidate activation
The reset gate
AE is a learning model, aiming at extracting a representation feature from a piece of data through unsupervised learning. The structure of an AE contains two parts: encoder and decoder, as illustrated in Fig. 2. Encoder intends to extract a latent code from input data and map the input data to a low-dimensional feature space. Decoder tries to reconstruct a piece of data from the latent code as close to the original input data as possible.

Structure of auto-encoder.
The formulations of the encoder and decoder are shown following:
In real environment, SSD faults tend to occur gradually over time. Therefore, the SSD reliability characteristics have a strong time-related correlation. Compared with traditional data generation methods, GRU is better at capturing temporal characteristics of samples and extracting time-related features. Therefore, we adopt GRU as the encoder of AE model to fit the probability distribution function of SSD samples, so that encoder can better learn the temporal characteristics of SSD SMART data to extract the potential code, and map the original input data into low-dimensional feature space. Also, the GRU is adopted as the decoder to better capture the temporal characteristics of the data and reconstruct the data from low dimensional space to the original state. The original SSD SMART data often contains multiple parameters, many of which cannot provide effective help for determining whether an SSD is faulty. The high dimensionality and redundant parameters normally have negative impacts on the accuracy and efficiency of the model. The proposed method trains the GRUAE model for the original SSD SMART data firstly, and then uses the encoder in the GRUAE model to reduce the dimensionality of the original SSD SMART data, aiming to remove the redundant features, weaken the impact of noise features in original SSD SMART data and highlight the features more relevant to data characteristics. Building the fault detection model based on low-dimensional SSD SMART data after dimensionality reduction can reduce the impact of invalid, redundant, or incorrect data on modeling, improve the efficiency and accuracy of modeling, and enable the model to detect SSD faults more accurately. As shown in Fig. 3, our method mainly includes training of GRUAE, dimensionality reduction and fault detection. The main steps are as follows:

Architecture of proposed approach.
The typical structure of AE is encoder-decoder structure. The proposed GRUAE model contains two independent GRU layers, as shown in Fig. 4. For d-dimensional sample set
For the structure of the model, the learning ability of AE will be affected if the hidden layer contains more neurons than the input layer, so it is necessary to impose certain constraints on hidden layer neurons. We add a sparse penalty term to the hidden layer to suppress the output of hidden layer neurons, so that the network achieves sparse effect. In this way, AE can still learn important features of input data even if there are plentiful hidden layer neurons.

Architecture of GRUAE model.
Specifically, the expression of encoder is as follows:
Similarly, the reconstructed output is expressed as:
Assuming that a j (x) represents the activation amount of the j-th unit in the hidden layer, then the average activation amount of j-th unit is:
where k is the number of samples. To ensure that most neurons are in the "inactive" state, assume that
The optimization objective function of the GRUAE model with sparse penalty terms is defined as follows:
Algorithm-1 summarizes the training process of GRUAE model. The parameters of network are initialized before the iteration start. For the original high-dimensional SSD SMART data, in each epoch, the forward propagation process first calculates the hidden layer output, and then calculates the output layer results according to the hidden layer results. In the process of back propagation, network parameters are updated by Adam to minimize the loss function with sparse penalty terms, and the iteration is repeated until convergence. When setting relevant parameters of the algorithm, if the amount of data is large enough, in order to avoid unnecessary iterations and reduce training time, the epoch can be set relatively small, corresponding to a moderate batchsize. When initializing the network, to avoid the output value of the activation function tending to 0, the selected initialization method should make the input and output follow the same distribution as possible. The adjustment parameter of regularization term, learning rate and other parameters should not be too large.
The more complex the model is, the larger the parameter value will be, and it will try to fit all the sample points. If the data contains abnormal samples, the model will produce great fluctuation among small ranges, leading to the over-fitting of the model to the training data. Hence, we adopt weight decay (L2 regularization) technology in the design of Algorithm-1, which simplify the model, realizes the sparsity of parameters, avoids overfitting, and keeps the parameters as small as possible to improve the robustness of the algorithm. GRU contains two gate structures: update gate and reset gate. Update gate corresponds to a weight matrix and the bias, and reset gate corresponds to two weight matrixes and the bias. Therefore, GRU needs to maintain three sets of parameters. The total number of parameters is 3 * ((m + n) * n + n), where m is the size of the input layer and n is the size of the hidden layer. The GRUAE model in this paper consists of encoder and decoder, each consisting of a GRU. Therefore, the total number of parameters of the GRUAE model is 6 * ((m + n) * n + n), which represents the complexity of the model and determines the convergence speed and calculation speed of the algorithm. While training the GRUAE model, six control parameters in total are passed which are epoch number, batch number, the adjustment parameter of regularization term, learning rate, dropout and sparse penalty coefficient. In order to ensure the convenience of the implementation, the proposed algorithm is implemented based on the common and open-source framework-PyTorch, which is easy to program.
After the GRUAE model is trained, for d-dimensional sample set
LSTM Model
To obtain long-term time dependence, the LSTM defines and maintains the unit state to regulate information flow. The cell state Ct-1 interacts with the intermediate output ht-1 and the subsequent input x t to determine which elements of the internal state vector should be updated, maintained, or removed based on the output of previous time step and the input of current time step. The formulas of the LSTM network are described as follows:
Datasets
To evaluate the performance and restrictions of the proposed method, we use the SSD dataset from Alibaba [45]. The dataset contains SMART data about the SSDs and some basic information, such as timestamps and device serial numbers. It is important to note that the SMART attributes provided by different manufacturers may be different, and some attributes may have different meanings depending on the type of device. Therefore, we extract the SMART data of SSD model MA2(MLC) and MC1(3-D TLC) which are of different Flash types with relatively complete data records in 2019 for the experiment. The SSDs in the original dataset include two classes: Healthy and Failed. However, due to the long-time span of the original dataset, some SSD SMART data records are incomplete and some related parameters are lost. Therefore, we remove those SSD SMART data with incomplete records when processing the original dataset. The basic information about the dataset is shown in Table 1. The sampling interval is 24h.
Overview of the dataset
Overview of the dataset
The SMART data in the dataset contains raw values denoted as Raw and normalized values denoted as Norm. The Norms in the dataset are calculated by Raw according to the manufacturer’s nonpublic custom formulas. As some Norms may result in a loss of data accuracy and the corresponding Raw may be highly sensitive to changes in disk health, Raw and Norm are all used in the experiments, as shown in Table 2.
Feature Information
Feature Information
The range of values spanned by different features varies widely. To avoid bias towards features with large values, we apply feature scaling for data normalization according to the following formula:
Experiment conditions and methods
We use Precision, Recall and F0.5-Score to measure the effect of different machine learning algorithms on fault detection. Precision indicates the proportion of true positives (TPs) among all predicted failures. Recall represents the proportion of TPs within all actually failed disks. From the practical experience, once an SSD is detected as failed (regardless of correctly or falsely), administrators will decommission the SSD for further inspection. Since the cost of replacing a healthy SSD that is falsely detected as a failure is higher than that of missing a failed SSD that is falsely detected as a healthy SSD [45]. Thus, we also use the F0.5-score instead of the F1-score to weigh the precision twice as important as the recall. These metrics are defined as:
To evaluate the effectiveness of the proposed method, the dataset is randomly divided into a training set and a test set with the ratio of 7:3. The encoder of GRUAE model is a GRU network with one hidden layer. The number of cells in the input layer is equal to the dimension of SSD SMART data. The hidden layer has 10 cells with time-step of 3, the number of cells in the output layer is 10, and the dropout is set as 0.2. The decoder is also a GRU network with one hidden layer. The number of cells in the input layer is 10 and the number of cells in hidden layer is 10. The output layer has linear mapping and tanh is used as the activation function. The learning rate is 0.001.
When setting the relevant parameters of Algorithm-1, the epoch of the training GRUAE model is set as 2, considering that the dataset used in the experiment contains tens of millions of data, which is a relatively large amount. The larger batchsize is set during training, the smaller the total number of update steps and the corresponding total update amount will be. Because the optimal solution is usually a certain distance from the initialization point, when the total update amount is small, the exploration of the optimization algorithm in the parameter space is limited to a limited range near the initialization point, and the solutions found may be poor. Therefore, the batchsize is set to the commonly value 128 [46]. Because the activation function in the output layer is tanh, Xavier initialization is adopted in network initialization [47]. To make the model fit the training data better and keep the parameter values small, we set σ to 0.1 to adjust the regularization terms [48]. Algorithm-2 uses GRUAE-encoder to perform dimensionality reduction on original high-dimensional data, so Xavier initialization is also adopted when initializing the network.
The structure of LSTM consists of an input layer, two LSTM hidden layers and an output layer. The output layer has linear mapping and Sigmoid is used as the activation function. The number of cells in the input layer is the dimension of SSD SMART data after dimensionality reduction. The number of cells in the hidden layer is 100, the number of cells in the output layer is 1, the dropout is set as 0.2, and the learning rate is 0.001.
The number of cells in the output layer of encoder is 10, which is the dimension of data after dimensionality reduction. We adopt PCA to determine this value by setting the variance ratio of principal component to projection feature as 99%, that is, the included principal components can explain 99% of the original data, and the value is automatically calculated.
The optimizer in the training process is Adam. Fig. 5 depicts the changing process of the loss of the model in MA2 and MC1 SSD datasets during training. It can be observed that after a period of training, the loss of the model has decreased to close to 0, indicating that the models have converged.

The loss of GRUAE model for (a) MA2, (b) MC1 during training, where x-axis is the number of iterations.
We conduct the experiment on two SSD datasets: MA2 and MC1, with the same process which is divided into two stages: dimensionality reduction and fault detection. In the stage of dimensionality reduction, the corresponding GRUAE model is trained for the original high-dimensional SSD SMART data. After the training, GRU-encoder is extracted from the GRUAE model as the tool for dimensionality reduction. Then the dimensionality of original high-dimensional SSD SMART data is reduced based on four dimensionality reduction methods: GRUAE-encoder, Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Factor Analysis (FA) for comparison. In the phase of fault detection, based on the data after dimensionality reduction, the effect of fault detection for seven AI algorithms are compared: Multilayer Perceptron (MLP), Random Forest (RF), Logistic Regression (LR), Decision Tree (DT) and Support Vector Machine (SVM), GRU and LSTM. We also compare with the approach in reference [45]. The schematic diagram of the experimental procedure is shown in Fig. 6.

Schematic diagram of the experiment process.
The hardware platform used in the experiment is a server, including two Intel(R) Xeon(R) CPUs E5-2620 V4 @ 2.10GHz, 94G RAM and 4T storage. For the software, the operating system is Ubuntu 18.04.5 LTS; the kernel version is Linux version 4.15.0-171-generic, and the compiler version is gcc version 9.3.0. We implement the proposed method based on PyTorch 1.7.1 and scikit-learn 0.24.1.
The precision of MA2 is shown in Fig. 7. Under the condition of without dimensionality reduction, the two deep learning algorithms with advantages in processing temporal data: GRU and LSTM, are significantly better than the other five traditional machine learning algorithms. After dimensionality reduction, the precision is improved more or less for almost all algorithms. Among them, MLP and DT have a large increase range, which is close to 20% on average, while SVM has an increase range of 13%. Other methods have a relatively small increase range, which is mainly within 10%. For MLP, RF, SVM and LSTM, the improvement brought by GRUAE is better than other dimensionality reduction methods. SVM shows little improvement on the dimensionality reduction of PCA, SVD and FA, with an improvement of less than 3%, but GRUAE brings about an improvement of 13%. For DT and GRU, SVD improves DT the most, while FA improves GRU the most. Although GRUAE does not improve the most compared with other dimensionality reduction methods, the difference is very close which is only within 3%. For LR, the improvement effect of several dimensionality reduction methods is not obvious, and the average improvement effect is less than 3%. Although the basic effect of GRU and LSTM has been relatively good, dimensionality reduction can still make about 3-4% improvement. And, because the network structure of LSTM is more complex than that of GRU, LSTM performs slightly better. The precision of proposed GAL (GRUAE+LSTM) exceeds 97% which is the best among all methods.

The precision of MA2
The recall of MA2 is shown in Fig. 8. Similar to precision, GRU and LSTM are significantly better than other five tra-ditional machine learning algorithms when there is no dimensionality reduction. After dimensionality reduction, the recall of some algorithms is greatly improved, such as DT, with an average increase of nearly 40%. There are also some algorithms that have relatively small improvement effects, such as LR and SVM, with an average improvement high range of about 3%. FA is the best dimensionality reduction method for DT. The enhancement of GRUAE for DT is slightly lower than that of the traditional dimensionality reduction method, but it is also very close. The difference between GRUAE and FA is about 4%, and the difference with other methods is even smaller. SVD has the best improvement on GRU, which is similar to DT. Although GRUAE does not have the largest increase range for GRU, the difference is less than 2%. For MLP, LR and SVM, GRUAE has a slightly better improvement than other dimensionality reduction methods. GRUAE improves the recall of MLP and RF by about 4% and 7%, respectively, and improves both GRU and LSTM by about 5%.

The recall of MA2.
The F0.5-score of MA2 is shown in Fig. 9 and the overall trend is similar to the precision of MA2. After dimensionality reduction, the effects of various algorithms are basically improved. The improvement of GRUAE to MLP, DT and SVM is the largest, averaging about 15%. The results of GRU and LSTM are very similar which is about 4%, both better than other algorithms. Since the precision and the recall of GAL are the best among all methods, its F0.5-score is also the best.

The F0.5-score of MA2
The precision of MC1 is shown in Fig. 10. Similar to the precision of MA2, GRU and LSTM are significantly better than the other five traditional machine learning algorithms. Compared with the situation without dimensionality reduction, basically all algorithms have improved after dimensionality reduction, among which the improvement effect of RF is particularly obvious, with an average increase of about 28%. For RF, SVD has the largest improvement, while GRUAE has a slightly lower improvement than the other three dimensionality reduction algorithms, with a difference of about 4% compared with SVD. However, for MLP, LR, DT and SVM, GRUAE can bring considerable improvement, among which LR with the smallest improvement that is 5.8%. The precision is increased by 14%, 5.8%, 8.5% and 12.5% respectively, which shows that the average improvement of GRUAE is better than that of other dimensionality reduction methods. Although GRU and LSTM are the two methods with the best basic effect of which the precision is 92% and 94% respectively, their precision can still be improved with 5% and 4% after the dimensionality reduction by GRUAE, and the effect of LSTM is slightly better than that of GRU. The precision of GAL is the best one which is almost 98%.

The precision of MC1.
For recall of MC1 is shown in Fig. 11, GRU and LSTM still have the best basic effect when there is no dimensionality reduction, with the original recall reaching 90% and 91%, respectively. DT and MLP perform the best among the five traditional machine learning algorithms, with an original recall of 55%. After dimensionality reduction, DT is still the algorithm with the largest improvement, with an average improvement of nearly 20%. Similar to MA2, the performance of GRUAE on DT is still not the best, and SVD has the largest improvement for DT of 24%. However, GRUAE also brings about a 21% increase in DT, and the difference with SVD is only about 3% which is relatively small. The improvement of GRUAE on MLP is also significant, which is 9%. In comparison, the improvement of other dimensionality reduction algorithms on MLP are negligible. For RF, LR and SVM, the overall performance of the three methods is similar. The improvement of GRUAE on the three methods is not as good as that of MLP and DT, and the improvement is relatively small. RF and SVM have an improvement of 4% and 3% respectively, while LR has a weak improvement of less than 1%. For GRU, the situation is similar to that of MA2 and SVD has the best improvement. Although GRUAE does not improve GRU the most, the difference with SVD is only 1.5%. GRUAE improves the recall by 5% and 6% for GRU and LSTM, respectively, so GAL is still the best one overall.

The recall of MC1.
The F0.5-score of MC1 is shown in Fig. 12, and the overall trend is similar to precision. After dimensionality reduction, the effects of various algorithms are basically improved. GRUAE has the largest improvement on MLP, DT and SVM, with an average of about 9%, which is significantly better than other dimensionality reduction methods. Since the precision and recall of GAL are the best among all methods, its F0.5-score is also the best one.

The F0.5-score of MC1.
We also verify the efficiency of GRUAE, PCA, SVD and FA in out experiments. The amount of data processed by the algorithm per second is shown in Table 3. In view of the simple gating structure of GRU, encoder of GRUAE Model is able to quickly perform dimensionality reduction. Compared with other dimensionality reduction algorithms, the data processed per second of GRUAE on MA2 and MC1 datasets is relatively close, reaching 1008 and 1025, respectively. Among the other compared methods, the efficiency of SVD on MA2 is 792 pieces per second which is the best. On MC1, FA has the best processing efficiency of 689 pieces per second. The average value of the three methods used for comparison is about 700, while the average value of GRUAE is 1017, which is about 1.5 times the average value of the other three methods, fully demonstrating the advantage in computational performance.
Performance comparison of GRUAE and other DR approaches
Performance comparison of GRUAE and other DR approaches
To further evaluate the effect of GAL, we also compare with WEFR [45]. As we improve the structure of traditional AE based on GRU, the encoder of GRUAE model is able to better learn the temporal characteristics of SSD SMART data and extract the latent code, aiming at reducing the influence of noise features in original high-dimensional SSD SMART data, and highlight features more relevant to data characteristics. The raw input data is mapped to a low-dimensional feature space and then fed to the classifiers for fault detection. Therefore, the fault detection effect of various AI algorithms has been improved with dimensionality reduction.
As shown in Figs. 13 and 14, the precision, recall and F0.5-score of WEFR on MA2 dataset are 57%, 32% and 49%, respectively. Except RF and LR, the rest methods based on GRUAE achieve better results. The precision, recall and F0.5-score of LSTM with the best effect are 97.8%, 95% and 97%, respectively, which exceed WEFR 40.8%, 63% and 48%, respectively. The precision, recall and F0.5-score of WEFR on MC1 dataset are 49%, 18% and 36%, respectively. The results of all algorithms based on dimensionality reduction by GRUAE are better than those of WEFR. The precision, recall and F0.5-score of LSTM with the best effect are 97%, 96% and 96.8%, respectively, which exceed those of WEFR by 48%, 78% and 60.8%, respectively. Since SSD has strong temporal characteristics of SMART data, and WEFR is based on RF which has certain limitations in processing temporal data, so there is a lot of space for improvement in precision and recall.

Comparison with WEFR of MA2.

Comparison with WEFR of MC1.
The overall experimental results show that the GRUAE model within GAL is able to learn the temporal characteristics of SSD SMART data and reduce the dimensionality of original high-dimensional SSD SMART data. On the condition that the data after dimensionality reduction contains the original data characteristics, the fault detection accuracy of various AI algorithms is improved. In addition, GAL performs the best in a variety of metrics, which fully demonstrates the effectiveness of GAL.
In order to solve the negative impact of the high dimensional SSD SMART data on the fault detection of traditional machine learning algorithms, we propose a novel SSD fault detection method – GAL(GRUAE + LSTM) based on dimensionality reduction of GRU sparse autoencoder. The GAL trains the GRUAE model with SSD SMART data firstly, and then adopts the encoder of GRUAE model as the dimensionality reduction tool to reduce the original high-dimensional SSD SMART data, aiming at reducing the influence of noise features in original SSD SAMRT data and highlight the features more relevant to data characteristics to improve the accuracy of fault detection. Finally, LSTM is adopted for fault detection with low-dimensional SSD SMART data. The experiment is conducted on the SSD dataset that is publicly available in the industry. Experimental results show that the proposed approach is able to reduce the dimensionality of SSD SMART data while ensuring the characteristics of original SSD SMART data. In most cases, the dimensionality reduction effect of the GRUAE model in this paper is better than that of other dimensionality reduction methods used for comparison, and the average computational performance of the GRUAE model is about 1.5 times that of the compared methods. Fault detection accuracy of various AI algorithms is improved after dimensionality reduction. For family “MA2”, compared with no dimensionality reduction, GAL improves the precision, recall and F0.5-score by about 4%, 5% and 4%, reaching 97.8%, 95% and 97%, respectively. For family “MC1”, GAL improves the precision, recall and F0.5-score by about 4%, 8% and 5%, reaching 97%, 96% and 96.8%, respectively, compared with no dimensionality reduction. Also, the GAL proposed in this paper performs the best. In future work, we intend to optimize the architecture of the proposed LSTM model to further improve the detection performance.
Footnotes
Acknowledgements
This research was funded by the National Key Research and Development Plan of China under grant No. 2016YFB1000303.
