A novel SSD fault detection method using GRU-based Sparse Auto-Encoder for dimensionality reduction

Abstract

In recent years, with the development of flash memory technology, storage systems in large data centers are typically built upon thousands or even millions of solid-state drives (SSDs). Therefore, the failure of SSDs is inevitable. An SSD failure may cause unrecoverable data loss or unavailable system service, resulting in catastrophic results. Active fault detection technologies are able to detect device problems in advance, so it is gaining popularity. Recent trends have turned toward applying AI algorithms based on SSD SMART data for fault detection. However, SMART data of new SSDs contains a large number of features, and the high dimension of data features results in poor accuracy of AI algorithms for fault detection.

To tackle the above problems, we improve the structure of traditional Auto-Encoder (AE) based on GRU and propose an SSD fault detection method – GAL based on dimensionality reduction with Gated Recurrent Unit (GRU) sparse autoencoder (GRUAE) by combining the temporal characteristics of SSD SMART data. The proposed method trains the GRUAE model with SSD SMART data firstly, and then adopts the encoder of GRUAE model as the dimensionality reduction tool to reduce the original high-dimensional SSD SMART data, aiming at reducing the influence of noise features in original SSD SAMRT data and highlight the features more relevant to data characteristics to improve the accuracy of fault detection. Finally, LSTM is adopted for fault detection with low-dimensional SSD SMART data. Experimental results on real SSD dataset from Alibaba show that the fault detection accuracy of various AI algorithms can be improved by varying degrees after dimensionality reduction with the proposed method, and GAL performs best among all methods.

Keywords

Fault detection dimensionality reduction sparse auto-encoder solid state drives gated recurrent unit

Acronyms

SSDs

Solid State Drives

GRU

Gated Recurrent Unit

Auto-Encoder

GRUAE

GRU Sparse Auto-Encoder

LSTM

Long Short-Term Memory

GAL

GRUAE + LSTM

SMART

self-monitoring, analysis and reporting technology

machine learning

dimensionality reduction

PCA

Principal Component Analysis

SVD

Singular Value Decomposition

Factor Analysis

penalty term

Nomenclature

sparse penalty coefficient

{\hat{p}}_{j}

average activation amount of j-th unit

parameter set

a_j (x)

activation amount of the j-th unit in the hidden layer

offset vector

activation function

h_{k}^{d^{'}}

d′-dimensional latent representation of input data with low dimensionality with sample number k

cost function

sparse constant

S _l

number of neurons at layer l

weight matrix

x_{k}^{d}

d-dimensional sample set with sample number k

z_{k}^{d}

data reconstructed by GRU-decoder

1 Introduction

With the continuous development of storage technology, large-scale data centers usually deploy a large number of solid-state drives (SSDs) on the underlying storage devices to improve the data processing efficiency of the storage system; examples include the data servers of Alibaba Pangu[1], Amazon[2], Google[3], Facebook[4], Microsoft Azure[5]. In such data centers, it has been an extremely challenging undertake to ensure high availability and reliability for IT management, as various drive failures constantly occur in the field. Data centers usually adopt some data protection mechanisms, such as data copy or erasure codes [5, 6]. If the drives fail to recover the lost data despite the data protection capabilities, permanent data loss occurs, and the system cannot be used, which would be disastrous for the data centers. SSDs tend to have a limited endurance and failure is inevitable. As a result, SSDs may fail with different severity and manifestations for a variety of reasons, which can be observed in many data centers [7 –10]. Compared with traditional passive fault-tolerant techniques such as Erasure Code (EC) and Redundant Arrays of Independent Disks (RAIDs), active fault detection techniques can guarantee the reliability and availability of large-scale storage systems in advance. Thus, the risk of data loss can be reduced by successfully identifying drive failures.

In order to monitor the health status of SSDs, manufacturers generally implement self-monitoring, analysis and reporting technology (SMART) in the firmware of devices. The SMART attributes contain the drive state information and possible defects. Internally, drives use the so-called “threshold method” based on SMART values to evaluate the failures, which means the drives would raise alarms if the values of one or more of the SMART attributes cross the corresponding predefined threshold. However, this "threshold method" only achieved a 3-10% failure detection rate (FDR) and a 0.1% false alarm rate (FAR) in practice [11]; in other words, this method is too conservative and misses opportunities to detect disk failures.

In the literature, many prior studies [9 , 12–16] have investigated the fault detection for hard disk drives (HDDs) based on SMART data. Nevertheless, due to the complicated characteristics of SSDs [7 , 17–19], only a few studies [8 , 21] detect SSDs failure based on the private data from Google, which is relatively limited. Therefore, it is necessary to study SSDs fault detection based on SMART data in production environment.

Machine learning (ML) algorithms have been widely adopted in fault detection research. The detection performance of algorithms depends on the quality of data greatly. As SSD manufacturers advance in technologies, more and more SSD reliability data can be collected, which contains a large number of features. In this context, the performance of a large number of traditional machine learning algorithms is negatively affected by the high dimension of data [22]. The main reason for this behavior lies in the "curse of dimensionality" [23], this phenomenon is related to the decrease of the performance of traditional classifiers with the increase of the dimension of the input data.

Hence, processing the original data to eliminate the negative effects of the high-dimensional feature space is the premise to ensure the effectiveness of the algorithms. Dimensionality reduction (DR) can reduce the impact of noise features in the original data and highlight features more related to data characteristics [24]. Principal Component Analysis (PCA) achieves dimensionality reduction by projecting original data into a linear subspace spanning the main feature vectors of the data covariance matrix [25]. Locality discriminant analysis [26] and marginal fisher analysis [27] are also local methods which construct low-dimensional subspace based on different graphs of samples. However, these traditional dimensionality reduction methods are mostly used to deal with linear data, there are some limitations when dealing with high-dimensional nonlinear data.

Auto-Encoder (AE) is known as an effective tool for learning data coding in an unsupervised manner [28]. An auto-encoder has two parts: the encoder and the decoder. The encoder operator aims at learning low-dimensional representations for a set of data typically for dimensionality reduction, while the decoder attempts to generate reconstructions from the low-dimensional representations that can be close enough to the original inputs. As a data-driven method, an autoencoder can nonlinearly extract the most important features of data instead of relying on manually specified features; thus, it enables us to obtain an optimal dimensionality reduction performance.

Original SSD SMART data often contains multiple parameters, many of which are ineffective in determining SSD failures. Modeling based on original SSD SMART data may make the model learn many useless features and take a lot of training time. The high data dimensionality and the redundant parameters normally have negative impacts on the accuracy and efficiency of the model. Building the fault detection model based on low-dimensional SSD SMART data after dimensionality reduction can reduce the impact of invalid, redundant, or incorrect features, and improve the efficiency and accuracy of modeling, and enable the model to detect SSD faults more accurately. In view of the SSD SMART data has certain temporal characteristics, we improve the structure of traditional AE based on Gated Recurrent Unit (GRU) and propose an SSD fault detection method – GAL based on dimensionality reduction of GRU sparse autoencoder (GRUAE). The proposed method trains the GRUAE model with SSD SMART data firstly, and then adopts the encoder of GRUAE model as the dimensionality reduction tool to reduce the original high-dimensional SSD SMART data, aiming at removing redundant features, reducing the influence of noise features in original SSD SAMRT data and highlight the features more relevant to data characteristics. Finally, Long Short-Term Memory (LSTM) is adopted for fault detection with low-dimensional SSD SMART data. The experiment is based on real SSD datasets to evaluate the effectiveness of the method. The experimental results show that the proposed method can detect SSD faults more accurately than that without dimensionality reduction or feature extraction method.

The rest of this paper is organized as follows. Section II is the related works. Section III is the background knowledge. Section IV is the description of the proposed method. The dataset description and the experiments results are listed in Section V and Section VI, respectively. Finally, Section VII summarizes the conclusion of this paper.

2 Related works

Our work is mainly related to two lines of studies, dimensionality reduction and fault detection. For dimensionality reduction, PCA [25], as one of the very popular unsupervised dimensionality reduction algorithms, achieves dimensionality reduction by projecting original data into a linear subspace spread by the main feature vectors of the data covariance matrix. SVD [34] calculates the number of singular values and singular vectors to generate an approximate matrix that can replace the original matrix, and ranks the singular value representations of the dataset according to importance, discarding unimportant feature vectors, so as to achieve the purpose of dimensionality reduction. However, SVD is poor in dealing with the noise, and many works have studied the combination of SVD and other methods, such as locally Linear embedding [35] and Hankel matrix-based Fast SVD [36]. By finding the hidden representative factors among different features, FA [37] reduces the number of features by classifying features of the same nature into a factor, and can also test the hypothesis of the relationship between variables. Traditional dimensionality reduction methods such as PCA and SVD are commonly for the linear data, there are limitations such as slow processing speed and poor dimensionality reduction effect when processing high-dimensional nonlinear data. Therefore, many studies about AE based dimensionality reduction methods have been proposed in recent years. SSDSAE [29] is proposed for the fault detection of rotating machine, feature extraction and dimensionality reduction are firstly conducted on the vibration spectrum of collected signal, then the faults are detected based on the low dimensional data. SSDSAE has stronger robustness and information structure extraction ability compared to other deep learning algorithms. Guo et al. [30] proposed a stacked sparse autoencoder model S-SAE with three hidden layers to solve the problem of excessive dimension and excessive scattering of parameters in multi-temporal quadpol SAR images in crop classification. The classification accuracy of CNN is improved effectively by dimensionality reduction and feature extraction based on S-SAE. Shen et al. [31] applied stack-based Contractive auto-encoder technology to automatic feature extraction of rotating machinery and improved the robustness of fault diagnosis method. ClEnDAE [32] combines integration-based classifiers with denoising AEs to reduce the dimension of the input space. Literature [33] converts the original high-dimensional seismic inversion problem into a low-dimensional one by using the dimensionality reduction characteristics of AE, and effectively solves the low-dimensional seismic inversion problem through global optimization.

For fault detection, many prior works [9 , 12–16] have studied the HDD fault prediction. Xu et al. [9] develop a cost-sensitive machine learning model based on sorting, which can learn the characteristics of past failed HDDs and classify HDDs according to the future error tendency. Zhang et al. [13] propose a tier-scrubbing scheme based on LSTM and Adaptive Scrubbing Rate Controller, which locate high-risk areas of HDDs according to errors in local sectors of HDDs to detect HDD faults. Xiao et al. [14] propose an HDD fault prediction model based on Online Random Forests, which can automatically evolve with the collection sequence of data and is highly adaptable to changes in SMART data distribution over time. In recent years, some optimization methods for HDD fault detection have also been studied. Ircio et al. [38] use the sliding window mechanism to extract HDD SMART data focusing on the failed HDDs, the method extracts the data that is close to the failures and then use the extracted data for fault detection. The detection accuracy is improved when the ratio of positive and negative samples is unbalanced. Chhetri et al. [39] converts HDD SMART data from tabular format to KG triple format through Knowledge Graph (KG). Then the Relational Graph Convolution Network is used for fault detection, and a new idea of HDD fault detection is proposed. Mamoutova et al. [40] proposes an Ontology-based method to automatically analyze and diagnose system logs based on semantic model. The proposed method uses Ontology to represent fault symptoms, and then analyzes data and diagnoses HDD faults in combination with ML algorithm. Wang et al. [41] models the degradation process of HDDs based on Rao-Blackwellized particle filter, and adjusts the difference between the real observed value and the estimated value of the model iteratively. It can analyze HDD SMART data in real time and evaluate the current degradation degree of HDDs, effectively reducing the false alarm rate of HDD failure prediction. Reference [42] improves the network structure of Generative Adversarial Networks (GANs) based on LSTM, and expands the total amount of data by generating virtual failed HDD data through GAN. The accuracy of HDD fault detection are effectively improved in the case of small samples.

However, the works about SSD fault detection are quite limited. Narayanan et al. [7] study the importance of different SMART features on SSD failure prediction based on random forest (RF). Mahdisoltani et al. [15] predict sector errors specifically based on Google SSD data and analyze the possibility of improving storage system reliability through the sector errors predictor. Alter et al. [20] also use data from Google to construct SSD failure prediction model by with various machine learning algorithms such as RF and support vector machine (SVM), and analyze the influence of different workloads on SSD failure. Sarkar et al. [21] study SSD fault prediction based on features provided by firmware functions. Existing works generally focuses on the analysis of original SSD SMART data and the construct the prediction model directly. There are limitations and space for optimization in the data pre-processing stage of model construction. Chandranil et al. [43] proposes a 1-class model for the unbalanced ratio of positive and negative samples of SSD fault detection, which only trains the majority class. It reduces the risk of over-fitting due to the imbalance of positive and negative samples and improves the accuracy of prediction.

The main difference between our work and other studies is that, although the optimization methods of SSD fault detection have been proposed from different perspectives by the existing works, these methods are generally based on the original SSD SMART data for direct analysis and modeling. However, the original SSD SMART data often contains multiple parameters, many of which cannot provide effective help for determining whether an SSD is faulty. And the redundant parameters normally have negative impacts on the accuracy and efficiency of the model. Traditional dimensionality reduction methods such as PCA and SVD are commonly used for linear data, but there are limitations such as slow processing speed and poor dimensionality reduction effect when processing high-dimensional nonlinear data. However, SSD SMART data has strong temporal correlation, and high-dimensional temporal data will greatly reduce the effectiveness of traditional dimensionality reduction methods. Considering such problems, the method proposed in this paper does not directly model the original SSD SMART data, but proposes a GRUAE model based on GRU in combination with the timing characteristics of SSD SMART data. The method trains the corresponding GRUAE model based on the original SSD SMART data firstly, then the encoder in GRUAE model is utilized to reduce the dimensionality of original SSD SMART data to weaken the impact of noise features in original SSD SMART data and highlight features more relevant to data characteristics. Finally, LSTM is adopted to detect faults based on the low dimensional SSD SMART data. The accuracy of SSD fault detection by machine learning and deep learning algorithms is effectively enhanced.

3 Preliminary knowledge

3.1 Gated recurrent unit

Gated Recurrent Unit (GRU) [44] was proposed to make each recurrent unit adaptively capture the dependency at different time scales. Similar to LSTM unit, GRU has gating units that modulate information flow within the unit, however, without having separate memory cell. The graphical illustration of the GRU is displayed in Fig. 1.

Fig. 1

Gated recurrent unit.

The activation $h_{t}^{j}$ of the GRU at time t is a linear interpolation between the previous activation $h_{t - 1}^{j}$ and the candidate activation ${\hat{h}}_{t}^{j}$ :

$h_{t}^{j} = (1 - z_{t}^{j}) h_{t - 1}^{j} + z_{t}^{j} {\hat{h}}_{t}^{j}$ (1) where an update gate $z_{t}^{j}$ decides how much the unit updates its activation, or content. The update gate is computed by:

$z_{t}^{j} = σ (W_{z} x_{t} + U_{z} h_{t - 1})^{j}$ (2)

This procedure of taking a linear sum between the existing state and the newly computed state is similar to the LSTM unit. However, the GRU does not contain any mechanisms to control the degree to which its state is exposed, but exposes the whole state each time.

The candidate activation ${\hat{h}}_{t}^{j}$ is computed similarly to that of the traditional recurrent unit: ${\hat{h}}_{t}^{j} = tanh ({Wx}_{t} + U (r_{t} ⊙ h_{t - 1}))^{j}$ (3) where r_t is a set of reset gates and ⊙ is an element-wise multiplication. When off ( $r_{t}^{j}$ close to 0), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state.

The reset gate $r_{t}^{j}$ is computed similarly to the update gate: $r_{t}^{j} = σ (W_{r} x_{t} + U_{r} h_{t - 1}))^{j}$ (4)

3.2 Auto-Encoder

AE is a learning model, aiming at extracting a representation feature from a piece of data through unsupervised learning. The structure of an AE contains two parts: encoder and decoder, as illustrated in Fig. 2. Encoder intends to extract a latent code from input data and map the input data to a low-dimensional feature space. Decoder tries to reconstruct a piece of data from the latent code as close to the original input data as possible.

Fig. 2

Structure of auto-encoder.

The formulations of the encoder and decoder are shown following: $h = σ_{1} (WX + b)$ (5)

$Y = σ_{2} (W^{T} h + d)$ (6) where X = (x₁, & , x_n) denotes the input data, and n is the number of samples. σ₁ and σ₂ are the activation functions. The feature vector is denoted as h and the reconstruction is denoted as Y. The objective of AE is to minimize of the reconstruction error:

$L_{AE} = \frac{1}{n} \sum_{i = 1}^{n} ‖ y_{i} - x_{i} ‖^{2}$ (7)

4 Proposed method

In real environment, SSD faults tend to occur gradually over time. Therefore, the SSD reliability characteristics have a strong time-related correlation. Compared with traditional data generation methods, GRU is better at capturing temporal characteristics of samples and extracting time-related features. Therefore, we adopt GRU as the encoder of AE model to fit the probability distribution function of SSD samples, so that encoder can better learn the temporal characteristics of SSD SMART data to extract the potential code, and map the original input data into low-dimensional feature space. Also, the GRU is adopted as the decoder to better capture the temporal characteristics of the data and reconstruct the data from low dimensional space to the original state. The original SSD SMART data often contains multiple parameters, many of which cannot provide effective help for determining whether an SSD is faulty. The high dimensionality and redundant parameters normally have negative impacts on the accuracy and efficiency of the model. The proposed method trains the GRUAE model for the original SSD SMART data firstly, and then uses the encoder in the GRUAE model to reduce the dimensionality of the original SSD SMART data, aiming to remove the redundant features, weaken the impact of noise features in original SSD SMART data and highlight the features more relevant to data characteristics. Building the fault detection model based on low-dimensional SSD SMART data after dimensionality reduction can reduce the impact of invalid, redundant, or incorrect data on modeling, improve the efficiency and accuracy of modeling, and enable the model to detect SSD faults more accurately. As shown in Fig. 3, our method mainly includes training of GRUAE, dimensionality reduction and fault detection. The main steps are as follows:

Training of GRUAE. For the original input data, the GRUAE model for the dimensionality reduction is trained firstly. Then, the GRU-encoder is used to encode the original input data, and map the input to the low-dimensional feature space in the hidden layer. In this way, the dimensionality reduction is conducted on the basis of preserving the characteristics of the original input data. In order to ensure that the encoded data in the low-dimensional feature space contains the characteristics of the original input data and the connection between each data feature, the output of hidden layer is decoded through the GRU-decoder, thus the data is reconstructed from the low-dimensional state to the original high-dimensional state. In each epoch of training, the mean square error MSE and the Kullback-Leibler (KL) divergence are used as the cost measurement, and the output of hidden layer neurons are suppressed by sparse penalty terms to make the network sparse, then the error between the reconstructed output and the original input is calculated. In this way, GRUAE model is still able to learn important features of input data even if there are plentiful neurons in hidden layer.

Dimensionality Reduction. After the training of GRUAE model is completed, the GRU-encoder is adopted as a tool for dimensionality reduction. With the pre-processing such as regularization and normalization of original SSD SMART data, the SSD SMART data is mapped to a low-dimensional feature space by the GRU-encode to reduce the influence of noise features in the original SMART data and highlight features more relevant to data characteristics. The low-dimensional SSD SMART data composes a new dataset for the experiment.

Fault Detection. Due to SSD SMART data has strong temporal characteristics, and LSTM is good at capturing the timing characteristics of samples and extracting time-related features. In view of this, we adopt LSTM to detect faults and output diagnosis results based on the low-dimensional dataset. As described in the subsequent experimental chapters, LSTM performs the best.

Fig. 3

Architecture of proposed approach.

4.1 GRUAE Model

The typical structure of AE is encoder-decoder structure. The proposed GRUAE model contains two independent GRU layers, as shown in Fig. 4. For d-dimensional sample set $x \in {x_{k}^{d}}$ with sample number k, GRU-encoder extracts the latent representation of input data and maps the input to a d′-dimensional vector $h \in {h_{k}^{d^{'}}}$ with low dimensionality; GRU-decoder reconstruct the data $z \in {z_{k}^{d}}$ to d-dimensional original input data as close as possible based on the low-dimensional feature space $h \in {h_{k}^{d^{'}}}$ , then the GRU-encoder is used to generate low-dimensional datasets for subsequent fault detection.

For the structure of the model, the learning ability of AE will be affected if the hidden layer contains more neurons than the input layer, so it is necessary to impose certain constraints on hidden layer neurons. We add a sparse penalty term to the hidden layer to suppress the output of hidden layer neurons, so that the network achieves sparse effect. In this way, AE can still learn important features of input data even if there are plentiful hidden layer neurons.

Fig. 4

Architecture of GRUAE model.

Specifically, the expression of encoder is as follows:

$h = f_{θ} (x) = E_{f} (Wx + b)$ (8) where the parameter set of encoder is expressed as θ = {W, b}, W is the weight matrix, b is the offset vector, and E_f is the encoder activation function.

Similarly, the reconstructed output is expressed as:

$z = g_{θ^{'}} (h) = E_{g} (W^{'} h + b^{'})$ (9) where the parameter set of decoder is expressed as θ′ = {W′, b′}, W′ is the weight matrix, b′ is the offset vector, E_g is the decoder activation function. Thus, the cost function is defined as:

$J = min \frac{1}{k} \sum_{i = 1}^{k} {∥ z_{i} - x_{i} ∥}^{2}$ (10)

Assuming that a_j (x) represents the activation amount of the j-th unit in the hidden layer, then the average activation amount of j-th unit is:

${\hat{p}}_{j} = \frac{1}{k} \sum_{i = 1}^{k} a_{j} (x (i))$ (11)

where k is the number of samples. To ensure that most neurons are in the "inactive" state, assume that ${\hat{p}}_{j}$ is equal to a constant close to 0 and p is called the sparse constant, and then a sparse penalty is added to the cost function to punish ${\hat{p}}_{j}$ . We choose KL divergence as the expression of penalty term (PN):

$PN = \sum_{j = i}^{s_{2}} KL (p ∥ {\hat{p}}_{j})$ (12) where s₂ represents the number of neurons in hidden layer. $KL (p ∥ {\hat{p}}_{j})$ is the KL divergence, the specific expression is as follows:

$KL (p ∥ {\hat{p}}_{j}) = p ln \frac{p}{{\hat{p}}_{j}} + (1 - p) ln \frac{1 - p}{1 - {\hat{p}}_{j}}$ (13)

The optimization objective function of the GRUAE model with sparse penalty terms is defined as follows:

$\begin{matrix} J_{sparse} (W, b) = \frac{1}{k} \sum_{i = 1}^{k} {∥ x (i) - \hat{x} (i) ∥}^{2} \\ + \frac{λ}{2} \sum_{l = 1}^{nl - 1} \sum_{i = 1}^{S_{l}} \sum_{j = 1}^{S_{l + 1}} (W_{ij} (l))^{2} \end{matrix}$ (14) where λ is the coefficient of weight decay and, nl is the number of network layers. S_l is the number of neurons at layer l, S_l+1 is the number of neurons at layer l + 1. Therefore, the cost function containing the sparse penalty expression is expressed as:

$J_{sparse} (W, b) = J (W, b) + β PN$ (15) where β is the sparse penalty coefficient which is set as 3 in our experiment.

Algorithm-1 summarizes the training process of GRUAE model. The parameters of network are initialized before the iteration start. For the original high-dimensional SSD SMART data, in each epoch, the forward propagation process first calculates the hidden layer output, and then calculates the output layer results according to the hidden layer results. In the process of back propagation, network parameters are updated by Adam to minimize the loss function with sparse penalty terms, and the iteration is repeated until convergence. When setting relevant parameters of the algorithm, if the amount of data is large enough, in order to avoid unnecessary iterations and reduce training time, the epoch can be set relatively small, corresponding to a moderate batchsize. When initializing the network, to avoid the output value of the activation function tending to 0, the selected initialization method should make the input and output follow the same distribution as possible. The adjustment parameter of regularization term, learning rate and other parameters should not be too large.

Algorithm 1 GRU-based Sparse Auto-Encoder.
Input: Data samples X_L; Epoch; Batch
Output: The trained network Model (•) = θ^* (W, b)
Initialization: The weights W and the bias b of network
1: fore = 1 to EpochNumberdo
2: forbn = 1 to BatchNumberdo
3: Calculate reconstruction constraint matrix J_L (t_i, y_i)
$= \sum_{i = 1}^{l} (1 - t_{i}) * y_{i} + l (log (1 + exp (- abs (y_{i}))) +$
max(- y_i, 0)) + PN
4: Calculate local constraint matrix $SL = \frac{1}{2} \sum_{i = 1}^{n}$
$\sum_{j = 1}^{n} S_{i, j}^{L} (f_{i} - f_{j})^{2}$ ; where f means the low-dimensional
embedded sample.
5: fork = 1 to Kdo
6: Forward propagation process, calculate: h^(k+1)
= w^(k)x + b^k, x^(k+1) = f (h^(k+1)), $J_{L}^{k}$ ,
7: Calculate the target constraint martix $J_{ss}^{(k)} (θ) =$
$σ J_{L}^{(k)} (θ)$ ; where σ is the adjustment parameter to control the
contribution of the corresponding regularization term.
8: Backward propagation process, optimize target:
$θ^{} = arg min σ J_{L}^{(k)} (θ)$ ; $\frac{\partial J_{SS}}{\partial θ^{}} = 0$ , $W_{ij}^{k} = W_{ij}^{k} - l \frac{\partial J_{SS (W, b)}}{\partial W_{ij}^{k}}$ ,
$b_{i}^{k} = b_{i}^{k} - l \frac{\partial J_{SS (W, b)}}{\partial b_{i}^{k}}$ ; where l is learning rate.
9: Update network parameters(loss, W, b)
10: Any gradient-based learning rule can be used to update the
gradient of the network. The Adam is adopted in the experiment.
11: return result

The more complex the model is, the larger the parameter value will be, and it will try to fit all the sample points. If the data contains abnormal samples, the model will produce great fluctuation among small ranges, leading to the over-fitting of the model to the training data. Hence, we adopt weight decay (L2 regularization) technology in the design of Algorithm-1, which simplify the model, realizes the sparsity of parameters, avoids overfitting, and keeps the parameters as small as possible to improve the robustness of the algorithm. GRU contains two gate structures: update gate and reset gate. Update gate corresponds to a weight matrix and the bias, and reset gate corresponds to two weight matrixes and the bias. Therefore, GRU needs to maintain three sets of parameters. The total number of parameters is 3 * ((m + n) * n + n), where m is the size of the input layer and n is the size of the hidden layer. The GRUAE model in this paper consists of encoder and decoder, each consisting of a GRU. Therefore, the total number of parameters of the GRUAE model is 6 * ((m + n) * n + n), which represents the complexity of the model and determines the convergence speed and calculation speed of the algorithm. While training the GRUAE model, six control parameters in total are passed which are epoch number, batch number, the adjustment parameter of regularization term, learning rate, dropout and sparse penalty coefficient. In order to ensure the convenience of the implementation, the proposed algorithm is implemented based on the common and open-source framework-PyTorch, which is easy to program.

4.2 Dimensionality Reduction

After the GRUAE model is trained, for d-dimensional sample set $x \in {x_{k}^{d}}$ with sample number k, encoder is used to extract potential features of input data. The input is mapped into a d′-dimensional low-dimensional sample set $h \in {h_{k}^{d^{'}}}$ , and the calculation process is shown in formula 8. Algorithms-2 summarizes the process of dimensionality reduction. When initializing the network, to avoid the output value of the activation function tending to 0, the selected initialization method should make the input and output follow the same distribution as possible. GRU-encoder used in Algorithm-2 for dimensionality reduction comes from GRUAE Model and is part of the execution result of Algorithm-1. Algorithm-2 only invokes GRU-encoder to perform dimensionality reduction, there are no external control parameters.

Algorithm 2 Dimensionality Reduction.
Input: The original high dimensionality data samples
d-dimensional $x \in {x_{k}^{d}}$ ; GRU - encoder
Output: The low dimensionality data samples
d′-dimensional $h \in {h_{k}^{d^{'}}}$
Initialization: The weights W and the bias b of
network
1: Load the model state of GRU - encoder;
2: Calculate h^(k+1) = w^(k)x + b^k, x^(k+1) = f (h^(k+1)), $J_{L}^{k}$
3: return result

4.3 LSTM Model

To obtain long-term time dependence, the LSTM defines and maintains the unit state to regulate information flow. The cell state C_t-1 interacts with the intermediate output h_t-1 and the subsequent input x_t to determine which elements of the internal state vector should be updated, maintained, or removed based on the output of previous time step and the input of current time step. The formulas of the LSTM network are described as follows:

$i_{t} = σ (x_{t} U^{i} + h_{t - 1} W^{i})$ (16)

$f_{t} = σ (x_{t} U^{f} + h_{t - 1} W^{f})$ (17)

$o_{t} = σ (x_{t} U^{o} + h_{t - 1} W^{o})$ (18)

${\hat{C}}_{t} = tan h (x_{t} U^{g} + h_{t - 1} W^{g})$ (19)

$C_{t} = f_{t} * C_{t - 1} + i_{t} * {\hat{C}}_{t}$ (20)

$h_{t} = tan h (C_{t}) * o_{t}$ (21) where the operator * represents elementwise multiplication and σ represents the sigmoid activation function; i, f, and o denote the input, forget and output gates, respectively; Wⁱ, W^f, W^o, and W_g represent the weight matrices that need to be learned during training; Uⁱ, U^f, U^o, and U^g are coefficient matrices; ${\hat{C}}^{t}$ is a "candidate" hidden state that is calculated based on the current input and the previous hidden state; C_t is the internal memory of the unit; h_t represents the final output of the memory unit. Via the function of various gates, LSTM memory units can capture the complicated correlation features within time series in both the short and long terms, and this is a remarkable improvement over other RNNs.

5 Dataset description and preprocessing

5.1 Datasets

To evaluate the performance and restrictions of the proposed method, we use the SSD dataset from Alibaba [45]. The dataset contains SMART data about the SSDs and some basic information, such as timestamps and device serial numbers. It is important to note that the SMART attributes provided by different manufacturers may be different, and some attributes may have different meanings depending on the type of device. Therefore, we extract the SMART data of SSD model MA2(MLC) and MC1(3-D TLC) which are of different Flash types with relatively complete data records in 2019 for the experiment. The SSDs in the original dataset include two classes: Healthy and Failed. However, due to the long-time span of the original dataset, some SSD SMART data records are incomplete and some related parameters are lost. Therefore, we remove those SSD SMART data with incomplete records when processing the original dataset. The basic information about the dataset is shown in Table 1. The sampling interval is 24h.

Table 1
Overview of the dataset

MA2 MC1

Flash Tech. MLC 3D-TLC

Capacity 800G 1920G

Duration 12 months 12 months

Healthy Disks amount 84076 163338

Failed Disks amount 452 8546

Positive Items 30687985 59618418

Negative Items 45692 1855058

Size after processed 13G 27G

	MA2	MC1
Flash Tech.	MLC	3D-TLC
Capacity	800G	1920G
Duration	12 months	12 months
Healthy Disks amount	84076	163338
Failed Disks amount	452	8546
Positive Items	30687985	59618418
Negative Items	45692	1855058
Size after processed	13G	27G

5.2 Feature Information and pre-processing

The SMART data in the dataset contains raw values denoted as Raw and normalized values denoted as Norm. The Norms in the dataset are calculated by Raw according to the manufacturer’s nonpublic custom formulas. As some Norms may result in a loss of data accuracy and the corresponding Raw may be highly sensitive to changes in disk health, Raw and Norm are all used in the experiments, as shown in Table 2.

Table 2
Feature Information

Attribute Name MA2 MC1

Read Error Rate Norm&Raw

Reallocated Sectors Count Norm&Raw Norm&Raw

Power On Hours Norm&Raw Norm&Raw

Power Cycle Count Norm&Raw Norm&Raw

Available Reserved Space Norm&Raw Norm&Raw

SSD Program Fail Count Norm&Raw Norm&Raw

SSD Erase Fail Norm&Raw Norm&Raw

SSD Wear Leveling Count Norm&Raw

Unexpected Power Loss Count Norm&Raw Norm&Raw

Unused Reserved Block Count Total Norm&Raw

SATA Downshift Error Count Norm&Raw Norm&Raw

End-to-End error Norm&Raw Norm&Raw

Reported Uncorrectable Errors Norm&Raw Norm&Raw

Command Timeout Norm&Raw

Airflow Temperature Norm&Raw

Unsafe Shutdown Count Norm&Raw

Temperature Celsius Norm&Raw Norm&Raw

Hardware ECC Recovered Norm&Raw

Reallocation Event Count Norm&Raw

Current Pending Sector Count Norm&Raw Norm&Raw

Uncorrectable Sector Count Norm&Raw

UltraDMA CRC Error Count Norm&Raw

Media Wearout Indicator Norm&Raw

Total LBAs Writtenr Norm&Raw

Total LBAs Readr Norm&Raw

Power Loss Protection Failurer Norm&Raw

Attribute Name	MA2	MC1
Read Error Rate		Norm&Raw
Reallocated Sectors Count	Norm&Raw	Norm&Raw
Power On Hours	Norm&Raw	Norm&Raw
Power Cycle Count	Norm&Raw	Norm&Raw
Available Reserved Space	Norm&Raw	Norm&Raw
SSD Program Fail Count	Norm&Raw	Norm&Raw
SSD Erase Fail	Norm&Raw	Norm&Raw
SSD Wear Leveling Count		Norm&Raw
Unexpected Power Loss Count	Norm&Raw	Norm&Raw
Unused Reserved Block Count Total		Norm&Raw
SATA Downshift Error Count	Norm&Raw	Norm&Raw
End-to-End error	Norm&Raw	Norm&Raw
Reported Uncorrectable Errors	Norm&Raw	Norm&Raw
Command Timeout		Norm&Raw
Airflow Temperature	Norm&Raw
Unsafe Shutdown Count	Norm&Raw
Temperature Celsius	Norm&Raw	Norm&Raw
Hardware ECC Recovered		Norm&Raw
Reallocation Event Count		Norm&Raw
Current Pending Sector Count	Norm&Raw	Norm&Raw
Uncorrectable Sector Count		Norm&Raw
UltraDMA CRC Error Count	Norm&Raw
Media Wearout Indicator	Norm&Raw
Total LBAs Writtenr	Norm&Raw
Total LBAs Readr	Norm&Raw
Power Loss Protection Failurer	Norm&Raw

The range of values spanned by different features varies widely. To avoid bias towards features with large values, we apply feature scaling for data normalization according to the following formula:

$x_{n} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$ (22) where x is the original value of a feature and x_max and x_min are the maximum value and the minimum value of this feature, respectively.

6 Experiment results

6.1 Experiment conditions and methods

We use Precision, Recall and F0.5-Score to measure the effect of different machine learning algorithms on fault detection. Precision indicates the proportion of true positives (TPs) among all predicted failures. Recall represents the proportion of TPs within all actually failed disks. From the practical experience, once an SSD is detected as failed (regardless of correctly or falsely), administrators will decommission the SSD for further inspection. Since the cost of replacing a healthy SSD that is falsely detected as a failure is higher than that of missing a failed SSD that is falsely detected as a healthy SSD [45]. Thus, we also use the F0.5-score instead of the F1-score to weigh the precision twice as important as the recall. These metrics are defined as:

$Precision = \frac{TP}{TP + FP}$ (23)

$Recall = \frac{TP}{TP + FN}$ (24)

$F 0.5 - score = \frac{(1 + 0 . 5^{2}) \times Precision \times Recall}{0 . 5^{2} \times Precision + Recall}$ (25) where TP is “true positive”, FP is “false positive” and FN is “false negative”.

To evaluate the effectiveness of the proposed method, the dataset is randomly divided into a training set and a test set with the ratio of 7:3. The encoder of GRUAE model is a GRU network with one hidden layer. The number of cells in the input layer is equal to the dimension of SSD SMART data. The hidden layer has 10 cells with time-step of 3, the number of cells in the output layer is 10, and the dropout is set as 0.2. The decoder is also a GRU network with one hidden layer. The number of cells in the input layer is 10 and the number of cells in hidden layer is 10. The output layer has linear mapping and tanh is used as the activation function. The learning rate is 0.001.

When setting the relevant parameters of Algorithm-1, the epoch of the training GRUAE model is set as 2, considering that the dataset used in the experiment contains tens of millions of data, which is a relatively large amount. The larger batchsize is set during training, the smaller the total number of update steps and the corresponding total update amount will be. Because the optimal solution is usually a certain distance from the initialization point, when the total update amount is small, the exploration of the optimization algorithm in the parameter space is limited to a limited range near the initialization point, and the solutions found may be poor. Therefore, the batchsize is set to the commonly value 128 [46]. Because the activation function in the output layer is tanh, Xavier initialization is adopted in network initialization [47]. To make the model fit the training data better and keep the parameter values small, we set σ to 0.1 to adjust the regularization terms [48]. Algorithm-2 uses GRUAE-encoder to perform dimensionality reduction on original high-dimensional data, so Xavier initialization is also adopted when initializing the network.

The structure of LSTM consists of an input layer, two LSTM hidden layers and an output layer. The output layer has linear mapping and Sigmoid is used as the activation function. The number of cells in the input layer is the dimension of SSD SMART data after dimensionality reduction. The number of cells in the hidden layer is 100, the number of cells in the output layer is 1, the dropout is set as 0.2, and the learning rate is 0.001.

The number of cells in the output layer of encoder is 10, which is the dimension of data after dimensionality reduction. We adopt PCA to determine this value by setting the variance ratio of principal component to projection feature as 99%, that is, the included principal components can explain 99% of the original data, and the value is automatically calculated.

The optimizer in the training process is Adam. Fig. 5 depicts the changing process of the loss of the model in MA2 and MC1 SSD datasets during training. It can be observed that after a period of training, the loss of the model has decreased to close to 0, indicating that the models have converged.

Fig. 5

The loss of GRUAE model for (a) MA2, (b) MC1 during training, where x-axis is the number of iterations.

We conduct the experiment on two SSD datasets: MA2 and MC1, with the same process which is divided into two stages: dimensionality reduction and fault detection. In the stage of dimensionality reduction, the corresponding GRUAE model is trained for the original high-dimensional SSD SMART data. After the training, GRU-encoder is extracted from the GRUAE model as the tool for dimensionality reduction. Then the dimensionality of original high-dimensional SSD SMART data is reduced based on four dimensionality reduction methods: GRUAE-encoder, Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Factor Analysis (FA) for comparison. In the phase of fault detection, based on the data after dimensionality reduction, the effect of fault detection for seven AI algorithms are compared: Multilayer Perceptron (MLP), Random Forest (RF), Logistic Regression (LR), Decision Tree (DT) and Support Vector Machine (SVM), GRU and LSTM. We also compare with the approach in reference [45]. The schematic diagram of the experimental procedure is shown in Fig. 6.

Fig. 6

Schematic diagram of the experiment process.

The hardware platform used in the experiment is a server, including two Intel(R) Xeon(R) CPUs E5-2620 V4 @ 2.10GHz, 94G RAM and 4T storage. For the software, the operating system is Ubuntu 18.04.5 LTS; the kernel version is Linux version 4.15.0-171-generic, and the compiler version is gcc version 9.3.0. We implement the proposed method based on PyTorch 1.7.1 and scikit-learn 0.24.1.

6.2 MA2 Results

The precision of MA2 is shown in Fig. 7. Under the condition of without dimensionality reduction, the two deep learning algorithms with advantages in processing temporal data: GRU and LSTM, are significantly better than the other five traditional machine learning algorithms. After dimensionality reduction, the precision is improved more or less for almost all algorithms. Among them, MLP and DT have a large increase range, which is close to 20% on average, while SVM has an increase range of 13%. Other methods have a relatively small increase range, which is mainly within 10%. For MLP, RF, SVM and LSTM, the improvement brought by GRUAE is better than other dimensionality reduction methods. SVM shows little improvement on the dimensionality reduction of PCA, SVD and FA, with an improvement of less than 3%, but GRUAE brings about an improvement of 13%. For DT and GRU, SVD improves DT the most, while FA improves GRU the most. Although GRUAE does not improve the most compared with other dimensionality reduction methods, the difference is very close which is only within 3%. For LR, the improvement effect of several dimensionality reduction methods is not obvious, and the average improvement effect is less than 3%. Although the basic effect of GRU and LSTM has been relatively good, dimensionality reduction can still make about 3-4% improvement. And, because the network structure of LSTM is more complex than that of GRU, LSTM performs slightly better. The precision of proposed GAL (GRUAE+LSTM) exceeds 97% which is the best among all methods.

Fig. 7

The precision of MA2

The recall of MA2 is shown in Fig. 8. Similar to precision, GRU and LSTM are significantly better than other five tra-ditional machine learning algorithms when there is no dimensionality reduction. After dimensionality reduction, the recall of some algorithms is greatly improved, such as DT, with an average increase of nearly 40%. There are also some algorithms that have relatively small improvement effects, such as LR and SVM, with an average improvement high range of about 3%. FA is the best dimensionality reduction method for DT. The enhancement of GRUAE for DT is slightly lower than that of the traditional dimensionality reduction method, but it is also very close. The difference between GRUAE and FA is about 4%, and the difference with other methods is even smaller. SVD has the best improvement on GRU, which is similar to DT. Although GRUAE does not have the largest increase range for GRU, the difference is less than 2%. For MLP, LR and SVM, GRUAE has a slightly better improvement than other dimensionality reduction methods. GRUAE improves the recall of MLP and RF by about 4% and 7%, respectively, and improves both GRU and LSTM by about 5%.

Fig. 8

The recall of MA2.

The F0.5-score of MA2 is shown in Fig. 9 and the overall trend is similar to the precision of MA2. After dimensionality reduction, the effects of various algorithms are basically improved. The improvement of GRUAE to MLP, DT and SVM is the largest, averaging about 15%. The results of GRU and LSTM are very similar which is about 4%, both better than other algorithms. Since the precision and the recall of GAL are the best among all methods, its F0.5-score is also the best.

Fig. 9

The F0.5-score of MA2

6.3 MC1 Results

The precision of MC1 is shown in Fig. 10. Similar to the precision of MA2, GRU and LSTM are significantly better than the other five traditional machine learning algorithms. Compared with the situation without dimensionality reduction, basically all algorithms have improved after dimensionality reduction, among which the improvement effect of RF is particularly obvious, with an average increase of about 28%. For RF, SVD has the largest improvement, while GRUAE has a slightly lower improvement than the other three dimensionality reduction algorithms, with a difference of about 4% compared with SVD. However, for MLP, LR, DT and SVM, GRUAE can bring considerable improvement, among which LR with the smallest improvement that is 5.8%. The precision is increased by 14%, 5.8%, 8.5% and 12.5% respectively, which shows that the average improvement of GRUAE is better than that of other dimensionality reduction methods. Although GRU and LSTM are the two methods with the best basic effect of which the precision is 92% and 94% respectively, their precision can still be improved with 5% and 4% after the dimensionality reduction by GRUAE, and the effect of LSTM is slightly better than that of GRU. The precision of GAL is the best one which is almost 98%.

Fig. 10

The precision of MC1.

For recall of MC1 is shown in Fig. 11, GRU and LSTM still have the best basic effect when there is no dimensionality reduction, with the original recall reaching 90% and 91%, respectively. DT and MLP perform the best among the five traditional machine learning algorithms, with an original recall of 55%. After dimensionality reduction, DT is still the algorithm with the largest improvement, with an average improvement of nearly 20%. Similar to MA2, the performance of GRUAE on DT is still not the best, and SVD has the largest improvement for DT of 24%. However, GRUAE also brings about a 21% increase in DT, and the difference with SVD is only about 3% which is relatively small. The improvement of GRUAE on MLP is also significant, which is 9%. In comparison, the improvement of other dimensionality reduction algorithms on MLP are negligible. For RF, LR and SVM, the overall performance of the three methods is similar. The improvement of GRUAE on the three methods is not as good as that of MLP and DT, and the improvement is relatively small. RF and SVM have an improvement of 4% and 3% respectively, while LR has a weak improvement of less than 1%. For GRU, the situation is similar to that of MA2 and SVD has the best improvement. Although GRUAE does not improve GRU the most, the difference with SVD is only 1.5%. GRUAE improves the recall by 5% and 6% for GRU and LSTM, respectively, so GAL is still the best one overall.

Fig. 11

The recall of MC1.

The F0.5-score of MC1 is shown in Fig. 12, and the overall trend is similar to precision. After dimensionality reduction, the effects of various algorithms are basically improved. GRUAE has the largest improvement on MLP, DT and SVM, with an average of about 9%, which is significantly better than other dimensionality reduction methods. Since the precision and recall of GAL are the best among all methods, its F0.5-score is also the best one.

Fig. 12

The F0.5-score of MC1.

6.4 Performance Results

We also verify the efficiency of GRUAE, PCA, SVD and FA in out experiments. The amount of data processed by the algorithm per second is shown in Table 3. In view of the simple gating structure of GRU, encoder of GRUAE Model is able to quickly perform dimensionality reduction. Compared with other dimensionality reduction algorithms, the data processed per second of GRUAE on MA2 and MC1 datasets is relatively close, reaching 1008 and 1025, respectively. Among the other compared methods, the efficiency of SVD on MA2 is 792 pieces per second which is the best. On MC1, FA has the best processing efficiency of 689 pieces per second. The average value of the three methods used for comparison is about 700, while the average value of GRUAE is 1017, which is about 1.5 times the average value of the other three methods, fully demonstrating the advantage in computational performance.

Table 3
Performance comparison of GRUAE and other DR approaches

Amount of data processed per second MA2 MC1 Average

PCA 687 658 672

SVD 792 664 728

FA 724 689 706

GRUAE 1008 1025 1017

Amount of data processed per second	MA2	MC1	Average
PCA	687	658	672
SVD	792	664	728
FA	724	689	706
GRUAE	1008	1025	1017

6.5 Comparison with WEFR

To further evaluate the effect of GAL, we also compare with WEFR [45]. As we improve the structure of traditional AE based on GRU, the encoder of GRUAE model is able to better learn the temporal characteristics of SSD SMART data and extract the latent code, aiming at reducing the influence of noise features in original high-dimensional SSD SMART data, and highlight features more relevant to data characteristics. The raw input data is mapped to a low-dimensional feature space and then fed to the classifiers for fault detection. Therefore, the fault detection effect of various AI algorithms has been improved with dimensionality reduction.

As shown in Figs. 13 and 14, the precision, recall and F0.5-score of WEFR on MA2 dataset are 57%, 32% and 49%, respectively. Except RF and LR, the rest methods based on GRUAE achieve better results. The precision, recall and F0.5-score of LSTM with the best effect are 97.8%, 95% and 97%, respectively, which exceed WEFR 40.8%, 63% and 48%, respectively. The precision, recall and F0.5-score of WEFR on MC1 dataset are 49%, 18% and 36%, respectively. The results of all algorithms based on dimensionality reduction by GRUAE are better than those of WEFR. The precision, recall and F0.5-score of LSTM with the best effect are 97%, 96% and 96.8%, respectively, which exceed those of WEFR by 48%, 78% and 60.8%, respectively. Since SSD has strong temporal characteristics of SMART data, and WEFR is based on RF which has certain limitations in processing temporal data, so there is a lot of space for improvement in precision and recall.

Fig. 13

Comparison with WEFR of MA2.

Fig. 14

Comparison with WEFR of MC1.

The overall experimental results show that the GRUAE model within GAL is able to learn the temporal characteristics of SSD SMART data and reduce the dimensionality of original high-dimensional SSD SMART data. On the condition that the data after dimensionality reduction contains the original data characteristics, the fault detection accuracy of various AI algorithms is improved. In addition, GAL performs the best in a variety of metrics, which fully demonstrates the effectiveness of GAL.

7 Conclusions

In order to solve the negative impact of the high dimensional SSD SMART data on the fault detection of traditional machine learning algorithms, we propose a novel SSD fault detection method – GAL(GRUAE + LSTM) based on dimensionality reduction of GRU sparse autoencoder. The GAL trains the GRUAE model with SSD SMART data firstly, and then adopts the encoder of GRUAE model as the dimensionality reduction tool to reduce the original high-dimensional SSD SMART data, aiming at reducing the influence of noise features in original SSD SAMRT data and highlight the features more relevant to data characteristics to improve the accuracy of fault detection. Finally, LSTM is adopted for fault detection with low-dimensional SSD SMART data. The experiment is conducted on the SSD dataset that is publicly available in the industry. Experimental results show that the proposed approach is able to reduce the dimensionality of SSD SMART data while ensuring the characteristics of original SSD SMART data. In most cases, the dimensionality reduction effect of the GRUAE model in this paper is better than that of other dimensionality reduction methods used for comparison, and the average computational performance of the GRUAE model is about 1.5 times that of the compared methods. Fault detection accuracy of various AI algorithms is improved after dimensionality reduction. For family “MA2”, compared with no dimensionality reduction, GAL improves the precision, recall and F0.5-score by about 4%, 5% and 4%, reaching 97.8%, 95% and 97%, respectively. For family “MC1”, GAL improves the precision, recall and F0.5-score by about 4%, 8% and 5%, reaching 97%, 96% and 96.8%, respectively, compared with no dimensionality reduction. Also, the GAL proposed in this paper performs the best. In future work, we intend to optimize the architecture of the proposed LSTM model to further improve the detection performance.

Footnotes

Acknowledgements

This research was funded by the National Key Research and Development Plan of China under grant No. 2016YFB1000303.

References

Clouder

, Pangu – the high performance distributed file system by alibaba cloud, 2018.

Palankar

M.R.

, Iamnitchi

, Ripeanu

, Garfinkel

, Amazon S3 for Science Grids: A Viable Solution?, in: Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing, DADC '08, Association for Computing Machinery, New York, NY, USA, 2008, pp. 55–64. ISBN ISBN 9781605581545. doi:10.1145/1383519.1383526.

Ghemawat

, Gobioff

, Leung

, The Google file system, in: Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19-22, 2003, M.L. Scott and L.L. Peterson, eds, ACM, 2003, pp. 29–43. doi:10.1145/945445.945450.

Subramanian

, Lloyd

, Roy

, Hill

, Lin

, Liu

, Pan

, Shankar

, Viswanathan

, Tang

, Kumar

, f4: Facebook's Warm BLOB Storage System, in: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 6-8, 2014, J. Flinn and H. Levy, eds, USENIX Association, 2014, pp. 383–398. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar.

Calder

, Wang

, Ogus

, Nilakantan

, Skjolsvold

, McKelvie

, Xu

, Srivastav

, Wu

, Simitci

, Haridas

, Uddaraju

, Khatri

, Edwards

, Bedekar

, Mainali

, Abbasi

, Agarwal

, ul Haq

M.F.

, ul Haq

M.I.

, Bhardwaj

, Dayanand

, Adusumilli

, McNett

, Sankaran

, Manivannan

and Rigas

, Windows Azure Storage: a highly available cloud storage service with strong consistency, in: Proceedings of the 23rd ACM Symposium on Operating Systems Principles 2011, SOSP 2011, Cascais, Portugal, October 23-26, 2011, T. Wobber and P. Druschel, eds, ACM, 2011, pp. 143–157. doi:10.1145/2043556.2043571.

Huang

, Simitci

, Xu

, Ogus

, Calder

, Gopalan

, Li

, Yekhanin

, Erasure Coding in Windows Azure Storage, in: 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 13-15, 2012, G. Heiser and W.C. Hsieh, eds, USENIX Association, 2012, pp. 15–26. https://www.usenix.org/conference/atc12/technical-sessions/presentation/huang.

Narayanan

, Wang

, Jeon

, Sharma

, Caulfield

, Sivasubramaniam

, Cutler

, Liu

, Khessib

B.M.

, Vaid

, SSD Failures in Datacenters: What? When? and Why?, in: Proceedings of the 9th ACM International on Systems and Storage Conference, SYSTOR 2016, Haifa, Israel, June 6-8, 2016, ACM, 2016, pp. 7:1–7:11. doi:10.1145/2928275.2928278.

Schroeder

, Lagisetty

, Merchant

, Flash Reliability in Production: The Expected and the Unexpected, in: 14^th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22-25, 2016, A.D. Brown and F.I. Popovici, eds, USENIX Association, 2016, pp. 67–80. https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder.

, Sui

, Yao

, Zhang

, Lin

, Dang

, Li

, Jiang

, Zhang

, Lou

, Chintalapati

, Zhang

, Improving Service Availability of Cloud Systems by Predicting Disk Error, in: 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11-13, 2018, H.S. Gunawi and B. Reed, eds, USENIX Association, 2018, pp. 481–494. https://www.usenix.org/conference/atc18/presentation/xu-yong.

10.

Gunawi

H.S.

, Suminto

R.O.

, Sears

, Golliher

, Sundararaman

, Lin

, Emami

, Sheng

, Bidokhti

, Mc-Caffrey

, Srinivasan

, Panda

, Baptist

, Grider

, Fields

P.M.

, Harms

, Ross

R.B.

, Jacobson

, Ricci

, Webb

, Alvaro

, Runesha

H.B.

, Hao

and Li

, Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems, ACM Trans Storage 14(3) (2018), 23:1–23:26. doi:10.1145/3242086.

11.

Murray

J.F.

, Hughes

G.F.

and Kreutz-Delgado

, Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application, J Mach Learn Res 6 (2005), 783–816. http://jmlr.org/papers/v6/murray05a.html.

12.

Han

, Lee

P.P.C.

, Shen

, He

, Liu

, Huang

, Toward Adaptive Disk Failure Prediction via Stream Mining, in: 40th IEEE International Conference on Distributed Computing Systems, ICDCS 2020, Singapore, November 29 - December 1, 2020, IEEE, 2020, pp. 628–638. doi:10.1109/ICDCS47774.2020.00044.

13.

Zhang

, Wang

, Zhou

, Schelter

, Huang

, Cheng

, Ji

, Tier-Scrubbing: An Adaptive and Tiered Disk Scrubbing Scheme with Improved MTTD and Reduced Cost, in: 57th ACM/IEEE Design Automation Conference, DAC 2020, San Francisco, CA, USA, July 20-24, 2020, IEEE, 2020, pp. 1–6. doi:10.1109/DAC18072.2020.9218551.

14.

Xiao

, Xiong

, Wu

, Yi

, Jin

, Hu

, Disk Failure Prediction in Data Centers via Online Learning, in: Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, Eugene, OR, USA, August 13-16, 2018, ACM, 2018, pp. 35:1–35:10. doi:10.1145/3225058.3225106.

15.

Mahdisoltani

, Stefanovici

I.A.

, Schroeder

, Proactive error prediction to improve storage system reliability, in: 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017, D.D. Silva and B. Ford, eds, USENIX Association, 2017, pp. 391–402. https://www.usenix.org/conference/atc17/technical-sessions/presentation/mahdisoltani.

16.

, Luo

, Patel

, Yao

, Tiwari

, Shi

, Making Disk Failure Predictions SMARTer!, in: 18^th USENIX Conference on File and Storage Technologies, FAST 2020, Santa Clara, CA, USA, February 24-27, 2020, S.H. Noh and B. Welch, eds, USENIX Association, 2020, pp. 151–167. https://www.usenix.org/conference/fast20/presentation/lu.

17.

Han

, Lee

P.P.C.

, Xu

, Liu

, He

, Liu

, An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers, in: 19th USENIX Conference on File and Storage Technologies, FAST 2021, February 23-25, 2021, M.K. Aguilera and G. Yadgar, eds, USENIX Association, 2021, pp. 417–429. https://www.usenix.org/conference/fast21/presentation/han.

18.

, Zheng

, Qin

, Xu

, Wu

, Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures, in: 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10-12, 2019, D. Malkhi and D. Tsafrir, eds, USENIX Association, 2019, pp. 961–976. https://www.usenix.org/conference/atc19/presentation/xu.

19.

Schroeder

, Merchant

and Lagisetty

, Reliability of nand-Based SSDs: What Field Studies Tell Us, Proc IEEE 105(9) (2017), 1751–1769. doi:10.1109/JPROC.2017.2735969.

20.

Alter

, Xue

, Dimnaku

, Smirni

, SSD failures in the field: symptoms, causes, and prediction models, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, M. Taufer, P. Balaji and A.J. Peña, eds, ACM, 2019, pp. 75:1–75:14. doi:10.1145/3295500.3356172.

21.

Sarkar

, Peterson

, Sanayei

, Machine-learned assessment and prediction of robust solid state storage system reliability physics, in: IEEE International Reliability Physics Symposium, IRPS 2018, Burlingame, CA, USA, March 11-15, 2018, IEEE, 2018, pp. 3. doi:10.1109/IRPS.2018.8353565.

22.

Pulgar

F.J.

, Charte

, Rivera

A.J.

and del

M.J.

, Jesus, Choosing the proper autoencoder for feature fusion based on data complexity and classifiers: Analysis, tips and guidelines, Inf Fusion 54 (2020), 44–60. doi:10.1016/j.inffus.2019.07.004.

23.

Bengio

and Bengio

, Taking on the curse of dimensionality in joint distributions using neural networks, IEEE Trans Neural Networks Learn Syst 11(3) (2000), 550–557. doi:10.1109/72.846725.

24.

Gündüz

, Çataltepe

and Yaslan

, Stock daily return prediction using expanded features and feature selection, Turkish J Electr Eng Comput Sci 25 (2017), 4829–4840. doi:10.3906/elk-1704-256.

25.

Wang

, Gao

and Nie

, ł2, p -Norm Based PCA for Image Recognition, IEEE Trans Image Process 27(3) (2018), 1336–1346. doi:10.1109/TIP.2017.2777184.

26.

Duan

, Ren

and Yang

, A Gesture Recognition System Based on Time Domain Features and Linear Discriminant Analysis, IEEE Trans Cogn Dev Syst 13(1) (2021), 200–208. doi:10.1109/TCDS.2018.2884942.

27.

Yan

, Xu

, Zhang

, Yang

and Lin

, Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, IEEE Trans Pattern Anal Mach Intell 29(1) (2007), 40–51. doi:10.1109/TPAMI.2007.250598.

28.

Liou

, Cheng

, Liou

and Liou

, Autoencoder for words, Neurocomputing 139 (2014), 84–96. doi:10.1016/j.neucom.2013.09.055.

29.

Zhao

, Jia

and Liu

, Semisupervised Deep Sparse Auto-Encoder With Local and Nonlocal Information for Intelligent Fault Diagnosis of Rotating Machinery, IEEE Trans Instrum Meas 70 (2021), 1–13. doi:10.1109/TIM.2020.3016045.

30.

Guo

, Li

, Ning

, Han

, Zhang

and Zhou

, Feature Dimension Reduction Using Stacked Sparse Auto-Encoders for Crop Classification with Multi-Temporal, Quad-Pol SAR Data, Remote Sens 12(2) (2020), 321. doi:10.3390/rs12020321.

31.

Shen

, Qi

, Wang

, Cai

and Zhu

, An automatic and robust features learning method for rotating machinery fault diagnosis based on contractive autoencoder, Eng Appl Artif Intell 76 (2018), 170–184. doi:10.1016/j.engappai.2018.09.010.

32.

Pulgar

F.J.

, Charte

, Rivera

A.J.

and del Jesus

M.J.

, ClEn-DAE: A classifier based on ensembles with built-in dimensionality reduction through denoising autoencoders, Inf Sci 565 (2021), 146–176. doi:10.1016/j.ins.2021.02.060.

33.

Gao

, Li

, Liu

, Pan

, Gao

and Xu

, Large-Dimensional Seismic Inversion Using Global Optimization With Autoencoder-Based Model Dimensionality Reduction, IEEE Trans Geosci Remote Sens 59(2) (2021), 1718–1732. doi:10.1109/TGRS.2020.2998035.

34.

Kanjilal

P.P.

, Palit

and Saha

, Fetal ECG extraction from single-channel maternal ECG using singular value decomposition, IEEE Transactions on Biomedical Engineering 44(1) (1997), 51–59.

35.

Liu

, Yu

, Zeng

and Zhang

, LLE for submersible plunger pump fault diagnosis via joint wavelet and SVD approach, Neurocomputing 185 (2016), 202–211.

36.

Govindarajan

, Subbaiah

, Cavallini

, Krithivasan

and Jayakumar

, Partial discharge random noise removal using Hankel matrix-based fast singular value decomposition, IEEE Transactions on Instrumentation and Measurement 69(7) (2019), 4093–4102.

37.

Chen

, Liao

, Zhang

and Du

, Mixture factor analysis with distance metric constraint for dimensionality reduction, Pattern Recognit 121 (2022), 108156. doi:10.1016/j.patcog.2021.108156.

38.

Ircio

, Lojo

, Lozano

J.A.

, Mori

and Lozano

J.A.

, A Multivariate Time Series Streaming Classifier for Predicting Hard Drive Failures [Application Notes], IEEE Computational Intelligence Magazine 17(1) (2022), 102–114. doi:10.1109/MCI.2021.3129962.

39.

Chhetri

T.R.

, Kurteva

, Adigun

J.G.

and Fensel

, Knowledge Graph Based Hard Drive Failure Prediction, Sensors 22(3) (2022). doi:10.3390/s22030985. https://www-mdpi-com-s.web.bisu.edu.cn/1424-8220/22/3/985.

40.

Mamoutova

, Uspenskiy

, Smirnov

and Bolsunovskaya

, Ontological Approach to Automated Analysis of Enterprise Data Storage Systems Log Files, Acta Polytechnica Hungarica 18(9) (2021), 27–47.

41.

Wang

, He

, Jiang

and Chow

T.W.S.

, Failure Prediction of Hard Disk Drives Based on Adaptive Rao–Blackwellized Particle Filter Error Tracking Method, IEEE Transactions on Industrial Informatics 17(2) (2021), 913–921. doi:10.1109/TII.2020.3016121.

42.

Wang

, Dong

, Wang

, Chen

and Zhang

, Optimizing Small-Sample Disk Fault Detection Based on LSTMGAN Model, ACM Trans Archit Code Optim 19(1) (2022). doi:10.1145/3500917.

43.

Chakraborttii

, Litz

, Improving the Accuracy, Adaptability, and Interpretability of SSD Failure Prediction Models, in: Proceedings of the 11th ACM Symposium on Cloud Computing, SoCC'20, Association for Computing Machinery, New York, NY, USA, 2020, pp. 120–133. ISBN ISBN 9781450381376. doi:10.1145/3419111.3421300.

44.

Cho

, van

, On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, in: Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, D. Wu, M. Carpuat, X. Carreras and E.M. Vecchi, eds, Association for Computational Linguistics, 2014, pp. 103–111. doi:10.3115/v1/W14-4012. https://aclanthology.org/W14-4012/.

45.

, Han

, Lee

P.P.C.

, Liu

, He

, Liu

, General Feature Selection for Failure Prediction in Large-scale SSDDeployment, in: 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021, Taipei, Taiwan, June 21-24, 2021, IEEE, 2021, pp. 263–270. doi:10.1109/DSN48987.2021.00039.

46.

Hoffer

, Hubara

, Soudry

, Train longer, generalize better: closing the generalization gap in large batch training of neural networks, in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H.M. Wallach, R. Fergus, S.V.N. Vishwanathan and R. Garnett, eds, 2017, pp. 1731–1741. https://proceedings.neurips.cc/paper/2017/hash/a5e0ff62be0b08456fc7f1e88812af3d-Abstract.html.

47.

Glorot

, Bengio

, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, Y.W. Teh and D.M. Titterington, eds, JMLR Proceedings, Vol. 9, JMLR.org, 2010, pp. 249–256. http://proceedings.mlr.press/v9/glorot10a.html.

48.

, Liu

, Tao

, Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence, in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H.M. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E.B. Fox and R. Garnett, eds, 2019, pp. 1141–1150. https://proceedings.neurips.cc/paper/2019/hash/dc6a70712a252123c40d2adba6a11d84-Abstract.html.

A novel SSD fault detection method using GRU-based Sparse Auto-Encoder for dimensionality reduction

Abstract

Keywords

Acronyms

Nomenclature

1 Introduction

2 Related works

3 Preliminary knowledge

3.1 Gated recurrent unit

4.3 LSTM Model

5.1 Datasets

Table 1 Overview of the dataset MA2 MC1 Flash Tech. MLC 3D-TLC Capacity 800G 1920G Duration 12 months 12 months Healthy Disks amount 84076 163338 Failed Disks amount 452 8546 Positive Items 30687985 59618418 Negative Items 45692 1855058 Size after processed 13G 27G

6.1 Experiment conditions and methods

Table 3 Performance comparison of GRUAE and other DR approaches Amount of data processed per second MA2 MC1 Average PCA 687 658 672 SVD 792 664 728 FA 724 689 706 GRUAE 1008 1025 1017

Footnotes

Acknowledgements

References

Table 1
Overview of the dataset

MA2 MC1

Flash Tech. MLC 3D-TLC

Capacity 800G 1920G

Duration 12 months 12 months

Healthy Disks amount 84076 163338

Failed Disks amount 452 8546

Positive Items 30687985 59618418

Negative Items 45692 1855058

Size after processed 13G 27G

Table 3
Performance comparison of GRUAE and other DR approaches

Amount of data processed per second MA2 MC1 Average

PCA 687 658 672

SVD 792 664 728

FA 724 689 706

GRUAE 1008 1025 1017