G-CNN and double-referenced thresholding for detecting time series anomalies

Abstract

Anomaly detection based on time series data is of great importance in many fields. Time series data produced by man-made systems usually include two parts: monitored and exogenous data, which respectively are the detected object and the control/feedback information. In this paper, a so-called G-CNN architecture that combined the gated recurrent units (GRU) with a convolutional neural network (CNN) is proposed, which respectively focus on the monitored and exogenous data. The most important is the introduction of a complementary double-referenced thresholding approach that processes prediction errors and calculates threshold, achieving balance between the minimization of false positives and the false negatives. The outstanding performance and extensive applicability of our model is demonstrated by experiments on two public datasets from aerospace and a new server machine dataset from an Internet company. It is also found that the monitored data is close associated with the exogenous data if any, and the interpretability of the G-CNN is discussed by visualizing the intermediate output of neural networks.

Keywords

Anomaly detection CNN GRU time series deep learning

1 Introduction

Time series data are increasingly collected in various application systems of the real world. Detecting anomalies based on such data is of great significance for predicting the next behavior and ensuring the normal operation of these systems. For instance, in the aerospace field, anomaly pattern in monitored data can reflect the problems in performance, quality, and inadequate design of spacecraft or space shuttle [1]; in the medical field, electrocardiogram (ECG) signals are widely used to measure a person’s heart health to detect any arrhythmia that patients may suffer [2]. Generally speaking, abnormal patterns are detected by analyzing the temporal relationship of monitored time series data produced by application system.

For a man-made application system (such as a spacecraft) necessary are control and feedback signals that can be regarded as exogenous data, relative to the monitored data (such as the telemetry values for a spacecraft). As shown in Figure 1, as exogenous data the encoded command information of spacecraft [1] are usually binary data, different from the float data of the telemetry values (monitored data). Based on the difference in data type and attribute, independent modules designed for handling the monitored and exogenous data may work more efficiently, and have the potential to clarify the correlation between the two types of information when such two modules are taken into account in the same architecture.

Fig. 1

The data of spacecraft, where the line chart corresponds to the monitored telemetry data.

Much attention have paid to time series anomaly detection recent decades. Figure 2 shows the primary approaches for anomaly detection with spacecraft as an example, which can be roughly classified into two groups: the traditional and intelligent methods. The traditional anomaly detection combines manual monitoring with threshold interpretation, model match, rule test and expert systems [3, 4]. Although easy to operate and convenient for technicians, traditional methods heavily rely on expert experience, artificially defining and updating the scope of normal sequence data. With the increasing complexity in system structure and the exponentially growing scale in sequence data, the effectiveness of traditional methods has been challenged.

Fig. 2

The architectures for spacecraft anomaly detection.

With the rapid development of software and hardware, intelligent models including machine learning and deep learning, are currently research hotspots in time series anomaly detection. For instance, Codetta-Raiteri and Portinale used Dynamic Bayesian networks to capture the complex relationships of spacecraft telemetry data and to detect spacecraft anomalies [5]; whereas the correlation between the exogenous and monitored data has not been considered. Support vector machines and k-nearest neighbor algorithm have also been applied time series anomaly detection with distance-based classification [6, 7]. For high-dimensional time series, the distance calculation cost is very expensive. Recently, deep learning has been actively studied for time series anomaly detection, neural networks are good at fitting complex functions and capturing complicated nonlinear relationship [8]. Among of them, convolutional neural network (CNN) and recurrent neural network (RNN) have respectively achieved great success in time series prediction [9 –11]. RNN is perfect in capturing temporal dependency and CNN excels at representing spatial feature. Long Short Term Memory (LSTM) [12] and the Gated Recurrent Unit (GRU) [13] are the variants of RNN, which can capture more long-term temporal dependency among sequence data. Chauhan et al. has applied LSTM to predict ECG and detect anomaly by comparing the prediction error with the threshold [2]. However, the threshold setting in this work was artificial and lacked of applicability, resulting in low recall. For time series affected by exogenous factors or variables, Malhotra et al. developed LSTM based on Encoder-Decoder framework to reconstruct ‘normal’ time-series behavior and uses reconstruction error to detect anomalies [14]. But the fixed-length vector in Encoder is not sufficient to represent the long-term temporal dependency as the size of data increases. Low recall in experimental results proved the limitation of the method.

As mentioned above, individual methods can handle specific problems with some limitation. In general anomaly detection, threshold calculation plays an important role, and there is still huge room for improvement. For example, the Gaussian distribution with parameter assumptions is commonly adopt to fit prediction errors and then to estimate the mean and variance [15]. If the new prediction error does not follow the Gaussian distribution, the corresponding data is regarded as anomaly. However, in actual situations, monitored data is large and complex, and it is easy to violate parameter assumptions. Therefore, a more versatile nonparametric threshold calculation method is necessary.

In this paper, a dual-input hybrid neural network G-CNN is proposed, which combines GRU with CNN components. GRU is used to represent the long-term temporal dependency of monitored data while CNN focus on the spatial correlation and short-term temporal dependency in exogenous data. The outputs of GRU and CNN are fused to complete the feature extraction, so that the correlation between the monitored and exogenous data can be quantified in our approach. The prediction errors between the prediction values output by G-CNN and the ground truth received from application system are assessed by a novel dynamic double-referenced thresholding approach. This approach utilizes the prediction error distribution to dynamically set the error threshold, which overcomes the inaccuracy caused by artificially setting the error threshold and parameter assumptions. Finally, the interpretability of the network model G-CNN is provided by visualizing the intermediate output.

The rest of this paper is organized as follows. Section 2 describes the detail of our proposed architecture. Section 3 discusses the experiment and shows the comparative experiments results of our model with strong baselines on seven real-world datasets. Finally, the findings in this paper are discussed in Section 4.

2 The proposed method

The details of proposed detection logical structure and various components of the proposed architecture is shown in Figure 3. The present work mainly develops around the feature representation of monitored data and exogenous data, and threshold calculation based on the prediction errors.

Fig. 3

The proposed detection logical structure, where e represents the prediction error.

2.1 G-CNN architecture

For clarity in expression, some notations are introduced. More formally, given time series data as a sequence of vectors X={x⁽¹⁾, x⁽²⁾, …, x⁽ⁿ⁾}, each step x^(t)= ${x_{1}^{(t)}, x_{2}^{(t)}, . . ., x_{m}^{(t)}}$ , x^(t)∈R^m, whose elements correspond to the input variables of the hybrid neural network G-CNN. The input x^(t) consists of the monitored data ( ${x_{m}^{(t)}}$ ) received from the application systems and the exogenous data ( ${x_{1}^{(t)}, x_{2}^{(t)}, . . ., x_{m - 1}^{(t)}}$ ). We set x^(t)={X_tru^(t), X_cmd^(t)}, where X_tru^(t)= ${x_{m}^{(t)}}$ is monitored data that really needs to be detected, and X_cmd^(t)= ${x_{1}^{(t)}, x_{2}^{(t)}, . . ., x_{m - 1}^{(t)}}$ is exogenous data that help us complete anomaly detection task. The present work aims at predicting the future value in a rolling forecasting fashion. In other word, T= ${x_{m}^{(t - L)}, x_{m}^{(t - L - 1)}, . . ., x_{m}^{(t)}}$ and C={X _ cmd^(t-L), X _ cmd^(t-L-1), . . . , X _ cmd^(t)} are used as the input of G-CNN to train the network and to predict $x_{m}^{(t + 1)}$ . L is empirically selected based on various dataset, which will be further discussed in experiment part.

Learning the spatio-temporal relationship of monitored data and exogenous data is the key to prediction. The exogenous data C may be multivariate variable with spatial correlation, while monitored data T are time-dependency; if they are addressed by different neural networks that are good at the corresponding data type and nature, the efficiency can be tuned and improved separately. In the present work, a dual-input hybrid neural network G-CNN consisted of CNN and GRU layers is put forward to achieve the performance. In G-CNN, the spatial correlation and short-term dependency of exogenous data C in the time series window are extracted by the CNN layers and the long-term temporal dependency in monitored data T are extracted by the GRU layers. That is, $\tilde{y} = F (G (T), H (C))$ , a nonlinear function fitted by the hybrid neural network G-CNN. G (T) is a function learned by GRU over the monitored data T, and H (C) corresponds to another function represented by CNN on the exogenous data C. The processing of data with different characteristics is more targeted, which is conducive to the feature extraction of time series data.

2.1.1 GRU for monitored data

First, for monitored data with temporal depemdency, GRU is adopted to fit and represent the feature and to generate the function G (T). The gradient vanishing problem of vanilla RNN is effectively alleviated in GRU by introducing a gating mechanism [13]. Gating mechanism allows each unit to selectively retain and forget information. The structure of the GRU is simpler than the LSTM but also has the ability to efficiently process sequence data. In G-CNN, the GRU contains two layers and dropout layer, which are used to automatically extract higher-level sequences of time series temporal features. A GRU layer is followed by an RELU activation function. This allows the GRU to capture complex features in the input signal. The cell state of each GRU unit is updated by the activation of each gate. Each k-th unit contains an update gate $r_{k}^{(t)}$ and a reset gate $z_{k}^{(t)}$ at step t. The output $h_{k}^{(t)}$ of each neuron is computed as: $z_{k}^{(t)} = σ (W_{z} \cdot T + U_{z} \cdot h_{k}^{(t - 1)} + b_{z})$ (1) $r_{k}^{(t)} = σ (W_{r} \cdot T + U_{r} \cdot h_{k}^{(t - 1)} + b_{r})$ (2) ${\tilde{h}}_{k}^{(t)} = tanh (W_{(h)} \cdot [r_{k}^{(t)} \cdot h_{k}^{(t - 1)}, T])$ (3) $h_{k}^{(t)} = z_{k}^{(t)} \cdot h_{k}^{(t - 1)} + (1 - z_{k}^{(t)}) \cdot {\tilde{h}}_{k}^{(t)}$ (4) where · is the element-wise production, σ refers to the sigmoid function and tanh refers to the tanh activation function, T is used as the input of the GRU layer, W is a weight matrix and b is a bias vector. The reset gate $r_{k}^{(t)}$ determines the ignored degree of the status information at prior moment; at the same time, the update gate $z_{k}^{(t)}$ indicates how much information needs to be reset based on the current input and the output hidden state $h_{k}^{(t - 1)}$ of the previous moment. ${\tilde{h}}_{k}^{(t)}$ represents the reset information of the k-th neuron at step t. As shown in Figure 3, two-layer GRUs is used to extract the dependency features of monitored data T to obtain $G (T) = [{\tilde{h}}_{1}^{(t)}, {\tilde{h}}_{2}^{(t)}, . . ., {\tilde{h}}_{k}^{(t)}]$ , which is fed into the fully connected layer. In order to prevent over-fitting, a dropout layer is interspersed in the two-layer GRU. The parameters of GRU will be further discussed in Section 3.

2.1.2 CNN for exogenous data

When the monitored data T are fed to GRU, the corresponding exogenous data C is input to the other component of the model, the CNN, which contains convolution, dropout layer and pooling layer. Figure 3 shows the convolution layer operation details. Convolution can reduce connections between networks and reduce the risk of overfitting. Each layer is followed by an activation function to capture more complex features. The exogenous data C is first input into a convolutional layer to extract the mainly feature map, Eq. (5) derives the output $a_{u, v}^{(l)}$ from the input C of convolutional layer,, l refers to the number of convolutional layer, u and v are the width and height of a feature map, i and j are kernel size of the filter $k_{i, j}^{(l)}$ , b represents the bias. And in the convolution process, the same padding is used, so that a lot of original information can be retained in the feature map, which is convenient for better fitting data.

$a_{u, v}^{(l)} = RELU (\sum_{i = 1}^{\infty} \sum_{j = 1}^{\infty} a_{i + u, j + v}^{(l - 1)} \cdot C + b^{(l)})$ (5)

Subsequently a dropout layer to reduce the complexity of G-CNN and avoid overfitting; then the output of dropout layer enters an average pooling layer.The average pooling layer can be seen as a special convolution, which aim at depressing the size of feature map and reduce the number of parameters while maintaining the significant features for next operation. Eq. (6) reveals the operation of the average pooling layer. β^(l+1) represents the weight of each pooling filter, r refers to the size of pooling filter. $p_{u, v}^{(l + 1)}$ is the output of average pooling layer and the input of feature fusion layer.

$\begin{matrix} p_{u, v}^{(l + 1)} = RELU (β^{(l + 1)} \cdot \\ \sum_{u = ir}^{(i + 1) r - 1} \sum_{v = jr}^{(j + 1) r - 1} a_{u, v}^{(l)} + b^{(l + 1)}) \end{matrix}$ (6)

RELU (x) = max(x, 0) is the activation function. We use CNN to extract the local features of exogenous data C to obtain the $H (C) = p_{u, v}^{(l + 1)}$ which is input into the fully connected layer. And the parameters of CNN will be further discussed in Section 3.

Finally, The temproal features captured by GRU and the spatial and local feature by CNN are combined, which results in the final predicted values of the hybrid G-CNN. Feature fusion of the different features extracted will help the model’s robustness and better fit the monitored data relationship. The feature fusion layer combines the features extracted by GRU and CNN, i.e., G (T) and H (C) respectively, to form a joint feature. The extracted features by GRU and CNN are mapped to the same feature space through fully connected layer before fusion. After that, the fused feature is input to the another fully connected layer to predict the monitored data $\tilde{y}$ at the next time stamp, as shown in Figure 3. Mathematically, the prediction of G-CNN is expressed as Eq. (7).

$\begin{matrix} \tilde{y} = f (G (T), H (C)) = f (W^{T} \cdot f_{1} (G (T)) \\ + W^{C} \cdot f_{2} (H (C))) \end{matrix}$ (7)

Where f, f₁ and f₂ refer to the connected layer, W^T and W^C represent the corresponding weights of G (T) and H (C). G (T) denotes the vectors fitted by the stacked GRU based on T while H (C) indicates the feature maps by CNN based on C.

2.2 Double-referenced threshold

For each time step t, the prediction error is defined as $e^{(t)} = | y^{(t)} - {\tilde{y}}^{(t)} |$ , where y^(t) is the ground truth (monitored data) and ${\tilde{y}}^{(t)}$ represents the predicted monitored data. In most complex application systems, the monitored data will change with the environment. A fast and accurate threshold calculation method is important for anomaly detection.

Inspired by [1], a new method (double- referenced thresholding approach) is developed for setting error threshold, which work out the threshold based on the distribution of the prediction errors in a window and in the entire sequence. A sliding window is introduced to divide the prediction errors of an entire sequence into multiple sets of vectors. The prediction errors in an entire sequence can be denoted as a vector E = [e⁽¹⁾, …, e^(t-1), …, e^(t), …, e⁽ⁿ⁾] and the prediction errors in a sliding window can be denoted as a vector e = [e^(t-h), …, e^(t-1), e^(t)], where h refers to the size of a sliding window. In order to reduce the impact of imperfect prediction due to abrupt changes in values during the prediction phase, the prediction errors need to be further smoothed. The exponentially-weighted moving average (EWMA) [16] is adopt to generate the smoothed errors, that is e → e_s, $e_{s} = [e_{s}^{(t - h)}, .., e_{s}^{(t - 1)}, e_{s}^{(t)}], and E \to E_{s}, E_{s} = [e_{s}^{(1)}, e_{s}^{(2)}, .., e_{s}^{(t)}, ..., e_{s}^{(n)}]$ . EWMA is an exponentially decreasing weighted moving average. The weight of each value decreases exponentially with time. The more recent data is weighted, the older the data is also given a certain weight. The degree of weighting is determined by a constant λ, which is between 0 and 1. Eqs. (8) and (9) show the details of EWMA, where $λ = \frac{2}{t + 1}$ .

$e_{s}^{(t)} = λ e^{(t - 1)} + (1 - λ) M^{(t - 1)}$ (8) $M^{(t - 1)} = \frac{e^{(t - 1)} {(1 - λ)}^{t - 2} + e^{(t - 2)} {(1 - λ)}^{t - 3} + . . . + e^{2} (1 - λ) + e^{1}}{{(1 - λ)}^{t - 2} + {(1 - λ)}^{t - 3} + . . . + (1 - λ) + 1}$ (9) The threshold in each sliding window can be computed dynamically on account of the error distribution. Besides the error distribution in the corresponding window, the error distribution in the entire sequence is also considered in the present model. If the error in the corresponding window exceeds the threshold, it is classified as anomalous; otherwise nominal. The threshold can be selected from the set ɛ .

$\begin{matrix} ɛ = [μ (e_{s}) α + μ (E_{s}) (1 - α)] + \\ z [σ (e_{s}) α + σ (E_{s}) (1 - α)] \end{matrix}$ (10) where z (z ∈ z) and z is an ordered set of positive values denoting the number of standard deviations above; μ is to get average, σ refers to standard deviation. The value of z generally depends on context information; a reasonable range of z can be found with some criterion. To achieve a balance between minimizing false positives and false negatives, the limits of z are (2, 10).

In Eq. (10), α determines the weight distribution of the prediction errors in the current sliding window while 1-α represents the weight in the entire sequence. When α=1, only the prediction error distribution in the current sliding window is considered, which is called Nonparametric Dynamic(Non Parametric, assigned N-Par here) Thresholding [1]; with 0<α<1 the effects from both the current sliding window and the entire sequence are considered, thus called Double-referenced (abr. D-Ref) Thresholding. It is worth noting that the value of α is a dynamic weight value, and the value corresponding to each threshold is different. Dynamically find the α value according to the Eq. (13) and the error distribution in the current sliding window, which can reach a balance between minimizing false positives and false negatives. The value of α is between 0 and 1.

To simplify formula, we set: $μ (w) = μ (e_{s}) α + μ (E_{s}) (1 - α)$ (11) $σ (w) = σ (e_{s}) α + σ (E_{s}) (1 - α)$ (12)

An objective function is defined as follows:

$ɛ = arg max (ɛ) = \frac{(Δ μ (w)) / μ (w) + (Δ σ (w)) / σ (w)}{| e_{a} | + {| E_{seq} |}^{2}}$ (13) where ɛ is the threshold, and

$\begin{matrix} Δ μ (w) = μ (w) - (μ ({e_{s} \in e_{s} | e_{s} < ɛ}) α + \\ μ ({e_{s} \in E_{s} | e_{s} < ɛ}) (1 - α)) \end{matrix}$ (14)

$\begin{matrix} Δ σ (w) = σ (w) - (σ ({e_{s} \in e_{s} | e_{s} < ɛ}) α + \\ σ ({e_{s} \in E_{s} | e_{s} < ɛ}) (1 - α)) \end{matrix}$ (15)

$e_{a} = {e_{s} \in e_{s} | e_{s} > ɛ}$ (16) $E_{seq} = continuous sequence of e_{a} \in e_{a}$ (17)

Eq. (13) is defined to find the suitable z and the threshold ɛ. | e_a | indicates the number of abnormal points (usually very mall) in the corresponding sliding window. From the vector ɛ , find the maximum as the threshold ɛ by minimizing | e_a |.

3 Experiments

The novel method is implemented in Keras framework. In this section, the datasets and baseline methods are first described. Then, the experimental setup and evaluation metrics are introduced. Next, the comparative experiments prove that proposed method is superior to five baseline methods over three datasets and our method is provided widely applicable. Finally, the interpretability of the G-CNN is by visualizing the intermediate output of neural networks.

3.1 Datasets

To prove the efficiency of the novel method, three labeled real-world datasets are adopted: SMD (Server Machine Dataset),SMAP (Soil Moisture Active Passive satellite) and MSL (Mars Science Laboratory rover). Among of them, SMAP[1] and MSL [1] are two spacecraft datasets from NASA, Spacecraft generally has thousands of channels that produce monitored data, including its power, voltage, angle activities, and others. SMD [18] is a new 5-week-long dataset from a large Internet company. SMD include interpretation-label,we regard it as exogenous data and input it into CNN component. Table 1 summarizes the spacecraft datasets information. Table 2 summarizes the three datasets information during training.

Table 1
Experimental SMAP and MSL Data Information

Dataset Anomaly sequences Point anomalies (% tot.) Contextual anomalies (% tot.) monitored channels monitored values

SMAP 69 43(62%) 26(% 38) 55 429,735

MSL 36 19(53%) 17(% 47) 27 66,709

Total 105 63(59%) 43(% 41) 82 496,444

Dataset	Anomaly sequences	Point anomalies (% tot.)	Contextual anomalies (% tot.)	monitored channels	monitored values
SMAP	69	43(62%)	26(% 38)	55	429,735
MSL	36	19(53%)	17(% 47)	27	66,709
Total	105	63(59%)	43(% 41)	82	496,444

Table 2

Experimental Data Information

Dataset	Entities	Dimensions	Training set size	Testing set size	Anomaly ratio(%)
SMAP	55	25	135183	427617	13.13
MSL	27	55	58317	73727	10.72
SMD	28	38	708405	708420	4.16

3.2 Baseline methods

LSTM+N-Par [1]: The method based on LSTM is proposed to detect spacecraft anomaly, the nonparametric dynamic thresholding developed in this work achieves a balance between false positives and false negatives.

DAGMM [5]: This method based on deep autoencoding gaussian mixture model focuses on anomaly detection for multivariate data without temporal information between observations.

LSTM-VAE [19]: This method simply combines LSTM and VAE by replacing the feed-forward network in a VAE with LSTM.

OmniAnomaly [18]: In this method, it utilizes a stochastic recurrent neural network to capture the normal patterns of time series, reconstruct input data by the representations, and use the reconstruction probabilities to determine anomalies.

3.3 Setup

The following hardware and software are used to conduct the time series anomaly detection task. Intel(R) Xeon(R) CPU with 20GB storage is adopted for G-CNN training. High performance hardware can speed up the training of deep neural networks. The software used to implement the proposed method is installed with the CentOS operation system. The programming language is python 3.6.8, which is widely applied to artificial intelligence for machine learning and deep learning. Numpy 1.14.6 is used to simplify matrix operations. The Keras backed up TensorFlow is used to run G-CNN. Compared with other deep learning libraries, Keras is convenient for researchers to experiment because it is easy to use and well documented. G-CNN is trained with 64 batch size for 35 epochs using the above hardware and software.

Grid search is adopted to tune all hyper-parameters on validation set (20%). The architecture parameters tuned for G-CNN in the experiment are summarized in Table 3. Each of two hidden layers of the GRU component contains 80 neural units; the only one layer in the CNN has 8 convolution kernels. Dropout layers and an early stopping strategy are adopted to prevent G-CNN from overfitting. Concretely, the training is terminated when the validation loss no longer drops in the case of 10 consecutive epochs. The input sequence length L is set to 250 and the size of the sliding window is initialized as h=2100 with considering the balance between performance and computational cost. When the length of monitored values is less than 2100, h should be decreased to ensure at least one sliding window is available. Finally, when the predicted error is greater than the threshold, corresponding sequence is classified as an abnormal sequence and compared with the label, then calculate the TP, FP and FN. Table 4 shows the confusion matrix.

Table 3
Model Parameters

GRU CNN

hidden layers 2 1

units in hidden layers 80 8(filters)

sequence length(L) 250

training iterations 35

dropout 0.3

batch size 64

optimizer Adam

	GRU	CNN
hidden layers	2	1
units in hidden layers	80	8(filters)
sequence length(L)	250
training iterations	35
dropout	0.3
batch size	64
optimizer	Adam

Table 4

The Confusion Matrix

The Ground Truth	The Predicted Value
	Positive	Negative
Positive	TP	FN
Negative	FP	TN

For convenience in performance comparison, we employ precision, recall and F_0.5-score to evaluate all experiment results. The definition of performance metrics is shown as follows. $Precision = \frac{TP}{TP + FP}$ (18) $Recall = \frac{TP}{TP + FN}$ (19) $F_{0.5} - score = \frac{(1 + {0.5}^{2}) * Pr ecision * Re call}{{0.5}^{2} * Pr ecision + Re call}$ (20)

3.4 Results and discussion

In this section, the effectiveness of the proposed method are presented by comparing the performance metrics of baseline methods with ours on three datasets. First, the performance of G-CNN and Double-referenced thresholding with baseline method is evaluated on two spacecraft datasets and the correlation between exogenous data and monitored data is positioned. Next, for different spacecraft anomaly types, the performance of our method is compared with the state-of-art method for spacecraft anomaly detection. Then, the whole performance of the proposed method will be proved with four baseline methods over three datasets. Finally, the interpretation of G-CNN is discussed by visualizing the intermediate output.

3.4.1 Model comparison (LSTM v.s. G-CNN)

The experiment pairs of No.1, No.2, No.3, No.4 in Table 5 provide the straightforward comparison between the basic models based on LSTM and G-CNN. It is obvious that G-CNN can increase the overall precision by up to 8% (87.5% @No.1 to 95.60% @No.2) and the overall recall by up to 2.8% (80.00% @No.1 to 82.85% @No.2). For the SMAP dataset, the capability of the hybrid G-CNN model is superior; the maximum value of precision increase is 13% (85.50% @No.1 to 98.39% @No.2) and recall increase is 3% (85.50% to 88.41%).

LSTM excels at capturing long-term temporal dependency of sequence data, but cannot quantify the correlation between monitored data and exogenous data. In G-CNN, GRU focuses on monitored data to capture temporal features, while CNN on exogenous data to capture spatial features. The experimental results also prove that for data with different types and attributes, the results are better using more targeted modules to process them. Moreover, it’s worth pointing out that the performance of anomaly detection on the MSL dataset is always slightly less than on the SMAP dataset. This effect may be associated with two facts. Firstly, the scale of the MSL dataset is less than that of SMAP, which will reduce the effect of feature extraction on MSL dataset. Secondly, many attributes not fully represent in the training data, although MSL executes multifarious orders with different regularities. In addition, the small scale of training dataset may not cover the essential feature under real condition, which will reduce the effect of model that have higher fit.

3.4.2 Double-referenced v.s. Non-parametric thresholding

The novel threshold approach proposed here not only considers the error distribution in the sliding window, but also takes the error distribution of the entire sequence into account, and then dynamically find an optimized weight a, the threshold calculation becomes more flexible. The experiment pairs of No.1, No.3, No.2, No.4 in Table 5 show that our threshold method has a significant improvement in precision on both datasets. The maximal improvement in precision and F_0.5-score occurs on the basic LSTM model (the comparison between No.1 and No.3): precision on two datasets is 5.8% (=93.3% -87.5%), 7.4% (=100% -92.6%) on MSL dataset and 5.3% (=90.8% -85.5%) on SMAP dataset. This proves that the new threshold method is more effective in reducing false positives.

Table 5
Result Comparison of different models and thresholding approaches, * recalculated with Eq. (20) based on the precision and recall, and the best performance is displayed in boldface in each case. The No. indicates the serial number of each experiment

No. Threshold Model Dataset Precision(%) Recall(%) F_0.5-score

1 N-Par LSTM MSL 92.60 69.40 0.87^*

SMAP 85.50 85.50 0.85^*

Total 87.50 80.00 0.86^*

2 G-CNN MSL 89.65 72.22 0.85

SMAP 98.39 88.41 0.96

Total 95.60 82.85 0.93

3 D-Ref LSTM MSL 100.00 66.70 0.91

SMAP 90.80 85.50 0.90

Total 93.30 79.00 0.90

4 G-CNN MSL 96.29 72.22 0.90

SMAP 98.30 84.05 0.95

Toal 97.67 80.00 0.93

5 GRU MSL 90.91 55.56 0.81

SMAP 96.49 79.71 0.92

Total 94.94 71.43 0.89

No.	Threshold	Model	Dataset	Precision(%)	Recall(%)	F_0.5-score
1	N-Par	LSTM	MSL	92.60	69.40	0.87^*
			SMAP	85.50	85.50	0.85^*
			Total	87.50	80.00	0.86^*
2		G-CNN	MSL	89.65	72.22	0.85
			SMAP	98.39	88.41	0.96
			Total	95.60	82.85	0.93
3	D-Ref	LSTM	MSL	100.00	66.70	0.91
			SMAP	90.80	85.50	0.90
			Total	93.30	79.00	0.90
4		G-CNN	MSL	96.29	72.22	0.90
			SMAP	98.30	84.05	0.95
			Toal	97.67	80.00	0.93
5		GRU	MSL	90.91	55.56	0.81
			SMAP	96.49	79.71	0.92
			Total	94.94	71.43	0.89

In the process of dynamically finding weight α, the greater the weight α, the more error distribution is considered in the sliding window (see Eq. 10); otherwise, the more error distribution is considered in the entire sequence. To achieve parameters-free threshold calculation based on prediction errors distribution, the best way is to find the weight α dynamically. Our method can find the optimised value of weight α. In the course of the experiment, the value α can be outputted. In most cases, the automatically calculated α value is 0.9, which indicates that taking the prediction error of the entire sequence into account can help improve the performance of anomaly detection.

3.4.3 Correlation between X_cmd and X_tru

The correlation between the monitored data (X_tru) and the exogenous data (X_cmd) can be verified by the present G-CNN model, because they are handled separately by GRU and CNN. If there are no correlation between them, (i.e., the exogenous data does not affect the monitored data), removing the CNN component will not affect the final results. The No.5 in Table 5 shows the experiment result when only GRU works in the G-CNN model. Comparing No. 4 and No. 5, without the exogenous data to predict the next monitored values, anomaly detection performance is significantly degraded for all the three metrics on both datasets. Particularly the overall precision, recall, and F_0.5-score dropped by 3% (=97.67% -94.94%), 8% (=80.00% -71.43%), and 0.04 (=0.93-0.89), respectively; the maximal degeneration is 16.66% in recall on the MSL dataset. Therefore, there are close correlation between monitored data and exogenous data. The proposed model by us has the potential to position the relationship between them, and hence helps to detect which part of the system is abnormal in the actual scene.

3.4.4 Performance for different anomaly types

Table 6 details the precision and recall for point and contextual anomaly on spacecraft datasets obtained by LSTM and our model. The proposed method performs better in identifying the contextual anomaly than in point anomaly. For context anomaly detection, the overall precision increased by nearly 26.3% (73.7% to 100%) and the overall recall improved 7.7% (69.0% to 76.7%).

Table 6
Precision and Recall for different anomaly types

Model Dataset Precision(%) Recall(%)

point contextual point contextual

LSTM+N-Par MSL 100.0 90.9 78.9 58.8

SMAP 97.6 66.7 95.3 76.0

Total 98.2 73.7 90.3 69.0

G-CNN+D-Ref MSL 93.3 100.0 73.3 70.6

SMAP 100.0 100.0 88.4 80.8

Total 98.1 100.0 83.9 76.7

Model	Dataset	Precision(%)	Recall(%)
LSTM+N-Par	MSL	100.0	90.9	78.9	58.8
	SMAP	97.6	66.7	95.3	76.0
	Total	98.2	73.7	90.3	69.0
G-CNN+D-Ref	MSL	93.3	100.0	73.3	70.6
	SMAP	100.0	100.0	88.4	80.8
	Total	98.1	100.0	83.9	76.7

3.4.5 Whole performance of our method

On all the three metrics, Table 5 shows that the newly proposed method (G-CNN+D-Ref, see No.4) is conducive to spacecraft anomaly detection. Comparison with the cutting-edge method (No.1), overall precision improves by 11.4% (=98.90% -87.50%); the overall F_0.5-score is maximized as 0.93. Especially on the SMAP dataset, the precision reaches 100% and the corresponding F_0.5-score also reaches the maximum of 0.95 among all methods.

To further study the applicability and whole performance of our method, six groups of comparative experiments are conducted over three datasets, as listed in Table 7. In the model of OmniAnomaly, stochastic recurrent neural Network is trained to reconstruct time-series, and use the reconstruction probabilities to determine anomalies. However, the fixed length vector in Encoder is not enough for representing the temporal dependency as the data size increases. In the models of DAGMM, the temporal information between observations is ignored, which is inferior in reconstructing current observations. LSTM-VAE simply combines LSTM and VAE to represent the normal patterns and utilizes reconstruction error distribution to detect anomaly. However, LSTM-VAE does not consider temporal dependence. L-CNN represents the combination of LSTM and CNN, we replace GRU in G-CNN with LSTM to complete the comparison experiment. For the model of G-CNN, we input the monitored data into the GRU and CNN layer simultaneously to capture the temporal and local spatial relationship of the input data. The structure of GRU is simpler than LSTM, and the results also show that the effect of using GRU in the current datasets is better than that of LSTM. Furthermore, Double-referenced thresholding achieves parameter-free threshold calculation. To see the training process of G-CNN more intuitively, we show training and validation loss plot of G-CNN in Fig. 4.

Table 7
Result comparison of different methods over SMAP, MSL, SMD datasets, the best performance is displayed in boldface in each case

Methods Dataset Precision Recall F_0.5-score

LSTM+N-Par [1] SMAP 0.85 0.85 0.85

MSL 0.92 0.69 0.87

SMD 0.59 0.64 0.60

DAGMM [15] SMAP 0.58 0.90 0.62

MSL 0.54 0.99 0.59

SMD 0.59 0.88 0.63

LSTM-VAE [19] SMAP 0.85 0.64 0.79

MSL 0.53 0.95 0.58

SMD 0.79 0.71 0.77

OmniAnomaly [18] SMAP 0.74 0.98 0.78

MSL 0.89 0.91 0.89

SMD 0.83 0.94 0.85

L-CNN+D-Ref SMAP 0.87 0.81 0.81

MSL 0.92 0.69 0.86

SMD 0.90 0.93 0.90

G-CNN+D-Ref(ours) SMAP 0.98 0.84 0.95

MSL 0.96 0.72 0.90

SMD 0.99 0.99 0.99

Methods	Dataset	Precision	Recall	F_0.5-score
LSTM+N-Par [1]	SMAP	0.85	0.85	0.85
	MSL	0.92	0.69	0.87
	SMD	0.59	0.64	0.60
DAGMM [15]	SMAP	0.58	0.90	0.62
	MSL	0.54	0.99	0.59
	SMD	0.59	0.88	0.63
LSTM-VAE [19]	SMAP	0.85	0.64	0.79
	MSL	0.53	0.95	0.58
	SMD	0.79	0.71	0.77
OmniAnomaly [18]	SMAP	0.74	0.98	0.78
	MSL	0.89	0.91	0.89
	SMD	0.83	0.94	0.85
L-CNN+D-Ref	SMAP	0.87	0.81	0.81
	MSL	0.92	0.69	0.86
	SMD	0.90	0.93	0.90
G-CNN+D-Ref(ours)	SMAP	0.98	0.84	0.95
	MSL	0.96	0.72	0.90
	SMD	0.99	0.99	0.99

In order to further verify the coupling effect of GRU and CNN in G-CNN anomaly detection, ablation experiments are performed on the space shuttle dataset, as shown in Table 8. Experimental results show that the performance using G-CNN is better than any one that only includes one component, although the individual GRU and CNN also work rather well.

Table 8

Result comparison of different network with D-Ref over Space Shuttle, the best performance is displayed in boldface in each case

Network	TP	FP	FN	TN	Accuracy
GRU	985	0	159	848	0.92
CNN	982	0	162	848	0.91
G-CNN	1000	0	144	848	0.93

We conduct additional experiments to prove that our G-CNN with dual inputted mechanism is reasonable and effective. As show in Table 9, we compare the results with GRU-CNN without dual inputted mechanism with same parameters as GRU-CNN and calculate the time cost between GRU-CNN with dual inputted mechanism(WODIM), and dual-inputted GRU-CNN. It turns out that our G-CNN takes less time and performs better than G-CNN without dual inputted mechanism(WODIM). Different types of data are handed over to dfifferent network components for processing to capture features more accurately.

Table 9

Result comparison of G-CNN without dual inputted mechanism(WODIM) and G-CNN(ours), the the best performance is displayed in boldface in each case.P refers to Precision, R refers to Recall and T refers to time cost

Methods	Dataset	P	R	F_0.5-score	T(s)
G-CNN+D-Ref(WODIM)	SMAP	0.88	0.80	0.86	21739
	MSL	0.89	0.69	0.84	12375
	SMD	0.91	0.92	0.92	248820
G-CNN+D-Ref(ours)	SMAP	0.98	0.84	0.95	19140
	MSL	0.96	0.72	0.90	9396
	SMD	0.99	0.99	0.99	220540

3.4.6 Interpretability of G-CNN model

The most widely known shortcoming of neural networks with a large number of parameters is the "black box" nature. The specific process of neural network training is not intuitive, but train process can be understood by visualizing output of each layer in the network. Figure 5 shows the intermediate outputs from four kernel filters in the CNN of the G-CNN network. CNN layer consists of 8 kernels, each of which has the same size but a different weight matrix to capture the short-term temporal dependency and spatial correlation of input signals and produce different outputs by applying convolution operations. As shown in Figure 5, for the same input, the outputs of the four filters are different, but they all reflect the basic characteristics of the original input to varying degrees. When performing the convolution operation, the same (Zero) padding is adopted. Therefore, a lot of original information of input data is still retained in the feature map and the size of the test data is the same as the output feature size. The visualization of intermediate outputs is useful for analyzing the feature extraction process of the G-CNN.

Fig. 4

The loss plot in G-CNN for three datasets: (a) SMAP datasets, (b) MSL dataset, (c) SMD dataset.

Fig. 5

Intermediate outputs for various filters kernel over SMAP dataset.

3.4.7 Discussion

In summary, separately dealing with the monitored data and the exogenous data and then fusing their features can not only significantly improve the performance on anomaly detection, but also help to verify the correlation between the monitored data and the exogenous data. For the datasets without exogenous data, the monitored data are simultaneously input into the GRU and CNN layers of G-CNN, which also works very well. In addition, introducing the prediction errors distribution of the entire sequence into threshold calculation can effectively improve precision, and the performance for contextual anomaly detection.

The experimental results mentioned above revealed that our model performs well on different type of time series data in many fields. The interpretability of the CNN component in the G-CNN model is visualized by the intermediate outputs. It can be seen from the experiments that the threshold setting plays an important role in the anomaly detection algorithm. In future work, we are committed to overcoming the inexplicability of the neural network and further exploring the method of threshold setting.

4 Conclusion

In this paper, a novel dual-input hybrid neural network model G-CNN for representing feature in time series data and Double-referenced threshold for threshold setting are presented. For time series anomaly detection involving exogenous data, the exogenous data and monitored data are addressed respectively by a one-layer CNN and two-layer GRUs, not only leading to improvement in the overall performance of anomaly detection, but also providing a potential to position the relationship between monitored data and exogenous data.Our approach can improve precision by 10% to 40% compared to other methods. For the results of DAGMM, our method achieves a balance between precision and recall. Compared with the recent OmniAnomaly method, the precision of our method on three datasets is increased by 10% to 20%, the recall on the SMD dataset is increased by 5%, and the overall F-score is increased by 1% to 17% The Double-referenced threshold approach considers the prediction errors in a sliding window and in the entire sequence, and calculates the threshold dynamically based on the prediction error, which remarkably improve the performance of anomaly detection. Experimental results proved that the proposed method has strong scalability and applicability, and the interpretability of the G-CNN is demonstrated by visualing the intermediate output of neural networks.

Footnotes

Acknowledgment

The research was partially funded by the National Natural Science Foundation of China (Grant Nos.51661005, U1612442 and 61672221), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No.61625202), the China Scholarships Council (Grant Nos. 201706130080).

References

Hundman

, Constantinou

, Laporte

, Colwell

and Soderstrom

, Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding, In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 387–395. ACM, (2018).

Chauhan

and Vig

, Anomaly detection in ecg time signals via deep long short-term memory networks. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–7. IEEE, (2015).

Mills

A. R.

and Kadirkamanathan

, Sensing for aerospace combustor health monitoring, Aircraft Engineering and Aerospace Technology 92(1) (2020), 37–46.

Mills

A.R.

and Kadirkamanathan

, Sensing for aerospace combustor health monitoring, Aircraft Engineering and Aerospace Technology 92(1) (2020), 37–46.

Codetta-Raiteri

and Portinale

, Dynamic bayesian networks for fault detection, identification, and recovery in autonomous spacecraft, IEEE Transactions on Systems, Man, and Cybernetics: Systems 45(1) (2014), 13–24.

Xiong

, Ma

H.-D.

, Fang

H.-Z.

, Zou

K.-X.

and Yi

D.-W.

, Anomaly detection of spacecraft based on least squares support vector machine. In 2011 Prognostics and System Health Managment Confernece, pages 1–6. IEEE, (2011).

Seetha

, Sunitha

K.V.N.

and Malini

, Devi, Performance assessment of neural network and k-nearest neighbour classification with random subwindows, International Journal of Machine Learning and Computing 2(6) (2012), 844.

Qin

, Langari

, Wang

, Xiang

and Dong

, Road excitation classification for semi-active suspension system with deep neural networks, Journal of Intelligent & Fuzzy Systems 33(3) (2017), 1907–1918.

Ullah

and Haydarov

, and Muhammad, Deep learning assisted buildings energy consumption profiling using smart meter data, Sensors 20 (2020), 873.

10.

Xie

, Li

, Sun

and Lin

, Enhancing sentence embedding with dynamic interaction, Applied Intelligence (2019), 1–10.

11.

Min Ullah

F.U.

, Ullah

, Haq

, Rho

and Baik

, Short-term prediction of residential power energy consumption via cnn and multilayer bi-directional lstm networks, IEEE Access (2019), 1–1.

12.

Khan

, Hussain

, Ullah

, Rho

, Lee

and Baik

, Towards efficient electricity forecasting in residential and commercial buildings: A novel hybrid cnn with a lstm-ae based framework, Sensors 20 (2020), 1399.

13.

Sajjad

, Khan

, Ullah

, Hussain

, Ullah

, Lee

and Baik

, A novel cnn-gru based hybrid approach for short-term residential load forecasting, IEEE Access (2020), 1–1.

14.

Malhotra

, Ramakrishnan

, Anand

, Vig

, Agarwal

and Shroff

, Lstm-based encoder-decoder for multi-sensor anomaly detection, arXiv preprint arXiv:1607.00148, (2016).

15.

Zong

, Song

, Min

M.R.

, Cheng

, Lumezanu

, Cho

and Chen

, Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International Conference on Learning Representations, (2018).

16.

Reunanen

, Rty

, Jokinen

J.J.

, Hoyt

and Culler

, Unsupervised online detection and prediction of outliers in streams of sensor data, International Journal of Data ence and Analytics (2020), 1–30.

17.

Cui

, Surpur

, Ahmad

and Hawkins

, A comparative study of htm and other neural network models for online sequence learning with streaming data. In 2016 International Joint Conference on Neural Networks (IJCNN), 1530–1538. IEEE, (2016).

18.

, Zhao

, Niu

, Liu

, Sun

and Pei

, Robust anomaly detection for multivariate time series through stochastic recurrent neural network, (2019), 2828–2837.

19.

Park

, Hoshi

and Kemp

C.C.

, A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder, IEEE Robotics and Automation Letters 3(3) (2018), 1544–1551.

G-CNN and double-referenced thresholding for detecting time series anomalies

Abstract

Keywords

1 Introduction

2.1.1 GRU for monitored data

3.1 Datasets

Table 1 Experimental SMAP and MSL Data Information Dataset Anomaly sequences Point anomalies (% tot.) Contextual anomalies (% tot.) monitored channels monitored values SMAP 69 43(62%) 26(% 38) 55 429,735 MSL 36 19(53%) 17(% 47) 27 66,709 Total 105 63(59%) 43(% 41) 82 496,444

3.3 Setup

Table 3 Model Parameters GRU CNN hidden layers 2 1 units in hidden layers 80 8(filters) sequence length(L) 250 training iterations 35 dropout 0.3 batch size 64 optimizer Adam

3.4.1 Model comparison (LSTM v.s. G-CNN)

3.4.2 Double-referenced v.s. Non-parametric thresholding

3.4.4 Performance for different anomaly types

4 Conclusion

Footnotes

Acknowledgment

References

Table 1
Experimental SMAP and MSL Data Information

Dataset Anomaly sequences Point anomalies (% tot.) Contextual anomalies (% tot.) monitored channels monitored values

SMAP 69 43(62%) 26(% 38) 55 429,735

MSL 36 19(53%) 17(% 47) 27 66,709

Total 105 63(59%) 43(% 41) 82 496,444

Table 3
Model Parameters

GRU CNN

hidden layers 2 1

units in hidden layers 80 8(filters)

sequence length(L) 250

training iterations 35

dropout 0.3

batch size 64

optimizer Adam