Abstract
Anomaly detection based on time series data is of great importance in many fields. Time series data produced by man-made systems usually include two parts: monitored and exogenous data, which respectively are the detected object and the control/feedback information. In this paper, a so-called G-CNN architecture that combined the gated recurrent units (GRU) with a convolutional neural network (CNN) is proposed, which respectively focus on the monitored and exogenous data. The most important is the introduction of a complementary double-referenced thresholding approach that processes prediction errors and calculates threshold, achieving balance between the minimization of false positives and the false negatives. The outstanding performance and extensive applicability of our model is demonstrated by experiments on two public datasets from aerospace and a new server machine dataset from an Internet company. It is also found that the monitored data is close associated with the exogenous data if any, and the interpretability of the G-CNN is discussed by visualizing the intermediate output of neural networks.
Introduction
Time series data are increasingly collected in various application systems of the real world. Detecting anomalies based on such data is of great significance for predicting the next behavior and ensuring the normal operation of these systems. For instance, in the aerospace field, anomaly pattern in monitored data can reflect the problems in performance, quality, and inadequate design of spacecraft or space shuttle [1]; in the medical field, electrocardiogram (ECG) signals are widely used to measure a person’s heart health to detect any arrhythmia that patients may suffer [2]. Generally speaking, abnormal patterns are detected by analyzing the temporal relationship of monitored time series data produced by application system.
For a man-made application system (such as a spacecraft) necessary are control and feedback signals that can be regarded as exogenous data, relative to the monitored data (such as the telemetry values for a spacecraft). As shown in Figure 1, as exogenous data the encoded command information of spacecraft [1] are usually binary data, different from the float data of the telemetry values (monitored data). Based on the difference in data type and attribute, independent modules designed for handling the monitored and exogenous data may work more efficiently, and have the potential to clarify the correlation between the two types of information when such two modules are taken into account in the same architecture.

The data of spacecraft, where the line chart corresponds to the monitored telemetry data.
Much attention have paid to time series anomaly detection recent decades. Figure 2 shows the primary approaches for anomaly detection with spacecraft as an example, which can be roughly classified into two groups: the traditional and intelligent methods. The traditional anomaly detection combines manual monitoring with threshold interpretation, model match, rule test and expert systems [3, 4]. Although easy to operate and convenient for technicians, traditional methods heavily rely on expert experience, artificially defining and updating the scope of normal sequence data. With the increasing complexity in system structure and the exponentially growing scale in sequence data, the effectiveness of traditional methods has been challenged.

The architectures for spacecraft anomaly detection.
With the rapid development of software and hardware, intelligent models including machine learning and deep learning, are currently research hotspots in time series anomaly detection. For instance, Codetta-Raiteri and Portinale used Dynamic Bayesian networks to capture the complex relationships of spacecraft telemetry data and to detect spacecraft anomalies [5]; whereas the correlation between the exogenous and monitored data has not been considered. Support vector machines and k-nearest neighbor algorithm have also been applied time series anomaly detection with distance-based classification [6, 7]. For high-dimensional time series, the distance calculation cost is very expensive. Recently, deep learning has been actively studied for time series anomaly detection, neural networks are good at fitting complex functions and capturing complicated nonlinear relationship [8]. Among of them, convolutional neural network (CNN) and recurrent neural network (RNN) have respectively achieved great success in time series prediction [9–11]. RNN is perfect in capturing temporal dependency and CNN excels at representing spatial feature. Long Short Term Memory (LSTM) [12] and the Gated Recurrent Unit (GRU) [13] are the variants of RNN, which can capture more long-term temporal dependency among sequence data. Chauhan et al. has applied LSTM to predict ECG and detect anomaly by comparing the prediction error with the threshold [2]. However, the threshold setting in this work was artificial and lacked of applicability, resulting in low recall. For time series affected by exogenous factors or variables, Malhotra et al. developed LSTM based on Encoder-Decoder framework to reconstruct ‘normal’ time-series behavior and uses reconstruction error to detect anomalies [14]. But the fixed-length vector in Encoder is not sufficient to represent the long-term temporal dependency as the size of data increases. Low recall in experimental results proved the limitation of the method.
As mentioned above, individual methods can handle specific problems with some limitation. In general anomaly detection, threshold calculation plays an important role, and there is still huge room for improvement. For example, the Gaussian distribution with parameter assumptions is commonly adopt to fit prediction errors and then to estimate the mean and variance [15]. If the new prediction error does not follow the Gaussian distribution, the corresponding data is regarded as anomaly. However, in actual situations, monitored data is large and complex, and it is easy to violate parameter assumptions. Therefore, a more versatile nonparametric threshold calculation method is necessary.
In this paper, a dual-input hybrid neural network G-CNN is proposed, which combines GRU with CNN components. GRU is used to represent the long-term temporal dependency of monitored data while CNN focus on the spatial correlation and short-term temporal dependency in exogenous data. The outputs of GRU and CNN are fused to complete the feature extraction, so that the correlation between the monitored and exogenous data can be quantified in our approach. The prediction errors between the prediction values output by G-CNN and the ground truth received from application system are assessed by a novel dynamic double-referenced thresholding approach. This approach utilizes the prediction error distribution to dynamically set the error threshold, which overcomes the inaccuracy caused by artificially setting the error threshold and parameter assumptions. Finally, the interpretability of the network model G-CNN is provided by visualizing the intermediate output.
The rest of this paper is organized as follows. Section 2 describes the detail of our proposed architecture. Section 3 discusses the experiment and shows the comparative experiments results of our model with strong baselines on seven real-world datasets. Finally, the findings in this paper are discussed in Section 4.
The details of proposed detection logical structure and various components of the proposed architecture is shown in Figure 3. The present work mainly develops around the feature representation of monitored data and exogenous data, and threshold calculation based on the prediction errors.

The proposed detection logical structure, where e represents the prediction error.
For clarity in expression, some notations are introduced. More formally, given time series data as a sequence of vectors X={
Learning the spatio-temporal relationship of monitored data and exogenous data is the key to prediction. The exogenous data C may be multivariate variable with spatial correlation, while monitored data T are time-dependency; if they are addressed by different neural networks that are good at the corresponding data type and nature, the efficiency can be tuned and improved separately. In the present work, a dual-input hybrid neural network G-CNN consisted of CNN and GRU layers is put forward to achieve the performance. In G-CNN, the spatial correlation and short-term dependency of exogenous data C in the time series window are extracted by the CNN layers and the long-term temporal dependency in monitored data T are extracted by the GRU layers. That is,
GRU for monitored data
First, for monitored data with temporal depemdency, GRU is adopted to fit and represent the feature and to generate the function G (T). The gradient vanishing problem of vanilla RNN is effectively alleviated in GRU by introducing a gating mechanism [13]. Gating mechanism allows each unit to selectively retain and forget information. The structure of the GRU is simpler than the LSTM but also has the ability to efficiently process sequence data. In G-CNN, the GRU contains two layers and dropout layer, which are used to automatically extract higher-level sequences of time series temporal features. A GRU layer is followed by an RELU activation function. This allows the GRU to capture complex features in the input signal. The cell state of each GRU unit is updated by the activation of each gate. Each k-th unit contains an update gate
When the monitored data T are fed to GRU, the corresponding exogenous data C is input to the other component of the model, the CNN, which contains convolution, dropout layer and pooling layer. Figure 3 shows the convolution layer operation details. Convolution can reduce connections between networks and reduce the risk of overfitting. Each layer is followed by an activation function to capture more complex features. The exogenous data C is first input into a convolutional layer to extract the mainly feature map, Eq. (5) derives the output
Subsequently a dropout layer to reduce the complexity of G-CNN and avoid overfitting; then the output of dropout layer enters an average pooling layer.The average pooling layer can be seen as a special convolution, which aim at depressing the size of feature map and reduce the number of parameters while maintaining the significant features for next operation. Eq. (6) reveals the operation of the average pooling layer. β(l+1) represents the weight of each pooling filter, r refers to the size of pooling filter.
RELU (x) = max(x, 0) is the activation function. We use CNN to extract the local features of exogenous data C to obtain the
Finally, The temproal features captured by GRU and the spatial and local feature by CNN are combined, which results in the final predicted values of the hybrid G-CNN. Feature fusion of the different features extracted will help the model’s robustness and better fit the monitored data relationship. The feature fusion layer combines the features extracted by GRU and CNN, i.e., G (T) and H (C) respectively, to form a joint feature. The extracted features by GRU and CNN are mapped to the same feature space through fully connected layer before fusion. After that, the fused feature is input to the another fully connected layer to predict the monitored data
Where f, f1 and f2 refer to the connected layer, W T and W C represent the corresponding weights of G (T) and H (C). G (T) denotes the vectors fitted by the stacked GRU based on T while H (C) indicates the feature maps by CNN based on C.
For each time step t, the prediction error is defined as
Inspired by [1], a new method (double- referenced thresholding approach) is developed for setting error threshold, which work out the threshold based on the distribution of the prediction errors in a window and in the entire sequence. A sliding window is introduced to divide the prediction errors of an entire sequence into multiple sets of vectors. The prediction errors in an entire sequence can be denoted as a vector
In Eq. (10), α determines the weight distribution of the prediction errors in the current sliding window while 1-α represents the weight in the entire sequence. When α=1, only the prediction error distribution in the current sliding window is considered, which is called Nonparametric Dynamic(Non Parametric, assigned
To simplify formula, we set:
An objective function is defined as follows:
Eq. (13) is defined to find the suitable z and the threshold ɛ. |
The novel method is implemented in Keras framework. In this section, the datasets and baseline methods are first described. Then, the experimental setup and evaluation metrics are introduced. Next, the comparative experiments prove that proposed method is superior to five baseline methods over three datasets and our method is provided widely applicable. Finally, the interpretability of the G-CNN is by visualizing the intermediate output of neural networks.
Datasets
To prove the efficiency of the novel method, three labeled real-world datasets are adopted: SMD (Server Machine Dataset),SMAP (Soil Moisture Active Passive satellite) and MSL (Mars Science Laboratory rover). Among of them, SMAP[1] and MSL [1] are two spacecraft datasets from NASA, Spacecraft generally has thousands of channels that produce monitored data, including its power, voltage, angle activities, and others. SMD [18] is a new 5-week-long dataset from a large Internet company. SMD include interpretation-label,we regard it as exogenous data and input it into CNN component. Table 1 summarizes the spacecraft datasets information. Table 2 summarizes the three datasets information during training.
Experimental SMAP and MSL Data Information
Experimental SMAP and MSL Data Information
Experimental Data Information
LSTM+N-Par [1]: The method based on LSTM is proposed to detect spacecraft anomaly, the nonparametric dynamic thresholding developed in this work achieves a balance between false positives and false negatives.
DAGMM [5]: This method based on deep autoencoding gaussian mixture model focuses on anomaly detection for multivariate data without temporal information between observations.
LSTM-VAE [19]: This method simply combines LSTM and VAE by replacing the feed-forward network in a VAE with LSTM.
OmniAnomaly [18]: In this method, it utilizes a stochastic recurrent neural network to capture the normal patterns of time series, reconstruct input data by the representations, and use the reconstruction probabilities to determine anomalies.
Setup
The following hardware and software are used to conduct the time series anomaly detection task. Intel(R) Xeon(R) CPU with 20GB storage is adopted for G-CNN training. High performance hardware can speed up the training of deep neural networks. The software used to implement the proposed method is installed with the CentOS operation system. The programming language is python 3.6.8, which is widely applied to artificial intelligence for machine learning and deep learning. Numpy 1.14.6 is used to simplify matrix operations. The Keras backed up TensorFlow is used to run G-CNN. Compared with other deep learning libraries, Keras is convenient for researchers to experiment because it is easy to use and well documented. G-CNN is trained with 64 batch size for 35 epochs using the above hardware and software.
Grid search is adopted to tune all hyper-parameters on validation set (20%). The architecture parameters tuned for G-CNN in the experiment are summarized in Table 3. Each of two hidden layers of the GRU component contains 80 neural units; the only one layer in the CNN has 8 convolution kernels. Dropout layers and an early stopping strategy are adopted to prevent G-CNN from overfitting. Concretely, the training is terminated when the validation loss no longer drops in the case of 10 consecutive epochs. The input sequence length L is set to 250 and the size of the sliding window is initialized as h=2100 with considering the balance between performance and computational cost. When the length of monitored values is less than 2100, h should be decreased to ensure at least one sliding window is available. Finally, when the predicted error is greater than the threshold, corresponding sequence is classified as an abnormal sequence and compared with the label, then calculate the TP, FP and FN. Table 4 shows the confusion matrix.
Model Parameters
Model Parameters
The Confusion Matrix
For convenience in performance comparison, we employ precision, recall and F0.5-score to evaluate all experiment results. The definition of performance metrics is shown as follows.
In this section, the effectiveness of the proposed method are presented by comparing the performance metrics of baseline methods with ours on three datasets. First, the performance of G-CNN and Double-referenced thresholding with baseline method is evaluated on two spacecraft datasets and the correlation between exogenous data and monitored data is positioned. Next, for different spacecraft anomaly types, the performance of our method is compared with the state-of-art method for spacecraft anomaly detection. Then, the whole performance of the proposed method will be proved with four baseline methods over three datasets. Finally, the interpretation of G-CNN is discussed by visualizing the intermediate output.
Model comparison (LSTM v.s. G-CNN)
The experiment pairs of No.1, No.2, No.3, No.4 in Table 5 provide the straightforward comparison between the basic models based on LSTM and G-CNN. It is obvious that G-CNN can increase the overall precision by up to 8% (87.5% @No.1 to 95.60% @No.2) and the overall recall by up to 2.8% (80.00% @No.1 to 82.85% @No.2). For the SMAP dataset, the capability of the hybrid G-CNN model is superior; the maximum value of precision increase is 13% (85.50% @No.1 to 98.39% @No.2) and recall increase is 3% (85.50% to 88.41%).
LSTM excels at capturing long-term temporal dependency of sequence data, but cannot quantify the correlation between monitored data and exogenous data. In G-CNN, GRU focuses on monitored data to capture temporal features, while CNN on exogenous data to capture spatial features. The experimental results also prove that for data with different types and attributes, the results are better using more targeted modules to process them. Moreover, it’s worth pointing out that the performance of anomaly detection on the MSL dataset is always slightly less than on the SMAP dataset. This effect may be associated with two facts. Firstly, the scale of the MSL dataset is less than that of SMAP, which will reduce the effect of feature extraction on MSL dataset. Secondly, many attributes not fully represent in the training data, although MSL executes multifarious orders with different regularities. In addition, the small scale of training dataset may not cover the essential feature under real condition, which will reduce the effect of model that have higher fit.
Double-referenced v.s. Non-parametric thresholding
The novel threshold approach proposed here not only considers the error distribution in the sliding window, but also takes the error distribution of the entire sequence into account, and then dynamically find an optimized weight a, the threshold calculation becomes more flexible. The experiment pairs of No.1, No.3, No.2, No.4 in Table 5 show that our threshold method has a significant improvement in precision on both datasets. The maximal improvement in precision and F0.5-score occurs on the basic LSTM model (the comparison between No.1 and No.3): precision on two datasets is 5.8% (=93.3% -87.5%), 7.4% (=100% -92.6%) on MSL dataset and 5.3% (=90.8% -85.5%) on SMAP dataset. This proves that the new threshold method is more effective in reducing false positives.
Result Comparison of different models and thresholding approaches, * recalculated with Eq. (20) based on the precision and recall, and the best performance is displayed in boldface in each case. The No. indicates the serial number of each experiment
Result Comparison of different models and thresholding approaches, * recalculated with Eq. (20) based on the precision and recall, and the best performance is displayed in boldface in each case. The No. indicates the serial number of each experiment
In the process of dynamically finding weight α, the greater the weight α, the more error distribution is considered in the sliding window (see Eq. 10); otherwise, the more error distribution is considered in the entire sequence. To achieve parameters-free threshold calculation based on prediction errors distribution, the best way is to find the weight α dynamically. Our method can find the optimised value of weight α. In the course of the experiment, the value α can be outputted. In most cases, the automatically calculated α value is 0.9, which indicates that taking the prediction error of the entire sequence into account can help improve the performance of anomaly detection.
The correlation between the monitored data (
Performance for different anomaly types
Table 6 details the precision and recall for point and contextual anomaly on spacecraft datasets obtained by LSTM and our model. The proposed method performs better in identifying the contextual anomaly than in point anomaly. For context anomaly detection, the overall precision increased by nearly 26.3% (73.7% to 100%) and the overall recall improved 7.7% (69.0% to 76.7%).
Precision and Recall for different anomaly types
Precision and Recall for different anomaly types
On all the three metrics, Table 5 shows that the newly proposed method (G-CNN+D-Ref, see No.4) is conducive to spacecraft anomaly detection. Comparison with the cutting-edge method (No.1), overall precision improves by 11.4% (=98.90% -87.50%); the overall F0.5-score is maximized as 0.93. Especially on the SMAP dataset, the precision reaches 100% and the corresponding F0.5-score also reaches the maximum of 0.95 among all methods.
To further study the applicability and whole performance of our method, six groups of comparative experiments are conducted over three datasets, as listed in Table 7. In the model of OmniAnomaly, stochastic recurrent neural Network is trained to reconstruct time-series, and use the reconstruction probabilities to determine anomalies. However, the fixed length vector in Encoder is not enough for representing the temporal dependency as the data size increases. In the models of DAGMM, the temporal information between observations is ignored, which is inferior in reconstructing current observations. LSTM-VAE simply combines LSTM and VAE to represent the normal patterns and utilizes reconstruction error distribution to detect anomaly. However, LSTM-VAE does not consider temporal dependence. L-CNN represents the combination of LSTM and CNN, we replace GRU in G-CNN with LSTM to complete the comparison experiment. For the model of G-CNN, we input the monitored data into the GRU and CNN layer simultaneously to capture the temporal and local spatial relationship of the input data. The structure of GRU is simpler than LSTM, and the results also show that the effect of using GRU in the current datasets is better than that of LSTM. Furthermore, Double-referenced thresholding achieves parameter-free threshold calculation. To see the training process of G-CNN more intuitively, we show training and validation loss plot of G-CNN in Fig. 4.
Result comparison of different methods over SMAP, MSL, SMD datasets, the best performance is displayed in boldface in each case
Result comparison of different methods over SMAP, MSL, SMD datasets, the best performance is displayed in boldface in each case
In order to further verify the coupling effect of GRU and CNN in G-CNN anomaly detection, ablation experiments are performed on the space shuttle dataset, as shown in Table 8. Experimental results show that the performance using G-CNN is better than any one that only includes one component, although the individual GRU and CNN also work rather well.
Result comparison of different network with D-Ref over Space Shuttle, the best performance is displayed in boldface in each case
We conduct additional experiments to prove that our G-CNN with dual inputted mechanism is reasonable and effective. As show in Table 9, we compare the results with GRU-CNN without dual inputted mechanism with same parameters as GRU-CNN and calculate the time cost between GRU-CNN with dual inputted mechanism(WODIM), and dual-inputted GRU-CNN. It turns out that our G-CNN takes less time and performs better than G-CNN without dual inputted mechanism(WODIM). Different types of data are handed over to dfifferent network components for processing to capture features more accurately.
Result comparison of G-CNN without dual inputted mechanism(WODIM) and G-CNN(ours), the the best performance is displayed in boldface in each case.P refers to Precision, R refers to Recall and T refers to time cost
The most widely known shortcoming of neural networks with a large number of parameters is the "black box" nature. The specific process of neural network training is not intuitive, but train process can be understood by visualizing output of each layer in the network. Figure 5 shows the intermediate outputs from four kernel filters in the CNN of the G-CNN network. CNN layer consists of 8 kernels, each of which has the same size but a different weight matrix to capture the short-term temporal dependency and spatial correlation of input signals and produce different outputs by applying convolution operations. As shown in Figure 5, for the same input, the outputs of the four filters are different, but they all reflect the basic characteristics of the original input to varying degrees. When performing the convolution operation, the same (Zero) padding is adopted. Therefore, a lot of original information of input data is still retained in the feature map and the size of the test data is the same as the output feature size. The visualization of intermediate outputs is useful for analyzing the feature extraction process of the G-CNN.

The loss plot in G-CNN for three datasets: (a) SMAP datasets, (b) MSL dataset, (c) SMD dataset.

Intermediate outputs for various filters kernel over SMAP dataset.
In summary, separately dealing with the monitored data and the exogenous data and then fusing their features can not only significantly improve the performance on anomaly detection, but also help to verify the correlation between the monitored data and the exogenous data. For the datasets without exogenous data, the monitored data are simultaneously input into the GRU and CNN layers of G-CNN, which also works very well. In addition, introducing the prediction errors distribution of the entire sequence into threshold calculation can effectively improve precision, and the performance for contextual anomaly detection.
The experimental results mentioned above revealed that our model performs well on different type of time series data in many fields. The interpretability of the CNN component in the G-CNN model is visualized by the intermediate outputs. It can be seen from the experiments that the threshold setting plays an important role in the anomaly detection algorithm. In future work, we are committed to overcoming the inexplicability of the neural network and further exploring the method of threshold setting.
Conclusion
In this paper, a novel dual-input hybrid neural network model G-CNN for representing feature in time series data and Double-referenced threshold for threshold setting are presented. For time series anomaly detection involving exogenous data, the exogenous data and monitored data are addressed respectively by a one-layer CNN and two-layer GRUs, not only leading to improvement in the overall performance of anomaly detection, but also providing a potential to position the relationship between monitored data and exogenous data.Our approach can improve precision by 10% to 40% compared to other methods. For the results of DAGMM, our method achieves a balance between precision and recall. Compared with the recent OmniAnomaly method, the precision of our method on three datasets is increased by 10% to 20%, the recall on the SMD dataset is increased by 5%, and the overall F-score is increased by 1% to 17% The Double-referenced threshold approach considers the prediction errors in a sliding window and in the entire sequence, and calculates the threshold dynamically based on the prediction error, which remarkably improve the performance of anomaly detection. Experimental results proved that the proposed method has strong scalability and applicability, and the interpretability of the G-CNN is demonstrated by visualing the intermediate output of neural networks.
Footnotes
Acknowledgment
The research was partially funded by the National Natural Science Foundation of China (Grant Nos.51661005, U1612442 and 61672221), the National Outstanding Youth Science Program of National Natural Science Foundation of China (Grant No.61625202), the China Scholarships Council (Grant Nos. 201706130080).
