Abstract
Multivariate time series anomaly detection has made significant progress and has been studied in many fields. One of the difficulties in time-series data analysis is the complex nonlinear dependencies between multiple time steps and multiple variables. Therefore, detecting anomalies in these data is challenging. Although many studies used classical attention mechanisms to model the temporal patterns of data, few have combined multiple attention mechanisms and analyzed the data’s temporal characteristics and feature correlations. Therefore, we propose an autocorrelation and attention mechanism-based anomaly detection (ACAM-AD) framework that combines an autocorrelation model based on the Autoformer model, which is superior to the self-attention mechanism, a multi-head graph attention network, and a dot-product attention mechanism to model the complex dependencies of data considering temporal and feature dimensions. The autoregressive model is parallelized with the neural network, and a sparse autocorrelation mechanism and sparse graph attention network are used to reduce model complexity. Experiments on public datasets show that the model is effective and performs better than the baseline model.
Introduction
Sensor devices are used increasingly in many fields in the Big Data era for monitoring and generating time-series data [19, 29]. Research on multivariate time-series data has been conducted on spacecraft [22], server machines [32], and water treatment systems [18], among others. The use of these data to identify anomalies [7, 25] is critical because accurate anomaly detection can trigger early warning systems to enable timely troubleshooting to avoid revenue loss [10, 14].
However, detecting anomalies in time-series data is challenging due to complex patterns and data inconsistencies. Complex nonlinear dependencies exist between various variables and time steps. In addition, the dependencies may change dynamically over time, increasing the difficulty of analysis. Therefore, a major challenge for anomaly detection in multivariate time series data is capturing the dynamic dependencies between multiple time steps and variables. This problem has not been adequately addressed in many papers.
Due to the popularity of the transformer model [4], the self-attention mechanism has been increasingly utilized in various fields. Many improved models applicable to the temporal domain have been proposed, such as the Autoformer model [15]. This model is an autocorrelation mechanism that has higher efficiency for modeling the temporal correlation of data than the self-attention mechanism.
In order to model the data information comprehensively, we propose an autocorrelation and attention mechanism-based anomaly detection (ACAM-AD) framework to analyze the features and temporal perspectives of time-series data and improve anomaly detection accuracy. In this model, the data are input ACAM-AD into three parallel branches. The autocorrelation layer is used to model the temporal patterns of the data, the multi-head graph attention layer and gated recurrent unit (GRU) attention layer model the complex features and temporal patterns, and the autoregressive (AR) layer learns the overall trend of the data, predicts the values, and uses extreme value theory to select a threshold to detect anomalies. Each layer has a specific focus, and the framework comprehensively models the complex relationships of the data from multiple perspectives. To the best of our knowledge, this is the first time that an autocorrelation mechanism has been combined with a graph attention network (GAT) and dot-product attention mechanism to model the temporal and feature characteristics of data for time-series anomaly detection.
The main contributions of this paper are as follows: The autocorrelation mechanism, GAT, and GRU-Attention mechanism are used to model the correlation between the temporal characteristics and features of time-series data. The graph attention layer is preceded by a graph structure learning layer, which assigns K neighbors to each feature node to reduce the number of operations, unlike the traditional GAT that considers all features as neighbors. After assessing the vector similarity of the data, only the top K sets of the results with the strongest correlation are retained and assigned weights in the autocorrelation module. This sparse processing reduces model complexity. A statistical AR model based on the integrated learning theory is established parallel to the neural network to improve model robustness.
The remainder of this paper is organized as follows:
Section 2 describes related work in time-series anomaly detection. Section 3 explains the architecture of the ACAM-AD model. Section 4 presents the experimental setup and results and discusses the results. Section 5 is the conclusion.
Related work
Time-series anomaly detection is a complex task and has been extensively studied. Commonly used traditional methods are the 3-sigma criterion method and the local outlier factor (LOF) algorithm [2], which regard outliers in the data as anomalies. In addition to traditional methods, deep learning methods have received significant attention due to their improved ability to model complex relationships between time series. The most widely used methods are recurrent neural networks (RNNs) [20, 23] and their variants, long short-term memory (LSTM) networks [30], and GRUs [17, 21].
Unsupervised anomaly detection methods based on deep learning have developed rapidly in recent years. The LSTM-based variational autoencoder (LSTM-VAE) [8] was proposed in 2017. It combines the LSTM and VAE [9] to project time-series data with multiple variables into the hidden space and has shown excellent performance on multiple datasets. In 2018, the LSTM network with nonparametric dynamic thresholding (LSTM-NDT) [22] was developed for automatic anomaly detection to monitor telemetry data sent back from spacecraft. In the same year, the deep autoencoding Gaussian mixture model (DAGMM) [6] was proposed for anomaly detection in multivariate data without temporal correlation. In 2019, Su et al. [32] proposed a stochastic model for multivariate time-series anomaly detection (OmniAnomaly) to capture the normal patterns in multivariate time-series data using learning feature representations. In 2020, Zhao et al. [16] proposed a multivariate time-series anomaly detection via a graph attention network (MTAD-GAT) algorithm. It uses two juxtaposed GATs to process data separately to achieve higher accuracy. In the same year, a model for multivariate time-series forecasting with graph neural networks (MTGNN) was developed [33]. It learns the one-way relationships between variables using graph neural network (GNN) [24] modules to capture the dependencies in multivariate data better. In 2021, Deng and Hooi [1] proposed a graph deviation network (GDN) that uses a GNN to learn the relationship between variables and enhance the interpretability of the model. The literature review indicates that most recent approaches have focused on analyses from one perspective. Few models can analyze and model the information contained in the data by considering temporal characteristics and feature patterns.
Attention mechanisms have been used increasingly in the analysis of time-series data. Self-attention mechanisms have replaced recurrent models, such as LSTM, for time-series prediction due to their superior performance. The transformer model was proposed in 2017 for processing sequence data for natural language processing [4]. It uses a self-attention mechanism rather than RNNs. The dual self-attention network (DSANet) proposed in 2019 [31] has a two-branch architecture that combines a self-attention mechanism with convolutional neural networks for time-series prediction. It has achieved good results. The Autoformer model [15] was developed in 2022. It is an improvement of the transformer model and uses an autocorrelation mechanism that is more applicable to long time-series data. It performs better than the self-attention mechanism. It has improved transformer models for analyzing time-series data and currently provides optimal performance for time-series forecasting.
The excellent performance of the autocorrelation mechanism and parallel frameworks has inspired our work. In this study, we combine these methods with attention mechanisms that have performed well in time-series prediction and establish a three-branch framework (ACAM-AD) to model data dependencies by considering temporal and feature dimensions to detect anomalies. In addition, many methods have obtained good results at the expense of training speed; therefore, model efficiency is considered in this study. We use sparse attention mechanisms in our model to minimize the training time.
Proposed framework
We define the problem addressed in this study in Section 3.1 and outline the proposed framework in Section 3.2. In Section 3.3, we describe our approach to data preprocessing. We elaborate on the components of the prediction part of the model in Section 3.4 and describe the anomaly detection aspect in Section 3.5.
Problem statement
The goal of this study is anomaly detection in multivariate time-series data. We denote the multivariate time-series data as X ={ x1, x2, …, x
T
train
}, where x
t
∈ R
k
, T
train
is the maximum length of the timestamp, and k is the number of input variables. Next, we define X
t
, the input time window with length T at a given moment t. The input data matrix is denoted as X
t
= { xt-T, xt-T+1, …, xt-1 } ∈ Rk×T, which is used to predict the anomaly at time t, i.e.,
Overview
The process of the ACAM-AD model is summarized in Algorithm 1, and the framework is shown in Fig. 1.

Structure of the ACAM-AD framework for multivariate time-series anomaly detection.
X ←normalize (X)
Autocorrelation, graph learning, GAT, GRU-Attention, AR ← initialize weights
n ← 1
L
Autocorrelation, graph learning, GAT, GRU-Attention, AR ← updating weights using L
n ← n + 1
The steps are as follows: The temporal pattern of the data is modeled using the autocorrelation mechanism, and the data are decomposed into seasonal and trend parts. These are predicted separately and combined. We use the graph structure layer to generate K-nearest neighbors of each feature node and use the multi-head GAT and GRU-Attention for the analysis of the features and temporal characteristics, respectively. The outputs of the autocorrelation and GAT layers are concatenated and combined with the linear AR model results after the fully connected layer to obtain the model predictions.
The predicted and true values of the model are processed using the peaks over threshold (POT) selection method to obtain the threshold. They are compared with the anomaly scores to determine the outliers.
We use maximum-minimum normalization to normalize the data because the variables have different data ranges.
Auto-correlation layer
The autocorrelation layer is used to model the time-series data from a temporal perspective. This layer consists of the autocorrelation mechanism and sequence decomposition and has a similar progressive decomposition architecture as the Autoformer model. The model performs alternating decomposition and optimization during the prediction process. The trend is extracted gradually using a cumulative approach and is predicted separately from the seasonal term. The result is the sum of the trend and seasonal components.
Unlike the self-attention mechanism, which aims to determine the similarity between two sequences, the autocorrelation mechanism calculates the correlation between the historical and present sequences. Its coefficient is defined as the similarity between the sequence {Xt,i} and its τ-step delay {Xt-τ,i} of the feature i:
Due to the sequential aspect of time-series data, we add a mask tensor to the autocorrelation mechanism with values of 1 and 0, representing masked or unmasked positions. The objective of the mask tensor is to prevent the use of future information. The mask structure ensures that the current output value depends only on the current and past data, i.e., the output
The autocorrelation coefficients are calculated using fast Fourier transform (FFT) [26] based on the Wiener-Khinchin theorem to improve the computational efficiency and reduce the training time. The data are subsequently inverse-transformed. It has been shown [15] that this method reduces the time complexity from O (L2) to O (LlogL). It is calculated as follows:
We perform sparsification of the autocorrelation mechanism because not all two sequences may have the influence and to prevent the effect of irrelevant sequences and reduce the memory requirements. Only the similarity values of the first K1 sequences with the largest correlation sparsity R
XX
are selected, i.e., the attention weights R corresponding to these K1 sequences are retained, and the remaining attention values are set to 0.
Since many data have complex temporal patterns, we use the Autoformer model [15] and perform decomposition [28] to divide the series into a more regular seasonal part and a smoother trend part. These two parts reflect the short-term volatility and long-term trend of the time series, respectively. The schematic diagram is shown in Fig. 2.

Schematic diagram of sequence decomposition.
We use the moving average method for smoothing and perform a padding operation to keep the series length constant. The calculation is as follows.
We use X se , X tr = seriesDecomp (X) to summarize the sequence decomposition process, which is the second internal block of the model in the autocorrelation layer.
Incorporating the above modules, such as the autocorrelation mechanism, increases the complexity of the network structure, which may reduce the stability and cause gradient disappearance. Therefore, we incorporate residual connections, which alleviate gradient disappearance and substantially improve model performance.
The flow in the autocorrelation layer is defined as follows:
A cumulative approach is used to extract the trend information step by step as follows:
The value predicted by the Auto-Correlation layer contains two components: the cumulative result of the trend component and the weighted sum of the seasonal component:
We treat each feature as a node and use the sparse multi-head GAT layer to capture the similarity between different nodes. First, we use graph structure learning to calculate the most relevant K2 vector sequences of each sequence, i.e., the K2 neighbor nodes. Then we use the multi-head GAT mechanism to integrate the information of the K2 neighbors.
We use an embedding vector z for each feature [1]. The embedding vectors are randomly initialized, input into the model, and trained together to update z. They are used to learn the graph structure to determine which features are related. The similarity between the embedding vectors represents the similarity between features.
Our original data have k features. These data are regarded as a graph representing the relationship between features, where the nodes represent features and the edges represent the dependencies between features. An edge from one node to another represents the influence of the first feature on the second feature.
Because the influence between features is generally asymmetric, an undirected graph is not sufficiently accurate. Thus, we use directed graphs. We use the adjacency matrix A to represent this directed graph, where A ij denotes the presence or absence of directed edges from node i to node j. Each feature i is given a set of candidate relationships with other features C i , i.e., features that may be related; i has values from 1 to k. In the absence of a priori information, the initial candidate relationship C i for feature i is all features.
We want to learn the graph structure with K2 neighbors because the GAT network defaults to an interaction between all neighbors, which may introduce interference information. We first compute e ij , i.e., the normalized dot product between the embedding vector z i of feature i and the embedding vector z j of the candidate correlated feature j. Then, we choose the top K2 largest normalized dot product values as the relevant features of i. If the values contain j, A ij is 1, i.e., j is a neighbor of i. The formula is expressed as follows.
The value of K2 can be chosen according to the desired sparsity level. The GAT module utilizes this newly learned adjacency matrix A.
The GAT mechanism [27] is a GNN model. It performs better for capturing and modeling the correlations between different types of graph nodes than traditional attention mechanisms.
The i-th feature, whose corresponding graph has K2 nodes, is denoted as {v1, ⋯ , v
K
2
}, representing K2 features. The input vector of the i-th node is v
i
, and the output vector
Here ⊕ denotes the combination of two vectors, LeakyReLU is a nonlinear activation function [5], and w is a learnable weight parameter.
A multi-head GAT layer is used to improve its generalization ability; its structure with 6 nodes and 3 heads is schematically shown in Fig. 3.

Example of a feature-based multi-head graph attention layer.

Schematic of the GRU-Attention layer.
Multiple sets of GAT models with different weights are averaged to obtain the multi-head GAT, and multiple v
j
are combined to obtain the final
We input the output of the GAT layer into the GRU-Attention layer. This layer consists of a GRU, followed by an attention mechanism layer with an inner product kernel based on the temporal dimension. The GRU is an improved version of the LSTM that reduces the internal complexity of the gate, simplifies the model structure, and better captures long-term temporal information. It also alleviates gradient disappearance, a problem of the LSTM model.
A temporal attention mechanism is added after the GRU layer, and the inner product kernel function is used to measure the similarity of the two vectors h t with ht-τ is as follows:
Due to the nonlinear nature of the attention mechanism component and the recurrent component, the neural network model may be complex and unstable. To reduce model complexity and enhance model robustness, we use the method in Ref. [12] and decompose the final prediction result of the model into the sum of the linear component, which is based on the overall trend, and the neural network component, which is based on specific fluctuations. We use the AR model as the linear component; it is computed as
The output of the AR module is
The outputs of the GRU-Attention layer
The loss function of the model uses the root mean square error between the predicted value
The detection phase of the model is described in Algorithm 2.
st,i ←
y t ← 1
y t ← 0
The predicted value of each feature at time t computed by the model is denoted as
When one of the features has an anomaly, that moment is regarded as an anomalous moment. We use the highest feature outlier score st,i of that moment as the anomaly score s t . Finally, the obtained outlier fraction of the observed points at each moment is denoted as {s1, s2, ⋯ , s T test }, and T test is the number of the timestamps in the testing set.
The POT method [32] has been used increasingly for threshold selection in temporal data. It is based on the extreme value theory and maximum likelihood estimation. Unlike the fixed threshold method, the POT method automatically selects the threshold based on certain parameters, which is more effective for subsequent anomaly detection. The root mean square error between the predicted and true values at each timestep in the testing set is denoted as {l1, l2, ⋯ , l T test } and is used in the POT method to select the threshold automatically.
The calculated anomaly score for each timestamp of the testing set is compared with the threshold. If the anomaly score s t corresponding to the t-th moment is greater than the threshold th, it is an anomaly, i.e., y t = 1.
Sections 4.1 and 4.2 of this section describe the data sets, metrics, and parameters used in the experiments. Section 4.3 compares the proposed model with other state-of-the-art methods to evaluate its performance and effectiveness. Section 4.4 investigates the properties of the model’s modules using an ablation study.
Datasets
We used five public datasets: a Mars Science Laboratory (MSL) rover dataset, a dataset obtained from the Soil Moisture Active Passive (SMAP) instrument, a server machine dataset (SMD), a water allocation dataset (WADI), and a safe water treatment (SWaT) dataset.
The anomalous values in the dataset obtained from the MSL and the SMAP (https://github.com/ khundman/telemanom) rover have been flagged by NASA experts [22]. The datasets have 25 and 55 indicator features, respectively.
The SMD (https://github.com/NetManAIOps/ OmniAnomaly/tree/master/ServerMachineDataset) is a 5-week server dataset collected by a large Internet company [32]. It contains data from 28 server machines with 33 features. The SMD is divided into two subsets of equal size: the first half is the training set, and the second half is the testing set.
The SWaT dataset (https://itrust.sutd.edu.sg/ testbeds/secure-water-treatment-swat/) consists of data obtained from industrial water treatment systems producing filtered water [13]. It was collected during 11 days of continuous operation: 7 days under normal operation and 4 days under attack scenarios [33].
The WADI dataset (https://itrust.sutd.edu.sg/ testbeds/water-distribution-wadi/) was obtained from the WADI testbed, an extension of the SWaT testbed [3]. It consists of data collected during 16 days of continuous operation: 14 days under normal operation and 2 days under attack scenarios.
The statistics of these five datasets are listed in Table 1.
Dataset information
Dataset information
The precision (P), recall (R), and F1 values are used to evaluate the performances of the algorithms. They are calculated as follows.
We use the point adjustment method [32] to measure model effectiveness, similar to other anomaly detection methods. If the algorithm can detect a point in the anomaly region, we assume that all points in the region are correctly detected. The assumption holds even if only a few points are detected in an anomaly region. We improved the point adjustment method by using a ratio of 50%, i.e., it is assumed that all points in an anomalous region are correctly detected if 50% of the anomalous points are detected.
Parameter selection is a critical step in deep learning. We set the model parameters as follows.
The sliding window size T is 20, the number of GAT heads is 4, the number of GRU hidden layers is 64, q = 10-4 in the POT threshold selection, K1=14, K2=20, and stochastic gradient descent is performed using the Adam optimizer with an initial learning rate of 10-4 and 10 iterations during model training. All experiments were performed on an NVIDIA GeForce GTX 3090 GPU.
We compared the proposed model with five other state-of-the-art models: LSTM-NDT [22], DAGMM [6], OmniAnomaly [32], MTAD-GAT [16], and GDN [1]. The results are listed in Table 2.
Performance of the proposed ACAM-AD model and four baseline approaches
Performance of the proposed ACAM-AD model and four baseline approaches
The ACAM-AD model outperforms the baselines in terms of the F1 values for all datasets except for MSL, where GDN has the highest F1 value (0.9591), demonstrating high accuracy and generalization ability. The results show good consistency between the actual and predicted values and the overall trend on the training set.
All models perform relatively poorly on WADI due to the long sequence lengths and data modality. Specifically, the ACAM-AD model has a 17.17% higher F1 score than the state-of-the-art baseline models. The baseline method LSTM-NDT only scores high on the SMAP dataset but performs poorly on the other datasets, indicating that the model may be sensitive to the information in the datasets. The DAGMM model performs very well for short datasets like SMAP, but its scores are significantly lower for the other datasets with longer sequences. The reason is that this method does not use sequence windows but only a single GRU model. In contrast, the autocorrelation and GRU-Attention mechanisms of the proposed model accurately capture the temporal dependence of the time-series data. Recent models, such as MTAD-GAT and GDN, are the most advanced models. They use graph attention mechanisms to focus on specific modes of the data, and all performed well on the five datasets. MTAD-GAT only utilizes GAT for an attention mechanism, and the GDN does not model the temporal perspective. The proposed multi-module, multi-perspective model is more comprehensive than the other models and achieved the best performance on all five datasets.
Because this model performs anomaly detection by conducting prediction followed by detection, we provide visualizations of the prediction performance of the ACAM-AD model in Fig. 5 using features No. 1 and No. 8 of the SMD 1-1 dataset.

Actual and predicted values obtained from the ACAM-AD model on the SMD 1-1 training set).
We also use the SMD 1-1 sensor data set as an example and plot the predicted and true value fluctuation of features 1 and 8 on the testing set and the anomaly score (Fig. 6) to show the anomaly detection performance of the ACAM-AD model.

Anomaly detection performance of the ACAM-AD model on the SMD 1-1 testing set.
The results of plots one and two indicate some fluctuations due to outliers. The third plot shows the anomaly score and threshold. The fourth and fifth plots indicate high consistency between the true and predicted anomalies on the testing set. Figures 5 and 6 demonstrate the effectiveness and high accuracy of the model.
We analyzed the training time of the ACAM-AD and compared it to the model without sparse operations and another method using GAT. We measured the average time taken per epoch on the 5 public datasets. Table 3 lists the average training times in seconds for all models on all datasets per epoch, and Fig. 7 also shows some results. The training time of ACAM-AD is lower than those of the baseline methods.
Comparison of training times in seconds per epoch
Comparison of training times in seconds per epoch

Training time.
Four sets of ablation experiments were conducted to determine the influence of different modules on model performance. The experiments were conducted by removing the autocorrelation layer, GAT, attention mechanism, and AR layer from the ACAM-AD model and calculating the F1 values. The results are listed in Table 4. All other parameters had the same values to avoid interfering with our validity analysis.
The results show that each module is required in the model, and its deletion has different impacts on model accuracy. Omitting the attention mechanism has the smallest accuracy impact in most cases because the GRU and the autocorrelation mechanism also mode the temporal dimension. Removing the AR model results in a less than 10% reduction in the F1 value because it improves model robustness. Removing the GAT has the largest influence on model accuracy because it is the only module performing correlation analysis between features. Because of the large number of modules, the removal of one module does not degrade the performance significantly. The ablation study results show that each module of the model is necessary and effective.
Results of ablation study
Results of ablation study
We proposed the novel ACAM-AD framework for anomaly detection in multivariate time series data. The model consists of an autocorrelation mechanism, GAT network, and a GRU-Attention mechanism. It models the correlation between the temporal characteristics and features of time-series data. Sparse attention mechanisms were used to minimize memory usage and training time. Extensive experiments on spacecraft, server, and water treatment datasets show that the proposed algorithm provides higher accuracy than other state-of-the-art methods. Future research should focus on enhancing the anomaly diagnosis capability and interpretability of the models.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Nos. 62072024, 41971396), the R&D Program of the Beijing Municipal Education Commission (Nos. KM202210016002, KM202110016001), the Scientific Research Foundation of Beijing University of Civil Engineering and Architecture (Nos. KYJJ2017017, Y19-19), and the Projects of the Beijing Advanced Innovation Center for Future Urban Design, Beijing University of Civil Engineering and Architecture (Nos. UDC2019033324).
