Abstract
With the continuous development of deep learning, long sequence time-series forecasting (LSTF) has attracted more and more attention in power consumption prediction, traffic prediction and stock prediction. In recent studies, various improved models of Transformer are favored. While these models have made breakthroughs in reducing the time and space complexity of Transformer, there are still some problems, such as the predictive power of the improved model being slightly lower than that of Transformer. And these models ignore the importance of special values in the time series. To solve these problems, we designed a more concise network named Resformer, which has four significant characteristics: (1) The fully sparse self-attention mechanism achieves
Introduction
In daily life, time-series forecasting plays an important role, such as sales analysis, enterprise decisionmaking, and smart grid management, etc. Time-series forecasting can help people have a more clearly understanding of the trend of the next period of time, so as to help people make better decisions. In particular, long sequence time-series forecasting (LSTF) has received increasing attention. As the name suggests, LSTF uses past time data to make predictions about a longer time into the future. In the existing research, people have proposed many methods to solve this problem. However, most studies have considered how to solve the prediction of short-term time series. RNN [1], LSTM [2] and their variants [3, 4] are often used in time series forecasting and have a good effect on solving short-time time series forecasting. As the length of the prediction sequence gradually increases, the requirements for the network model become higher and higher. At this time, the predictive capabilities of RNN and LSTM models are not meet the requirements. In addition, many studies based on temporal convolutional networks (TCN) [5, 6, 7] attempt to model temporal causality using causal convolution.
(a) Comparison of our model with other models in the case of prediction of multiple variables. (b) Comparison of our model with other models in the case of predicting a single variable. 
In recent studies, Transformer models are increasingly used to solve long-term series forecasting. The transformer model shows better performance than the RNN model in capturing remote dependencies. The self-attention mechanism can reduce the maximum length of network signal propagation path to a constant. Compared with the sequential structure of RNN, Transformer has good parallel processing ability, so Transformer has great potential in LSTF problems. However,the transformer model has natural limitations. The running time of the transformer is greatly related to the length of input and output. In the long-term series forecasting, the L-quadratic computation and memory consumption of transformer are unacceptable. As the length of the prediction time series increases, people need to devote a lot of resources to the transformer to obtain results. For many people, these required resources cannot be afforded. Therefore, reducing the time complexity of the transformer is the key to solving the problem. Secondly, Vanilla Transformer’s dynamic decoding caused it to be slow in step-by-step reasoning. The large amount of time and resource consumption of self-attention mechanism becomes the bottleneck of its application in long time series prediction. The long tail information redundancy caused by global attention is also an unavoidable defect.
There have been studies to address some of the problems that the self-attention mechanism cannot avoid. LogSparse Transformer [8] allows each cell to pay attention to only its previous cells and itself in exponential steps. Sparse Transformer [9] divides the self-attention operation into two parts according to specific conditions. Longformer [10] reduce the time complexity of the self-attention mechanism to
At the same time, the above models only make some adjustments to the self-attention mechanism in order to reduce time complexity. As the number of model layers increases, the network still becomes very complex. Therefore, we hope to use a simpler way to replace the self-attention mechanism, rather than modify the self-attention mechanism. In addition, the powerful reasoning ability of transformer has made it attract great attention in the field of NLP recently. This general model is used to solve various NLP and imaging tasks. However, in time series forecasting, the dependence on the self-attention mechanism neglects the importance of some special values such as the mean or maximum value, and this information is easy to learn. Therefore, we hope to design a module that can be used to obtain this information, so as to improve the accuracy of time series forecasting while reducing the complexity of the model.
For this reason, we designed a simpler module based on the existing research to solve these existing problems. Experiments on a large number of real data sets prove that our method is useful for predicting long time series. This paper mainly does the following work:
We designed a fully sparse self-attention mechanism. It only retains the header information that is beneficial to the result, and discards all tail information that is not beneficial to the result. It keeps the model complexity to We designed an information extraction module for the average and maximum value of the time series (AMS). This module can more easily learn the trend information contained in the time series. It can be amicably combined with a fully sparse self-attention mechanism to achieve a significant improvement in predictive ability. We designed a quadratic linear module (LT) with a simple structure to replace the self-attention mechanism. While improving the time series forecasting ability, it simplifies the network structure and reduces the running time. We designed a self-selected pooling mechanism based on the distribution of data, called DistPooling. DistPooling can purify data more friendly, reduce sequence dimension and improve prediction speed.
In this section, we first introduce the fully sparse self-attention mechanism based on probsparse selfattention mechanism. Then we explain in detail the trend learning module called AMS and the linear transformation used to replace self-attention. Finally, the DistPooling method based on the original distribution of sequences is introduced. In the next section, we will go through our network structure in detail to demonstrate how our proposed module works.
Fully sparse self-attention mechanism
Self-attention Mechanism
The canonical self-attention in Scaled Dot-Product Attention [14] transforms the time series encoded as feature vectors into the queries and keys of dimension
In order to make the self-attention mechanism more specific, we further discuss the elements of each row in the query matrix. In the mechanism of attention, let
where
Query Sparsity Measurement
According to the Eq. (2), query
where
ProbSparse Self-attention
ProbSparse self-attention uses the value of the
where
Fully sparse self-attention mechanism
To ensure that the outputs of the attention mechanism are dimensionally consistent with the inputs,
where
Recently, the self-attention mechanism has been widely used to solve the problem of long-term sequence prediction. As the prediction length increases, the high time complexity of the self-attention mechanism becomes an obstacle. The Transformer based models have to use a sparse form of attention mechanism to deal with the quadratic complexity problem, but the trend information is destroyed. Some recent studies [16] have also focused on the use of averages. In addition, with the deepening of research, the lag of the prediction sequence has been alleviated. However, it is still a great challenge to predict the change of sequence trend, especially for the judgment of maximum value, which can not be solved well by the existing prediction methods. To solve these problems, we propose an AMS module for fast learning the mean and maximum information of the input sequences.
For the input sequence
where
Resformer model overview. Right: The whole flow of encoder and decoder. The fully sparse self-attention mechanism and AMS module process the input sequence in parallel. The encoder connects LT module to simplify network structure and improve prediction accuracy. The decoder connects to the standard self-attention mechanism and instantly predicts the output elements in a generated style (orange sequence). Left: The fully sparse self-attention mechanism determines whether q needs to be processed according to the category to which q belongs.
In this section, we will introduce a quadratic linear network structure that combines linear transformation and full connection layers.
In the previous section, we considered using the AMS module to improve prediction accuracy. The stacking of multiple layers of fully sparse self-attention results in repeated learning of information. The redundancy of information makes the network structure become complicated, and causes a large amount of time consumption. Therefore, we hope to replace the self-attention mechanism with a simpler network.
Inspired by MLP-related methods [17, 18, 19] in recent studies. In the prediction task of long time series, we try to use linear transformation instead of self-attention mechanism to reduce the complexity of the network. This method does not use any self-attention, and only relies on the stacking of the linear layer and the full connected layer. This architecture is more stable than self-attention training and does not require a specific batch or cross-channel standardization (such as BatchNorm, GroupNorm, or LayerNorm). At the same time, the idea of residual network is introduced, and the output result of each linear layer is skip-connection with the output result of the previous linear layer. In the experimental part, we also compared the sparse attention mechanism and linear transformation. It can be seen that the stacking of linear layer and full connection layer can obviously improve the accuracy of prediction.
In addition, while discarding the self-attention mechanism, we also discarded Layer Normalization. Relying only on linear transformation, the training results become very stable. The following is the formula for linear transformation:
where
In addition, inspired by the self-attention mechanism, we find that nonlinear multi-level connection is helpful for deep-level information mining. In order to learn the integrity of long-term dependence inside the input sequence, we use the combination of Linear and Aff to complete the construction of quadratic Linear structure. The overall processing is as follows:
where A is the weight matrix that can be learned. The Linear function in the formula ensures that the input dimension is the same as the output dimension. Therefore, the entire network structure will not change the size of the input data. This feature allows the network to be flexibly placed in any place in the entire project, and it can also be flexibly combined with other network structures. It can be seen from the above two formulas that we embed the Linear layer in the linear transformation implemented by the first formula. The Linear layer, that is, the full connection operation is also equivalent to a linear transformation. At the same time, the superposition of two linear transformations is similar to the definition of the traditional quadratic curve. Therefore we name this process quadratic linear transformation.
The figure shows the flow chart of self-attention distillation proposed by us. In Self-attention distillation (DistPooling), the second column of squares represents a window size of 4 in dispooling, where the lighter colored squares represent a value closer to the minimum value in the window, darker squares indicate that the value is closer to the maximum value in the window. The pooling results basically retained the distribution of the original sequence.
The deep neural network can easily realizes the mapping from low dimension to high dimension of time series and effectively completes the acquisition of local and global information. But the increase of the dimension will bring the increase of noise and information redundancy. Pooling purifies data by reducing the dimension of the input sequence. Average Pooling and Max Pooling are the two most common pooling methods, which are widely used in various NLP and image tasks, but there are still some defects. The average Pooling makes the input sequence to tend to be smooth, which reduces the identification and makes it difficult to obtain the dependent connection inside the time series. Max Pooling can effectively suppress noise in image tasks. For time series, especially long series forecasting, the downsampling strategy relying only on the maximum value destroys the original distribution of the data, resulting in loss of information.
In order to solve these problems, we selectively use the maximum or minimum value as the result of pooling according to the original distribution of data. This approach uses edge values to reduce the dimension of the original sequence, which not only inhibits noise, but also improves the identification of the input sequence and preserves more internal connections between time sequences.
For the sequence
where
The existing long series time series prediction methods can be roughly divided into two categories. One is prediction based on traditional methods [20, 21, 22]. The other is deep learning methods based on encoder-decoder structures, such as RNN and Transformer and their variants [23, 24, 25]. Our entire framework is based on the standard encoder-decoder structure.
Encoder
The function of the encoder is to extract features from the input sequence to reduce redundancy, so as to obtain the correlation between the input sequences. In this section, we will introduce in detail how to combine the fully sparse self-attention mechanism with the AMS and LT to complete the work.
For the encoder, the input
where Probspare represents the fully sparse self-attention mechanism, and Conv1d is a one-dimensional convolution operation, the value of kernel is 3, and the ELU activation function is used.
As shown in Fig. 2, the decoder consists of a fully sparse self-attention mechanism and a complete self-attention mechanism. At the same time, we designed two input sequences for the decoder. One of the vectors is as follows:
where
We first put
We conducted extensive experiments on four public data sets, including three electricity data ETTs from different counties and cities in China and a public benchmark data sets.
Univariate long sequence time-series forecasting result on four cases
Univariate long sequence time-series forecasting result on four cases
Multivariate long sequence time-series forecasting result on four cases
Runtime analysis
ETT1
ETT dataset was acquired at https://github.com/zhouhaoyi/ETDataset .
ETT dataset was acquired at
ETT is an important indicator of long-term power configuration. ETT includes three real data sets named ETTh1, ETTh2, and ETTm1. They come from two different counties in China. ETTh1 and ETTh2 record power data every hour, and ETTm1 records power data every 15 minutes. Each data set contains seven characteristics, namely oil temperature and six power loads. Training/validation/test is 12/4/4 months.
Weather2
Weather dataset was acquired at http://www.ncei.noaa.gov/data/local-climatological-data/ .
Weather dataset was acquired at
The data set contains three years of weather data from nearly 1,600 different locations across the United States. Record data every hour. Each row of data contains 12 features, consisting of wet bulb and other climatic features. The train/val/test is 28/10/10 months.
Our proposed method uses the Adam optimizer for optimization, and its learning rate starts from
We chose eight time series prediction methods for comparison, including Informer [13], LogTrans [8], Reformer [11], ARIMA [26], Prophet [27], LSTMa [28], LSTnet [29] and DeepAR [30].
Ablation study of the LT module
Ablation study of the LT module
Without LT uses a fully sparse self-attention mechanism instead of the LT module.
Univariate prediction
We tested the performance of our model for univariate prediction. For example, in the ETT data set, we use the oil temperature in the time series to predict the oil temperature. Table 1 is the comparison result between several time series forecasting methods. We can observe: (1) The proposed Resformer model significantly improves the inference performance of all data sets, and its prediction errors rise steadily and slowly within the ever-increasing prediction range. (2) The prediction ability of Resformer model is superior to Informer, LSTMa and Reformer, and the MSE decreases 29% (at 24), 25% (at 48), 32% (at 168), 41% (at 336), 30% (at 720) in average. The Resformer model shows better results than Reformer, and the MSE decreases 94% in average.
Multivariate prediction
We also tested the power of our model in multivariate prediction. In this setting, we not only predict the target feature, but also predict other features in the time series. This kind of prediction is easy to implement, and it can be achieved only by modifying the output length of the final fully connected layer on the basis of univariate prediction. The experimental results can be seen in Table 2. It can be seen from the results: (1) Our model shows good predictive ability when predicting multiple features. (2) In previous studies, LSTM has been widely used for long series time series prediction. Experimental results show that compared with LSTMa, our method reduced MSE by 56% (at 24), 21% (at 48), 38% (at 168 and 336), and 48% (at 720). (3) compared with other Transformer variants. When metric
Ablation study of the AMS module
Ours
Ablation study of the fully sparse self-attention mechanism
Ours
Ablation study of the DistPooling
Runtime analysis
We compared the model proposed in this paper with Informer and Autoformer [16] on the ETTm1 dataset in terms of prediction accuracy and running time. The experimental results can be seen in Table 3. It can be seen that when the prediction length is short, our model achieves better results, and when the prediction length increases, the Autoformer gradually shows advantages. In terms of running time, the model proposed in this paper takes the least amount of time.
The performance of the LT module
In order to verify the function of the linear module, we modify the linear transformation module in the network structure to a fully probsparse self-attention layer. And compared two different network structures. The difference between them is only whether to use linear transformation. Through this comparison, it is verified that the linear transformation module not only has a significant improvement in improving the prediction accuracy, but also can reduce the running time very well. The experimental results can be seen in Table 4. It can be seen from the experimental results: (1) In terms of accuracy improvement, the model using the linear transformation module can improve by 10% to 20% compared to the model without the linear module. And as the prediction length increases, the linear transformation module shows more and more advantages in long-term sequence prediction. (2) In terms of running time, the average consumption time of a single epoch of the two models is given in the Table 4. From the experimental results, it can be seen that the linear transformation module compared with the sparse self-attention mechanism, the average consumption time of a single epoch is reduced by 12% to 15%.
Performance of the fully sparse self-attention mechanism
In order to verify the accuracy of the model, we replace the fully sparse self-attention mechanism in the model with a probsparse self-attention mechanism in Informer. After that, the two models were compared experimentally. At the same time, the fully sparse self-attention mechanism and the standard self-attention mechanism are compared in the same way. The comparison results of the three models are shown in the Table 6. The time complexity of
The performance of the AMS module
In order to verify the function of the AMS module, we set up two different network structures with and without AMS module. A series of experiments were performed on the ETT data set for comparison. The results of the experiment are recorded in Table 5. It shows that the extracted mean information is helpful to the prediction result, especially when the length of the prediction sequence is large. In a series of experiments, we also found that when the trend information is fused with the output of fully sparse self-attention layer, the proportion of the two will have a great influence on the prediction results. In the subsequent parameter adjustment experiments, we will give detailed comparison instructions.
The performance of the DistPooling
In order to verify the effectiveness of DistPooling, Max Pooling and Avg Pooling are used to replace DistPooling in the Resformer model on ETTh1 dataset, and the experimental results are shown in Table 7. By comparing the experimental results of predicting a variety of lengths, the DistPooling method obtained better results than other commonly used pooling methods, which proves that the pooling method we adopted is effective in time series prediction.
The parameter sensitivity of three components in our network
Visualization of results using different pooling methods
We use the data set ETTh1 to analyze the sensitivity of parameters under multivariable conditions.
Model analysis
Time complexity of fully sparse self-attention mechanism
The self-attention mechanism includes three parts: query, key and value. Usually the input length of query and key is the same, ie
Visual Analysis of DistPooling
To explore how DistPooling preserves the information of the original time series during downsampling, we perform a visual analysis on three different pooling methods. The visualization results are shown in Fig. 5. AvgPooling uses the mean of itself and surrounding time points to get a smoother downsampling curve. It can be seen that the AvgPooling ignores the local small fluctuations in the time series, and pays more attention to the acquisition of the overall trend. MaxPooling is more sensitive to fluctuations in time series, but the reflection of MaxPooling on details is still not obvious enough. DistPooling selectively obtains down-sampling results by using the distribution of time series within a certain range. This allows DistPooling to preserve the details of the time series to the greatest extent possible.
Conclusions
This paper deals with the prediction of long series time series. On the basis of the sparse self-attention mechanism, we designed the fully sparse self-attention mechanism and AMS module to fuse the multidimensional information of the internal dependence of time series. At the same time, we tested the effectiveness of LT module in replacing the self-attention mechanism, and we will continue to explore the prediction of long series time series using only quadratic linear transformation. Through experiments on a series of real data sets, it is proved that our study is helpful for improving the prediction of long series time series.
Footnotes
Acknowledgments
This work was supported in part by the following: the National Natural Science Foundation of China under Grant Nos. 62272281, 62007017, and 61902220, Youth Innovation Technology Project of Higher School in Shandong Province under Grant No. 2019KJN042, and the project ZR2021QF134 supported by Shandong Provincial Natural Science Foundation.
