Few-shot time series forecasting in a meta-learning framework

Abstract

Time series forecasting has a wide range of applications in various fields. To eliminate the need for time series data volume, a meta-learning-based few-shot time series forecasting method is proposed. This method uses a residual stack module as its backbone and connects the residuals forward and backward through a multilayer fully connected network so that the model and the meta-learning framework can be seamlessly combined. The Empirical knowledge of different time-sequence tasks is obtained through meta-training. To enable fast adaptation to new prediction tasks, a small meta-network is introduced to adaptively and dynamically generate the learning rate and weight decay coefficient of each step in the network. This method can use sequences of different data distribution characteristics for cross-task learning, and each training task only needs a small number of time series to achieve sequence prediction for the target task. The results show that compared with the two baselines, the proposed method has improved performance on 67.07% and 58.53% of the evaluated tasks. Thus, this method can effectively alleviate the problems caused by insufficient data during training and has broad application prospects in the field of time series.

Keywords

Time series forecasting few-shot learning meta learning residual stack model

1 Introduction

Time series prediction(TSP) has been a hot research topic for decades, and it has always played a vital role in energy systems [1, 2], medical care [3], finance and many other fields. The process of analyzing and extracting valuable insights from time series data allows for forecasting future trends, thereby informing and influencing decision-making strategies across various fields.

With the rapid development of deep learning technology in various fields, deep neural networks (DNNs) have made great progress in the field of time series forecasting. However, the success of deep learning technology hinges on the availability of large-scale training data. To promote the further application of deep learning in the field of time series prediction, two critical challenges must be urgently addressed: 1) In practical scenarios, it is not always possible to acquire sufficient training data. Typical scenarios include the diagnosis of rare disease, traffic flow prediction for new roads, power consumption prediction of new equipment in a power system, etc. In these scenarios, it is difficult to obtain sufficient data due to data scarcity, which leads to model overfitting. Consequently, the prediction performance decreases, limiting plausible decision-making. 2) Traditional time-series prediction algorithms often have difficulty adapting to different prediction tasks, and need to achieve the best prediction performance by constantly adjusting hyperparameters, which requires considerable computational resources and time. Therefore, implementing general and efficient deep learning models with small sample data is still challenging.

To facilitate learning with limited data, few-shot learning (FSL) has received much attention in recent years, FSL aims, to train deep learning network models with better generalization performance using a small amount of labeled data. Based on this, researchers proposed a meta-learning approach that, realizes cross-task training by learning the existing empirical knowledge of multiple different tasks, and then transferring this knowledge to the target task, allowing for rapid adaptation to the new task. This method effectively mitigates the need for many training samples for the model. Therefore prompting researchers to employ small-sample learning for time series forecasting. Among them, Iwata et al. proposed a meta-learning based bidirectional LSTM model to solve the small-sample timing problem [4], This model utilizes the attention mechanism on the support set to minimize the error of the query set and achieves a better prediction performance with small datasets. Subsequently, Feng Xiao et al. proposed a small-sample prediction mechanism based on meta-learning [5], that has been shown to outperform traditional prediction methods in small-sample timing prediction. However, existing pure small-sample meta-learning algorithms are unable to target the hyperparameters of different timing tasks, resulting in slow convergence.

To address these problems, this paper proposes a few-shot time series prediction algorithm based on a meta-learning framework. The method utilizes knowledge shared across different tasks to learn generic time series modeling capabilities. Additionally, we introduce a multilayer perceptron (MLP) as a small meta-network to adaptively adjust hyperparameters for internal training tasks, facilitating the optimal solution of hyperparameter combinations and realizing generic small-sample time series forecasting. The main contributions of this paper include the following:

1) In order to solve the problem of insufficient data in the timing task, which leads to overfitting of the training model, we integrate the residual stack module in the N-BEATS model with the model-independent meta-learning method, use the model-agnostic meta-learning (MAML) algorithm to adapt to different tasks by learning shared model initialization parameters across tasks, and employ the residual mechanism to continuously "correct" the prediction results to better capture the intrinsic timing feature information. The experimental results show that the meta-learning approach using the residual stack module significantly improves the prediction accuracy in small-sample timing tasks compared to that of traditional deep learning models.

2) To find the optimal hyperparameter combination in each timing task, we propose a meta-learning for hyperparameter tuning (MLHT), which introduces a small meta-network based on MAML to dynamically generate network hyperparameters for each timing task, yielding an adaptive learning strategy to achieve state-of-the-art prediction performance. The experimental results show faster convergence compared to that of classical MAML.

3) The variation among task samples in cross-task learning poses several challenges, such as inconsistent lengths of time series data from different domains. This leads to the fact that the data cannot be directly imported into the model for training and necessitates preprocessing. To solve the above problems, this paper uses a bidirectional gated recurrent unit (Bi-GRU) neural network as an encoder to deal with disparate data lengths across tasks and maps the raw data from different tasks to a unified space for representation, to satisfy the requirements of cross-task learning.

The remainder of this paper is organized as follows. Section 2 introduces the current state of research. Section 3 introduces the work related to the few-shot time series forecasting method. In Section 4, we compare our method with currently existing methods.

2 Related work

2.1 Time series forecasting

Time series forecasting predicts future development by analyzing time series to find internal regularities. In the early stages of time series forecasting, statistical models such as the autoregressive moving average (ARMA) and its variant, the autoregressive integrated moving average (ARIMA), were extensively utilized. However, these models fell short in characterizing the nonlinear variations inherent in time series data. To solve this problem, researchers have used nonlinear models such as support vector machines(SVM) [6], Gaussian processes(GP) [7], and hidden Markov models [8] for time series forecasting, which has the ability to handle complex time series. Nevertheless, these methodologies have demonstrated some limitations, primarily in the effective management of sequence dependencies among input data, thus compromising their efficiency in time series forecasting tasks.

Deep learning applications are progressively expanding across various domains, including time series prediction tasks. The two most prevalent network architectures in these tasks are the long short-term memory (LSTM) network [9] and convolutional neural network (CNN) [10]. CNN uses convolutional and pooling layers as the feature extractor of the input vector to fully extract the local correlation of time data and are widely used in traffic flow prediction [11], stock price prediction [10], power load forecasting [12] and other fields. Their disadvantage is that they cannot extract long-term dependent features of time series data. On the other hand, LSTM can retain long-term sequence features and effectively solve the problems of gradient vanishing and gradient explosion. In addition, LSTM shows excellent prediction performance by introducing the mechanisms of the update gate, output gate and forget gate. However, LSTM needs to optimize many weights and bias parameters, which makes the network training speed slow. Recently, the multilayer self-attention mechanism (transformer) proposed in the field of natural language processing has also been applied to time series prediction tasks [13], and has shown excellent prediction results. The disadvantage of transformers is that their calculation costs are high. In recent years, the proposed N-BEATS model has garnered attention. It is composed of a deep stack of fully connected layers connected by forward and backward residual links [14]. The stacked fully connected layers learn the dependencies in the data, while the residual structure guarantees the depth of the network and the ability of the model to process data. Compared with other models, this model has reduced computational complexity and faster training speed. N-BEATS demonstrates state-of-the-art performance for several large datasets. Therefore, we apply the residual block in the N-BEATS model to the few-shot time series problem and verify its performance.

2.2 Few-shot learning

Few-shot learning has recently attracted considerable attention for data-scarce problems in new domains. Several methodologies, encompassing transfer learning references [15, 16], domain adaptation references [17], and multitask learning references [18, 19], utilize the knowledge or experience in the source domain as assistance for learning tasks in the target domain. Among them, Oreshkin et al. [20] proposed a zero-sample time series prediction learning framework based on migration learning. This framework trains a neural network model with many time series datasets in the source domain, and then directly applies the model to new time series prediction tasks. Orozco et al. [21] proposed a small-sample time-series prediction method based on a recurrent neural network, solving the small-sample time-series prediction problem by learning a shared feature embedding from a large amount of time-series data. However, the above methods use a large amount of data in the source domain for training. To reduce the number of sequential tasks needed, researchers have proposed many meta-learning ideas to solve many problems faced in the field of small-sample learning.

Fig. 1

The overall structure framework diagram.

Currently, research on meta-learning is roughly classified into three groups, metric-based [22], model-based [23] and gradient-based meta-learning [24]. Among them, gradient-based methods is particularly attention-catching due to its wide-ranging adaptability. The most traditional method is the MAML algorithm [22] proposed by Fine et al. to optimize the initial network parameters through training, then fine-tune these parameters in one or more subsequent steps. By utilizing the MAML algorithm, Feng Xiao et al. investigated small-sample time series forecasting, and the prediction performance was greatly improved. However, it does not consider the tuning of meta-hyperparameters for different tasks. Therefore, many efforts have been made in hyperparameter optimization for meta-learning. To reduce the sensitivity to hyperparameters, Li et al. [25] proposed a novel meta-learning optimizer (MetaSGD) for few-shot learning. The experimental results demonstrate the effectiveness of meta-learner hyperparameter optimization. Ravi et al. [26] learned the entire inner-loop optimization directly through LSTMs to generate updated weights. However, these methods lack adaptive properties in inner-loop optimization, and the hyperparameters cannot be adapted to each task. Therefore, Sungyong et al. [27] proposed an adaptive learning update rule (ALFA) within the meta-learning framework, achieving the purpose of automatically adjusting the hyperparameters of the learning algorithm in meta-learning and thereby improving the performance of meta-learning. In this paper, we build on the foundation of the MAML by presenting a trainable learning rate and weight decay coefficient to update the model hyperparameters adaptively. This innovation aims to enhance the quick adjustability to new tasks in time series prediction tasks, thereby improving generalization capabilities.

3 Few-shot time series prediction

3.1 Problem statement

Meta-learning can allow a network to quickly adapt to new tasks through cross-task training. Assuming that there are N different time-series tasks in cross-task learning T = {T₁, T₂, . . . T_N}, the model inputs the i-th task historical time series data as $X_{i} = {x_{i}^{1}, x_{i}^{2} . . ., x_{i}^{t}}, X_{i} \in T$ . The true value of the i-th task in the future h steps is expressed as $Y_{i} = {x_{i}^{t + 1}, x_{i}^{t + 2}, . . ., x_{i}^{t + h}}$ . The time series obtained after prediction is ${\hat{Y}}_{i} = {{\hat{x}}_{i}^{t + 1}, {\hat{x}}_{i}^{t + 2}, . . ., {\hat{x}}_{i}^{t + h}}$ . We choose a new test task and train the meta-network with a few shot time series data of N - 1 tasks each time. The goal is to let the machine learn the original hyperparameters determined by humans and automatically find the optimal initial parameters ω^*and hyperparameters α and β by learning different tasks. $ω^{*} = arg min_{w} L^{task} (ω, α, β)$ (1)

The network quickly generalizes meta-knowledge to new tasks when encountering unseen tasks. It only needs to fine-tune the learned initialization parameters based on few-shot time series data o predict future time series trends better.

3.2 Proposed method

The model training is handled by two parts: the backbone network (base-learner) and the small meta-network (meta-learner). The backbone network corresponds to a specific episode source task and needs to learn knowledge about each task from the few-shot labeled support set time series of multiple time series tasks and predict the query set sequence. The small meta-network improves the prediction performance of the base learner on cross-episode tasks by learning general meta-knowledge from multiple-episode tasks. Among them, the backbone network uses the residual stack module in the N-BEATS model. Each block consists of four fully connected layers F_{C
₁} ∼ F_{C
₄} stacked and two residual branch linear layers. The small meta-network comprises a 3-layer perceptron (MLP), and the rectified linear unit (ReLU) activation function is used between the layers. The frame diagram of the model structure is shown in figure:

A stack in the backbone network is composed of multiple blocks. The core idea is that each block is used for learning. The backbone network is composed of multiple blocks. The core idea is that each block trains part of the historical information in the time series, removes the information learned by the previous block through the residual method, and then uses the information as the input for the next block. The unlearned information is further used to update the prediction. The inside of the block is composed of 4 fully connected layers, and after the output value of the last hidden layer is obtained, the data enter two residual branches that generate forward expansion θ^f and backward θ^b expansion coefficients through the linear layer and then pass through the g^b and g^f functions. The output of the two residual branches is the backcast of the current window time point and the prediction of the time point in the future prediction window. The following is a schematic diagram of the residual stack module.

Fig. 2

Residual stack module diagram.

For block1, the input is the entire sequence of the model, but the remaining block input is the difference between the previous block input and the backtracking value generated (i.e., the residual value), and the final predicted value is the sum of all partial predicted values. This is described by the following equation: $x_{k} = x_{k - 1} - {\hat{x}}_{k - 1}$ (2) $\hat{y} = \sum_{k} {\hat{y}}_{k}$ (3) The internal operation equation of the k-th block is: $h_{k, 1} = F C_{k, 1} (x_{k}) . . . h_{k, i} = F C_{k, i} (h_{k, i - 1})$ (4) $θ_{k}^{b} = {Linear}_{k}^{b} (h_{k, 4}), θ_{k}^{f} = {Linear}_{k}^{f} (h_{k, 4})$ (5) The functions g^b and g^f are used to remove some irrelevant information from the input time series data, making the prediction more detailed and accurate. The network uses these functions to amplify the point weight coefficient to obtain the desired prediction result at a certain moment. The kth block output is represented as: ${\hat{y}}_{k} = V_{k}^{f} θ_{k}^{f} + b_{k}^{f}, {\hat{x}}_{k} = V_{k}^{b} θ_{k}^{b} + b_{k}^{f}$ (6)

3.3 Data preprocessing

Since the time series data from different tasks have unequal input lengths, the Bi-GRU neural network is used to process time series data of any length, and the time series of each task is processed into a unified dimension. The Bi-GRU network can not only maintain the temporal order of the original time series but also use the hidden state of each time step to obtain the past and future information in the sequence as the characteristics of the current moment.

Therefore, the Bi-GRU neural network can be regarded as an encoder. The sequence of time step t in task i is expressed as $X_{i} = {x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{t}}$ , the sequence is passed as the encoder input to two GRU networks in opposite directions, and the hidden states $\vec{h_{t}}, \overset{\leftarrow}{h_{t}}$ at different moments in the forward and backward directions are obtained. Note that we only need the historical time series part as the object to be encoded. The number of hidden layer neurons is 100. The splicing of the forward and backward state vectors at the last moment is expressed as $h_{i} = \vec{h_{t}} \oplus \overset{\leftarrow}{h_{t}}$ . The new sequence of output is defined as $X_{i}^{'} = {x_{i}^{1^{'}}, x_{i}^{2^{'}}, \dots, x_{i}^{d^{'}}}$ . In the experiments in this paper, h_i is the initial input of the prediction model.

Fig. 3

Bi-GRU encoder.

The new sequence $X_{i}^{'} = {x_{i}^{1^{'}}, x_{i}^{2^{'}}, \dots, x_{i}^{d^{'}}}$ in each task is obtained through the Bi-GRU encoder. First, all-time series tasks are divided into training tasks T_train and new prediction tasks T_test.The historical time series data in each training task dataset are divided into a support set and a query set. The support set in the training phase is composed of few-shot labeled time-series samples, and the query set is composed of unlabeled time-series samples to update the meta-network parameters. The prediction task is also composed of two similar datasets. In the test phase, the support set is used to fine-tune the prediction model, while the query set is used to predict the future time series to evaluate the prediction effect of the new task. In the N-way K-shot learning setting, N tasks are extracted from the total training task T_train, and each task has K time series data. The support set S and a query set Q for the time series in the task are defined as follows: $S = {(x^{d}, y^{d + h})}_{p = 1}^{N \times K}$ (7) $Q = {(x^{d}, y^{d + h})}_{p = 1}^{N \times M}$ (8) In the formula, x^d is the historical time series data of each task, y^d+h is the time series of h steps in the future, N represents the number of tasks extracted from the total tasks, K is the number of time series with labels for each task in the support set, and M is the number of unlabeled samples per task in the query set. In this experiment, we randomly select five tasks from D_meta-train, i.e., we perform 5-way 5-shot learning. In the meta-training process, a mini-batch is defined as $T_{i}^{'} = T_{i}^{support - set} \cup T_{i}^{query - set}$ also known as an episode training process.

3.4 Meta-training

In the meta-training stage, an optimal set of network parameters is trained according to multitask learning bilevel optimization. The purpose of this stage is to increase the generaliz ability of the model in multiple tasks and continuously adjust the model parameters to adapt to different tasks. The network does this by learning task characteristics to improve the sensitivity of the model and adapt to new tasks faster. For the convenience of description, the initial parameters of the base backbone network are represented as θ₀, and the parameters of the small meta-network are represented as φ, where θ = [w₁, . . . , w_j] ^T represents the weight vector of each layer in the network, and w_j is the weight vector of the jth layer. In the experimental parameter update, θ is used to represent all weight vectors. Note that due to the use of the meta-learning algorithm, the weights need to be initialized with a normal distribution and added to the parameter list. The sequence $S_{i} = {x_{i}^{1}, x_{i}^{2}, . . ., x_{i}^{d}}$ is taken as the current input, and the expression x₁ denotes the sequence S_i,where d represents the length of the time series. Each block can be represented internally as follows: $h_{k, 1} = σ (w_{1} x_{1} + b_{1})$ (9) $h_{k, p} = σ (w_{p} h_{k, p - 1} + b_{p})$ (10) $h_{k, j} = Linear (h_{k, p}, w_{j})$ (11) Where σ is the activation function, that is the ReLU function.h_k,i represents the output of the fully connected layer of the kth block, and h_k,j represents the output of the jth block. In the formula, k=1 represents the first block. Only block 1 passes through the entire sequence of the model. The final output includes the backcast sequence $[x_{k}^{1}, . . ., x_{k}^{d}]$ and the forecast sequence $[x_{k}^{d}, . . ., x_{k}^{d + h}]$ .The input of the next block is the input of the previous block, and it produces a sequence of backtracking to generate the remaining residual components x₂ = (x₁ - backcast), and so on for the subsequent blocks. Therefore, the block can decompose the input time series feature information layer by layer and only fit part of the feature information of the predicted time series. Each layer deals with residuals that were not fitted correctly by the previous layer. In the experiment, there are a total of n- blocks, and the sum of the predicted values generated by all blocks is the final predicted sequence ${\hat{y}}_{i} = {\hat{y}}_{i}^{1} + . . . + {\hat{y}}_{i}^{n}$ .

The output sequence is ${\hat{y}}_{i} = [{\hat{x}}_{i}^{d + 1}, . . ., {\hat{x}}_{i}^{d + h}] = [{\hat{y}}_{i}^{1}, . . . {\hat{y}}_{i}^{h}]$ . The network computes θ_t and the loss function $L_{T_{i}}^{S_{i}} f (θ_{t})$ on the support set of the training task T_i, where f_{θ
_t} is a parameterized function of θ_t. Next, the network updates the parameters and quickly generalizes and transfers meta-knowledge on multiple episode tasks to adapt to different meta-tasks. Figure 4 shows the few-shot series prediction parameter update process.

Fig. 4

Block diagram of the few-shot time series prediction parameter update process.

3.4.1 Inner-loop update

First, the hyperparameter learning rate and weight decay coefficient are generated layer by layer through a small meta-network. That is, the small meta-network is similar to a hyperparameter generator. The generated gradient cumulative mean and network parameters adjust the learning rate of each layer of weights. Specific hyperparameters are generated based on the learning state of a specific task to control the direction and magnitude of weight updates. The backbone network weight parameters and gradient stacking results are used as the inputs of the meta-learner. We generate hyperparameters as shown in (12).

$(α, β) = g_{φ} [\nabla_{θ_{t}} L_{T_{i}}^{S_{i}} f (θ_{t}), θ_{t}]$ (12) Note that the parameters φ in the small meta-network are only updated in the outer loop optimization. Therefore, the learning rate and weight decay coefficient corresponding to each layer are used to update the model weight parameters. Since there are multiple episodes of source task learning, the weight parameters need to be updated multiple times. After several inner loop updates, the adaptive network weight parameters are obtained for each task, as shown in (13).

$θ_{t} \leftarrow β ⊙ θ - α ⊙ \nabla Lf (θ)$ (13)

3.4.2 Outer-loop update

To evaluate the network parameters, the query set is used to verify the previously trained model. The updated parameters θ_t are passed into the N-BEATS model, and the loss $L_{T_{i}}^{Q_{i}} f (θ_{t})$ on the query set is calculated. Note that the parameters θ_t are not updated during this process. The source tasks with multiple episodes generate the value of the corresponding task. Therefore, the cumulative sum of losses on multiple source tasks is calculated to minimize the sum of loss values on all tasks, as shown in Formula (13).

$L = \sum T_{i} L_{T_{i}}^{Q_{i}} f (θ_{t})$ (14)

In the outer loop, the total loss gradient of multiple source tasks Q_i is calculated, and the gradient descent algorithm is used to update the outer loop parameters θ iteratively. The gradient descent adaptive step size is η, and the update process is shown in equation (15).

$θ \leftarrow θ - η \nabla_{θ} \sum T_{i} L_{T_{i}}^{Q_{i}} f (θ_{t})$ (15) Additionally, the outer loop uses the gradient descent algorithm to update the parameters φ in the small meta-network and finely control the learning rate α and decay rate β corresponding to each layer in the inner loop update rule. The hyperparameter γ is a fixed learning rate, and the update process is shown in the formula (16).

$φ \leftarrow φ - γ \nabla_{φ} \sum T_{i} L_{T_{i}}^{Q_{i}} f (θ_{t})$ (16)

3.5 Meta-testing

The optimal network parameters ω^* are obtained by learning the meta-knowledge of different source tasks. Therefore, the network can quickly adapt to the target task with only few-shot data, where ω^* represents the best initial parameters θ of the network and the optimal learning rate α′ and decay rate β′ in the inner loop update. Additionally, there are support sets and query sets in the new task. Through the fine-tuning method, the previously trained parameters are used to learn and fine-tune the model parameters with the support set, where $L_{T}^{S} f (θ)$ represents the loss function on the new task support set. The fine-tuning process is shown in Equation (17).

$θ \leftarrow β^{'} ⊙ θ - α^{'} ⊙ \nabla_{θ} L_{T}^{S} f (θ)$ (17) Fine-tuning is used to find the network parameters θ^* adapted to the new task so that the time series of unlabeled samples can be quickly predicted. The meta-learning parameter update algorithm is as follows:

Algorithm 1 Meta-parameter updating framework

Require: Task distribution P (T), learning rate λ, hyperparameters α, β

1: randomly initialize φ, θ₀

2: while not done do

3: Sample a task T from p (T)

4: for all T_i = {S_i, Q_i} ∈ T do

5: Compute the loss $L_{T_{i}}^{S_{i}} f (θ_{t})$ on S_i of T_i

6: Compute the gradient on $\nabla_{θ_{t}} L_{T_{i}}^{S_{i}} f (θ_{t})$

7: Generate hyperparameters $(α, β) = g_{φ} [\nabla_{θ_{t}} L_{T_{i}}^{S_{i}} f (θ_{t}), θ_{t}]$

8: Obtain adaptive parameters $θ_{t}^{'} = β ⊙ θ_{t} - α ⊙ \nabla_{θ_{t}} L_{T_{i}}^{S_{i}} f (θ_{t})$

9: end for

10: Compute the loss $L_{T_{i}}^{Q_{i}} f (θ_{t}^{'})$ on Q_i of T_i

11: Update the parameters $θ \leftarrow θ - λ \nabla_{θ} \sum T_{i} L_{T_{i}}^{Q_{i}} f (θ_{t^{'}})$

12: end while

13: Sample a few time series S from a new task

14: Compute loss $L_{n e w}^{S}$ and fine-tuned parameters θ

4 Experiment

In this section, we report the results of an extensive experimental study to evaluate the effectiveness of our proposed algorithm relative to the baseline algorithms. Two metrics are chosen to evaluate the prediction model in the experiment, the root mean square error (RMSE) and symmetric mean absolute percentage error (sMAPE), as shown in Eq. (19), where denotes the i-th prediction task actual sequence and H represents the prediction sequence range. $RMSE = \sqrt{\frac{1}{H} \sum_{i = 1}^{H} {(y_{i} - {\hat{y}}_{i})}^{2}}$ (18) $SMAPE = \frac{100}{n} \sum_{i = 1}^{n} \frac{| {\hat{y}}_{i} - y_{i} |}{(| {\hat{y}}_{i} | + | y_{i} |) / 2}$ (19)

4.1 Dataset

The experiments are based on the UCR Time Series Classification Archive [22] to obtain time series datasets, which contain 128 different types of time series data subsets, including speech signals, financial data, medical data, etc. To ensure the accuracy of the experiment, we remove the data with missing values in the sequences and lengths of less than 50. Finally, 82-time series from different domains are selected to predict the model, which can be regarded as 82 prediction tasks. In addition, we divided each task data subset into a training set and a test set.

4.2 Baseline models

In experiments, We use the residual stack model to stack by multiple blocks, and each block is a fully connected layer sequence with a prediction/backtracking branch at the end. The architecture runs residual recursion over the entire input window and sums the block outputs to make its final prediction. We selected the following models as baselines:

LSTM The LSTM model is a recurrent neural network variant with the ability to capture long-term dependencies. Its structure includes an input gate, forget gate, an output gate, and a memory unit. With this structure, LSTM can efficiently process time series data.

CNN The CNN architecture generally comprises convolutional and pooling layers. It was originally applied in the field of computer vision, but it also shows certain potential in time series prediction. We convert time series data into image form to extract local features of sequence data, which provide valuable information for predicting time series tasks.

MLP A multilayer perceptron (MLP) is a classical feedforward neural network structure consisting of multiple fully connected hidden and output layers. Each neuron is connected to all the neurons in the previous layer, and the information is transferred through a nonlinear activation function.

BiLSTM The bidirectional LSTM (BiLSTM) is a model that combines LSTM units in both forward and backward directions. It not only captures the contextual information of the current location but also can utilize the information of the future location. Thus it can capture the dynamic change patterns in time series data more comprehensively.

We also compare the proposed method with three types of training frameworks. The first group combines the MAML framework with the small meta-network (MLHT) to adapt per-step hyperparameter generalization to prediction tasks; and the second group removes the small meta-network (MAML) to perform cross-task learning on the training task. The generalized optimal initialization parameters quickly adapt to the prediction task. The third group removes the MAML framework (Base) and directly uses the time series data of the prediction task for training.

4.3 Training setup

In this experiment, 82 prediction tasks are divided into a training task set and a prediction task set. One of the tasks is selected as a prediction task, and the rest are divided into training tasks. Every five tasks is a batch, and each task uses 5-time series for training. The experiment uses the Adam optimizer, and the base learning method is directly trained with the prediction task. The learning rate settings are shown in Table 1. For the MAML framework, Table 1 lists the hyperparameter settings in the meta-training process for different models, α is the learning rate of the backbone network, and β indicates the learning rate of the meta-network. For our proposed meta-learning method, the internal loop adaptively generates the learning rate and weight decay rate of each step. To achieve more fine-grained control, the inner loop parameter that updates the learning rate inside the optimizer is set to 10^-2; therefore, no additional settings are needed. The training processes of the three learning frameworks are consistent, and the sequence prediction range is set to H=20.

Table 1
Training process hyperparameter settings

Learning rate Ours LSTM CNN MLP BiLSTM

Base 0.02 0.2 0.03 0.2 0.01

MAML- α 0.001 0.001 0.01 0.002 0.001

MAML-β 0.02 0.2 0.02 0.3 0.02

Learning rate	Ours	LSTM	CNN	MLP	BiLSTM
Base	0.02	0.2	0.03	0.2	0.01
MAML- α	0.001	0.001	0.01	0.002	0.001
MAML-β	0.02	0.2	0.02	0.3	0.02

4.4 Experimental results

To prove that the proposed method promotes the improvement of the prediction performance of few-shot time series tasks, one task is selected as a test in the experiment, and the rest are used as training tasks to realize multitask training. Each task is used as a test task in turn for cyclic iterative prediction.

Figure 5 shows the time series prediction effect of the residual-stack model combined with different meta-learning methods. The horizontal axis represents the time series prediction range H=20, and the vertical axis represents the predicted value under different meta-learning methods. The result of the MLHT framework is closer to the real value. The method can be used to find hyperparameters suitable for each task faster in cross-domain scenarios. The adaptability of network parameters plays an important role in cross-task few-shot time series prediction.

Fig. 5

Comparison of the prediction effects of different meta-learning frameworks.

Figure 6 shows the time series prediction effect of different models using the same meta-learning method. The prediction task has a certain periodicity and trend. The circles represent the real values of the sequence, and the squares represent the predicted values of the model sequence in this paper. The prediction result of the proposed residual stack model is better than that of the baseline model and is closer to the real value.

Fig. 6

Comparison of the time series forecasting effects of different models.

Table 2 shows the MLHT comparison of the other two methods on the task performance increase and decrease ratio on sMAPE. Only when the difference between two sMAPE indices is greater than 0.1 is the task recorded. The counted tasks indicate that the performance of the model fluctuates significantly. Therefore, the sum of the performance-up and performance-down tasks may be less than 1. As a whole, the application of our method results in a significant performance improvement for most tasks with different models. Compared with the second set of MAML, the proposed method has a 67.07% task performance improvement, indicating that for most tasks, there is a significant improvement in the sMAPE metric.

Table 2

Task improvement and decrease ratio measured by sMAPE

	Base	MAML
Ours	(58.53%) ↑ (28.04%) ↓	(67.07%) ↑ (15.85%) ↓
LSTM	(68.29%) ↑ (23.17%) ↓	(66.51%) ↑ (28.04%) ↓
CNN	(59.75%) ↑ (40.24%) ↓	(56.09%) ↑ (43.90%) ↓
MLP	(53.09%) ↑ (41.63%) ↓	(58.54%) ↑ (37.80%) ↓
Bi-LSTM	(65.85%) ↑ (25.61%) ↓	(69.51%) ↑ (30.48%) ↓

Figure 7 shows a comparison of the difference between the MLHT and MAML method with different models. The horizontal axis represents the 82 prediction tasks, and the vertical axis represents the difference in sMAPE between the two prediction methods. That is, a difference greater than zero ( Δ >0) indicates that the model combined with the MLHT method is superior to the MAML method. The prediction performance of most tasks is significantly improved, and the performance only degrades for a few tasks.

Fig. 7

sMAPE comparison between MLHT and MAML fusion meta-network.

Table 3 shows the RMSE for the task to predict the next time step sequence and the real sequence. Using different models as the backbone network, ablation experiments are carried out with the three prediction methods. Some tasks are randomly selected to show the prediction results. MAML outperforms the traditional base prediction methods, while our method outperforms the MAML framework. It is proven that the MLHT method has the ability to adapt to new tasks faster for time series cross-task training. The addition of the small meta-network in the MAML framework can adaptively and dynamically adjust the learning rate and weight decay coefficient to find the optimal network hyperparameters and significantly improve the generaliz ability. At the same time, compared with the performance of the LSTM and CNN models, the model proposed in this paper has better predictive performance on most time series tasks with the MLHT meta-learning method, and performance degradation is only observed a few prediction tasks. The proposed residual stack model combined with the meta-learning method has a better forecasting effect. Residual stacking can allow the network to effectively capture time series features, and the proposed model can adapt to different time series forecasting tasks. Thus, the proposed method achieves the best prediction results compared with those of other models.

Table 3

The RMSE values for the prediction task with different models

Model		Ours		LSTM	BiLSTM	CNN	MLP
Method	MLHT	Base	MAML	MLHT	MLHT	MLHT	MLHT
Adiac	0.1532	0.1557	0.1569	0.1606	0.1563	0.1634	0.1554
ArrowHead	0.2405	0.2413	0.241	0.2412	0.2496	0.3081	0.3151
BME	0.8952	0.5921	0.8978	0.8857	0.9044	0.867	0.918
Beef	0.5909	0.9002	0.5913	0.5849	0.5866	0.5503	0.6078
CBF	0.1621	0.7774	0.1631	0.1999	0.1654	0.1661	0.1621
Coffee	0.0909	0.0905	0.0908	0.1046	0.1035	0.1093	0.1283
CricketX	0.6431	0.6449	0.6444	0.6498	0.6337	0.6504	0.6472
DiatomSizeReduction	0.1147	0.1187	0.1215	0.1634	0.2469	0.1597	0.1235
ECG5000	0.9044	0.6999	0.9042	0.9037	0.9056	0.9049	0.9044
EthanolLevel	0.0126	0.0216	0.0158	0.0267	0.0138	0.0169	0.0151
FiftyWords	0.2999	0.3017	0.3007	0.3078	0.2983	0.3166	0.3007
FreezerRegularTrain	0.3922	0.412	0.4103	0.4305	0.4273	0.4329	0.4288
FreezerSmallTrain	0.4499	0.5168	0.5137	0.5171	0.6549	0.5639	0.628
Fish	0.141	0.1412	0.1425	0.1568	0.1399	0.2205	0.1519
GunPointOldVersusYoung	0.3132	0.3133	0.3133	0.3142	0.3151	0.315	0.3196
GunPoint	0.2953	0.2956	0.2952	0.2953	0.3054	0.3038	0.323
Herring	0.0961	0.0961	0.0962	0.1053	0.0962	0.2038	0.1002
InlineSkate	0.6831	0.6833	0.6833	0.6837	0.6843	0.6845	0.6848
InsectEPGRegularTrain	0.9107	0.9192	0.9164	0.9414	0.899	0.8947	0.9021
InsectWingbeatSound	0.2442	0.2421	0.2434	0.2406	0.2466	0.2471	0.2526
Mallat	0.1189	0.1205	0.1195	0.1257	0.1417	0.1441	0.1645
Meat	0.0123	0.0126	0.0126	0.013	0.0268	0.1189	0.0622
MedicalImages	0.3571	0.3537	0.3537	0.3535	0.3578	0.3633	0.3636
Plane	0.349	0.3504	0.3503	0.3619	0.3526	0.3638	0.3723
OSULeaf	0.0059	0.8336	0.0058	0.0212	0.0218	0.0334	0.0338
SemgHandGenderCh2	0.6449	0.6455	0.6451	0.6451	0.6498	0.6493	0.647
StarLightCurves	0.5501	0.5485	0.55	0.5469	0.5523	0.5583	0.5505
Strawberry	0.2663	0.2881	0.2704	0.269	0.2729	0.3155	0.3018
UMD	0.6522	0.6572	0.6556	0.6696	0.6278	0.655	0.6398
Trace	0.0918	0.0942	0.0951	0.1042	0.0926	0.1038	0.0936
UWaveGestureLibraryY	0.8898	0.8905	0.89	0.8903	0.9113	0.8931	0.8996
Wine	0.0142	0.0148	0.0146	0.0126	0.0527	0.2103	0.2047

To analyze these results in further detail and draw reliable conclusions from them, we performed the Friedman hypothesis test on the sMAPE performance of all the models in our experiments for 82 tasks. Based on the test p-value being less than the significance level ( α =0.05), we rejected the original hypothesis, indicating that there is a significant difference in the performance of these methods. Therefore, we applied post hoc tests to investigate the relative performance between the model algorithms, and we used the Nemenyi test for post hoc analysis. Figure 8 shows a plot of the critical difference in the sMAPE. The top line in the figure is the axis along which the average ranking of each model algorithm is plotted, from the lowest ranking on the left (best performance) to the highest ranking on the right (worst performance). In the graph, it can be observed that our model shows superior performance, and the average ranking using the MLHT meta-learning algorithm is higher than that of the MAML and base methods, which confirms that the MLHT method has better generalization performance.

Fig. 8

Ranking and critical differences diagram of the performance metric sMAPE on all tasks.

4.5 Scalability

Table 4 shows the running time, RMSE, and SMAPE under different numbers of training tasks. It verifies the scalability of our model by reducing the number of original tasks. The number of tasks is 20, 30, 40, 50, and 80 respectively to predict new target tasks. As can be seen from the table, the proposed deep learning model for small data tasks is scalable because as the dataset size increases, the running time also increases. When the number of tasks is 40, it also shows excellent performance.

Table 4
Scalability under different number of training tasks

RMSE SMAPE TIME(s)

nums_20 0.3210 33.89 2.934

nums_30 0.3203 36.75 4.791

nums_40 0.3185 32.91 5.652

nums_50 0.3187 33.41 8.935

nums_80 0.3115 32.90 9.210

	RMSE	SMAPE	TIME(s)
nums_20	0.3210	33.89	2.934
nums_30	0.3203	36.75	4.791
nums_40	0.3185	32.91	5.652
nums_50	0.3187	33.41	8.935
nums_80	0.3115	32.90	9.210

Table 5 compares the model running time and memory usage under different meta-learning methods. We can find that under the same number of training tasks, although the MLHT method takes up more memory, its prediction performance and running time are better than MAML. Therefore, the overall performance of MLHT is better than MAML under smaller data amounts.

Table 5

Comparison of different meta-learning methods

Method	MAML		MLHT
Nums	nums_20	nums_40	nums_20	nums_40
RMSE	0.3306	0.3261	0.321	0.3185
SMAPE	36.73	35.02	33.89	32.91
TIME	5.22s	7.42s	2.93s	5.65s
memory	0.21GB	0.21GB	0.45GB	0.45GB

5 Conclusions

This paper proposes a meta-learning adaptive method and residual stack model to form a new time series prediction method. Cross-domain multitask learning is used to learn meta-knowledge through the MLHT framework to adapt to new tasks quickly. It mitigates the model overfitting phenomenon caused by few-shot time series forecasting. The proposed method shows significant performance improvements in most tasks. Furthermore, the residual stack model is better than the other models in time series prediction, and its unique architecture can capture the shared characteristics of multiple time series and obtain better prediction results.∥Although this study focuses on single-dimensional few-shot time series, many problems in the real world involve complex relationships among multiple related variables. We expect that the proposed method will address multidimensional time series forecasting in the the future work, and by integrating the changes of multiple variables, we expect to capture the dynamics of the system more accurately and provide decision-makers with more comprehensive forecasting results.

Footnotes

Acknowledgments

This work is supported by the Zhejiang Provincial Natural Science Foundation of China under grant LQ22F010008.

References

Mostafa Majidpour , Hamidreza Nazaripouya , Peter Chu , Hemanshu Pota

, Rajit Gadh , Fast univariate time series prediction of solarpower for real-time control of energy storage system, Forecasting 1(1) (2018), 107–120.

Md Mijanur Rahman , Mohammad Shakeri , Sieh Kiong Tiong , FatemaKhatun , Nowshad Amin , Jagadeesh Pasupuleti , Mohammad Kamrul Hasan , Prospective methodologies in hybrid renewable energy systemsfor energy prediction using artificial neural networks, Sustainability 13(4) (2021), 2393.

Hye Jin Kam, , Jin Ok Sung , Rae Woong Park , Prediction of dailypatient numbers for a regional emergency medical center using timeseries analysis, Healthcare Informatics Research 16(3) (2010), 158–165.

Tomoharu Iwata , Atsutoshi Kumagai Few-shot learning for time-series forecasting, arXiv preprint arXiv:2009.14379, 2020.

Feng Xiao , Lu Liu , Jiayu Han , Degui Guo , Shang Wang , Hai Cui , Tao Peng , Meta-learning for few-shot time series forecasting, Journal of Intelligent & Fuzzy Systems 43(1) (2022), 325–341.

Yingjun Chen , Yongtao Hao , A feature weighted support vectormachine and k-nearest neighbor algorithm for stock market indicesprediction, Expert Systems with Applications 80 (2017), 340–355.

Roger Frigola , Carl Edward Rasmussen Integrated preprocessing for bayesian nonlinear system identification with gaussian processes. In 52nd IEEE Conference on Decision and Control, pages 5371–5376. IEEE, 2013.

Sang-Ho Park , Ju-Hong Lee, , Jae-Won Song , Tae-Su Park Forecasting change directions for financial time series using hidden markov model. In Rough Sets and Knowledge Technology: 4th International Conference, RSKT 2009, Gold Coast, Australia, July 14-16, 2009. Proceedings 4, pages 184–191. Springer, 2009.

Yong Yu , Xiaosheng Si , Changhua Hu , Jianxun Zhang , A review ofrecurrent neural networks: Lstm cells and network architectures, Neural Computation 31(7) (2019), 1235–1270.

10.

Jou-Fan Chen , Wei-Lun Chen , Chun-Ping Huang , Szu-Hao Huan , An-Pin Chen Financial time-series data analysis using deep convolutional neural networks. In 2016 7th International conference on cloud computing and big data (CCBD), pages 87–92. IEEE, 2016.

11.

Weibin Zhang , Yinghao Yu , Yong Qi , Feng Shu , Yinhai Wang , Short-term traffic flow prediction based on spatiotemporal analysisand cnn deep learning, Transportmetrica A: Transport Science 15(2) (2019), 1688–1711.

12.

Maryam Imani , , Electrical load-temperature cnn for residential loadforecasting, Energy, 227 (2021), 120480.

13.

Shiyang Li , Xiaoyong Jin , Yao Xuan , Xiyou Zhou , Wenhu Chen , Yu-XiangWang , Xifeng Yan , Enhancing the locality and breaking the memorybottleneck of transformer on time series forecasting, Advancesin Neural Information Processing Systems 32 (2019).

14.

Boris Oreshkin

, Dmitri Carpov, , Nicolas Chapados , Yoshua Bengio N-beats: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437, 2019.

15.

Chuanqi Tan , Fuchun Sun , Tao Kong , Wenchang Zhang, , Chao Yang , Chunfang Liu A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27, pages 270–279. Springer, 2018.

16.

Mingsheng Long , Han Zhu , Jianmin Wang , Michael Jordan

Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, pages 2208–2217. PMLR, 2017.

17.

Xiaoyong Jin , Youngsuk Park , Danielle Maddix , Hao Wang , Yuyang Wang Domain adaptation for time series forecasting via attention sharing. In International Conference on Machine Learning, pages 10280–10297. PMLR, 2022.

18.

Hrayr Harutyunyan , Hrant Khachatrian , David C Kale , Greg Ver Steeg , Aram Galstyan , Multitask learning and benchmarking with clinicaltime series data, Scientific Data 6(1) (2019), 96.

19.

Razvan-Gabriel Cirstea , Darius-Valer Micu , Gabriel-Marcel Muresan , Chenjuan Guo , Bin Yang Correlated time series forecasting using multi-task deep neural networks. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1527–1530, 2018.

20.

Boris N. Oreshkin , Dmitri Carpov , Nicolas Chapados , YoshuaBengio , Meta-learning framework with applications to zero-shottime-series forecasting, In Proceedings of the AAAI Conferenceon Artificial Intelligence 35 (2021), 9242–9250.

21.

Bernardo Perez Orozco , Stephen Roberts

Zeroshot and few-shot time series forecasting with ordinal regression recurrent neural networks. arXiv preprint arXiv:2003.12162, 2020.

22.

Chelsea Finn , Pieter Abbeel , Sergey Levine Modelagnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.

23.

Oriol Vinyals , Charles Blundell , Timothy Lillicrap , DaanWierstra et al. Matching networks for one shot learning, Advances inNeural Information Processing Systems 29, 2016.

24.

Nikhil Mishra , Mostafa Rohaninejad , Xi Chen , Pieter Abbeel A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141, 2017.

25.

Zhenguo Li , Fengwei Zhou , Fei Chen , Hang Li Metasgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.

26.

Sachin Ravi , Hugo Lalle Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2016.

27.

Sungyong Baik , Myungsub Choi , Janghoon Choi , Heewon Kim , KyoungMu Lee , Meta-learning with adaptive hyperparameters, Advancesin Neural Information Processing Systems 33 (2020), 20755–20765.