Abstract
Time series forecasting has a wide range of applications in various fields. To eliminate the need for time series data volume, a meta-learning-based few-shot time series forecasting method is proposed. This method uses a residual stack module as its backbone and connects the residuals forward and backward through a multilayer fully connected network so that the model and the meta-learning framework can be seamlessly combined. The Empirical knowledge of different time-sequence tasks is obtained through meta-training. To enable fast adaptation to new prediction tasks, a small meta-network is introduced to adaptively and dynamically generate the learning rate and weight decay coefficient of each step in the network. This method can use sequences of different data distribution characteristics for cross-task learning, and each training task only needs a small number of time series to achieve sequence prediction for the target task. The results show that compared with the two baselines, the proposed method has improved performance on 67.07% and 58.53% of the evaluated tasks. Thus, this method can effectively alleviate the problems caused by insufficient data during training and has broad application prospects in the field of time series.
Introduction
Time series prediction(TSP) has been a hot research topic for decades, and it has always played a vital role in energy systems [1, 2], medical care [3], finance and many other fields. The process of analyzing and extracting valuable insights from time series data allows for forecasting future trends, thereby informing and influencing decision-making strategies across various fields.
With the rapid development of deep learning technology in various fields, deep neural networks (DNNs) have made great progress in the field of time series forecasting. However, the success of deep learning technology hinges on the availability of large-scale training data. To promote the further application of deep learning in the field of time series prediction, two critical challenges must be urgently addressed: 1) In practical scenarios, it is not always possible to acquire sufficient training data. Typical scenarios include the diagnosis of rare disease, traffic flow prediction for new roads, power consumption prediction of new equipment in a power system, etc. In these scenarios, it is difficult to obtain sufficient data due to data scarcity, which leads to model overfitting. Consequently, the prediction performance decreases, limiting plausible decision-making. 2) Traditional time-series prediction algorithms often have difficulty adapting to different prediction tasks, and need to achieve the best prediction performance by constantly adjusting hyperparameters, which requires considerable computational resources and time. Therefore, implementing general and efficient deep learning models with small sample data is still challenging.
To facilitate learning with limited data, few-shot learning (FSL) has received much attention in recent years, FSL aims, to train deep learning network models with better generalization performance using a small amount of labeled data. Based on this, researchers proposed a meta-learning approach that, realizes cross-task training by learning the existing empirical knowledge of multiple different tasks, and then transferring this knowledge to the target task, allowing for rapid adaptation to the new task. This method effectively mitigates the need for many training samples for the model. Therefore prompting researchers to employ small-sample learning for time series forecasting. Among them, Iwata et al. proposed a meta-learning based bidirectional LSTM model to solve the small-sample timing problem [4], This model utilizes the attention mechanism on the support set to minimize the error of the query set and achieves a better prediction performance with small datasets. Subsequently, Feng Xiao et al. proposed a small-sample prediction mechanism based on meta-learning [5], that has been shown to outperform traditional prediction methods in small-sample timing prediction. However, existing pure small-sample meta-learning algorithms are unable to target the hyperparameters of different timing tasks, resulting in slow convergence.
To address these problems, this paper proposes a few-shot time series prediction algorithm based on a meta-learning framework. The method utilizes knowledge shared across different tasks to learn generic time series modeling capabilities. Additionally, we introduce a multilayer perceptron (MLP) as a small meta-network to adaptively adjust hyperparameters for internal training tasks, facilitating the optimal solution of hyperparameter combinations and realizing generic small-sample time series forecasting. The main contributions of this paper include the following:
1) In order to solve the problem of insufficient data in the timing task, which leads to overfitting of the training model, we integrate the residual stack module in the N-BEATS model with the model-independent meta-learning method, use the model-agnostic meta-learning (MAML) algorithm to adapt to different tasks by learning shared model initialization parameters across tasks, and employ the residual mechanism to continuously "correct" the prediction results to better capture the intrinsic timing feature information. The experimental results show that the meta-learning approach using the residual stack module significantly improves the prediction accuracy in small-sample timing tasks compared to that of traditional deep learning models.
2) To find the optimal hyperparameter combination in each timing task, we propose a meta-learning for hyperparameter tuning (MLHT), which introduces a small meta-network based on MAML to dynamically generate network hyperparameters for each timing task, yielding an adaptive learning strategy to achieve state-of-the-art prediction performance. The experimental results show faster convergence compared to that of classical MAML.
3) The variation among task samples in cross-task learning poses several challenges, such as inconsistent lengths of time series data from different domains. This leads to the fact that the data cannot be directly imported into the model for training and necessitates preprocessing. To solve the above problems, this paper uses a bidirectional gated recurrent unit (Bi-GRU) neural network as an encoder to deal with disparate data lengths across tasks and maps the raw data from different tasks to a unified space for representation, to satisfy the requirements of cross-task learning.
The remainder of this paper is organized as follows. Section 2 introduces the current state of research. Section 3 introduces the work related to the few-shot time series forecasting method. In Section 4, we compare our method with currently existing methods.
Related work
Time series forecasting
Time series forecasting predicts future development by analyzing time series to find internal regularities. In the early stages of time series forecasting, statistical models such as the autoregressive moving average (ARMA) and its variant, the autoregressive integrated moving average (ARIMA), were extensively utilized. However, these models fell short in characterizing the nonlinear variations inherent in time series data. To solve this problem, researchers have used nonlinear models such as support vector machines(SVM) [6], Gaussian processes(GP) [7], and hidden Markov models [8] for time series forecasting, which has the ability to handle complex time series. Nevertheless, these methodologies have demonstrated some limitations, primarily in the effective management of sequence dependencies among input data, thus compromising their efficiency in time series forecasting tasks.
Deep learning applications are progressively expanding across various domains, including time series prediction tasks. The two most prevalent network architectures in these tasks are the long short-term memory (LSTM) network [9] and convolutional neural network (CNN) [10]. CNN uses convolutional and pooling layers as the feature extractor of the input vector to fully extract the local correlation of time data and are widely used in traffic flow prediction [11], stock price prediction [10], power load forecasting [12] and other fields. Their disadvantage is that they cannot extract long-term dependent features of time series data. On the other hand, LSTM can retain long-term sequence features and effectively solve the problems of gradient vanishing and gradient explosion. In addition, LSTM shows excellent prediction performance by introducing the mechanisms of the update gate, output gate and forget gate. However, LSTM needs to optimize many weights and bias parameters, which makes the network training speed slow. Recently, the multilayer self-attention mechanism (transformer) proposed in the field of natural language processing has also been applied to time series prediction tasks [13], and has shown excellent prediction results. The disadvantage of transformers is that their calculation costs are high. In recent years, the proposed N-BEATS model has garnered attention. It is composed of a deep stack of fully connected layers connected by forward and backward residual links [14]. The stacked fully connected layers learn the dependencies in the data, while the residual structure guarantees the depth of the network and the ability of the model to process data. Compared with other models, this model has reduced computational complexity and faster training speed. N-BEATS demonstrates state-of-the-art performance for several large datasets. Therefore, we apply the residual block in the N-BEATS model to the few-shot time series problem and verify its performance.
Few-shot learning
Few-shot learning has recently attracted considerable attention for data-scarce problems in new domains. Several methodologies, encompassing transfer learning references [15, 16], domain adaptation references [17], and multitask learning references [18, 19], utilize the knowledge or experience in the source domain as assistance for learning tasks in the target domain. Among them, Oreshkin et al. [20] proposed a zero-sample time series prediction learning framework based on migration learning. This framework trains a neural network model with many time series datasets in the source domain, and then directly applies the model to new time series prediction tasks. Orozco et al. [21] proposed a small-sample time-series prediction method based on a recurrent neural network, solving the small-sample time-series prediction problem by learning a shared feature embedding from a large amount of time-series data. However, the above methods use a large amount of data in the source domain for training. To reduce the number of sequential tasks needed, researchers have proposed many meta-learning ideas to solve many problems faced in the field of small-sample learning.

The overall structure framework diagram.
Currently, research on meta-learning is roughly classified into three groups, metric-based [22], model-based [23] and gradient-based meta-learning [24]. Among them, gradient-based methods is particularly attention-catching due to its wide-ranging adaptability. The most traditional method is the MAML algorithm [22] proposed by Fine et al. to optimize the initial network parameters through training, then fine-tune these parameters in one or more subsequent steps. By utilizing the MAML algorithm, Feng Xiao et al. investigated small-sample time series forecasting, and the prediction performance was greatly improved. However, it does not consider the tuning of meta-hyperparameters for different tasks. Therefore, many efforts have been made in hyperparameter optimization for meta-learning. To reduce the sensitivity to hyperparameters, Li et al. [25] proposed a novel meta-learning optimizer (MetaSGD) for few-shot learning. The experimental results demonstrate the effectiveness of meta-learner hyperparameter optimization. Ravi et al. [26] learned the entire inner-loop optimization directly through LSTMs to generate updated weights. However, these methods lack adaptive properties in inner-loop optimization, and the hyperparameters cannot be adapted to each task. Therefore, Sungyong et al. [27] proposed an adaptive learning update rule (ALFA) within the meta-learning framework, achieving the purpose of automatically adjusting the hyperparameters of the learning algorithm in meta-learning and thereby improving the performance of meta-learning. In this paper, we build on the foundation of the MAML by presenting a trainable learning rate and weight decay coefficient to update the model hyperparameters adaptively. This innovation aims to enhance the quick adjustability to new tasks in time series prediction tasks, thereby improving generalization capabilities.
Problem statement
Meta-learning can allow a network to quickly adapt to new tasks through cross-task training. Assuming that there are N different time-series tasks in cross-task learning T = {T1, T2, . . . T
N
}, the model inputs the i-th task historical time series data as
The network quickly generalizes meta-knowledge to new tasks when encountering unseen tasks. It only needs to fine-tune the learned initialization parameters based on few-shot time series data o predict future time series trends better.
The model training is handled by two parts: the backbone network (base-learner) and the small meta-network (meta-learner). The backbone network corresponds to a specific episode source task and needs to learn knowledge about each task from the few-shot labeled support set time series of multiple time series tasks and predict the query set sequence. The small meta-network improves the prediction performance of the base learner on cross-episode tasks by learning general meta-knowledge from multiple-episode tasks. Among them, the backbone network uses the residual stack module in the N-BEATS model. Each block consists of four fully connected layers F C 1 ∼ F C 4 stacked and two residual branch linear layers. The small meta-network comprises a 3-layer perceptron (MLP), and the rectified linear unit (ReLU) activation function is used between the layers. The frame diagram of the model structure is shown in figure:
A stack in the backbone network is composed of multiple blocks. The core idea is that each block is used for learning. The backbone network is composed of multiple blocks. The core idea is that each block trains part of the historical information in the time series, removes the information learned by the previous block through the residual method, and then uses the information as the input for the next block. The unlearned information is further used to update the prediction. The inside of the block is composed of 4 fully connected layers, and after the output value of the last hidden layer is obtained, the data enter two residual branches that generate forward expansion θ f and backward θ b expansion coefficients through the linear layer and then pass through the g b and g f functions. The output of the two residual branches is the backcast of the current window time point and the prediction of the time point in the future prediction window. The following is a schematic diagram of the residual stack module.

Residual stack module diagram.
For block1, the input is the entire sequence of the model, but the remaining block input is the difference between the previous block input and the backtracking value generated (i.e., the residual value), and the final predicted value is the sum of all partial predicted values. This is described by the following equation:
Since the time series data from different tasks have unequal input lengths, the Bi-GRU neural network is used to process time series data of any length, and the time series of each task is processed into a unified dimension. The Bi-GRU network can not only maintain the temporal order of the original time series but also use the hidden state of each time step to obtain the past and future information in the sequence as the characteristics of the current moment.
Therefore, the Bi-GRU neural network can be regarded as an encoder. The sequence of time step t in task i is expressed as

Bi-GRU encoder.
The new sequence
In the meta-training stage, an optimal set of network parameters is trained according to multitask learning bilevel optimization. The purpose of this stage is to increase the generaliz ability of the model in multiple tasks and continuously adjust the model parameters to adapt to different tasks. The network does this by learning task characteristics to improve the sensitivity of the model and adapt to new tasks faster. For the convenience of description, the initial parameters of the base backbone network are represented as θ0, and the parameters of the small meta-network are represented as φ, where θ = [w1, . . . , w
j
] T represents the weight vector of each layer in the network, and w
j
is the weight vector of the jth layer. In the experimental parameter update, θ is used to represent all weight vectors. Note that due to the use of the meta-learning algorithm, the weights need to be initialized with a normal distribution and added to the parameter list. The sequence
The output sequence is

Block diagram of the few-shot time series prediction parameter update process.
First, the hyperparameter learning rate and weight decay coefficient are generated layer by layer through a small meta-network. That is, the small meta-network is similar to a hyperparameter generator. The generated gradient cumulative mean and network parameters adjust the learning rate of each layer of weights. Specific hyperparameters are generated based on the learning state of a specific task to control the direction and magnitude of weight updates. The backbone network weight parameters and gradient stacking results are used as the inputs of the meta-learner. We generate hyperparameters as shown in (12).
To evaluate the network parameters, the query set is used to verify the previously trained model. The updated parameters θ
t
are passed into the N-BEATS model, and the loss
In the outer loop, the total loss gradient of multiple source tasks Q i is calculated, and the gradient descent algorithm is used to update the outer loop parameters θ iteratively. The gradient descent adaptive step size is η, and the update process is shown in equation (15).
The optimal network parameters ω* are obtained by learning the meta-knowledge of different source tasks. Therefore, the network can quickly adapt to the target task with only few-shot data, where ω* represents the best initial parameters θ of the network and the optimal learning rate α′ and decay rate β′ in the inner loop update. Additionally, there are support sets and query sets in the new task. Through the fine-tuning method, the previously trained parameters are used to learn and fine-tune the model parameters with the support set, where
Require: Task distribution P (T), learning rate λ, hyperparameters α, β
1: randomly initialize φ, θ0
2:
3: Sample a task T from p (T)
4:
5: Compute the loss
6: Compute the gradient on
7: Generate hyperparameters
8: Obtain adaptive parameters
9:
10: Compute the loss
11: Update the parameters
12:
13: Sample a few time series S from a new task
14: Compute loss
In this section, we report the results of an extensive experimental study to evaluate the effectiveness of our proposed algorithm relative to the baseline algorithms. Two metrics are chosen to evaluate the prediction model in the experiment, the root mean square error (RMSE) and symmetric mean absolute percentage error (sMAPE), as shown in Eq. (19), where denotes the i-th prediction task actual sequence and H represents the prediction sequence range.
The experiments are based on the UCR Time Series Classification Archive [22] to obtain time series datasets, which contain 128 different types of time series data subsets, including speech signals, financial data, medical data, etc. To ensure the accuracy of the experiment, we remove the data with missing values in the sequences and lengths of less than 50. Finally, 82-time series from different domains are selected to predict the model, which can be regarded as 82 prediction tasks. In addition, we divided each task data subset into a training set and a test set.
Baseline models
In experiments, We use the residual stack model to stack by multiple blocks, and each block is a fully connected layer sequence with a prediction/backtracking branch at the end. The architecture runs residual recursion over the entire input window and sums the block outputs to make its final prediction. We selected the following models as baselines:
We also compare the proposed method with three types of training frameworks. The first group combines the MAML framework with the small meta-network (MLHT) to adapt per-step hyperparameter generalization to prediction tasks; and the second group removes the small meta-network (MAML) to perform cross-task learning on the training task. The generalized optimal initialization parameters quickly adapt to the prediction task. The third group removes the MAML framework (Base) and directly uses the time series data of the prediction task for training.
Training setup
In this experiment, 82 prediction tasks are divided into a training task set and a prediction task set. One of the tasks is selected as a prediction task, and the rest are divided into training tasks. Every five tasks is a batch, and each task uses 5-time series for training. The experiment uses the Adam optimizer, and the base learning method is directly trained with the prediction task. The learning rate settings are shown in Table 1. For the MAML framework, Table 1 lists the hyperparameter settings in the meta-training process for different models, α is the learning rate of the backbone network, and β indicates the learning rate of the meta-network. For our proposed meta-learning method, the internal loop adaptively generates the learning rate and weight decay rate of each step. To achieve more fine-grained control, the inner loop parameter that updates the learning rate inside the optimizer is set to 10-2; therefore, no additional settings are needed. The training processes of the three learning frameworks are consistent, and the sequence prediction range is set to H=20.
Training process hyperparameter settings
Training process hyperparameter settings
To prove that the proposed method promotes the improvement of the prediction performance of few-shot time series tasks, one task is selected as a test in the experiment, and the rest are used as training tasks to realize multitask training. Each task is used as a test task in turn for cyclic iterative prediction.
Figure 5 shows the time series prediction effect of the residual-stack model combined with different meta-learning methods. The horizontal axis represents the time series prediction range H=20, and the vertical axis represents the predicted value under different meta-learning methods. The result of the MLHT framework is closer to the real value. The method can be used to find hyperparameters suitable for each task faster in cross-domain scenarios. The adaptability of network parameters plays an important role in cross-task few-shot time series prediction.

Comparison of the prediction effects of different meta-learning frameworks.
Figure 6 shows the time series prediction effect of different models using the same meta-learning method. The prediction task has a certain periodicity and trend. The circles represent the real values of the sequence, and the squares represent the predicted values of the model sequence in this paper. The prediction result of the proposed residual stack model is better than that of the baseline model and is closer to the real value.

Comparison of the time series forecasting effects of different models.
Table 2 shows the MLHT comparison of the other two methods on the task performance increase and decrease ratio on sMAPE. Only when the difference between two sMAPE indices is greater than 0.1 is the task recorded. The counted tasks indicate that the performance of the model fluctuates significantly. Therefore, the sum of the performance-up and performance-down tasks may be less than 1. As a whole, the application of our method results in a significant performance improvement for most tasks with different models. Compared with the second set of MAML, the proposed method has a 67.07% task performance improvement, indicating that for most tasks, there is a significant improvement in the sMAPE metric.
Task improvement and decrease ratio measured by sMAPE
Figure 7 shows a comparison of the difference between the MLHT and MAML method with different models. The horizontal axis represents the 82 prediction tasks, and the vertical axis represents the difference in sMAPE between the two prediction methods. That is, a difference greater than zero ( Δ >0) indicates that the model combined with the MLHT method is superior to the MAML method. The prediction performance of most tasks is significantly improved, and the performance only degrades for a few tasks.

sMAPE comparison between MLHT and MAML fusion meta-network.
Table 3 shows the RMSE for the task to predict the next time step sequence and the real sequence. Using different models as the backbone network, ablation experiments are carried out with the three prediction methods. Some tasks are randomly selected to show the prediction results. MAML outperforms the traditional base prediction methods, while our method outperforms the MAML framework. It is proven that the MLHT method has the ability to adapt to new tasks faster for time series cross-task training. The addition of the small meta-network in the MAML framework can adaptively and dynamically adjust the learning rate and weight decay coefficient to find the optimal network hyperparameters and significantly improve the generaliz ability. At the same time, compared with the performance of the LSTM and CNN models, the model proposed in this paper has better predictive performance on most time series tasks with the MLHT meta-learning method, and performance degradation is only observed a few prediction tasks. The proposed residual stack model combined with the meta-learning method has a better forecasting effect. Residual stacking can allow the network to effectively capture time series features, and the proposed model can adapt to different time series forecasting tasks. Thus, the proposed method achieves the best prediction results compared with those of other models.
The RMSE values for the prediction task with different models
To analyze these results in further detail and draw reliable conclusions from them, we performed the Friedman hypothesis test on the sMAPE performance of all the models in our experiments for 82 tasks. Based on the test p-value being less than the significance level ( α =0.05), we rejected the original hypothesis, indicating that there is a significant difference in the performance of these methods. Therefore, we applied post hoc tests to investigate the relative performance between the model algorithms, and we used the Nemenyi test for post hoc analysis. Figure 8 shows a plot of the critical difference in the sMAPE. The top line in the figure is the axis along which the average ranking of each model algorithm is plotted, from the lowest ranking on the left (best performance) to the highest ranking on the right (worst performance). In the graph, it can be observed that our model shows superior performance, and the average ranking using the MLHT meta-learning algorithm is higher than that of the MAML and base methods, which confirms that the MLHT method has better generalization performance.

Ranking and critical differences diagram of the performance metric sMAPE on all tasks.
Table 4 shows the running time, RMSE, and SMAPE under different numbers of training tasks. It verifies the scalability of our model by reducing the number of original tasks. The number of tasks is 20, 30, 40, 50, and 80 respectively to predict new target tasks. As can be seen from the table, the proposed deep learning model for small data tasks is scalable because as the dataset size increases, the running time also increases. When the number of tasks is 40, it also shows excellent performance.
Scalability under different number of training tasks
Scalability under different number of training tasks
Table 5 compares the model running time and memory usage under different meta-learning methods. We can find that under the same number of training tasks, although the MLHT method takes up more memory, its prediction performance and running time are better than MAML. Therefore, the overall performance of MLHT is better than MAML under smaller data amounts.
Comparison of different meta-learning methods
This paper proposes a meta-learning adaptive method and residual stack model to form a new time series prediction method. Cross-domain multitask learning is used to learn meta-knowledge through the MLHT framework to adapt to new tasks quickly. It mitigates the model overfitting phenomenon caused by few-shot time series forecasting. The proposed method shows significant performance improvements in most tasks. Furthermore, the residual stack model is better than the other models in time series prediction, and its unique architecture can capture the shared characteristics of multiple time series and obtain better prediction results.∥Although this study focuses on single-dimensional few-shot time series, many problems in the real world involve complex relationships among multiple related variables. We expect that the proposed method will address multidimensional time series forecasting in the the future work, and by integrating the changes of multiple variables, we expect to capture the dynamics of the system more accurately and provide decision-makers with more comprehensive forecasting results.
Footnotes
Acknowledgments
This work is supported by the Zhejiang Provincial Natural Science Foundation of China under grant LQ22F010008.
