Abstract
With the rapid advancements in industrial big data, the Internet of Things, and sensor acquisition technologies, the similarity measurement of multivariate time series has emerged as a pivotal research area in data mining and machine learning. To enhance the accuracy and efficacy of multivariate time series similarity measurement, this paper proposes a sliding window approach based on Transformer. Specifically, each dimension of the multivariate time series is processed through sliding windows and input into a Transformer for feature extraction. By using multiple window sizes, the method simultaneously captures localized temporal segment features and identifies local patterns within the time series. Encoded window features for each sample are combined to form a comprehensive feature sequence that represents the global characteristics of the entire time series. These global features are then used to compute the final similarity measure through Dynamic Time Warping (DTW). This approach effectively captures both local and global features of multivariate time series, significantly improving similarity measurement precision. The effectiveness of the proposed method is validated through 1-Nearest Neighbor (1NN) classification experiments, demonstrating superior accuracy and enhanced performance in similarity measurement. The experiments showed that ten of the sixteen datasets had the best performance in terms of classification accuracy.
Keywords
Introduction
Time series are a prevalent data type, widely found in fields such as finance, healthcare, electronic information, and meteorology. 1 Examples include daily stock market transactions, product sales data, meteorological records, and flight data recorded by “black boxes.” With the continuous improvement in data storage and processing capabilities, the scale of time series data storage in real-world applications has grown exponentially over time. 2 Consequently, time series data mining has become a prominent research focus in the field of data mining. 3 Similarity measurement is one of the core techniques in time series data mining, as its accuracy directly impacts the effectiveness of data mining. 4 In recent years, the study of time series data mining (TSDM) has attracted researchers from various disciplines. Compared with univariate time series, the research on similarity measurement for multivariate time series is relatively limited, with many unresolved challenges. 5 In real-world scenarios, multivariate time series are more common. For instance, stock transactions can be described using attributes such as opening price, closing price, highest price, lowest price, and trading volume. Moreover, multimedia data (e.g., audio, images) can also be converted into multivariate time series. Therefore, studying the similarity measurement of multivariate time series holds significant theoretical value and broad application potential.
Currently, prominent methods for measuring multivariate time series similarity include Euclidean Distance (ED), 6 Singular Value Decomposition (SVD), 7 Point Distribution-based Features (PD), 8 Dynamic Time Warping (DTW) Distance, 9 and Trend Distance (TD). 10 Each method has its respective advantages and limitations. Ultimately, we adopt DTW distance due to its ability to handle similarity measurement between time series of different lengths, its flexibility in accommodating stretching and warping along the time axis, and its superior accuracy and robustness. Consequently, DTW is widely employed in this domain.
Multivariate time series often encompass many variables and data points, making direct processing susceptible to the curse of dimensionality, which significantly increases computational complexity and storage requirements. Feature extraction serves as an effective approach to project high-dimensional data into a lower-dimensional space, retaining essential information while eliminating redundancy. 11 This not only simplifies data representation but also enhances computational efficiency and mitigates the impact of noise and redundant features. Given the high dimensionality (multiple variables) and extensive data points characteristic of multivariate time series, directly computing similarity can be computationally intensive and prone to noise interference. 12 By leveraging feature extraction, high-dimensional time series data can be transformed into a lower-dimensional feature space, thereby streamlining similarity computations. This approach reduces computational complexity, minimizes data redundancy and noise, and ultimately ensures more efficient and robust similarity measurements.
Feature extraction is particularly important when dealing with multivariate time series for several key reasons:
Dimensionality Reduction and Redundancy Removal: Multivariate time series often involve numerous variables spanning long observation periods. Direct similarity computation across all variables is not only computationally expensive but also susceptible to interference from irrelevant or redundant information. Feature extraction compresses the data by retaining only the most salient information, thereby effectively reducing dimensionality and alleviating computational burden. Highlighting Key Patterns and Relationships: Beyond summarizing individual variable behaviors, multivariate time series analysis requires capturing the complex interactions and couplings between variables. Feature extraction techniques are designed to detect and encode these cross-variable relationships, which are critical for precise and meaningful similarity measurement. As demonstrated in prior works analyzing EEG data, carefully selected features can reveal hidden temporal and spatial patterns across channels that are crucial for accurate classification and recognition tasks.13,14 Enhancing Robustness and Noise Resistance: Raw multivariate time series data are often contaminated by noise, outliers, and missing values. By using feature extraction methods—such as sliding window-based statistics, frequency-domain transformations, or deep learning-based embeddings—the resulting representations emphasize stable and representative signals while filtering out transient disturbances. This robust feature transformation improves the reliability of similarity computations, a critical advantage noted in advanced EEG analysis applications.
15
Providing Better Inputs for Downstream Algorithms: While algorithms like Dynamic Time Warping (DTW) are powerful for aligning temporal sequences, they struggle with raw high-dimensional inputs due to the curse of dimensionality. Feeding DTW or other similarity measures with carefully extracted, compact feature representations dramatically lowers computational costs, improves alignment quality, and enhances downstream task performance. For example, brain-computer interface systems that leverage optimized feature spaces, as highlighted in recent EEG classification studies, achieve both higher efficiency and accuracy compared to raw-signal-based approaches.
In summary, the challenge of similarity measurement in multivariate time series can be defined as determining and computing a distance or similarity metric that effectively captures the resemblance between two sequences across both temporal and multivariate feature dimensions. Feature extraction plays a pivotal role in this process by transforming complex raw data into meaningful, compact representations that enable efficient, robust, and interpretable similarity computations.
In Section 2.2, we extensively discuss the feature extraction methods and present the rationale for selecting Transformer as the feature extraction technique. Based on the characteristics of multivariate time series feature extraction, the key conclusions of this work are summarized as follows:
A sliding window is applied to each dimension of the multivariate time series to extract features between variables, enabling the capture of local patterns. Transformer-encoded sequences of varying window sizes are merged to form a feature sequence, representing the global characteristics of the entire time series. Global features are utilized for Dynamic Time Warping (DTW) similarity computation, ultimately providing the final similarity measure. The effectiveness of the feature extraction and similarity integration is validated through 1NN experiments.
From these findings, we conclude that Transformer proves to be effective in time series feature extraction, while DTW shows robustness in similarity computation. The combination of these two methods helps to accurately measure the similarity of time series.
Related work
Definitions
A series of observations recorded in chronological order, denoted as
A multivariate time series is defined as
T denotes the number of time steps. D represents the number of variables (dimensions), with each variable
Feature extraction formalization
The process of extracting the feature representation F can be defined as a mapping:
Here, Univariate Feature Extraction: For each variable Multivariate Interaction Features: Modeling the relationships between variables to generate interaction features Global Features: Extract the global representation
The goal is to extract the feature representation

The process of feature extraction.
Given two multivariate time series,
Here, S represents the similarity score function, and g denotes the similarity measurement method, such as Euclidean distance, cosine similarity, Dynamic Time Warping (DTW), or kernel-based approaches.
Univariate similarity Compute similarity on a per-variable basis:
Similarity of inter-variable relationships Calculate similarity for interaction features:
Comprehensive similarity Define a weighted similarity score by combining univariate features and interaction features:
Here,
Traditional sequence feature representation methods typically account for the temporal dependencies and multidimensional characteristics of time series data. Below are some classical time series feature representation methods:
Application of transformer in time series
The application of Transformer models in time series analysis has garnered increasing attention due to their powerful feature representation and long-range dependency modeling capabilities, 30 which allow them to excel in handling time series data. Transformers demonstrate robust performance across a wide range of tasks in time series analysis, including prediction, classification, anomaly detection, similarity measurement, data generation, multivariate modeling, and preprocessing. 33
In prediction tasks, Transformers capture long-term dependencies and complex patterns through their self-attention mechanism, enabling efficient multi-step forecasting. 34 In classification, the model enhances discrimination ability by leveraging global feature extraction and high-dimensional encoding. 35 For anomaly detection, Transformers effectively identify outliers using reconstruction errors and attention mechanisms. 36 For data generation, Transformer-based generative models simulate sequences with similar characteristics. 37 In multivariate modeling, multi-head attention is employed to capture interaction relationships and model variable dependencies. During preprocessing, Transformers utilize masking techniques to fill in missing values and reduce noise, thereby improving data quality.
The application of Transformers in time series analysis capitalizes on the advantages of their self-attention mechanism and hierarchical feature learning, enabling them to handle complex temporal dependencies, capture nonlinear patterns, and improve performance across prediction, classification, anomaly detection, and data generation tasks. 38 As research progresses, the use of Transformers and their variants in time series analysis will continue to expand and deepen.
Dynamic time warping (DTW) similarity measure
Dynamic Time Warping (DTW) 39 is an algorithm used to compute the similarity between time series by allowing nonlinear alignment of the time axis. It is particularly effective in comparing sequences that may vary in speed or exhibit misalignments between time steps. DTW has a wide range of applications in time series data analysis, especially in fields such as speech recognition, handwriting recognition, and biological data analysis. The ability to handle such temporal variations makes DTW a powerful tool in domains where time series exhibit complex patterns that are not perfectly aligned.
DTW is a method for measuring the similarity between two time series by finding an optimal alignment path that minimizes the difference between the sequences at each aligned time step. This approach allows for nonlinear adjustments along the time axis to achieve the best possible match between the sequences. By considering such non-linear temporal shifts, DTW can effectively handle time series that are out of phase or vary in speed.
To compute the DTW similarity between two sequences
The distance between each time step is:
The final similarity distance is:
Problem description and modeling
Transformer architecture
The core of our approach is built upon the Transformer encoder architecture, following the original design introduced by Vaswani et al. 40 As shown in Figure 2, but explicitly omitting the decoder component. This design choice is intentional, as the decoder is primarily tailored for generative tasks where the output sequence length is not predetermined, such as machine translation, text summarization, or time series forecasting. Notably, the decoder requires access to the (masked) ground-truth output sequence during training, making it unsuitable for tasks like classification or external regression.
In contrast, our objective is to develop a unified framework capable of addressing a wide range of tasks, including classification, regression, inference, and, when needed, generative prediction. The encoder-only architecture provides this versatility, enabling it to handle both discriminative and generative tasks effectively. Moreover, removing the decoder substantially reduces the overall number of model parameters—by approximately half—thereby improving computational efficiency, accelerating training, and lowering the risk of overfitting. This streamlined design ultimately enhances both the performance and generalizability of the proposed framework.

Transformer encoder architecture.
Input sequence where Input Embedding: The input is mapped to a specific dimension through a linear transformation:
where Add positional information to the embedded time steps:
Here, Features are then extracted through a multi-head self-attention mechanism, residual connections combined with layer normalization, a feedforward neural network, and multiple stacked layers, ultimately producing the output features:
In the feature extraction process for multivariate time series based on Transformer, the first step involves input preprocessing, which standardizes the data, segments the time series, and handles missing values with masking, ensuring uniform feature scales and consistent input lengths. Next, the self-attention mechanism in Transformer is employed to capture the relationships between time steps in the time series, assigning higher weights to important time steps to extract key features. The multi-head attention mechanism concurrently attends to different subspace features of the time series, enhancing the model's expressive power. Following the self-attention layer, the feedforward neural network performs further feature transformation and nonlinear combinations, enriching the feature representation. During the feature representation and encoding stage, positional encoding is introduced to address the time-step awareness issue in Transformer, with common methods including sine-cosine encoding and learnable positional embeddings. By combining positional encoding with input features, the self-attention mechanism generates a global representation of the time series that reflects the relationships and sequential information between time steps. Through the stacking of multiple Transformer modules, higher-order features are progressively extracted, enabling hierarchical feature learning and producing the final feature sequence.
In this method, DTW is used to measure the similarity between feature representations extracted by Transformer, enabling the capture of both local and global similarities within the sequences. This approach effectively accommodates sequence comparisons with varying lengths and speed changes. Finally, the similarity analysis between multivariate time series
In this approach, we first use a sliding window, determining the window size and step size, to partition each multivariate time series into overlapping subsequences. The first part is to encode each subsequence using a transformer model to extract meaningful local features. These window-level features are then aggregated to represent the complete sequence. The second part to compute the similarity between two sequences, we measure the best fit cost by applying dynamic time warping (DTW) on their respective feature sequences. By repeating this process across all sample pairs, we obtain a similarity matrix that captures the DTW-based distance between time series in the dataset.
Experimental validation
This chapter primarily focuses on the validation of the method for multivariate time series similarity measurement, which combines sliding window-based Transformer feature extraction with DTW. Time series data mining finds applications across various domains, including clustering, classification, fault detection, pattern recognition, and prediction. 38 In these research tasks, measuring the similarity between two sequences is crucial, especially in time series classification tasks. Efficient and accurate classification of sequences heavily relies on the precise measurement of sequence similarity. Therefore, we primarily use classification methods to validate the feasibility and accuracy of the similarity measurement based on the sliding window's maximum and minimum values for long time series.
Among various classification methods, the simple k-nearest neighbor classifier (1-NN) is widely used due to its simplicity, efficiency, and parameter-free nature. 39 When combined with an appropriate distance or similarity metric, such as Dynamic Time Warping (DTW), 1-NN classification has proven to be effective. In fact, the choice of distance or similarity measure is critical to the accuracy of the 1-NN classifier. This method has shown significant progress in time series classification research. Thus, we use 1-NN to validate the similarity measurement of long time series based on the sliding window maximum and minimum values.
Datasets
In this section, we describe the datasets used for evaluating the proposed method. The datasets used in time series analysis cover various domains and provide different challenges, such as high dimensionality, temporal dependencies, and varying data patterns. The selected datasets allow us to test the performance of the sliding window-based Transformer feature extraction method combined with DTW for similarity measurement across multiple tasks.
In our experiment, we used the public UEA 2018 multivariate time series classification archive 41 to validate the performance of our method. The UEA archive has been widely used as a benchmark for evaluating time series data mining algorithms. In its previous version (2018), the UEA repository includes 128 datasets collected from various domains, such as medicine, meteorology, and computer vision. The datasets vary in terms of time series length, training/test set size, and the number of classes. Each dataset is divided into two parts: one for training and the other for testing.
The Table 1 presents the relevant parameters for each dataset.
Relevant parameters for each dataset.
Relevant parameters for each dataset.
We use the 1NN classifier as the evaluation method in our experiments. Although the 1NN classifier is simple, it demonstrates strong effectiveness when combined with an appropriate similarity measure. Additionally, the accuracy of the 1NN classifier is highly dependent on the choice of distance measure, making it an excellent method for validating the effectiveness of time series similarity metrics. DTW is frequently used as a similarity measure in the 1NN classifier.
All experiments were conducted on a machine with the following specifications: 64-bit Windows 11 OS, 12th Gen Intel(R) Core (TM) i7-12700H 2.30 GHz, Anaconda3, and Python 3.8.1.
Feature extraction experimental results and analysis
To evaluate the effectiveness of Transformer-based feature extraction, we compare it with other feature extraction methods, including Autoencoders, Temporal Convolution Networks (TCN), RNN Encoders, LSTM, VAE, PAA, and Local Extrema. These methods cover both deep learning-based feature extraction and time-domain feature extraction. All experiments were conducted on a single machine. Since deep learning feature extraction methods involve the selection of network depth (number of layers), we performed a hyperparameter search for the number of layers, ranging from 1 to 10, to mitigate the influence of this parameter on the results. The final choice of the number of layers n corresponds to the one that achieves the highest classification accuracy. The selected values for this parameter across different datasets are summarized in Table 2.
Number of network layers in deep learning models.
Number of network layers in deep learning models.
1NN classification accuracy based on different feature extraction methods combined with DTW.
In 16 datasets (In the table above we have abbreviated the dataset names), as shown in Table 3, the Transformer combined with DTW achieved the highest classification accuracy in 10 cases, demonstrating its superior performance in feature extraction compared to other methods. To ensure the results are not due to statistical bias, we conducted a Wilcoxon signed-rank test on the classification error rates obtained by the Transformer and each comparative method. The detailed statistical outcomes are presented in Table 4. The Wilcoxon signed-rank test was employed primarily because it is suitable for small samples and non-normally distributed data, requiring no parametric assumptions, thus robustly evaluating the significance of paired differences. By analyzing classification error rate differences between the Transformer and other methods on the same datasets, the test confirmed the statistical significance of performance superiority and supported the conclusion from the difference plot as shown in Figure 3, where data predominantly leaned toward the negative side (lower error rates of the Transformer). Its rank-based calculation reduced sensitivity to outliers, ensuring reliable results to exclude randomness, thereby providing rigorous statistical evidence for the method's superiority. At a confidence level of α=0.05 and N = 16 (number of datasets), the experimental results revealed that methods such as autoencoders, temporal convolutions, RNN encoders, LSTM, VEA, PAA, and extreme value extraction all rejected the null hypothesis. The Wilcoxon signed-rank tests consistently rejected the null hypothesis, indicating a significant difference in accuracy between the Transformer-based approach and other methods. However, further analysis is required to confirm whether this method can be definitively deemed superior. The generated difference plot shows the data predominantly distributed on the negative side (less than 0), and the experimental results are shown in Figure 4 suggesting that the Transformer generally outperforms the other methods.
Results of Wilcoxon signed rank test.

Wilcoxon signed rank test stat result plot.

Distribution of differences between transformer and other methods.
Through the above validation, we finally get the optimal performance of DTW similarity computation combined with transformer feature extraction. To conduct a more comprehensive evaluation, we compared our proposed method against each distance-based classifier individually. The results, illustrated in Figure 5, depict each dataset as a point on the scatter plot. A higher density of points in the lower-right region indicates superior accuracy for the corresponding method. The results demonstrate that our method consistently outperforms autoencoders, temporal convolutions, RNN encoders, LSTMs, VEA, PAA, and extreme values.

Comparison of 1nn classification results between transformer and other feature extraction methods combined with DTW similarity.
For the selection of sliding windows, we only iterate through the dimensions of the series to determine the best-performing window size and stride based on classification results. The results are summarized in Table 5 below.
Window and step size of the optimal results for classification of each dataset.
To evaluate the effectiveness of DTW in multivariate time series similarity measurement, we conducted classification experiments by combining Euclidean distance and cosine similarity with Transformer-based feature extraction. These diverse distance metrics facilitate a comparative analysis of DTW and traditional measures, such as Euclidean distance and cosine similarity, in terms of their impact on classification performance for multivariate time series under Transformer-based feature extraction. The final results are presented in Table 6.
Table of comparison results for the three distance classifications.
Table of comparison results for the three distance classifications.

Comparison chart of the three distance classifications.

Comparison of classification accuracy under three models.
Figure 6 clearly illustrates that the classification results of the three distance metrics are relatively close, with DTW outperforming the other two similarity measures. Furthermore, the classification outcomes achieved by combining similarity measures with Transformer-based feature extraction surpass those obtained with other feature extraction methods, further substantiating the effectiveness of Transformer in feature extraction.
To assess the contribution of each key component in the proposed framework, we conducted a series of ablation experiments designed to isolate and evaluate the impact of the sliding window mechanism, the Transformer-based feature extraction module, and the integration with Dynamic Time Warping (DTW) for similarity measurement. We conducted an ablation study comparing three key experimental settings:
No Transformer (Windowed Features + DTW): In this setting, the raw windowed features are directly used for similarity measurement with DTW, without passing through the Transformer encoder. This configuration assesses the baseline value of windowed feature extraction alone. No Sliding Window (Direct Transformer + DTW): Here, the entire multivariate time series is directly input into the Transformer encoder without any sliding window segmentation, and the extracted global features are compared using DTW. This setup examines the impact of removing localized temporal segmentation. Sliding Window + Transformer + DTW (Full Model): This is the complete proposed framework, where multi-scale sliding window features are fed into the Transformer encoder, and the resulting representations are used with DTW for similarity measurement.
The results of the ablation experiments, summarized in Figures 7–10, reveal the critical importance of integrating all components. Specifically, removing the Transformer leads to a notable decline in performance, highlighting the essential role of deep feature extraction in modeling complex temporal dependencies and cross-variable interactions. When the sliding window component is removed, the model's ability to capture fine-grained local patterns diminishes, resulting in lower classification accuracy. Finally, the full model—combining sliding windows, Transformer-based feature extraction, and DTW—consistently achieves the highest accuracy and robustness, confirming that the interplay between local and global features is key to precise multivariate time series similarity measurement.

Comparison of classification precision under three models.

Recall comparison under three models.

Comparison of F1 scores under three models.
The ablation study results reveal a clear and consistent trend across all evaluated datasets: there is a strong positive correlation between classification accuracy and the other three performance metrics—precision, recall, and F1 score. Specifically, datasets where the proposed full model achieves high accuracy also consistently exhibit elevated precision, recall, and F1 scores, indicating that improvements in overall correctness are accompanied by balanced and robust detection of both positive and negative classes.
For instance, in datasets such as BM, Cricket, NATOPS, and AWR, the full model not only achieves the highest accuracy but simultaneously yields superior precision, indicating fewer false positives; higher recall, reflecting the model's ability to detect most true positives; and elevated F1 scores, demonstrating a well-balanced trade-off between precision and recall. This pattern suggests that, under the proposed framework, accuracy improvements are not achieved at the cost of sacrificing the quality of positive class identification or error control but instead reflect genuine model improvements across multiple dimensions.
Moreover, comparative analysis across ablation settings shows that when the accuracy drops (as seen in No Transformer or No Sliding Window configurations), the corresponding precision, recall, and F1 scores also decline, reinforcing the observed correlation. This alignment indicates that in the evaluated multivariate time series datasets, accuracy serves as a reliable summary indicator of overall model performance, reflecting not only correct predictions but also balanced precision-recall behavior.
These findings highlight the effectiveness of the proposed sliding window + Transformer + DTW framework in delivering improvements that are not isolated to a single metric but extend comprehensively across all critical evaluation dimensions.
In this study, we proposed a novel approach for multivariate time series similarity measurement, integrating sliding window input, Transformer-based feature extraction, and DTW similarity metrics. By incorporating the sliding window technique, we effectively decomposed multivariate time series into multiple subsequences, facilitating the capture of inter-dimensional features within the time series. The robust feature extraction capabilities of the Transformer model enabled us to identify complex patterns and long-range dependencies in the time series, thereby enhancing the quality of feature representation. Finally, DTW was employed as the similarity measurement method, ensuring high accuracy even under temporal distortions, particularly when handling nonlinear variations and asynchronous time scales. The results of the 1-NN classification experiments demonstrate that using sliding windows for subsequence segmentation effectively captures local variations in the time series, while Transformer-based feature extraction strengthens the understanding of global structures, thereby improving feature extraction performance. As a similarity measure, DTW excels in aligning time series, providing more precise distance metrics between series of differing forms. Compared to traditional measures such as Euclidean distance and cosine similarity, DTW exhibits superior adaptability and robustness in time series classification tasks.
In summary, the proposed method demonstrates outstanding performance in multivariate time series similarity measurement, particularly in handling complex and morphologically diverse time series. It not only improves measurement accuracy but also offers a new perspective on time series similarity analysis, opening new avenues for the application of Transformer models in time series analysis. Future work can explore the integration of additional feature extraction methods and distance measurement strategies to further enhance the model's performance and address more complex real-world challenges.
Footnotes
Acknowledgements
This research work is supported by the National Natural Science Foundation of China (Grant No.62192751 and 61425027).
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research work is supported by the National Natural Science Foundation of China (Grant No.62192751 and 61425027).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
