A similarity measurement method with sliding window approach based on transformer for multivariate time-series

Abstract

With the rapid advancements in industrial big data, the Internet of Things, and sensor acquisition technologies, the similarity measurement of multivariate time series has emerged as a pivotal research area in data mining and machine learning. To enhance the accuracy and efficacy of multivariate time series similarity measurement, this paper proposes a sliding window approach based on Transformer. Specifically, each dimension of the multivariate time series is processed through sliding windows and input into a Transformer for feature extraction. By using multiple window sizes, the method simultaneously captures localized temporal segment features and identifies local patterns within the time series. Encoded window features for each sample are combined to form a comprehensive feature sequence that represents the global characteristics of the entire time series. These global features are then used to compute the final similarity measure through Dynamic Time Warping (DTW). This approach effectively captures both local and global features of multivariate time series, significantly improving similarity measurement precision. The effectiveness of the proposed method is validated through 1-Nearest Neighbor (1NN) classification experiments, demonstrating superior accuracy and enhanced performance in similarity measurement. The experiments showed that ten of the sixteen datasets had the best performance in terms of classification accuracy.

Keywords

Multivariate time series similarity measurement transformer sliding window feature extraction dynamic time warping (DTW)

1. Introduction

Time series are a prevalent data type, widely found in fields such as finance, healthcare, electronic information, and meteorology.¹ Examples include daily stock market transactions, product sales data, meteorological records, and flight data recorded by “black boxes.” With the continuous improvement in data storage and processing capabilities, the scale of time series data storage in real-world applications has grown exponentially over time.² Consequently, time series data mining has become a prominent research focus in the field of data mining.³ Similarity measurement is one of the core techniques in time series data mining, as its accuracy directly impacts the effectiveness of data mining.⁴ In recent years, the study of time series data mining (TSDM) has attracted researchers from various disciplines. Compared with univariate time series, the research on similarity measurement for multivariate time series is relatively limited, with many unresolved challenges.⁵ In real-world scenarios, multivariate time series are more common. For instance, stock transactions can be described using attributes such as opening price, closing price, highest price, lowest price, and trading volume. Moreover, multimedia data (e.g., audio, images) can also be converted into multivariate time series. Therefore, studying the similarity measurement of multivariate time series holds significant theoretical value and broad application potential.

Currently, prominent methods for measuring multivariate time series similarity include Euclidean Distance (ED),⁶ Singular Value Decomposition (SVD),⁷ Point Distribution-based Features (PD),⁸ Dynamic Time Warping (DTW) Distance,⁹ and Trend Distance (TD).¹⁰ Each method has its respective advantages and limitations. Ultimately, we adopt DTW distance due to its ability to handle similarity measurement between time series of different lengths, its flexibility in accommodating stretching and warping along the time axis, and its superior accuracy and robustness. Consequently, DTW is widely employed in this domain.

Multivariate time series often encompass many variables and data points, making direct processing susceptible to the curse of dimensionality, which significantly increases computational complexity and storage requirements. Feature extraction serves as an effective approach to project high-dimensional data into a lower-dimensional space, retaining essential information while eliminating redundancy.¹¹ This not only simplifies data representation but also enhances computational efficiency and mitigates the impact of noise and redundant features. Given the high dimensionality (multiple variables) and extensive data points characteristic of multivariate time series, directly computing similarity can be computationally intensive and prone to noise interference.¹² By leveraging feature extraction, high-dimensional time series data can be transformed into a lower-dimensional feature space, thereby streamlining similarity computations. This approach reduces computational complexity, minimizes data redundancy and noise, and ultimately ensures more efficient and robust similarity measurements.

Feature extraction is particularly important when dealing with multivariate time series for several key reasons:

Dimensionality Reduction and Redundancy Removal: Multivariate time series often involve numerous variables spanning long observation periods. Direct similarity computation across all variables is not only computationally expensive but also susceptible to interference from irrelevant or redundant information. Feature extraction compresses the data by retaining only the most salient information, thereby effectively reducing dimensionality and alleviating computational burden.

Highlighting Key Patterns and Relationships: Beyond summarizing individual variable behaviors, multivariate time series analysis requires capturing the complex interactions and couplings between variables. Feature extraction techniques are designed to detect and encode these cross-variable relationships, which are critical for precise and meaningful similarity measurement. As demonstrated in prior works analyzing EEG data, carefully selected features can reveal hidden temporal and spatial patterns across channels that are crucial for accurate classification and recognition tasks.^13,14

Enhancing Robustness and Noise Resistance: Raw multivariate time series data are often contaminated by noise, outliers, and missing values. By using feature extraction methods—such as sliding window-based statistics, frequency-domain transformations, or deep learning-based embeddings—the resulting representations emphasize stable and representative signals while filtering out transient disturbances. This robust feature transformation improves the reliability of similarity computations, a critical advantage noted in advanced EEG analysis applications.¹⁵

Providing Better Inputs for Downstream Algorithms: While algorithms like Dynamic Time Warping (DTW) are powerful for aligning temporal sequences, they struggle with raw high-dimensional inputs due to the curse of dimensionality. Feeding DTW or other similarity measures with carefully extracted, compact feature representations dramatically lowers computational costs, improves alignment quality, and enhances downstream task performance. For example, brain-computer interface systems that leverage optimized feature spaces, as highlighted in recent EEG classification studies, achieve both higher efficiency and accuracy compared to raw-signal-based approaches.

In summary, the challenge of similarity measurement in multivariate time series can be defined as determining and computing a distance or similarity metric that effectively captures the resemblance between two sequences across both temporal and multivariate feature dimensions. Feature extraction plays a pivotal role in this process by transforming complex raw data into meaningful, compact representations that enable efficient, robust, and interpretable similarity computations.

In Section 2.2, we extensively discuss the feature extraction methods and present the rationale for selecting Transformer as the feature extraction technique. Based on the characteristics of multivariate time series feature extraction, the key conclusions of this work are summarized as follows:

A sliding window is applied to each dimension of the multivariate time series to extract features between variables, enabling the capture of local patterns.

Transformer-encoded sequences of varying window sizes are merged to form a feature sequence, representing the global characteristics of the entire time series.

Global features are utilized for Dynamic Time Warping (DTW) similarity computation, ultimately providing the final similarity measure. The effectiveness of the feature extraction and similarity integration is validated through 1NN experiments.

From these findings, we conclude that Transformer proves to be effective in time series feature extraction, while DTW shows robustness in similarity computation. The combination of these two methods helps to accurately measure the similarity of time series.

2. Related work

2.1. Definitions

A series of observations recorded in chronological order, denoted as $x_{t} (j)$ , is referred to as a time series, where $t (t = 1, 2, \dots, T)$ represents the t-th moment, and $j (j = 1, 2, \dots, d)$ indicates the j-th variable. $x_{t} (j)$ specifically represents the recorded value of the j-th variable at time t.³ When d = 1, $x_{t} (j)$ constitutes a univariate time series (UTS); when d > 1, $x_{t} (j)$ is classified as a multivariate time series (MTS).

A multivariate time series is defined as $X = [x_{1}, x_{2}, \dots, x_{D}]$ , where:

T denotes the number of time steps. D represents the number of variables (dimensions), with each variable $x_{d} = [x_{d} (1), x_{d} (2), \dots, x_{d} (T)]$ corresponding to time series data. X can be expressed in matrix form as: $X \in R^{D \times T}$ .

2.1.1. Feature extraction formalization

The process of extracting the feature representation F can be defined as a mapping:

\begin{aligned} Φ : X \mapsto Z, Z \in R^{m \times D} \end{aligned}

(1)

Here, $Φ$ is the feature extraction function, $Z$ is the reduced-dimensional feature representation, and m is the number of features. 1)

Univariate Feature Extraction:

For each variable $x_{d}$ , extract the feature vector $z_{d}$ , i.e., $Φ_{d} : x_{d} \mapsto z_{d}$ .

Multivariate Interaction Features:

Modeling the relationships between variables to generate interaction features $Φ_{i j} (x_{i}, x_{j})$ .

Global Features:

Extract the global representation $Φ_{g l o b a l} (X)$ from the entire time series $X$ .

The goal is to extract the feature representation $F$ for similarity analysis. Figure 1 illustrates a simplified diagram of feature extraction from a sequence.

Figure 1.

The process of feature extraction.

2.1.2. Similarity analysis formalization

Given two multivariate time series, $X^{(1)}$ and $X^{(2)}$ , define a similarity measure function $S (X^{(1)}, X^{(2)})$ to calculate the similarity between two time series. Feature representations $Z^{(1)}$ and $Z^{(2)}$ respectively, the similarity analysis can be expressed as:

\begin{aligned} S (X^{(1)}, X^{(2)}) = g (Z^{(1)}, Z^{(2)}) \end{aligned}

(2)

Here, S represents the similarity score function, and g denotes the similarity measurement method, such as Euclidean distance, cosine similarity, Dynamic Time Warping (DTW), or kernel-based approaches. 1)

Univariate similarity

Compute similarity on a per-variable basis:

\begin{aligned} S_{d} = g_{d} (f_{d}^{(1)}, f_{d}^{(2)}) . d = 1, 2, \dots, D \end{aligned}

(3)

Similarity of inter-variable relationships

Calculate similarity for interaction features:

\begin{aligned} S_{i j} = g_{i j} (Φ_{i j} (x_{i}^{(1)}, x_{j}^{(1)}), Φ_{i j} (x_{i}^{(2)}, x_{j}^{(2)})) \end{aligned}

(4)

Comprehensive similarity

Define a weighted similarity score by combining univariate features and interaction features:

\begin{aligned} S_{overall} = α \sum_{d = 1}^{D} S_{d} + β \sum_{i, j} S_{i j} + γ g_{global} (Φ_{global} (X^{(1)}), Φ_{global} (X^{(2)})) \end{aligned}

(5)

Here, $α, β, γ$ represent the weight parameters.

2.2. Feature extraction and representation methods

Traditional sequence feature representation methods typically account for the temporal dependencies and multidimensional characteristics of time series data. Below are some classical time series feature representation methods:

Statistical feature-based methods offer simple and intuitive advantages in time series feature extraction, such as mean, variance, maximum, and minimum, making them easy to compute and understand.¹⁶ However, these methods have significant drawbacks: they fail to capture the temporal dependencies within the time series and only reflect overall statistical measures, overlooking the sequential relationships and dynamic patterns between data points. Moreover, these methods are insensitive to the temporal order, unable to represent trends or abrupt changes in the sequence, and are particularly vulnerable to noise and outliers, which may lead to feature distortion.¹⁷ Additionally, statistical features have limited capacity for extracting information from complex time series, often losing critical temporal information, making it challenging for models to effectively capture the complexity and diversity of the sequence. This limitation is especially evident in high-dimensional time series, where such methods cannot adequately describe the interactions and dependencies across different dimensions, hindering their performance in complex temporal tasks.¹⁸

Model-based approaches: Autoregressive (AR)¹⁹ models, Autoregressive Integrated Moving Average (ARIMA)²⁰ models, and Hidden Markov Models (HMM)²¹ are classical time series analysis methods, yet they exhibit certain limitations when handling complex temporal tasks. ARIMA assumes that the series is linear and stationary, which makes it difficult to effectively capture nonlinear and intricate dynamic patterns. Additionally, ARIMA is sensitive to parameter selection and preprocessing (such as differencing), and performs poorly when addressing long-range dependencies or multivariate time series.²² HMM assumes that the relationship between the observed sequence and hidden states is independent and follows a Markov process, which limits its ability to model long-term dependencies and complex interactions.²³ Furthermore, HMM is sensitive to dataset size and initialization, often getting stuck in local optima, with high computational complexity, making it unsuitable for large-scale or high-dimensional time series analysis. These limitations constrain their applicability in modern, complex time series analysis tasks.

Dimensionality reduction approaches: Principal Component Analysis (PCA) is a classical dimensionality reduction method²⁴ that projects multivariate time series onto a smaller set of principal components, reducing data complexity while preserving most of the variance. However, the limitations of PCA are quite evident: it can only capture linear relationships in the data and struggles to handle complex nonlinear structures. PCA is also sensitive to outliers, as anomalous data points can significantly distort the reduction results. The principal components obtained after dimensionality reduction often lack clear interpretability, making them difficult to interpret.²⁵ Additionally, PCA is sensitive to feature scaling, requiring data normalization, and cannot directly handle missing values, necessitating preprocessing of incomplete data. During the reduction process, PCA may discard information with low variance that could still be meaningful for the analysis, limiting its effectiveness when dealing with complex, high-dimensional, and nonlinear data.

Deep learning-based feature extraction methods²⁶: Autoencoders²⁷ perform feature extraction and dimensionality reduction through unsupervised learning, allowing them to learn effective representations from unlabeled data. They are well-suited for anomaly detection and generative tasks. However, their performance is limited when dealing with time series data, as they struggle to effectively model temporal dependencies and adapt to long or complex time series. Additionally, the interpretability of the features they output is low. Compared to Transformers, autoencoders excel in feature compression but lack the ability to capture global and long-range dependencies. Temporal convolutional networks (TCNs)²⁸ efficiently extract local features and possess strong parallel processing capabilities, making them suitable for modeling periodic or frequency-based features. They have a relatively simple network structure and fewer parameters. However, their receptive field is limited, making it difficult to capture long-term dependencies, and they struggle with modeling non-stationary time series patterns. In comparison to Transformers, TCNs are more efficient but are restricted to local feature extraction and perform poorly in global dependency modeling. Recurrent Neural Networks (RNNs)²⁹ excel in processing sequential data and capturing dynamic dependencies in time series through sequential updates of the hidden state, making them ideal for tasks that require strict temporal ordering. However, their computational inefficiency, lack of parallelization, and vulnerability to gradient vanishing or explosion problems limit their ability to model long-range dependencies in long sequences. Compared to Transformers, RNNs are less efficient and scalable, although they remain competitive for small-scale sequence tasks. Long Short-Term Memory (LSTM)³⁰ networks address the gradient vanishing issue of RNNs with memory cells and gating mechanisms, enabling them to capture long-range dependencies, making them well-suited for time series prediction and sequence classification tasks. However, the sequential nature of their computation limits training efficiency, and the large number of parameters increases optimization complexity. LSTMs also have limited memory capacity and struggle with extremely long sequences. In comparison to Transformers, LSTMs still perform well in capturing long-range dependencies but fall short in terms of efficiency and global relationship modeling.

Transformers, relying on self-attention mechanisms, excel at modeling global dependencies and offer high parallelism, making them ideal for handling long sequences and tasks involving various data types.³¹ Their flexibility is exceptional. However, their high computational complexity and significant memory overhead for long sequences, as well as the need for additional positional encoding due to the lack of inherent sequential order, pose challenges. Furthermore, they may not perform as well in small sample scenarios compared to traditional methods. Overall, Transformers outperform in large-scale tasks, though optimization through integration with other models may be necessary in specific scenarios.³²

2.3. Application of transformer in time series

The application of Transformer models in time series analysis has garnered increasing attention due to their powerful feature representation and long-range dependency modeling capabilities,³⁰ which allow them to excel in handling time series data. Transformers demonstrate robust performance across a wide range of tasks in time series analysis, including prediction, classification, anomaly detection, similarity measurement, data generation, multivariate modeling, and preprocessing.³³

In prediction tasks, Transformers capture long-term dependencies and complex patterns through their self-attention mechanism, enabling efficient multi-step forecasting.³⁴ In classification, the model enhances discrimination ability by leveraging global feature extraction and high-dimensional encoding.³⁵ For anomaly detection, Transformers effectively identify outliers using reconstruction errors and attention mechanisms.³⁶ For data generation, Transformer-based generative models simulate sequences with similar characteristics.³⁷ In multivariate modeling, multi-head attention is employed to capture interaction relationships and model variable dependencies. During preprocessing, Transformers utilize masking techniques to fill in missing values and reduce noise, thereby improving data quality.

The application of Transformers in time series analysis capitalizes on the advantages of their self-attention mechanism and hierarchical feature learning, enabling them to handle complex temporal dependencies, capture nonlinear patterns, and improve performance across prediction, classification, anomaly detection, and data generation tasks.³⁸ As research progresses, the use of Transformers and their variants in time series analysis will continue to expand and deepen.

2.4. Dynamic time warping (DTW) similarity measure

Dynamic Time Warping (DTW)³⁹ is an algorithm used to compute the similarity between time series by allowing nonlinear alignment of the time axis. It is particularly effective in comparing sequences that may vary in speed or exhibit misalignments between time steps. DTW has a wide range of applications in time series data analysis, especially in fields such as speech recognition, handwriting recognition, and biological data analysis. The ability to handle such temporal variations makes DTW a powerful tool in domains where time series exhibit complex patterns that are not perfectly aligned.

DTW is a method for measuring the similarity between two time series by finding an optimal alignment path that minimizes the difference between the sequences at each aligned time step. This approach allows for nonlinear adjustments along the time axis to achieve the best possible match between the sequences. By considering such non-linear temporal shifts, DTW can effectively handle time series that are out of phase or vary in speed.

To compute the DTW similarity between two sequences $X = [x_{1}, x_{2}, \dots, x_{n}]$ and $Y = [y_{1}, y_{2}, \dots, y_{m}]$ , $x_{i} = [x_{i 1}, x_{i 2}, \dots, x_{i D}]$ and $y_{j} = [y_{j 1}, y_{j 2}, \dots, y_{j D}]$ represent that the data at each time step is a d-dimensional vector.

The distance between each time step is:

\begin{aligned} d (x_{i}, y_{j}) = \sqrt{\sum_{d = 1}^{D} {(x_{i d} - y_{j d})}^{2}} \end{aligned}

(6)

The final similarity distance is:

\begin{aligned} DTW (X, Y) = \frac{1}{L} \sum_{k = 1}^{L} d (x_{i k}, y_{j k}) \end{aligned}

(7)

3. Method

3.1. Problem description and modeling

Problem Description: The goal of multivariate time series analysis is to extract meaningful and informative features from sequences composed of multiple interrelated variables, enabling tasks such as similarity measurement, classification, and prediction. Traditional feature extraction methods—such as statistical measures, autoregressive models (AR, ARIMA), and dimensionality reduction techniques like Principal Component Analysis (PCA)—are often limited in their ability to capture complex, nonlinear relationships and long-range temporal dependencies inherent in multivariate time series. To address these limitations, this paper adopts a deep learning-based approach, leveraging the Transformer architecture to model multivariate time series and enhance the accuracy and expressiveness of extracted features.

Modeling Objective: The primary objective is to construct a model capable of automatically learning deep, high-level representations of multivariate time series data. By employing the self-attention mechanism, the model effectively captures both temporal dependencies and interactions between variables across time. Furthermore, the extracted feature representations are integrated with Dynamic Time Warping (DTW) for similarity measurement, enabling precise and robust analysis of multivariate time series across various application domains.

3.2. Transformer architecture

The core of our approach is built upon the Transformer encoder architecture, following the original design introduced by Vaswani et al.⁴⁰ As shown in Figure 2, but explicitly omitting the decoder component. This design choice is intentional, as the decoder is primarily tailored for generative tasks where the output sequence length is not predetermined, such as machine translation, text summarization, or time series forecasting. Notably, the decoder requires access to the (masked) ground-truth output sequence during training, making it unsuitable for tasks like classification or external regression.

In contrast, our objective is to develop a unified framework capable of addressing a wide range of tasks, including classification, regression, inference, and, when needed, generative prediction. The encoder-only architecture provides this versatility, enabling it to handle both discriminative and generative tasks effectively. Moreover, removing the decoder substantially reduces the overall number of model parameters—by approximately half—thereby improving computational efficiency, accelerating training, and lowering the risk of overfitting. This streamlined design ultimately enhances both the performance and generalizability of the proposed framework.

Figure 2.

Transformer encoder architecture.

Sliding Window Description: The sliding window runs along the dimension axis rather than the time axis. When used, the window slides across each dimension, extracting a fixed-length sequence to form a new feature representation. And the sliding windows are overlapped. This method is effective in capturing localized features and patterns within each dimension. By leveraging the dimensional segmentation capability of the sliding window and the temporal modeling capability of the Transformer encoder, this approach effectively captures complex dynamic relationships in multivariate time series.

Steps for Implementing Transformer Feature Extraction:

Input sequence $X \in R^{T \times D}$ : Perform data standardization to normalize each dimension and eliminate the influence of scale differences:

\begin{aligned} X^{'} = \frac{X - μ}{σ} \end{aligned}

(8)

where $μ$ is the mean, and $σ$ is the standard deviation of the respective dimension.

Input Embedding: The input is mapped to a specific dimension through a linear transformation:

\begin{aligned} Z = X^{'} W_{e} + b_{e}, Z \in R^{T \times d_{model}} \end{aligned}

(9)

where $W_{e}$ is the weight matrix, $b_{e}$ is the bias term, and $d_{model}$ represents the model's embedding dimension.

Add positional information to the embedded time steps:

\begin{aligned} Z p = Z + P E \end{aligned}

(10)

Here, $P E$ is computed using sine and cosine functions.

Features are then extracted through a multi-head self-attention mechanism, residual connections combined with layer normalization, a feedforward neural network, and multiple stacked layers, ultimately producing the output features:

\begin{aligned} Z_{o u t p u t} \in R^{T \times d_{model}} \end{aligned}

(11)

In the feature extraction process for multivariate time series based on Transformer, the first step involves input preprocessing, which standardizes the data, segments the time series, and handles missing values with masking, ensuring uniform feature scales and consistent input lengths. Next, the self-attention mechanism in Transformer is employed to capture the relationships between time steps in the time series, assigning higher weights to important time steps to extract key features. The multi-head attention mechanism concurrently attends to different subspace features of the time series, enhancing the model's expressive power. Following the self-attention layer, the feedforward neural network performs further feature transformation and nonlinear combinations, enriching the feature representation. During the feature representation and encoding stage, positional encoding is introduced to address the time-step awareness issue in Transformer, with common methods including sine-cosine encoding and learnable positional embeddings. By combining positional encoding with input features, the self-attention mechanism generates a global representation of the time series that reflects the relationships and sequential information between time steps. Through the stacking of multiple Transformer modules, higher-order features are progressively extracted, enabling hierarchical feature learning and producing the final feature sequence.

3.3. Similarity measurement combining DTW

In this method, DTW is used to measure the similarity between feature representations extracted by Transformer, enabling the capture of both local and global similarities within the sequences. This approach effectively accommodates sequence comparisons with varying lengths and speed changes. Finally, the similarity analysis between multivariate time series $X^{(1)}$ and $X^{(2)}$ , with feature representations $Z^{(1)}$ and $Z^{(2)}$ respectively, can be expressed as:

\begin{aligned} S (X^{(1)}, X^{(2)}) & = g (Z^{(1)}, Z^{(2)}) \end{aligned}

(12)

\begin{aligned} g (Z^{(1)}, Z^{(2)}) & = DTW (Z^{(1)}, Z^{(2)}) \end{aligned}

(13)

\begin{aligned} DTW (Z^{(1)}, Z^{(2)}) & = \frac{1}{L} \sum_{k = 1}^{L} d (z_{i k}^{(1)}, z_{i k}^{(2)}) \end{aligned}

(14)

\begin{aligned} S (X^{(1)}, X^{(2)}) & = \frac{1}{L} \sum_{k = 1}^{L} d (z_{i k}^{(1)}, z_{i k}^{(2)}) \end{aligned}

(15)

The final step is the calculation of the similarity between the two sequences.

3.4. Algorithmic process

In this approach, we first use a sliding window, determining the window size and step size, to partition each multivariate time series into overlapping subsequences. The first part is to encode each subsequence using a transformer model to extract meaningful local features. These window-level features are then aggregated to represent the complete sequence. The second part to compute the similarity between two sequences, we measure the best fit cost by applying dynamic time warping (DTW) on their respective feature sequences. By repeating this process across all sample pairs, we obtain a similarity matrix that captures the DTW-based distance between time series in the dataset.

4. Experimental validation

This chapter primarily focuses on the validation of the method for multivariate time series similarity measurement, which combines sliding window-based Transformer feature extraction with DTW. Time series data mining finds applications across various domains, including clustering, classification, fault detection, pattern recognition, and prediction.³⁸ In these research tasks, measuring the similarity between two sequences is crucial, especially in time series classification tasks. Efficient and accurate classification of sequences heavily relies on the precise measurement of sequence similarity. Therefore, we primarily use classification methods to validate the feasibility and accuracy of the similarity measurement based on the sliding window's maximum and minimum values for long time series.

Among various classification methods, the simple k-nearest neighbor classifier (1-NN) is widely used due to its simplicity, efficiency, and parameter-free nature.³⁹ When combined with an appropriate distance or similarity metric, such as Dynamic Time Warping (DTW), 1-NN classification has proven to be effective. In fact, the choice of distance or similarity measure is critical to the accuracy of the 1-NN classifier. This method has shown significant progress in time series classification research. Thus, we use 1-NN to validate the similarity measurement of long time series based on the sliding window maximum and minimum values.

4.1. Datasets

In this section, we describe the datasets used for evaluating the proposed method. The datasets used in time series analysis cover various domains and provide different challenges, such as high dimensionality, temporal dependencies, and varying data patterns. The selected datasets allow us to test the performance of the sliding window-based Transformer feature extraction method combined with DTW for similarity measurement across multiple tasks.

In our experiment, we used the public UEA 2018 multivariate time series classification archive⁴¹ to validate the performance of our method. The UEA archive has been widely used as a benchmark for evaluating time series data mining algorithms. In its previous version (2018), the UEA repository includes 128 datasets collected from various domains, such as medicine, meteorology, and computer vision. The datasets vary in terms of time series length, training/test set size, and the number of classes. Each dataset is divided into two parts: one for training and the other for testing.

The Table 1 presents the relevant parameters for each dataset.

Table 1.
Relevant parameters for each dataset.

Dataset Train Cases Test Cases Dimensions Length Classes

AWR 275 300 9 144 25

AF 15 15 2 640 3

BM 40 40 6 100 4

Cricket 108 72 6 100 4

EC 261 263 3 1751 4

FM 316 100 28 50 2

HMD 320 147 10 400 4

Handwrit 150 850 3 152 26

Libras 180 180 2 45 15

LSST 2495 2466 6 36 14

NATOPS 180 180 24 51 6

RS 151 152 6 30 4

SRSCP1 269 293 6 896 2

SRSCP2 200 180 7 1152 2

SWJ 12 15 4 2500 3

UWGL 120 320 3 315 8

Dataset	Train Cases	Test Cases	Dimensions	Length	Classes
AWR	275	300	9	144	25
AF	15	15	2	640	3
BM	40	40	6	100	4
Cricket	108	72	6	100	4
EC	261	263	3	1751	4
FM	316	100	28	50	2
HMD	320	147	10	400	4
Handwrit	150	850	3	152	26
Libras	180	180	2	45	15
LSST	2495	2466	6	36	14
NATOPS	180	180	24	51	6
RS	151	152	6	30	4
SRSCP1	269	293	6	896	2
SRSCP2	200	180	7	1152	2
SWJ	12	15	4	2500	3
UWGL	120	320	3	315	8

4.2. Validation method

We use the 1NN classifier as the evaluation method in our experiments. Although the 1NN classifier is simple, it demonstrates strong effectiveness when combined with an appropriate similarity measure. Additionally, the accuracy of the 1NN classifier is highly dependent on the choice of distance measure, making it an excellent method for validating the effectiveness of time series similarity metrics. DTW is frequently used as a similarity measure in the 1NN classifier.

All experiments were conducted on a machine with the following specifications: 64-bit Windows 11 OS, 12th Gen Intel(R) Core (TM) i7-12700H 2.30 GHz, Anaconda3, and Python 3.8.1.

4.3. Feature extraction experimental results and analysis

To evaluate the effectiveness of Transformer-based feature extraction, we compare it with other feature extraction methods, including Autoencoders, Temporal Convolution Networks (TCN), RNN Encoders, LSTM, VAE, PAA, and Local Extrema. These methods cover both deep learning-based feature extraction and time-domain feature extraction. All experiments were conducted on a single machine. Since deep learning feature extraction methods involve the selection of network depth (number of layers), we performed a hyperparameter search for the number of layers, ranging from 1 to 10, to mitigate the influence of this parameter on the results. The final choice of the number of layers n corresponds to the one that achieves the highest classification accuracy. The selected values for this parameter across different datasets are summarized in Table 2.

Table 2.
Number of network layers in deep learning models.

Dataset Autoencoders TCN RNN Encoders LSTM VEA Transformer

AWR 1 7 5 3 2 1

AF 2 9 7 1 9 5

BM 3 4 1 6 7 4

Cricket 6 3 4 1 2 1

EC 3 1 7 1 8 9

FM 2 8 4 9 2 4

HMD 2 2 9 2 4 4

Handwrit 4 7 4 1 4 4

Libras 2 3 4 1 2 2

LSST 6 4 3 1 5 5

NATOPS 8 3 2 3 3 5

RS 2 4 6 1 4 7

SRSCP1 9 6 4 3 1 6

SRSCP2 2 8 4 3 4 7

SWJ 3 3 5 1 5 4

UWGL 4 6 3 2 3 3

Dataset	Autoencoders	TCN	RNN Encoders	LSTM	VEA	Transformer
AWR	1	7	5	3	2	1
AF	2	9	7	1	9	5
BM	3	4	1	6	7	4
Cricket	6	3	4	1	2	1
EC	3	1	7	1	8	9
FM	2	8	4	9	2	4
HMD	2	2	9	2	4	4
Handwrit	4	7	4	1	4	4
Libras	2	3	4	1	2	2
LSST	6	4	3	1	5	5
NATOPS	8	3	2	3	3	5
RS	2	4	6	1	4	7
SRSCP1	9	6	4	3	1	6
SRSCP2	2	8	4	3	4	7
SWJ	3	3	5	1	5	4
UWGL	4	6	3	2	3	3

Table 3.

1NN classification accuracy based on different feature extraction methods combined with DTW.

Dataset	Autoencoders	TCN	RNN Encoders	LSTM	VEA	PAA	Local Extrema	Transformer
AWR	0.98	0.75	0.77	0.70	0.98	0.98	0.7	0.98
AF	0.33	0.53	0.80	0.40	0.53	0.47	0.6	0.53
BM	0.82	0.95	0.62	0.65	0.70	0.75	0.8	0.9
Cricket	0.96	0.75	0.65	0.58	0.96	0.96	0.89	0.96
EC	0.32	0.30	0.29	0.30	0.33	0.34	0.27	0.32
FM	0.53	0.62	0.53	0.55	0.54	0.51	0.5	0.63
HMD	0.36	0.32	0.36	0.39	0.38	0.36	0.32	0.38
Handwrit	0.36	0.23	0.20	0.16	0.29	0.32	0.12	0.36
Libras	0.80	0.73	0.77	0.75	0.67	0.81	0.51	0.83
LSST	0.47	0.4	0.26	0.40	0.08	0.49	0.48	0.42
NATOPS	0.84	0.68	0.79	0.62	0.88	0.83	0.54	0.84
RS	0.82	0.63	0.59	0.63	0.83	0.8	0.60	0.84
SRSCP1	0.80	0.77	0.82	0.75	0.78	0.74	0.72	0.87
SRSCP2	0.52	0.56	0.53	0.54	0.53	0.49	0.48	0.59
SWJ	0.27	0.67	0.47	0.60	0.47	0.6	0.4	0.67
UWGL	0.88	0.71	0.86	0.74	0.86	0.89	0.23	0.89
Win	3	2	1	1	3	4	0	10
Average accuracy	0.628	0.6	0.582	0.545	0.613	0.6462	0.51	0.6881

In 16 datasets (In the table above we have abbreviated the dataset names), as shown in Table 3, the Transformer combined with DTW achieved the highest classification accuracy in 10 cases, demonstrating its superior performance in feature extraction compared to other methods. To ensure the results are not due to statistical bias, we conducted a Wilcoxon signed-rank test on the classification error rates obtained by the Transformer and each comparative method. The detailed statistical outcomes are presented in Table 4. The Wilcoxon signed-rank test was employed primarily because it is suitable for small samples and non-normally distributed data, requiring no parametric assumptions, thus robustly evaluating the significance of paired differences. By analyzing classification error rate differences between the Transformer and other methods on the same datasets, the test confirmed the statistical significance of performance superiority and supported the conclusion from the difference plot as shown in Figure 3, where data predominantly leaned toward the negative side (lower error rates of the Transformer). Its rank-based calculation reduced sensitivity to outliers, ensuring reliable results to exclude randomness, thereby providing rigorous statistical evidence for the method's superiority. At a confidence level of α=0.05 and N = 16 (number of datasets), the experimental results revealed that methods such as autoencoders, temporal convolutions, RNN encoders, LSTM, VEA, PAA, and extreme value extraction all rejected the null hypothesis. The Wilcoxon signed-rank tests consistently rejected the null hypothesis, indicating a significant difference in accuracy between the Transformer-based approach and other methods. However, further analysis is required to confirm whether this method can be definitively deemed superior. The generated difference plot shows the data predominantly distributed on the negative side (less than 0), and the experimental results are shown in Figure 4 suggesting that the Transformer generally outperforms the other methods.

Table 4.

Results of Wilcoxon signed rank test.

Methods	Stat	P-Value	Hypothesis Result
Autoencoders	5.0	0.007602	Reject Null Hypothesis
TCN	5.0	0.001778	Reject Null Hypothesis
RNN Encoders	15.0	0.002090	Reject Null Hypothesis
LSTM	1.0	0.000031	Reject Null Hypothesis
Vea	5.5	0.005142	Reject Null Hypothesis
PAA	12.5	0.012016	Reject Null Hypothesis
Local Extrema	7.0	0.000290	Reject Null Hypothesis

Figure 3.

Wilcoxon signed rank test stat result plot.

Figure 4.

Distribution of differences between transformer and other methods.

Through the above validation, we finally get the optimal performance of DTW similarity computation combined with transformer feature extraction. To conduct a more comprehensive evaluation, we compared our proposed method against each distance-based classifier individually. The results, illustrated in Figure 5, depict each dataset as a point on the scatter plot. A higher density of points in the lower-right region indicates superior accuracy for the corresponding method. The results demonstrate that our method consistently outperforms autoencoders, temporal convolutions, RNN encoders, LSTMs, VEA, PAA, and extreme values.

Figure 5.

Comparison of 1nn classification results between transformer and other feature extraction methods combined with DTW similarity.

For the selection of sliding windows, we only iterate through the dimensions of the series to determine the best-performing window size and stride based on classification results. The results are summarized in Table 5 below.

Table 5.

Window and step size of the optimal results for classification of each dataset.

Dataset	Best window size	Best step size
AWR	4	1
AF	2	1
BM	3	1
Cricket	5	4
EC	3	1
FM	6	1
HMD	7	5
Handwrit	3	1
Libras	2	1
LSST	5	1
NATOPS	7	4
RS	6	2
SRSCP1	5	2
SRSCP2	6	1
SWJ	4	1
UWGL	3	1

4.4. Similarity comparison

To evaluate the effectiveness of DTW in multivariate time series similarity measurement, we conducted classification experiments by combining Euclidean distance and cosine similarity with Transformer-based feature extraction. These diverse distance metrics facilitate a comparative analysis of DTW and traditional measures, such as Euclidean distance and cosine similarity, in terms of their impact on classification performance for multivariate time series under Transformer-based feature extraction. The final results are presented in Table 6.

Table 6.
Table of comparison results for the three distance classifications.

Euclidean Cosine DTW

AWR 0.98 0.98 0.98

AF 0.4 0.4 0.53

BM 0.9 1 0.9

Cricket 0.96 0.94 0.96

EC 0.32 0.29 0.32

FM 0.63 0.57 0.63

HMD 0.38 0.36 0.38

Handwrit 0.35 0.35 0.36

Libras 0.83 0.81 0.83

LSST 0.42 0.4 0.42

NATOPS 0.86 0.86 0.84

RS 0.84 0.8 0.84

SRSCP1 0.87 0.91 0.87

SRSCP2 0.63 0.58 0.59

SWJ 0.6 0.67 0.67

UWGL 0.89 0.88 0.89

Average accuracy 0.6787 0.675 0 . 6881

	Euclidean	Cosine	DTW
AWR	0.98	0.98	0.98
AF	0.4	0.4	0.53
BM	0.9	1	0.9
Cricket	0.96	0.94	0.96
EC	0.32	0.29	0.32
FM	0.63	0.57	0.63
HMD	0.38	0.36	0.38
Handwrit	0.35	0.35	0.36
Libras	0.83	0.81	0.83
LSST	0.42	0.4	0.42
NATOPS	0.86	0.86	0.84
RS	0.84	0.8	0.84
SRSCP1	0.87	0.91	0.87
SRSCP2	0.63	0.58	0.59
SWJ	0.6	0.67	0.67
UWGL	0.89	0.88	0.89
Average accuracy	0.6787	0.675	0 . 6881

Figure 6.

Comparison chart of the three distance classifications.

Figure 7.

Comparison of classification accuracy under three models.

Figure 6 clearly illustrates that the classification results of the three distance metrics are relatively close, with DTW outperforming the other two similarity measures. Furthermore, the classification outcomes achieved by combining similarity measures with Transformer-based feature extraction surpass those obtained with other feature extraction methods, further substantiating the effectiveness of Transformer in feature extraction.

4.5. Ablation experiment

To assess the contribution of each key component in the proposed framework, we conducted a series of ablation experiments designed to isolate and evaluate the impact of the sliding window mechanism, the Transformer-based feature extraction module, and the integration with Dynamic Time Warping (DTW) for similarity measurement. We conducted an ablation study comparing three key experimental settings:

(1)
No Transformer (Windowed Features + DTW): In this setting, the raw windowed features are directly used for similarity measurement with DTW, without passing through the Transformer encoder. This configuration assesses the baseline value of windowed feature extraction alone.
(2)
No Sliding Window (Direct Transformer + DTW): Here, the entire multivariate time series is directly input into the Transformer encoder without any sliding window segmentation, and the extracted global features are compared using DTW. This setup examines the impact of removing localized temporal segmentation.
(3)
Sliding Window + Transformer + DTW (Full Model): This is the complete proposed framework, where multi-scale sliding window features are fed into the Transformer encoder, and the resulting representations are used with DTW for similarity measurement.

The results of the ablation experiments, summarized in Figures 7–10, reveal the critical importance of integrating all components. Specifically, removing the Transformer leads to a notable decline in performance, highlighting the essential role of deep feature extraction in modeling complex temporal dependencies and cross-variable interactions. When the sliding window component is removed, the model's ability to capture fine-grained local patterns diminishes, resulting in lower classification accuracy. Finally, the full model—combining sliding windows, Transformer-based feature extraction, and DTW—consistently achieves the highest accuracy and robustness, confirming that the interplay between local and global features is key to precise multivariate time series similarity measurement.

Figure 8.
Comparison of classification precision under three models.

Figure 9.
Recall comparison under three models.

Figure 10.
Comparison of F1 scores under three models.

The ablation study results reveal a clear and consistent trend across all evaluated datasets: there is a strong positive correlation between classification accuracy and the other three performance metrics—precision, recall, and F1 score. Specifically, datasets where the proposed full model achieves high accuracy also consistently exhibit elevated precision, recall, and F1 scores, indicating that improvements in overall correctness are accompanied by balanced and robust detection of both positive and negative classes.

For instance, in datasets such as BM, Cricket, NATOPS, and AWR, the full model not only achieves the highest accuracy but simultaneously yields superior precision, indicating fewer false positives; higher recall, reflecting the model's ability to detect most true positives; and elevated F1 scores, demonstrating a well-balanced trade-off between precision and recall. This pattern suggests that, under the proposed framework, accuracy improvements are not achieved at the cost of sacrificing the quality of positive class identification or error control but instead reflect genuine model improvements across multiple dimensions.

Moreover, comparative analysis across ablation settings shows that when the accuracy drops (as seen in No Transformer or No Sliding Window configurations), the corresponding precision, recall, and F1 scores also decline, reinforcing the observed correlation. This alignment indicates that in the evaluated multivariate time series datasets, accuracy serves as a reliable summary indicator of overall model performance, reflecting not only correct predictions but also balanced precision-recall behavior.

These findings highlight the effectiveness of the proposed sliding window + Transformer + DTW framework in delivering improvements that are not isolated to a single metric but extend comprehensively across all critical evaluation dimensions.
5. Conclusion

In this study, we proposed a novel approach for multivariate time series similarity measurement, integrating sliding window input, Transformer-based feature extraction, and DTW similarity metrics. By incorporating the sliding window technique, we effectively decomposed multivariate time series into multiple subsequences, facilitating the capture of inter-dimensional features within the time series. The robust feature extraction capabilities of the Transformer model enabled us to identify complex patterns and long-range dependencies in the time series, thereby enhancing the quality of feature representation. Finally, DTW was employed as the similarity measurement method, ensuring high accuracy even under temporal distortions, particularly when handling nonlinear variations and asynchronous time scales. The results of the 1-NN classification experiments demonstrate that using sliding windows for subsequence segmentation effectively captures local variations in the time series, while Transformer-based feature extraction strengthens the understanding of global structures, thereby improving feature extraction performance. As a similarity measure, DTW excels in aligning time series, providing more precise distance metrics between series of differing forms. Compared to traditional measures such as Euclidean distance and cosine similarity, DTW exhibits superior adaptability and robustness in time series classification tasks.

In summary, the proposed method demonstrates outstanding performance in multivariate time series similarity measurement, particularly in handling complex and morphologically diverse time series. It not only improves measurement accuracy but also offers a new perspective on time series similarity analysis, opening new avenues for the application of Transformer models in time series analysis. Future work can explore the integration of additional feature extraction methods and distance measurement strategies to further enhance the model's performance and address more complex real-world challenges.

Footnotes

Acknowledgements

This research work is supported by the National Natural Science Foundation of China (Grant No.62192751 and 61425027).

ORCID iDs

Aiping Pang

Qianchuan Zhao

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research work is supported by the National Natural Science Foundation of China (Grant No.62192751 and 61425027).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Ruiz

Flynn

Large

, et al. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances[J]. Data Min Knowl Discov 2021; 35: 401–449.

Hamilton

. Time series analysis[M]. Princeton, NJ: Princeton University Press, 2020.

Zhao

Gao

, et al. A similarity measurement for time series and its application to the stock market[J]. Expert Syst Appl 2021; 182: 115217.

Garg

. Multi-variate time series similarity measures and their robustness against temporal asynchrony[M]. Tempe, AZ: Arizona State University, 2015.

Wei

WWS

. Multivariate time series analysis and applications[M]. Hoboken, NJ: John Wiley & Sons, 2019.

Malkauthekar

. Analysis of Euclidean distance and manhattan distance measure in face recognition[C]. Third International Conference on Computational Intelligence and Information Technology (CIIT 2013). IET, 2013: 503–507.

Zermi

Khaldi

Kafi

, et al. A DWT-SVD based robust digital watermarking for medical image security[J]. Forensic Sci Int 2021; 320: 110691.

Boots

Getls

. Point pattern analysis[J]. Geogr Anal 2020; 52: 543–561.

Rabiner

Schmidt

. Application of dynamic time warping to connected digit recognition[J]. IEEE Trans Acoust Speech Signal Process 1980; 28: 377–388.

10.

Liu

Zheng

, et al. A survey of trajectory distance measures and performance evaluation[J]. VLDB J 2020; 29: 3–32.

11.

Barandas

Folgado

Fernandes

, et al. TSFEL: time series feature extraction library[J]. SoftwareX 2020; 11: 100456.

12.

Fan

Zhu

Zhang

. Multivariate time series feature extraction and clustering framework for multi-function radar work mode recognition[J]. Electronics (Basel) 2024; 13: 1412.

13.

Maleki

Manshouri

Kayikçioğlu

Fast and accurate classifier-based brain-computer interface system using single channel EEG data[C]. 2018 26th Signal Processing and Communications Applications Conference (SIU). IEEE, 2018: 1–4.

14.

Manshouri

Kayıkçıoğlu

. Classification of 2D and 3D videos based on EEG waves[C]. 2016 24th Signal Processing and Communication Application Conference (SIU). IEEE, 2016: 949–952.

15.

Manshouri

Kayikcioglu

. A comprehensive analysis of 2D&3D video watching of EEG signals by increasing PLSR and SVM classification results[J]. Comput J 2020; 63: 425–434.

16.

Chiarot

Silvestri

. Time series compression survey[J]. ACM Comput Surv 2023; 55: 1–32.

17.

Islam

Lima

Das

, et al. A comprehensive survey on the process, methods, evaluation, and challenges of feature selection[J]. IEEE Access 2022; 10: 99595–99632.

18.

Schmidl

Wenig

Papenbrock

. Anomaly detection in time series: a comprehensive evaluation[J]. Proceedings of the VLDB Endowment 2022; 15: 1779–1797.

19.

Kaur

Parmar

Singh

. Autoregressive models in environmental forecasting time series: a theoretical and application review[J]. Environmental Science and Pollution Research 2023; 30: 19617–19641.

20.

Dong

Guo

Reichgelt

, et al. Predictive power of ARIMA models in forecasting equity returns: a sliding window method[J]. Journal of Asset Management 2020; 21: 549–566.

21.

Mor

Garhwal

Kumar

. A systematic review of hidden Markov models and their applications[J]. Arch Comput Methods Eng 2021; 28: 1429–1448.

22.

Chatfield

Xing

. The analysis of time series: an introduction with R[M]. Boca Raton, FL: Chapman and Hall/CRC, 2019.

23.

Prado

West

. Time series: modeling, computation, and inference[M]. Boca Raton, FL: Chapman and Hall/CRC, 2010.

24.

Reddy

MPK

Lakshmanna

, et al. Analysis of dimensionality reduction techniques on big data[J]. Ieee Access 2020; 8: 54776–54788.

25.

Kurita

. Principal component analysis (PCA)[M]. In: Computer vision: a reference guide. Cham: Springer International Publishing, 2021: 1013–1016.

26.

Dara

Tumma

Feature extraction by using deep learning: a survey[C]. 2018 Second international conference on electronics, communication and aerospace technology (ICECA). IEEE, 2018: 1795–1801.

27.

Bank

Koenigstein

Giryes

Autoencoders[J]. In: Machine learning for data science handbook: data mining and knowledge discovery handbook. Cham, Switzerland: Springer, 2023: 353–374.

28.

Lea

Flynn

M D

Vidal

, et al. Temporal convolutional networks for action segmentation and detection[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 156–165.

29.

Schmidt

. Recurrent neural networks (rnns): A gentle introduction and overview[J]. arXiv preprint 2019; arXiv:1912.05911: 1–18.

30.

, et al. A review of recurrent neural networks: LSTM cells and network architectures[J]. Neural Comput 2019; 31: 1235–1270.

31.

Wen

Zhou

Zhang

, et al. Transformers in time series: a survey[J]. arXiv preprint 2022; arXiv:2202.07125: 1–35.

32.

Lin

Wang

Liu

, et al. A survey of transformers[J]. AI open 2022; 3: 111–132.

33.

Zerveas

Jayaraman

Patel

, et al. A transformer-based framework for multivariate time series representation learning[C]. Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2021: 2114–2124.

34.

Zhou

Zhang

Peng

, et al. Informer: beyond efficient transformer for long sequence time-series forecasting[C]. Proceedings of the AAAI Conference on Artificial Intelligence 2021; 35: 11106–11115.

35.

Liu

Ren

, et al. Gated transformer networks for multivariate time series classification[J]. arXiv preprint 2021; arXiv:2103.14438.

36.

Tuli

Casale

Jennings

. Tranad: Deep transformer networks for anomaly detection in multivariate time series data[J]. arXiv preprint 2022; arXiv:2201.07284: 1–10.

37.

Metsis

Wang

, et al. Tts-gan: a transformer-based time-series generative adversarial network[C]. International conference on artificial intelligence in medicine. Cham: Springer International Publishing, 2022: 133–143.

38.

Zeng

Chen

Zhang

, et al. Are transformers effective for time series forecasting? [C]. Proceedings of the AAAI Conference on Artificial Intelligence 2023; 37: 11121–11128.

39.

Sakoe

Chiba

. Dynamic programming algorithm optimization for spoken word recognition[J]. IEEE Trans Acoust Speech Signal Process 1978; 26: 43–49.

40.

Ashish

. Attention is all you need[J]. Adv Neural Inf Process Syst 2017; 30: I.

41.

The UEA multivariate time series classification archive, 2018.