The influence of the Sakoe–Chiba band size on time series classification

Abstract

A key component of many types of time series classification methods is an appropriate dissimilarity measure. Elastic measures like DTW, LCS, ERP and EDR are methods that have long been known in the time series community. The methods have flourished, particularly in the last decade, and have been applied to many real problems in a variety of branches. All of the above-mentioned measures have the common feature that they work in the time domain and equalize for possible localized misalignment through some elastic adaptation. However, being of quadratic time complexity, global constraints are often used to speed up computation. Apart from this advantage, it has been shown by simulations that constrained measures can also give better classification results than unconstrained measures.

In this paper, our aim is to verify experimentally the effects of the Sakoe–Chiba band on classification (by the 1NN method) with the above-mentioned measures. Using rigorous statistical analysis we demonstrate that it is possible to find global constraints which lead to improvement of the classification accuracy for all methods. Additionally, for each measure we suggest the best values of the parameter r (the size of the band).

Keywords

Time series classification global constraints elastic measures dynamic time warping longest common subsequence edit distance with real penalty edit distance on real sequence Sakoe–Chiba band derivative dynamic time warping

1 Introduction

A key element in dealing with time series is the use of an appropriate dissimilarity measure. There exist a large number of elastic dissimilarity measures used for time series analysis. The most popular are Dynamic Time Warping (DTW), Longest Common Subsequence (LCS), Edit Distance with Real Penalty (ERP), Edit Distance on Real Sequence (EDR) and Derivative Dynamic Time Warping (DDTW).

The implementation of each of these measures relies on dynamic programming. Such programming is quite slow and has some serious limitations when dealing with very large data. To improve the calculation time, various methods have been developed. It seems that because of its simplicity, the most popular is the Sakoe–Chiba band [30]. This constraint narrows the warping window around the diagonal using a single parameter R (Figure 1). The Sakoe–Chiba band is classified as global constraint. Another common such constraint is the Itakura parallelogram [19]. Global constraints are different from local constraints, because they do not provide local restrictions [21]. Ref. [23] reported that appropriately selected global constraints can significantly reduce the computation time of DTW and LCS. Additionally they noted that constrained variants of the above measures are qualitatively different from their unconstrained equivalents. Apart from reducing the computation time, the use of some constraints can lead to improved accuracy of classification with a nearest neighbors classifier, compared with unconstrained measures [2 , 33]. Other recent papers devoted to the comparison of constrained dissimilarity measures are that of [10, 24].

Fig.1

Time series alignment and the corresponding warping path.

The remainder of the paper is organized as follows. Firstly, we give an overview of the dissimilarity measures used (Section 2). The data sets used are described at the beginning of Section 3. Later in that section we describe the experimental setup. The analysis of results is illustrated with tables and figures in Section 4. The same section contains the results of rigorous statistical analysis. We conclude the paper in Section 5.

2 Methodology

2.1 Classifier

The most popular time series classification method involves the use of a nearest-neighbor-based classifier with various distance measures [34]. The most popular specific method is 1NN-DTW, which is a special case of the k-nearest neighbor classifier with k = 1 and the DTW distance, which demonstrates superior performance for time series classification [4 , 31].

2.2 Dynamic time warping (DTW)

The Dynamic Time Warping distance measure (DTW) is a very popular and efficient dissimilarity measure for time series [6]. To compute the DTW value for two one-dimensional time series with length n ∈ N $\begin{matrix} x & = {x (i) \in i = 1, 2, \dots, n}, \\ y & = {y (i) \in i = 1, 2, \dots, n} \end{matrix}$ we use the following procedure. We define a local cost function d. This is a real-valued function of two real variables which, computes a distance between two points x (i) and y (j) of time series x and y. For the standard DTW we usually define it as $(1) d (x (i), y (j)) = (x (i) - x (j))^{2} .$ (1)

Then we construct a square matrix D with dimensions n × n containing values of the local cost function D (i, j) = d (x (i) , y (j)). The matrix element D (i, j) corresponds to the alignment between values x (i) and y (j) of the time series. Then we construct a warping path W = {w₁, w₂, …, w_K} (K ∈ N) of elements of the matrix D. For classical DTW the warping path is required to meet the following criteria:

w₁ = d (1, 1) and w_K = D (n, n) (boundary conditions),

if w_k = D (i_k, j_k) and w_k+1 = D (i_k+1, j_k+1) then i_k+1 - i_k ≤ 1 and j_k+1 - j_k ≤ 1 (continuity),

i_k+1 - i_k ≥ 0 and j_k+1 - j_k ≥ 0 (monotonicity).

To obtain a warping path we start from the element D (1, 1) and, shifting at most one index forward, we end at the element D (n, n) (Figure 1). The path that minimizes the warping path gives the value of DTW: $\begin{matrix} (x, y) = \sqrt{min_{W} {\sum_{k = 1}^{k = K} w_{k}}} . \end{matrix}$

In practice we often compute DTW by building a cumulative distance matrix Γ. We can use dynamic programming with the following recurrence: $\begin{matrix} Γ (i, j) & = & D (i, j) + min {Γ (i - 1, j - 1), \\ Γ (i - 1, j), Γ (i, j - 1)} \end{matrix}$ with initial conditions: $\begin{matrix} Γ (0, 0) = 0, Γ (0, i) = \infty, Γ (i, 0) = \infty \\ (i = 1, 2, \dots, n) . \end{matrix}$ Then we obtain the value of DTW at position (n, n) of the matrix Γ (usually the square root of the value): $\begin{matrix} (x, y) = \sqrt{Γ (n, n)} . \end{matrix}$

The distance measure DTW does not satisfy the triangle inequality, so it is not a metric. However, it is the case that DTW (x, x) =0 and DTW (x, y) = DTW (y, x) for the cost function (1).

2.3 Elastic measures with global constraints

To decrease the computation time and to increase the accuracy of the classification we can use global constraints for distance measures. One of the most popular for DTW and other elastic measures is the Sakoe–Chiba band [30] with radius R (Figure 1). The same technique can be used for the other elastic dissimilarity measures presented below.

2.3.1 DTW

We will use an integer percentage radius r ∈ [0 % , 100 %] with a step of 1%. Then to obtain the absolute radius R, depending on the length n of a time series, we compute $\begin{matrix} R = round (r / 100) (n - 1) \end{matrix}$ where round takes the nearest integer. Then R ∈ {0, 1, 2, …, n - 1}. We denote $\begin{matrix} j_{\min} (i) & = max (1, i - R), \\ j_{\max} (i) & = min (n, i + R) \end{matrix}$ for i = 1, 2, …, n and we compute Γ (i, j) only for $\begin{matrix} i & = & 1, 2, \dots, n, \\ j & = & j_{\min} (i), j_{\min} (i) + 1, \dots, j_{\max} (i) . \end{matrix}$

To compute DTW we usually use dynamic programming with the following recursion: $\begin{matrix} D (i, j) & = & (x_{i} - y_{j})^{2}, \\ Γ (i, j) & = & D (i, j) + \\ min {Γ (i - 1, j - 1), \\ Γ (i - 1, j), Γ (i, j - 1)} \end{matrix}$ with initial conditions (Figure 2): $\begin{matrix} Γ (0, 0) & = & 0, \\ Γ (0, j) & = & \infty, j = 1, 2, \dots, R + 1, \\ Γ (i, 0) & = & \infty, i = 1, 2, \dots, R + 1, \\ Γ (i, j_{\min} (i) - 1) & = & \infty, i = R + 2, R + 3, \dots, n, \\ Γ (i, j_{\max} (i) + 1) & = & \infty, i = 1, 2, \dots, n - R - 1 . \end{matrix}$

Fig.2

The matrix Γ with initial conditions for DTW.

Then the value of DTW is $\begin{matrix} DTW(x, y) = \sqrt{Γ (n, n)} . \end{matrix}$

2.3.2 LCS

The Longest Common Subsequence (LCS) distance measure is based on the edit distance used for string comparison. It is computed as the longest matching subsequence [32]. The strict string condition x_i = y_j is weakened for time series to |x_i - y_j| ≤ ɛ with parameter 0 < ɛ < 1. To compute LCS we usually use dynamic programming with the following recursion: $\begin{matrix} Γ (i, j) & = {\begin{matrix} 1 + Γ (i - 1, j - 1) if | x_{i} - y_{j} | \leq ɛ, \\ max {Γ (i - 1, j), Γ (i, j - 1)} \\ if | x_{i} - y_{j} | > ɛ \end{matrix} \end{matrix}$ with initial conditions: $\begin{matrix} Γ (0, 0) & = & 0, \\ Γ (0, j) & = & 0, j = 1, 2, \dots, R + 1, \\ Γ (i, 0) & = & 0, i = 1, 2, \dots, R + 1, \\ Γ (i, j_{\min} (i) - 1)) & = & 0, i = R + 2, R + 3, \dots, n, \\ Γ (i, j_{\max} (i) + 1)) & = & 0, i = 1, 2, \dots, n - R - 1 . \end{matrix}$

Then the value of LCS is $\begin{matrix} LCS (x, y) = 1 - \frac{Γ (n, n)}{n} . \end{matrix}$

2.3.3 ERP

The Edit Distance with Real Penalty (ERP) uses the L₁ distance between elements of time series as the penalty for local shifting of time series [7, 8]. ERP is used in many areas of classification and clustering of time series data [27, 35]. To compute ERP we usually use dynamic programming with the following recursion: $\begin{matrix} D_{1} (i, j) & = & Γ (i - 1, j) + | x_{i} |, \\ D_{2} (i, j) & = & Γ (i, j - 1) + | y_{j} |, \\ D_{12} (i, j) & = & Γ (i - 1, j - 1) + | x_{i} - y_{j} |, \\ Γ (i, j) & = & min {D_{1} (i, j), D_{2} (i, j), D_{12} (i, j)} \end{matrix}$ with initial conditions: $\begin{matrix} Γ (0, 0) & = & 0, \\ Γ (0, j) & = & Γ (0, j - 1) + | y_{j} |, \\ j = 1, 2, \dots, R + 1, \\ Γ (i, 0) & = & Γ (i - 1, 0) + | x_{i} |, \\ i = 1, 2, \dots, R + 1, \\ Γ (i, j_{\min} (i) - 1)) & = & \infty, i = R + 2, R + 3, \dots, n, \\ Γ (i, j_{\max} (i) + 1)) & = & \infty, i = 1, 2, \dots, n - R - 1 . \end{matrix}$ Then the value of ERP is $\begin{matrix} ERP (x, y) = Γ (n, n) . \end{matrix}$

2.3.4 EDR

The Edit Distance on Real sequence (EDR) is a variation of the edit distance that finds the minimal number of edit operations to convert one time series to another [1, 8]. The strict condition x_i = y_j is weakened for time series to |x_i - y_j| ≤ ɛ with parameter 0 < ɛ < 1. To compute EDR we usually use dynamic programming with the following recursion: $\begin{matrix} D_{1} (i, j) & = & Γ (i - 1, j) + 1, \\ D_{2} (i, j) & = & Γ (i, j - 1) + 1, \\ D_{12} (i, j) & = & Γ (i - 1, j - 1) + {\begin{matrix} 0 & if | x_{i} - y_{j} | \leq ɛ, \\ 1 & if | x_{i} - y_{j} | > ɛ, \end{matrix}, \\ Γ (i, j) & = & min {D_{1} (i, j), D_{2} (i, j), D_{12} (i, j)} \end{matrix}$ with initial conditions: $\begin{matrix} Γ (0, 0) & = & 0, \\ Γ (0, j) & = & j, j = 1, 2, \dots, R + 1, \\ Γ (i, 0) & = & i, i = 1, 2, \dots, R + 1, \\ Γ (i, j_{\min} (i) - 1)) & = & \infty, i = R + 2, R + 3, \dots, n, \\ Γ (i, j_{\max} (i) + 1)) & = & \infty, i = 1, 2, \dots, n - R - 1 . \end{matrix}$

Then the value of EDR is $\begin{matrix} EDR (x, y) = Γ (n, n) . \end{matrix}$

2.4 Distance measures with derivatives

We will also examine two distance measures based on derivatives with global constraints of radius r.

2.4.1 DDTW

Derivative Dynamic Time Warping [20] is the DTW distance measure computed on the data transformed by the first (discrete) derivative: $\begin{matrix} x^{'} (i) = x (i + 1) - x (i) (i = 1, 2, \dots, n - 1) . \end{matrix}$

Then we can simply compute the value of DDTW with radius r by: $\begin{matrix} DDTW(x, y) = DDTW(x^{'}, y^{'}), \end{matrix}$ where DTW is the constrained DTW described in Section 2.3.1.

2.4.2 _DDDTW

Parametric Derivative Dynamic Time Warping was introduced in [16] and examined in several papers [15 , 25]. It is a convex combination of the distances DTW and DDTW: ${DD}_{DTW} (x, y) = (1 - α) DTW (x, y) + α DDTW (x, y),$

where α ∈ [0, 1] is a parameter chosen in the learning phase of the classification. To obtain a constrained version of _DDDTW with radius r we use DTW and DDTW (Sections 2.3.1 and 2.4.1) with the same radius r.

3 Data sets and experimental setup

We performed experiments on 85 data sets from the UCR Time Series Classification Archive [9]. This is a database with labeled time series data from a very broad range of fields, including medicine, finance, multimedia and engineering. Each data set from the database is split into training and testing subsets. All time series instances are z-normalized.

Constrained versions (with radius r) of the following elastic distance measures were used in the experiments: DTW, LCS, ERP, EDR, DDTW, and _DDDTW. For LCS and EDR the parameter ɛ was fixed at the value ɛ = 0.25.

For the classification process the nearest neighbor method (1NN) is used for all of the compared distance measures. We use the cross-validation (leave-one-out) method to find the best r value on the training subset, where r ∈ [0, 100] with a step of 1. If there is more than one optimal value of r, we may choose the minimum, median or maximum of those values. These three methods of tie-breaking are analyzed in the results of the experiments. We study two different strategies for selecting the best value of the radius r. First, we try to fix the same perfect value of the r parameter for all data sets. Second, we try to select the best value for each data set separately, from 0 to a fixed r_max.

For the parametric distance _DDDTW a finite subset of values of the parameter α is considered, ranging from 0 to 1 with a fixed step size of 0.01. This means that, for the fixed r, we find the best α by leave-one-out cross-validation on the learning subset.

4 Results

4.1 Methods of tie-breaking

We find a set of radii r with the same minimal error on the training subset. Then we examine three methods of tie-breaking by choosing the minimal, median, or maximal r and computing the error on the test subset.

In the first step we decided to create histograms of r (selected by the leave-one-out method on the training data set) for different methods of tie-breaking and different dissimilarity measures (Figures 3–8). We also include a non-parametric estimate of the density of r. Additionally, in each figure we have added the probabilities that r is greater than 10% and 20%. For each dissimilarity measure we have four plots, with three methods of tie-breaking and the optimum value of the parameter r found on the test data set. We would like to find the tie-breaking method which gives us the true value of the parameter r.

Fig.3

Histograms and densities of r for tie-breaking methods for DTW distance.

Fig.4

Histograms and densities of r for tie-breaking methods for LCS distance.

Fig.5

Histograms and densities of r for tie-breaking methods for ERP distance.

Fig.6

Histograms and densities of r for tie-breaking methods for EDR distance.

Fig.7

Histograms and densities of r for tie-breaking methods for DDTW distance.

Fig.8

Histograms and densities of r for tie-breaking methods for _DDDTW distance.

We can observe that for each dissimilarity measure the histogram of the “Minimum” method looks the most similar to that of the “Optimum” method. For the “Median” method of tie-breaking we notice a high peak in the middle of the distribution of r. This peak is not observed for the “Optimum” method. Similar behavior can be observed for the “Maximum” method. In this case the peak is at the end of the distribution. The probability P (r > 10) is less than 30% and P (r > 20) is less than 10% for the perfect value of r. The “Minimum” method approximates this probabilities quite well (it is definitively the best method) for each dissimilarity measure. Hence, looking at the densities we can recommend the “Minimum” method of tie-breaking. Additionally, it seems that r less than 20% is sufficient for most of the data sets.

Finally, to distinguish statistically the methods of tie-breaking for each distance measure, we performed a detailed statistical comparison. We tested the hypothesis that there are no differences between tie-breaking methods. We used the [18] test, which is a less conservative variant of the Friedman’s repeated-measures ANOVA. This test is recommended by [11, 13] as the best test to compare several different classifiers. In this test we rank the classifiers for each data set separately. Let R_ij be the rank of the jth of K methods on the ith of N data sets and let $R_{j} = \frac{1}{N} \sum_{i = 1}^{N} R_{ij}$ . The test compares the mean ranks R₁, R₂, …, R_K of the classifiers, and is based on the statistic $\begin{matrix} S = \frac{(N - 1) S_{1}}{N (K - 1) - S_{1}} \end{matrix}$ where $\begin{matrix} S_{1} = \frac{12 N}{K (K + 1)} \sum_{i = 1}^{K} R_{i}^{2} - 3 N (K + 1) \end{matrix}$ is the Friedman statistic. The statistic S is distributed according to the F-distribution with K - 1 and (K - 1) (N - 1) degrees of freedom. In our case N = 85 and K = 3 for each comparison. We give the p-values from the Iman & Davenport test in Table 1. The p-values are much higher than the usual confidence level of 5%, so we do not have any evidence to reject the null hypothesis. Hence, for each distance measure we do not have any evidence for a difference in tie-breaking methods. In a such situation we should select the simplest method. Based on this, we decided to use the “Minimum” method of tie-breaking in our further experiments.

Table 1

P-values from the Iman & Davenport test for tie-breaking methods

Distance	p-value
DTW	0.75
LCS	0.79
ERP	0.99
EDR	0.37
DDTW	0.29
_DDDTW	0.52

4.2 r analysis

4.2.1 Fixed r

The first way to choose the radius r is to fix the same r for all data sets. We try to find an r which is statistically better (higher rank, lower error) than the full radius r = 100. To test this statistically we used the [18] test, introduced earlier. For each comparison we obtained p-values equal to 0. In such a case we should apply a post hoc test to discover the structure of groups of similar radii. The test statistic for comparing the ith and jth algorithm is $\begin{matrix} Z = \frac{R_{i} - R_{j}}{\sqrt{\frac{K (K + 1)}{6 N}}} . \end{matrix}$ This statistic is asymptotically normal with zero mean and unit variance. When comparing multiple algorithms, to retain an overall significance level α, one has to adjust the value of α for each post hoc comparison. There are different methods for this. Ref. [13] compared various correction algorithms, showing that although it requires intensive computation, the [5] procedure has the highest statistical power. Unfortunately, in our situation we have too many algorithms (101 different radii) to use this procedure. Hence, we used the [26] test, in which all classifiers are compared with each other. This test is recommended to compare large number of algorithms [11, 13]. The performance of two methods is significantly different at the experimentwise error rate α if $\begin{matrix} | R_{i} - R_{j} | > q (α, K, \infty) {(\frac{K (K + 1)}{12 N})}^{1 / 2}, \\ i = 1, \dots, K - 1, j = i + 1, \dots, K, \end{matrix}$ where the values of q (α, K, ∞) are based on the Studentized range statistic [11].

In Figures 9–14 we have groups of radii with similar ranks (left) and values of the rank depending on r (right). Each group is statistically separated from the others. Using these graphs we can see which r is the best for which distance measure.

Fig.9

Multiple comparisons of radii r for DTW. Groups of radii with similar ranks (left). Ranks of radii (right).

Figure 9 shows the groups and ranks for DTW. We can see that the best group, with the highest ranks and the lowest errors, is the interval [6, 12]. The group is clearly separated from all groups with r = 100 included. A value of r around 10 seems to be the best choice in this case.

Figure 10 shows the groups and ranks for LCS. The best group is the interval [7, 20] (with a few values excluded). In this case the best group is not separated from the groups with r = 100. Only the values r = 13, 15, 16 are not included in other groups with r = 100. Therefore we can only say that values of r around 15 give results not worse than other radii including the full radius r = 100.

Fig.10

Multiple comparisons of radii r for LCS. Groups of radii with similar ranks (left). Ranks of radii (right).

Figure 11 shows the groups and ranks for ERP. There is a fairly wide group with the highest ranks, but only a few radii are excluded from the groups with r = 100. Values of r from around 5 to around 20 give the same and sometimes better results than r = 100. It seems that a value of around r = 10 is a good choice in this case.

Fig.11

Multiple comparisons of radii r for ERP. Groups of radii with similar ranks (left). Ranks of radii (right).

Figure 12 shows the groups and ranks for EDR. The graph is very clear: there is one very wide group that includes all values of r from 14 to 100. Since as usual the smallest r is preferred, a value around 15 is the best choice.

Fig.12

Multiple comparisons of radii r for EDR. Groups of radii with similar ranks (left). Ranks of radii (right).

Figure 13 shows the groups and ranks for DDTW. The continuous group with the highest ranks is the interval [5, 18], which is clearly separated from the groups with r = 100. Moreover, r = 12, 13 belong only to the best group. It seems that values from around 10 to 15 are the best choice.

Fig.13

Multiple comparisons of radii r for DDTW. Groups of radii with similar ranks (left). Ranks of radii (right).

Figure 14 shows the groups and ranks for _DDDDTW. The group with the lowest errors is the interval [7, 17], where radii from 7 to 12 are not included in groups with r = 100. The value r = 11 belongs only to the best group. It seems that r around 10 is the best choice.

Fig.14

Multiple comparisons of radii r for _DDDTW. Groups of radii with similar ranks (left). Ranks of radii (right).

4.2.2 Choosing the best r from 0 to r_max

The second way to choose a radius r is to fix a maximal radius r_max and find the best r in the interval [0, r_max]. We find the best r (with the smallest error) on the training subset using the leave-one-out cross-validation method. To statistically test algorithms for different radii we used the methodology described above.

In Figures 15–20 we have groups of maximal radii with similar ranks (left) and values of the rank depending on r_max (right). Each group is statistically separated from the others. Using these graphs we can see which r_max is the best for which distance measure.

Fig.15

Multiple comparisons of maximal radii r_max for DTW. Groups of maximal radii with similar ranks (left). Ranks of maximal radii (right).

Fig.16

Multiple comparisons of maximal radii r_max for LCS. Groups of maximal radii with similar ranks (left). Ranks of maximal radii (right).

Fig.17

Multiple comparisons of maximal radii r_max for ERP. Groups of maximal radii with similar ranks (left). Ranks of maximal radii (right).

Fig.18

Multiple comparisons of maximal radii r_max for EDR. Groups of maximal radii with similar ranks (left). Ranks of maximal radii (right).

Fig.19

Multiple comparisons of maximal radii r_max for DDTW. Groups of maximal radii with similar ranks (left). Ranks of maximal radii (right).

Fig.20

Multiple comparisons of maximal radii r_max for _DDDTW. Groups of maximal radii with similar ranks (left). Ranks of maximal radii (right).

The graphs are similar for all distance measures. There is one wide group including r_max = 100 with the highest ranks, and there is no better group. The best groups start with the maximal radius: 16 for DTW, 13 for LCS, 16 (6) for ERP, 11 for EDR, 14 for DDTW, and 12 for _DDDTW. If we have to take one universal r_max for all of the examined distances, a value around 20 seems to be the best choice.

4.2.3 r vs r_max

We compared a fixed choice of r with the method of finding the best r from 0 to r_max. For every distance measure we take the minimal r and the minimal r_max from the best group (Table 2).

Table 2
Lowest r and r_max in the best group, p-values of the Wilcoxon test, and win/tie/loss numbers

Distance r r _max p-value W/T/L

DTW 6 16 0.011 27/14/44

LCS 7 13 0.369 29/20/36

ERP 4 16 0.970 33/22/30

EDR 14 11 0.001 42/26/17

DDTW 5 14 0.002 20/15/50

_DDDTW 7 12 0.105 29/13/43

Distance	r	r _max	p-value	W/T/L
DTW	6	16	0.011	27/14/44
LCS	7	13	0.369	29/20/36
ERP	4	16	0.970	33/22/30
EDR	14	11	0.001	42/26/17
DDTW	5	14	0.002	20/15/50
_DDDTW	7	12	0.105	29/13/43

As a first step we decided to prepare a graphical comparison of methods (Figures 21–23).

Fig.21

Comparison of test errors.

Fig.22

Comparison of test errors.

Fig.23

Comparison of test errors.

Finally, we present a statistical comparison. To statistically compare two classifiers over multiple data sets, [11] recommends the Wilcoxon signed-ranks test, which is a non-parametric alternative to the paired t-test. The obtained p-values and numbers of wins/ties/losses are presented in Table 2. As we can see, the p-values and W/T/L numbers show that for distances LCS, ERP, and _DDDTW the methods with fixed r and with tuned r ∈ [0, r_max] are statistically the same. For DTW and DDTW the r_max method is better than for fixed r. The only case for which fixed r method is better than the tuned r_max method is that of the EDR distance measure. The graphical comparison in Figures 21–23 confirms the results.

5 Conclusions

First, we sum up the results concerning tie-breaking methods for finding r on a training subset with leave-one-out cross-validation. Both the analysis of probability distributions of r (Figures 3–8) and statistical tests show that the proposed methods of choosing r are statistically indistinguishable. Since the value of r has a significant influence on the computational complexity of the distance measures, the best choice is the minimal r for all studied distances.

Next, we tested whether for the fixed r method we can take r < 100 with the same or better result than for r = 100. It turned out that the minimal r values in the best group (with highest ranks/lowest classification errors) are very small and are lower than 10 for all of the distance measures except of EDR (Table 2). Furthermore, for all distances, these minimal radii give better (the same in the case of EDR) results (lower classification errors) than r = 100. This shows that by fixing these minimal r we can greatly reduce the computational complexity of the distances. For example, if we take approximately r = 10, then because the computational complexity of all distances is r², we have a 100-fold reduction in computational time for all distance measures.

On the other hand, in the case of selecting the best r ∈ [0, r_max], there is no better group than that with r_max = 100. However, the minimal values of r_max in these groups are still much lower than r_max = 100 (Table 2). It seems that a universal value of r_max for all of the examined distance measures may be r_max = 20. Since we select r by the leave-one-out method, we can also achieve large reduction in computational time. Finally, we tested whether the method of choosing the best r ∈ [0, r_max] always gives the better result than the method with fixed r. For DTW and DDTW choosing r on the training subset gives a much better result than for fixed r, while for LCS, ERP, and _DDDTW the two methods are statistically the same. For EDR the fixed r was even better than the method with choice of the best r; the reason for this behavior may lie in the weak correspondence of the error rates on the learning (cross-validation) and testing subsets.

In summary:

When we have many r with the same error rate on the training subset we select the minimal one.

In the method with fixed r, the standard maximal value r = 100 represents a significant waste of computing power. Table 2 shows the minimal values of r that we can select in the best groups for each distance measure. For all distances these minimal values of r guarantee results not worse than for r = 100, and sometimes much better. To allow for scaling for new data sets we can select values of r slightly higher than the minimal ones. It seems that a recommended universal value for of the all examined distance measures (excluding the case of EDR) may be r = 10 (r = 15 for EDR).

For the method of selecting the best r ∈ [0, r_max], although we do not have better results than for r_max = 100, the group with r_max = 100 is always very wide and the minimal value of r_max in this group is very low (Table 2). It seems that the value r_max = 20 is a good choice for all of the examined distance measures.

In accordance with intuition, the method of selecting r ∈ [0, r_max] is not worse than the method with fixed r (excluding the case of EDR). For DTW and DDTW we substantially increase the quality of classification by choosing the best r ∈ [0, r_max] instead of fixed r. For other distance measures the results are not conclusive. The special case of EDR seems to be explained by overfitting—the weak correspondence of the choices of best r on the training and testing data sets for this distance measure.

References

Abul

, Bonchi

and Nanni

, Anonymization of moving objects databases by clustering and perturbation, Information Systems 35 (2010), 884–910.

Bagnall

and Lines

, Technical Report CMP-C14-01: An Experimental Evaluation of Nearest Neighbour Time Series Classification, arXiv:1406.4757v1, 2014.

Bagnall

, Bostrom

, Large

and Lines

, The great time series classification bake off: An experimental evaluation of recently proposed algorithms, Extended Version (2016), arXiv:1602.0711v1.

Batista

, Wang

and Keogh

, A complexityinvariant distance measure for time series, Proceedings of the Eleventh SIAM Conference on Data Mining (SDM) (2011).

Bergmann

, Hommel

, Improvements of general multiple test procedures for redundant systems of hypotheses, in Bauer

, Hommel

and Sonnemann

(Eds), Multiple Hypotheses Testing, Springer, 1988, pp. 110–115.

Berndt

D.J.

and Clifford

, Using dynamic time warping to find patterns in time series, AAAI Workshop on Knowledge Discovery in Databases, 1994, pp.229–248.

Chen

and Ng

, On The Marriage of Lp-norms and Edit Distance, In: Proceedings of the Thirtieth International Conference on Very Large Databases 30, 2004, pp. 792–803.

Chen

, Özsu

M.T.

, Oria

, Robust and fast similarity search for moving object trajectories, In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, New York, ACM, NY, USA, pp. 491–502.

Chen

, Keogh

, Hu

, Begum

, Bagnall

, Mueen

and Batista

, The UCR Time Series Classification Archive, www.cs.ucr.edu/ eamonn/time_series_data/, (2015).

10.

Dau

H.A.

, Silva

D.F.

, Petitjean

, Forestier

, Bagnall

, Mueen

and Keogh

, Optimizing dynamic time warping’s window width for time series data mining applications, Data Mining and Knowledge Discovery (2018). 10.1007/s10618-018-0565-y.

11.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

12.

Ding

, Trajcevski

, Scheuermann

, Wang

and Keogh

, Querying and mining of time series data: Experimental comparison of representations and distance measures, Proc VLDB Endow 1 (2008), 1542–1552.

13.

García

and Herrera

, An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comons, Journal of Machine Learning Research 9 (2008), 2677–2694.

14.

Geler

, Kurbalija

, Radovanović

and Ivanović

, Impact of the sakoe-chiba band on the DTWTime-series distance measure for kNN classification, In: Knowledge Science, Engineering and Management, Lecture Notes in Computer Science 8793 (2014), 105–114.

15.

Górecki

, Using derivatives in a longest common subsequence dissimilarity measure for time series classification, Pattern Recognition Letters 45C (2014), 99–105.

16.

Górecki

and Łuczak

, Using derivatives in time series classification, Data Mining and Knowledge Discovery 26(2) (2013), 310–331.

17.

Górecki

and Łuczak

, Multivariate time series classification with parametric derivative dynamic time warping, Expert Systems with Applications 42(5) (2015), 2305–2312.

18.

Iman

and Davenport

, Approximations of the critical region of the Friedman statistic, Communications in Statistics - Theory and Methods 9(6) (1980), 571–595.

19.

Itakura

, Minimum prediction residual principle applied to speech recognition, IEEE Trans Acoustics Speech Signal Process 23 (1975), 52–72.

20.

Keogh

and Pazzani

, Dynamic Time Warping with Higher Order Features, In Proc SIAM International Conference on Data Mining (SDM’2001) In Chicago, USA 2001.

21.

Keogh

, Ratanamahatana

C.A.

, Exact indexing of dynamic time warping, In: Proceedings of the 26th Int’l Conference on Very Large Data Bases, Hong Kong 2002, pp. 406–417.

22.

Keogh

, Zhu

, Hu

, Hao

, Xi

, Wei

and Ratanamahatana

C.A.

, The UCR Time Series Classification/Clustering Homepage: http://www.cs.ucr.edu/ eamonn/time_series_data/ (2011).

23.

Kurbalija

, Radovanović

, Geler

, Ivanović

The Influence of Global Constraints on DTW and LCS Similarity Measures for Time-Series Databases. In: Dicheva

, Markov

and Stefanova

, (Eds.), Third International Conference on Software Services and Semantic Technologies S3T 2011 SE - 10, Sringer Berlin Heidelberg, 2011 pp.67–74.

24.

Kurbalija

, Radovanović

, Geler

and Ivanović

, The influence of global constraints on similarity measures for time-series databases, Knowledge-Based Systems 56(1) (2014), 49–67.

25.

Łuczak

, Hierarchical clustering of time series data with parametric derivative dynamic time warping, Expert Systems with Applications 62 (2016), 116–130.

26.

Nemenyi

P.B.

, Distribution-free multiple comparisons, Ph.D. thesis, Princeton University, 1963.

27.

Pelekis

, Kopanakis

, Kotsifakos

E.E.

, Frentzos

, Theodoridis

, Clustering Trajectories of Moving Objects in an Uncertain World, In: Data Mining 2009 ICDM’09 Ninth IEEE International Conference On, 2009, pp. 417–427

28.

Radovanović

, Nanopoulos

and Ivanović

, Timeseries classification in many intrinsic dimensions, In 10th SIAM International Conference on Data Mining (SDM’2010), Columbus, USA, 2010, pp.677–688.

29.

Ratanamahatana

C.A.

, Keogh

, Three myths about dynamic time warping data mining, In: Proceedings of SIAM International Conference on DataMining (SDM’05), 2005, pp. 506–510.

30.

Sakoe

and Chiba

, Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing 26(1) (1978), 43–49.

31.

Tomašev

and Mladenić

, Nearest Neighbor voting in high dimensional data: Learning from past occurrences, Computer Science and Information Systems 9 (2012), 691–712.

32.

Vlachos

, Kollios

, Gunopulos

, Discovering similar multidimensional trajectories, In: Proceedings 18th International Conference on Data Engineering IEEE Comput Soc, 2002, pp. 673–684.

33.

, Keogh

, Shelton

, Wei

, Ratanamahatana

C.A.

Fast time series classification using numerosity reduction, In: Proceedings of the 23rd International Conference on Machine Learning ACM, New York NY, USA. 2006, pp.1033–1040.

34.

, Kumar

, Quinlan

J.R.

, Ghosh

, Yang

, Motoda

, McLachlan

G.J.

, Ng

, Liu

, Yu

P.S.

, Zhou

, Steinbach

, Hand

D.J.

and Steinberg

, Top 10 algorithms in data mining, Knowledge Information Systems 14(1) (2008), 1–37.

35.

Zhang

, Zuo

, Zhang

and Li

, Classification of pulse waveforms using edit distance with real penalty, EURASIP Journal on Advances in Signal Processing 28 (2010), 1–28.

The influence of the Sakoe–Chiba band size on time series classification

Abstract

Keywords

1 Introduction

2.1 Classifier

2.2 Dynamic time warping (DTW)

2.3.1 DTW

2.3.3 ERP

2.3.4 EDR

2.4 Distance measures with derivatives

2.4.1 DDTW

2.4.2 DDDTW

3 Data sets and experimental setup

4 Results

4.1 Methods of tie-breaking

4.2.1 Fixed r

Table 2 Lowest r and rmax in the best group, p-values of the Wilcoxon test, and win/tie/loss numbers Distance r r max p-value W/T/L DTW 6 16 0.011 27/14/44 LCS 7 13 0.369 29/20/36 ERP 4 16 0.970 33/22/30 EDR 14 11 0.001 42/26/17 DDTW 5 14 0.002 20/15/50 DDDTW 7 12 0.105 29/13/43

References

2.4.2 _DDDTW

Table 2
Lowest r and r_max in the best group, p-values of the Wilcoxon test, and win/tie/loss numbers

Distance r r _max p-value W/T/L

DTW 6 16 0.011 27/14/44

LCS 7 13 0.369 29/20/36

ERP 4 16 0.970 33/22/30

EDR 14 11 0.001 42/26/17

DDTW 5 14 0.002 20/15/50

_DDDTW 7 12 0.105 29/13/43