Combining raw and normalized data in multivariate time series classification with dynamic time warping

Abstract

Data normalization is one of the most common processing methods applied to raw data before its subsequent use in data mining algorithms, classification, or clustering methods. Many procedures, particularly those that use any statistical analysis, require that data be normalized in one way or another. In the case of time series a standard method of processing raw data is z-normalization of each time series instance in the data set. For multivariate (multidimensional) time series we z-normalize each dimension (variable) individually. Although normalization brings a lot of advantages, it is easy to find examples of data sets where normalization destroys information contained in the raw data. In this paper we demonstrate, that for multivariate time series (MTS) both raw and normalized components give some information about the data and the best way of mining it is a combination of them. We focus here on multidimensional time series and their classification using the nearest neighbor method with the dynamic time warping (DTW) distance measure. We construct a parametric distance measure that is a combination of DTW on raw and z-normalized time series data. It turns out that the combined distance measure carries more information about the data than the two distance components separately. By determining an individual parameter for each data set it is possible to obtain a lower classification error than the errors of both component distance measures. We perform experiments on real data sets from many fields of science and technology. The advantage of the combined approach is confirmed by graphical and statistical comparisons.

Keywords

Multivariate time series classification dynamic time warping parametric distance measure combining raw and normalized data

1 Introduction

Data normalization is the most common transformation of raw data in the theory and applications of data mining and machine learning [15 , 29]. There are many works demonstrating theoretically and empirically the necessity and usefulness of data normalization with applications in various fields of science and technology [2 , 31]. For time series data [1, 28], both one- and multidimensional, the basic most common method is z-normalization, where we normalize each data series instance (each dimension for multivariate time series) in the data set separately. Z-normalization of a time series is a transformation resulting in a new time series with the same length, for which the arithmetic mean of the time series values is equal to zero and the standard deviation is equal to one. The need for time series normalization is often emphasized in classification methods with the dynamic time warping and other distance measures [10 , 23].

Despite the undeniable advantages of time series normalization, it is easy to identify artificial data where z-normalization destroys information contained in the classes of a given data set. An example may be two classes of similar or even identical waveform signals which differ strongly in amplitude. The natural discriminatory factor for these classes is the amplitude of the signals, and the process of normalization completely destroys the information in the classes. After the normalization process both classes will be indistinguishable. Especially in the case of multivariate time series an amplitude distinction between specific dimensions of the MTS can have a large impact on the discrimination process. Z-normalization of all dimensions in all instances of MTS data collection can cut out information resulting in deterioration of the classification results.

In this paper we show how we can use information from raw and z-normalized data in the process of classification of multivariate time series. We focus on the classification of multivariate (multidimensional) time series using the nearest neighbor method (1NN) with the dynamic time warping (DTW) distance measure. We construct here a parametric measure that is a combination of the DTW distance on the raw MTS data and the DTW distance on the z-normalized instances of the same data set. A single real parameter controls the contribution of each of the distance components to the final value of the combined distance. Similar techniques have been used in previous work for combinations of other distances; for example, for multivariate time series with a combination of DTW and a derivative distance DDTW or an integral distance IDTW [14, 22] or for one-dimensional time series data classification [11 –13] and clustering [21]. In this paper, we show that a combination of raw and normalized data gives better classification results for multivariate time series with 1NN and DTW than the same classification using only raw data or normalized data. The main idea of the paper is to show that in both raw and normalized data is a useful information that can improve performance of the classifiers. We also show that the combining approach is essential here, only combination of the two distances can outperform component measures. Furthermore, we show that the parametric approach is important here. The parameter of the combined distance varies greatly depending on the data set, determining the strength of the contributions of the respective distance components to the final distance. Empirical experiments were carried out on 16 real data sets of the multivariate time series from many fields of science, technology and medicine.

The remainder of the paper is organized as follows. Section 2 contains basic definitions of the DTW distance measure for univariate and multivariate time series. The new parametric distance as a combination of DTW on raw and z-normalized data is presented, and some efficiency optimizations of the tested method are described. Section 3 contains computational experiments. Classification error rates for the combined method and for its components are presented followed by a detailed discussion of the results and graphical and statistical comparisons. Finally, Section 4 contains conclusions and sums up the experiments performed.

2 Multivariate combined DTW

A univariate (one-dimensional) time series is a sequence of observations in time [7]. In this paper we assume that the time series is discrete, ie. it is a finite sequence of real numbers: $\begin{matrix} x = {x (i) \in ℝ : i = 1, 2, \dots, n}, \end{matrix}$ where $n \in ℕ$ is the length of the time series.

A multivariate (multidimensional) time series is defined as a finite sequence of univariate series: $\begin{matrix} X = (x_{1}, x_{2}, \dots, x_{m}), \end{matrix}$ where $m \in ℕ$ is the dimension of the multi-series X, ie. the number of variables of the time series. In this work we assume that all series in a multi-series have the same length n for each instance of the data set.

2.1 Dynamic time warping

The dynamic time warping (DTW) distance measure is a very popular distance measure used to compute the similarity/dissimilarity of time series data [4]. To compute the value of DTW for two one-dimensional time series with length $n \in ℕ$ $\begin{matrix} x & = & {x (i) \in ℝ : i = 1, 2, \dots, n}, \\ y & = & {y (i) \in ℝ : i = 1, 2, \dots, n} \end{matrix}$ we proceed as follows. We define a local cost function d — a real valued function of two real variables which computes distance between two different points of time series x and y. For standard DTW the local cost function is usually defined by: $d (x (i), y (j)) = (x (i) - x (j))^{2} .$ (1)

Then we construct a square matrix D with dimension n × n consisting values of the local cost function D (i, j) = d (x (i) , y (j)). The matrix element D (i, j) corresponds to the alignment between values x (i) and y (j) of the time series. Then we construct a warping path W = {w1, w2, …, w_K} ( $K \in ℕ$ ) of elements of the matrix D. In standard DTW distance measure the warping path is required to meet three criteria:

w₁ = d (1, 1) and w_K = D (n, n) (boundary conditions);

if w_k = D (i_k, j_k) and w_k+1 = D (i_k+1, j_k+1) then i_k+1 - i_k ≤ 1 and j_k+1 - j_k ≤ 1 (continuity);

i_k+1 - i_k ≥ 0 and j_k+1 - j_k ≥ 0 (monotonicity).

To obtain the warping path we begin at the element D (1, 1) and moving at most one index up or right we finish at the element D (n, n) (Fig. 1). The path that minimizes the warping cost gives the value of the DTW distance: $DTW (x, y) = min_{W} {\sum_{k = 1}^{k = K} w_{k}} .$

Fig.1

Time series alignment and the corresponding warping path.

Sometimes, the DTW is defined as the square root of this value.

In practice, we often compute the value of the DTW distance by building the cumulative distance matrix Γ. For this we use dynamic programming with the following recursion: $\begin{matrix} Γ (i, j) & = & D (i, j) + min {Γ (i - 1, j - 1), \\ Γ (i - 1, j), Γ (i, j - 1)} \end{matrix}$ with the start conditions: $\begin{matrix} Γ (0, 0) = 0, \\ Γ (0, i) = \infty, Γ (i, 0) = \infty (i = 1, 2, \dots, n) . \end{matrix}$

Then the DTW value is found at position (n, n) of the matrix Γ: $\begin{matrix} DTW (x, y) & = & Γ (n, n) \\ (DTW (x, y) & = & \sqrt{Γ (n, n)}) . \end{matrix}$

The DTW distance measure is not a metric, as it does not satisfy the triangle inequality. However, it is the case that DTW (x, x) =0 and DTW (x, y) = DTW (y, x) if we use the cost function (1).

2.2 Multivariate dynamic time warping

Since we assume that the time series length for each dimension of the multi-series in a given data set is the same, we can regard an MTS as a one-dimensional trajectory in m-dimensional Euclidean space:

$\begin{matrix} X = {X (i) & = & (x_{1} (i), x_{2} (i), \dots, x_{m} (i)) \in ℝ^{m} : \\ i = 1, 2, \dots, n} . \end{matrix}$ (2)

Then we can define the DTW distance measure between two multi-series X and Y [14] in the same way as for one-dimensional series with the local cost function d defined by: $d (X (i), Y (j)) = \sum_{k = 1}^{k = m} (x_{k} (i) - y_{k} (j))^{2},$ (3)

This means that the local cost function is the square Euclidean distance of m-dimensional vectors generated by taking the values along the dimensions of the multi-series at positions i and j (Fig. 2).

Fig.2

MTS alignment and the cost function.

2.3 Combining raw and normalized data

We will define the normalized MTS as an MTS for which each of m dimensions is z-normalized separately, ie. $\begin{matrix} norm (X) \\ = (z - norm (x_{1}), z - norm (x_{2}), \dots, z - norm (x_{m})), \end{matrix}$ where $z - norm (x_{i}) = \frac{x_{i} - μ (x_{i})}{σ (x_{i})} (i = 1, 2, \dots, m),$ where μ (x_i) is the mean of the (univariate) time series x_i, and σ (x_i) is the standard deviation of x_i. For two MTS X, Y and their normalizations norm (X), norm (Y), we can compute both the DTW distance between raw multi-series and the DTW (normDTW) between their normalizations: $\begin{matrix} DTW (X, Y), \\ normDTW (X, Y) = DTW (norm (X), norm (Y)) . \end{matrix}$

We define a parametric combined dynamic time warping distance measure (combDTW) as a convex combination of DTW and normDTW distance measures: $\begin{matrix} combDTW (X, Y) \\ = (1 - α) DTW (X, Y) + α normDTW (X, Y), \end{matrix}$ where α is a real valued parameter and α ∈ [0, 1].

The distance function combDTW can be used in the nearest neighbor classification method, where the parameter α is chosen in the learning phase (on the learning data set). In this paper α will be found by the leave-one-out cross-validation method on the training data set.

Since the parameter α is located outside the DTW and normDTW distances, to compute the value of combDTW for all α ∈ [0, 1] we need to compute DTW and normDTW only once. This allows us to make some computational optimizations in the learning phase of the nearest neighbor method. The optimized algorithm for leave-one-out cross-validation on the training set is presented in Fig. 3.

Fig.3

Implementation of the optimized algorithm for the leave-one-out cross-validation routine (Matlab code).

2.4 Metric conditions and lower bounds

Since the DTW distance measure is not a metric, the new combined distance combDTW is not a metric either. However, as for DTW distance, it is the case that $\begin{matrix} combDTW (X, X) & = & 0, \\ combDTW (X, Y) & = & combDTW (Y, X), \end{matrix}$ for each fixed parameter α ∈ [0, 1].

To decrease the computation time of the 1NN method the lower bound technique is often used. If LB is a lower bound for DTW and normLB is a lower bound for normDTW then the function

$\begin{matrix} {LB}_{α} (X, Y) \\ = (1 - α) LB (X, Y) + α normLB (X, Y) \end{matrix}$ (4) is a lower bound for the distance combDTW (for each fixed α ∈ [0, 1]). We can find many good lower bounds for the DTW distance on one-dimensional time series, for example LB_Keogh [16] or LB_Improved [20]. By the definition of DTW for multivariate time series (2), (3) we can easily transform the univariate lower bounds to multivariate lower bounds for the multi DTW. Therefore (by (4)) we can find a good lower bounds for combDTW as well.

3 Results

3.1 Experimental setup

Experiments were performed on 16 data sets, which are all non-normalized data sets whose labels are given. The data sets originate from different domains, including medicine, robotics, handwriting recognition, etc. Information on the time series used is presented in Table 1 (UCI — [3], CMU MOCAP — [5 , 25]). The number of time series per data set varies from 47 to 10992, the number of variables varies form 2 to 62 and the number of classes varies from 2 to 95.

Table 1
Datasets

Dataset # instances # variables # classes Min length Max length Source

ArabicDigits 8800 13 10 4 93 UCI

AUSLAN 2565 22 95 45 136 UCI

BCI 416 28 2 500 500 Blankertz

CharacterTrajectories 2858 3 20 109 205 UCI

CMUsubject16 58 62 2 127 580 CMU MOCAP

ECG 200 2 2 39 152 Olszewski

Graz 140 3 3 1152 1152 Leeb

JapaneseVowels 640 12 9 7 29 UCI

Libras 360 2 15 45 45 UCI

PenDigits 10992 2 10 8 8 UCI

RobotFailure LP1 88 6 4 15 15 UCI

RobotFailure LP2 47 6 5 15 15 UCI

RobotFailure LP3 47 6 4 15 15 UCI

RobotFailure LP4 117 6 3 15 15 UCI

RobotFailure LP5 164 6 5 15 15 UCI

Wafer 1194 6 2 104 198 Olszewski

Dataset	# instances	# variables	# classes	Min length	Max length	Source
ArabicDigits	8800	13	10	4	93	UCI
AUSLAN	2565	22	95	45	136	UCI
BCI	416	28	2	500	500	Blankertz
CharacterTrajectories	2858	3	20	109	205	UCI
CMUsubject16	58	62	2	127	580	CMU MOCAP
ECG	200	2	2	39	152	Olszewski
Graz	140	3	3	1152	1152	Leeb
JapaneseVowels	640	12	9	7	29	UCI
Libras	360	2	15	45	45	UCI
PenDigits	10992	2	10	8	8	UCI
RobotFailure LP1	88	6	4	15	15	UCI
RobotFailure LP2	47	6	5	15	15	UCI
RobotFailure LP3	47	6	4	15	15	UCI
RobotFailure LP4	117	6	3	15	15	UCI
RobotFailure LP5	164	6	5	15	15	UCI
Wafer	1194	6	2	104	198	Olszewski

The MTS samples in each data set are of different lengths. For each data set, the MTS samples are extended to the length of the longest MTS sample in the data set. We extend all variables of the MTS to the same length. For a short TS instance x with length n we enlarge it to a long instance y with length n_max by $\begin{matrix} y (j) = x (i), i & = & ⌈ \frac{j - 1}{n_{\max} - 1} (n - 1) + 0.5 ⌉ \\ j & = & 1, 2, \dots, n_{\max} . \end{matrix}$

Some of the values in the MTS sample are duplicated in order to extend it. In this way, all of the values in the original MTS sample appear in the extended MTS sample.

For the classification process the nearest neighbor method (1NN) is used for all compared distances: DTW, normDTW and combDTW. We use the leave-one-out cross-validation method to find the best parameter α for our classifier combDTW on a training subset. If the minimal error rate is the same for more than one value of α, we choose the smallest such value. A finite subset of parameters α is considered, from 0 to 1 with fixed step 0.01. For each data set we calculated the classification error rate using 10-fold cross-validation (1NN classifier).

3.2 Experimental results

Classification error rates for the 1NN method with the distance measures DTW, normDTW, and combDTW are presented in Table 2. The examined parametric combined distance combDTW is better than DTW and normDTW for almost all data sets. The only exception is the data set Libras, where combDTW performs slightly worse than DTW, but much better than normDTW. A graphical comparison of results for the pairs of classifiers: DTW vs. combDTW and normDTW vs. combDTW (Fig. 4) confirms that the combined method outperforms both components.

Table 2
Test errors (10CV) of the compared methods (in %)

Dataset DTW normDTW combDTW

ArabicDigits 0.19 0.22 0.14

AUSLAN 18.05 23.20 12.20

BCI 44.89 46.54 43.95

CharacterTrajectories 1.36 1.50 1.26

CMUsubject16 3.67 0.00 0.00

ECG 18.50 16.00 16.00

Graz 37.14 34.29 31.43

JapaneseVowels 2.03 36.09 2.03

Libras 8.61 18.61 8.89

PenDigits 0.65 0.63 0.63

RobotFailure LP1 12.64 28.06 12.64

RobotFailure LP2 32.00 34.00 32.00

RobotFailure LP3 29.00 48.00 29.00

RobotFailure LP4 10.08 16.21 10.08

RobotFailure LP5 29.30 36.54 29.30

Wafer 2.01 3.85 2.01

Dataset	DTW	normDTW	combDTW
ArabicDigits	0.19	0.22	0.14
AUSLAN	18.05	23.20	12.20
BCI	44.89	46.54	43.95
CharacterTrajectories	1.36	1.50	1.26
CMUsubject16	3.67	0.00	0.00
ECG	18.50	16.00	16.00
Graz	37.14	34.29	31.43
JapaneseVowels	2.03	36.09	2.03
Libras	8.61	18.61	8.89
PenDigits	0.65	0.63	0.63
RobotFailure LP1	12.64	28.06	12.64
RobotFailure LP2	32.00	34.00	32.00
RobotFailure LP3	29.00	48.00	29.00
RobotFailure LP4	10.08	16.21	10.08
RobotFailure LP5	29.30	36.54	29.30
Wafer	2.01	3.85	2.01

Fig.4

Graphical comparison of error rates.

The comparison DTW vs. normDTW shows, surprisingly, that the raw DTW is better than the normalized DTW. This paradox appears to be caused by the RobotFailure data set, for which DTW is always much better than normDTW. If we report only one of the RobotFailure sets, then DTW and normDTW will have comparable results. It should be noted that the examined combined method combDTW performs very well on those data sets, and faultlessly detects that the DTW error rate is lower than the error rate of normDTW.

We present here a statistical comparison of the examined methods. For statistical comparison of two classifiers over multiple data sets, [9] recommends the Wilcoxon signed-ranks test. This is a non-parametric alternative to the paired t-test, which ranks the differences in the performances of two classifiers for each data set, ignoring the signs, and compares the ranks for the positive and the negative differences. The values of the Wilcoxon test are are 0.0273 for DTW vs. combDTW and 0.0002 for normDTW vs. combDTW. We can see that, with a significance above 95%, the classifier with the combined distance combDTW outperforms the component methods DTW and normDTW.

We also examine the influence of the component distance measures DTW and normDTW on the combined value of the combDTW distance (Fig. 5). For each data set we choose one of the 10-fold splits on the training and testing subset to illustrate the contribution and correspondence of the cross-validation (leave-one-out) and test error rate. It is clear that there is no one universal best value of the parameter α for all data sets. The value of α corresponding to the minimal error rate is different for each data set. On the other hand we can see that the minimum of error is well positioned — there is only one minimum for each error curve. The test error rate curve generally corresponds to the cross-validation error rate curve, so we can predict quite well the best value of the parameter α.

Fig.5

Correspondence of the parameter α and error rates. Dashed line: CV error; solid line: test error.

4 Conclusions

This paper has presented a classifier of multivariate time series based on the nearest neighbor method with the distance measure as a combination of the DTW distances on raw and z-normalized data. The research has shown that such a combined approach makes it possible to use the information contained in both raw and normalized data. The resulting classification errors were almost always (excluding one data set) lower than (or equal to) the classification errors of the component methods. It has been shown that for different data sets, the lowest classification error is obtained for different values of the parameter of the combined method. Also, in cases where the single component distance works best on raw or normalized data, the examined combined method can predict this by selecting the appropriate extreme value (0 or 1) of the parameter α. At the same time, there was no excessive effect of overfitting. The parametric approach of the method requires a parameter selection phase, which affects the calculation time on the learning subset (relative to nonparametric DTW). However, in situations where the calculation time of the learning phase is not critical, the combined approach presented here appears to be a good way to utilize the information contained in both raw and normalized data giving a clear improvement in the classification error rate on the examined data sets.

References

Aghabozorgi

, Shirkhorshidi

A.S.

and Wah

T.Y.

, Time-seriesclustering—A decade review, Information Systems 53 (2015), 16–38.

Auckenthaler

, Carey

and Lloyd-Thomas

, Score normalization for text-independent speaker verification systems, Digital Signal Processing 10(1–3) (2000), 42–54.

Bache

and Lichman

, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml], University of California, School of Information and Computer Science, Irvine, CA, 2013.

Berndt

D.J.

and Clifford

, Using dynamic time warping to find patterns in time series, AAAI Workshop on Knowledge Discovery in Databases (1994), 229–248.

Blankertz

, Curio

and Müller

K.R.

, Classifying single trial EEG: Towards brain computer interfacing, In: Diettrich

T.G.

, Becker

, Ghahramani

(Eds.), Advances in Neural Inf Proc Systems, 14 (NIS 01). Available from, 2002. http://www.bbci.de/competition/ii/

Bolstad

B.M.

, Irizarry

R.A.

, Astrand

and Speed

T.P.

, A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics 19(2) (2003), 185–193.

Box

G.E.P.

, Jenkins

G.M.

and Reinsel

G.C.

, Time series analysis: Forecasting and control, Wiley, 2008.

Carnegie Mellon University Motion Capture Database (2014). Available from: http://mocap.cs.cmu.edu/.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

10.

Ertuğrul

Ö.F.

and Tağluk

M.E.

, A novel version of k nearest neighbor: Dependent nearest neighbor, Applied Soft Computing 55 (2017), 480–490.

11.

Górecki

and Łuczak

, Using derivatives in time series classification, Data Mining and Knowledge Discovery 26(2) (2013), 310–331.

12.

Górecki

and Łuczak

, First and second derivative in time series classification using DTW, Communications in Statistics-Simulation and Computation 43(9) (2014a), 2081–2092.

13.

Górecki

and Łuczak

, Non-isometric transforms in time series classification using DTW, Knowledge-Based Systems 61 (2014b), 98–108.

14.

Górecki

and Łuczak

, Multivariate time series classification with parametric derivative dynamic time warping, Expert Systems with Applications 42(5) (2015), 2305–2312.

15.

Han

and Kamber

, Data Mining: Concepts and Techniques, Morgan Kaufmann, USA, 2001.

16.

Keogh

, Exact indexing of dynamic time warping, In 28th International Conference on Very Large Data Bases, 2002, pp. 406–417.

17.

Keogh

and Kasetty

, On the need for time series data mining benchmarks: A survey and empirical demonstration, Data Mining and Knowledge Discovery 4(7) (2003), 349–371.

18.

Larose

D.T.

and Larose

C.D.

, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2014.

19.

Leeb

, Lee

, Keinrath

, Scherer

, Bischof

and Pfurtscheller

, Brain-computer communication: Motivation, aim, and impact of exploring a virtual apartment, IEEE Transactions on Neural Systems and Rehabilitation Engineering 15 (2007), 473–482. Available from: http://www.bbci.de/competition/iv/

20.

Lemire

, Faster retrieval with a two-pass dynamictime-warping lower bound, Pattern Recognition 42(9) (2009), 2169–2180.

21.

Łuczak

, Hierarchical clustering of time series data with parametric derivative dynamic time warping, Expert Systems with Applications 62 (2016), 116–130.

22.

Łuczak

, Univariate and multivariate time series classification with parametric integral dynamic time warping, Journal of Intelligent and Fuzzy Systems 33(4) (2017), 2403–2413.

23.

Merigó

J.M.

, Palacios-Marqués

and Soto-Acosta

, Distance measures, weighted averages, OWA operators and Bonferroni means, Applied Soft Computing 50 (2017), 356–366.

24.

Morrison

D.F.

, Multivariate statistical methods, McGraw-Hill, 1990.

25.

Olszewski

R.T.

, Generalized Feature Extraction for Structural Pattern Recognition in Time-Series Data, Ph.D. Thesis. Carnegie Mellon University, Pittsburgh, 2001. Available from: http://www.cs.cmu.edu/ bobski

26.

Seber

G.A.F.

, Multivariate Observations, Wiley, 1984.

27.

Sola

and Sevilla

, Importance of input data normalization for the application of neural networks to complex industrial problems, IEEE Transactions on Nuclear Science 44(3), 1464–1468.

28.

Warrenliao

, Clustering of time series data—a survey, Pattern Recognit 38(11) (2005), 1857–1874.

29.

Zhai

, Xu

and Wang

, Dynamic ensemble extreme learning machine based on sample entropy, Soft Computing 16(9) (2012), 1493–1502.

30.

Zhai

, Wang

and Pang

, Voting-based instance selection from large data sets with mapreduce and random weight networks, Information Sciences 367 (2016), 1066–1077.

31.

Zhai

, Zhang

and Wang

, The classification of imbalanced large data sets based on mapreduce and ensemble of ELM classifiers, Journal of Machine Learning and Cybernetics 8(3) (2017), 1009–1017.