Abstract
Data normalization is one of the most common processing methods applied to raw data before its subsequent use in data mining algorithms, classification, or clustering methods. Many procedures, particularly those that use any statistical analysis, require that data be normalized in one way or another. In the case of time series a standard method of processing raw data is z-normalization of each time series instance in the data set. For multivariate (multidimensional) time series we z-normalize each dimension (variable) individually. Although normalization brings a lot of advantages, it is easy to find examples of data sets where normalization destroys information contained in the raw data. In this paper we demonstrate, that for multivariate time series (MTS) both raw and normalized components give some information about the data and the best way of mining it is a combination of them. We focus here on multidimensional time series and their classification using the nearest neighbor method with the dynamic time warping (DTW) distance measure. We construct a parametric distance measure that is a combination of DTW on raw and z-normalized time series data. It turns out that the combined distance measure carries more information about the data than the two distance components separately. By determining an individual parameter for each data set it is possible to obtain a lower classification error than the errors of both component distance measures. We perform experiments on real data sets from many fields of science and technology. The advantage of the combined approach is confirmed by graphical and statistical comparisons.
Keywords
Introduction
Data normalization is the most common transformation of raw data in the theory and applications of data mining and machine learning [15, 29]. There are many works demonstrating theoretically and empirically the necessity and usefulness of data normalization with applications in various fields of science and technology [2, 31]. For time series data [1, 28], both one- and multidimensional, the basic most common method is z-normalization, where we normalize each data series instance (each dimension for multivariate time series) in the data set separately. Z-normalization of a time series is a transformation resulting in a new time series with the same length, for which the arithmetic mean of the time series values is equal to zero and the standard deviation is equal to one. The need for time series normalization is often emphasized in classification methods with the dynamic time warping and other distance measures [10, 23].
Despite the undeniable advantages of time series normalization, it is easy to identify artificial data where z-normalization destroys information contained in the classes of a given data set. An example may be two classes of similar or even identical waveform signals which differ strongly in amplitude. The natural discriminatory factor for these classes is the amplitude of the signals, and the process of normalization completely destroys the information in the classes. After the normalization process both classes will be indistinguishable. Especially in the case of multivariate time series an amplitude distinction between specific dimensions of the MTS can have a large impact on the discrimination process. Z-normalization of all dimensions in all instances of MTS data collection can cut out information resulting in deterioration of the classification results.
In this paper we show how we can use information from raw and z-normalized data in the process of classification of multivariate time series. We focus on the classification of multivariate (multidimensional) time series using the nearest neighbor method (1NN) with the dynamic time warping (DTW) distance measure. We construct here a parametric measure that is a combination of the DTW distance on the raw MTS data and the DTW distance on the z-normalized instances of the same data set. A single real parameter controls the contribution of each of the distance components to the final value of the combined distance. Similar techniques have been used in previous work for combinations of other distances; for example, for multivariate time series with a combination of DTW and a derivative distance DDTW or an integral distance IDTW [14, 22] or for one-dimensional time series data classification [11–13] and clustering [21]. In this paper, we show that a combination of raw and normalized data gives better classification results for multivariate time series with 1NN and DTW than the same classification using only raw data or normalized data. The main idea of the paper is to show that in both raw and normalized data is a useful information that can improve performance of the classifiers. We also show that the combining approach is essential here, only combination of the two distances can outperform component measures. Furthermore, we show that the parametric approach is important here. The parameter of the combined distance varies greatly depending on the data set, determining the strength of the contributions of the respective distance components to the final distance. Empirical experiments were carried out on 16 real data sets of the multivariate time series from many fields of science, technology and medicine.
The remainder of the paper is organized as follows. Section 2 contains basic definitions of the DTW distance measure for univariate and multivariate time series. The new parametric distance as a combination of DTW on raw and z-normalized data is presented, and some efficiency optimizations of the tested method are described. Section 3 contains computational experiments. Classification error rates for the combined method and for its components are presented followed by a detailed discussion of the results and graphical and statistical comparisons. Finally, Section 4 contains conclusions and sums up the experiments performed.
Multivariate combined DTW
A univariate (one-dimensional) time series is a sequence of observations in time [7]. In this paper we assume that the time series is discrete, ie. it is a finite sequence of real numbers:
A multivariate (multidimensional) time series is defined as a finite sequence of univariate series:
Dynamic time warping
The dynamic time warping (DTW) distance measure is a very popular distance measure used to compute the similarity/dissimilarity of time series data [4]. To compute the value of DTW for two one-dimensional time series with length
Then we construct a square matrix D with dimension n × n consisting values of the local cost function D (i, j) = d (x (i) , y (j)). The matrix element D (i, j) corresponds to the alignment between values x (i) and y (j) of the time series. Then we construct a warping path W = {w1, w2, …, w
K
} ( w1 = d (1, 1) and w
K
= D (n, n) (boundary conditions); if w
k
= D (i
k
, j
k
) and wk+1 = D (ik+1, jk+1) then ik+1 - i
k
≤ 1 and jk+1 - j
k
≤ 1 (continuity); ik+1 - i
k
≥ 0 and jk+1 - j
k
≥ 0 (monotonicity).
To obtain the warping path we begin at the element D (1, 1) and moving at most one index up or right we finish at the element D (n, n) (Fig. 1). The path that minimizes the warping cost gives the value of the DTW distance:

Time series alignment and the corresponding warping path.
Sometimes, the DTW is defined as the square root of this value.
In practice, we often compute the value of the DTW distance by building the cumulative distance matrix Γ. For this we use dynamic programming with the following recursion:
Then the DTW value is found at position (n, n) of the matrix Γ:
The DTW distance measure is not a metric, as it does not satisfy the triangle inequality. However, it is the case that DTW (x, x) =0 and DTW (x, y) = DTW (y, x) if we use the cost function (1).
Since we assume that the time series length for each dimension of the multi-series in a given data set is the same, we can regard an MTS as a one-dimensional trajectory in m-dimensional Euclidean space:
Then we can define the DTW distance measure between two multi-series X and Y [14] in the same way as for one-dimensional series with the local cost function d defined by:
This means that the local cost function is the square Euclidean distance of m-dimensional vectors generated by taking the values along the dimensions of the multi-series at positions i and j (Fig. 2).

MTS alignment and the cost function.
We will define the normalized MTS as an MTS for which each of m dimensions is z-normalized separately, ie.
We define a parametric combined dynamic time warping distance measure (combDTW) as a convex combination of DTW and normDTW distance measures:
The distance function combDTW can be used in the nearest neighbor classification method, where the parameter α is chosen in the learning phase (on the learning data set). In this paper α will be found by the leave-one-out cross-validation method on the training data set.
Since the parameter α is located outside the DTW and normDTW distances, to compute the value of combDTW for all α ∈ [0, 1] we need to compute DTW and normDTW only once. This allows us to make some computational optimizations in the learning phase of the nearest neighbor method. The optimized algorithm for leave-one-out cross-validation on the training set is presented in Fig. 3.

Implementation of the optimized algorithm for the leave-one-out cross-validation routine (Matlab code).
Since the DTW distance measure is not a metric, the new combined distance combDTW is not a metric either. However, as for DTW distance, it is the case that
To decrease the computation time of the 1NN method the lower bound technique is often used. If LB is a lower bound for DTW and normLB is a lower bound for normDTW then the function
Experimental setup
Experiments were performed on 16 data sets, which are all non-normalized data sets whose labels are given. The data sets originate from different domains, including medicine, robotics, handwriting recognition, etc. Information on the time series used is presented in Table 1 (UCI — [3], CMU MOCAP — [5, 25]). The number of time series per data set varies from 47 to 10992, the number of variables varies form 2 to 62 and the number of classes varies from 2 to 95.
Datasets
Datasets
The MTS samples in each data set are of different lengths. For each data set, the MTS samples are extended to the length of the longest MTS sample in the data set. We extend all variables of the MTS to the same length. For a short TS instance x with length n we enlarge it to a long instance y with length nmax by
Some of the values in the MTS sample are duplicated in order to extend it. In this way, all of the values in the original MTS sample appear in the extended MTS sample.
For the classification process the nearest neighbor method (1NN) is used for all compared distances: DTW, normDTW and combDTW. We use the leave-one-out cross-validation method to find the best parameter α for our classifier combDTW on a training subset. If the minimal error rate is the same for more than one value of α, we choose the smallest such value. A finite subset of parameters α is considered, from 0 to 1 with fixed step 0.01. For each data set we calculated the classification error rate using 10-fold cross-validation (1NN classifier).
Classification error rates for the 1NN method with the distance measures DTW, normDTW, and combDTW are presented in Table 2. The examined parametric combined distance combDTW is better than DTW and normDTW for almost all data sets. The only exception is the data set Libras, where combDTW performs slightly worse than DTW, but much better than normDTW. A graphical comparison of results for the pairs of classifiers: DTW vs. combDTW and normDTW vs. combDTW (Fig. 4) confirms that the combined method outperforms both components.
Test errors (10CV) of the compared methods (in %)
Test errors (10CV) of the compared methods (in %)

Graphical comparison of error rates.
The comparison DTW vs. normDTW shows, surprisingly, that the raw DTW is better than the normalized DTW. This paradox appears to be caused by the RobotFailure data set, for which DTW is always much better than normDTW. If we report only one of the RobotFailure sets, then DTW and normDTW will have comparable results. It should be noted that the examined combined method combDTW performs very well on those data sets, and faultlessly detects that the DTW error rate is lower than the error rate of normDTW.
We present here a statistical comparison of the examined methods. For statistical comparison of two classifiers over multiple data sets, [9] recommends the Wilcoxon signed-ranks test. This is a non-parametric alternative to the paired t-test, which ranks the differences in the performances of two classifiers for each data set, ignoring the signs, and compares the ranks for the positive and the negative differences. The values of the Wilcoxon test are are 0.0273 for DTW vs. combDTW and 0.0002 for normDTW vs. combDTW. We can see that, with a significance above 95%, the classifier with the combined distance combDTW outperforms the component methods DTW and normDTW.
We also examine the influence of the component distance measures DTW and normDTW on the combined value of the combDTW distance (Fig. 5). For each data set we choose one of the 10-fold splits on the training and testing subset to illustrate the contribution and correspondence of the cross-validation (leave-one-out) and test error rate. It is clear that there is no one universal best value of the parameter α for all data sets. The value of α corresponding to the minimal error rate is different for each data set. On the other hand we can see that the minimum of error is well positioned — there is only one minimum for each error curve. The test error rate curve generally corresponds to the cross-validation error rate curve, so we can predict quite well the best value of the parameter α.

Correspondence of the parameter α and error rates. Dashed line: CV error; solid line: test error.
This paper has presented a classifier of multivariate time series based on the nearest neighbor method with the distance measure as a combination of the DTW distances on raw and z-normalized data. The research has shown that such a combined approach makes it possible to use the information contained in both raw and normalized data. The resulting classification errors were almost always (excluding one data set) lower than (or equal to) the classification errors of the component methods. It has been shown that for different data sets, the lowest classification error is obtained for different values of the parameter of the combined method. Also, in cases where the single component distance works best on raw or normalized data, the examined combined method can predict this by selecting the appropriate extreme value (0 or 1) of the parameter α. At the same time, there was no excessive effect of overfitting. The parametric approach of the method requires a parameter selection phase, which affects the calculation time on the learning subset (relative to nonparametric DTW). However, in situations where the calculation time of the learning phase is not critical, the combined approach presented here appears to be a good way to utilize the information contained in both raw and normalized data giving a clear improvement in the classification error rate on the examined data sets.
