Abstract
The traditional time series data clustering for landslide displacement prediction is based on Euclidean distance measure. The time series data is clustered by distance calculation of two vectors. The correlation between components is not considered. The multiple components with single feature will interfere with the clustering results, and the accuracy of clustering results is greatly reduced. To address this problem, an intelligent clustering algorithm for time series data in landslide displacement prediction based on nonlinear dynamic time bending is proposed in this paper. By reconstructing the phase space of the landslide displacement time series, the phase space transposed matrix is obtained as the time series reconstruction matrix. After embedding dimension processing, the time series of landslide displacement is predicted by SVM data mining model. Dynamic time warping calculation is based on the correlation of time series sequence and the components. The local optimal solution is obtained by recursive search, and the whole curve path is obtained. Clustering calculation of time series data set is carried out by using hierarchical clustering algorithm according to bending path. The intelligent clustering results of time series data in landslide displacement prediction is obtained. Experimental results show that the proposed algorithm has better clustering effect and higher clustering accuracy.
Keywords
Introduction
The landslide disaster causes serious damage to mankind. The reason is the complexity of the forming conditions and the diversity and randomness of the external factors [1], which make it difficult for human beings to know exactly the time, place, intensity and effect of their occurrence in advance. Therefore, it is an important measure to prevent and reduce the disaster by taking the necessary means to monitor, and then predict the landslide disaster scientifically and effectively [2]. Time series is a common form of data in scientific research and business applications [3]. In the research of landslide displacement evolution, due to the different shape, type and scale of landslide, a single landslide is usually selected to research the change of displacement time curve [4]. In the traditional displacement time series research, the time series analysis method is used to research the time evolution of the landslide directly from the displacement sequence [5].
Recently, researchers have begun to pay attention to the research of time series data mining algorithm [6]. Many statistical methods have been applied to the analysis of time series, but the good results cannot be obtained with the concept of similarity of time series and its search method [7]. In fact, almost all time series algorithms are related to the calculation of similarity between the sequences [8]. Euclidean distance is used in the typical similarity measure [9]. But after analysis, some limitations of the Euclidean distance measure will be found. The main reason is that Euclidean distance as a similarity measure, does not identify the shape distortion of time series data on time axis, and is not robust to data noise [10]. With the rapid increase of data storage, especially the emergence of data warehouse [11], the demand for time series data mining algorithm is becoming increasingly urgent.
In the field of data mining, the research work should focus on the efficient and practical clustering analysis of large databases. The unique structure of time series makes the vast majority of traditional machine learning algorithms do not achieve good performance in time series data analysis [12]. Especially under the high dimensional, multi complex and noisy conditions, the research of time series data clustering mining algorithm has become a hot topic. The basis of clustering is to search the similarity of the different objects in the data set to find out the similarity between the data [13]. Therefore, the focus of the research is also turned to provide a novel, feasible and highly credible similarity measure for clustering algorithm. Based on this, a hierarchical clustering algorithm based on nonlinear dynamic time warping is proposed in this paper. Experimental results show that the clustering results obtained by the proposed algorithm are not subject to multi-component interference with single feature, and have high recognition and matching accuracy, and strong robustness to noise, phase drift and amplitude difference.
Material and methods
Time series data intelligent clustering algorithm for landslide displacement prediction based on nonlinear dynamic time warping
Extraction of time series of landslide displacement
Phase space reconstruction of time series of landslide displacement. There is an interaction between components in the phase space reconstruction system, and the change of any component is inseparable from the other components [14]. The information of these components can be hidden in the process of any component change. Therefore, the chaotic behavior of the system can be researched by any univariate time data of the long term evolution of the system.
Assume a known univariate time series is { x i , i = 0, 1, 2, …, N }, the time sampling interval is ΔT, reconstruction delay is τ, x0 = x i , x p = xi σ +pτ. The set of time delays { X i , i =, 1, 2, …, N - (m - 1) τ } describes the evolution trajectory of the system in phase space, where X i = (x i , xi+1, …, xi+m-1).
The phase space of time series of the landslide displacement is reconstructed, the delay time and the embedding dimension are introduced, and a proper model is established. The dynamic characteristics of the landslide displacement system can be researched according to multidimensional phase space transformed by one-dimensional displacementsequence.
Embedding dimension. The phase space transposed matrix of reverse sequence reconstruction is the time series reconstruction matrix given by Equation (1). m is set to the integer N/2 and the matrix includes all the samples.
For the convenience of calculation, first, the restructured displacement time series matrix is expanded based on the original sequence according to the bottom-up order, and the zero is filled in the blank position. Assume d
ij
is the displacement record value of the Ith row and the jth column in the expanded displacement time series matrix. The entropy value is calculated by using
Find out the dimension of the corresponding peak point. For any three continuous dimensions defined in the change curve of entropy, if the corresponding value of the middle dimension is greater than the value of the two sides of dimension, then the entropy value corresponding to the middle dimension is the peak value, and the middle dimension is the peak dimension, given by
Time series prediction by using SVM data mining model. Support vector machine (SVM) is a good way to achieve structural risk minimization. It seeks trade-off between the approximation accuracy of the given data and the complexity of approximation function, in order to obtain the best generalization ability. In theory, the support vector machine solves the local extreme problem that cannot be avoided in the neural network method [15]. The real problem is converted to high dimensional feature space by nonlinear transformation. Linear decision function in high dimensional space is constructed to realize nonlinear decision function in the original space. Oracle data mining (ODM) powerful data mining capabilities are provided by native SQL functions in the Oracle database. Oracle SVM achieves linear expansion based on the complexity of the algorithm and small size of samples. But before creating the model, in order to satisfy the smoothness of the data, there is also a need for trend movement, target conversion, and attribute selection dataprocessing.
Data processing. The logarithmic transformation of the known observation data sequence can reduce the correlation between the unobservable error and the predictive variable. Difference can eliminate its trend and reduce its fluctuation. The normalization of the Z-score method can make it fluctuate near the zero value and become a stationary sequence, that is,td
i
= log(d
i
), tdi+1 = tdi+1 - td
i
,
Building of SVM data mining model. ODM can implement basic functions such as modeling, testing, and application models of data mining through the call of PL/SQL API and other interfaces. ODM SVM regression can support time series modeling through time delay or lags methods, and simplify training. In the building of landslide displacement time series model, the displacement time series data is the goal of the model building, while the displacement sample data will reserve a small part as the test data and the rest of the data is the input part of the model as the past value of the displacement. The univariate displacement sequence is transformed into the training set of the model after the transformation of the time series reconstruction matrix. As the nonstationarity of the displacement data, a series of data processing is required before the training model. Then the SVM data mining model is created by using PL/SQL API training on the processed training set. Meanwhile, the multi-step prediction is carried out by using the obtained model, and the actual landslide displacement prediction time series after n time of the prediction result is obtained after the reduction operation of the predicted result contrary to the data processing
Dynamic time warping. Assume the obtained perdition time series in the above section are Q and C with the data length n and m, which is given by
In order to align the two time series by dynamic time warping, a distance difference matrix isdefined.

Example of dynamic time warping.
According to definition 2, the warping path meets the following conditions.
Boundedness.
Boundary condition. w1 = D _ matrix (q1, c1) and w K = D _ matrix (q n , c m ), that is, the beginning and stop elements of the warping path are both ends of the diagonal of the distance matrix.
Continuity. For given
Monotonicity. For given
In addition to the above constraints, it is necessary to restrict the walking area of the warping path in practical applications. For the convenience of matrix processing, the swing range of the warping path should be restricted near the diagonal of the distance matrix, shown in the dotted line in Fig. 1. Therefore, the subset of the warping path is called a warping window without considering the distance similarity factors outside the region. There are two main reasons for the setting of the warping window. The first is to accelerate the running speed of DTW algorithm. The computational complexity of DTW algorithm is O(mn). In the application of time series data mining, whole sequence matching and subsequence matching are carried out. For these two cases, the computational complexity will be up to O(n2X). Therefore, it is necessary to limit the swing of the warping path. The second is to prevent the logically ill-conditioned warping path. In the time series analysis, the time factor is very important, the large time offset is of no practical significance. Logically, it should be ignored.
From the analysis of the distance matrix, it can be known that, there is a multiple solution to the warping path. But what we care about is actually the path with the least total length. The path with the highest similarity a (the minimum distance) is taken as the criterion for similar search, given by
In theory, the exhaustive search method is used to find the warping path that satisfies the condition. But it is often unrealistic for large database analysis. According to dynamic programming theory, assume the point (i, j) is on the best path, then the subpath from the point (1, 1) to (i, j) is local optimal solution, that is, the best path from the point (1, 1) to the point (m, n) can be obtained by recursively searching the local optimal solution.
The minimum cumulative value of the final warping path of time series is Sm,n. Starting from Sm,n, the whole warping path is obtained through iteration with minimum cumulative value until back to the start point S1,1.
Intelligent clustering of time series data. The clustering of time series is to group the time series objects in the database into multiple classes or clusters according to the similarity or dissimilarity measure. It is to mine an event or other event state in the same attribute time series database of different time domains [16]. Cluster analysis can also be used as a preprocessing step for other algorithms (feature extraction, classification).
In this paper, the clustering includes two parts. The Euclidean distance between any two sequences in the time series database is calculated according to the obtained time series data to form the distance matrix D_matrix. In the distance matrix, dynamic programming method is used to obtain the dynamic time warping path as similarity measure. According to the dynamic time warping path, the clustering algorithm with the hierarchical method is used to cluster the time series data set, and the hierarchical tree is generated.
The algorithm first examines the location of the data. If it falls outside the warping window, it is replaced with NaN. Then the recursive call function given by Equation (8) is used to find the minimum warping path length. The extract path function is called to find out the distance matrix elements of the warping path. In this way, the results of different alignment of time series data can be obtained, as shown in Fig. 2.

Dynamic time warping measure.
The selection of clustering algorithms depends on the type of data, the purpose and application of clustering [17]. The hierarchical clustering method is selected in this paper, which decomposes all the records hierarchically. The hierarchical method can be divided into two types: the condensation mode and the segmentation mode, which depends on whether the formation of the hierarchical structure is top-down or bottom-up. In this paper, the segmentation mode is selected. In this mode, the entire database is taken as a large class, and then divided into small classes according to some rules until some predetermined conditions are met. Generally speaking, there is no best classification method for a database. The best combination point for clustering is to be found between the similarity of the object in the class and the number of classes [18]. Figure 3 shows the flow chart of the proposed algorithm.

Flow chart of the proposed algorithm.
Experimental results and analysis
To verify the proposed clustering algorithm, experiments are carried out. 3 GPS equipment deployed on the three landslides of loess slope, Baishui river, and Yuhuangge to obtain cumulative displacement. The systematic monitoring of the 3 landslides began between 2013 and 2014. Landslide displacement data are recorded in month. In the experiment, the length of the time series is selected for 61 months, 101 months and 101 months, respectively. The data is shown in Table 1 and Fig. 4. By using the proposed algorithm, displacement perdition of the three landslides is carried out and the results are shown in Fig. 5.
Measurement time and length of 3 groups of monitoring data
Measurement time and length of 3 groups of monitoring data
Test data properties

Displacement monitoring data of 3 landslides.

Displacement prediction results of 3 landslides.
From Fig. 5, it can be seen that, the proposed algorithm can predict the displacement of 3 landslides effectively, which shows the feasibility of the proposed algorithm.
An instance time sequence data Synthetic Control C Hart Time Series (Robert 1999) in the UCI KDD database is selected as the research object. These data are from the industrial production control chart, consisting of 600 sequence samples.
To test the performance of the measure algorithm used in this paper, the DTW-based distance measure algorithm and the Euclidean-based distance measure algorithm are used to calculate the experimental object. For the calculation of the distance of two classes, 3 methods are provided forselection.
Single-link method. The nearest recorded distance of two classes is the distance of the class. This method can generate snake-shaped class, which is not suitable for the case of records gathered together.
Complete-link method. The farthest recorded distance of two classes is the distance of the class. This method is easy to generate a class of small records gathered together.
Group-average link method. Calculate the average distance of two classes.
To save time, 20 samples are randomly selected from 600 samples for calculation. They are the time series with the number of [70 146 166 351 457 200 575 275 379 243 478 80 491 108 393 533 39 147 317 489]. These sequence vectors consist of a test data matrix and numbered 1–20. The results of the two algorithms are shown in Fig. 6.

Experimental results.
From Fig. 6, it can be seen that, Up-ward shift, Increasing trend and Decreasing trend, Downward shift are clustered as two subclasses in the root node. The attributes Normal and Cyclic are assigned to an arbitrary cluster. In Fig. 6 (a), the first, seventeenth time series with the Normal attribute are split into two different clusters. However, the correct result is obtained with the proposed algorithm. Therefore, the clustering result obtained with the proposed algorithm is more accurate.
The experiment platform uses Pentium 1.73 G and 1 G memory, the operating system is Window XP Home Edition, the program is written in JAVA language, and the execution environment is JRE1.5.0
From the theoretical analysis of the algorithm, it can be known that, its time complexity increases linearly [19]. The conclusion of theoretical analysis can be verified by experiments. The data used in the experiment are the eensus-income customer data set in the machine learning data at the University of California at Irvine. The data set has 48842 pieces of data and each record has 14 attributes, of which 6 are continuous numeric attributes and 8 category attributes. Through the average method of random sampling, 2 K, 16 K until 48 K records is taken as the data sample. The sample data are extracted 4K, 8K, respectively. The algorithm is executed 5 times. Figure 7 shows the scatter plot of the size of the data and the execution time of the algorithm.

Scatter plot of Data scale and algorithm execution time.
From Fig. 7, it can be seen that, with the increase of the data scale, the execution time of the algorithm is approximately linear. Experimental result shows that the efficiency of the proposed algorithm can meet the requirements of data mining to a great extent.
The proposed algorithm is applied in the perdition of Lijiawan landslide. Comparison of the original value and the predicted value of the horizontal displacement monitoring points with the landslide profile of BD735, BD736, and BD737 is shown in Fig. 8. The precision analysis is shown in Table 3.

Comparison of original and predicted values.
Prediction accuracy
From Fig. 8, it can be seen that, for the three landslide profiles, the trend of the predicted value of horizontal displacement obtained with the proposed paper is in accordance with the trend of the measured value of the project. From Table 3, it can be seen that, the precision is above 94%, the maximum is 99.6%, which is very close to the measured value.
Comparison of the prediction effect between the proposed algorithm, the Euclidean-based distance measure clustering algorithm, and K-means clustering algorithm are carried out for displacement prediction of the Baijiabao landslide. The prediction results are shown in Fig. 9. The error of the predicted value and the actual value is shown in Table 4.

Prediction results of ZG324 total displacement of monitoring point.
Comparison of prediction error of different landslide displacement prediction methods
From Fig. 9 and Table 4, it can be seen that, the prediction results obtained with the proposed algorithm are in good agreement with the actual displacement, and the error is minimal. The second good is the Euclidean-based distance measure clustering algorithm [20–22]. The prediction results of K-means clustering algorithm are lowest. Comparison results show that the prediction results obtained with the proposed algorithm are best.
The landslide causes serious damage to human beings. The Euclidean-based distance measure clustering algorithm is used in the traditional landslide displacement prediction. In traditional method, the correlation between components is not considered, and the multiple components reflecting single feature will interfere with the results. The clustering results have the disadvantages of low precision and poor stability. To address this problem, an intelligent clustering algorithm for time series data in landslide displacement prediction based on nonlinear dynamic time warping is proposed in this paper. Dynamic time warping algorithm is discussed in detail. Experimental results of the displacement prediction of different landslides show that the proposed algorithm is of high accuracy and the prediction results are optimal. It can predict the landslide displacement more accurately, so as to prevent the serious damage of the environment and resources and ensure the safety of the human life and property.
Footnotes
Acknowledgments
This work was financially supported by the National Key Research and Development Plan (2016YFC0501103), National Natural Science Foundation of China (51774271), National Natural Science Foundation of China (No. 51674245), Natural Science Foundation of Jiangsu Province (BK20160259), Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), the Fundamental Research Funds for the Central Universities (NO. 2014XT01).
