Time-series data dynamic density clustering

Abstract

In many clustering problems, the whole data is not always static. Over time, part of it is likely to be changed, such as updated, erased, etc. Suffer this effect, the timeline can be divided into multiple time segments. And, the data at each time slice is static. Then, the data along the timeline shows a series of dynamic intermediate states. The union set of data from all time slices is called the time-series data. Obviously, the traditional clustering process does not apply directly to the time-series data. Meanwhile, repeating the clustering process at every time slices costs tremendous. In this paper, we analyze the transition rules of the data set and cluster structure when the time slice shifts to the next. We find there is a distinct correlation of data set and succession of cluster structure between two adjacent ones, which means we can use it to reduce the cost of the whole clustering process. Inspired by it, we propose a dynamic density clustering method (DDC) for time-series data. In the simulations, we choose 6 representative problems to construct the time-series data for testing DDC. The results show DDC can get high accuracy results for all 6 problems while reducing the overall cost markedly.

Keywords

Time-series data dynamic clustering density clustering

1. Introduction

Time-series data refers to a data set sequence in chronological order, which expresses a series of internal states of data that is changed with time, is characterized by its numerical and continuous nature. Time-series data analysis is used in various fields [1]. For example, in a city, taxis generate a lot of location information at every moment. Between adjacent moment, the position information of the taxi also change. By clustering those position data, we could explore the people’s travel behavior patterns, so as to provide differentiated services for different spatial units, which will realize smart lives [2]. To deal with this time-series data, we need to abstract the above scene into a time series model, which means that the data set is composed of data on multiple time slices, and as time changes, the data on every time slice will change locally. We mainly explore the clustering methods for the above model.

For the above time series data clustering, there are two main ideas: one idea is to extract the overall features of the time series for clustering, the other is to directly process the data on each time slice. Driven by the first idea, some time-series clustering algorithms project time-series data as a whole into an n-dimensional space or use deep learning methods to directly extract the feature vector of the time series data, and used static cluster algorithms to handle those data [3]. This type of method lacks the analysis of the data state of adjacent time slices and cannot dig out the evolvement trend of data and cluster in the time sequence, only utilizing the holistic or abstract characteristics of the time series. However, time-series data changes over time, these methods cluster them as a whole, result in cannot reflect the dynamic characteristics of time series and of lack interpretability of results. Affected by the second idea, some time-series clustering methods implement a static clustering process at every time slice [4], which is easy to operate but costly when there are multiple time slices. Furthermore, there is a conspicuous data similarity between two adjacent time slices because generally only a few data points would be updated between adjacent time series, so there will be lots of repetitive operations in the above clustering process that causes large time consumption [5].

In order to increase the interpretability of time series clustering, we explore clustering methods in the second category of ideas. Since time series data will change with time, we split it into each time slice for analysis. By comparing the data status on adjacent time slices, we found that not all the attributes and categories of data points will change, which remain in the attributes and categories of the previous time slice, and only a small number of data points change their attributes and categories. Therefore, under certain conditions, time series data has a strong correlation between data sets between adjacent time slices, and there is a certain degree of inheritance between cluster structures.

Based on the above findings, in this article, we propose a dynamic density clustering method (DDC) to deal with the time-series data. The main contributions of this paper are summarized as follows:

1.
We analyze the shifting process of data set and cluster structure between two adjacent time slices, and we prove that there are a distinct correlation and succession of data set and cluster structure.
2.
We propose a dynamic density clustering method (DDC) to deal with the time-series data. By making full use of the inheritance information containing in the time-series data, DDC improves the clustering performance while reducing the cost significantly compared to the traditional clustering mode. Meanwhile, it can reveal the changing trend of the cluster structure and the attributes of data points.
3.
We evaluated the proposed approach using extensive experiments on six benchmark datasets, and the experimental results validate the effectiveness of our approach.

The rest of this paper is organized as follows. In Section 2, we describe related techniques and representative works. In Section 3, we analyze the variation rules of cluster structure between two adjacent time slices and the defects of traditional clustering mode in processing time-series data. In Section 4, the proposed DDC is presented in detail. In Section 5, the comprehensive experiments are implemented using benchmark functions to evaluate the performance of our approach. In the last section, we draw the conclusions from this paper and point out the possible future work.
2. Related works

In general, we can divide the form of data and clustering process into two categories: dynamic and static. From the point of view of the above, the clustering analysis methods can basically fall into four types [6].

Firstly, all of the data and the clustering process are static. For example, the K-means algorithm, proposed by MacQueen [7], is a typical static clustering algorithm possessing the advantages of high-speed and good stability. But if we modify the K (a static integer value for the cluster number) and the initial cluster center points, the computational results will be affected strongly. To improve it, Rodriguez et al. [8] proposed to select cluster center points in some subregions with high density and keep a relatively large distance between them. The BIRCH algorithm [9] is a kind of hierarchical clustering algorithm that has high clustering efficiency. This algorithm can handle noise points effectively. But the data applied to this algorithm must be static. And, this method is unsuitable for finding arbitrary shape clusters. Ester [10] proposed the DBSCAN algorithm, a density-based method, which can efficiently discover arbitrary shape clusters, but it is sensitive to the Neighborhood radius (Eps) and the data distribution. To solve it, Qin [11] proposed an adaptive approach to regulate the cluster radius dynamically. Kohonen [12] proposed SOM (Self-organizing Maps) neural network, which is a typical no-mentor clustering algorithm. However, the network structure of SOM is fixed, so the number of neurons is unchanged, which cannot be used to clustering freely [13].

Secondly, the data is static, but the number of clusters can be updated in the clustering process. Such as, Zhang et al. [14] proposed an improved K-means algorithm, which can delete the redundant information generated by the change of cluster structure to reduce the total computation requirements. However, this algorithm just applies to static data.

Thirdly, the data is dynamic and time-varying, but the cluster structure is static. For example, Agrawal et al. [15] proposed a feature representation method to translate the time domain of time-series data into a frequency domain by the discrete Fourier transform. Similarly, based on the K-means algorithm, Benítez I et al. [16] used a similarity distance calculated by the Hausdorff similarity degree to implement the clustering analysis for the dynamic electricity data. And, Wang et al. [17] selected part of feature points containing important information to reducing the time-series dimension. But it is hard to ensure the reliability of results.

In the final category, both the data and the cluster structure are constantly changing during the clustering process. Such as, considering the advantages of dynamic time warping and time extended method for the unequal time-series data, Luczak [18] proposed a new distance metric, which is more suitable for the time-series data clustering. In Genolini and Falissard [19] the Haar wavelet transform was used to represent time-series data, as well as the K-means algorithm and the Euclidean distance were adopted to data clustering in the new feature space. Min et al. [20] described an improved fuzzy C-means algorithm, which is an extended version of fuzzy C-means (FCM) for time-series data. Izakian and Pedrycz [21] proposed an augmented method of Euclidean distance for fuzzy clustering in time-series data, but with high complexity. Xie [3] implemented a time-series clustering method based on Euclidean distance and fuzzy C-means clustering algorithm. In this method, equidistant processing for each time slice is used to reduce the implementation complexity. But it also makes this method more sensitive to the segmentation of time frame. Besides, Sun and Li [22] proposed a time-series hierarchical clustering algorithm based on dynamic bending distance, but its similarity measuring method has high computation complexity, which causes comparatively low efficiency. Combining the sliding window idea into the K-means clustering algorithm, Liu Qin [23] proposed a short and un-equidistance time-series data clustering algorithm. This method is difficult to determine the best cluster number. Shukri [24] proposed a dynamic clustering algorithm based on a nature-inspired search algorithm called Multi-verse Optimizer (MVO), in which the number of clusters is automatically detected without any prior information. However, this method is not suitable for high dimensional data.

The problem studied in this article belongs to the fourth category. From the above discussion, we can find most traditional clustering methods apply only to the static data, and the potential regularity of time-series data has not been exploiting fully. In this study, we start to analyze the internal rules of data and cluster structure when the time slice shifts from one to the next so as to construct a more efficient time-series data clustering method.

3. Time-series data clustering process analysis

3.1 Data correlation and cluster structure inheritance

Without loss of generality, a typical clustering process at one time slice can be described as follows: The data set $\xi$ consists of $n$ points. That is $\xi=\{{X_{1},X_{2},\ldots,X_{n}}\}$ , in which each point can be represented by a d-dimensional vector $X_{i}=\{{X_{i1},X_{i2},\ldots,X_{id}}\}$ . For $\xi$ , the algorithm tries to find a cluster set $C=\{{C_{1},C_{2},\ldots,C_{k}}\}$ to get obvious distinctions inter-cluster and similarities intra-cluster. The centroid point of each group is represented by $CE_{j}$ , $j=1,2,\ldots,k$ . For time-series data, we can get follows: In time slice $t_{i}$ , the data set $\xi^{i}$ includes $n_{i}$ elements and constitutes $k_{i}$ groups. In next time slice $t_{i+1}$ , the number of elements in data set $\xi^{{i}+1}$ is $n_{{i}+1}$ , which includes $k_{i+1}$ groups. In this process, elements produce four statuses: removed, added, displaced, and unchanged, which will affect $\xi^{i+1}$ and $C^{i+1}$ at $t_{i+1}$ . The following are the definitions of the four statuses:

Definition 1. Element removed status: $X_{j}^{i}\in\xi^{i}$ at $t_{i}$ while $X_{j}^{i+1}\notin\xi^{i+1}$ at $t_{i+1}$ . Definition 2. Element added status: $X_{j}^{i}\notin\xi^{i}$ at $t_{i}$ and $X_{j}^{i+1}\in\xi^{i+1}$ at $t_{i+1}$ . Definition 3. Element displaced status: $X_{j}^{i}\in\xi^{i}$ at $t_{i}$ and $X_{j}^{i+1}\in\xi^{i+1}$ , $X_{j}^{i}\neq X_{j}^{i+1}$ at $t_{i+1}$ . Definition 4. Element unchanged status: $X_{j}^{i}\in\xi^{i}$ at $t_{i}$ and $X_{j}^{i+1}\in\xi^{i+1}$ , $X_{j}^{i}=X_{j}^{i+1}$ at $t_{i+1}$ .

Apparently, the data turnover will lead to cluster structure change. We assumed that there are $\omega_{1}$ elements unchanged, $\omega_{2}$ elements removed, $\omega_{3}$ elements added and $\omega_{4}$ elements displaced from $t_{i}$ to $t_{i+1}$ . And, the probabilities of the above action are $p_{1}$ , $p_{2}$ , $p_{3}$ , and $p_{4}$ respectively. Then, we can get the probability of converting $\xi^{i}$ to $\xi^{i+1}$ is

$\displaystyle\alpha=p_{1}^{\omega_{1}}p_{2}^{\omega_{2}}p_{3}^{\omega_{3}}p_{4% }^{\omega_{4}}$ (1)

For the convenience of analysis, the above four statuses can be sorted into two types: elements changed and elements unchanged. We assume the probability of elements changed is $p$ , and then the probability of elements unchanged should be $1-p$ as the above complementation. Hence, the $\alpha$ is

$\displaystyle\alpha=({1-p})^{\omega_{1}}p^{\omega_{2}+\omega_{3}+\omega_{4}}$ (2)

Since the number of elements at $t_{i+1}$ is $n_{i+1}$ , and $n_{i+1}=\omega_{1}+\omega_{2}+\omega_{3}+\omega_{4}$ , Eq. (2) can be expressed as

$\displaystyle\alpha=({1-p})^{\omega_{1}}p^{n_{i+1}-\omega_{1}}$ (3)

Obviously, the construct of data set has a direct relationship with $\omega_{1}$ at $t_{i+1}$ . Therefore, we can take a partial derivatives of $\alpha$ with respect to $\omega_{1}$ .

$\displaystyle\frac{\partial\alpha}{\partial\omega_{1}}=({1-p})^{\omega_{1}}% \cdot p^{n_{i+1}-\omega_{1}}[{\ln({1-p})-\ln p}]$ (4)

When $p=$ 0.5, Eq. (4) equals 0. Therefore, $p=$ 0.5 is an extreme point. When $p<$ 0.5 the function value is monotonically increasing with the increase of $\omega_{1}$ . When $p>$ 0.5, it is the opposite. So, we can get a conclusion about the data correlation between $\xi^{i}$ and $\xi^{i+1}$ .

Conclusion 1. When $p<$ 0.5, there will be a relatively obvious correlation of data composition between two adjacent time slices.

Under the condition of $p<$ 0.5, the correlation between $\xi^{i}$ and $\xi^{i+1}$ will cause significant inheritance between $C^{i}$ and $C^{i+1}$ with the increase of $\omega_{1}$ . On the other hand, the types of changed elements also affect cluster inheritance. We discuss it here.

3.1.1 Removed elements predominate the changed elements

When $\omega_{3}$ and $\omega_{4}$ are small, the removed elements can be considered to predominate in the changed elements. That is $\omega_{2}\approx n_{i+1}-\omega_{1}$ . In this case, we will focus on the discussion about the relationship between the quantity of removed elements in each group and the compositional stability of the whole cluster. There are two extreme cases of elements removing from the set: Firstly, the removed elements are relatively evenly distributed in each group. Under such circumstances, the number of removed elements in any subclass will be small. Secondly, the removed elements are mainly concentrated in one group, which will cause the structure of that group to be seriously destructed. In general, the fewer elements remove from one group, the group centroid would be moved with a lower probability. Consequently, in the first case, the probability of changing the whole cluster structure is low. In the second case, it is the opposite.

Under the condition of $p<$ 0.5, we supposed that $\pi_{1}$ is the probability of removed elements being concentrated in one group, $k$ is the total number of group at $t_{i}$ , and the group $C_{k}^{i}$ include $n_{j}^{i}$ ( $1\leqslant n_{j}^{i}\leqslant n_{i+1}-\omega_{1}$ ) elements. In that way, the probability of one group being selected is $\beta=\frac{1}{k}$ , the probability of all elements being removed from the group is $\gamma=\prod_{y=0}^{n_{j}^{i}-1}\frac{1}{n_{j}^{i}-y}$ , and the probability of $n_{i+1}-\omega_{1}$ elements being changed is $p$ . Then $\pi_{1}$ is

$\displaystyle\pi_{1}=\left[{\prod_{y=0}^{n_{j}^{i}-1}\frac{1}{({n_{j}^{i}-y})% \cdot k}}\right]\cdot p^{n_{i+1}-\omega_{1}}$ (5)

${y}$ refers to the number of elements removing from one group.

Similarly, $\pi_{2}$ is the probability of removed elements evenly distributed in each group. Generally, $\pi_{2}$ and $\pi_{1}$ are complementary. So, $\pi_{2}$ can be expressed as

$\displaystyle\pi_{2}=1-\pi_{1}$ (6)

It can be seen from Eq. (5) that $\pi_{1}$ is monotonically decreasing with the increase of $(n_{j}^{i}-y)\times k$ when $(n_{j}^{i}-y)\times k>0$ . Therefore, when $(n_{j}^{i}-y)\times k$ is the minimum, $\pi_{1}$ should be the maximum. Actually, there are $n_{j}^{i}\geqslant 1$ , $k\geqslant 1$ , and $0\leqslant y\leqslant n_{j}^{i}-1$ . So, we can get that when $n_{j}^{i}=1$ , $k=1$ , and $y=0$ , $(n_{j}^{i}-y)\times k$ achieves the maximum, which is 1. In conclusion, the maximum of the Eq. (5) is:

$\displaystyle\pi_{1}=p^{n_{i+1}-\omega_{1}}$ (7)

In Eq. (7), it can be known that $1\leqslant n_{i+1}-\omega_{1}$ . Therefore, under the premise of $p<0.5$ , the maximum of $\pi_{1}$ is near 0.5. In this case, combining with Eq. (6), we can get $\pi_{1}\leqslant\pi_{2}$ .

In addition, when $n_{j}^{i}$ or $k$ is large, we can find that $\pi_{1}$ will be small and $\pi_{2}$ will be large as they are complementary. In other words, the probability of removed elements concentrating in one group is low. Therefore, the cluster structure between two adjacent time slices has a strong inheritance when a few removed elements predominate the changed elements.

3.1.2 Added elements predominate the changed elements

When $\omega_{2}$ and $\omega_{4}$ are small, the added elements can be considered to predominate in the changed elements. That is, $\omega_{3}\approx n_{i+1}-\omega_{1}$ . In this case, we will focus on the discussion about the relationship between the quantity of added elements and the stability of the whole cluster. The following are two typical cases: a) all of the added elements are distributed in the existing groups. Apparently, in this case, the centroid of groups will be not easy to move much. So, the existing cluster structure is stable. b) some added elements are not in any existing groups. Firstly, if the number is small, as shown in Fig. 1, they will just become some outliers and the cluster structure will not be damaged much. But, if the number is not small, the existing cluster structure is likely to be changed. Apparently, in the right part of Fig. 1, there are several added elements that are not in any existing cluster, which is enough to damage the current cluster structure.

Figure 1.

Added elements may cause damage to the existing cluster structure. The left part is a case that added elements are distributed in existing groups. The right part is a case that several added elements are not in existing groups.

We assume $q$ is the probability of the added elements which are not in the existing groups, whose number is $r(0\leqslant r\leqslant n_{i+1}-\omega_{1})$ . Then the probability of breaking the current cluster structure $\pi_{3}$ is

$\displaystyle\pi_{3}=p^{n_{i+1}-\omega_{1}}\cdot q^{r}$ (8)

And, the probability of keeping the existing cluster is

$\displaystyle\pi_{4}=1-\pi_{3}$ (9)

Obviously, under the condition of $p,q<0.5$ and $1<r\leqslant n_{i+1}-\omega_{1}$ , the maximum of $\pi_{3}$ is $p q$ . Given all of this, we can get $\pi_{3}$ is much less than $\pi_{4}$ . So, a small number of added elements will not visibly break the inheritance of cluster structure between two adjacent time slices.

3.1.3 Displaced elements predominate the changed elements

The core operation of element displacement is an element has been moved to a new position when the time slice switch from $t_{i}$ to $t_{i+1}$ . Therefore, the element displacement can be considered as a superposition of two processes: removing one element and adding a new one. Based on above analysis, it is obtained that, under the condition of $p<$ 0.5, when displaced elements predominate the changed elements, there is still a strong inheritance of cluster structures between two adjacent time slices.

Conclusion 2. Under the condition of $p<$ 0.5, there is an obvious inheritance of cluster structure between $C^{i}$ and $C^{i+1}$ when elements changed.

3.2 Analysis of traditional clustering algorithm for time-series data

In this section, we focus on the discussion about the computation process of the K-means algorithm (partition clustering mode), the BIRCH algorithm (hierarchical clustering mode) and the DBSCAN algorithm (density clustering mode) for time-series data. What we want to know is which one is more suitable for dealing with time-series data. In the following, we take the process of removing elements as an example to compare them. As Fig. 2 shown, the element F has been removed at $t_{i+1}$ . The red elements indicate cluster 1, blue ones indicate cluster 2, and “ $\times$ ”s indicate the cluster centroid. When element F has been removed, the centroid of group 2 will shift to the left. Element G is likely to be merged into group 2 because element G is closer to the new centroid of group 2. Obviously, the remaining elements need to be reclassified. It can be seen that the K-means algorithm needs a complete recalculation process to get the new cluster structure when the changed elements make a considerable shift of the existing cluster centroid.

Figure 2.

Clustering result of K-means algorithm for two adjacent time slices when $k=$ 2. The left figure is the clustering result at $t_{i}$ while the right figure is the clustering result at $t_{i+1}$ .

Figure 3.

Clustering result of BIRCH algorithm for two adjacent time slices. Left figure is the merging process at $t_{i}$ and right figure is the merging process at $t_{i+1}$ .

In Fig. 3, we show the situation of clustering results of the BIRCH algorithm at $t_{i}$ and $t_{i+1}$ . The number in each circle represents the order in the merging process. As shown in Fig. 3, when element F removed, the second merging process will be changed. The element E will be merged with element G and element D. Obviously, the disappearance of F breaks the distance matrix among elements. It needs to be recreated to implement a new cluster, which will cause extra computations. All in all, the BIRCH algorithm cannot achieve the dynamic cluster process by local computation.

In the DBSCAN algorithm, Eps represents the radius length centered on one point in problem space, and MinPts represents a given threshold. If the quantity of elements in an Eps is bigger than MinPts, the central point will be a core point. The boundary points locate in the Eps-Neighbourhood of one core point, and the noise points do not belong to any core point or boundary point. If any two core points are close enough, the two core points and their boundary points will belong to the same group. Figure 4 is the clustering result as above. But, when element F has been removed, the attributes of Element G, element D, and element E will be updated naturally, which generates a new cluster structure directly.

Figure 4.

Clustering result of DBSCAN algorithm for two adjacent time slices. The left figure is the clustering result at $t_{i}$ , while the right figure is the clustering result at $t_{i+1}$ .

In summary, the density-based clustering process is able to rebuild the cluster flexibly with a relatively small cost when a few elements have been changed, which provides a better way to handle the time-series data.

4. Dynamic density clustering method for time-series data

In this section, we propose a density-based dynamic clustering method (DDC) for time-series data, which includes 3 processes for updating cluster structure: element removing (ER), element adding (EA) and element displacing (ED). Figure 5 shows the processing flow of DDC. The input of the algorithm contains $\xi^{i}$ (the data set of $t_{i}$ ), Queue (the queue of changed element from $t_{i}$ to $t_{i+1}$ ), Eps (density clustering radius parameter) and MinPts (threshold parameter) as well as the output includes the cluster label for all objects and their attribute.

Figure 5.

The processing flow of DDC.

4.1 Elements removing process

When an element was removed, the rest of the elements within its neighborhood defined by Eps would be affected directly. Suffering this, we need to update the attributes of the elements within the neighborhood one by one. Meanwhile, it causes a chain reaction. The changed attributes of those elements are also likely to affect other elements in their neighborhoods. Obviously, we should handle all of these to get a new cluster structure. We assume that Fig. 6 is the original state of all elements at $t_{i}$ . Then, Fig. 7 shows the state of elements at $t_{i+1}$ when A was removed. After that, element F has been transformed from a core point to a boundary point, and element E and element Z have been transformed from boundary points to noise points due to chain reactions, which causes a group to be divided into two new groups whose core points are element D and G respectively.

Figure 6.

State of elements when element A exists. The circle represents the Eps neighborhood of one element, and MinPts $=$ 4.

Figure 7.

State of elements when element A not exists.

All in all, in the element removing process (ER), the main procedure includes the following steps: (1) deleting the element; (2) checking whether the removed elements affect other elements within its neighborhood or not; (3) If there is evidence that some attributes have been changed, we will recalculate the elements within the Eps neighborhood one by one until there is no changed attribute in the neighborhood.

ER Process:

Input:

\xi^{i}

: the data set of

t_{i}

QueueOne: the queue of the removed elements

Output:

\xi^{i+1}

: the data set of

t_{i+1}

index: the cluster label for the object after element removing

attribute: the attribute label for the object after element removing

1. Array

=

[], num

=

QueueOne.length, n

=

2. while QueueOne !

=

null do

3. A

=

QueueOne.pop();

4. if n

<

QueueOne.length

5. delete A from

\xi^{i}

;

6. end if

7. push the elements located in the neighbourhood of A into Array;

8. for i

=

0 to Array.length do

9. if attribute of Array[i] changed & Array[i] has no label;

10. change the attribute of Array[i];

11. add Array[i] to QueueOne;

12. mark Array[i];

13. delete Array[i] from Array;

14. end if

15. end for

16. n++;

end while

4.2 Elements adding process

The elements adding process (EA) is an inverse process of ER. When an element has been added, the other elements within its Eps-neighborhood will be affected and chain reactions will be activated too. We assume Fig. 7 is the state of elements at $t_{i}$ before element A added, and Fig. 6 is state of elements at $t_{i+1}$ . Then element F will be transformed from a boundary point to a core point and element E will be transformed from a noise point to a boundary point. Furthermore, the elements located in the above two neighborhoods will also be merged into one group.

In conclusion, the key of EA process is also to find the elements located in the Eps-neighborhood whose attribute is changed of because the added element. Then check other affected elements within the Eps-neighborhood until no element’s attributes have been changed in this process.

EA Process:

Input:

\xi^{i}

: The data set at

t_{i}

QueueOne: The added elements queue

Output:

\xi^{i+1}

: The data set of

t_{i+1}

index: The class label of the object after the element adding;

attribute: The attribute tag of the object after the element adding

1. Array

=

[], num

=

QueueOne.length, n

=

2. while QueueOne !

=

null do

3. A

=

QueueOne.pop();

4. if n

<

QueueOne.length

5. add A to

\xi^{i}

;

6. end if

7. push the elements located in the neighbourhood of A into Array;

8. for i

=

0 to Array.length do

9. if attribute of Array[i] changes & Array[i] has no tag;

10. change the properties of Array[i];

11. add Array[i] to QueueOne;

12. mark Array[i];

13. delete Array[i] from Array;

14. end if

15. end for

16. n++;

17. end while

Table 1
The category and attribute of elements at $t_{i}$ and $t_{i+1}$ time slice

Elements	A	D	E	F	G	H	L	M	N	R	S	U	V	Z
Category at ${t}_{i}$	⟀	⟀	⟀	⟀	⟀	⟀	⟀	⟀	⟀	⟀	⟀	⟀	⟀	⟀
Category at ${t}_{{i}+1}$	⟁	⟀	⟂	⟀	⟁	⟀	⟀	⟀	⟀	⟁	⟁	⟁	⟁	⟂
Attribute at ${t}_{i}$	1	1	0	1	1	0	0	0	0	0	0	0	0	0
Attribute after A removing	Nan	1	$-$ 1	0	1	0	0	0	0	0	0	0	0	$-$ 1
Attribute at ${t}_{{i}+1}$	0	1	$-$ 1	0	1	0	0	0	0	0	0	0	0	$-$ 1

Figure 8.

A chain reaction caused by element A which has been moved from point $\times$ to its current position.

4.3 Elements displacing process

The element displacing process (ED) can be considered as the superposition of EA and ER. We use an example to specify this process. Firstly, we assume the data in Fig. 6 is the data at $t_{i}$ . Then element A has been moved to the new position in Fig. 8 to form the data at $t_{i+1}$ . We use the ED process to deal with the two different data sets at two time slices. The changes of element categories and attributes are shown in Table 1. The numbers of categories represent the ID of groups. In addition, in the description of attributes, 1, 0, $-$ 1 and Nan represent the core point, the boundary point, the isolated point, and the removed element respectively.

From Table 1 and Fig. 6, we can find that all elements belong to group ⟀ at $t_{i}$ . Under the influence of element A displacing, elements A, G, R, S, U, V have been transformed into group ⟁, and elements E and Z have been transformed into the group of noise points. Therefore, one group has been divided into 3 subgroups. From Table 1, we can see that the elements A, D, F, and G are core points at $t_{i}$ . After the ED operation for element A, element F has been transformed from a core point to a boundary point, and the elements E and Z have been transformed from the boundary points to the noise points. The attributes of the other elements are unchanged no longer. After the ED operation, the final state at $t_{i+1}$ is shown in Fig. 8.

In summary, the DDC can clearly show the details of changing attributes for each element and the corresponding variation of cluster structure at different time slices. This is just what the traditional clustering method does not have.

4.4 Time complexity analysis

We will discuss the time complexity of DDC in the worst case and the ordinary situation. We assume that there are N elements in total and M elements have changed from $t_{i}$ to $t_{i+1}$ .

In the situation of the worst case, m elements’ change has affected all elements’ attributes at $t_{i+1}$ . At this moment, the DDC algorithm degenerates to the traditional DBSCAN algorithm, which needs to seek every element’s attribute and find other elements in its eps. So, the worst time complexity of DDC algorithm is the worst time complexity of the DBSCAN, that is $O({n^{2}})$ .

In the situation of the ordinary situation, M changed elements will affect the attributes of other elements at $t_{i+1}$ , and the DDC algorithm needs to recalculate the elements within the Eps neighborhood of that changed elements one by one until there is no changed attribute in the neighborhood. We assume the time of seeking elements within the eps of M elements is T. In conclusion, the time complexity of DDC algorithm in the ordinary situation is $O({M\times T})$ .

All in all, when the probability of element change is small, there will be a relatively obvious correlation of data composition and cluster structure between two adjacent time slices, therefore, the DDC algorithm merely needs to find the elements within the Eps neighborhood of a few changed elements by taking full advantage of the inheritance of the class structure in the time series, which could greatly reduce redundant calculations and improve clustering efficiency compared to DBSCAN algorithm.

5. Experiments

We implemented the DDC in MATLAB 8.3, and performed a variety of experiments on PC (2.2 GHz, 4 GB RAM, Windows 7) including three classic data sets [25]: Iris, Seeds, Wine, and three larger data sets [25]: Mushroom, Anuran Calls, a subset of the SEQUOIA 2000 benchmark database [10]. Both Iris, Seeds, Wine, Mushroom and the Anuran Calls are derived from the machine learning UCI database [25].

5.1 Time-series data generation

The Iris dataset contains 3 groups, which are Setosa, Versicolor and Virginica. Meanwhile, each group contains 50 instances, and each instance has four attributes: sepal length, sepal width, petal length and petal width. The Seeds dataset contains three different types of wheat, which was created by the Institute of Agricultural Physics of the Polish Academy of Sciences. The Wine dataset is the statistics of three different varieties of wine produced in the same region of Italy. The mushroom dataset is the descriptions of 23 species of mushrooms’ hypothetical samples, which includes 2 groups and 8124 instances with 22 attributes. The Anuran Calls (MFCCs) dataset is acoustic features extracted from Anuran’s syllables, which contains 4 groups, 7195 instances with 22 attributes. Table 2 is the information of the above 5 problems. Besides, the subset of the SEQUOIA 2000 benchmark database contains 62556 Californian names of landmarks, extracted from the US Geological Surveys Geographic Names Information System, together with their location.

In this paper, the testing data sets are time serialized as follows: at each time slice, we selected 30% of elements by a random number, and those elements will be changed randomly by 3 kinds of processing that are disappearing elements, adding elements, and displacing elements. In the disappearing process, we directly delete the elements from the data set. In the added process, new elements in the space formed by all elements are randomly selected. And in the displacing process, we will randomly select new points in the field of the original element. This operation is performed four times, and we get 5 continuous time-series datasets. Through the above mechanism, we could construct the time series data based on an ordinary data set and made it very different from the original data. Meanwhile, the constructed data can simulate the specific scenarios handled by DDC. Experiments on these data sets can well illustrate the effectiveness of our proposed algorithm.

Table 2
Information of 5 problems

Dataset	Number of instances	Number of attributes	Number of groups	Number of instances in each group
Iris	150	4	3	50-50-50
Seeds	210	7	3	70-70-70
Wine	178	13	3	59-71-48
Mushroom	8124	22	2	4208-3916
Anuran Calls	7195	22	4	68-542-2165-4420

5.2 Evaluation metrics

The experimental results were evaluated across six metrics: Accuracy (ACC), Precision (PR), Recall (RE), Adjusted Rand index (ARI), Running time (T, unit ms) and the number of data involved in the calculation (NUM). ACC is the correct proportion of clustering results. PR defines how many the samples predicted positive is correct. RE measures how many positive samples are predicted correctly. ARI measures the degree of agreement between two data distributions. T is the running time of an algorithm on a single time slice, and NUM is the number of elements involved in one clustering computation. Overall, both ACC, PR, RE, ARI, and T are static indicators for the clustering result at a time slice. NUM of the clustering process is used to show the operational overhead.

If the clustering result is closer to the true division of the data set, the value of ACC, PR, RE, ARI will be close to 1. In other words, the larger value of the above four evaluation criteria is, the better the cluster quality and clustering efficiency of DDC is. The following are the definitions of the 4 indicators:

$\displaystyle\textit{ACC}=\frac{1}{n}{\max}_{j_{1},j_{2},\ldots j_{k}\in S}% \sum_{i=1}^{k}n_{ij_{i}}$ (10) $\displaystyle PR=\frac{1}{k}\sum_{i=1}^{k}\frac{n_{ij_{i}^{\ast}}}{b_{i}}$ (11) $\displaystyle RE=\frac{1}{k}\sum_{i=1}^{k}\frac{n_{ij_{i}^{\ast}}}{c_{i}}$ (12) $\displaystyle\textit{ARI}=\frac{\sum_{ij}\left({{\begin{array}[]{*{20}c}{n_{ij% }}\\ 2\\ \end{array}}}\right)-\left[{\sum_{i}\left({{\begin{array}[]{*{20}c}{b_{i}}\\ 2\\ \end{array}}}\right)\sum_{j}\left({{\begin{array}[]{*{20}c}{c_{j}}\\ 2\\ \end{array}}}\right)}\right]\left/\left({{\begin{array}[]{*{20}c}n\hfill\\ 2\hfill\\ \end{array}}}\right)\right.}{\frac{1}{2}\left[{\sum_{i}\left({{\begin{array}[]% {*{20}c}{b_{i}}\\ 2\\ \end{array}}}\right)+\sum_{j}\left({{\begin{array}[]{*{20}c}{c_{j}}\\ 2\\ \end{array}}}\right)}\right]-\left[{\sum_{i}\left({{\begin{array}[]{*{20}c}{b_% {i}}\\ 2\\ \end{array}}}\right)\sum_{j}\left({{\begin{array}[]{*{20}c}{c_{j}}\\ 2\\ \end{array}}}\right)}\right]\left/\left({{\begin{array}[]{*{20}c}n\\ 2\\ \end{array}}}\right)\right.}$ (13)

where $n$ is the element number of data set $\xi$ , ${k}$ is the number of clusters, $b_{i}$ and $c_{j}$ are the numbers of the element in group $B_{i}$ and $C_{j}$ , $n_{ij}=|{B_{i}\cap C_{j}}|$ , and $n_{ij_{i}^{\ast}}$ defines the number of elements assigned correctly to cluster $i$ . Meanwhile, the smaller T is, the faster clustering speed is. In addition, the less NUM is, the smaller clustering cost is. These two indicators we can count directly during the testing process.

5.3 Experiment results

To compare the performance of DDC with state-of-the-art designs, 4 representative algorithms were chosen as competitors: the improved K-means algorithm based on density Canopy [26], the improved BIRCH clustering algorithm based on connectivity distance and intensity [27], the adaptive and the fast density-based spatial clustering of applications with noise (AF-DBSCAN) [28], and the multi-density clustering algorithm [29].

We used 3 small-scale datasets and 2 large-scale datasets to compare the performance of DDC with competitors. Meanwhile, the subset of the SEQUOIA 2000 database as a large-scale dataset was used to compare the time efficiency of DDC with 4 state-of-the-art designs.

The following is the experimental procedure: For each time slice, the four competitors implemented an independent and complete clustering operation for the dataset belong to the current time slice, and outputted results and related indicators. In contrast, because the element does not dynamically change in the initial state at $t_{1}$ , DDC needs DBSCAN to initialize the cluster at $t_{1}$ and performs dynamic clustering in the following time slices based on the initial result. In short, DDC did not perform clustering at $t_{1}$ but clustering in the next four time slices, and outputted the result and related indicators.

Firstly, we compared the clustering accuracy on 5 time slices between DDC and 4 algorithms from literature [26, 27, 28, 29]. The accuracy rates are shown in Table 3. We can find that the algorithm from Han et al. [29] has the highest correct proportion in clustering results for 5 datasets. The accuracies of DDC for 5 testing data are slightly lower than Han et al. [29] and very close to AF–DBSCAN from Zhou et al. [28]. Meanwhile, DDC is much better than the results from Zhang et al. [26] and Fan et al. [27].

Table 3
Comparisons of accuracy among 5 algorithms

Datasets	Time slice	Literature [26]	Literature [27]	AF-DBSCAN [28]	Literature [29]	DDC
Iris	$t_{1}$	0.883	0.897	0.932	0.943	0.926
	$t_{2}$	0.871	0.886	0.927	0.946	0.934
	$t_{3}$	0.877	0.901	0.933	0.938	0.929
	$t_{4}$	0.877	0.891	0.925	0.944	0.924
	$t_{5}$	0.881	0.887	0.930	0.944	0.932
Seeds	$t_{1}$	0.837	0.847	0.886	0.905	0.882
	$t_{2}$	0.845	0.853	0.885	0.923	0.879
	$t_{3}$	0.851	0.860	0.895	0.910	0.872
	$t_{4}$	0.833	0.854	0.890	0.902	0.881
	$t_{5}$	0.842	0.864	0.895	0.924	0.885
Wind	$t_{1}$	0.861	0.878	0.895	0.931	0.887
	$t_{2}$	0.858	0.883	0.902	0.935	0.890
	$t_{3}$	0.864	0.869	0.880	0.925	0.879
	$t_{4}$	0.871	0.873	0.903	0.934	0.897
	$t_{5}$	0.856	0.868	0.901	0.931	0.891
Mushroom	$t_{1}$	0.837	0.859	0.933	0.951	0.921
	$t_{2}$	0.831	0.863	0.936	0.959	0.923
	$t_{3}$	0.829	0.856	0.935	0.948	0.921
	$t_{4}$	0.835	0.851	0.938	0.947	0.921
	$t_{5}$	0.841	0.863	0.936	0.949	0.922
Anuran Calls	$t_{1}$	0.837	0.863	0.931	0.933	0.919
	$t_{2}$	0.832	0.849	0.913	0.925	0.902
	$t_{3}$	0.836	0.841	0.933	0.942	0.921
	$t_{4}$	0.842	0.847	0.921	0.937	0.922
	$t_{5}$	0.834	0.849	0.931	0.936	0.921

Table 4

Comparisons of PR, RE, and ARI for Iris with 5 time slices

Time slice	Algorithm	PR	RE	ARI
$t_{1}$	Literature [26]	0.894	0.875	0.853
	Literature [27]	0.901	0.863	0.874
	AF-DBSCAN [28]	0.943	0.892	0.923
	Literature [29]	0.952	0.921	0.916
$t_{2}$	Literature [26]	0.863	0.851	0.841
	Literature [27]	0.905	0.856	0.865
	AF-DBSCAN [28]	0.936	0.893	0.916
	Literature [29]	0.958	0.926	0.919
	DDC	0.946	0.897	0.921
$t_{3}$	Literature [26]	0.883	0.861	0.837
	Literature [27]	0.899	0.879	0.875
	AF-DBSCAN [28]	0.941	0.901	0.911
	Literature [29]	0.949	0.931	0.924
	DDC	0.937	0.902	0.922
$t_{4}$	Literature [26]	0.902	0.864	0.849
	Literature [27]	0.896	0.873	0.866
	AF-DBSCAN [28]	0.944	0.928	0.913
	Literature [29]	0.956	0.925	0.922
	DDC	0.940	0.897	0.919
$t_{5}$	Literature [26]	0.886	0.871	0.847
	Literature [27]	0.892	0.882	0.871
	AF-DBSCAN [28]	0.947	0.928	0.913
	Literature [29]	0.953	0.929	0.922
	DDC	0.937	0.901	0.917

Table 5

Comparisons of PR, RE, and ARI for Seeds with 5 time slices

Time slice	Algorithm	PR	RE	ARI
$t_{1}$	Literature [26]	0.859	0.841	0.836
	Literature [27]	0.856	0.834	0.822
	AF-DBSCAN [28]	0.902	0.894	0.880
	Literature [29]	0.916	0.911	0.898
$t_{2}$	Literature [26]	0.863	0.853	0.798
	Literature [27]	0.869	0.847	0.832
	AF-DBSCAN [28]	0.889	0.902	0.877
	Literature [29]	0.931	0.924	0.913
	DDC	0.873	0.858	0.844
$t_{3}$	Literature [26]	0.871	0.863	0.812
	Literature [27]	0.853	0.862	0.841
	AF-DBSCAN [28]	0.907	0.899	0.881
	Literature [29]	0.911	0.906	0.897
	DDC	0.871	0.864	0.845
$t_{4}$	Literature [26]	0.857	0.846	0.824
	Literature [27]	0.861	0.853	0.844
	AF-DBSCAN [28]	0.906	0.894	0.879
	Literature [29]	0.913	0.905	0.867
	DDC	0.891	0.859	0.878
$t_{5}$	Literature [26]	0.849	0.854	0.835
	Literature [27]	0.872	0.851	0.839
	AF-DBSCAN [28]	0.907	0.899	0.881
	Literature [29]	0.933	0.925	0.914
	DDC	0.902	0.894	0.880

Table 6

Comparisons of PR, RE, and ARI for Wine with 5 time slices

Time slice	Algorithm	PR	RE	ARI
$t_{1}$	Literature [26]	0.879	0.884	0.852
	Literature [27]	0.884	0.874	0.859
	AF-DBSCAN [28]	0.897	0.901	0.887
	Literature [29]	0.944	0.915	0.926
$t_{2}$	Literature [26]	0.845	0.832	0.827
	Literature [27]	0.890	0.881	0.867
	AF-DBSCAN [28]	0.909	0.905	0.891
	Literature [29]	0.942	0.939	0.928
	DDC	0.899	0.886	0.873
$t_{3}$	Literature [26]	0.871	0.876	0.843
	Literature [27]	0.878	0.861	0.949
	AF-DBSCAN [28]	0.891	0.889	0.877
	Literature [29]	0.939	0.924	0.919
	DDC	0.883	0.876	0.878
$t_{4}$	Literature [26]	0.878	0.882	0.869
	Literature [27]	0.878	0.875	0.864
	AF-DBSCAN [28]	0.917	0.902	0.895
	Literature [29]	0.945	0.928	0.933
	DDC	0.902	0.901	0.884
$t_{5}$	Literature [26]	0.868	0.867	0.857
	Literature [27]	0.871	0.863	0.855
	AF-DBSCAN [28]	0.911	0.904	0.887
	Literature [29]	0.943	0.927	0.929
	DDC	0.898	0.903	0.889

Table 7

Comparisons of PR, RE, and ARI for Mushroom with 5 time slices

Time slice	Algorithm	PR	RE	ARI
$t_{1}$	Literature [26]	0.851	0.844	0.831
	Literature [27]	0.875	0.862	0.841
	AF-DBSCAN [28]	0.937	0.941	0.923
	Literature [29]	0.963	0.952	0.947
$t_{2}$	Literature [26]	0.857	0.849	0.838
	Literature [27]	0.876	0.868	0.854
	AF-DBSCAN [28]	0.941	0.935	0.926
	Literature [29]	0.957	0.946	0.949
	DDC	0.932	0.924	0.914
$t_{3}$	Literature [26]	0.865	0.878	0.864
	Literature [27]	0.871	0.861	0.846
	AF-DBSCAN [28]	0.942	0.938	0.928
	Literature [29]	0.961	0.955	0.947
	DDC	0.930	0.925	0.912
$t_{4}$	Literature [26]	0.877	0.884	0.871
	Literature [27]	0.864	0.852	0.838
	AF-DBSCAN [28]	0.940	0.937	0.928
	Literature [29]	0.962	0.948	0.940
	DDC	0.929	0.926	0.910
$t_{5}$	Literature [26]	0.869	0.867	0.861
	Literature [27]	0.875	0.863	0.850
	AF-DBSCAN [28]	0.940	0.938	0.927
	Literature [29]	0.963	0.949	0.941
	DDC	0.931	0.924	0.911

Table 8

Comparisons of PR, RE, and ARI for Aunran Calls with 5 time slices

Time slice	Algorithm	PR	RE	ARI
$t_{1}$	Literature [26]	0.849	0.859	0.823
	Literature [27]	0.876	0.867	0.849
	AF-DBSCAN [28]	0.942	0.939	0.929
	Literature [29]	0.931	0.941	0.938
$t_{2}$	Literature [26]	0.851	0.826	0.821
	Literature [27]	0.861	0.863	0.851
	AF-DBSCAN [28]	0.924	0.936	0.928
	Literature [29]	0.922	0.943	0.939
	DDC	0.928	0.925	0.918
$t_{3}$	Literature [26]	0.861	0.867	0.823
	Literature [27]	0.866	0.855	0.843
	AF-DBSCAN [28]	0.946	0.937	0.926
	Literature [29]	0.953	0.952	0.948
	DDC	0.933	0.925	0.918
$t_{4}$	Literature [26]	0.856	0.851	0.858
	Literature [27]	0.868	0.874	0.860
	AF-DBSCAN [28]	0.943	0.936	0.926
	Literature [29]	0.956	0.953	0.946
	DDC	0.929	0.927	0.920
$t_{5}$	Literature [26]	0.842	0.852	0.844
	Literature [27]	0.874	0.864	0.856
	AF-DBSCAN [28]	0.941	0.938	0.928
	Literature [29]	0.953	0.951	0.939
	DDC	0.928	0.924	0.917

The comparisons of PE, RE, and ARI between DDC and 4 competitors for 5 testing problems are shown in Tables 4–8. We can find that the method from Han et al. [29] gets the best values for PE, RE, and ARI. The PE, RE, and ARI of DDC are close to 1, which means the clustering results of DDC are very similar to the true answers. However, the results of DDC are slightly lower than those of Han et al. [29]. The algorithm in Han et al. [29] is an improved density clustering method, which uses region division and adaptive Eps in every region to improve its clustering performance. Relatively, DDC is just based on the basic density clustering operation but gets a close performance, which shows making effective use of the inheritance information in the time-series data is valuable.

Table 9 are the comparisons of NUM and T between DDC and 4 competitors. Since all 4 competitors have performed an independent and complete clustering process at 5 time slices, they have the same NUM at all 5 time slices. In contrast, DDC can markedly reduce the NUM and T from $t_{2}$ to $t_{5}$ by making full use of the clustering result getting at $t_{1}$ .

Table 9

Comparisons of clustering cost (NUM and T)

Datasets	Time slices	Literature [26]		Literature [27]		AF-DBSCAN [28]		Literature [29]		DDC
		NUM	T	NUM	T	NUM	T	NUM	T	NUM	T
Iris	$t_{1}$	150	6.7	150	18.7	150	5.8	150	6.4	150	16.3
	$t_{2}$	150	7.1	150	18.8	150	5.6	150	6.3	46	4.8
	$t_{3}$	150	6.8	150	18.4	150	5.5	150	6.3	47	5.1
	$t_{4}$	150	6.7	150	18.7	150	5.6	150	6.2	51	5.1
	$t_{5}$	150	18.7	150	16.6	150	5.5	150	6.4	48	4.9
Seeds	$t_{1}$	210	9.6	210	20.3	210	8.1	210	8.7	210	22.1
	$t_{2}$	210	9.1	210	20.5	210	8.3	210	8.6	65	6.8
	$t_{3}$	210	9.2	210	20.5	210	7.9	210	8.7	67	7.1
	$t_{4}$	210	10.1	210	20.7	210	8.1	210	8.8	68	6.7
	$t_{5}$	210	9.6	210	20.6	210	8	210	8.6	67	7.2
Wind	$t_{1}$	178	8.3	178	19.1	178	7.6	178	8	178	19.2
	$t_{2}$	178	8.7	178	18.8	178	7.5	178	8.2	54	7.2
	$t_{3}$	178	8.1	178	19.2	178	7.7	178	8.1	55	7.6
	$t_{4}$	178	8.2	178	19.5	178	7.6	178	8.3	58	7.7
	$t_{5}$	178	8.5	178	19.1	178	7.9	178	8.5	55	7.6
Mushroom	$t_{1}$	8124	2875.4	8124	7277.6	8124	3266.3	8124	3748.7	8124	5733.7
	$t_{2}$	8124	2866.8	8124	7284.4	8124	3269.2	8124	3746.5	2455	1554.2
	$t_{3}$	8124	2871.7	8124	7271.7	8124	3266.3	8124	3744.8	2431	1551.6
	$t_{4}$	8124	2876.1	8124	7274.5	8124	3264.7	8124	3746.5	2509	1553.2
	$t_{5}$	8124	2870.3	8124	7267.1	8124	3263.3	8124	3751.7	2471	1587.2
Anuran Calls	$t_{1}$	7195	2580.3	7195	6981.8	7195	3179.8	7195	3153.5	7195	5574.4
	$t_{2}$	7195	2586.5	7195	6976.8	7195	3174.3	7195	3151.7	7195	1519.3
	$t_{3}$	7195	2581.7	7195	6985.8	7195	3178.5	7195	3154.2	7195	1523.2
	$t_{4}$	7195	2574.4	7195	6987.8	7195	3180.3	7195	3157.9	7195	1521.2
	$t_{5}$	7195	2579.6	7195	6979.8	7195	3177.7	7195	3149.6	7195	1526.4

In these five testing problems, the results of AF–DBSCAN have the lowest T value among the 4 competitors at most of time slices. AF-DBSCAN can adaptively determine the optimal global parameters of Eps and MinPts based on a KNN distribution and mathematical-statistical based analysis process. Meanwhile, AF-DBSCAN improved the way of selecting the representative seed to query region. All of the above mechanisms help AF-DBSCAN reduces its overall time cost. In contrast, the values of T of DDC are significantly less than 4 competitors including AF-DBSCAN because it can save lots of double counting in a series of clustering processes by using the inheritance of cluster structure between two adjacent time slices. That makes DDC is more efficient for the time series data.

Above all, DDC can get a very close clustering effect to Han et al. [29] and AF-DBSCAN just based on basic density clustering operations. Moreover, it significantly reduces the cost no matter the number of elements involved in the clustering process or overall time consumption.

To further illustrate the operation efficiency of DDC, the DBSCAN algorithm, the AF-DBSCAN algorithm, and the method from Han et al. [29] were used to test the subset of the SEQUOIA 2000 database. The number of subsets has been set in 7 sizes including 200, 500, 1000, 2000, 5000, 10000, and 15000. The time-series data constructing procedure is the same as the above 5 problems. So, the number of changed elements in 7 tests will be 60, 150, 300, 600, 1500, 3000, 4500.

We used the value of T to compare the difference of time efficiency between DDC and 3 competitors, in which the value of T is the average value of the running time of the subset of the SEQUOIA 2000 database on 5 time slices. The result of T values is shown in Fig. 9.

Figure 9.

T values of 4 algorithms for SEQUO IA 2000 database.

As it is shown, the running times of 4 algorithms mark a big escalation with the growth of data size. Apparently, the increment of T value of DDC is visibly slower than others. And, the running time of DDC is about 33% of DBSCAN, 51% of AF-DBSCAN, and 42% of the method from Han et al. [29]. So, for large time-series data, the DDC can significantly reduce the time cost, especially for the data whose element number is more than 5000.

6. Conclusions

In this paper, we propose a dynamic density clustering method (DDC) for the time-series data. Based on analyzing the transition rule of data set and cluster structure between two adjacent time slices, we found there is a distinct correlation of data set and succession of cluster structure. Inspired by it, we design the DDC to reduce the cost of the whole clustering process for the time-series data. Experiments for 6 representative problems show that:

Using basic density clustering operations, DDC can get high accuracy results for 5 problems.

Compared to 4 state-of-the-art algorithms, DDC can reduce the overall cost significantly, especially for the largescale problems.

Thus, the proposed technique DDC is a potential candidate to solve time-series data clustering problems. In the following works, we will try to apply it to some real-life problems.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61203311, in part by the Natural Science Basic Research Program of Shaanxi Province of China under Grant 2019JM-365, in part by the Key Research and Development Program of Shaanxi under Grant 2020SF-375, in part by the Scientific Research Program Funded by Shaanxi Provincial Education Department of China under Grant 17JK0701.

References

T.C.

, A review on time series data mining, Engineering Applications of Artificial Intelligence 24(1) (2011), 164–181.

Zhao

J.Y.

and Li

H.W.

, Time series spectral clustering analysis of taxi data, Bulletin of Surveying and Mapping 8 (2020), 112–116.

Xie

F.D.

et al., A time series dynamic clustering algorithm, Application Research of Computers 10 (2012), 3677–3680.

X.X.

, Research on Key Issues in Time Series Data Mining, Ph.D. Dissertation, University of Science and Technology of China, 2014.

Shan

Z.N.

Weng

X.Q.

and Ma

C.H.

, Review of time series semi-supervised classification, Journal of the Hebei Academy of Sciences 35(2) (2018), 49–54.

Oliveira

V.D.

and Pedrycz

, Advances in Fuzzy Clustering and its Applications, John Wiley & Sons, Inc. 2007, 1–454.

MacQueen

J.B.

, Some Methods for Classification and Analysis of Multivariate Observation, in: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344(6191) (2014), 1492–1496.

Zhang

Ramakrishnan

and Livny

, BIRCH: an efficient data clustering method for very large databases, in: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 25(2) (1996), 103–114.

10.

Ester

et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.

11.

QIN

J.R.

et al., Self-adaptive local eps DBSCAN, Journal of Chinese Computer Systems 39(10) (2018), 2186–2190.

12.

Kohonen

, Self-organization and associative memory, Springer-Verlag, 1984.

13.

Zheng

S.P.

, An improved dynamic SOM algorithm and its application in clustering, Ph.D. Dissertation, South China University of Technology, 2010.

14.

Zhang

and Zhu

H.D.

, An improved k-means dynamic clustering algorithm, Journal of Chongqing Normal University (Natural Science) 33(1) (2016), 97–101.

15.

Agrawal

Faloutsos

and Swami

, Efficient similarity search in sequence databases, in: Proc of the 4th International Conference on Foundations of Data Organization and Algorithms, 1993, pp. 69–84.

16.

Benítez

et al., Dynamic clustering of residential electricity consumption time series data based on Hausdorff distance, Electric Power Systems Research 140(6) (2016), 517–526.

17.

Wang

and Meng

J.Y.

, Similarity dynamical clustering algorithm based on multidimensional shape features for time series, Chinese Journal of Engineering 39(7) (2017), 1114–1122.

18.

Łuczak

, Hierarchical clustering of time series data with parametric derivative dynamic time warping, Expert Systems with Applications 62(12) (2016), 116–130.

19.

Genolini

and Falissard

, KmL: k-means for longitudinal data, Computational Statistics 25(2) (2016), 317–328.

20.

Min

Fan

and Xie

F.D.

, Clustering time series based on orthogonal function system, System Science and Mathematics 36(1) (2016), 53–60.

21.

Izakian

and Pedrycz

, Agreement-based fuzzy C-means for clustering data with blocks of features, Neurocomputing 127 (2014), 266–280.

22.

Sun

and Li

Z.H.

, Clustering algorithm for time series based on locally extreme point, Computer Engineering 41(5) (2015), 33–37.

23.

Liu

Wang

K.L.

and Rao

W.X.

, Non-equal time series clustering algorithm with sliding window STS distance, Journal of Frontiers of Computer Science and Technology 9(11) (2015), 1301–1313.

24.

Shukri

et al., Evolutionary static and dynamic clustering algorithms based on multi-verse optimizer, Engineering Applications of Artificial Intelligence 72(3) (2018), 54–66.

25.

Bache

and Lichman

, UCI machine learning repository, 2013. Available: http://archive.ics.uci.edu/ml.

26.

Zhang

and Zhang

, Improved k-means algorithm based on density canopy, Knowledge Based Systems (2018), 289–297.

27.

Fan

Z.X.

Wang

and Miao

C.S.

, Improved BIRCH clustering algorithm based on connectivity distance and intensity, Journal of Computer Applications 39(4) (2019), 1027–1031.

28.

Zhou

Z.P.

et al., An improved adaptive and fast AF-DBSCAN clustering algorithm, CAAI Transactions on Intelligent Systems 11(1) (2016), 93–98.

29.

Han

L.Z.

et al., Multi-density clustering algorithm DBSCAN based on region division, Application Research of Computers 35(6) (2018), 1668–1671.

Time-series data dynamic density clustering

Abstract

Keywords

1. Introduction

3. Time-series data clustering process analysis

3.1 Data correlation and cluster structure inheritance

3.2 Analysis of traditional clustering algorithm for time-series data

Table 1 The category and attribute of elements at t i and t i + 1 time slice

4.4 Time complexity analysis

5. Experiments

5.1 Time-series data generation

Table 2 Information of 5 problems

Table 3 Comparisons of accuracy among 5 algorithms

Footnotes

Acknowledgments

References

Table 1
The category and attribute of elements at $t_{i}$ and $t_{i+1}$ time slice

Table 2
Information of 5 problems

Table 3
Comparisons of accuracy among 5 algorithms