Abstract
In many clustering problems, the whole data is not always static. Over time, part of it is likely to be changed, such as updated, erased, etc. Suffer this effect, the timeline can be divided into multiple time segments. And, the data at each time slice is static. Then, the data along the timeline shows a series of dynamic intermediate states. The union set of data from all time slices is called the time-series data. Obviously, the traditional clustering process does not apply directly to the time-series data. Meanwhile, repeating the clustering process at every time slices costs tremendous. In this paper, we analyze the transition rules of the data set and cluster structure when the time slice shifts to the next. We find there is a distinct correlation of data set and succession of cluster structure between two adjacent ones, which means we can use it to reduce the cost of the whole clustering process. Inspired by it, we propose a dynamic density clustering method (DDC) for time-series data. In the simulations, we choose 6 representative problems to construct the time-series data for testing DDC. The results show DDC can get high accuracy results for all 6 problems while reducing the overall cost markedly.
Introduction
Time-series data refers to a data set sequence in chronological order, which expresses a series of internal states of data that is changed with time, is characterized by its numerical and continuous nature. Time-series data analysis is used in various fields [1]. For example, in a city, taxis generate a lot of location information at every moment. Between adjacent moment, the position information of the taxi also change. By clustering those position data, we could explore the people’s travel behavior patterns, so as to provide differentiated services for different spatial units, which will realize smart lives [2]. To deal with this time-series data, we need to abstract the above scene into a time series model, which means that the data set is composed of data on multiple time slices, and as time changes, the data on every time slice will change locally. We mainly explore the clustering methods for the above model.
For the above time series data clustering, there are two main ideas: one idea is to extract the overall features of the time series for clustering, the other is to directly process the data on each time slice. Driven by the first idea, some time-series clustering algorithms project time-series data as a whole into an n-dimensional space or use deep learning methods to directly extract the feature vector of the time series data, and used static cluster algorithms to handle those data [3]. This type of method lacks the analysis of the data state of adjacent time slices and cannot dig out the evolvement trend of data and cluster in the time sequence, only utilizing the holistic or abstract characteristics of the time series. However, time-series data changes over time, these methods cluster them as a whole, result in cannot reflect the dynamic characteristics of time series and of lack interpretability of results. Affected by the second idea, some time-series clustering methods implement a static clustering process at every time slice [4], which is easy to operate but costly when there are multiple time slices. Furthermore, there is a conspicuous data similarity between two adjacent time slices because generally only a few data points would be updated between adjacent time series, so there will be lots of repetitive operations in the above clustering process that causes large time consumption [5].
In order to increase the interpretability of time series clustering, we explore clustering methods in the second category of ideas. Since time series data will change with time, we split it into each time slice for analysis. By comparing the data status on adjacent time slices, we found that not all the attributes and categories of data points will change, which remain in the attributes and categories of the previous time slice, and only a small number of data points change their attributes and categories. Therefore, under certain conditions, time series data has a strong correlation between data sets between adjacent time slices, and there is a certain degree of inheritance between cluster structures.
Based on the above findings, in this article, we propose a dynamic density clustering method (DDC) to deal with the time-series data. The main contributions of this paper are summarized as follows:
We analyze the shifting process of data set and cluster structure between two adjacent time slices, and we prove that there are a distinct correlation and succession of data set and cluster structure. We propose a dynamic density clustering method (DDC) to deal with the time-series data. By making full use of the inheritance information containing in the time-series data, DDC improves the clustering performance while reducing the cost significantly compared to the traditional clustering mode. Meanwhile, it can reveal the changing trend of the cluster structure and the attributes of data points. We evaluated the proposed approach using extensive experiments on six benchmark datasets, and the experimental results validate the effectiveness of our approach.
The rest of this paper is organized as follows. In Section 2, we describe related techniques and representative works. In Section 3, we analyze the variation rules of cluster structure between two adjacent time slices and the defects of traditional clustering mode in processing time-series data. In Section 4, the proposed DDC is presented in detail. In Section 5, the comprehensive experiments are implemented using benchmark functions to evaluate the performance of our approach. In the last section, we draw the conclusions from this paper and point out the possible future work.
In general, we can divide the form of data and clustering process into two categories: dynamic and static. From the point of view of the above, the clustering analysis methods can basically fall into four types [6].
Firstly, all of the data and the clustering process are static. For example, the K-means algorithm, proposed by MacQueen [7], is a typical static clustering algorithm possessing the advantages of high-speed and good stability. But if we modify the K (a static integer value for the cluster number) and the initial cluster center points, the computational results will be affected strongly. To improve it, Rodriguez et al. [8] proposed to select cluster center points in some subregions with high density and keep a relatively large distance between them. The BIRCH algorithm [9] is a kind of hierarchical clustering algorithm that has high clustering efficiency. This algorithm can handle noise points effectively. But the data applied to this algorithm must be static. And, this method is unsuitable for finding arbitrary shape clusters. Ester [10] proposed the DBSCAN algorithm, a density-based method, which can efficiently discover arbitrary shape clusters, but it is sensitive to the Neighborhood radius (Eps) and the data distribution. To solve it, Qin [11] proposed an adaptive approach to regulate the cluster radius dynamically. Kohonen [12] proposed SOM (Self-organizing Maps) neural network, which is a typical no-mentor clustering algorithm. However, the network structure of SOM is fixed, so the number of neurons is unchanged, which cannot be used to clustering freely [13].
Secondly, the data is static, but the number of clusters can be updated in the clustering process. Such as, Zhang et al. [14] proposed an improved K-means algorithm, which can delete the redundant information generated by the change of cluster structure to reduce the total computation requirements. However, this algorithm just applies to static data.
Thirdly, the data is dynamic and time-varying, but the cluster structure is static. For example, Agrawal et al. [15] proposed a feature representation method to translate the time domain of time-series data into a frequency domain by the discrete Fourier transform. Similarly, based on the K-means algorithm, Benítez I et al. [16] used a similarity distance calculated by the Hausdorff similarity degree to implement the clustering analysis for the dynamic electricity data. And, Wang et al. [17] selected part of feature points containing important information to reducing the time-series dimension. But it is hard to ensure the reliability of results.
In the final category, both the data and the cluster structure are constantly changing during the clustering process. Such as, considering the advantages of dynamic time warping and time extended method for the unequal time-series data, Luczak [18] proposed a new distance metric, which is more suitable for the time-series data clustering. In Genolini and Falissard [19] the Haar wavelet transform was used to represent time-series data, as well as the K-means algorithm and the Euclidean distance were adopted to data clustering in the new feature space. Min et al. [20] described an improved fuzzy C-means algorithm, which is an extended version of fuzzy C-means (FCM) for time-series data. Izakian and Pedrycz [21] proposed an augmented method of Euclidean distance for fuzzy clustering in time-series data, but with high complexity. Xie [3] implemented a time-series clustering method based on Euclidean distance and fuzzy C-means clustering algorithm. In this method, equidistant processing for each time slice is used to reduce the implementation complexity. But it also makes this method more sensitive to the segmentation of time frame. Besides, Sun and Li [22] proposed a time-series hierarchical clustering algorithm based on dynamic bending distance, but its similarity measuring method has high computation complexity, which causes comparatively low efficiency. Combining the sliding window idea into the K-means clustering algorithm, Liu Qin [23] proposed a short and un-equidistance time-series data clustering algorithm. This method is difficult to determine the best cluster number. Shukri [24] proposed a dynamic clustering algorithm based on a nature-inspired search algorithm called Multi-verse Optimizer (MVO), in which the number of clusters is automatically detected without any prior information. However, this method is not suitable for high dimensional data.
The problem studied in this article belongs to the fourth category. From the above discussion, we can find most traditional clustering methods apply only to the static data, and the potential regularity of time-series data has not been exploiting fully. In this study, we start to analyze the internal rules of data and cluster structure when the time slice shifts from one to the next so as to construct a more efficient time-series data clustering method.
Time-series data clustering process analysis
Data correlation and cluster structure inheritance
Without loss of generality, a typical clustering process at one time slice can be described as follows: The data set
Apparently, the data turnover will lead to cluster structure change. We assumed that there are
For the convenience of analysis, the above four statuses can be sorted into two types: elements changed and elements unchanged. We assume the probability of elements changed is
Since the number of elements at
Obviously, the construct of data set has a direct relationship with
When
Under the condition of
When
Under the condition of
Similarly,
It can be seen from Eq. (5) that
In Eq. (7), it can be known that
In addition, when
When
Added elements may cause damage to the existing cluster structure. The left part is a case that added elements are distributed in existing groups. The right part is a case that several added elements are not in existing groups.
We assume
And, the probability of keeping the existing cluster is
Obviously, under the condition of
The core operation of element displacement is an element has been moved to a new position when the time slice switch from
Analysis of traditional clustering algorithm for time-series data
In this section, we focus on the discussion about the computation process of the K-means algorithm (partition clustering mode), the BIRCH algorithm (hierarchical clustering mode) and the DBSCAN algorithm (density clustering mode) for time-series data. What we want to know is which one is more suitable for dealing with time-series data. In the following, we take the process of removing elements as an example to compare them. As Fig. 2 shown, the element F has been removed at
Clustering result of K-means algorithm for two adjacent time slices when 
Clustering result of BIRCH algorithm for two adjacent time slices. Left figure is the merging process at 
In Fig. 3, we show the situation of clustering results of the BIRCH algorithm at
In the DBSCAN algorithm, Eps represents the radius length centered on one point in problem space, and MinPts represents a given threshold. If the quantity of elements in an Eps is bigger than MinPts, the central point will be a core point. The boundary points locate in the Eps-Neighbourhood of one core point, and the noise points do not belong to any core point or boundary point. If any two core points are close enough, the two core points and their boundary points will belong to the same group. Figure 4 is the clustering result as above. But, when element F has been removed, the attributes of Element G, element D, and element E will be updated naturally, which generates a new cluster structure directly.
Clustering result of DBSCAN algorithm for two adjacent time slices. The left figure is the clustering result at 
In summary, the density-based clustering process is able to rebuild the cluster flexibly with a relatively small cost when a few elements have been changed, which provides a better way to handle the time-series data.
In this section, we propose a density-based dynamic clustering method (DDC) for time-series data, which includes 3 processes for updating cluster structure: element removing (ER), element adding (EA) and element displacing (ED). Figure 5 shows the processing flow of DDC. The input of the algorithm contains
The processing flow of DDC.
When an element was removed, the rest of the elements within its neighborhood defined by Eps would be affected directly. Suffering this, we need to update the attributes of the elements within the neighborhood one by one. Meanwhile, it causes a chain reaction. The changed attributes of those elements are also likely to affect other elements in their neighborhoods. Obviously, we should handle all of these to get a new cluster structure. We assume that Fig. 6 is the original state of all elements at
State of elements when element A exists. The circle represents the Eps neighborhood of one element, and MinPts 
State of elements when element A not exists.
All in all, in the element removing process (ER), the main procedure includes the following steps: (1) deleting the element; (2) checking whether the removed elements affect other elements within its neighborhood or not; (3) If there is evidence that some attributes have been changed, we will recalculate the elements within the Eps neighborhood one by one until there is no changed attribute in the neighborhood.
The elements adding process (EA) is an inverse process of ER. When an element has been added, the other elements within its Eps-neighborhood will be affected and chain reactions will be activated too. We assume Fig. 7 is the state of elements at
In conclusion, the key of EA process is also to find the elements located in the Eps-neighborhood whose attribute is changed of because the added element. Then check other affected elements within the Eps-neighborhood until no element’s attributes have been changed in this process.
The category and attribute of elements at
and
time slice
The category and attribute of elements at
A chain reaction caused by element A which has been moved from point 
The element displacing process (ED) can be considered as the superposition of EA and ER. We use an example to specify this process. Firstly, we assume the data in Fig. 6 is the data at
From Table 1 and Fig. 6, we can find that all elements belong to group ⟀ at
In summary, the DDC can clearly show the details of changing attributes for each element and the corresponding variation of cluster structure at different time slices. This is just what the traditional clustering method does not have.
Time complexity analysis
We will discuss the time complexity of DDC in the worst case and the ordinary situation. We assume that there are N elements in total and M elements have changed from
In the situation of the worst case, m elements’ change has affected all elements’ attributes at
In the situation of the ordinary situation, M changed elements will affect the attributes of other elements at
All in all, when the probability of element change is small, there will be a relatively obvious correlation of data composition and cluster structure between two adjacent time slices, therefore, the DDC algorithm merely needs to find the elements within the Eps neighborhood of a few changed elements by taking full advantage of the inheritance of the class structure in the time series, which could greatly reduce redundant calculations and improve clustering efficiency compared to DBSCAN algorithm.
Experiments
We implemented the DDC in MATLAB 8.3, and performed a variety of experiments on PC (2.2 GHz, 4 GB RAM, Windows 7) including three classic data sets [25]: Iris, Seeds, Wine, and three larger data sets [25]: Mushroom, Anuran Calls, a subset of the SEQUOIA 2000 benchmark database [10]. Both Iris, Seeds, Wine, Mushroom and the Anuran Calls are derived from the machine learning UCI database [25].
Time-series data generation
The Iris dataset contains 3 groups, which are Setosa, Versicolor and Virginica. Meanwhile, each group contains 50 instances, and each instance has four attributes: sepal length, sepal width, petal length and petal width. The Seeds dataset contains three different types of wheat, which was created by the Institute of Agricultural Physics of the Polish Academy of Sciences. The Wine dataset is the statistics of three different varieties of wine produced in the same region of Italy. The mushroom dataset is the descriptions of 23 species of mushrooms’ hypothetical samples, which includes 2 groups and 8124 instances with 22 attributes. The Anuran Calls (MFCCs) dataset is acoustic features extracted from Anuran’s syllables, which contains 4 groups, 7195 instances with 22 attributes. Table 2 is the information of the above 5 problems. Besides, the subset of the SEQUOIA 2000 benchmark database contains 62556 Californian names of landmarks, extracted from the US Geological Surveys Geographic Names Information System, together with their location.
In this paper, the testing data sets are time serialized as follows: at each time slice, we selected 30% of elements by a random number, and those elements will be changed randomly by 3 kinds of processing that are disappearing elements, adding elements, and displacing elements. In the disappearing process, we directly delete the elements from the data set. In the added process, new elements in the space formed by all elements are randomly selected. And in the displacing process, we will randomly select new points in the field of the original element. This operation is performed four times, and we get 5 continuous time-series datasets. Through the above mechanism, we could construct the time series data based on an ordinary data set and made it very different from the original data. Meanwhile, the constructed data can simulate the specific scenarios handled by DDC. Experiments on these data sets can well illustrate the effectiveness of our proposed algorithm.
Information of 5 problems
Information of 5 problems
The experimental results were evaluated across six metrics: Accuracy (ACC), Precision (PR), Recall (RE), Adjusted Rand index (ARI), Running time (T, unit ms) and the number of data involved in the calculation (NUM). ACC is the correct proportion of clustering results. PR defines how many the samples predicted positive is correct. RE measures how many positive samples are predicted correctly. ARI measures the degree of agreement between two data distributions. T is the running time of an algorithm on a single time slice, and NUM is the number of elements involved in one clustering computation. Overall, both ACC, PR, RE, ARI, and T are static indicators for the clustering result at a time slice. NUM of the clustering process is used to show the operational overhead.
If the clustering result is closer to the true division of the data set, the value of ACC, PR, RE, ARI will be close to 1. In other words, the larger value of the above four evaluation criteria is, the better the cluster quality and clustering efficiency of DDC is. The following are the definitions of the 4 indicators:
where
To compare the performance of DDC with state-of-the-art designs, 4 representative algorithms were chosen as competitors: the improved K-means algorithm based on density Canopy [26], the improved BIRCH clustering algorithm based on connectivity distance and intensity [27], the adaptive and the fast density-based spatial clustering of applications with noise (AF-DBSCAN) [28], and the multi-density clustering algorithm [29].
We used 3 small-scale datasets and 2 large-scale datasets to compare the performance of DDC with competitors. Meanwhile, the subset of the SEQUOIA 2000 database as a large-scale dataset was used to compare the time efficiency of DDC with 4 state-of-the-art designs.
The following is the experimental procedure: For each time slice, the four competitors implemented an independent and complete clustering operation for the dataset belong to the current time slice, and outputted results and related indicators. In contrast, because the element does not dynamically change in the initial state at
Firstly, we compared the clustering accuracy on 5 time slices between DDC and 4 algorithms from literature [26, 27, 28, 29]. The accuracy rates are shown in Table 3. We can find that the algorithm from Han et al. [29] has the highest correct proportion in clustering results for 5 datasets. The accuracies of DDC for 5 testing data are slightly lower than Han et al. [29] and very close to AF–DBSCAN from Zhou et al. [28]. Meanwhile, DDC is much better than the results from Zhang et al. [26] and Fan et al. [27].
Comparisons of accuracy among 5 algorithms
Comparisons of accuracy among 5 algorithms
Comparisons of PR, RE, and ARI for Iris with 5 time slices
Comparisons of PR, RE, and ARI for Seeds with 5 time slices
Comparisons of PR, RE, and ARI for Wine with 5 time slices
Comparisons of PR, RE, and ARI for Mushroom with 5 time slices
Comparisons of PR, RE, and ARI for Aunran Calls with 5 time slices
The comparisons of PE, RE, and ARI between DDC and 4 competitors for 5 testing problems are shown in Tables 4–8. We can find that the method from Han et al. [29] gets the best values for PE, RE, and ARI. The PE, RE, and ARI of DDC are close to 1, which means the clustering results of DDC are very similar to the true answers. However, the results of DDC are slightly lower than those of Han et al. [29]. The algorithm in Han et al. [29] is an improved density clustering method, which uses region division and adaptive Eps in every region to improve its clustering performance. Relatively, DDC is just based on the basic density clustering operation but gets a close performance, which shows making effective use of the inheritance information in the time-series data is valuable.
Table 9 are the comparisons of NUM and T between DDC and 4 competitors. Since all 4 competitors have performed an independent and complete clustering process at 5 time slices, they have the same NUM at all 5 time slices. In contrast, DDC can markedly reduce the NUM and T from
Comparisons of clustering cost (NUM and T)
In these five testing problems, the results of AF–DBSCAN have the lowest T value among the 4 competitors at most of time slices. AF-DBSCAN can adaptively determine the optimal global parameters of Eps and MinPts based on a KNN distribution and mathematical-statistical based analysis process. Meanwhile, AF-DBSCAN improved the way of selecting the representative seed to query region. All of the above mechanisms help AF-DBSCAN reduces its overall time cost. In contrast, the values of T of DDC are significantly less than 4 competitors including AF-DBSCAN because it can save lots of double counting in a series of clustering processes by using the inheritance of cluster structure between two adjacent time slices. That makes DDC is more efficient for the time series data.
Above all, DDC can get a very close clustering effect to Han et al. [29] and AF-DBSCAN just based on basic density clustering operations. Moreover, it significantly reduces the cost no matter the number of elements involved in the clustering process or overall time consumption.
To further illustrate the operation efficiency of DDC, the DBSCAN algorithm, the AF-DBSCAN algorithm, and the method from Han et al. [29] were used to test the subset of the SEQUOIA 2000 database. The number of subsets has been set in 7 sizes including 200, 500, 1000, 2000, 5000, 10000, and 15000. The time-series data constructing procedure is the same as the above 5 problems. So, the number of changed elements in 7 tests will be 60, 150, 300, 600, 1500, 3000, 4500.
We used the value of T to compare the difference of time efficiency between DDC and 3 competitors, in which the value of T is the average value of the running time of the subset of the SEQUOIA 2000 database on 5 time slices. The result of T values is shown in Fig. 9.
T values of 4 algorithms for SEQUO IA 2000 database.
As it is shown, the running times of 4 algorithms mark a big escalation with the growth of data size. Apparently, the increment of T value of DDC is visibly slower than others. And, the running time of DDC is about 33% of DBSCAN, 51% of AF-DBSCAN, and 42% of the method from Han et al. [29]. So, for large time-series data, the DDC can significantly reduce the time cost, especially for the data whose element number is more than 5000.
In this paper, we propose a dynamic density clustering method (DDC) for the time-series data. Based on analyzing the transition rule of data set and cluster structure between two adjacent time slices, we found there is a distinct correlation of data set and succession of cluster structure. Inspired by it, we design the DDC to reduce the cost of the whole clustering process for the time-series data. Experiments for 6 representative problems show that:
Using basic density clustering operations, DDC can get high accuracy results for 5 problems. Compared to 4 state-of-the-art algorithms, DDC can reduce the overall cost significantly, especially for the largescale problems.
Thus, the proposed technique DDC is a potential candidate to solve time-series data clustering problems. In the following works, we will try to apply it to some real-life problems.
Footnotes
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61203311, in part by the Natural Science Basic Research Program of Shaanxi Province of China under Grant 2019JM-365, in part by the Key Research and Development Program of Shaanxi under Grant 2020SF-375, in part by the Scientific Research Program Funded by Shaanxi Provincial Education Department of China under Grant 17JK0701.
