Abstract
Data clustering is a well studied problem, where the aim is to partition a group of data instances into a number of clusters. Various methods have been proposed for the problem. K-means and its variants are the most well known examples. A common characteristic shared by the clustering algorithms is that they are all based on distance calculations between data points, or between data points and centroids. Hence, the efficiency of the proposed methods decline when big data is clustered. Clustering algorithms based on cellular automata have also been proposed in the literature. However, these methods are based on distance calculations, too. In this study, a new approach is proposed for the clustering problem. The method is based on the formation of clusters in a cellular automata by the interaction of neighborhood cells. The data points are mapped to fixed cellular automata cells, and the clusters are formed in a parallel fashion. The initial clusters formed spread in the cellular automata by uniting neighborhood cells in the same cluster. The rules utilized to compose clusters in the automata are inspired by the heat transfer process in nature. No distance calculation is used during the procedure. Therefore, it is possible to cluster huge datasets within a reasonable amount of time with the method proposed.
Introduction
Various clustering techniques have been proposed and utilized in fields such as data mining, machine learning and pattern recognition. The clustering problem can be defined as partitioning
The problem is in the NP-hard class, however various heuristic approaches have been proposed which can converge to a local optimum quickly. K-means [1] is a well known example, where the data points are assigned to clusters based on their distance to the cluster centroids. These centroids are chosen randomly at the beginning of the procedure, however they are updated throughout the iterations, so that more homogeneous clusters can be formed. Even though k-means utilizes a greedy approach that can return a sub-optimal solution, in practice k-means and its variants have been successfully used to cluster various kinds of data [2]. As noted above, the distance from each element to all of the centroids has to be calculated in each iteration of the algorithm. Hence, the number of elements in the dataset is one of the factors that determine the time complexity of the algorithm, and the efficiency of the method declines when huge datasets are clustered.
In contrast, cellular automata (CA) based clustering techniques have been proposed in the literature [3, 4]. These techniques map the initial dataset to the cells of a cellular automaton. However the formation of clusters is carried out again by distance calculations among the elements in the dataset. Similar elements are moved to neighborhood locations in the automaton and clusters emerge as more elements gather in a certain region [3].
In this study, a stochastic cellular automata algorithm named SCA-clustering is proposed in order to cluster a dataset without performing any distance calculations. The method first maps the elements in the dataset to fixed cells in a CA. Initially, each cell that has a data point is considered as a separate cluster in the automaton. Then, by using the local interactions between the cells, larger clusters are formed. The process of spreading the clusters in the automaton is inspired by the heat transfer process in nature. The CA cells that have a data point are considered as heat sources that generate heat energy continuously. This energy is spread to the neighborhood cells and after a while, the regions that consist of the natural clusters warm up in the automaton. On the other side, a second cellular automata rule is utilized simultaneously, which combines hot neighborhood cells into the same cluster. Hence, the initial clusters represented by the data cells start to merge and spread in the CA.
The rest of the paper is organized as follows: In Section 2, related work about clustering approaches and background information for the cellular automata model are given. In Section 3, the cellular automata model utilized in this study and the rules used for formation of clusters are presented. The experimental results can be found in Section 4. Lastly, we conclude in Section 5.
Related work
Density based clustering algorithms are popular in the literature. The method is based on grouping together the instances that exist in high-density regions of the data [5]. As expected, such instances are detected again based on distance calculations and the approach has the same disadvantage as k-means, for large datasets. Variations of the initial algorithm exist in the literature. In [6], GDBSCAN, which is a generalized version of DBSCAN, is introduced. The algorithm is used in some real world applications in the study. In [7], the authors propose a new DBSCAN algorithm; P-DBSCAN which is utilized to analyze places and events via geo-tagged photos. It should be noted that these alternative methods are also density based, hence the clustering process is again based on distance calculations.
In a recent study presented in [8], the cluster centroids are determined again by using density peaks. The proposed approach is based on a previous work, Clustering by Fast Search and Find of Density Peaks (CFSFDP). CFSFDP has some limitations in terms of determining density peaks so authors propose a novel CFSFDP method that utilizes a heat diffusion. This is derived from the concept of Weiner Process where a Gaussian kernel is used to estimate the next state.
Other methods have also been proposed for the clustering problem. For instance, a genetic algorithm is used in [9] and an artificial bee colony in [10]. In [11], the authors propose a method to determine the number of clusters which exist in a dataset before the k-means process starts. Genetic algorithms are utilized to determine the optimum number of clusters in this study. However, similar distance calculations are utilized in these studies, too. Even though clustering is a well studied problem, to the authors’ knowledge, no study attempts to form clusters without distance calculations among the elements in a dataset.
As noted in Section 1, cellular automata based clustering approaches have been proposed in the literature. However there are vital differences between the ideas proposed in these studies and the framework utilized in our work. For instance in [3], multi-dimensional data instances are mapped into a linear cellular automaton and the clustering process is carried out on this linear structure. Although there are neighborhood interactions among the cells, the clusters are formed by moving the data items autonomously in the automaton. In [4], an ant clustering algorithm (ACA) is proposed. Each ant represents a data item, and all ants are placed in a cellular automaton. Based on a fitness value, the ants propagate in the CA to form clusters. However, the fitness value is again determined by calculating the distance values between the data item represented by the ant and data items represented by other agents in the CA.
A similar idea is proposed in [12]. The data items are randomly mapped to a two-dimensional CA. Based on some stochastic rules, the data items move concurrently in the CA. A harmony function is defined for the cells which measures the similarity of the data item in the cell and the data items in neighborhood cells. As expected, this similarity measure is again based on the distance between the data items.
In contrast, a classification method based on CA is proposed in [13]. An energy function is utilized in this study. However, the model proposed for the classification process is quite different to the standard CA applications. Each feature of the dataset is associated with a column in the CA. The cells of a column might have low or high energy values and these energy states are utilized to denote whether the training sample is classified correctly or not. Certainly, the learning procedure iterates based on the interaction between the current and neighbor cells. The main disadvantage of the proposed method is that the energy parameter and threshold values utilized have to be tuned for each dataset separately. The work presented in [13] has been enhanced by using a Moore neighborhood in [14].
Cellular automata
Cellular automata (CA) is a discrete system composed of interconnected cells. The computation in a CA is based on the interaction among neighborhood cells. Each cell can be in one of the predefined states and the next state of the cell is based on the neighborhood states. Hence all transitions occur locally in a cellular automata where a massive parallelism could be achieved. The most well known example to CA is the Conway’s Game of Life [15]. Various CA applications have also been proposed mainly as simulation tools for various disciplines [16, 17, 18, 19, 20].
Neighborhood types.
A CA can be considered as a lattice in
As mentioned in Section 1, a stochastic cellular automata (SCA) is proposed in this study for clustering huge datasets efficiently. The SCA-clustering method updates the states of individual cells asynchronously by using some transition rules. The cells are chosen randomly for the update procedure.
The SCA-clustering first assigns the instances in the dataset to the cells of an
Certainly, more than one data point can be assigned to the same cell if there are data points which are close to each other in the dataset.
In a standard CA application, each cell in the automata can be in one of a finite number of states and, based on a predetermined rule, each cell can change its state depending on its own state and the states of the neighborhood cells. In this study, certain updates have been carried out on this standard framework in order to utilize the CA model for the clustering task.
The method aims to represent the various clusters in the dataset with individual states in the CA cells. Hence, if a group of CA cells are in the same state, then the data points they contain are considered to be in the same cluster. Initially, each CA cell that contains a data point is assigned a distinct state. Each state is denoted with a unique integer value. The cells that do not contain data points are accepted to be in state 0. Hence, if there are
In the proposed method, the cells can change their state again based on the states of the neighborhood cells. The grouping of adjacent cells into the same state is expected to reveal the cluster formation that exists in the dataset. At the end of the procedure, the aim is to obtain
As noted in the previous section, the formation of the clusters in the CA is inspired by the heat transfer process in nature. That is why; a temperature value is also kept for each cell in our model besides the state information. The cells change their state based on their – and neighborhood – cell temperatures. The cells that have a data point inside are considered to be heat sources. Such cells have a fixed temperature of 100
In Fig. 2, the initial configuration of a simple example CA is given. Each cell contains two integers, where the first one denotes the temperature and the second one the initial state of the cell. The cells that are in state 0 that do not contain a data point hence they also have the temperature 0
Initial configuration.
Last configuration.
The clustering process.
In Fig. 4, the same procedure is illustrated, this time on a real dataset. The initial configuration is given in Fig. 4a. Here, only the distribution of the data is shown in the CA. In Fig. 4b, an intermediate configuration of the CA is presented; the initial clusters represented by the data points have started to spread into the neighborhood cells. Figure 4c contains the last configuration with two clusters. The state information of each cell is denoted through the use of various colors and the temperature information is represented by color tones, where darker tones represent higher temperatures. As illustrated, the temperature has a tendency to increase towards the center of the clusters.
[h] Heat Transfer in CA[1] HEAT–TRANSFER (Cell C)
The method used for the heat transfer procedure can be seen in Algorithm 4. The procedure is applied on a randomly chosen cell
[h] State Transfer in CA[1] STATE–TRANSFER (Cell C)
[h] The General Algorithm[1] SCA–CLUSTERING (CELLULAR AUTOMATON CA) iteration
A second transfer rule is utilized in our CA model. The first rule enables the system to transfer the heat energy produced by the data cells to other cells in the CA, but the second rule is utilized for changing the states of the CA cells. Note that each state represents an individual cluster in the CA. This second rule is presented in Algorithm 4. The rule is simultaneously executed with the heat transfer rule, again on randomly chosen cells. When the cells warm up sufficiently, they start to change their states based on this second rule. Initially, each cell containing a data point is in a unique state and all other cells are in state 0. As seen in the algorithm, the neighbors of the randomly selected cell
The general algorithm used for SCA-clustering is given in Algorithm 4. In the algorithm, a loop is utilized which repeatedly applies the heat transfer and state transfer rules on randomly selected cells. This continues until a termination criterion is reached. One possible criterion is to continue with the process until the number of states fall down to
The number of cells in the CA would be very effective in the success of the proposed approach. If a sufficient number of cells are not utilized, a data distribution would be obtained in the CA where the clusters would be very close to each other. In such a case, the risk of merging clusters by the state transfer rule would increase. The time complexity of the method is based on the number of cells in the CA, hence, it is not possible to increase the size of the CA in an unrestrained manner.
The initial experiments have been carried out on two-dimensional datasets. It has been observed that a 200
where
Experiments are conducted by using datasets utilized in other studies or created by a synthetic cluster generator. This section contains two subsections. In the first subsection, the general analysis of SCA-clustering is presented. The method is tested on datasets with various numbers of clusters and dimensions, and with clusters of varying topology. In the next subsection, the comparison of SCA-clustering with k-means and DBSCAN algorithms is carried out. A Unix machine that has a 2, 6 GHz Intel Core i5 and 8 Gb memory is used in the experiments.
Experimental results for datasets in Fig. 5
Experimental results for datasets in Fig. 5
Different types of datasets are used during experiments. (a)–(e) are in 2-Dimension, (f) in 3-Dimension.
In Fig. 5, datasets utilized in multiple studies are presented [21, 22, 23, 24, 25]. The datasets are two-dimensional except the last one which includes two rings in three dimensions. Note that some of the datasets have uniform distribution. It is well known that k-means and other classical clustering approaches are not suitable for finding clusters that do not have a hyper-spherical topology. The natural clusters presented in Fig. 5b, c and f are examples where k-means and its variants would fail. However, the formation of clusters is based on the local interactions of neighborhood cells in SCA-clustering. Hence, the method is invariant to any variance in shapes and topology; the clusters might have. An example that consists of clusters with considerably varying sizes (Fig. 5a) is also included in the experimental set. Table 1 presents the experimental results for the datasets shown in Fig. 5.
As seen in Table 1, the SCA-clustering algorithm can cluster the datasets accurately even though they have multiple characteristics. The success rate and the runtime statistics are obtained by averaging 20 runs on each dataset. The success rate is determined by considering the percentage of correctly clustered instances. The rand index values are also calculated and presented in the table.
The CA utilized in the experiments have a fixed size 200
The box/quartile plots of these results are given in Fig. 6. Again, an expected result can be seen in the figure where the deviation of the runtime is minimal for datasets that contain compact clusters.
Runtime change based on
Runtime change based on
Box/quartile plot of results obtained on datasets in Fig. 5.
As noted above, the number of cells (
Average runtime of different size CA.
In this section, the comparison of SCA-clustering with k-means and and DBSCAN is presented. As noted in the previous section, k-means is expected to have a worse performance on datasets that do not contain hyper-spherical clusters. Again, 20 runs have been averaged in order to determine the performance of each method on a dataset. The same seed values are utilized in both algorithms. The comparison with k-means is displayed in Table 3.
Comparison between k-means and SCA-clustering
Comparison between k-means and SCA-clustering
In Table 3, the first column consists of the names of the datasets. The first six datasets are the ones utilized in the previous section. The rest are generated by using a Gaussian Generator which is publicly available.1
The variation in runtime can be seen in Fig. 8, for the two methods. The results are from datasets that have three dimensions. The average runtime of k-means increases considerably on the largest dataset, the variation is also huge for k-means on this last dataset.
When we compare the success rate of the two algorithms, we observe that SCA-clustering has quite a satisfactory performance against k-means. For instance, the success rate of k-means goes down to 64% for the Chainlink dataset presented in Fig. 5f. SCA-clustering has a worse result (68%) on the dataset
In Section 3.2, it was noted that the dimension of the dataset creates a severe limitation for SCA-clustering. The success rate of SCA-clustering goes down to 54% on the dataset
Silhouette scores
Runtime changes depending on number of instances in 3-dimensions.
Comparison between DBSCAN and SCA-clustering
As mentioned in the introduction section, density based clustering algorithms are also popular in the literature. SCA-clustering is also compared to a DBSCAN implementation which is utilized in [26]. In Table 5, the performance of the two algorithms on the datasets utilized is presented. DBSCAN has a perfect performance on the datasets created by the Gaussian Generator which is expected to form spherical clusters, but it is also heavily dependent on distance calculations among the elements in the dataset, like k-means. This results in a considerably worse runtime performance from the generated datasets compared to SCA-clustering and even to k-means. For the large datasets like
For the datasets; Banana, Jain,
A common characteristic of the classical clustering algorithms is that they are all based on distance calculations between the data instances. In this paper, a novel approach, based on CA, is proposed for the clustering problem. The method is inspired by the heat transfer process in nature. The clusters are formed by local interactions between the CA cells, hence no distance calculation is needed for the procedure.
The approach is tested on datasets of differing characteristics. It has been observed that the success of the method is not limited to datasets that include only hyper-spherical clusters. Differing datasets which include clusters of varying shape and sizes can be clustered successfully with the method proposed. The only limitation that exists for the approach is the number of dimensions in the dataset. It has been observed that the approach is applicable only to low dimensional data and it becomes impractical with high dimensional datasets.
Some future work can be considered to improve the algorithm. Firstly, the method is utilized with CA that has an equal number of cells in each dimension. This factor limits the number of dimensions that can be handled by the method. A more dynamic model could be proposed which would utilize a variable number of cells in each dimension based on the characteristics of the data. If such an approach is used, it might be possible to cluster higher dimensional data compared to the datasets utilized in this study. Also, all datasets utilized in the experiments consist of only numerical attributes. Since the number of cells in each dimension is fixed, the current model is not appropriate for categorical data. A dynamic model could be beneficial for categorical data, too. Furthermore, the proposed approach has been tested on datasets with minimal noise and without missing values. Noisy data might affect the performance of the system considerably. Therefore, an analysis is needed in order to gain insight into the behavior of the system on noisy data.
In this study, SCA-clustering has been compared with two clustering approaches; DBSCAN and k-means. These are robust and reliable clustering methods that have been utilized for many clustering applications. However, there are many other variations of k-means and density based methods proposed in the literature. It is possible to enhance the analysis of SCA-clustering by comparing it with other state-of-the-art methods in the literature.
In other future work, the approach can be used as an ensemble with other clustering methods. For instance, SCA-clustering can be used to determine the initial centroids of huge datasets quickly, then the fine tuning that will obtain the final clustering can be carried out by k-means or its variants. It is also possible to use the approach for the classification problem. The heat transfer procedure and the state transition rules utilized in this study might create efficient separators in the input space when the classification problem is considered.
