Data clustering with stochastic cellular automata

Abstract

Data clustering is a well studied problem, where the aim is to partition a group of data instances into a number of clusters. Various methods have been proposed for the problem. K-means and its variants are the most well known examples. A common characteristic shared by the clustering algorithms is that they are all based on distance calculations between data points, or between data points and centroids. Hence, the efficiency of the proposed methods decline when big data is clustered. Clustering algorithms based on cellular automata have also been proposed in the literature. However, these methods are based on distance calculations, too. In this study, a new approach is proposed for the clustering problem. The method is based on the formation of clusters in a cellular automata by the interaction of neighborhood cells. The data points are mapped to fixed cellular automata cells, and the clusters are formed in a parallel fashion. The initial clusters formed spread in the cellular automata by uniting neighborhood cells in the same cluster. The rules utilized to compose clusters in the automata are inspired by the heat transfer process in nature. No distance calculation is used during the procedure. Therefore, it is possible to cluster huge datasets within a reasonable amount of time with the method proposed.

Keywords

Cellular automata clustering big data

1. Introduction

Various clustering techniques have been proposed and utilized in fields such as data mining, machine learning and pattern recognition. The clustering problem can be defined as partitioning $n$ individual data points into $r$ mutually disjoint subsets where $r$ would be much smaller than $n$ . The partitioning process is expected to form clusters where variation of the elements in the same cluster would be minimal, whereas variation of the elements in distinct clusters would be maximal. Hence, homogeneous clusters which are well separated from each other should be formed by a successful clustering technique.

The problem is in the NP-hard class, however various heuristic approaches have been proposed which can converge to a local optimum quickly. K-means [1] is a well known example, where the data points are assigned to clusters based on their distance to the cluster centroids. These centroids are chosen randomly at the beginning of the procedure, however they are updated throughout the iterations, so that more homogeneous clusters can be formed. Even though k-means utilizes a greedy approach that can return a sub-optimal solution, in practice k-means and its variants have been successfully used to cluster various kinds of data [2]. As noted above, the distance from each element to all of the centroids has to be calculated in each iteration of the algorithm. Hence, the number of elements in the dataset is one of the factors that determine the time complexity of the algorithm, and the efficiency of the method declines when huge datasets are clustered.

In contrast, cellular automata (CA) based clustering techniques have been proposed in the literature [3, 4]. These techniques map the initial dataset to the cells of a cellular automaton. However the formation of clusters is carried out again by distance calculations among the elements in the dataset. Similar elements are moved to neighborhood locations in the automaton and clusters emerge as more elements gather in a certain region [3].

In this study, a stochastic cellular automata algorithm named SCA-clustering is proposed in order to cluster a dataset without performing any distance calculations. The method first maps the elements in the dataset to fixed cells in a CA. Initially, each cell that has a data point is considered as a separate cluster in the automaton. Then, by using the local interactions between the cells, larger clusters are formed. The process of spreading the clusters in the automaton is inspired by the heat transfer process in nature. The CA cells that have a data point are considered as heat sources that generate heat energy continuously. This energy is spread to the neighborhood cells and after a while, the regions that consist of the natural clusters warm up in the automaton. On the other side, a second cellular automata rule is utilized simultaneously, which combines hot neighborhood cells into the same cluster. Hence, the initial clusters represented by the data cells start to merge and spread in the CA.

The rest of the paper is organized as follows: In Section 2, related work about clustering approaches and background information for the cellular automata model are given. In Section 3, the cellular automata model utilized in this study and the rules used for formation of clusters are presented. The experimental results can be found in Section 4. Lastly, we conclude in Section 5.

2. Related work

Density based clustering algorithms are popular in the literature. The method is based on grouping together the instances that exist in high-density regions of the data [5]. As expected, such instances are detected again based on distance calculations and the approach has the same disadvantage as k-means, for large datasets. Variations of the initial algorithm exist in the literature. In [6], GDBSCAN, which is a generalized version of DBSCAN, is introduced. The algorithm is used in some real world applications in the study. In [7], the authors propose a new DBSCAN algorithm; P-DBSCAN which is utilized to analyze places and events via geo-tagged photos. It should be noted that these alternative methods are also density based, hence the clustering process is again based on distance calculations.

In a recent study presented in [8], the cluster centroids are determined again by using density peaks. The proposed approach is based on a previous work, Clustering by Fast Search and Find of Density Peaks (CFSFDP). CFSFDP has some limitations in terms of determining density peaks so authors propose a novel CFSFDP method that utilizes a heat diffusion. This is derived from the concept of Weiner Process where a Gaussian kernel is used to estimate the next state.

Other methods have also been proposed for the clustering problem. For instance, a genetic algorithm is used in [9] and an artificial bee colony in [10]. In [11], the authors propose a method to determine the number of clusters which exist in a dataset before the k-means process starts. Genetic algorithms are utilized to determine the optimum number of clusters in this study. However, similar distance calculations are utilized in these studies, too. Even though clustering is a well studied problem, to the authors’ knowledge, no study attempts to form clusters without distance calculations among the elements in a dataset.

As noted in Section 1, cellular automata based clustering approaches have been proposed in the literature. However there are vital differences between the ideas proposed in these studies and the framework utilized in our work. For instance in [3], multi-dimensional data instances are mapped into a linear cellular automaton and the clustering process is carried out on this linear structure. Although there are neighborhood interactions among the cells, the clusters are formed by moving the data items autonomously in the automaton. In [4], an ant clustering algorithm (ACA) is proposed. Each ant represents a data item, and all ants are placed in a cellular automaton. Based on a fitness value, the ants propagate in the CA to form clusters. However, the fitness value is again determined by calculating the distance values between the data item represented by the ant and data items represented by other agents in the CA.

A similar idea is proposed in [12]. The data items are randomly mapped to a two-dimensional CA. Based on some stochastic rules, the data items move concurrently in the CA. A harmony function is defined for the cells which measures the similarity of the data item in the cell and the data items in neighborhood cells. As expected, this similarity measure is again based on the distance between the data items.

In contrast, a classification method based on CA is proposed in [13]. An energy function is utilized in this study. However, the model proposed for the classification process is quite different to the standard CA applications. Each feature of the dataset is associated with a column in the CA. The cells of a column might have low or high energy values and these energy states are utilized to denote whether the training sample is classified correctly or not. Certainly, the learning procedure iterates based on the interaction between the current and neighbor cells. The main disadvantage of the proposed method is that the energy parameter and threshold values utilized have to be tuned for each dataset separately. The work presented in [13] has been enhanced by using a Moore neighborhood in [14].

2.1 Cellular automata

Cellular automata (CA) is a discrete system composed of interconnected cells. The computation in a CA is based on the interaction among neighborhood cells. Each cell can be in one of the predefined states and the next state of the cell is based on the neighborhood states. Hence all transitions occur locally in a cellular automata where a massive parallelism could be achieved. The most well known example to CA is the Conway’s Game of Life [15]. Various CA applications have also been proposed mainly as simulation tools for various disciplines [16, 17, 18, 19, 20].

Figure 1.

Neighborhood types.

A CA can be considered as a lattice in $\mathbb{R}^{n}$ . The number of cell’s neighbors in a CA varies based on the number of dimensions and the type of neighborhood relationship utilized. The Moore and Von Neumann neighborhood of a center cell can be seen in Fig. 1a and b. A two-dimension Moore neighborhood comprises of the eight cells surrounding the center cell. In an $n$ -dimensional CA $C$ , if the center cell is $C_{i_{1},i_{2},...i_{n}}$ , then all cells $C_{j_{1},j_{2},...,j_{n}}$ where $|i_{k}-j_{k}|\leqslant 1$ , would be in the Moore neighborhood. Hence $3^{n}-1$ neighbor cells exist in total. On the other side, a Von Neumann neighborhood consists of only the four cells that orthogonally surround the central cell in two-dimensions. In $n$ -dimensions, this time the neighborhood cells can have an index difference of $1$ only in one of the dimensions. Again, if the center cell is $C_{i_{1},i_{2},...i_{n}}$ , then all cells $C_{j_{1},j_{2},...,j_{n}}$ where $\exists k,|i_{k}-j_{k}|=1$ and $\forall s,s\neq k,|i_{s}-j_{s}|=0$ would be in the neighborhood. Hence, $2*n$ neighborhood cells exist for a cell in an $n$ -dimensional CA. Since the number of neighbors increases linearly, a Von Neumann neighborhood is utilized in this study.

3. Methodology

As mentioned in Section 1, a stochastic cellular automata (SCA) is proposed in this study for clustering huge datasets efficiently. The SCA-clustering method updates the states of individual cells asynchronously by using some transition rules. The cells are chosen randomly for the update procedure.

The SCA-clustering first assigns the instances in the dataset to the cells of an $n$ -dimensional CA where $n$ is determined by the dimension in the dataset. The cell index in each dimension is based on the corresponding attribute value of the data instance. Certainly the instance with the minimum attribute value would have the smallest index and the maximum attribute would have the largest. Determining the number of cells in each dimension is going to be discussed in detail in Section 3.2. For now, assume that there are $m$ cells in dimension $d$ and $x^{(d)}_{\min}$ and $x^{(d)}_{\max}$ are the minimum and maximum values for the attribute associated with this dimension. A data instance that has the value $x^{(d)}$ for the corresponding attribute would have the cell index $i_{d}$ in this dimension as given by Eq. (1).

$\displaystyle i_{d}=\Bigl{\lfloor}{\frac{x^{(d)}-x^{(d)}_{\min}}{(x^{(d)}_{% \max}-x^{(d)}_{\min})/m}}\Bigr{\rfloor}+1$ (1)

Certainly, more than one data point can be assigned to the same cell if there are data points which are close to each other in the dataset.

3.1 Formation of clusters

In a standard CA application, each cell in the automata can be in one of a finite number of states and, based on a predetermined rule, each cell can change its state depending on its own state and the states of the neighborhood cells. In this study, certain updates have been carried out on this standard framework in order to utilize the CA model for the clustering task.

The method aims to represent the various clusters in the dataset with individual states in the CA cells. Hence, if a group of CA cells are in the same state, then the data points they contain are considered to be in the same cluster. Initially, each CA cell that contains a data point is assigned a distinct state. Each state is denoted with a unique integer value. The cells that do not contain data points are accepted to be in state 0. Hence, if there are $n$ instances in the dataset, the cells could be in one of the $n+1$ states at the beginning of the procedure. If more than one data point is assigned to some CA cells, then the total number of distinct states in the initial CA would decrease.

In the proposed method, the cells can change their state again based on the states of the neighborhood cells. The grouping of adjacent cells into the same state is expected to reveal the cluster formation that exists in the dataset. At the end of the procedure, the aim is to obtain $k+1$ distinct states in the CA where $k$ is the number of clusters in the dataset. Certainly, $k$ is a user defined parameter for the algorithm. Some cells could be still in state 0 after the execution, this is why $k+1$ states would exist in the CA when the operation terminates.

As noted in the previous section, the formation of the clusters in the CA is inspired by the heat transfer process in nature. That is why; a temperature value is also kept for each cell in our model besides the state information. The cells change their state based on their – and neighborhood – cell temperatures. The cells that have a data point inside are considered to be heat sources. Such cells have a fixed temperature of 100 ${}^{\circ}$ and they do not cool down. A simple mechanism is used to transfer the heat energy generated by these source cells to other cells in the CA. As the CA cells start to warm up, an update rule is simultaneously executed on the states of CA cells, in order to gather them in a set of states that would represent the clusters in the data. Note that the procedure starts with $n+1$ states initially, $n$ being the size of the dataset, and when the number of states in the CA falls down to the number of desired clusters, the procedure ends.

In Fig. 2, the initial configuration of a simple example CA is given. Each cell contains two integers, where the first one denotes the temperature and the second one the initial state of the cell. The cells that are in state 0 that do not contain a data point hence they also have the temperature 0 ${}^{\circ}$ . The cells with a data point have a temperature of 100 ${}^{\circ}$ and each one is assigned a different state. Note that there are six instances in this example, hence six separate states are utilized, one for each data point. The final configuration of the CA is presented in Fig. 3. This configuration is obtained by the heat and state transfer procedures applied to the CA cells. These procedures are explained in detail in Algorithms 4 and 4 in this section. In Fig. 3, all cells have converged to two distinct states in the automaton, representing two separate clusters. The final temperature values are also shown in Fig. 3.

Figure 2.

Initial configuration.

Figure 3.

Last configuration.

Figure 4.

The clustering process.

In Fig. 4, the same procedure is illustrated, this time on a real dataset. The initial configuration is given in Fig. 4a. Here, only the distribution of the data is shown in the CA. In Fig. 4b, an intermediate configuration of the CA is presented; the initial clusters represented by the data points have started to spread into the neighborhood cells. Figure 4c contains the last configuration with two clusters. The state information of each cell is denoted through the use of various colors and the temperature information is represented by color tones, where darker tones represent higher temperatures. As illustrated, the temperature has a tendency to increase towards the center of the clusters.

[h] Heat Transfer in CA[1] HEAT–TRANSFER (Cell C) $N$ $\longleftarrow$ getNeighbour (C)AverageTemperature $=$ calculateAverageTemperature (C, N)empty (C) $C_{\textit{temperature}}$ $=$ AverageTemperature each Cell $K$ ? $N$ empty (K) $K_{\textit{temperature}}$ $=$ AverageTemperature

The method used for the heat transfer procedure can be seen in Algorithm 4. The procedure is applied on a randomly chosen cell $C$ . In the first step, the neighborhood cells of cell $C$ are determined and then the average temperature of cell $C$ and its neighbors is calculated. This average is set as the temperature of cell $C$ and its neighbors, if the cells are empty, or in other words, if no data point is assigned to them. Note that the cells with a data point have a fixed temperature of 100 ${}^{\circ}$ . This transfer rule enables the neighborhood cells to share the heat energy that exists in the nearby environment. The rule utilized has the tendency to equalize the temperature in all cells in the long run. However, the cells with a data point constantly provide heat energy to the system, hence, such cells increase the temperatures of the nearby cells. When this procedure is applied repeatedly on randomly chosen cells, the regions that have more data points inside become warmer compared to other regions in the CA.

[h] State Transfer in CA[1] STATE–TRANSFER (Cell C) $C_{\textit{temperature}}>\textit{threshold}$ $N$ $\longleftarrow$ getNeighbour (C)each cell $K$ ? $N$ $K_{\textit{temperature}}>\textit{threshold}$ $K_{\textit{state}}=C_{\textit{state}}$ STATE–TRANSFER (K)

[h] The General Algorithm[1] SCA–CLUSTERING (CELLULAR AUTOMATON CA) iteration $=$ 1 percentage $=$ 0 percentage $<$ 0.8 iteration%100 $=$ 0 percentage $=$ CalculatePercentage (CA) Cell C $=$ randomCell (CA) HEAT–TRANSFER (C) Cell C $=$ randomCell (CA) STATE–TRANSFER (C) iteration++ iteration $=$ 1 percentage $<$ 1 iteration%100 $=$ 0 percentage $=$ CalculatePercentage (CA) enlargeClusters (CA) iteration++

A second transfer rule is utilized in our CA model. The first rule enables the system to transfer the heat energy produced by the data cells to other cells in the CA, but the second rule is utilized for changing the states of the CA cells. Note that each state represents an individual cluster in the CA. This second rule is presented in Algorithm 4. The rule is simultaneously executed with the heat transfer rule, again on randomly chosen cells. When the cells warm up sufficiently, they start to change their states based on this second rule. Initially, each cell containing a data point is in a unique state and all other cells are in state 0. As seen in the algorithm, the neighbors of the randomly selected cell $C$ are determined as the first step. Then a threshold is utilized for changing the sate of the neighborhood cells. This threshold is experimentally determined as 80 ${}^{\circ}$ . If the temperature of a neighbor cell is above this threshold, then the neighbor cell is moved to the state of the selected cell. The function is recursively called on the neighbor cell, too. A group of nearby cells with temperatures above the threshold can combine in the same cluster/state with a single call to the State Transfer method. Hence, the system would be able to spread the clusters in the CA very quickly whenever sufficient warming is achieved in a certain region.

The general algorithm used for SCA-clustering is given in Algorithm 4. In the algorithm, a loop is utilized which repeatedly applies the heat transfer and state transfer rules on randomly selected cells. This continues until a termination criterion is reached. One possible criterion is to continue with the process until the number of states fall down to $k+1$ where $k$ is the number of desired clusters. However, it has been observed that using this termination criterion increases the risk of combining the large clusters formed, hence producing an unsuccessful run. Therefore, when the algorithm is started, the largest $k$ clusters that appear in the CA are examined after every 100 iterations. If the largest $k$ clusters contain 80% of the dataset, the process is ended. Hence, a safety zone is created in the CA that prevents the large clusters from merging. However, 20% of the data points will not be assigned to any cluster yet.Therefore, a final enlargement process is applied to the clusters until all data points are covered. This process selects CA cells, again randomly. If the cell does not belong to one of the largest $k$ clusters, but it has a neighbor that is in a large cluster, then the cell is added to this cluster. With the method utilized, the clusters are spread in the cellular automata without considering the temperature values. The unassigned data points will be added to one of the nearby clusters when this last procedure is applied.

3.2 Size of the CA utilized

The number of cells in the CA would be very effective in the success of the proposed approach. If a sufficient number of cells are not utilized, a data distribution would be obtained in the CA where the clusters would be very close to each other. In such a case, the risk of merging clusters by the state transfer rule would increase. The time complexity of the method is based on the number of cells in the CA, hence, it is not possible to increase the size of the CA in an unrestrained manner.

The initial experiments have been carried out on two-dimensional datasets. It has been observed that a 200 $\times$ 200 CA is sufficient to cluster different each kind of real and generated datasets. This would correspond to 40000 cells in the CA. Yet, the number of cells in a CA is not limited by this number, and it can be increased. For high dimensional data, it is not possible to use 200 cells in each dimension. The size of the CA would be unmanageable in such a case, impeding the runtime performance of the method. Therefore, the number of cells in a single dimension is decreased for high dimensional datasets. However, this count is determined such that the total number of cells always add up to $\approx$ 40000 in the CA, hence the following equation is utilized.

$\displaystyle C_{i}=\sqrt[n]{m},i=1,2...n$ (2)

where $n$ is the dimension of the dataset, $m$ is the fixed size 40000 and $C_{i}$ denotes the number of cells in dimension $i$ . Restricting the total size of the CA enables the system to prevent its operation on datasets with more than two dimensions. However, this restriction imposes an important limitation on the model proposed. The number of cells in a single dimension decreases exponentially when the number of dimensions in the data increases. Hence, the system could not be used for the high dimensional datasets. In the following section, it can be seen that the model has been tested with datasets of up to six dimensions. For higher dimensional datasets, the CA utilized could not reveal the structure of the natural clusters in the dataset, hence it becomes impossible to cluster the dataset without increasing the size of the CA.

4. Experimental results

Experiments are conducted by using datasets utilized in other studies or created by a synthetic cluster generator. This section contains two subsections. In the first subsection, the general analysis of SCA-clustering is presented. The method is tested on datasets with various numbers of clusters and dimensions, and with clusters of varying topology. In the next subsection, the comparison of SCA-clustering with k-means and DBSCAN algorithms is carried out. A Unix machine that has a 2, 6 GHz Intel Core i5 and 8 Gb memory is used in the experiments.

Table 1
Experimental results for datasets in Fig. 5

Dataset	Rand index		Success rate (%)		Runtime (sec)		Dimensions	# of clusters
Aggregation	99.26	$\pm$ 0.186	99.68	$\pm$ 0.12	5.16	$\pm$ 0.20	2	7
Banana	100	$\pm$ 0	100.0	$\pm$ 0.00	2.66	$\pm$ 0.26	2	2
Jain	82.34	$\pm$ 6.276	95.72	$\pm$ 1.54	12.57	$\pm$ 0.74	2	2
R15	98.35	$\pm$ 0.608	99.32	$\pm$ 0.22	1.49	$\pm$ 0.07	2	15
Sizes1	94.99	$\pm$ 0.271	98.02	$\pm$ 0.15	6.13	$\pm$ 0.25	2	4
Chainlink	100.0	$\pm$ 0	99.83	$\pm$ 0.74	0.75	$\pm$ 0.13	3	2

Figure 5.

Different types of datasets are used during experiments. (a)–(e) are in 2-Dimension, (f) in 3-Dimension.

4.1 Performance of SCA-clustering

In Fig. 5, datasets utilized in multiple studies are presented [21, 22, 23, 24, 25]. The datasets are two-dimensional except the last one which includes two rings in three dimensions. Note that some of the datasets have uniform distribution. It is well known that k-means and other classical clustering approaches are not suitable for finding clusters that do not have a hyper-spherical topology. The natural clusters presented in Fig. 5b, c and f are examples where k-means and its variants would fail. However, the formation of clusters is based on the local interactions of neighborhood cells in SCA-clustering. Hence, the method is invariant to any variance in shapes and topology; the clusters might have. An example that consists of clusters with considerably varying sizes (Fig. 5a) is also included in the experimental set. Table 1 presents the experimental results for the datasets shown in Fig. 5.

As seen in Table 1, the SCA-clustering algorithm can cluster the datasets accurately even though they have multiple characteristics. The success rate and the runtime statistics are obtained by averaging 20 runs on each dataset. The success rate is determined by considering the percentage of correctly clustered instances. The rand index values are also calculated and presented in the table.

The CA utilized in the experiments have a fixed size 200 $\times$ 200 for two dimensions. However, there is still a deviation in the runtime statistics, which is a result of the various distributions in the datasets. The sparsity in the clusters seems to be the main factor affecting the runtime of the algorithm. The worst performance is obtained on the Jain dataset which includes quite a sparse cluster compared to other datasets. In this case, the warming procedure would need more time, resulting a delay in the formation of final clusters. However, this is an expected outcome.

The box/quartile plots of these results are given in Fig. 6. Again, an expected result can be seen in the figure where the deviation of the runtime is minimal for datasets that contain compact clusters.

Table 2
Runtime change based on $m$

$m$	Success Rate (%)	Runtime (sec)
100	98.01 $\pm$ 0.24	0.69 $\pm$ 0.09
125	97.95 $\pm$ 0.28	1.08 $\pm$ 0.10
150	97.95 $\pm$ 0.22	2.04 $\pm$ 0.26
175	97.99 $\pm$ 0.16	3.73 $\pm$ 0.23
200	98.02 $\pm$ 0.15	6.13 $\pm$ 0.25
225	97.98 $\pm$ 0.15	9.36 $\pm$ 0.25
250	97.94 $\pm$ 0.20	15.15 $\pm$ 0.61

Figure 6.

Box/quartile plot of results obtained on datasets in Fig. 5.

As noted above, the number of cells ( $m$ ) in a dimension is set as 200 for two dimensions. This number is determined experimentally. However, this is a critical parameter for the model proposed. In Table 2 the affect of this parameter is analyzed on a selected dataset. As expected, the average runtime varies depending on the value of $m$ . When $m$ is larger, the number of empty cells between data instances grows. Thereby, the framework needs more time to increase the temperature of cells. The variation of runtime statistics is also presented in Fig. 7; there is a coherent increase in average runtime when the $m$ value is increased. However, the system has not been tested with very small $m$ values. In such a case, it is not possible to capture the structure of the dataset in the CA, and the whole procedure fails. However, the system is robust in the face of changes in $m$ , in terms of performance. There is no significant change in the success rate when multiple $m$ values are tested, as seen in Table 2.

Figure 7.

Average runtime of different size CA.

4.2 Comparison with k-means and DBSCAN

In this section, the comparison of SCA-clustering with k-means and and DBSCAN is presented. As noted in the previous section, k-means is expected to have a worse performance on datasets that do not contain hyper-spherical clusters. Again, 20 runs have been averaged in order to determine the performance of each method on a dataset. The same seed values are utilized in both algorithms. The comparison with k-means is displayed in Table 3.

Table 3
Comparison between k-means and SCA-clustering

Dataset	# of instances	K-Means				SCA-Clustering
		Success rate (%)		Runtime (sec)		Success rate (%)		Runtime (sec)
Aggregation	788	78.15	$\pm$ 5.36	0.02	$\pm$ 0.02	99.68	$\pm$ 0.12	5.16	$\pm$ 0.20
Banana	4811	81.51	$\pm$ 0.03	0.05	$\pm$ 0.03	100.0	$\pm$ 0.00	2.66	$\pm$ 0.26
Jain	373	88.20	$\pm$ 0.00	0.01	$\pm$ 0.01	95.72	$\pm$ 1.54	12.57	$\pm$ 0.74
R15	600	80.85	$\pm$ 8.51	0.01	$\pm$ 0.01	99.32	$\pm$ 0.22	1.49	$\pm$ 0.07
Sizes1	1000	98.20	$\pm$ 2.84	0.02	$\pm$ 0.01	98.02	$\pm$ 0.15	6.13	$\pm$ 0.25
Chainlink	1000	64.27	$\pm$ 2.88	0.02	$\pm$ 0.02	99.83	$\pm$ 0.74	0.75	$\pm$ 0.13
2d-4c	1000000	85.63	$\pm$ 73.46	15.71	$\pm$ 14.66	82.5	$\pm$ 23.84	32.1	$\pm$ 43.0
3d-2c	70930	99.99	$\pm$ 0.00	0.58	$\pm$ 0.12	99.99	$\pm$ 0.00	1.62	$\pm$ 0.16
3d-4c	144824	80.69	$\pm$ 17.47	3.38	$\pm$ 2.60	97.39	$\pm$ 10.41	1.37	$\pm$ 0.93
3d-6c	247087	73.59	$\pm$ 9.87	10.58	$\pm$ 3.65	94.53	$\pm$ 12.90	2.82	$\pm$ 1.05
3d-8c	405419	72.83	$\pm$ 12.46	25.17	$\pm$ 18.31	68.16	$\pm$ 14.86	5.44	$\pm$ 2.64
4d-2c	117565	100.0	$\pm$ 0.00	1.04	$\pm$ 0.16	100.0	$\pm$ 0.00	1.26	$\pm$ 0.14
4d-4c	137973	82.89	$\pm$ 20.29	4.35	$\pm$ 3.77	70.26	$\pm$ 29.75	2.47	$\pm$ 1.12
4d-5c	1250000	77.05	$\pm$ 57.20	107.7	$\pm$ 139.5	100.0	$\pm$ 0.0	0.881	$\pm$ 0.19
4d-6c	178736	72.75	$\pm$ 10.45	9.08	$\pm$ 2.50	93.30	$\pm$ 15.96	1.55	$\pm$ 0.78
5d-2c	111541	100.0	$\pm$ 0.00	1.08	$\pm$ 0.23	100.0	$\pm$ 0.00	3.10	$\pm$ 1.47
5d-4c	168799	70.46	$\pm$ 14.68	9.69	$\pm$ 4.54	99.93	$\pm$ 0.30	2.37	$\pm$ 1.75
5d-6c	162358	71.76	$\pm$ 13.65	17.49	$\pm$ 16.14	77.46	$\pm$ 16.13	1.53	$\pm$ 0.36
6d-2c	58140	100.0	$\pm$ 0.000	0.57	$\pm$ 0.16	92.97	$\pm$ 21.67	1.49	$\pm$ 0.66
6d-4c	98116	75.58	$\pm$ 17.80	5.40	$\pm$ 4.48	91.12	$\pm$ 26.54	0.58	$\pm$ 0.20
6d-5c	79127	83.59	$\pm$ 13.73	4.88	$\pm$ 4.44	54.91	$\pm$ 18.18	1.28	$\pm$ 0.21

In Table 3, the first column consists of the names of the datasets. The first six datasets are the ones utilized in the previous section. The rest are generated by using a Gaussian Generator which is publicly available.1

http://personalpages.manchester.ac.uk/mbs/julia.handl/generators.html.

In the naming convention utilized (

Xd-Yc

X

is the number of dimensions and

Y

is the number of clusters in the dataset. The second column gives the number of instances in each dataset and then the performance of k-means and SCA-clustering are presented. As expected, the runtime performance of k-means depends on the number of instances in the dataset. For smaller datasets, k-means has even a better performance. However, SCA-clustering exhibits a similar performance regardless of the dataset size. When we consider the larger datasets utilized in the experiments, the performance of k-means declines considerably compared to SCA-clustering. As noted in the previous section, the variations in the runtime performance of SCA-clustering is more dependent on the distribution of the data rather than the number of instances.

The variation in runtime can be seen in Fig. 8, for the two methods. The results are from datasets that have three dimensions. The average runtime of k-means increases considerably on the largest dataset, the variation is also huge for k-means on this last dataset.

When we compare the success rate of the two algorithms, we observe that SCA-clustering has quite a satisfactory performance against k-means. For instance, the success rate of k-means goes down to 64% for the Chainlink dataset presented in Fig. 5f. SCA-clustering has a worse result (68%) on the dataset $3d-8c$ . However, k-means has also a relatively bad performance (72%) on this dataset. The dataset $4d-4c$ is the second dataset where the performance of SCA-clustering is worse compared to k-means. For all other datasets up to five dimensions, SCA-clustering has a remarkable success compared to k-means.

In Section 3.2, it was noted that the dimension of the dataset creates a severe limitation for SCA-clustering. The success rate of SCA-clustering goes down to 54% on the dataset $6d-5c$ . Note that this is a six dimensional dataset. Even though the performance of the method is high for the two other datasets that have six dimensions, we have observed that SCA-clustering is robust up to five dimensions. The method becomes unreliable for higher dimensional datasets with the current CA size utilized. Future work could be to carry out experiments on a larger CA in order to analyze the performance in higher dimensions. Also, silhouette scores for both approaches are presented in Table 4. The scores are presented only for datasets with ellipsoid clusters since silhouette scores provides a reasonable comparison on such data sets.

Table 4

Silhouette scores

Dataset	SCA-Clustering		K-Means
	Mean	Std	Mean	Std
R15	0.75	9.46e-4	0.60	6.13e-2
Sizes1	0.59	4.29e-4	0.59	1.11e-16
2d-4c	0.85	9.66e-2	0.92	5.85e-4
3d-2c	0.68	2.74e-3	0.68	1.55e-3
3d-4c	0.86	6.23e-2	0.79	8.75e-2
3d-6c	0.86	6.31e-2	0.73	9.66e-2
3d-8c	0.71	8.03e-2	0.73	9.33e-2
4d-2c	0.58	3.28e-3	0.58	2.41e-3
4d-4c	0.71	1.73e-1	0.72	1.81e-1
4d-5c	0.95	2.67e-4	0.68	1.26e-1
4d-6c	0.89	7.15e-2	0.65	1.14e-1
5d-2c	0.94	3.34e-4	0.94	3.00e-4
5d-4c	0.93	8.12e-3	0.55	1.71e-1
5d-6c	0.75	1.22e-1	0.59	1.59e-1
6d-2c	0.81	1.52e-1	0.89	4.66e-4
6d-4c	0.87	1.80e-1	0.62	2.25e-1
6d-5c	0.49	1.41e-1	0.66	1.57e-1

Figure 8.

Runtime changes depending on number of instances in 3-dimensions.

Table 5

Comparison between DBSCAN and SCA-clustering

Dataset	# of Instances	DBSCAN		SCA-Clustering
		Success rate (%)	Runtime (sec)	Success rate (%)		Runtime (sec)
Aggregation	788	79.19	0.04	99.68	$\pm$ 0.12	5.16	$\pm$ 0.20
Banana	4811	98.94	0.72	100.0	$\pm$ 0.00	2.66	$\pm$ 0.26
Jain	373	80.16	0.03	95.72	$\pm$ 1.54	12.57	$\pm$ 0.74
R15	600	6.67	0.07	99.32	$\pm$ 0.22	1.49	$\pm$ 0.07
Sizes1	1000	39.40	0.05	98.02	$\pm$ 0.15	6.13	$\pm$ 0.25
Chainlink	1000	100.0	0.08	99.83	$\pm$ 0.74	0.75	$\pm$ 0.13
2d-4c	1000000	NA	$1day<$	82.5	$\pm$ 23.84	32.1	$\pm$ 43.0
3d-2c	70930	98.83	163.02	99.99	$\pm$ 0.00	1.62	$\pm$ 0.16
3d-4c	144824	99.99	1054.03	97.39	$\pm$ 10.41	1.37	$\pm$ 0.93
3d-6c	247087	100.0	3764.82	94.53	$\pm$ 12.90	2.82	$\pm$ 1.05
3d-8c	405419	100.0	11434.76	68.16	$\pm$ 14.86	5.44	$\pm$ 2.64
4d-2c	117565	99.56	817.97	100.0	$\pm$ 0.00	1.26	$\pm$ 0.14
4d-4c	137973	100.0	1984.32	70.26	$\pm$ 29.758	2.47	$\pm$ 1.12
4d-5c	1250000	NA	$1day<$	100.0	$\pm$ 0.0	0.881	$\pm$ 0.19
4d-6c	178736	100.0	3507.25	93.30	$\pm$ 15.96	1.55	$\pm$ 0.78
5d-2c	111541	100.0	1659.80	100.0	$\pm$ 0.00	3.10	$\pm$ 1.47
5d-4c	168799	100.0	4474.26	99.93	$\pm$ 0.30	2.37	$\pm$ 1.75
5d-6c	162358	100.0	4344.80	77.46	$\pm$ 16.13	1.53	$\pm$ 0.36
6d-2c	58140	100.0	317.78	92.97	$\pm$ 21.67	1.49	$\pm$ 0.66
6d-4c	98116	100.0	1869.26	91.12	$\pm$ 26.54	0.58	$\pm$ 0.20
6d-5c	79127	99.99	1293.21	54.91	$\pm$ 18.18	1.28	$\pm$ 0.21

As mentioned in the introduction section, density based clustering algorithms are also popular in the literature. SCA-clustering is also compared to a DBSCAN implementation which is utilized in [26]. In Table 5, the performance of the two algorithms on the datasets utilized is presented. DBSCAN has a perfect performance on the datasets created by the Gaussian Generator which is expected to form spherical clusters, but it is also heavily dependent on distance calculations among the elements in the dataset, like k-means. This results in a considerably worse runtime performance from the generated datasets compared to SCA-clustering and even to k-means. For the large datasets like $3d-8c$ , $4d-6c$ and $5d-6c$ , the runtime of DBSCAN increases to hours while SCA-clustering can complete the runs in a few seconds. Note that, the three approaches have also been tested on two very large datasets ( $2d-4c$ , $4d-5c$ ) consisting of around one million instances. The result of the DBSCAN algorithm is reported as NA in Table 5. This is because a result was not available within an admissible amount of time ( $<$ 10 hours). However, SCA-clustering has a promising performance on these large datasets and the performance of the approach is superior to K-means on $4d-5c$ datasets as seen in Table 3.

For the datasets; Banana, Jain, $R15$ , $\textit{Sizes}1$ , and Chainlink, DBSCAN is very efficient in terms of runtime, since these are low dimensional datasets with a limited number of instances. However, the algorithm might fail to cluster the datasets successfully this time. $R15$ is such a dataset, where DBSCAN cannot detect the clusters in the data. SCA-clustering has a significantly better performance on datasets $R15$ , Aggregation, Jain, and $\textit{Sizes}1$ compared to DBSCAN. Note that these datasets contain clusters of varying shapes and sizes. When clusters reside in closer regions this may turn into a big disadvantage for density based methods. That is why a high variation exists for the success rate of DBSCAN on these datasets. However, the same disadvantage does not exist for SCA-clustering and, as noted before the approach, it has a satisfactory performance on these low dimensional datasets.

5. Conclusion

A common characteristic of the classical clustering algorithms is that they are all based on distance calculations between the data instances. In this paper, a novel approach, based on CA, is proposed for the clustering problem. The method is inspired by the heat transfer process in nature. The clusters are formed by local interactions between the CA cells, hence no distance calculation is needed for the procedure.

The approach is tested on datasets of differing characteristics. It has been observed that the success of the method is not limited to datasets that include only hyper-spherical clusters. Differing datasets which include clusters of varying shape and sizes can be clustered successfully with the method proposed. The only limitation that exists for the approach is the number of dimensions in the dataset. It has been observed that the approach is applicable only to low dimensional data and it becomes impractical with high dimensional datasets.

Some future work can be considered to improve the algorithm. Firstly, the method is utilized with CA that has an equal number of cells in each dimension. This factor limits the number of dimensions that can be handled by the method. A more dynamic model could be proposed which would utilize a variable number of cells in each dimension based on the characteristics of the data. If such an approach is used, it might be possible to cluster higher dimensional data compared to the datasets utilized in this study. Also, all datasets utilized in the experiments consist of only numerical attributes. Since the number of cells in each dimension is fixed, the current model is not appropriate for categorical data. A dynamic model could be beneficial for categorical data, too. Furthermore, the proposed approach has been tested on datasets with minimal noise and without missing values. Noisy data might affect the performance of the system considerably. Therefore, an analysis is needed in order to gain insight into the behavior of the system on noisy data.

In this study, SCA-clustering has been compared with two clustering approaches; DBSCAN and k-means. These are robust and reliable clustering methods that have been utilized for many clustering applications. However, there are many other variations of k-means and density based methods proposed in the literature. It is possible to enhance the analysis of SCA-clustering by comparing it with other state-of-the-art methods in the literature.

In other future work, the approach can be used as an ensemble with other clustering methods. For instance, SCA-clustering can be used to determine the initial centroids of huge datasets quickly, then the fine tuning that will obtain the final clustering can be carried out by k-means or its variants. It is also possible to use the approach for the classification problem. The heat transfer procedure and the state transition rules utilized in this study might create efficient separators in the input space when the classification problem is considered.

References

Hartigan

J.A.

, Clustering Algorithms, John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975.

Patel

B.C.

and Sinha

, An adaptive k-means clustering algorithm for breast image segmentation, International Journal of Computer Applications 10 (2010), 35–38.

de Lope

and Maravall

, Data clustering using a linear cellular automata-based algorithm, Neurocomputing 114 (2013), 86–91.

Chen

and He

, A novel ant clustering algorithm based on cellular automata, Web Intelligence and Agent Systems: An International Journal 5 (2007), 1–14.

Ester

, Kriegel

H.P.

, Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, AAAI Press, 1996, 226–231.

Sander

Ester

Kriegel

H.-P.

and Xu

, Density-based clustering in spatial databases: The algorithm gdbscan and its applications, Data Mining and Knowledge Discovery 2 (1998), 169–194.

Kisilevich

Mansmann

and Keim

, P-dbscan: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos, in: Proceedings of the 1st international conference and exhibition on computing for geospatial research & application, ACM, p. 38.

Mehmood

Zhang

Bie

Dawood

and Ahmad

, Clustering by fast search and find of density peaks via heat diffusion, Neurocomputing 208 (2016), 210–217. SI: BridgingSemantic.

Maulik

and Bandyopadhyay

, Genetic algorithm-based clustering technique, Pattern Recognition 33 (2000), 1455–1465.

10.

Karaboga

and Ozturk

, A novel clustering approach: Artificial bee colony (abc) algorithm, Applied Soft Computing 11 (2011), 652–657.

11.

Rahman

M.A.

and Islam

M.Z.

, A hybrid clustering technique combining a novel genetic algorithm with k-means, Knowledge-Based Systems 71 (2014), 345–365.

12.

Shuai

Dong

and Shuai

, A new data clustering approach: Generalized cellular automata, Information Systems 32 (2007), 968–977. Special Issue on Intelligent Information Processing.

13.

Kokol

Povalej

Lenic

and Stiglic

, Building classifier cellular automata., in: Sloot

P.M.A.

Chopard

Hoekstra

A.G.

(Eds.), ACRI, volume 3305 of Lecture Notes in Computer Science, Springer, 2004, pp. 823–830.

14.

Adwan

Huneiti

Ayyal Awwad

Al Damari

Ortega

Abu Dalhoum

A.L.

and Alfonseca

, Utilizing an enhanced cellular automata model for data mining, International Review on Computers and Software (2013).

15.

Gardner

, Mathematical games: The fantastic combinations of john conway’s new solitaire game “life”, Scientific American 223 (1970), 120–123.

16.

Boerlijst

and Hogeweg

, Self-structuring and selection: Spiral waves as a substrate for prebiotic evolution, Artificial Life 2 (1991), 255–276.

17.

Ermentrout

G.B.

and Edelstein-Keshet

, Cellular automata approaches to biological modeling, Journal of Theoretical Biology 160 (1993), 97–133.

18.

Langton

C.G.

, Self-reproduction in cellular automata, Physica D: Nonlinear Phenomena 10 (1984), 135–144.

19.

Mai

and Von Niessen

, A cellular automaton model with diffusion for a surface reaction system, Chemical physics 165 (1992), 57–63.

20.

Margolus

Toffoli

and Vichniac

, Cellular-automata supercomputers for fluid-dynamics modeling, Physical Review Letters 56 (1986), 1694.

21.

Gionis

Mannila

and Tsaparas

, Clustering aggregation, ACM Transactions on Knowledge Discovery from Data (TKDD) 1 (2007), 4.

22.

Ultsch

, Clustering with SOM: U*C, in: Proc. Workshop on Self-Organizing Maps, Paris, France, pp. 75–82.

23.

Jain

and Law

, Data clustering: A user’s dilemma., Pattern Recognition and Machine Intelligence, Proceedings 3776 (2005), 1–10.

24.

Veenman

C.J.

Reinders

M.J.

and Backer

, A maximum variance cluster algorithm, Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (2002), 1273–1280.

25.

Handl

and Knowles

, Multiobjective clustering with automatic determination of the number of clusters, UMIST, Manchester, Tech. Rep. TR-COMPSYSBIO-2004-02 (2004).

26.

Daszykowski

Walczak

and Massart

, Looking for natural patterns in data: Part 1. density-based approach, Chemometrics and Intelligent Laboratory Systems 56 (2001), 83–92.