Empirical evaluation of five algorithms for the initialization phase of the k-Means algorithm

Abstract

A recurring problem in a wide variety of research areas such as pattern recognition, machine learning, data mining and statistics, among others, is characterized as a clustering problem. Such a problem can be described in a simplistic way as: given a set of data (observations, objects, points, etc.), group similar data into clusters (groups). A clustering of a given data set is then characterized as a set of clusters, in which elements belonging to a cluster are similar to each other and elements belonging to distinct clusters are not similar. Clustering algorithms are non-supervised algorithms and, among the many available in the literature, the k-Means, that uses a random initizalization process, can be considered one of the most popular and successful. The performance of the k-Means, however, is highly dependent on a ‘good’ initialization of the $k$ cluster centers (centroids), as well as on the value assigned to the number ( $k$ ) of clusters the final clustering should have. This paper addresses experiments using five initialization algorithms available in the literature namely, the Method1, the k-Means++, the CCIA, the Maedeh and Suresh and the SPSS algorithms, to empirically evaluate their contribution for improving the k-Means performance.

Keywords

Unsupervised learning k-Means initialization algorithms

1. Introduction and motivations

Considering the vast number of machine learning (ML) algorithms available, several taxonomies aiming at organizing such algorithms, by grouping them according to some criteria, can be found in the literature (see, for instance [20, 43]). Taxonomies based on the level of supervision required by the algorithm usually group them into three categories: (1) supervised learning, (2) unsupervised learning and (3) semi-supervised learning.

If the learning algorithm needs the value of the attribute class associated with each data instance, to induce the expression of the concept, such algorithm is a supervised algorithm, in the sense that it ’needs’ external information (in this case, the class value attached to the description of each data instance, provided by something or someone), to generalize the (training) set of data it receives as input. Semi-supervised algorithms use for training both, data instances that have an associated class and data instances that do not have the class information in their descriptions. Usually semi-supervised algorithms are a convenient choice in situations where the number of unlabeled data instances is much larger than the number of labeled data. An extensive survey on semi-supervised algorithms can be seen in [45].

In many real-world situations, however, the class associated with each instance of the available training set is unknown. ML algorithms which deal with such type of data are known as unsupervised algorithms; they usually learn by identifying subsets of data that share certain similarities. The so called clustering algorithms are the ones that most accurately characterize unsupervised algorithms. Given a data set X, the task of grouping the data of X into groups, such that data in the same group are more similar among themselves than to data belonging to any other group is referred to as a clustering process and is, usually, conducted by a clustering algorithm.

The process of grouping data instances based on measures of similarity (or dissimilarity) between them can be trivially performed by humans, but designing an algorithm for performing the task is not trivial. An algorithm for this purpose should identify groups of instances based only on their descriptions. As pointed out in [10], the design of efficient clustering techniques is considered a great challenge, mainly due to the fact that they do not have external supervision; this somehow implies they should work under the constraint of a total lack of prior knowledge about the internal structure of the data (such as spatial distribution, volume, density, geometric shapes of groups, etc.). In this scenario, automatic learning becomes an exploratory activity, aiming at identifying statistically separable data groups, detecting the most evident groups and their relation to what one wishes to discriminate, in an attempt to highlight the underlying structure of the data set, only having as information the data instance descriptions, each of them represented by a vector of attribute values.

The solution to a clustering problem can be addressed in several different ways. In the literature it can be found a large number of clustering algorithm proposals [5], many of them supported by different mathematical formalisms as well as implementing a diverse range of different strategies. Due to the broad range of different characteristics that can be associated with clustering, several taxonomies have been proposed in the literature, in an attempt to organize clustering algorithms into categories, according to several criteria, such as those found in [1, 6, 7, 9, 36, 37].

Two categories of clustering algorithms found in most taxonomies are known as partitional and hierarchical; they mainly differ from each other in the way they approach the clustering process considering the fact whether the clusters are nested or not. While a partitional clustering algorithm implements an iterative procedure that divides the data set into disjoint groups (clusters) until a final clustering is obtained, a hierarchical algorithm produces a sequence of clusterings, where their corresponding clusters are nested clusters, usually organized as a tree.

Among the several partitional algorithms available in the literature, the k-Means [21] is considered the most successfully and has been used in a large number of applications. It is known that the k-Means suffers from a problem identified as initialization problem, related to the initial set of group centers (or centroids), from which the iterative process conducted by the algorithm begins. It can be found in the literature several algorithms that attempt to solve the problem of centroid initialization.

This article is an extension of the work described in [2] which reports an empirical investigation of five algorithms available in the literature that claim to provide better initializations to the k-Means algorithm, than the random initialization which is an intrinsic part of the original algorithm. This article expands the previous work by giving a more detailed contextualization of the area, by providing a more refined description of the five initialization algorithms, by introducing two new sections, one with focus on the validation indices used in the experiments, and the other on the computational system developed for the experiments and, also, by extending the number of experiments conducted considering six more data domains.

The remainder of this paper is organized as follows. Section 2 briefly revisits the k-Means algorithm. Section 3 presents and discusses the main aspects of the five initialization algorithms used to initialize the k-Means, whose performance is the focus of this work; the algorithms are (1) Method1 [27], (2) k-Means++ [12], (3) SPSS [24, 25], (4) Maedeh and Suresh [8] and finally, (5) CCIA [42]. Section 4 introduces the main characteristics of the validation indices used in the experiments namely, Dunn [22], Silhouette [26, 33] and Rand [44]. Section 5 describes several relevant details of a user-friendly computational system, named I-k-Means (Initializing k-Means), that was developed to support the evaluation of the impact of the inicialization procedure on the k-Means performance. Section 6 describes the data domains used, the adopted methodology for the experiments and presents the results of the experiments, discussing the contribution of each inicialization proposal on the k-Means performance, as well as comments on the values of the employed validation indices. Section 7 resumes the work done, presents a few conclusions based on the results, and ends the article by considering a few directions in which the work may proceed.

Figure 1.

Simplified pseudocode of the k-Means, evidencing both phases, initialization and iterative, based on the description found in [23].

2. Revisiting the k-Means algorithm

As mentioned before, the k-Means algorithm is a partitional algorithm and, as such, it induces a partition (clustering) on the given set of instances, into $k$ groups (clusters), where the value of $k$ is also input to the algorithm.

Using a generic notation and considering a given data set of N instances, DIS $=$ { $I_{1}$ , $I_{2}$ , $\ldots$ , $I_{N}$ }, and an integer $k$ , a k-clustering of DIS, denoted as AG ${}_{k}$ , is defined as a partition of DIS in $k$ clusters (groups), $G_{1}$ , $G_{2}$ , $\ldots$ , $G_{k}$ i.e., AG ${}_{k}=$ { $G_{1}$ , $G_{2}$ , $\ldots$ , $G_{k}$ }. It is assumed that data instances in a cluster $G_{i}$ ( $i=$ 1, $\ldots$ , $k$ ) are “more similar” to each other than to instances belonging to other clusters.

According to the mathematical definition of a $k$ partition of a given set S, P(S) $=$ { $P_{1}$ , $P_{2}$ , $\ldots$ , $P_{k}$ }, the following three conditions must be satisfied:

(1)
$P_{i}$ $\neq$ $\emptyset$ , for 1, $\ldots$ , $k$ ,
(2)
The union of all $P_{i}$ s recomposes the initial set i.e., $S=\bigcup^{k}_{i=1}P_{i}$ ,
(3)
$P_{i}\cap P_{j}=\emptyset$ for $i\neq j$ and $i$ , $j=$ 1, $\ldots$ , $k$ .

Figure 2.
Pseudocode of the Method1, as described in [27].

Since 1967, the year in which it was published, the k-Means algorithm [21] remains the most widely known and widespread clustering algorithm. Since its initial proposal it has been used in a large number of computational systems dealing with the most diverse knowledge domains. Much of the success of k-Means is due to its simplicity, easy understanding and easy implementation, as well as its fast execution. However, as pointed out in [29, 36], the algorithm has some disadvantages such as it can only detect compact hyperspherical clusters that are well separated, it is sensitive to noise and to outlier instances and there is no guarantee of its convergence to a global optimum. Possible solutions for some of these problems are also presented in the previous references. Figure 1 describes the pseudocode of the k-Means algorithm based on the description found in [23].

Many works reported in the literature mention that both, the convergence of the iterative process and the performance of the clustering induced by the k-Means depend on the initial set of centroids [39]. Both, the number of groups ( $k$ ) and the initialization of the centroids of groups are relevant aspects that deeply affect the performance of the algorithm.

As shows Fig. 1, in the initialization phase carried out by the original k-Means, $k$ data instances from the given data set are randomly selected as the initial centroids of $k$ clusters, which can be considered a deficiency, since there are more convenient ways of selecting a more appropriate set of $k$ centroids, rather than randomly choosing them. Section 3 addresses the five algorithms found in the literature, which are based on a diverse set of mathematical formalisms and aim at providing the k-Means with a ‘good’ set of $k$ initial centroids.
3. A brief introduction to five initialization algorithms

The five algorithms considered for replacing the randomly conducted initialization phase of the k-Means, that chooses the initial set of centroids, are briefly considered next in the following sequence: the Method1 [27], the k-Means++ [12], the Single Pass Seed Selection (SPSS) [24, 25], the Maedeh and Suresh algorithm [8] (which has been named in this paper by the names of its creators) and the Cluster Center Initialization Algorithm (CCIA) [42].

In [27] two k-Means initialization algorithms are proposed and the Method1, the first to be proposed, is one of the five algorithm considered for the experiments. The Method1 can be characterized as grid-based algorithm, since it considers the space defined by the data divided into a certain number of cells, all of them with the same dimensions. The process of selecting the initial set of centroids is done all at once and, according to the authors, there is no need for further attempts to define them. Also, as the authors point out, the algorithm does not require a lower limit in relation to the number of data instances.

The Method1 is based on the idea of selecting the initial $k$ centroids (of clusters) according to the distribution of data instances at a macro level, leaving the clustering algorithm (the k-Means, in this case) in charge of the grouping task itself, which is expected to improve and refine the initial solution given by Method1. The algorithm distributes the centroids directly, driven by the density of the data instances. Its pseudocode is presented in Fig. 2 as described in [27]. As pointed out in [29] and confirmed in the experiments in Section 6, one of the main disadvantages associated with the Method1 algorithm is related to the difficulty to establish an appropriate value for M (a variable representing the number of cells considered for the grid, as can be seen in the pseudocode presented in Fig. 2).

The k-Means variant known as k-Means++ [12] is the original k-Means itself, with a change in its initialization phase. The k-Means++ initialization process still randomly chooses the initial $k$ centroids, but ponders them according to the square of their distances to the centroid that is closest to them, among those already chosen.

In their paper the authors conducted empirical preliminary comparative studies based on four datasets, two of them having synthetic data and the other two were real-world datasets. Due to the randomly aspect of the algorithms (k-Means and k-Means++), 20 trials for each case was ran. Results show that the k-Means++ performs better than the original k-Means taking into account both, accuracy and speed, generally, by a substantial margin.

The authors in their paper also present a formalism which proves that, by augmenting k-Means with a simple randomized seeding technique, an algorithm that is O(log k) competitive with the optimal clustering is obtained. Figure 3 presents the pseudocode of the K-Means++ which is based on the one described in [12] and partially adapted to the notation adopted in this paper, where X is the dataset of data instances to be clustered. The algorithm uses a function D: X $\to\Re$ , where the real value associated with x, D(x), denotes the shortest distance from x to the closest centroid already chosen.

Figure 3.

Pseudocode of k-Means++, a variant of k-Means with a proper initialization phase, based on its description presented in [12].

Figure 4.

Simplified pseudocode of the SPSS, based on the description found in [24].

Authors in [24, 25] point out that, considering the fact that the k-Means++ [12] starts the initialization process by randomly choosing the first centroid, this can affect the number of iterations at each execution, eventually giving rise to different results, for the same set of data instances to be clustered.

The authors also state that for k-Means++ to reach a good result, it has to be run a certain number of times. Their proposed SPSS algorithm can be considered a slightly modified k-Means++ that selects, as the first centroid, the highest density data instance (note that the first centroid is a deterministic choice).

Figure 5.

Initial pseudocode of the Maedeh and Suresh algorithm [8], with calls for two procedures (1) Maedeh_Suresh_phase1( $X, k, C$ ), for choosing the initial set of centroids ( $C$ ) and (2) Maedeh_Suresh_phase2(X,C,AG), for the assignment of instances to their corresponding clusters.

Figure 6.

Simplified pseudocode of the initialization phase by the Maedeh and Suresh algorithm, based on the description found in [8].

According to the authors the initial set of centroids selected by the SPSS has several advantages, since it promotes both, the quality of the clusters induced by k-Means and the number of iterations performed by k-Means, as well as the number of times that distance calculations have to be performed. Figure 4 presents the pseudocode of the SPSS algorithm, based on the descriptions found in [24, 25].

Still comparing both, k-Means++ and SPSS, the authors of the latter in [24] comment that “for selecting y, the number of passes executed by the k-Means++, in the worst case, is $\textit{max}{\{}d(x_{1})^{2}+d(x_{2})^{2}+\ldots$ , $+d(x_{N})^{2}$ }, while it is equal to one in SPSS and therefore, the SPSS is a single pass algorithm with a unique solution while the k-Means++ is not.”

The Maedeh and Suresh algorithm [8] can be considered a modified k-Means. The algorithm has the same two phases as the original k-Means, where the purposes of both phases have been maintained i.e., they refer to the initialization and iteration processes, respectively.

Similarly to the k-Means, in the initialization phase of the Maedeh and Suresh algorithm $k$ centroids of clusters are chosen and, in the iterative phase, the algorithm assigns each data instance to one of the $k$ clusters, depending on how distant they are to the cluster centroids. Considering the main goal of this paper concerns initializaton algorithms, in what follows only the initialization phase of the Maedeh and Suresh algorithm will be taken into account.

In [8] the authors state that their algorithm has greater accuracy when compared to the accuracy of the original k-Means and, when using the same set of data instances, has the advantage of always producing the same clustering.

Figure 7.

Pseudocode of the first part of the CCIA, for identifying the initial centroids to be passed on to the k-Means (as in [42]).

Figure 5 shows the initial pseudocode of the algorithm (procedure Maedeh_Suresh), where each phase is implemented by an auxiliary algorithm, identified in this paper as Maedeh_Suresh_phase1 and Maedeh_ Suresh_phase2, in charge, respectively, of the initialization and iteration processes. As previously mentioned, only the Maedeh-Suresh_phase1 procedure is of interest in the work described in this paper and its pseudocode is presented in Fig. 6.

In the initialization phase implemented by procedure Maedeh_Suresh_phase1, in Fig. 6, the distance from each data instance to the Cartesian origin is determined (step (1)); instances are then sorted in ascending order, based on their corresponding distance values to the origin (step (2)) and the first and the last instances are chosen as first and second cluster centroids, respectively (step (3)). All the remaining data instances are then assigned to their closest group centroid and the process enters in a iterative mode, until all $k$ centroids have been chosen.

Considering a dataset with N data instances, the procedure Maedeh_Suresh_phase1 uses two N-dimensi- onal vectors: (a) ClusterId and (b) NearestDist. Each position i (1 $\leqslant$ i $\leqslant$ N) in: (a) ClusterId has the number associated with the cluster to which the data instance $x_{i}$ has been asssinged to and in (b) Nearest_Dist, has its distance value from the cluster centroid.

Next the algorithm identifies in the NearestDist vector, the position that has the highest value and chooses its corresponding data instance as the next centroid. The process iterates until $k$ centroids have been chosen. As mentioned before, in the experiments described in Section 6, the procedure Maedeh_Suresh_phase1 just described was used as the initialization process of the k-Means.

According to Khan and Ahmad [42], the Cluster Center Initialization Algorithm (CCIA) was proposed with the intent of obtaining a good startup of centroids, to be used by the k-Means. The algorithm was based on two comments made by the authors, associated with clustering processes under the k-Means:

(1)

some data instances are very similar to each other and, due to that, they end up belonging to the same cluster, regardless of the way the initial choice of centroids is made;

(2)

attributes, individually, may also provide some information regarding the initial centroids. The CCIA exclusively aims at initializing centroids for the k-Means and is suitable only for numerical data.

The detailed description of the CCIA algorithm given in [42] is approached divided into two parts, where the second part is conditionally executed, depending on the results of the first part. The first part generates $k^{\prime}$ cluster centroids and if $k^{\prime}>k$ , then the second part of the algorithm is executed, to merge similar clusters aiming to get $k$ cluster centroids, which will initialize the k-Means, instead of the default procedure that chooses them randomly.

The CCIA is presented in [42] with a detailed and long description which, at its ending, may eventually require the use of an algorithm for merging clusters (in [42] the DBMSDC (Density-based multiscale data condensation) [32] was used), when the number of centroids obtained, $k^{\prime}$ , is larger than the given value $k$ . The merged clusters are those whose corresponding centroids are near to each other. However, since the focus of this paper is on algorithms for the initialization of the k-Means and considering that the DBMSDC is a well-known algorithm, only the first part of the CCIA is presented, in Fig. 7.

4. Assessing the quality of induced clustering – validation indices

An important issue when using clustering algorithms is to evaluate the resulting clusterings. Several validation indices have been proposed in the literature for measuring the quality of the clusterings induced by unsupervised clustering algorithms.

As discussed in [16, 17, 30, 31, 34, 37], cluster validity can be implemented via different approaches, namely (1) internal criteria, based only on statistics measures associated with the patterns themselves, (2) external criteria, based on a pre-defined structure associated with the set of patterns and (3) relative criteria, which is based on comparisons between the induced clustering with others, induced by the same algorithm on the same data, but employing e.g., different parameter values or, even, other algorithms.

As suggested in various works found in the literature, a strategy to induce a good clustering for a given set of instances is to run the clustering algorithm a certain number of times, specifying a different number of clusters at each time and, then, selecting the clustering that optimizes the validation index, as the final result [40, 46].

In the experiments described in Section 6 the results obtained with the clustering algorithms were evaluated using two internal validation indices namely, the Dunn’s index (D) [22] and the Silhouette index (S) [26, 33]. Also, to quantify the number of data instances incorrectly assigned (taking into account the visually identifiable clusters in non-supervised sets of data instances), an external validation index, the Rand index (R) [44], was also used.

The main goal of most validation indices is to identify clusterings whose clusters are compact and well-separated. A drawback of using validation indices is the computational cost involved, when the number of clusters in the clusterings increases and the dimensionality of the data instances also increases. Consider the following notation for the descriptions that follow:

$X$ : set of data instances to be clustered into clustering $C=$ { $C_{1}$ , $C_{2}$ , $\ldots$ , $C_{k}$ } $k$ : number of clusters $N$ : number of data instances in $X$ $C_{i}$ : the $i^{\rm th}$ cluster of the clustering $n_{i}$ : number of data instances in $C_{i}$ $I_{p}$ , $I_{q}$ , $I_{x}$ : generic data instances in $X$ and

Let the distance between the two data instances, $I_{x}$ , $I_{y}$ , be represented as dist( $I_{x}$ , $I_{y}$ ).

4.1 The Dunn’s index

The Dunn’s index can be approached as the ratio of the smallest distance between two data instances from different clusters (implicitly involving the separation between clusters) to the largest distance between two data instances from the same cluster (implicitly involving cluster’s density). Compact and well separated clusters have a large value for the smallest distance between two data instances from different clusters and a small value for the largest distance between two data instances from the same cluster. The Dunn’s index (D) is defined by Eq. (4.1).

$\displaystyle D=\text{min}_{i=1,\ldots,k}{\{}\text{min}_{j=i+1,\ldots,k}(A/B){% \}},\text{where}$ $\displaystyle A=\text{min}_{Ip\in C_{i},Iq\in C_{j}}\ \text{dist}(I_{p},I_{q})% \ \text{and}$ (1) $\displaystyle B=\text{max}_{h}{\{}\text{max}_{Ip,Iq\in C_{h}}\ \text{dist}(I_{% p},I_{q}){\}}$

The Dunn’s index can also be defined by Eq. (2) where $\textit{dist}_{\textit{min}}$ is the smallest distance between two data instances belonging to different clusters and $\textit{dist}_{\textit{max}}$ is the largest distance between two data instances that belong to the same cluster.

$\displaystyle D=\textit{dist}_{\textit{min}}/\textit{dist}_{\textit{max}}$ (2)

The Dunn’s index takes into consideration the density associated to each cluster as well as the distances that separates clusters. So D is a value such that D $\in$ [0, $\infty$ ) and the larger its value, the better the induced clustering is, taking into account separability inter clusters and compactness of its clusters.

4.2 The Silhouette index

Consider a clustering $C$ on $X$ , and instance $I_{x}\in X$ such that $I_{x}\in C^{*}$ , where $C^{*}$ is one of the clusters of $C$ .

For each data instance $I_{x}\in X$ , let $a(I_{x})$ represent the average distance between $I_{x}$ and all the other data instances in $C^{*}$ .

The value $a(I_{x})$ can be interpreted as the measure of how well $I_{x}$ has been assigned to $C^{*}$ – the smaller the value, the better the assigment.

Let the average dissimilarity of data instance $I_{x}$ to a cluster $C^{w}$ $\in C$ be the average distance from $I_{x}$ to all data instances in $C^{w}$ .

Consider $b(I_{x})$ be the smallest average distance of $I_{x}$ to all data instances in any other cluster $C^{\prime}\in C$ and $C^{\prime}\neq C^{*}$ . The cluster with this smallest average distance of $I_{x}$ to all data instances in any other cluster is considered to be the “neighbouring cluster” of $I_{x}$ . Then the silhouette value of data instance $I_{x}$ is given as [41]:

$\displaystyle s(I_{x})=(b(I_{x})-a(I_{x}))/\textit{max}{\{}a(I_{x}),b(I_{x}){\}}$

which can be written as

$\displaystyle s(I_{x})=\left\{\begin{array}[]{ll}1-a(I_{x})/b(I_{x})&\textit{% if}\ a(I_{x})<b(I_{x})\\ 0&\textit{if}\ a(I_{x})=b(I_{x})\\ b(I_{x})/a(I_{x})-1&\textit{if}\ a(I_{x})>b(I_{x})\end{array}\right.$

and from the above definition it can be inferred that $-1\leqslant s(I_{x})\leqslant 1$ .

The average of the silhouette value $(s(I_{x}))$ over all data instances of a cluster $C^{\prime}\in C$ is a measure of how tightly grouped all the data instances in $C^{\prime}$ are. Considering the average $s(I_{x})$ over all data instances in $X$ , gives a measure of how appropriately $X$ has been clustered and is determined by Eq. (4.2).

$\displaystyle S=1/k\times\Sigma_{i}{\{}1/n_{i}\times\Sigma_{Ix\in C_{i}}(b(I_{% x})-a(I_{x}))/\textit{max}{\{}b(I_{x}),a(I_{x}){\}}{\}}\ \text{where}$ $\displaystyle a(I_{x})=(1/(n_{i}-1)\times\Sigma_{Iy\in C_{i}I_{y}\neq I_{x}}% \textit{dist}(I_{x},I_{y})$ $\displaystyle\ \textit{and}$ (3) $\displaystyle b(I_{x})=\textit{min}_{j},_{j\neq i}(1/n_{j}\times\Sigma_{I_{y}% \in C_{j}}\ \textit{dist}(I_{x},I_{y}))$

In [26, 33] the creators of the Silhouette index give a subjective interpretation to their index when say that values between 0.71–1.00 are interpreted as a clustering having a strong structure, between 0.51–0.70, as having a reasonable structure, between 0.26–0.50, as having a weak structure which could be artificial and $\leqslant$ 0.25 as an indication that no substantial structure has been found.

4.3 The Rand index

In data clustering the value of the Rand index [44] can be seen as a measure of the similarity between two data clusterings. Considering that the experiments will be dealing with clusterings, in order to use the Rand index in the experiments described in Section 6, one of the clusterings is induced by the clustering algorithm and the other is provided externally by the user, taking into account the inherent separation that visually can be perceived, and by assigning each pattern to a perceptual cluster. The assignment was conducted only for the unsupervised datasets; for the supervised datasets employed, the information about the class was the criteria for creating the supervised clustering.

In a general setup and considering that one of the clustering of the set of data instances in $X$ is given as $Y=$ { $Y_{1}$ , $Y_{2}$ , $\ldots$ , $Y_{NY}$ } and the other as $Z=$ { $Z_{1}$ , $Z_{2}$ , $\ldots$ , $Z_{NZ}$ }, the Rand index value is calculated based on the following values:

(i)
$a$ : number of pairs of patterns in $X$ that share the same cluster in $Y$ and share the same cluster in $Z$ ;
(ii)
$b$ : number of pairs of patterns in $X$ that are placed in different clusters in $Y$ and that are placed in different clusters in $Z$ ;
(iii)
$c$ : number of pairs of patterns in $X$ that share the same cluster in $Y$ and are in different clusters in $Z$ and
(iv)
$d$ : number of pairs of patterns in $X$ that are in different clusters in $Y$ and share the same cluster in $Z$ .

Once the values $a$ , $b$ , $c$ and $d$ have been calculated, the Rand index ( $R$ ) is defined by Eq. (4).

$\displaystyle R=(a+b)/(a+b+c+d)$ (4)

Intuitively ( $a+b$ ) can be thought of as the number of agreements between clusterings $Y$ and $Z$ and ( $c+d$ ) as the number of disagreements between $Y$ and $Z$ . More details about clustering indices can be seen in [30, 46].
5. The I-k-Means computational system

During the research work described in [3], which was partially presented in [2], a computational system named I-k-Means (Initializing k-Means) was designed and implemented aiming at providing a suitable computational environment for running initialization methods intended to improve the performance of k-Means.

The I-kMeans system was developed in JAVA (version 1.7) programming language using the Eclipse Luna version development platform, under the Linux operating system Ubuntu (version 17.04).

The I-k-Means can be considered a multiplatform system, taking into account that its implementation in Java allows the system to run on any platform (operating system) without needing changes in the source code. In order to run the I-k-Means system it is necessary the Java virtual machine JRE (Java Runtime Environment) be installed. The architecture of I-k-Means is divided into three main modules namely:

Figure 8.

Plotting of four synthetic data sets used in the experiments namely MSD, LongSquare, Aggregation and 3MC.

Figure 9.

Plotting of the remaining three synthetic data sets out of the seven synthetic data used in the experiments namely, Ruspini, Mouse-Like and Spherical_6_2.

Table 1

Data sets characteristics where ID: data set identification, #NI: no. of data instances, #NA: no. of attributes, #NG: no. of groups and G_Id $=$ #NI: no. of instances per group where G_id represents the group identification. The first seven lines refer to synthetic data and the last seven, to real data

ID	#NI	#NA	#NG	G_id $=$ #NI
MSD	30	2	5	1 $=$ 4, 2 $=$ 7, 3 $=$ 10, 4 $=$ 4, 5 $=$ 5
LongSquare	900	2	6	1 $=$ 147, 2 $=$ 155, 3=150 4 $=$ 148, 5 $=$ 150, 6 $=$ 150
Aggregation	788	2	7	1 $=$ 45, 2 $=$ 170, 3 $=$ 102, 4 $=$ 273, 5 $=$ 34, 6 $=$ 130, 7 $=$ 34
3MC	400	2	2	1 $=$ 120, 2 $=$ 170, 3 $=$ 170
Ruspini	75	2	4	1 $=$ 20, 2 $=$ 23, 3 $=$ 17, 4 $=$ 15
Mouse-Like	1,000	2	5	1 $=$ 200, 2 $=$ 200, 3 $=$ 600
Spherical_6_2	250	2	5	1 $=$ 50, 2 $=$ 50, 3 $=$ 50, 4 $=$ 50, 5 $=$ 50
Iris	150	4	3	1 $=$ 50, 2 $=$ 50 , 3 $=$ 50
Fossil	87	6	3	1 $=$ 40, 2 $=$ 24 , 3 $=$ 13
Wine	178	13	3	1 $=$ 59, 2 $=$ 71, 3 $=$ 48
Seeds	210	7	3	1 $=$ 70, 2 $=$ 70 , 3 $=$ 70
Blood Transfusion	748	4	2	1 $=$ 178, 2 $=$ 570
Diabetes	768	8	2	1 $=$ 268, 2 $=$ 500
E.coli	336	8	8	1 $=$ 143, 2 $=$ 77, 3 $=$ 2, 4 $=$ 2, 5 $=$ 35, 6 $=$ 20, 7 $=$ 5, 8 $=$ 52

Table 2

MSD

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.4245/0.2301	0.9586/0.0457	0.9586/0.0457	2.400/0.4245
M1	0.4018/0.2064	0.6782/0.1095	0.9587/0.0715	2.850/0.4018
k-M	0.2579/0.2498	0.6041/0.0958	0.9302/0.0642	4.500/2.1563
CCIA	0.014	0.3403	0.8391	7
SPSS	0.1701	0.6754	0.908	1
MS	0.6328	0.6920	1	2

Table 3

LongSquare

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.0202/0.0062	0.5162/0.0237	0.9367/0.0290	8.9000/4.1821
M1	0.0203/0.0057	0.5153/0.0237	0.9338/0.0259	12.4500/4.7061
k-M	0.0152/0.0064	0.4974/0.0216	0.9240/0.0205	12.3000/6.1081
CCIA	0.0230	0.4645	0.8602	30
SPSS	0.0242	0.5309	0.9445	2
MS	0.0242	0.5309	0.9445	4

(1)

Import & Preprocess which, among its functionalities, allows loading the set of data instances to be clustered, to normalize attribute values and to remove duplicate data instances. The module expects data instances been provided via an ARFF (Attribute-Relation File Format) file;

Table 4

Aggregation

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.0294/0.0048	0.4716/0.0255	0.9099/0.0158	13.8500/3.6094
M1	0.0292/0.0062	0.4644/0.0301	0.9030/0.0157	15.5000/5.7576
k-M	0.0314/0.0057	0.4582/0.0175	0.8898/0.0089	18.0000/6.8920
CCIA	0.0339	0.4843	0.9202	20
SPSS	0.0340	0.4919	0.9110	13
MS	0.0319	0.4047	0.8830	28

Table 5

3MC

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.0224/0.0019	0.5105/0.0051	0.8970/0.0618	6.8500/2.6509
M1	0.0234/0.0027	0.5066/0.0074	0.8541/0.0843	7.8/3.9949
k-M	0.0239/0.0028	0.5051/0.0074	0.8367/0.0863	6.9500/2.6921
CCIA	0.0216	0.5126	0.9230	0
SPSS	0.0270	0.4983	0.7498	8
MS	0.0270	0.4983	0.7498	7

(2)

Trace & Validation makes available the implementation of the five initialization algorithms that are the focus of the research work, namely Method1, k-Means++, Maedeh and Suresh, SPSS and CCIA (see Section 3), as well as the implementation of the initialization process via random choice (k-M). The module also makes available the implementation of three validation indices namely the Dunn, Silhouette and the Rand indices (see Section 4).

Table 6

Ruspini

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.4894/0.1061	0.7159/0.0762	0.9843/0.0470	1.700/1.0535
M1	0.4018/0.2064	0.6782/0.1095	0.9587/0.0715	2.8500/1.0618
k-M	0.3110/0.2334	0.6313/0.1218	0.9292/0.0787	2.6500/1.1521
CCIA	0.521	0.7413	1	0
SPSS	0.521	0.7413	1	2
MS	0.521	0.7413	1	1

Table 7

Mouse-like

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.0194/0.0015	0.5041/0.0003	0.7339/0.0017	6.8000/2.9597
M1	0.0191/0.0014	0.5040/0.0002	0.7334/0.0012	7.7500/2.1650
k-M	0.0197/0.0016	0.5041/0.0003	0.7341/0.0016	7.2500/2.2555
CCIA	0.0183	0.5039	0.7328	5
SPSS	0.0216	0.5047	0.7368	6
MS	0.0216	0.5047	0.7368	5

Table 8

Spherical_6_2

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.3028/0.2344	0.6894/0.0661	0.9686/0.0346	3.1000/2.7730
M1	0.5149/0.0000	0.7481/0.0000	1.0000/0.0000	2.3500/0.7262
k-M	0.1240/0.1693	0.6188/0.0645	0.9291/0.0389	5.1500/2.7730
CCIA	0.5149	0.7481	1.00	0
SPSS	0.0204	0.5961	0.9305	5
MS	0.5149	0.7481	1.00	1

Table 9

Iris

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.0601/0.0085	0.4879/0.0216	0.8556/0.0454	5.4000/1.4966
M1	0.0564/0.0077	0.4783/0.0255	0.8397/0.0600	5.5000/2.2693
k-M	0.0609/0.0085	0.4889/0.0218	0.8559/0.0455	6.2500/2.4469
CCIA	0.069	0.5048	0.8737	3
SPSS	0.053	0.4838	0.8679	7
MS	0.053	0.4838	0.8679	9

Table 10

Fossil

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.3433/0.0856	0.4664/0.0731	0.9619/0.0784	5.3000/3.3926
M1	0.2570/0.1070	0.3902/0.0987	0.8900/0.0955	4.1000/1.7578
k-M	0.2927/0.1153	0.4325/0.0903	0.9318/0.0880	4.7500/2.6433
CCIA	0.386	0.5022	1	3
SPSS	0.133	0.3839	0.8851	5
MS	0.386	0.5022	1	2

Table 11

Wine

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.1444/0.0153	0.3030/0.0015	0.9342/0.0083	4.8500/1.7684
M1	0.1424/0.0112	0.3030/0.0011	0.9346/0.0062	5.6000/2.3958
k-M	0.1387/0.0036	0.3024/0.0005	0.9313/0.0030	4.7000/1.6462
CCIA	0.142	0.3025	0.9318	10
SPSS	0.135	0.3028	0.9349	6
MS	0.135	0.3028	0.9349	5

(3)

Plotting is responsible for displaying the set of data instances loaded in memory, as a graphic representation, in a two-dimensional Cartesian plane. Plotting can be done after the data file has been loaded, before or after the execution of any algorithm. The user has the option to choose two attributes (among those that describe the set of data instances) to be plotted. In situations where data instances have associated classes before execution or associated cluster labels after execution, instances from each class or label are identifiable in the plotting by having the same color.

Table 12

Seeds

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.1029/0.0229	0.4247/0.0004	0.8667/0.0025	6.1000/1.8947
M1	0.1006/0.0228	0.4247/0.0004	0.8670/0.0025	7.3000/2.3043
k-M	0.0937/0.0210	0.4245/0.0004	0.8677/0.0023	6.4500/2.1324
CCIA	0.08	0.4243	0.8693	4
SPSS	0.08	0.4243	0.8693	3
MS	0.126	0.4252	0.8642	6

Table 13

Blood transfusion

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.120/0.0000	0.4442/0.0000	0.5149/0.0000	6.8500/2.2198
M1	0.0302/0.0000	0.2195/0.0000	0.8214/0.0000	10/1.5644
k-M	0.0390/0.0000	0.2009/0.0000	0.8266/0.0000	5/1.5524
CCIA	0.012	0.4442	0.5149	0
SPSS	0.012	0.4442	0.5149	9
MS	0.012	0.4442	0.5149	10

6. Experiments, results and analysis

To evaluate the five algorithms, seven synthetic data sets and seven real world data sets were used. The characteristics of the fourteen data sets are presented in Table 1; in the table synthetic and real world data sets are separated by a thicker horizontal line.

Table 14
Diabetes

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.0719/0.0000	0.2415/0.0003	0.5558/0.0000	10.7000/4.3023
M1	0.0776/0.0200	0.2544/0.0330	0.5534/0.0060	9.0500/4.9038
k-M	0.0719/0.0000	0.2416/0.0003	0.5559/0.0003	10.2000/5.6445
CCIA	0.0719	0.2414	0.5558	12
SPSS	0.0719	0.2422	0.5558	5
MS	0.0719	0.2414	0.5558	11

Table 15

E.coli

AG	D	S	R	I
	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$	$\mu/\sigma$
++	0.0436/0.0110	0.2692/0.0242	0.8083/0.0207	11.7000/3.9000
M1	0.0378/0.0055	0.2483/0.0320	0.7986/0.0193	15.0500/6.4921
k-M	0.0404/0.0066	0.2365/0.0365	0.7914/0.0187	12.4500/5.5720
CCIA	0.038	0.2634	0.8295	15
SPSS	0.0289	0.1746	0.7635	25
MS	0.1037	0.3106	0.8808	11

The plottings of four synthetic data sets are in Fig. 8 and the plottings of the remaining three are in Fig. 9. Out of the seven real world data sets, six were downloaded from the UCI Repository [13] namely, the Iris, Wine, Seeds, Blood Transfusion, Diabetes, E.coli and one, the Fossil data set, was available in [19]. The synthetic data sets used in the experiments were: MSD (refers to the data employed by the authors Maedeh and Suresh in [8], to evaluate their algorithm named, in this paper as Maedeh and Suresh or simply MS), LongSquare [14], Aggregation [4], 3MC [28], Ruspini [15], Mouse-Like [18] and Spherical_6_2 [38].

In what follows the five initialization algorithms are referred to by their abbreviations, that follow their names, stated between parenthesis: k-Means++ (++), CCIA (CCIA), Method1 (M1), Maedeh and Suresh (MS) and SPSS (SPSS). The random initialization, which is the default initialization of the original k-Means is referred to as k-M and is added to the results for comparative purposes.

Tables 2 to 15 show the results of the k-Means algorithm, represented by the values of three validation indices, in the 14 data sets. For each data set the k-Means was initialized by each of the five initialization algorithms considered, as well as randomly (k-M), as in the original k-Means.

In the fourteen tables the quality of an induced clustering by the k-Means was measured by two internal validation indices, the Dunn index (D) [22] and the Silhouette index (S) [26, 33], as well by the external validation index, the Rand index (R) [44]. In the tables $\mu$ / $\sigma$ represents the average/standard deviation of each index value in 20 runs, as described in the methodology, taking into account the induced clusterings.

The methodology for comparing the performances of the five k-Means initialization algorithms was implemented by sequentially executing the steps described next, for each data set (generically identified as X) described in Table 1 and for each algorithm, generically identified by Y. As mentioned before, the original k-Means using the default random initialization was also used for comparative purposes. The methodology approached the 6 algorithms (random choice (k-M) included) split into two sets: {k-M, ++, M1} and {CCIA, SPSS, MS}.

The justification for that division was the fact that three algorithms, k-M, ++ and M1, involve some random choice while the other three, CCIA, SPSS and MS, do not. The random choices occur when the original k-M chooses the k centroids in a random manner, when the ++ chooses the first centroid, and when the M1 makes a random selection of instances within a cell, as well as at its end, when certain conditions are not satisfied.

For the k-M, ++ and M1 algorithms it was decided that steps (2.1), (2.2) and (2.3) (described next) should be performed 20 times; such number was chosen based on the experiments described in [12, 24, 25].

For the CCIA, SPSS and MS algorithms, which always select the same centroids for a given data set when executed more than once, only one execution was conducted for each of the three algorithms. The methodology adopted for the experiments has the following steps:

(1)

assign to $k$ the number of visually identifiable groups (or classes, in case of supervised data sets) in X;

(2)

for algorithms {k-M, ++, M1} perform steps (2.1), (2.2) and (2.3) 20 times (due to the random choices involved) and, for the algorithms {CCIA, SPSS, MS}, since none of them use of random choices, perform the three steps just once.

(2.1)

execute the initialization algorithm Y, using the set X, whose execution result is a set of $k$ initial centroids C;

(2.2)

execute k-Means without its original initialization step using, as input, in addition to set $X$ and $k$ , the set of centroids C created in step (2.1), obtaining the clustering AG;

(2.3)

calculate the values of indices Silhouette and Rand, in the induced clustering AG, obtained in (2.2), and store them;

(3)

for the k-M, ++ and M1 algorithms, calculate the mean and standard deviation of the validation values as well as of the number of iterations performed by the k-M to converge, in the 20 runs performed in steps (2.1), (2.2) and (2.3). For the CCIA, SPSS and MS algorithms the values found in a single execution of steps (2.1), (2.2) and (2.3) are used, considering the three algorithms always have the same result when executed more than once having the same data set as input. The analysis of the results will focus on the values of both indices as well as on the number of iterations performed by k-Means initialized by each algorithm.

As shown in Table 2, in the MSD data the ++ algorithm had one of the best performances among the six algorithms (i.e., ++, M1, k-M (random), CCIA, SPSS and MS), taking into account the values of D, S and R validation indices as well the number of iterations of the k-Means, when initialized by ++.

As shown in Table 3, results related to the R index for the LongSquare have values close to 1, an evidence that the induced clusterings are a very good approximation to those visually detected. Since the LongSquare has 6 groups, with approximately 150 instances/group, the average number of iterations required for the k-Means to converge, when initialized randomly or via M1, was around 12 iterations, whereas when initiated by ++, required on average approximately 9 iterations.

In the LongSquare data domain the validation indices R and D or S do not agree much. While the R index points to the induction of good clusterings, in spite of, sometimes, with a large number of iterations by the k-Means, the S index values indicate clusterings with average quality; the low values for D indicate clusterings that do not conform to compact and spherical well separated clusters, which is partially the case.

Taking into account the data in the Aggregation data set, 7 groups can be visually detected. On the one hand, the mean values of the S index, in Table 4, suggest that the clusterings obtained by k-Means, initialized by ++, M1 and randomly, were just average clusterings. The low values for D are probably due to the fact that the visually identifiable groups of instances are not well separated.

On the other hand, however, the mean values of the R index show the induction of clusterings similar to those visually detected. The configuration and arrangement of data in the Aggregation is not particularly suitable for a grid-based approach, such as the one adopted by M1. When the M1 algorithm is not able to find centroids through its grid-based approach, it adopts a random choice, which is the same procedure adopted by the original K-Means.

Particularly, in the Aggregation data domain, the M1 performance is similar to that of the original k-Means. Results related to the CCIA, SPSS and MS follow the same tendency of those of the ++, M1 and k-M.

Considering the 3MC data set, the mean values of the validation index R, in Table 5 point out to the induction of good clusterings, by the 6 algorithms.

The values of the S index, however, suggest average clusterings while those of the D index indicate not good clusterings. Meanwhile, the number of iterations performed by the k-Means point out to algorithm ++ and SPSS as those that provided better initializations. Note that the CCIA provided an initialization set of centroids to k-Means that did not require any iteration (i.e., $I=$ 0) for k-Means to converge.

Considering the numbers given in the three first lines of Table 6, in the Ruspini data set the k-Means, initialized with centroids found by ++, induced clusterings with the highest value of the R index and, at the same time, reached convergence with the smallest number of iterations, in comparison with the k-Means performance, when initialized with the M1 and k-M.

The values of R, associated with clusterings induced by the k-Means initialized by M1 and random can also be considered good, since their values are close to 1.0. However, the number of iterations performed by k-Means initialized by the two algorithms were, on average, twice the number of iterations when using the ++, although still small. In relation to the indices D and S, associated with clusterings induced by the k-Means using three different initializations, the values are reasonably close although, again, the numbers suggest that clusterings induced by the ++ are better.

Taking into account the values of indices in the last three lines of Table 6 reporting results related to CCIA, SPSS and MS, respectively, they suggest that the same clustering has been induced by the k-Means initialized by any of the three algorithms. This, somehow, implies that the same (or very similar) initial centroids have been induced by the three algorithms; this conclusion may be corroborate by the number of iterations of the k-Means, for converging.

The results of the experiments in the Ruspini data are evidence that algorithms that do not use an approach based on random choice, produce a set of initial centroids that helps k-Means to converge quickly and, also, for the same clustering which, particularly when using the CCIA, SPSS and MS have their associated S index with the same value.

Table 7 gives the results from the experiments conducted with the initialization algorithms using the data instances from the Mouse-Like data set. Taking into account the numbers in the three first line of the table and considering the average of the three validation indices, it can be conjectured that the initialization algorithms produced initial centroids that made the k-Means produce clusterings quite similar.

Also, such similarity reflects, but not so strongly, on the average of the number of iterations of k-Means to reach convergence. Similar analysis applies to values obtained, shown in the three last lines of the table, when the k-Means was initialized by CCIA, SPSS and MS. In spite the data having three compact spherical groups of instances the groups are not separated, which has contributed for the low values of the D index.

In the Spherical_6_2 data set and considering the three first lines of Table 8, it can be seen that the average values of the R index in relation to clusterings induced by k_Means, initialized by each of the three algorithms, ++, M1 and k-M, are close to each other; they can be considered good sets of centroids, considering that k-Means induced similar clusterings and close to those visually identified.

As the number of iterations performed by k-Means up to convergence was, on average, slightly different, it may be conjectured that the initial centroids provided by the algorithms were slightly different, which ended up influencing, the number of iterations required for k-Means to converge.

Having as focus the results presented in the three last lines of Table 8, it can be seen that the CCIA, SPSS and MS present similar results, particularly if the value of the R index is considered. Notice that the initial centroids obtained by CCIA have not required any iteration of k-Means, to reach convergence.

In relation to the Iris data set, the results presented in the first three lines of Table 9, related to validation indices as well as to the number of iterations performed by k-Means, initialized by ++, M1 and randomly, are not statistically very different from each other.

In the last three lines of Table 9, except for the number of iterations, all the other values follow the same trend as those presented in the first three lines that is, they do not differ significantly.

It can be observed, however, that the CCIA algorithm provided k-Means with the best set of 3 initial centroids, considering that k-Means required only 3 iterations to achieve convergence, while with the initial centroids provided by SPSS and by MS, the k-Means performed 7 and 9 iterations, respectively, to achieve convergence. The SPSS and MS initialization methods did not contribute much, considering that the original k-Means with the random initialization needed, on average, 6.25 iterations to converge.

Analyzing the data results shown in Table 10 obtained using the Fossil data set, it can be noted that, with respect to the values of the R index, the initialization provided by ++, on average, slightly favored the quality of the obtained clusterings.

However, in contrast, the average number of iterations required for k-Means to converge was slightly higher when compared to the average of iterations required when initialized by M1 or, then, randomly.

During the experiment the M1 was not successful when using its grid-based approach, which caused the algorithm, in all the 20 runs, to finish the search for initial centroids using random choice, meaning that in the Fossil domain, the M1 behaved exactly like the original k-Means.

With respect to the average values of the R index associated to the clusterings obtained with the initializations provided by ++, M1 and random, in the Wine data set, shown in Table 11, it can be said that all the induced clusterings were very close to the one visually detected, with a difference of 0.007, at most, of the clustering visually detected. However, neither of the two methods, ++ and M1, helped to promote the k-Means performance, in respect to the number of iterations, which reached better values when the random initialization of the k-Means itself, was used.

The validation values of the D, S and the R indices, shown in Table 12, in the Seeds data set, are close to each other for the five algorithms plus the k-M.

The recurrent problem of the M1 algorithm, that of not being able to select a set of initial centroids by means of its grid-based approach, has again occurred in the Seeds data, and the M1 ended up adopting the random selection, turning its execution very similar to that of the original k-Means.

In the Blood Transfusion data the results in the three first lines of the Table 13 show that both, M1 and k-M, have similar values, as far as the D, S and R indices are concerned. They only differ in relation to the number of iterations of the k-Means which were, on average, 10 and 5, respectively. So, in the Blood Transfusion domain and taking into account initialization algorithms that have some randomly process involved, the random choice implemented by the original k-Means is the best choice.

Results shown in the last three lines of the table indicate that CCIA, SPSS and MS are in a die, except for the number of iterations required by the k-Means, with favors the CCIA. However the value of the R index points out to clusterings which agree with the given classes of the data in 50% only. The two classes in this data set are quite unbalanced.

Results from experiments using the Diabetes data set are shown in Table 14. The numbers in the three first lines of the table are practically the same and the value of the R index is an indication of clusterings that only partially agree with the classes informed in this supervised data set.

The same trend is observed with clusterings induced by the CCIA, SPSS and MS. In this specific domain any of the five algorithms would be as good as the original random choice used by k-Means, shown in the third line of the table. The two classes in the Diabetes data set are unbalanced.

Experiments using data from the E.coli data set, shown in Table 15, point out that, although the results from the algorithms do not differ much from each other, on average the those obtained with the MS are the best and are closely followed by results from the ++.

7. Conclusions

This paper discusses and empirically evaluates the performance of five initialization algorithms as replacements for the k-Means original random initialization process. Taking into account the data sets used, it can be said that the results were not too conclusive so to support a particular algorithm as the best one. However, based on the experiments, it can be said that the CCIA and the SPSS initialization algorithms could be a good choice.

Based on the work described in this paper and taking into account all the experience acquired during the development of a software system that implements the five algorithms, as described in Section 5, a few possibilities for continuing in this line of research have been considered:

(1)

to pre-process the data to remove irrelevant attributes;

(2)

to use a combination of the centroids found by each of the five algorithms and by k-M (e.g., the arithmetic mean of the values found by the algorithms), and initialize k-Means with these values;

(3)

to implement a variation of the process described in (2), weighting the individual performance of the five algorithms/data domain and

(4)

to investigate several other initialization algorithms proposed in the literature, particularly, the algorithms proposed in [11, 35].

Footnotes

Acknowledgments

The authors thank UNIFACCAMP and CNPq for their support and also express their gratitude to Prof. Shehroz S. Khan for his support in relation to the algorithm CCIA. His kindness was very much appreciated considering it was an important contribution to the experiments conducted in the work described in this paper. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) - Finance Code 001.

References

Fahad

Alshatri

Tari

Alamri

Khalil

Zomaya

A.Y

Foufou

and Bouras

, A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE Transactions on Emerging Topics in Computing 2(3) (2014), 267–279.

Oliveira

A.F.

and Nicoletti

M.C.

, (2018) Favouring the k-means algorithm with initialization methods, In: Abraham

Cherukuri

Melin

Gandhi

. (eds), Intelligent Systems Design and Applications. ISDA 2018 2018. Advances in Intelligent Systems and Computing, v. 940, Springer, Cham.

Oliveira

A.F.

, Favouring the performance of k-Means via centroid initialization methods, M. Sc.dissertation, UNIFACCAMP, C.L. Paulista, Brazil, 2018 (in Portuguese).

Gionis

Mannila

and Tsaparas

, Clustering aggregation, ACM Transactions on Knowledge. Discovery. Data (ACM TKDD), v1. Article 4 (2007), p. 30.

Jain

A.K.

, Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31(8) (2010), 651–666.

Jain

A.K.

Murty

M.N.

and Flynn

P.J.

, Data clustering: a review, ACM Computing Surveys 31(3) (1991), 264–323.

Jain

A.K.

and Law

M.H.C.

, Data clustering: a user’s dilemma, Lecture Notes in Computer Science 3776 (2005), 1–10.

Maedeh

and Suresh

, Design of efficient k-Means clustering algorithm with improved initial centroids, International Journal of Engineering and Technology 5(1) (2013), 33–38.

Everitt

B.S.

Landau

Leese

and Stahl

, (2011) Cluster Analysis, UK: John Wiley & Sons Ltd.

10.

Aggarwal

C.C.

and Reddy

C.K.

, Data clustering algorithms and applications, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, CRC Press, 2013.

11.

Pizzuti

Talia

and Vonella

, A divisive initialisation method for clustering algorithms, Proc. of The 3rd. European Conference on Principles and Practice of Knowledge Discovery in Databases, 1999, pp. 484–491.

12.

Arthur

and Vassilvitskii

, K-Means++: the advantages of careful seeding, Proc. of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027–1035.

13.

Dua

and Graff

, UCI Machine Learning Repository [http://archive.ics.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2019.

14.

Hand

D.J.

Daly

Lunn

A.D.

McConway

K.J.

and Ostrowski

, Handbook of Small Data Sets, Chapman and Hall/CRC,1st. edition, 1993.

15.

Ruspini

E.H.

, Numerical methods for fuzzy clustering, Information Sciences 2(3) (1970), 319–350.

16.

Kovács

Legány

and Babos

, Cluster validity measurement techniques, Proc. of the Fifth WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, 2006, pp. 388–393.

17.

Gan

and Wu

, Data Clustering – Theory, Algorithms and Applications, Philadelphia, USA:SIAM, 2007.

18.

GMUM.r, Group of Machine Learning Research, Faculty of Mathematics and Computer Science of Jagiellonian University, Kraków, Poland [online] http://r.gmum.net/samples/cec.basic.html.

19.

Chernoff

, The use of faces to represent points in n-dimensional space graphically, Technical Report n

{}^{\circ}

71, Department of Statistics, Stanford University, 1971.

20.

Brownlee

, Master Machine Learning Algorithms, Ebook, 2018, https://machinelearningmastery.com/master-machine-learning-algorithms/.

21.

MacQueen

J.B.

, Some methods for classification and analysis of multivariate observations, Proc. of 5th. Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1967, pp. 281–297.

22.

Dunn

, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics, 4 (1974), 95–104.

23.

Han

Kamber

and Pei

, Data mining – concepts and techniques, 3

{}^{\rm rd}

Ed., Amsterdam: Morgan Kaufmann Publishers, 2012.

24.

Pavan

K.K

Rao

A.A

Rao

A.V.D.

and Sridhar

G.R.

, Robust seed selection algorithm for k-means type algorithms, International Journal of Computer Science & Information Technology (IJCSIT), 3(5) (2011), 147–163.

25.

Pavan

K.K

Rao

A.A.

Rao

A.V.D.

and Sridhar

G.R.

, Single pass seed selection algorithm for k-Means, Journal of Computer Science 6(11) (2010), 60–66.

26.

Kaufman

and Rousseeuw

P.J.

, Finding Groups in Data, USA: John Wiley & Sons, Inc., 2005.

27.

Al-Daoud

and Roberts

S.A.

, New methods for the initialisation of clusters, Pattern Recognition Letters 17 (1996), 451–455.

28.

M.C.

Chou

C.H.

and Hsieh

C.C.

, Fuzzy C-Means algorithm with a point symmetry distance, International Journal of Fuzzy Systems 7(4) (2005), 175–181.

29.

Celebi

M.E.

Kingravi

H.A.

and Vela

P.A.

, A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Systems with Applications 40 (2013), 200–210.

30.

Halkidi

Batistakis

and Vazirgiannis

, On clustering validation techniques, Journal of Intelligent Information Systems 17(2–3) (2001), 107–145.

31.

Berthold

M.R.

Borgelt

Höppner

and Klawonn

, Guide to Intelligent Data Analysis, London:Springer-Verlag, 2010.

32.

Mitra

Murthy

C.A.

and Pal

S.K.

, Density-based multiscale data condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6) (2002), 734–747.

33.

Rousseeuw

P.J.

, Silhouettes: a graphical-aid to the interpretation and validation of cluster analysis, Computational and Applied Mathematics 20 (1987), 53–65.

34.

Tan

P.-N.

Steinback

and Kumar

, Introduction to Data Mining, Pearson Education, Inc., 2006.

35.

Erisoglu

Calis

and Sakallioglu

, A new algorithm for initial cluster centers in k-Means algorithms, Pattern Recognition Letters 3 (2011), 1701–1705.

36.

and Wunch

D.C.

, II, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (2005), 645–678.

37.

Theodoridis

and Koutroumbas

, Pattern Recognition, 4

{}^{\rm th}

ed., USA: Elsevier, 2009.

38.

Bandyopadhyay

and Maulik

, Genetic clustering for automatic evolution of clusters and application to image classification, Pattern Recognition 35 (2002), 1197–1208.

39.

Burks

Harrell

and Wang

, On initial effects of the k-Means clustering, Proc. of The 2015 World Congress in Computer Science, Computer Engineering, & Applied Computing, USA, 2015, pp. 200–205.

40.

Günter

and Bunke

, Validation indices for graph clustering, Pattern Recognition Letters 24(8) (2003), 1107–1113.

41.

Sillhouette (clustering) https://en.wikipedia.org/wiki/Silhouette_(clustering).

42.

Khan

S.S.

and Ahmad

, Cluster center initialization algorithm for k-Means clustering, Pattern Recognition Letters 25 (2004), 1293–1302.

43.

Mitchell

T.M.

, Machine Learning, USA: McGraw-Hill, 1997.

44.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336) (1971), 846–850.

45.

Zhu

, Semi-supervised learning literature survey, Technical Report 1530, University of Wisconsin-Madison, 2006.

46.

Liu

Xiong

Gao

and Wu

, Understanding of internal clustering validation measures, Proc. of the 10th International IEEE Conference on Data Mining (ICMD), 2010, pp. 911–916.

Empirical evaluation of five algorithms for the initialization phase of the k-Means algorithm

Abstract

Keywords

1. Introduction and motivations

4.1 The Dunn’s index

Table 14 Diabetes

Footnotes

Acknowledgments

References

Table 14
Diabetes