Abstract
A recurring problem in a wide variety of research areas such as pattern recognition, machine learning, data mining and statistics, among others, is characterized as a clustering problem. Such a problem can be described in a simplistic way as: given a set of data (observations, objects, points, etc.), group similar data into clusters (groups). A clustering of a given data set is then characterized as a set of clusters, in which elements belonging to a cluster are similar to each other and elements belonging to distinct clusters are not similar. Clustering algorithms are non-supervised algorithms and, among the many available in the literature, the k-Means, that uses a random initizalization process, can be considered one of the most popular and successful. The performance of the k-Means, however, is highly dependent on a ‘good’ initialization of the
Introduction and motivations
Considering the vast number of machine learning (ML) algorithms available, several taxonomies aiming at organizing such algorithms, by grouping them according to some criteria, can be found in the literature (see, for instance [20, 43]). Taxonomies based on the level of supervision required by the algorithm usually group them into three categories: (1) supervised learning, (2) unsupervised learning and (3) semi-supervised learning.
If the learning algorithm needs the value of the attribute class associated with each data instance, to induce the expression of the concept, such algorithm is a supervised algorithm, in the sense that it ’needs’ external information (in this case, the class value attached to the description of each data instance, provided by something or someone), to generalize the (training) set of data it receives as input. Semi-supervised algorithms use for training both, data instances that have an associated class and data instances that do not have the class information in their descriptions. Usually semi-supervised algorithms are a convenient choice in situations where the number of unlabeled data instances is much larger than the number of labeled data. An extensive survey on semi-supervised algorithms can be seen in [45].
In many real-world situations, however, the class associated with each instance of the available training set is unknown. ML algorithms which deal with such type of data are known as unsupervised algorithms; they usually learn by identifying subsets of data that share certain similarities. The so called clustering algorithms are the ones that most accurately characterize unsupervised algorithms. Given a data set X, the task of grouping the data of X into groups, such that data in the same group are more similar among themselves than to data belonging to any other group is referred to as a clustering process and is, usually, conducted by a clustering algorithm.
The process of grouping data instances based on measures of similarity (or dissimilarity) between them can be trivially performed by humans, but designing an algorithm for performing the task is not trivial. An algorithm for this purpose should identify groups of instances based only on their descriptions. As pointed out in [10], the design of efficient clustering techniques is considered a great challenge, mainly due to the fact that they do not have external supervision; this somehow implies they should work under the constraint of a total lack of prior knowledge about the internal structure of the data (such as spatial distribution, volume, density, geometric shapes of groups, etc.). In this scenario, automatic learning becomes an exploratory activity, aiming at identifying statistically separable data groups, detecting the most evident groups and their relation to what one wishes to discriminate, in an attempt to highlight the underlying structure of the data set, only having as information the data instance descriptions, each of them represented by a vector of attribute values.
The solution to a clustering problem can be addressed in several different ways. In the literature it can be found a large number of clustering algorithm proposals [5], many of them supported by different mathematical formalisms as well as implementing a diverse range of different strategies. Due to the broad range of different characteristics that can be associated with clustering, several taxonomies have been proposed in the literature, in an attempt to organize clustering algorithms into categories, according to several criteria, such as those found in [1, 6, 7, 9, 36, 37].
Two categories of clustering algorithms found in most taxonomies are known as partitional and hierarchical; they mainly differ from each other in the way they approach the clustering process considering the fact whether the clusters are nested or not. While a partitional clustering algorithm implements an iterative procedure that divides the data set into disjoint groups (clusters) until a final clustering is obtained, a hierarchical algorithm produces a sequence of clusterings, where their corresponding clusters are nested clusters, usually organized as a tree.
Among the several partitional algorithms available in the literature, the k-Means [21] is considered the most successfully and has been used in a large number of applications. It is known that the k-Means suffers from a problem identified as initialization problem, related to the initial set of group centers (or centroids), from which the iterative process conducted by the algorithm begins. It can be found in the literature several algorithms that attempt to solve the problem of centroid initialization.
This article is an extension of the work described in [2] which reports an empirical investigation of five algorithms available in the literature that claim to provide better initializations to the k-Means algorithm, than the random initialization which is an intrinsic part of the original algorithm. This article expands the previous work by giving a more detailed contextualization of the area, by providing a more refined description of the five initialization algorithms, by introducing two new sections, one with focus on the validation indices used in the experiments, and the other on the computational system developed for the experiments and, also, by extending the number of experiments conducted considering six more data domains.
The remainder of this paper is organized as follows. Section 2 briefly revisits the k-Means algorithm. Section 3 presents and discusses the main aspects of the five initialization algorithms used to initialize the k-Means, whose performance is the focus of this work; the algorithms are (1) Method1 [27], (2) k-Means++ [12], (3) SPSS [24, 25], (4) Maedeh and Suresh [8] and finally, (5) CCIA [42]. Section 4 introduces the main characteristics of the validation indices used in the experiments namely, Dunn [22], Silhouette [26, 33] and Rand [44]. Section 5 describes several relevant details of a user-friendly computational system, named I-k-Means (Initializing k-Means), that was developed to support the evaluation of the impact of the inicialization procedure on the k-Means performance. Section 6 describes the data domains used, the adopted methodology for the experiments and presents the results of the experiments, discussing the contribution of each inicialization proposal on the k-Means performance, as well as comments on the values of the employed validation indices. Section 7 resumes the work done, presents a few conclusions based on the results, and ends the article by considering a few directions in which the work may proceed.
Simplified pseudocode of the k-Means, evidencing both phases, initialization and iterative, based on the description found in [23].
As mentioned before, the k-Means algorithm is a partitional algorithm and, as such, it induces a partition (clustering) on the given set of instances, into
Using a generic notation and considering a given data set of N instances, DIS
According to the mathematical definition of a
The union of all
Pseudocode of the Method1, as described in [27].
Since 1967, the year in which it was published, the k-Means algorithm [21] remains the most widely known and widespread clustering algorithm. Since its initial proposal it has been used in a large number of computational systems dealing with the most diverse knowledge domains. Much of the success of k-Means is due to its simplicity, easy understanding and easy implementation, as well as its fast execution. However, as pointed out in [29, 36], the algorithm has some disadvantages such as it can only detect compact hyperspherical clusters that are well separated, it is sensitive to noise and to outlier instances and there is no guarantee of its convergence to a global optimum. Possible solutions for some of these problems are also presented in the previous references. Figure 1 describes the pseudocode of the k-Means algorithm based on the description found in [23].
Many works reported in the literature mention that both, the convergence of the iterative process and the performance of the clustering induced by the k-Means depend on the initial set of centroids [39]. Both, the number of groups (
As shows Fig. 1, in the initialization phase carried out by the original k-Means,
The five algorithms considered for replacing the randomly conducted initialization phase of the k-Means, that chooses the initial set of centroids, are briefly considered next in the following sequence: the Method1 [27], the k-Means++ [12], the Single Pass Seed Selection (SPSS) [24, 25], the Maedeh and Suresh algorithm [8] (which has been named in this paper by the names of its creators) and the Cluster Center Initialization Algorithm (CCIA) [42].
In [27] two k-Means initialization algorithms are proposed and the Method1, the first to be proposed, is one of the five algorithm considered for the experiments. The Method1 can be characterized as grid-based algorithm, since it considers the space defined by the data divided into a certain number of cells, all of them with the same dimensions. The process of selecting the initial set of centroids is done all at once and, according to the authors, there is no need for further attempts to define them. Also, as the authors point out, the algorithm does not require a lower limit in relation to the number of data instances.
The Method1 is based on the idea of selecting the initial
The k-Means variant known as k-Means++ [12] is the original k-Means itself, with a change in its initialization phase. The k-Means++ initialization process still randomly chooses the initial
In their paper the authors conducted empirical preliminary comparative studies based on four datasets, two of them having synthetic data and the other two were real-world datasets. Due to the randomly aspect of the algorithms (k-Means and k-Means++), 20 trials for each case was ran. Results show that the k-Means++ performs better than the original k-Means taking into account both, accuracy and speed, generally, by a substantial margin.
The authors in their paper also present a formalism which proves that, by augmenting k-Means with a simple randomized seeding technique, an algorithm that is O(log k) competitive with the optimal clustering is obtained. Figure 3 presents the pseudocode of the K-Means++ which is based on the one described in [12] and partially adapted to the notation adopted in this paper, where X is the dataset of data instances to be clustered. The algorithm uses a function D: X
Pseudocode of k-Means++, a variant of k-Means with a proper initialization phase, based on its description presented in [12].
Simplified pseudocode of the SPSS, based on the description found in [24].
Authors in [24, 25] point out that, considering the fact that the k-Means++ [12] starts the initialization process by randomly choosing the first centroid, this can affect the number of iterations at each execution, eventually giving rise to different results, for the same set of data instances to be clustered.
The authors also state that for k-Means++ to reach a good result, it has to be run a certain number of times. Their proposed SPSS algorithm can be considered a slightly modified k-Means++ that selects, as the first centroid, the highest density data instance (note that the first centroid is a deterministic choice).
Initial pseudocode of the Maedeh and Suresh algorithm [8], with calls for two procedures (1) Maedeh_Suresh_phase1(
Simplified pseudocode of the initialization phase by the Maedeh and Suresh algorithm, based on the description found in [8].
According to the authors the initial set of centroids selected by the SPSS has several advantages, since it promotes both, the quality of the clusters induced by k-Means and the number of iterations performed by k-Means, as well as the number of times that distance calculations have to be performed. Figure 4 presents the pseudocode of the SPSS algorithm, based on the descriptions found in [24, 25].
Still comparing both, k-Means++ and SPSS, the authors of the latter in [24] comment that “for selecting y, the number of passes executed by the k-Means++, in the worst case, is
The Maedeh and Suresh algorithm [8] can be considered a modified k-Means. The algorithm has the same two phases as the original k-Means, where the purposes of both phases have been maintained i.e., they refer to the initialization and iteration processes, respectively.
Similarly to the k-Means, in the initialization phase of the Maedeh and Suresh algorithm
In [8] the authors state that their algorithm has greater accuracy when compared to the accuracy of the original k-Means and, when using the same set of data instances, has the advantage of always producing the same clustering.
Pseudocode of the first part of the CCIA, for identifying the initial centroids to be passed on to the k-Means (as in [42]).
Figure 5 shows the initial pseudocode of the algorithm (procedure Maedeh_Suresh), where each phase is implemented by an auxiliary algorithm, identified in this paper as Maedeh_Suresh_phase1 and Maedeh_ Suresh_phase2, in charge, respectively, of the initialization and iteration processes. As previously mentioned, only the Maedeh-Suresh_phase1 procedure is of interest in the work described in this paper and its pseudocode is presented in Fig. 6.
In the initialization phase implemented by procedure Maedeh_Suresh_phase1, in Fig. 6, the distance from each data instance to the Cartesian origin is determined (step (1)); instances are then sorted in ascending order, based on their corresponding distance values to the origin (step (2)) and the first and the last instances are chosen as first and second cluster centroids, respectively (step (3)). All the remaining data instances are then assigned to their closest group centroid and the process enters in a iterative mode, until all
Considering a dataset with N data instances, the procedure Maedeh_Suresh_phase1 uses two N-dimensi- onal vectors: (a) ClusterId and (b) NearestDist. Each position i (1
Next the algorithm identifies in the NearestDist vector, the position that has the highest value and chooses its corresponding data instance as the next centroid. The process iterates until
According to Khan and Ahmad [42], the Cluster Center Initialization Algorithm (CCIA) was proposed with the intent of obtaining a good startup of centroids, to be used by the k-Means. The algorithm was based on two comments made by the authors, associated with clustering processes under the k-Means:
some data instances are very similar to each other and, due to that, they end up belonging to the same cluster, regardless of the way the initial choice of centroids is made; attributes, individually, may also provide some information regarding the initial centroids. The CCIA exclusively aims at initializing centroids for the k-Means and is suitable only for numerical data.
The detailed description of the CCIA algorithm given in [42] is approached divided into two parts, where the second part is conditionally executed, depending on the results of the first part. The first part generates
The CCIA is presented in [42] with a detailed and long description which, at its ending, may eventually require the use of an algorithm for merging clusters (in [42] the DBMSDC (Density-based multiscale data condensation) [32] was used), when the number of centroids obtained,
An important issue when using clustering algorithms is to evaluate the resulting clusterings. Several validation indices have been proposed in the literature for measuring the quality of the clusterings induced by unsupervised clustering algorithms.
As discussed in [16, 17, 30, 31, 34, 37], cluster validity can be implemented via different approaches, namely (1) internal criteria, based only on statistics measures associated with the patterns themselves, (2) external criteria, based on a pre-defined structure associated with the set of patterns and (3) relative criteria, which is based on comparisons between the induced clustering with others, induced by the same algorithm on the same data, but employing e.g., different parameter values or, even, other algorithms.
As suggested in various works found in the literature, a strategy to induce a good clustering for a given set of instances is to run the clustering algorithm a certain number of times, specifying a different number of clusters at each time and, then, selecting the clustering that optimizes the validation index, as the final result [40, 46].
In the experiments described in Section 6 the results obtained with the clustering algorithms were evaluated using two internal validation indices namely, the Dunn’s index (D) [22] and the Silhouette index (S) [26, 33]. Also, to quantify the number of data instances incorrectly assigned (taking into account the visually identifiable clusters in non-supervised sets of data instances), an external validation index, the Rand index (R) [44], was also used.
The main goal of most validation indices is to identify clusterings whose clusters are compact and well-separated. A drawback of using validation indices is the computational cost involved, when the number of clusters in the clusterings increases and the dimensionality of the data instances also increases. Consider the following notation for the descriptions that follow:
Let the distance between the two data instances,
The Dunn’s index
The Dunn’s index can be approached as the ratio of the smallest distance between two data instances from different clusters (implicitly involving the separation between clusters) to the largest distance between two data instances from the same cluster (implicitly involving cluster’s density). Compact and well separated clusters have a large value for the smallest distance between two data instances from different clusters and a small value for the largest distance between two data instances from the same cluster. The Dunn’s index (D) is defined by Eq. (4.1).
The Dunn’s index can also be defined by Eq. (2) where
The Dunn’s index takes into consideration the density associated to each cluster as well as the distances that separates clusters. So D is a value such that D
Consider a clustering
For each data instance
The value
Let the average dissimilarity of data instance
Consider
which can be written as
and from the above definition it can be inferred that
The average of the silhouette value
In [26, 33] the creators of the Silhouette index give a subjective interpretation to their index when say that values between 0.71–1.00 are interpreted as a clustering having a strong structure, between 0.51–0.70, as having a reasonable structure, between 0.26–0.50, as having a weak structure which could be artificial and
In data clustering the value of the Rand index [44] can be seen as a measure of the similarity between two data clusterings. Considering that the experiments will be dealing with clusterings, in order to use the Rand index in the experiments described in Section 6, one of the clusterings is induced by the clustering algorithm and the other is provided externally by the user, taking into account the inherent separation that visually can be perceived, and by assigning each pattern to a perceptual cluster. The assignment was conducted only for the unsupervised datasets; for the supervised datasets employed, the information about the class was the criteria for creating the supervised clustering.
In a general setup and considering that one of the clustering of the set of data instances in
Once the values
Intuitively (
During the research work described in [3], which was partially presented in [2], a computational system named I-k-Means (Initializing k-Means) was designed and implemented aiming at providing a suitable computational environment for running initialization methods intended to improve the performance of k-Means.
The I-kMeans system was developed in JAVA (version 1.7) programming language using the Eclipse Luna version development platform, under the Linux operating system Ubuntu (version 17.04).
The I-k-Means can be considered a multiplatform system, taking into account that its implementation in Java allows the system to run on any platform (operating system) without needing changes in the source code. In order to run the I-k-Means system it is necessary the Java virtual machine JRE (Java Runtime Environment) be installed. The architecture of I-k-Means is divided into three main modules namely:
Plotting of four synthetic data sets used in the experiments namely MSD, LongSquare, Aggregation and 3MC.
Plotting of the remaining three synthetic data sets out of the seven synthetic data used in the experiments namely, Ruspini, Mouse-Like and Spherical_6_2.
Data sets characteristics where ID: data set identification, #NI: no. of data instances, #NA: no. of attributes, #NG: no. of groups and G_Id
MSD
LongSquare
Import & Preprocess which, among its functionalities, allows loading the set of data instances to be clustered, to normalize attribute values and to remove duplicate data instances. The module expects data instances been provided via an ARFF (Attribute-Relation File Format) file; Aggregation
3MC
Trace & Validation makes available the implementation of the five initialization algorithms that are the focus of the research work, namely Method1, k-Means++, Maedeh and Suresh, SPSS and CCIA (see Section 3), as well as the implementation of the initialization process via random choice (k-M). The module also makes available the implementation of three validation indices namely the Dunn, Silhouette and the Rand indices (see Section 4).
Ruspini
Mouse-like
Spherical_6_2
Iris
Fossil
Wine
Plotting is responsible for displaying the set of data instances loaded in memory, as a graphic representation, in a two-dimensional Cartesian plane. Plotting can be done after the data file has been loaded, before or after the execution of any algorithm. The user has the option to choose two attributes (among those that describe the set of data instances) to be plotted. In situations where data instances have associated classes before execution or associated cluster labels after execution, instances from each class or label are identifiable in the plotting by having the same color.
Seeds
Blood transfusion
To evaluate the five algorithms, seven synthetic data sets and seven real world data sets were used. The characteristics of the fourteen data sets are presented in Table 1; in the table synthetic and real world data sets are separated by a thicker horizontal line.
Diabetes
Diabetes
E.coli
The plottings of four synthetic data sets are in Fig. 8 and the plottings of the remaining three are in Fig. 9. Out of the seven real world data sets, six were downloaded from the UCI Repository [13] namely, the Iris, Wine, Seeds, Blood Transfusion, Diabetes, E.coli and one, the Fossil data set, was available in [19]. The synthetic data sets used in the experiments were: MSD (refers to the data employed by the authors Maedeh and Suresh in [8], to evaluate their algorithm named, in this paper as Maedeh and Suresh or simply MS), LongSquare [14], Aggregation [4], 3MC [28], Ruspini [15], Mouse-Like [18] and Spherical_6_2 [38].
In what follows the five initialization algorithms are referred to by their abbreviations, that follow their names, stated between parenthesis: k-Means++ (++), CCIA (CCIA), Method1 (M1), Maedeh and Suresh (MS) and SPSS (SPSS). The random initialization, which is the default initialization of the original k-Means is referred to as k-M and is added to the results for comparative purposes.
Tables 2 to 15 show the results of the k-Means algorithm, represented by the values of three validation indices, in the 14 data sets. For each data set the k-Means was initialized by each of the five initialization algorithms considered, as well as randomly (k-M), as in the original k-Means.
In the fourteen tables the quality of an induced clustering by the k-Means was measured by two internal validation indices, the Dunn index (D) [22] and the Silhouette index (S) [26, 33], as well by the external validation index, the Rand index (R) [44]. In the tables
The methodology for comparing the performances of the five k-Means initialization algorithms was implemented by sequentially executing the steps described next, for each data set (generically identified as X) described in Table 1 and for each algorithm, generically identified by Y. As mentioned before, the original k-Means using the default random initialization was also used for comparative purposes. The methodology approached the 6 algorithms (random choice (k-M) included) split into two sets: {k-M, ++, M1} and {CCIA, SPSS, MS}.
The justification for that division was the fact that three algorithms, k-M, ++ and M1, involve some random choice while the other three, CCIA, SPSS and MS, do not. The random choices occur when the original k-M chooses the k centroids in a random manner, when the ++ chooses the first centroid, and when the M1 makes a random selection of instances within a cell, as well as at its end, when certain conditions are not satisfied.
For the k-M, ++ and M1 algorithms it was decided that steps (2.1), (2.2) and (2.3) (described next) should be performed 20 times; such number was chosen based on the experiments described in [12, 24, 25].
For the CCIA, SPSS and MS algorithms, which always select the same centroids for a given data set when executed more than once, only one execution was conducted for each of the three algorithms. The methodology adopted for the experiments has the following steps:
assign to for algorithms {k-M, ++, M1} perform steps (2.1), (2.2) and (2.3) 20 times (due to the random choices involved) and, for the algorithms {CCIA, SPSS, MS}, since none of them use of random choices, perform the three steps just once.
execute the initialization algorithm Y, using the set X, whose execution result is a set of execute k-Means without its original initialization step using, as input, in addition to set calculate the values of indices Silhouette and Rand, in the induced clustering AG, obtained in (2.2), and store them; for the k-M, ++ and M1 algorithms, calculate the mean and standard deviation of the validation values as well as of the number of iterations performed by the k-M to converge, in the 20 runs performed in steps (2.1), (2.2) and (2.3). For the CCIA, SPSS and MS algorithms the values found in a single execution of steps (2.1), (2.2) and (2.3) are used, considering the three algorithms always have the same result when executed more than once having the same data set as input. The analysis of the results will focus on the values of both indices as well as on the number of iterations performed by k-Means initialized by each algorithm.
As shown in Table 2, in the MSD data the ++ algorithm had one of the best performances among the six algorithms (i.e., ++, M1, k-M (random), CCIA, SPSS and MS), taking into account the values of D, S and R validation indices as well the number of iterations of the k-Means, when initialized by ++.
As shown in Table 3, results related to the R index for the LongSquare have values close to 1, an evidence that the induced clusterings are a very good approximation to those visually detected. Since the LongSquare has 6 groups, with approximately 150 instances/group, the average number of iterations required for the k-Means to converge, when initialized randomly or via M1, was around 12 iterations, whereas when initiated by ++, required on average approximately 9 iterations.
In the LongSquare data domain the validation indices R and D or S do not agree much. While the R index points to the induction of good clusterings, in spite of, sometimes, with a large number of iterations by the k-Means, the S index values indicate clusterings with average quality; the low values for D indicate clusterings that do not conform to compact and spherical well separated clusters, which is partially the case.
Taking into account the data in the Aggregation data set, 7 groups can be visually detected. On the one hand, the mean values of the S index, in Table 4, suggest that the clusterings obtained by k-Means, initialized by ++, M1 and randomly, were just average clusterings. The low values for D are probably due to the fact that the visually identifiable groups of instances are not well separated.
On the other hand, however, the mean values of the R index show the induction of clusterings similar to those visually detected. The configuration and arrangement of data in the Aggregation is not particularly suitable for a grid-based approach, such as the one adopted by M1. When the M1 algorithm is not able to find centroids through its grid-based approach, it adopts a random choice, which is the same procedure adopted by the original K-Means.
Particularly, in the Aggregation data domain, the M1 performance is similar to that of the original k-Means. Results related to the CCIA, SPSS and MS follow the same tendency of those of the ++, M1 and k-M.
Considering the 3MC data set, the mean values of the validation index R, in Table 5 point out to the induction of good clusterings, by the 6 algorithms.
The values of the S index, however, suggest average clusterings while those of the D index indicate not good clusterings. Meanwhile, the number of iterations performed by the k-Means point out to algorithm ++ and SPSS as those that provided better initializations. Note that the CCIA provided an initialization set of centroids to k-Means that did not require any iteration (i.e.,
Considering the numbers given in the three first lines of Table 6, in the Ruspini data set the k-Means, initialized with centroids found by ++, induced clusterings with the highest value of the R index and, at the same time, reached convergence with the smallest number of iterations, in comparison with the k-Means performance, when initialized with the M1 and k-M.
The values of R, associated with clusterings induced by the k-Means initialized by M1 and random can also be considered good, since their values are close to 1.0. However, the number of iterations performed by k-Means initialized by the two algorithms were, on average, twice the number of iterations when using the ++, although still small. In relation to the indices D and S, associated with clusterings induced by the k-Means using three different initializations, the values are reasonably close although, again, the numbers suggest that clusterings induced by the ++ are better.
Taking into account the values of indices in the last three lines of Table 6 reporting results related to CCIA, SPSS and MS, respectively, they suggest that the same clustering has been induced by the k-Means initialized by any of the three algorithms. This, somehow, implies that the same (or very similar) initial centroids have been induced by the three algorithms; this conclusion may be corroborate by the number of iterations of the k-Means, for converging.
The results of the experiments in the Ruspini data are evidence that algorithms that do not use an approach based on random choice, produce a set of initial centroids that helps k-Means to converge quickly and, also, for the same clustering which, particularly when using the CCIA, SPSS and MS have their associated S index with the same value.
Table 7 gives the results from the experiments conducted with the initialization algorithms using the data instances from the Mouse-Like data set. Taking into account the numbers in the three first line of the table and considering the average of the three validation indices, it can be conjectured that the initialization algorithms produced initial centroids that made the k-Means produce clusterings quite similar.
Also, such similarity reflects, but not so strongly, on the average of the number of iterations of k-Means to reach convergence. Similar analysis applies to values obtained, shown in the three last lines of the table, when the k-Means was initialized by CCIA, SPSS and MS. In spite the data having three compact spherical groups of instances the groups are not separated, which has contributed for the low values of the D index.
In the Spherical_6_2 data set and considering the three first lines of Table 8, it can be seen that the average values of the R index in relation to clusterings induced by k_Means, initialized by each of the three algorithms, ++, M1 and k-M, are close to each other; they can be considered good sets of centroids, considering that k-Means induced similar clusterings and close to those visually identified.
As the number of iterations performed by k-Means up to convergence was, on average, slightly different, it may be conjectured that the initial centroids provided by the algorithms were slightly different, which ended up influencing, the number of iterations required for k-Means to converge.
Having as focus the results presented in the three last lines of Table 8, it can be seen that the CCIA, SPSS and MS present similar results, particularly if the value of the R index is considered. Notice that the initial centroids obtained by CCIA have not required any iteration of k-Means, to reach convergence.
In relation to the Iris data set, the results presented in the first three lines of Table 9, related to validation indices as well as to the number of iterations performed by k-Means, initialized by ++, M1 and randomly, are not statistically very different from each other.
In the last three lines of Table 9, except for the number of iterations, all the other values follow the same trend as those presented in the first three lines that is, they do not differ significantly.
It can be observed, however, that the CCIA algorithm provided k-Means with the best set of 3 initial centroids, considering that k-Means required only 3 iterations to achieve convergence, while with the initial centroids provided by SPSS and by MS, the k-Means performed 7 and 9 iterations, respectively, to achieve convergence. The SPSS and MS initialization methods did not contribute much, considering that the original k-Means with the random initialization needed, on average, 6.25 iterations to converge.
Analyzing the data results shown in Table 10 obtained using the Fossil data set, it can be noted that, with respect to the values of the R index, the initialization provided by ++, on average, slightly favored the quality of the obtained clusterings.
However, in contrast, the average number of iterations required for k-Means to converge was slightly higher when compared to the average of iterations required when initialized by M1 or, then, randomly.
During the experiment the M1 was not successful when using its grid-based approach, which caused the algorithm, in all the 20 runs, to finish the search for initial centroids using random choice, meaning that in the Fossil domain, the M1 behaved exactly like the original k-Means.
With respect to the average values of the R index associated to the clusterings obtained with the initializations provided by ++, M1 and random, in the Wine data set, shown in Table 11, it can be said that all the induced clusterings were very close to the one visually detected, with a difference of 0.007, at most, of the clustering visually detected. However, neither of the two methods, ++ and M1, helped to promote the k-Means performance, in respect to the number of iterations, which reached better values when the random initialization of the k-Means itself, was used.
The validation values of the D, S and the R indices, shown in Table 12, in the Seeds data set, are close to each other for the five algorithms plus the k-M.
The recurrent problem of the M1 algorithm, that of not being able to select a set of initial centroids by means of its grid-based approach, has again occurred in the Seeds data, and the M1 ended up adopting the random selection, turning its execution very similar to that of the original k-Means.
In the Blood Transfusion data the results in the three first lines of the Table 13 show that both, M1 and k-M, have similar values, as far as the D, S and R indices are concerned. They only differ in relation to the number of iterations of the k-Means which were, on average, 10 and 5, respectively. So, in the Blood Transfusion domain and taking into account initialization algorithms that have some randomly process involved, the random choice implemented by the original k-Means is the best choice.
Results shown in the last three lines of the table indicate that CCIA, SPSS and MS are in a die, except for the number of iterations required by the k-Means, with favors the CCIA. However the value of the R index points out to clusterings which agree with the given classes of the data in 50% only. The two classes in this data set are quite unbalanced.
Results from experiments using the Diabetes data set are shown in Table 14. The numbers in the three first lines of the table are practically the same and the value of the R index is an indication of clusterings that only partially agree with the classes informed in this supervised data set.
The same trend is observed with clusterings induced by the CCIA, SPSS and MS. In this specific domain any of the five algorithms would be as good as the original random choice used by k-Means, shown in the third line of the table. The two classes in the Diabetes data set are unbalanced.
Experiments using data from the E.coli data set, shown in Table 15, point out that, although the results from the algorithms do not differ much from each other, on average the those obtained with the MS are the best and are closely followed by results from the ++.
This paper discusses and empirically evaluates the performance of five initialization algorithms as replacements for the k-Means original random initialization process. Taking into account the data sets used, it can be said that the results were not too conclusive so to support a particular algorithm as the best one. However, based on the experiments, it can be said that the CCIA and the SPSS initialization algorithms could be a good choice.
Based on the work described in this paper and taking into account all the experience acquired during the development of a software system that implements the five algorithms, as described in Section 5, a few possibilities for continuing in this line of research have been considered:
to pre-process the data to remove irrelevant attributes; to use a combination of the centroids found by each of the five algorithms and by k-M (e.g., the arithmetic mean of the values found by the algorithms), and initialize k-Means with these values; to implement a variation of the process described in (2), weighting the individual performance of the five algorithms/data domain and to investigate several other initialization algorithms proposed in the literature, particularly, the algorithms proposed in [11, 35].
Footnotes
Acknowledgments
The authors thank UNIFACCAMP and CNPq for their support and also express their gratitude to Prof. Shehroz S. Khan for his support in relation to the algorithm CCIA. His kindness was very much appreciated considering it was an important contribution to the experiments conducted in the work described in this paper. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) - Finance Code 001.
