Abstract
This paper proposes a novel evolutionary clustering algorithm and prepares eligible initial centroids for K-Means algorithm by global search approach of efficient hybrid knowledge of swarm intelligence algorithms. Clustering performs data grouping into subsets with common features, So that useful information can be retrieved from them. Swarm intelligence algorithms with evolutionary optimization approach have been very efficient performance in these matters. In this paper, a novel hybrid algorithm called IFAPSO has been proposed which uses swarm hybrid knowledge of Firefly and PSO intelligence algorithms to make an effective data clustering. Performance improvement of PSO and Firefly swarm algorithms and resolve the deficiency of each of them is applied and hybrid of them has been used to benefit from both algorithms into clustering problem effectively. Also this hybrid swarm algorithm overcomes the initial centers’ selection sensitivity and the limitation of local optima in K-Means. We used five benchmarks with several samples and features to evaluate our work. The comparison between proposed methods with the traditional algorithms and previous hybrid methods, suggest that there is more compactness of the resulting clusters and promising accuracy is achieved.
Introduction
Data clustering is a data classification process to some subsets (clusters), So that through identifying intrinsic patterns in objects, we can retrieve useful information of them. These patterns can be displayed in multi-dimensional vector space in numerical form. In partitional clustering, the algorithms determine clusters centers by having the number of clusters (k) and similarity evaluation criteria, So that the intra-cluster similarity between objects will be maximum and the similarity between the clusters will be minimum. K-Means clustering algorithm [1] is a partitional clustering technique that has a great reputation because of low complexity and simplicity of implementation. The K-Means algorithm is one of the most popular methods for clustering multivariate quantitative data. However there is high sensitively to the choice of initial clusters centers and it often traps into local optima. In this algorithm, each object is only belongs to the closest cluster and centers are updated by the objects assigned to them. So centers cannot take out of a local density of data. Therefore, the initial selection of the cluster centers decides the main processing of K-Means algorithm and the partition result of the dataset as well.
Stochastic search methods, especially insects and animals self-organizing of community algorithms and population behavior based algorithms which are called Swarm Intelligence (SI) algorithms, were proposed because they deal with global optimization problems effectively as an effective tool for traditional clustering methods. These algorithms produce optimal solutions using the evolution mechanism.
SI algorithms have been considered as an appropriate solution for traditional clustering methods. A common method of implementing SI algorithms for clustering is considering the set of initial cluster centers (possible solutions) and then finding the best division on data sets. These algorithms recover the best solution through combining and comparing different solutions.
Among the various population based optimization algorithms, PSO algorithm has been widely studied. Great tendency to use this method is because of its easy implementation, low computational cost, absence of complex evolutionary parameters such as crossover and mutation in GA, low sensitivity to objects and parameters and sustainable solution with a high quality. Despite of its many advantages, when the search space is large (multi-dimensional data); its convergence rate is very slow. For this purpose, multiple versions of the PSO algorithm are proposed to solve its early convergence problem in the semi optimal points.
On the other hand, Firefly algorithm is a swarm Intelligence technique for optimization problems. It is based on the assumption that we can consider the optimal solution of a problem as a Firefly that gives light of itself with the extent of its quality in the problem. Naturally, the brighter fireflies attract its partner that leads to the exploration of the search space. FA was constructed by Yang to solve the multi objective optimization problem [2]. In his research, efficacy of FA was validated using various classical benchmarks and its powerful and relatively efficient was demonstrated.
In clustering problems, it has been indicated that the global search capability of FA surpassed many other techniques such as artificial bee colony, multilayer perceptrons, PSO, Bayes net and radial basis function in most cases [3]. Also, usage of FA in various domains, represent promising results [4, 5]. FA can find optimal solution using its exploration in most cases. However because of its randomness search approach, it cannot always achieve global best results.
Hybridization of evolutionary algorithms is useful to optimize the performance of the direct evolutionary approach [6]. Studies show that hybrid optimization algorithms have largely overcome the problems. Using them along with traditional clustering algorithms such as K-Means, overcome limitations and also they have more accuracy and more efficiency compared with traditional algorithms [7].
In this paper, we use efficient hybrid of PSO and Firefly algorithms for clustering purpose in order to overcome the limitations of both algorithms. In this hybrid algorithm, an improvement of PSO functionality is added to the FA in order to improve diversity of fireflies (Improved Firefly), which can be treated as a mutation operator. Also by combining the principle of PSO and FA, we look for the best initial centers for K-Means algorithm and reach better performance of centroid-based clustering methods.
The purpose of the proposed algorithm is to minimize intra-cluster and to maximize the distance between the clusters. Results obtained from the proposed methods are compared with the K-Means, PSO, FA, PSO-KM, FA-KM algorithms and five benchmark with several features and instances. Comparing the results shows more compactness and well efficiency of the proposed method than the compared algorithms.
The rest of this paper is organized as follows: the revision of the literature about previous works is presented in Section 2. There is a brief explanation of the algorithms K-Means, PSO and the Firefly in Section 3. The details of the proposed methods and its implementation are described in the Section 4. Section 5 represents the experimental datasets and compares the results. Finally, in Section 6, a summary of the manuscript and future works will be explained.
Related works
Here are various optimization algorithms inspired by nature and employed for data clustering in order to solve the problems of center based clustering algorithms such as K-Means, like Simulated Annealing (SA) [8], Particle Swarm Optimization (PSO) [7, 10], Genetic Algorithm (GA) [11], Ant Colony Optimization (ACO) [12], Tabu search (TA) [13], Cuckoo Search algorithm (CS) [14], Bat Algorithm (BA) [15] and Firefly Algorithm (FA) [16–18].
Meanwhile, PSO has been extensively explored. Many researches have been done in PSO based clustering to optimize clustering challenges. Various approaches of hybrid of PSO and K-Means algorithm are proposed to overcome both algorithms’ limitations.
In [8], a new hybrid sequential clustering approach was presented. They used PSO and K-Means algorithm in sequence for data clustering. This approachwas proposed to overcome the drawbacks of both algorithms and to improve clustering result. Moreover it aimed to avoid stagnation by good convergence speed and reducing error rate. Their proposed algorithm generated more accurate and robust clustering results.
A hybrid Particle Swarm Optimization named Subtractive+ (PSO) clustering algorithm proposed in [9]. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any dataset. The results illustrated that the Subtractive+ (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
The proposed approach by [10] was a combination of PSO and K-Means. This method applied meta-optimization technique for overcoming the limitations of K-Means with the help of PSO which offers a globalized search approach but suffers from slow convergence near optimal solution. Here the proposed technique applied the result of PSO as the input seed of K-Means to obtain better result. Comparative analysis represented that proposed approach had better convergence to lower quantization errors, larger inter-cluster distances, smaller intra-cluster distances and approx, same execution time and more accuracy measurement.
Despite of popularity of PSO, it suffers from fast and premature convergence in mid optimum points and don’t have good performance in multi dimensional data. In order to solve these problems, we use the benefit of evolutionary algorithms hybridization to solve them. Especially we use Firefly algorithm that is a nature inspired optimization algorithm. FA is an efficient, reliable and robust method, which can be applied successfully to generate optimal cluster centers [16–18].
In this research, a novel hybrid of FA and PSO swarm intelligence algorithms is presented to solve drawbacks of these algorithms and use global optimization advantages of both of them to the clustering problem. Then we use this efficiency hybrid as a solution to overcome local optima in K-means.
Background
K-Means clustering algorithm
K-Means is one of the simplest unsupervised learning algorithms which it solves the well-known clustering problem because of its fast execution and easy implementation [1]. Its idea is splitting a dataset samples into K groups (clusters), in which each object has common characteristics with the others objects on the same cluster, maximizing the inter-cluster distances and minimizing the intra-cluster distances. For each cluster a centroid is associated, which is placed in any location of the problem space. The K-means algorithm can be summarized as: Initialize k cluster centers randomly. Assign each data vector to the closest cluster centroids (less distance), where the distance to the centroid vectors is determined as Equation (1).
X
i
is ith data vector and C
j
is jth cluster centroid vector and d is the number of dimensions in each vector. Recalculate the clusters centroids vector c
j
using Equation (2).
n
j
is the number of data vectors that belongs to cluster S
j
. Repeat Steps 2 and 3 until the convergence is achieved (the fixed iteration number, or the cluster result does not change after a certain number of iterations).
The aim of K-Means algorithm is to optimize the objective function (f) where N is the number of objects, illustrated in Equation (3).
PSO algorithm [19] is a population-based optimization technique. Each population consists of N particles, each particle moves in a d-dimensional space. Each particle finds the best solution based on its movement history and the knowledge gained from the swarm. pbest
i
is the best position obtained by particle i and gbest represents the best global position of all pbest
i
s.
Position and velocity of every particle is shown by x i and v i respectively. They are updated in each iteration by pbest i and gbest.
In Equation (4), r1 and r2 are random numbers in the range (0, 1). c1 and c2 are used to control the motion of a particle in each iteration. w controls the impact of previous rate of a particle on its current value in the optimization process. PSO procedure is described in Fig. 1. One of the most important issues in PSO is balancing local and global search capabilities in a population because it is vital in the success of PSO.

Pseudocode of PSO algorithm.
FA algorithm is a swarm intelligence algorithm that has been inspired by the behavior of fireflies [2]. This optimization technique is based on the fact that every firefly is absorbed to the other fireflies based on the light amount. That is a firefly with the less brightness is absorbed to the case with more brightness. So the search space has been explored efficiently. FA algorithm follows these three rules: Assumes that all the fireflies are a genus. That is their attraction to each other is independent of their gender. Each firefly is absorbed to the other one based on its brightness. So the distance between them is reduced. This brightness is a criteria of the objective function which is determined depending on any issue.
Attractiveness of any Firefly is calculated by Equation (6).
Where r represents the Euclidean distance between the two fireflies and β0 is the Attractiveness value when r = 0. γ represents the light absorption coefficient. Firefly i’s movement which is absorbed toward firefly j, is calculated by Equation (7).
The amount of α is considered for randomness control of x value and diversity of the solutions. It is determined as Equation (8).
α0 is the criteria of amount of initial randomness and δ is a moderating factor that its value is determined in range (0.95, 0.97) in most cases [19]. Procedure of Firefly algorithm is described inFig. 2.

Pseudocode of firefly algorithm.
In order to use the benefits of Firefly and PSO optimization algorithms, as described before, and increasing the accuracy and quality of clustering results, IFAPSO (Improved Firefly Algorithm base on PSO) hybrid clustering algorithm is proposed that is firstly used for clustering problem directly. Then this method has been used with K-Means algorithm to overcome its drawbacks.
Improved Firefly Algorithm based on PSO (IFAPSO)
In each iteration of the standard firefly algorithm, brighter firefly influences on the others and attract them to itself (local optimization). In fact, fireflies move towards each other regardless of the global optimal and best their previous position. This reduces the ability of the algorithm to find the optimal solution. To overcome this weakness and to improve collective movement of fireflies, we used the improved firefly algorithm inspired by PSO algorithm. In this algorithm, instead of only a firefly (brighter) impact and attract neighbors, firefly with the highest or lowest value (global optimization) can also affect others’ movement in each iteration. Also, each firefly’s previous best position can be effective in its Motion.
In this section, a part of PSO is used in the FA to increase convergence and also to enhance its capability for not falling into the local optima. So the position vector of IFAPSO is modified by Equation (9) [20, 21].
Also, if the firefly i is brighter than the firefly j (or be a better choice than the j-th), it randomly changes the location hoping to achieve the global optima:
IFAPSO algorithm has been described in Fig. 3. Like most population-based algorithm, a population of possible solutions is randomly generated at first.

Pseudocode of improved firefly algorithm based on PSO (IFAPSO).
Each solution in the population is coded as a string of finite length consisting of k cluster centers, according to Fig. 4. Each solution in this structure has D features (D dimensions). So every solution stores an array with size of (K × D).

Individual’s structure for IFAPSO swarm algorithm.
First, the fitness function is computed for each solution. The fitness of each solution is evaluated based on the sum of Euclidean distances of each data vector to the closest cluster centers, by Equation (3). The solution which has the minimum fitness function, is considered as the best current solution (gbest). Then, the population is sorted based on fitness evaluations and is divided in to two sub population. The first sub population consists of solutions which have better function evaluation (well sub population). We apply improved firefly algorithm to this sub population. The fitness of solution i in this sub population is compared with the jth solution in each iteration. If the solution j has better fitness evaluation, solution i move toward it by Equation (9) and Vice versa, position of solution i is updated by Equation (11).
In the other hand, second sub population consists of solutions which has worse function evaluation and is not desirable (worse sub population). In order to improve these sub population, each solution is updated as Equation (12).
x r is the randomly selected solution from well sub population and gbest is the best solution in total population. Finally two improved sub population are combined and this procedure is executed until the stop criterion is met. The best individual (with the best fitness value) contains the best coordinates for the centroids of all clusters.
Upon execution of IFAPSO algorithm, we can still make a post-processing to gain better visualizeof the results. The purpose of this method is to combine swarm intelligence with traditional partitional clustering algorithms, such as K-Means. In this way, we used the advantages of two population-based algorithms, PSO and FA, along with classical clustering methods. First step inK-Means and other partitional clustering algorithms is randomly selection of initial clusters centers which has direct effect on quality of clustering. Because of sensitivity of initial centers, optimal selection of them using swarm intelligence algorithms can have a significant impact on the quality ofclustering.
At first, the population of initial clusters centers is generated randomly. Then using advantages of background knowledge of IFAPSO swarm algorithm to find global optima, best selection of initial centers is achieved. K-Means iterative procedure is started by these optimal centers. So that assignment of objects to the clusters and updating the clusters centers, is based on the IFAPSO gbest result. Operation of this method is described in Fig. 5.

Flowchart of IFAPSOKM algorithm.
In this section, we evaluate the performance of IFAPSO swarm clustering and combination of this with K-Means algorithm, IFAPSOKM, based on the datasets selected from UCI machine learning repository [22]. Then we compare the result of these methods with standard K-Means, PSO, GA, FA, and previously hybrid algorithms consist of PSO-KM, FA-KM [8, 16–18]. In all these algorithms, allocation of objects to the closest cluster is based on Euclidean distance. Also the parameters of PSO, Firefly and KM algorithms in the proposed method have been set according to Table 1.
Parameters of the proposed method
Parameters of the proposed method
Five different datasets from the UCI repository have been used to test the IFPSO and IFPSOKM algorithms in order to comparative analysis can be performed properly. The datasets include different number of clusters (k), with several number of features or dimensions (d) and different sample sizes (n), so that we can obtain accurate information about the performance of these algorithms. These datasets are commonly used in evaluating the performance of data clustering algorithms and can control correct allocation of objects in clusters after running the algorithm. The characteristics of the selected datasets are summarized in Table 2.
Evaluated datasets
Evaluated datasets
The criteria used to assess the quality of clustering are in accordance with two internal evaluation [14, 21] (Intra and Inter cluster distance) and one external evaluation (F-Measure). When a clustering result is evaluated based on the data that was clustered itself, this is called internal evaluation. In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks.
N j , N i and N i j are respectively the number of observations assigned to cluster j, number of observations in class i, and number of observations of class i within cluster j. A better clustering solution will be indicated by a higher F-measure, where the score 1 means that the perfect clustering is attained. A better clustering solution will be indicated by a higher F-measure, where the score 1 means that the perfect clustering is attained.
The results of the implementation of the proposed methods and the previous five algorithms are summarized in the Tables 3 and 4. In each execution of these algorithms, new population is generated. Results are based on the average performance of each of these algorithms in 10 executions of them. In these tables, proposed methods (IFAPSO and IFAPSOKM) were compared with the previous related algorithms based on intra and inter cluster distances. Previous algorithms consist of two types of evolutionary clustering algorithms and hybrid clustering based on K-Means algorithm.
Comparison of Intra-cluster distance
Comparison of Intra-cluster distance
Comparison of Inter-cluster distance
Performance of different algorithms based on Intra cluster distance is considered in Table 3. As it is obvious from comparing values, in almost all cases the IFAPSO proposed methods is relatively lower intra-cluster distances than all other algorithms. This means that similarity of objects within a cluster is more toward previous cases and represents more compactness in output clusters. Graphical comparison of Intra-cluster distance for each dataset is in Fig. 6.

Intra-cluster distance bar plots for each dataset.
Performance of the methods based on Inter-cluster distance is considered in Table 4. In this criterion, more distance between clusters represent more dissimilarity between them and better performance of clustering. Based on the results, a significant increase of the distances between clusters of the both proposed methods (IFAPSO and IFAPSOKM) can be found over other algorithms. Graphical comparison of Intra-cluster distance for each dataset is shown in Fig. 7.

Inter-cluster distance bar plots for each dataset.
Generally, we understood that using hybrid knowledge of swarm intelligence algorithms to achieve the benefits of each approach and resolve the deficiency of them, will have a significant influence on increasing the quality of clustering.
Also, comparison of two proposed methods results based on intra and inter cluster distances show that both of them have appropriate separation between the clusters results but IFAPSO swarm clustering method has more performance than the IFAPSOKM in case of intra similarity into clusters.
In Table 5, we compared proposed methods with previous algorithms based on F-Measure external criteria. As shown, both the proposed approaches have higher F-measure than the others and this means that a significant increase in the accuracy and quality of clustering algorithms is obtained.
Comparison of F-Measure evaluation criteria
Chart of different methods execution based on F-measure evaluation criteria is shown at the Fig. 8. With comparison of IFAPSO with evolutionary clustering algorithms like FA and PSO, we found that in datasets with fewer instances (Iris, Wine & Glass), IFAPSO has better F-Measure values and so better performance. For datasets with larger instances, IFAPSO has satisfactory results and is comparable with best previous method. In case of K-Means based clustering algorithm, IFAPSOKM almost has best F-Measure result among them. This means that using hybrid knowledge of swarm intelligent algorithms, not only perfect k-means algorithm operation, but also has more performances rather than hybrid of each algorithm of PSO and FA with K-Means clustering algorithm. It is because of using advantages of both swarm algorithms and cover limitations of them with appropriate hybridization of them.

F-Measure evaluation criteria for all datasets.
In this paper, IFAPSO swarm clustering algorithm and combination of this, with K-Means algorithm and IFAPSOKM algorithm, are suggested and implemented. IFAPSO algorithm is proposed with the aim of using evolutionary benefits of both FA and PSO algorithms and also removing the limitations of them. It has been done through FA reforms in order to benefit from the global optima to improve the centers. Also the IFAPSOKM hybrid clustering algorithm that is based on K-Means, is proposed to eliminate the sensitivity of K-Means algorithm to the initial centers and local optimality limitations. The results of the comparison of these algorithms with traditional and previous hybrid algorithms showed that the proposed algorithms are much satisfactory. For future works, these algorithms can be applied on other common data sets of UCI repository to check their Integrity. Also they can be used to solve different data mining techniques like classification and other optimization problems.
