Clustering mixed numeric and categorical data with artificial bee colony strategy

Abstract

Data objects with both numeric and categorical attributes are prevalent in many real-world applications. However, most of the partitional clustering algorithms dealing with such data may trap into local optima. To further promote the performance, a novel clustering algorithm, called ABC-K-Prototypes (Artificial Bee Colony clustering based on K-Prototypes), is presented on the basis of the K-Prototypes algorithm, the search strategy of the artificial bee colony, and the chaos theory. In the presented approach, the one-step k-prototypes procedure is first given, and then this procedure is combined with the search strategy of the artificial bee colony to address the mixed numeric and categorical data. In the search process of scout bees, the chaotic map is utilized to generate chaotic sequences for substituting the random numbers. To accelerate the convergence of the ABC-K-Prototypes algorithm, the multi-source search is adopted in the search process of scout bees. Finally, the performance of the ABC-K-Prototypes algorithm is demonstrated by a series of experiments on mixed numeric and categorical data in comparison with that of the other popular algorithms.

Keywords

Clustering numeric attribute categorical attribute mixed data artificial bee colony

1 Introduction

Clustering analysis is one of the most important techniques in data mining [11, 17]. Many fields including information retrieval [10, 27], social media analysis [26], privacy preserving [31], image analysis [9], text analysis [6], and bioinformatics [7, 28], are benefited from the algorithms in clustering analysis. The target of clustering is to allocate a set of data objects into clusters such that the data objects in the same clusters are more similar to each other than those in other clusters [21]. Clustering algorithms generally be categorized into two types: hierarchical and partitional. In hierarchical clustering algorithms, data objects are distributed into a dendrogram of the nested partitions in the light of a divisive or agglomerative strategy [14]. While in partitional clustering algorithms, data objects are divided into the given number of clusters by optimizing an objective cost function.

The k-means algorithm is an extensively utilized center-based partitional clustering algorithm owing to its simplicity and high efficiency [18]. Taking into account the uncertainty of data objects, Bezdek et al. proposed the fuzzy k-means algorithm [4]. The k-means algorithm and its fuzzy version are designed for the datasets with numeric attributes. In many real-world applications, the collected data are described by both numeric and categorical attributes. The k-prototypes algorithm proposed by Huang is one of the well-known algorithms for clustering this type of data [15]. Considering the fuzzy nature of the data object, the fuzzy k-prototypes algorithm are given by Bezdek et.al. [5]. Several extensions of the k-prototypes algorithm are proposed by taking the significance of attribute and the representation of cluster’s center into account [1] [19, 21] [12]. However, one issue associated with these algorithms is that they are prone to trap into local optima.

Over the last decade, several approaches, which imitate the interesting foraging behavior of social animals including birds and ants, have been introduced for the optimization issue [20, 25]. In swarm intelligence, investigating the collective behavior of honeybees, such as the foraging, learning, memorizing, and information sharing, has become an interesting research issue [34]. Lucic and Teodorović presented a bee colony optimization metaheuristic, which is inspired by the foraging behavior of a bee swarm in the real world [29]. This metaheuristic has been utilized to solve various engineering and management problems. Karaboga and Basturk devised an artificial bee colony (ABC) algorithm [22] to address the numerical optimization problems. Alatas introduced chaotic bee colony algorithms for numeric optimization [2]. Karaboga and Ozturk presented an artificial bee colony clustering approach on the basis of the ABC optimization strategy [23]. Zhang, Ouyang and Ning proposed an artificial bee colony clustering approach, which adopts Deb’s rules to guide the search for candidate food sources [34]. However, most of these heuristic approaches are devised for the data with numeric attribute, and they might be unsuitable to address the data with both numeric and categorical attribute. It is necessary to develop an ABC-based clustering algorithm for the data with both numeric and categorical attributes since this type of data is prevalent in real-world applications.

Chaos depicts nonlinear systems which have deterministic dynamic behavior [13]. This behavior is ergodic and stochastic. The chaos is sensitive to the initial conditions. Many chaotic maps have certainty, ergodicity, and the stochastic property. Chaotic sequences have been adopted to replace random sequence, and successfully applied to many applications including secure transmission, DNA computing, and image processing [3]. Many optimization results exhibit that chaotic sequences work better than random sequences [32].

In this paper, a novel artificial bee colony clustering approach for mixed numeric and categorical data is presented. In the proposed approach, the one-step k-prototypes procedure is given first, and then this procedure is integrated with the artificial bee colony heuristic to cluster mixed data. In the search process of scouts, the chaotic sequences substitute the random numbers for the parameters where it is necessary to make a random-based choice. Then, the time complexity, space complexity, and the convergence of the proposed approach is analyzed. Finally, the proposed approach is applied to cluster the mixed data.

The rest of this paper is organized as follows: some related work is briefly reviewed in Section 2. Then, the proposed approach is described in Section 3, and the experimental results are reported in Section 4. The conclusion and future direction of this paper are given in Section 5.

2 Related work

2.1 The k-prototypes algorithm

This algorithm was first introduced by Huang in [15] for clustering the data with both numeric and categorical attributes. Assume X = {X₁, X₂,. . . , X_n} indicate a dataset including n data objects, and X_i (1 ≤ i ≤ n) be a data object with m attributes. Each attribute A_j (1 ≤ j ≤ m) has a domain of values denoted by Dom (A_j) The domains of attributes related to the mixed data has two types: numeric and categorical. The numeric domain consists of continuous values, and the categorical domain consists of a finite set without any natural ordering (such as color, gender), which commonly denoted as $D o m (A_{j}) = {a_{j}^{1}, a_{j}^{2}, \dots, a_{j}^{t}}$ . Here t is the number of category values for the categorical attribute A_j. For simplicity, the data object X_i is represented as a vector $[x_{i 1}, x_{i 2}, \dots, x_{i m}]$ . The k-prototypes algorithm aims to partition the dataset X into k clusters by minimizing the cost function which is given by:

$(1) E (U, Q) = \sum_{l = 1}^{k} \sum_{i = 1}^{n} u_{il} d (x_{i}, Q_{l}),$ (1)

where Q_l is the prototype of the cluster l; u_il (0 ≤ u_il ≤ 1) is an element of the partition matrix U_n×k, and d (x_i, Q_l) is the dissimilarity measure which is formulated as:

$d (x_{i}, Q_{l}) = \sum_{j = 1}^{m} d (x_{i j}, q_{l j}),$ (2)

In the above,

$\begin{matrix} (3) & d (x_{ij}, q_{lj}) \\ = {\begin{matrix} (x_{ij} - q_{lj})^{2} & if A_{l} is the numeric attribute, \\ μ_{l} δ (x_{ij}, q_{lj}) & if A_{l} is the categorical one, \end{matrix} \end{matrix}$ (3)

where A_l is the lth attribute; δ (p, q) =0 if the values of p and q are the same, while δ (p, q) =1 if the values of p and q are different; μ_l is the weight for categorical attributes in the cluster l. When x_ij is a numeric value, q_lj is the mean of the jth numeric attribute in the cluster l; when x_ij is a categorical one, q_lj is the mode of the jth categorical attribute in the cluster l The procedure of the k-prototypes algorithm is given by:

Step 1. Randomly select k data objects from the dataset X as the initial prototypes of clusters.

Step 2. Allocate each data object in the dataset X to the cluster which has the nearest prototype according to Eq. (2). Update the prototype of the cluster after each assignation.

Step 3. Once all data objects have been allocated, reevaluate the similarity of data objects against the current prototypes. If a data object is found that its nearest prototype locates in another cluster rather than the current one, reallocate this data object to that cluster and update the prototypes for both clusters.

Step 4. If no data objects have changed clusters after a full circle test of X, terminate the algorithm; otherwise, go to Step 3.

2.2 The artificial bee colony algorithm

The artificial bee colony (ABC) algorithm is introduced by Karaboga and Basturk to optimize the numeric problems [22]. This algorithm is well-known for its simplicity and robustness [23]. In the ABC algorithm, the artificial bees are divided into three types: employed bees, onlookers, and scouts. The employed bee exploits a particular food source, and shares the information of this food source with onlookers; the scout seeks a new food source in the search space; the onlooker waits in the nest and discovers a food source via the shared information. The artificial bee colony is divided into two halves: the first half is the employed bees and the rest half is the onlookers. There are three essential components (i. e. food sources, employed foragers, and unemployed foragers) and two modes of the behavior (recruitment to a food source and abandonment of a food source) in the model of forage selection. The value of a food source is relevant to many factors including its proximity to the nest, nectar amount and the ease of extracting its nectar. The unemployed forgers are divided into two types: scouts and onlookers. One food source is gathered by one employed bee. The number of employed bees therefore equals the number of food sources. Onlookers fly onto a food source in term of a probability-based selection strategy. The employed bee becomes a scout bee once its food source’s nectar is exhausted. The exploitation and exploration processes are implemented together in the ABC algorithm. Concretely speaking, the exploitation process is executed by the employed bees and onlookers, and the exploration process is carried out by the scouts. The bee colony exploits and explores the food sources in a manner to maximize the nectar being stored in the nest. In an optimization problem, a food source represents a possible solution, the nectar amount of a food source indicates the quality of the corresponding solution, and the aim is to achieve the optimal value of the objective function. The main steps of ABC algorithm are described as follows:

Step 1. Initialize the population of food sources.

Step 2. Dispatch the employed bees onto the food sources and assess the nectar amount of these food sources.

Step 3. Calculate the probabilities of all food sources to be picked up by the onlooker bees;

Step 4. Dispatch the onlookers onto the food sources: each onlooker will select a food source according to the probabilities obtained from Step 3, exploit this food source, and assess the nectar amount of the obtained food source;

Step 5. If a food source is exhausted, the corresponding employed bee ceases its exploitation process and becomes a scout bee;

Step 6. Dispatch the scouts into the search space to forage for new food sources randomly;

Step 7. Memorize the best food source obtained so far;

Step 8. If the requirements are satisfied, terminate the algorithm and output the best food source; otherwise go to Step 2.

2.3 Chaos theory

Chaos theory proposed by Edward Lorenz is focused on the behavior of nonlinear dynamical systems. The chaos behavior is a deterministic, random-like process, and is highly sensitive to its initial condition [3 , 13]. The important characteristics of chaos include ergodicity, pseudo-randomness, irregularity, and strange attractor with self-similar fractal pattern [32]. Due to its ergodicity, chaos provides great diversity. Chaotic sequences have been utilized to replace the random sequences, and achieved good results in many applications including secure transmission, DNA computing and image processing [2, 3]. Many chaotic maps are proposed to generate chaotic sequences. As a discrete-time dynamical system, the general form of the chaotic maps is given by

$(4) y_{t + 1} = f (y_{t})$ (4)

where 0 < y_t < 1, t = 0, 1, 2 . . .. The obtained chaotic sequences are denoted by:

$(5) {y_t : t = 0, 1, 2, \dots}$ (5)

It is unnecessary to store the chaotic sequences since these sequences are easy and fast to generate and store. Given the chaotic maps and initial conditions, the chaotic sequences can be generated. The simplest system which is able to generate chaotic motion is the one-dimensional chaotic maps [32]. One of the simplest one-dimensional chaotic maps is the Kent map [2] [32] which is defined by:

$(6) y_{t + 1} = {\begin{matrix} \frac{y_{t}}{β} y_{t} < β, \\ \frac{(1 - y_{t})}{1 - β} β \leq y_{t} \leq 1 . \end{matrix}$ (6)

Here, the control parameter β is within the interval (0,1). For simplicity, the parameter β is taken as 0.7 in our work.

3 Our proposed ABC-K-Prototypes algorithm

In this section, the proposed ABC-K-Prototypes clustering approach is first described, and then its complexity, and convergence is analyzed.

3.1 The proposed approach

In this subsection, the novel clustering algorithm which is based on the k-prototypes approach, the search strategy of an artificial bee colony, and the chaos theory, is introduced. As aforementioned, the swarm of artificial bees has three types of bees: employed bees, onlookers, and scouts. In an optimization issue, a food source is a possible solution, and the nectar amount of this food source reflects the quality of the corresponding solution. In the clustering, the clustering results are determined by the position of cluster centers. The clustering issue therefore can be regarded as the optimization of the cluster centers, and a group of cluster centers is a possible solution. For clustering the data with both numeric and categorical attributes, let f_i = {C₁, C₂,. . . , C_k} denote a food source, where C _l is the prototype of the cluster l; $E (U, f_{i}) = \sum_{l = 1}^{k} \sum_{i = 1}^{n} u_{il} dis (x_{i}, C_{l})$ is the objective cost function, where the symbols have the same meaning as in Eq. (1). The nectar amount [23] of a food source f_i is expressed as follow:

$(7) NA (f_{i}) = \frac{1}{E (f_{i}) + 1}$ (7)

In the proposed algorithm, the artificial bees are divided into two parts: the first half of the artificial bees are the employed bees, and the rest ones are the onlookers. There is only one employed bee on a food source, and therefore the number of the employed bees equals the number of solutions in the population. Assume P_fs = {f₁, f₂,. . . , f_T} indicate the population of food sources, where is the number of the food sources, and f_i is the ith food source. The probability [22] of the ith food source being selected by an onlooker is formulated as:

$(8) {prob}_{i} = \frac{NA (f_{i})}{\sum_{j = 1}^{T} NA (f_{j})}$ (8)

To obtain a candidate food source from the current one, the One-step K-Prototypes procedure, abbreviated as OKP, is presented first. Essentially, the OKP procedure is one iteration step in the search process of the k-prototypes algorithm. This OKP procedure is utilized to look for the neighbor food source of the current food source in the exploitation process. The exploitation process is executed by employed bees and onlookers. Let f_i be the current food source, then the procedure of the OKP contains two steps:

1) For each data object in the dataset X, allocate it to the cluster with the nearest prototype, and therefore generate a partition matrix U; concretely speaking, if the ith data object is a member of the lth cluster, then u_il = 1; otherwise u_il = 0, where u_il is an element of U;

2) obtain the prototypes according to the partition matrix U, and thus generate a candidate food source $f_{i}^{'} = {C_{1}^{'}, C_{2}^{'}, \dots, C_{k}^{'}}$

In the OKP procedure, we adop the Written and Frank’s normalization scheme (WF nornalization scheme) [30] to make the different numeric attributes on the same scale. The WF nornalization scheme is given by:

$x_{i j}^{'} = \frac{x_{i j} - υ_{j, \min}}{υ_{j, \max} - υ_{j, \min}},$ (9)

where v_j,min (v_j,max) is the mininum (maximum) value of the jth attribute, and $x_{i j} (x_{i j}^{'})$ is the original (normalized) value.

Once a food source is exhausted, the corresponding employed bee becomes a scout. In our algorithm, the parameter NT, which is the given number of trials, is introduced to control the abandonment of a food resource. More precisely, if a food source cannot be improved further through NT trials, this food source is abandoned, and the relevant employed bee becomes a scout. The scout will search for a new food source in the search space. The kent map, which is one of the chaotic maps, is introduced in the search process of a scout due to its ergodicity, irregularity and stochastic property. The kent map [32] [2] with the parameter β = 0.7 is given as follows:

$(10) {CN}_{l + 1} = {\begin{matrix} \frac{{CN}_{l}}{0.7}, {CN}_{l} < 0.7, \\ \frac{10}{3} (1 - {CN}_{l}), otherwise . \end{matrix}$ (10)

where: CN_l is the lth chaotic number. The chaotic numbers generated by the kent map is in the range (0,1). As mentioned above, the food source in the clustering scenario is a group of cluster centers, and the cluster center is the set of attribute values. In the scenario of mixed numeric and categorical attributes, the search process of a scout is different for these two types of attributes. For a categorical attribute j, the search operation of a scout is performed as following: the scout selects a categorical value in a chaotic way from the collection of attribute values by the following equation:

$(11) {catVal}_{j} = Cj [index],$ (11)

where catVal_j is the categorical value, C_j is the collection of categorical values for the jth categorical attribute in a dataset, and the index is given by:

$(12) index = floor ({CN}_{l}^{*} len)$ (12)

Here, len denotes the number of categorical values in the cluster C_j, and floor(s) means the greatest integer that is less than or equal to s. For a numeric attribute j, the value is determined by:

$(13) {numVal}_{j} = {min}_{j} + {CN}_{l}^{*} (\max_{j} - \min_{j}),$ (13)

where numVal_j denotes the jth attribute value; $min_{j} (max_{j})$ is the mininum (maximum) value of the jth attribute. Let the abandoned food source be f_i, and then the search operation of a scout foraging a new food source is given by:

$(14) f_{i}^{'} = scoutSearch (X),$ (14)

where i ∈ {1, 2,. . . , T}, and scoutSearch (X) is the search operation of a scout. The pseudo-code of the scoutSearch(X) process is given by

Input: dataset X, the number of attributes m, the number of clusters k.

Output: food scource $f_{i}^{'} = {C_{1}, C_{2}, \dots, C_{k}} .$

1) Randomly initialize the chaotic variables; let C_r denote the rth cluster center, and initialize r=1;

Repeat

2) For the rth cluster center C_r; let A_j denote the jth attribute, and set j=1;

Repeat

3) For the jth attribute.

a) If the jth attribute is a numeric attribute Update the chaotic variable for this attribute according to Eq. (10); Get the the jth attribute value for cluster center C_r by using the Eq. (13).

b) If the jth attribute is a categorical attribute Update the chaotic variable for this attribute according to Eq. (10) Get the jth attribute value for cluster center C_r by using the Eq. (11), and Eq. (12).

4) j=j+1. Until(j=m) 5) r=r+1. Until(r=k) The multi-source search [20] is adopted to accelerate the convergence of the proposed algorithm. The process of the multi-sources search is depicted as follows: a scout bee forages for H candidate food sources at a time, and then selects the best one as the new food source.

Having presented the calculation formulas for all relevant variables, the pseudo-code of the proposed ABC-K-Prototypes algorithm for the mixed data is described as follows:

Input: The size of bee colony T, the maximum cycle number MCN, the number of clusters k, and the number of trials NT.

Output: The best food source.

1) Initialize the group of food sources G_fs = {f₁, f₂,. . . , f_T} in a random way; concretely speaking, for a food source f_i (1 ≤ i ≤ T) pick up k data objects randomly from the dataset X as the prototypes of clusters; set the exploitation numbers En _i=0 (1 ≤ i ≤ T) for these food sources.

2) Calculate the nectar amounts NA (f₁) , NA (f₂) ,. . . , NA (f_T) for these food sources according to Eq. (7);

3) set the cycles number CN=1;

Repeat 4) For each employed bee

a) Adopt the procedure OKP to obtain a new food source f_i from the current food source, and set En_i = En_i + 1;

b) Calculate the nectar amount of the obtained food source, that is,, according to Eq. (7);

c) If NA (f_i) <, the current food source f_i is displaced by the new food source ; otherwise the current food source f_i is unchanged.

5) Assess the probability prob_i for each food source f_i according to Eq. (8);

6) For each onlooker bee

a) Choose a food source f_i as the current food source depending on the probability prob_i;

b) Adopt the procedure OKP to obtain a new food source from the current food source f_i, and set En_i = En_i + 1;

c) Calculate the nectar amount NA() for the food source ;

d) If NA (f_i) <, the current food source f_i is displaced by the new food source ; otherwise the current food source f_i is retained;

e) Update the probability prob_i (1 ≤ i ≤ T) for all food sources according to Eq. (8).

7) For each food source f_i, if the exploitation number En_i is equal to or larger than the number of trials NT, this food source is abandoned, and the corresponding employed bee becomes a scout.

8) If there exists an abandoned food source f_i,

a) Dispatch the scout in the search space to forage for H candidate food sources ${f_{i}^{1}, f_{i}^{2}, . . ., f_{i}^{H}}$ according to Eq. (14);

b) Calculate the nectar amounts ${NA (f_{i}^{1}), NA (f_{i}^{2}), . . ., NA (f_{i}^{H})}$ for these food sources ${f_{i}^{1}, f_{i}^{2}, . . ., f_{i}^{H}}$ ;

c) Pick up the food source with the highest nectar amount as the new food source f_i, and initialize its exploitation number En_i=0;

d) If NA (f_i) < the current food source f_i is displaced by the new food source ; otherwise the current food source f_i is retained.

9) CN=CN+1;

4 Until(CN=MCN)

4.1 Complexity analysis

In this section, the time and space complexities of the proposed ABC-K-Prototypes approach is analyzed. The time complexity of the proposed method mainly contains five parts: the initialization, the search of employed bees, the calculation of the probability of food sources, and the search of scouts and onlookers. The computational cost of these five parts are O (Tk), O (T (nkm + nkp + nkC (m - p))), O (T), O (Hkm), and O (T (nkm + kpn + (m - p) Cn)), respectively. Here n is the number of data objects in the dataset X; m is the number of attributes; k is the number of clusters; p is the number of numeric attributes; T is the number of employed bees or food sources; C is the maximum number of categories value for all categorical attributes; H is the number of candidate food sources for the scout bee. Therefore, the overall time complexity of the proposed approach is O (Tk + s (Hkm + T (nkm + nkp + nkC (m - p)))), where s is the number of iterations. For space complexity, it requires O (mn) to store the dataset X, O ((T + H) km) to store the food sources, and O (nk) to store the partition matrix. Therefore, the overall space complexity of our proposed method is O (mn + (T + H) km + nk).

3.3 Converence analysis

In our approach, the search process includes exploration and exploitation process, both of which are performed by the ABC search strategy. The current solution will be displaced by a new solution if the new solution is better than the current one in the exploitation or exploration process. Therefore, each possible solution appears in the current solution list at most once. If the value of MCN (maximum cycle number) is high enough, the global optimal solution will be very likely to be found; otherwise, the algorithm will be converged to a local optimum. In other words, the higher the value of MCN, the greater the possibility that ABC-K-Prototypes will converge is. The possibility of convergence for our proposed approach approaches to 100% when MCN tends to be infinite. Therefore the convergence of our algorithm to a global/optimal solution is guaranteed as long as MCN is high enough.

5 Experimental results and discussion

To evaluate its performance, the proposed clustering algorithm ABC-K-Prototypes is executed on three mixed datasets: Zoo, Heart Disease, and Credit Approval, all of which are obtained from the well-known UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html). In this work, the Yang’s accuracy measure [33], which is one of the most commonly used criteria, is utilized to evaluate the obtained clustering results. In Yang’s method, the definition of the accuracy (AC) is given by:

$(15) AC = \frac{\sum_{i = 1}^{k} a_{i}}{n},$ (15)

where a_i is the number of data objects that are correctly assigned to the class C_i, and n is the number of data objects in a dataset. According to this definition, the AC has the same meaning as the clustering accuracy r given in [16].

According to this measure, the higher value of AC implies the better clustering result. Four well-known algorithms, i.e., the K-Prototypes algorithm [15], EKP algorithm [35], SBAC algorithm [24], and KL-FCM-GM algorithm [12], are selected to compare with the proposed algorithm. For the performance analysis, the proposed ABC-K-Prototypes algorithm, the K-Prototypes algorithm, the EKP algorithm, the SBAC algorithm, and the KL-FCM-GM algorithm are executed on three different datasets, and for each dataset these algorithms are run twenty trials. Then the clustering results of the proposed ABC-K-Prototypes algorithm are compared with that of the other four well-known algorithms according to the average of AC. All algorithms are implemented in Java language and run on an Intel(R) Core(TM) i7, 3.4GHz, 16GB RAM computer. The parameters of the proposed ABC-K-Prototypes algorithm in all experiments are set as follows: T=20, which is the typical value adopted in the original ABC algorithm [23]; MCN=100, NT=5 and H=5 are set by the rule of thumb. In all five algorithms, the number of clusters k is set as the number of classes supplied by the class information of the dataset. It is worth to note that other class information is not utilized in the clustering process apart from the number of classes. The other parameters of the k-prototypes algorithm, the EKP algorithm, the SBAC algorithm, and the KL-FCM-GM are set the same as those given in their original papers.

The experiments are begun by considering the Zoo dataset. This dataset has 101 data objects, each of which is described by one numeric attribute and 16 categorical attributes. The last categorical attribute is the class attribute, and has seven values. Therefore, the data objects in the Zoo dataset belong to one of the seven classes. Table 1 summarizes the clustering results of the ABC-K-Prototypes, the K-Prototypes, the EKP, the SBAC, and the KL-FCM-GM algorithms on the Zoo dataset according to AC. The K-Prototypes, the EKP, the SBAC, the KL-FCM-GM algorithms give values of AC 0.806, 0.566, 0.426, and 0.864, respectively. In contrast, the proposed ABC-K-Prototypes algorithm gives a higher value of AC 0.886.

Table 1

The AC of the five algorithms on the Zoo dataset

Algorithms	AC
K-Prototypes	0.806
EKP	0.566
SBAC	0.426
KL-FCM-GM	0.864(α = 1.3)
ABC-K-Prototypes	0.886

The Heart Disease dataset comprises 303 patient instances, each of which has six numeric attributes and nine categorical attributes. The last two attributes are the class attributes. When the 15th attribute is selected as its class attribute, the data objects in this dataset belong to one of five classes (s1, s2, s3, s4, and H), and each of them is described by 14 attributes; when the 14th attribute is chosen as its class attribute, the data objects in this dataset belong to one of two classes (buff, sick), and each of them is described by 13 attributes. For the first case, Table 2 lists the comparison of the clustering results of the ABC-K-Prototypes and the other four well-known algorithms on the Heart Disease dataset (first case) according to AC. The K-Prototypes, the EKP, the SBAC, and the KL-FCM-GM algorithms give values of AC 0.546, 0.545, 0.545, and 0.653, respectively. In contrast, the proposed ABC-K-Prototypes algorithm gives a higher value of AC 0.648.

Table 2

The AC of the five algorithms on the Heart Disease dataset (5 classes and 14 attributes)

Algorithms	AC
K-Prototypes	0.546
EKP	0.545
SBAC	0.545
KL-FCM-GM	0.653(α = 1.3)
ABC-K-Prototypes	0.648

For the second case where each data object in the Heart Disease dataset has 13 attributes and the 14th attribute is taken as its class attribute. Table 3 lists the comparison of the clustering results of ABC-K-Prototypes and the other four well-known algorithms on the Heart Disease dataset (second case) according to AC. The K-Prototypes, the EKP, the SBAC, and the KL-FCM-GM algorithms give values of AC 0.577, 0.577, 0.752, and 0.758, respectively. In contrast, the proposed ABC-K-Prototypes algorithm gives a higher value of AC 0.809.

Table 3

The AC of the five algorithms on the Heart Disease dataset (2 classes and 13 attributes)

Algorithms	AC
K-Prototypes	0.577
EKP	0.577
SBAC	0.752
KL-FCM-GM	0.758(α = 1.7)
ABC-K-Prototypes	0.809

The Credit Approval dataset consists of 690 data objects from credit card organizations, where each data object has ten categorical attributes and six numeric attributes (the last categorical one is the class attribute). The data objects in this dataset belong to one of two classes: negative (383) and positive (307). Table 4 summarizes the comparison of the clustering results of ABC-K-Prototypes and the other four well-known algorithms on this dataset according to AC. The K-Prototypes, the EKP, the SBAC, the KL-FCM-GM algorithms give values of AC 0.562, 0.560, 0.555, and 0.584 respectively. In contrast, the proposed ABC-K-Prototypes algorithm gives a higher value of AC 0.794.

Table 4

The AC of the five algorithms on the Credit Approval dataset

Algorithms	AC
K-Prototypes	0.562
EKP	0.560
SBAC	0.555
KL-FCM-GM	0.584(α = 2.3)
ABC-K-Prototypes	0.794

In the proposed ABC-K-Prototypes algorithm, the multi-sources search is adopted to accelerate the convergence of the algorithm. To illustrate the efficiency of the multi-sources search, we run the ABC-K-Prototypes algorithm with and without the multi-sources search twenty trials on each of the three different datasets. Table 5 lists the average number of iterations of the proposed ABC-K-Prototypes with and without the multi-sources search on the different datasets. From this table, we can see that the average number of iterations of the ABC-K-Prototypes algorithm with the multi-sources search are lower than that of the same algorithm without the multi-sources search.

Table 5

The average number of iterations of the ABC-K-Prototypes algorithm with and without the multi-sources search on the different datasets

Datasets	With the multi-sources search	Without the multi-sources search
Zoo	1.95	2.55
Heart Disease (5 classes and 14 attributes)	2.55	2.8
Heart Disease (2 classes and 13 attributes)	1.05	1.15
Credit Approval	1.1	1.6

Table 6 summarizes the number of iterations of the ABC-K-Prototypes algorithm, and the EKP algorithm on each of the three different datasets. The results in Table 6 show that the number of iterations required by the ABC-K-Prototypes algorithm is lower than that of the EKP algorithm in most cases.

Table 6

The number of iterations of the ABC-K-Prototypes algorithm and the EKP algorithm on the different datasets

Datasets	Algorithms
Datasets	ABC-K-Prototypes	EKP
Zoo	1.95	12.5
HeartDisease (5 classes and 14 attributes)	2.55	7.85
HeartDisease (2 classes and 13 attributes)	1.05	4.15
Credit Approval	1.1	1.0

The experimental results in Tables 1-6 show that the proposed ABC-K-Prototypes algorithm achieves higher values of AC on most datasets, and therefore the proposed algorithm outperforms the other four algorithms according to the measure AC. Furthermore, the ABC-K-Prototypes algorithm requires less number of iterations in most cases. The reason for the success of the ABC-K-Prototypes is described as follow: this approach has the ability of global search (exploration) and local search (exploitation) by introducing the OKP operator, and the ABC optimization framework. More specifically, the employed and onlooker bees implement the local search by utilizing the OKP operator, and the scout bees execute the global search in a chaotic way. Therefore, the proposed ABC-K-Prototypes algorithm can obtain optimal or near-optimal results.

6 Conclusions and Future Work

Data objects with both numeric and categorical attributes are ubiquitous in many real-world applications. The k-prototypes type algorithms are well-known for their high efficiency to cluster this type of data. However, this type of algorithms is prone to trap into local optima.

To solve this issue, the novel clustering algorithm ABC-K-Prototypes, which is based on the traditional k-prototypes algorithm, ABC optimization strategy, and chaos theory, is presented in this paper. In the proposed algorithm, the employee bees and onlookers utilize the OKP procedure to explore the food source around the existing food source, and the scouts explore the food sources in the entire search space in a chaotic way. For accelerating the convergence of the ABC-K-Prototypes, the multi-source search is utilized in the search process of scout bees. The time complexity, space complexity, and convergence of the ABC-K-Prototypes algorithm is analyzed, and this algorithm is tested on three datasets with both numeric and categorical attributes. The experimental results validate the performance of the proposed algorithm.

For simplicity, the kent map is adopted to generate the chaotic sequences in this paper. In the future work, we will focus on applying the other chaotic maps and swarm intelligent algorithms to cluster mixed data.

Footnotes

Acknowledgments

This work was supported by the National Key R&D Program of China under Grant No. 2017YFC 0909200, the National Natural Science Foundation of China (NSFC) under Grant Nos. (61403077,61502 093,11501095,81502291,61802057, 61872076)), Natural Science Foundation of the Education Department of Jilin Province under Grant Nos. (2016504,2016505), Science and Technology Development Plan of Jilin province under Grant Nos. (20170520058JH, 20170520051JH, 20180414 006GH, 20180520028JH, 20150101057JC).

References

Ahmad ,

Dey , A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering 63 (2007), 503–527.

Alatas , Chaotic bee colony algorithms for global numerical optimization, Expert Systems with Applications 37 (2010), 5682–5687.

Alatas and

Akin ,

A.B.

Ozer , Chaos embedded particle swarm optimization algorithms, Chaos, Solitons & Fractals 40 (2009), 1715–1734.

J. C.

Bezdek ,

Ehrlich and

Full , Fcm: the fuzzy c-means clustering algorithm Computers & Geosciences 10 (1984), 191–203.

J.C.

Bezdek ,

Keller ,

Krisnapuram and

N.R.

Pal , Fuzzy Models Algorithms for Pattern Recognition and Image Processing Kluwer Academy Publishers, Boston, 1999.

K.K.

Bharti and

P.K.

Singh , Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering, Applied Soft Computing 43 (2016), 20–34.

Blomstedt ,

Dutta ,

Seth ,

Brazma and

Kaski , Modelling-based experiment retrieval: a case study with gene expression clustering, Bioinformatics 32 (2016), 1388–1394.

Boeing , Visual Analysis of Nonlinear Dynamical Systems: Chaos, Fractals, Self-Similarity and the Limits of Prediction, Social Science Electronic Publishing 4 (2016).

Bogner ,

B.T.Y.

Widemann and

Lange , Characterising flow patterns in soils by feature extraction and multiple consensus clustering, Ecological Informatics 15 (2013), 44–52.

10.

Bordogna and

Pasi , A quality driven hierarchical data divisive soft clustering for information retrieval, Knowledge-Based Systems 26 (2012), 9–19.

11.

M.E.

Celebi ,

H.A.

Kingravi and

P.A.

Vela , A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Systems with Applications 40 (2013), 200–210.

12.

S.P.

Chatzis , A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional, Expert Systems with Applications 38 (2011), 8684–8689.

13.

L.-Y.

Chuang ,

C-J

Hsiao and

C-H

Yang , Chaotic particle swarm optimization for data clustering, Expert Systems with Applications 38 (2011), 14555–14563.

14.

Han and

Kamber ,

Pei , Data mining concepts and techniques, 3 ed, Morgan Kaufmann (2012).

15.

Huang , Clustering large data sets with mixed numeric and categorical values, The first Pacific-Asia Conference on Knowledge Discovery and Data Mining (1997), 21–34.

16.

Huang , Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2 (1998), 283–304.

17.

A.K.

Jain , Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31 (2010), 651–666.

18.

A.K.

Jain and

R.C.

Dubes , Algorithms for clustering data, Prentice Hall (1988).

19.

Ji ,

Bai ,

Zhou and

Ma , Wang, An improved k-prototypes clustering algorithm for mixed numeric and categorical data, Neurocomputing 120 (2013), 590–596.

20.

Ji ,

Pang ,

Zheng ,

Wang and

Ma , A novel artificial bee colony based clustering algorithm for categorical data, Plos One 10 (2015), e0127125.

21.

J.C.

Ji ,

Pang ,

C.G.

Zhou ,

Han and

Wang , A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data, Knowledge-Based Systems 30 (2012), 129–135.

22.

Karaboga and

Basturk , On the performance of artificial bee colony (ABC) algorithm, Applied Soft Computing 8 (2008), 687–697.

23.

Karaboga and

Ozturk , A novel clustering approach: Artificial Bee Colony (ABC) algorithm, Applied Soft Computing 11 (2011), 652–657.

24.

Li and

Biswas , Unsupervised learning with mixed numeric and nominal data, IEEE Transactions on Knowledge and Data Engineering 14 (2002), 673–690.

25.

Li and

Yin , Modified cuckoo search algorithm with self adaptive parameter method, Elsevier Science Inc.. 2015.

26.

Luo and

Pang and

Wang , Semi-supervised Clustering on Heterogeneous Information Networks, Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’14), Taiwan (2014), 548–559.

27.

Naouar ,

Hlaoua and

M.N.

Omri , Collaborative, Information, Retrieval Model based on Fuzzy Clustering, 2017 International Conference on High Performance Computing & Simulation (HPCS), Genoa (2017), 495–502.

28.

Saeed ,

Salim and

Abdo , Information theory and voting based consensus clustering for combining multiple clusterings of chemical structures, Molecular Informatics 32 (2013), 591–598.

29.

Teodorović , Bee Colony, Optimization, (BCO), Innovations in Swarm Intelligence, Springer, Berlin Heidelberg (2009), 39–60.

30.

I.H.

Witten and

Frank , Data mining: practical machine learning tools and techniques with Java implementations, ACM SIGMOD Record 31 (2002), 76–77.

31.

Xin ,

Z.Q.

Xie and

Yang , The privacy preserving method for dynamic trajectory releasing based on adaptive clustering, Information Sciences 378 (2017), 131–143.

32.

Yang and

Liu ,

Zhou , Chaos optimization algorithms based on chaotic maps with different probability distribution and search speed for global optimization, Communications in Nonlinear Science and Numerical Simulation 19 (2014), 1229–1246.

33.

Yang , An evaluation of statistical approaches to text categorization, Journal of Information Retrieval 1 (1999), 67–88.

34.

Zhang ,

Ouyang and

Ning , An artificial bee colony approach for clustering, Expert Systems with Applications 37 (2010), 4761–4767.

35.

Zheng ,

Gong ,

Ma ,

Jiao and

Wu , Unsupervised evolutionary clustering algorithm for mixed type data, Proceedings of the IEEE Congresson Evolutionary Computation (CEC) (2010), 1–8.