Abstract
Partitioning a set of objects into groups or clusters is a fundamental task in data mining, and clustering is a popular approach to implementing partitioning. Among several clustering algorithms, the k-means algorithm is well-known and widely applied in several areas that only handle numerical attributes. The k-modes algorithm is an extension of the k-means algorithm that deals with categorical variables, which has several variations such as fuzzy methods. This paper presents a new attribute weighting method for the k-modes algorithm that utilizes impurity measures such as entropy and Gini impurity. The proposed algorithm considers both the distribution of categories of attributes within the same cluster and between different clusters. By doing this, categorical variables defined as more important that others by the new algorithm have a significant influence on the similarity calculation, and this results in improved clustering performance, which was confirmed by experiments.
Introduction
Clustering is an unsupervised learning method that identifies groups of objects similar to each other in such a way that generated groups are distinct from other groups. Clustering is one way to gain insight from data distribution and has been applied to many real-world applications such as marketing, psychology, and bioinformatics. Because of the usefulness of clustering, various clustering algorithms have been developed such as k-means algorithm [29, 30], spectral clustering algorithm [35], hierarchical clustering algorithm [26], density-based spatial clustering algorithm of applications with noise [14, 33],two-stage genetic clustering algorithm [21] and others. However, clustering of categorical data has been less extensively studied compared with clustering of numerical data. Recently, clustering of categorical data has gained increasing attention considering practical requirements from fields such as statistics and psychology.
Studies on the clustering of categorical data are limited because of certain problems that need to be solved [7, 9]. Categorical attributes are lacking in inherent order of categories in each attribute, which prevents the definition of similarity or dissimilarity measures. The difficulty on defining distance measures is critical in evaluating the clustering quality during the clustering process, which does not occur when dealing with only numerical attributes. Several dissimilarity measures for categorical data have been proposed to overcome this problem and utilize similar clustering techniques for numerical data [12].
Based on dissimilarity measures for categorical data instead of distance metrics such as Euclidean distance, k-modes clustering algorithm [22, 23], which similar to the k-means clustering algorithm for numerical data was proposed and became one of the most efficient clustering methods for datasets including categorical features [3, 20]. Like the k-means clustering algorithms, which have been widely used in real-world applications because of its simplicity and strength in high-dimensional space, the k-modes clustering algorithm groups a given dataset with k clusters to minimize dissimilarities between a data point and the centroid of the cluster to which the data point belongs for all data points. The simplest way to do this is to count the number of mismatching attributes for dissimilarity calculation.
Furthermore, a fuzzy version of the k-modes clustering algorithm was also proposed in [24], in which each data point has membership degrees for k clusters instead of being delegated to one of the k clusters. Similar to the way in which the performance of the k-means clustering was improved by introducing fuzziness, the fuzzy k-modes clustering algorithm has achieved improved clustering results [19, 24].
In addition, attribute weighting has been introduced in clustering algorithms to overcome the problems in high-dimensional spaces. In such spaces, data sparseness, skewness, and irrelevancy or redundancy of some attributes can degrade the clustering performance because of small differences between similar and dissimilar pairs of objects. Similar to other clustering algorithms for categorical data, attribute weighting techniques were tested first for numerical data. For numerical data, attribute weighted clustering has been intensively studied by utilizing common weights for the entire dataset [11, 32] and extended versions of conventional algorithms, which provide different feature weights for each cluster, also called as subspace in such algorithms, have been newly developed [8, 15]. Although these weighting algorithms have been successfully applied to different areas, the application areas are limited to numerical data. Therefore, the clustering algorithms proposed in [22, 23] were modified for datasets with categorical attributes; their effectiveness has been confirmed in calculation of the feature weights of datasets with categorical attributes.
Because of the enhancement induced by weighting in early researches, new weighting algorithms have been continuously proposed. In [1], each categorical variable calculates a similarity measure in a weighted manner, which means that a weighting scheme is embedded in the similarity calculation. This method presents nonzero dissimilarity value, 1 - w kl when x il = x jl where k represents the specific cluster, l denotes one of the categorical variables and i and j represent data points. This is opposite to the Gower similarity [18], which gives zero when x il = x jl . w kl is generally proportional to the number of objects with the certain category of the lth attribute in the kth cluster if xi and xj belong to the kth cluster. This forces a certain category of the lth attribute to have a larger weight when objects with this value are concentrated in the specific cluster. In other studies, the weights of features were calculated based on complementary entropy for the k-modes algorithm [6]. Both [1] and [6] consider distributions of values of categorical attributes to count different effects of features during clustering. However, these works only focus on the individual cluster.
In the present paper, the new attribute weighting method using impurity measures has been proposed in different views to consider the relations among cluster. The main characteristics of the proposed algorithm are as follows: A new weighting method uses impurity measures such as entropy and Gini impurity to determine which attributes are more relevant to separate clusters. A new weighting method considers the distinct number of values of each attribute to calculate weights. Because of the existence of the upper bound of entropy and Gini impurity, a large weight might be applied on categorical attributes with large categories. To remove this effect on weights, the proposed method applies relative entropy and Gini impurity to the upperbound. A new weighting method considers distributions of categories within clusters and between clusters. Conventional weighting methods usually reflect distributions of categories within clusters. They put additional weight on attributes such that values of attributes in each cluster are highly imbalanced. However, the proposed method incorporates the distribution of each value of categorical attributes across clusters.
The rest of the paper is organized as follows. In Sections 2 and 3, detailed reviews of the k-modes and fuzzy k-modes clustering algorithms are provided. In Section 4, the proposed weighting method is described in detail. In Section 5, the experiment results obtained from the proposed algorithm and other comparative algorithms are presented. Finally, conclusion of the paper is given in Section 6.
The k-modes clustering algorithm
The k-modes clustering algorithm, proposed in [23], is an extension of the k-means clustering algorithm that cat be applied for categorical data. The objective of clustering a dataset of n observations with d categorical variables into k clusters, C1, …, C
k
is to minimize the following objective function F.
W = [w
ji
] is a k-by-n matrix and w
ji
is a binary variable that indicates whether object d ( Problem P1: Fix , solve the reduced problem
Problem P2: Fix , solve the reduced problem
Similar to k-means clustering, k-modes clustering also minimizes the objective function in such a way as to use partial optimization for Z and W. In the k-modes clustering, two steps of finding the best Z and W are repeated until convergence.
The solution of problem P1 is given by
The solution of problem P2 is given by
The fuzzy k-modes clustering algorithm is a fuzzy version of the k-modes clustering [24]. In this algorithm, the objective function is not different from that of the k-modes clustering algorithm, but w ji is not a binary variable. In the original k-modes clustering, an object belongs to one of k clusters. However, the fuzzy k-modes clustering algorithm calculates the probability that an object belongs to a cluster. The probability is computed based on dissimilarity between objects and centroids. If an object is close to the specific centroid, it is more probable that this object belongs the cluster represented by the centroid. To reflect this assumption, the probability to belong to certain cluster increases as the dissimilarity between an object and a cluster’s centroiddecreases.
In [24], the objective function of the k-modes clustering algorithm is modified slightly to introduce the fuzzy concept.
If d (
At first, the fuzzy k-modes clustering algorithm only modified the method to calculate w
ji
and each attribute of the cluster’s centroid was set to the specific category if the summation of the membership degree of the objects with the category for that attribute was higher than that of the other categories within the clusters.
In [27], an improved version of the fuzzy k-modes clustering was proposed by introducing fuzzy centroids. The proposed k-modes clustering algorithm calculates the centroids of clusters in a fuzzy way in addition to w
ji
. Instead of hard assignment of the value on z
jl
, fuzzy centroids also calculate the probability that z
jl
is . Similar to the membership degree, the probability, that z
jl
is is proportional to the sum of the jth membership degree of objects whose lth attribute’s value is .
Dissimilarity calculation should be tuned when the fuzzy concept is used for centroids, because it is impossible to determine whether a value of a categorical attribute of an object matches with that of a centroid or not. Therefore, dissimilarity is calculated by averaging dissimilarity of every possible pairs in weighted manner and weights is set to . By doing this, dissimilarity can be obtained as followingway.
Weighting method using within-cluster information
Generally, in the k-modes clustering algorithm, each categorical attribute has an equal effect on the dissimilarity calculation because the dissimilarity of two objects with categorical variables is computed by summation of the dissimilarity of each categorical variable. However, certain attributes are more relevant to separate objects into different clusters. To reflect this situation into k-modes clustering, a weighting scheme should be introduced.
The proposed weighting algorithm is based on the entropy [34] or the Gini impurity [5]. Both the entropy and the Gini impurity are measures of impurity and are widely used measurements in decision tree algorithms; they can be used to obtain the best split at each node of decision trees. These two measures have the maximum value when every outcome have equal probabilities and produce zero when only one outcome exists, which is the reason that they are called impurity measures.
The entropy is defined as
Compared with the entropy, the Gini impurity tends to locate a larger group in a set to minimize misclassification errors (misclassification error is defined as ). In addition, calculating the Gini impurity is slightly faster than calculating the entropy because computing the log requires more computation power. Otherwise, the entropy tends to find groups of classes that make up 50% of data samples.
The main concept of the proposed algorithm is to give a large weight to the variable with the low entropy or the low Gini impurity because if categories of a categorical variable are homogeneously distributed in a cluster, this categorical variable is meaningless when objects are divided into clusters, and the two measures become zero when only one category exists for a categorical attribute in a cluster.
To compute weights for categorical variables, first, calculate the entropy, IE,w (f
lj
) or the Gini impurity, IG,w (f
lj
) of each categorical variable based on the distribution of categories within the clusters.
f
lj
represents a distribution of categories of the lth categorical attribute in the jth cluster
, frequency ratio of the rth category of the lth categorical attribute in the jth cluster is defined as where and n·j is the number of objects in the cluster C
j
From IE,w (f
lj
) or IG,w (f
lj
), the weights, λl,w for categorical variables are computed using the weighted average. For clarity, both IE,w (f
lj
) and IG,w (f
lj
) are represented by I
w
(f
lj
) in common equations because the scheme of calculating weights is the same for both the entropy and the Gini impurity.
In addition, we further modified the Equation (17) to consider the number of categories in a categorical variable. In cases in which the number of categories in a categorical variable is large, I
w
(f
lj
) has a greater probability of being large. Therefore, the possible maximum I
w
(f
lj
) of normalized I
w
(f
lj
) is defined as the function of the number of categories in a categorical variable.
Distributions of categories within clusters are only considered to obtain appropriate weights for categorical variables in many k-modes clustering algorithms. However, in cases where original categories are already inhomogeneously distributed, I w (f lj ) is more likely to be high. For example, when the portion of zero in a binary categorical variable is 90%, Gini impurity is 0.18; the same value is obtained even though every cluster has the same distribution, which means that this categorical variable is not significant for the separation of clusters.
To avoid this drawback in weighting variables, distributions of values of categorical variables across clusters must be considered.
is the frequency ratio of objects whose category value of the lth categorical attribute is in the jth cluster and computed by where and
β
lr
is a weight for each category in the lth variable to account for the number of objects with a certain category value and is calculated by
Similar to the weighting method using within-cluster information, an upper bound of I
b
(g
l
) is set on the basis of the number of clusters k and the number of categories m
l
for the lth categorical variable.
Weights from are obtained in the same way as weights obtained from .
Overall weights of attributes are obtained by the weighted summation of two different weights λl,w(or ) and λl,b (or ); the amount of contribution of each weight on the total weight is adjusted by parameter c (0 ≤ c ≤ 1).
In the previously introduced weighting method, and are computed based on objects that belong to the jth cluster. However, and can be interpreted as the probability that in the jth cluster and the probability that an object with belongs to the jth cluster respectively. Therefore, and can be computed using membership degree. In fuzzy weighting method, is defined as
The proposed k-modes or fuzzy k-modes clustering algorithm considers the different significance of each categorical attribute on clustering. Thus, the purpose of the proposed algorithm is to optimize the objective functions, which include attribute weights. The objective function of the proposed k-modes algorithm is defined as
Step 1: Fix , update W
Step 2: Fix , update Z
Step 3: Fix , update Λ
These steps are repeated until any terminal condition is fulfilled; terminal conditions generally include the convergence of Z, W, and Λ, or the reaching of the maximum iteration defined by a user in advance.
Datasets
For experiments to evaluate the proposed algorithm, the six standard datasets obtained from the UCI Machine Learning Repository [28] were used. These datasets are introduced as follows. Soybean data: This dataset has 47 data samples and 35 attributes. The class output is labeled as one of the four diseases, Diaporthe Stem Canker, Charcoal Rot, Phizoctonia Root Rot, and Phytophthora Rot. Zoo data: This datasets consists of 101 instances and 15 categorical variables. The purpose of this dataset is to classify the type of animal based on characteristics. Each animal is classified into 7 different categories. Horse colic data: Original dataset has 368 samples and 22 attributes; it is a binary classification problem that tries to determine whether a lesion of colic of a horse is surgical. However, some attributes have too many missing values and the 7 features are not categorical. Therefore, attributes with more than 100 missing values and numerical attributes are removed. In addition, rows with any missing values are also eliminated after removing certain features. Finally, 215 samples remain with 9 categorical attributes. Among them, the number of samples that requires surgical treatment is 134. Heart disease data: This dataset has 303 instances with eight categorical and five numerical attributes generated at the Cleveland Clinic. It is a binary classification dataset with normal patients (164 objects) and heart patients (139 objects). Credit approval data: It consists of 690 customers divided into two classes, negative (383 objects) and positive (307 objects), collected from credit card organizations. It has eight categorical and six numeric features. In this paper, only categorical attributes are selected for experiments. Breast cancer data: It consists of 699 data objects and 9 categorical attributes. It has two clusters, benign (458 objects) and malignant (241 objects). This dataset was obtained from the University Medical Center, Institute of Oncology, Ljubljana, Yugoslavia.
The number of objects, categorical features, and clusters for the datasets are summarized in Table 2.
Similarity measure
Unlike numerical variables, the values of categorical variables represent different categories; thus, difference or distance in values is irrelevant. Therefore, dissimilarity measures for categorical variables are based on matching. If the number of matched categorical attributes of two objects increases, the two objects are more likely to be similar.
Dissimilarity is the opposite of similarity. This value increases when two objects are less similar. Usually, dissimilarity is defined as d (xi, xj) =1 - sim (xi, xj), which justifies the concept of complementarity in similarity and dissimilarity measures. Through this relation, the decision to one of the measures determines the remaining measure.
To improve the clustering performance, several studies have investigated new similarity measures to reflect the characteristics of categorical variables, such as Gower similarity [18], Eskin similarity [13], inverse occurrence frequency similarity [10], occurrence frequency similarity [4], Goodall similarity [17], and so on.
In the present paper, the similarity measure to calculate the dissimilarity between two objects is determined as Gower similarity. It is the most popular similarity measure for categorical variables. It has only a binary output, 0 or 1, depending on matching of an attribute [18].
To ensure a fair comparison, the parameter setting for the experiments is determined in the following steps: The number of clusters, k is set to the number of classes for each dataset. Because clustering performance depends on the initial centroids of the clusters, initial centroids are randomly selected and the same procedures are repeated 100 times. In each iteration, the same initial centroids are used in the different clustering algorithms. For fuzzy k-modes clustering, parameter e, controlling the fuzziness, should be set in advance; it also affects the performance of the clustering results. In several papers related to fuzzy k-modes clustering with categorical attributes [2, 27], the clustering performance for categorical data with smaller e is usually better than that with larger e. Generally, the best clustering performance has been reported when e is 1.1. For this reason, e is set at 1.1. Λ is initialized as (1, 1,. . . , 1). To control the balance between within-cluster information and between-cluster information in the calculation of the weights, parameter c is changed. The effectiveness of c is tested during experiments.
Evaluation metrics
In the performance analysis, four widely used metrics were adopted to evaluate the results of the proposed clustering algorithm. category utility function: The category utility (CU) function is a measure of the probability that two objects in the same cluster have the same attribute values and the probability that objects from different clusters have different attribute values. CU function is defined as
accuracy: The accuracy (AC) is an external criterion that takes advantage of true class labels to evaluate the performance of clustering. AC is defined as follows
adjusted Rand index: The adjusted Rand index (ARI) is an external performance measure, which means that it requires knowledge of the ground truth cluster assignments [25]. ARI is a function that measures the similarity of the two assignments, ignoring chances of normalization and permutation. To obtain ARI, the raw Rand index should be computed first; it is given by
partition entropy coefficient: The partition entropy (PE) coefficient is a validity index for fuzzy clustering. PE measures the fuzziness of the clustering results, and has a smaller value when the clustering result has less uncertainty. This metric is only computed for fuzzy algorithms.
In this paper, to prove the value of the proposed method, the k-modes clustering and fuzzy k-modes clustering algorithms are compared with the k-modes clustering algorithm combined with the proposed weighting method. The same experimental procedures with the same parameter setting are applied to the different clustering algorithms, with weights calculated using the entropy and the Gini impurity. The effectiveness of fuzziness in the calculation of weights is verified through experiments. For clarity, the clustering algorithms tested in this paper are denoted as follows: k-modes clustering as
In addition, the balancing parameter c is changed in the range of [0,1] with interval 0.1. Clustering performance is evaluated using four different metrics.
In Tables 3–4, the clustering results for the datasets are summarized. In each cell, the first number represents the average value for 100 runs; the second number is the standard deviation. Results from the clustering algorithms with the feature weighting are reported separately with respect to weightingmethods.
Based on the experimental results, in can be seen that fuzzy k-modes clustering generally outperforms k-modes clustering in all datasets and for all evaluation metrics except PE. In addition, common k-modes and fuzzy k-modes clustering algorithms can be improved to utilize the proposed methods, which calculate weights of categorical variables in terms of CU, AC, and ARI. After introducing weights for calculation of similarity, PE increases in some datasets depending on the weighting methods, which means that the membership degrees of objects become less certain. However, the membership degrees do not have a large effect on the final clustering results because each object is assigned to one of the clusters by calculating the maximum member degrees in actual situations. Between entropy and the Gini impurity, entropy usually provides a smaller PE, possibly because the same measure, entropy, is used to obtain weights.
In terms of standard deviation, the proposed algorithm reduces the standard deviation compared with original methods, which means that the proposed algorithm groups objects into stable clusters regardless of random initial centroids. This fact is an advantage in real-world applications because users do not need to apply algorithms to the same data several times to obtain improved clustering results.
In Tables 4, the best balancing parameter c, the average, and the standard deviation of the four evaluation metrics regardless of c are summarized. The average and standard deviation are calculated from 1,100 clustering results (11 different c × 100). The optimal value of c is different according to datasets and weighting methods. However, the average values of the evaluation metrics regardless of c degraded slightly from the optimum; the deviations are also not significant. In addition, the best c usually is not located in the center in the range of [0,1]; for most cases, it is close to 0 or 1. Therefore, selecting small c and large c, and selecting c based on CU or PE instead of applying several values of c within [0,1] is a good approach to save time. Moreover, for fuzzy k-modes clustering, PE is usually more correlated with AC and ARI with CU; thus, PE may need to be considered further to determine the appropriate c.
Conclusion
This paper proposed a new weighting method for categorical attributes to improve k-modes clustering results in both ordinal and fuzzy algorithms. The key concept of the proposed algorithm is to increase the certainty of the values of categorical attributes within the cluster. To measure this certainty, the entropy and the Gini impurity were introduced to weighting features. These two metrics have high values when attributes are not very useful to separate clusters; thus, weights are proportional to the subtraction of the entropy or the Gini impurity from 1. This method is simple and easy to implement as well as universal for categorical variables.
Experiments were systematically conducted to evaluate the performance of the proposed method with k-modes and fuzzy k-modes clustering algorithms. To this end, experiments were conducted over six different datasets; and four different evaluation metrics were used for comparison. The results of the experiments confirmed that the performance of the k-modes and fuzzy k-modes clustering algorithms was enhanced overall by introducing the proposed weighting method. For some datasets, PE increased with the introduction of weights, although the final clustering assignment was not much affected by PE. Other metrics, such as AC and ARI, related to cluster assignment, showed improved results in the proposed method. In addition, the best performance of the proposed algorithm was achieved when c > 0 for all the datasets, which proves that between-cluster information is helpful to calculate appropriate weights.
Even though the k-modes clustering algorithm with the proposed weighting method provides improved clustering results, selection of a balancing parameter c is important; however, in this paper, a rough guideline to select c was presented. Therefore, a more detailed analysis of c and the construction of guidelines to choose the optimal value of c should be performed in the future. In addition, the proposed algorithm can be extended to numerical attributes by introducing inhomogeneous measure for numerical features and can be applied to data with mixed attributes.
Footnotes
Acknowledgments
This study was financially supported by Seoul National University of Science & Technology.
