Abstract
In this paper we present a new c-means clustering algorithm for combined continuous-nominal data. We use spherical representation of nominal data. The impact of specific features is modeled with corresponding weights in metric definition. To solve constrained minimization problem we transfer the methodology of reformulation functions to the considered context. As a result we obtain a clustering algorithm with adaptation of weights. Series of numerical experiments on real and synthetic data show that the algorithm can successfully cluster raw, non-normalized data.
Keywords
Introduction and notations
The purpose of this article is further development of algorithms for analysis of combinations of continuous and nominal data by embedding of the discrete nominal part into continuous metric space.
The general method is described in [7]. Assume that we are given some problem, defined on points of a metric space
In this paper we assume that different features has distinct impact on the clusters’ structure. So, the initial metric on the space
We develop a new variant of the c-means clustering algorithm under the above assumption. Namely, we assume that clusters are defined by the nearest-prototype condition and obtain an algorithm for determining prototypes and metric weights that minimizes the objective function Eq. (10). Our method uses the framework of reformulation functions [9], which was adapted to the case of non-linear metric space.
Besides, we deduce an explicit formula for the complete residual error estimation. This estimation appeared to be very useful in our numerical experiments. Namely, we applied the following simple heuristics: start with five randomly initial prototypes and chose the one with the less error estimation. This heuristics allows us to decrease the strong dependence of the c-means clustering algorithm on initial data, which is known to be the main shortcoming of such algorithms.
The paper is organized as follows. The second section is devoted to a short description of background and related works. In the next section we introduce all the necessary concepts and define an embedding of the data space into a metric space. The reformulation function is defined and analyzed in the Section 4. The fifth and sixth sections describe, correspondingly, algorithms of prototypes and weights determining. The complete residual error of the algorithm is estimated in the Section 7. The next section presents numerical experiments. Finally, we give some conclusions and remarks. The article has an appendix, in which we prove that the proposed embedding of the weighted dataset space into the weighted continuous metric space has the same distortion as embedding without weights, introduced in [4].
Let us fix some conventions and notations, used in the article.
In this paper we will consider the data set
In the Section 3 we define embedding of the nominal part into the unit sphere and will keep the same notations for the embedded data. So, the components of the nominal part have the following representation:
For two vectors
will always denote the standard Euclidean scalar product.
We consider the problem of partitioning the data to
For a record
where
Clustering is an important part of various data analysis algorithms. It is of active study in recent decades. Many different approaches were worked out in clustering.
Our study concerns data with continuous and nominal features. Such data occur naturally in several contexts, for instance, in medical or social investigations [13].
Many contemporary approaches to clustering of continuous-nominal data rely on a similarity matrix: hierarchical clustering, partitioning around medoids [10], spectral clustering [14]. However, constructing of the similarity matrix itself needs very subtle and crucial preprocessing stage.
Our approach does not demand any preprocessing. It can be applied directly to raw, unprocessed data in case when no domain knowledge is available. Nevertheless, some domain knowledge can be useful. Specifically, it can suggest certain initial guess, that is close to the sought metric. In such a way algorithm has good chances to avoid local minimums. Falling into a local minimum in know to be a disadvantage of gradient-based methods (Section 5).
The proposed algorithm emerges from two works. It is essentially based on the methodology of reformulation functions. This methodology was developed by Karayiannis and Randolph-Gips in [9] for adaptation of weights in case of continuous data.
The second methodology is embedding of the nominal part of data into a sphere.
In our previous works [3, 4] we considered the standard Hamming space as a nominal part of
Embedding into a sphere and weighted metric
In this section we will define an embedding of the Hamming metric space into the sphere, and the weighted quadrance distance.
The standard Hamming metric on the set of nominal data is defined as follows:
where
Combining the Hamming metric with the standard Euclidean metric on
We do not assume that any additional structure is given for continuous data, so we use the Euclidean metrics as the most common and straightforward way for measuring distances between continuous data [10].
We assume that different features have distinct impact into the clusters’ structure. Let us introduce the weights vector:
where
The assumption is that clusters are formed with respect to the weighted distance:
Now let us remind the embedding of the finite discrete space into the unit sphere according to [3]. Consider at first a discrete set
Consider now the case when the set of nominal data is a product
.
The embedding of the space
where
To determine the weights vector
.
[4] Let two points
where
where
We chose the quadrance metric because the embedding Eq. (7) from the standard Hamming space into the sphere has less distortion than embedding into the sphere with the standard metric [4]. On other hand, it is possible to generalize to that space linear methods of clustering.
.
Let two points
We show in the appendix that the distortion of the embedding Eq. (7) from the space with weighted Hamming metric into the sphere has the same estimate as the one obtained in [4] for the standard Hamming and quadrance metrices.
Combining Eq. (8) with the weighted Euclidean metric we obtain the product metric on the cylinder
where
The c-means clustering algorithm is based on minimizing of the following objective function [2]:
To overcome analytical difficulties, that are caused by the minimum operator, we will use the method of reformulation functions.
The reformulation function is defined as follows [9]:
The method is based on the following relation:
So, instead of minimizing
Differentiation of
where
So,
Let us rewrite the coefficient at
and calculate its limit as
Note that
For the second fraction we have:
Since each fraction in the above equality is less then 1,
Summarizing up, we have the following formula for the limit of the derivative of the reformulation function:
To determine the prototypes we assume that the weights vector
Projection onto constraints surface, i.e. normalization of
Consider at first continuous part of prototype
where
Taking in consideration Eq. (14), the first step of the method becomes
One can chose
i.e. the usual c-means.
For the nominal part:
So, the first step of the method takes the following form:
Again, by choosing
followed by normalization:
where
Note that the Eq. (16) contains the previous centroids, so the above calculation followed by normalization should be repeated iteratively to approximately minimize the objective function Eq. (10).
in practical implementation.
The weight vector
where
Note that the parameter
Following the methodology of reformulation functions, we replace the objective function with the reformulation function Eq. (11) and consider appropriate constrained minimization problem in the limit
The optimal weight vector is determined from the following conditions:
where
Using Eq. (14) and the following relations
Equation (19) is rewritten as follows:
The Eq. (20) can be resolved with respect to
where
Equation (21) implies an analogous formula for
The vector
where
for
Note that the matrix
Substituting Eqs (23) and (25) into Eq. (22), one can eliminate Lagrange multiplier
Now, let us discuss methods to solve the Eq. (26). Note that this equation has iterative form and immediately implies the simple fixed-point iteration method. However, if the data are distributed close to the cluster centroids, one can expect that the sum Eq. (27) is very small, so the norm of the matrix
So, we propose a relaxation iterative method:
where
where
It turned out in all our numerical experiments that the proposed method provides satisfactory approximation.
The residual error can be obtained using Eq. (12) as
where
where
For the continuous part we have the following representation:
where
Due to usage of the quadrance metric, the nominal part of the residual error does not split with respect to particular features as explicitly as the continuous part. Nevertheless, it is possible to obtain for it the similar representation:
where matrix
Summarizing, we get the following formula for the total residual error:
Let us make the following substitution:
where
does not depend on
The formula for the total residual error will be transformed as follows:
where
The estimate (Eq. (33)) was used in numerical experiments (Section 8).
Let us make one more remark concerning the meaning of parameter
Since the generalized mean
i.e., every feature has the same residual error. Hence, for small values of
The above considerations are summarized up in the Algorithm 8.
[htb] C-means clustering algorithm
We applied the following simple heuristic: start with five random collections of
Since the Eq. (28) involve parameter
Following the original work of Karayiannis and Randolph-Gips [9], we accept the following two measures of clustering quality. Both are based on the coverage of predefined groups by the computed clusters.
Let two finite sets
The first measure (cv) is the mean coverage degree of the intended predefined groups by the corresponding computed clusters.
As the second measure (suc) we consider the percentage of least-half covering, i.e. the percentage of initial sets of centroids from 30 random initial collections of five prototypes, for which the algorithm computes clusters that cover the predefined groups in at least one-half degree. This measure makes sense for such algorithms, that computation results depend on some randomly chosen initial data. This is the case for the c-means clustering.
In order to demonstrate the improvement of clustering, we compare our new algorithm with the previous one [4], without weight optimization. Besides, we compare the weighted metric clustering algorithm with other approaches to clustering of mixed data: hierarchical clustering, partitioning around medoids and spectral clustering.
Our new algorithm works on raw data and does not require any normalization. For the previous one we applied the normalization to
As a benchmark sets we took four real training sets with decision categories as predefined groups: Australian Credit Approval, Heart Disease, Hepatitis, and Bank Marketing. These datasets are described particularly in [1]. Specifically, let us note that missing data in Hepatitis and Bank Marketing data sets were reinserted according to probability distribution – a suitable normal distribution for continuous coordinates and appropriate discrete distribution for nominal features.
Heart disease
The Heart Disease dataset consists of 6 continuous, 7 nominal attributes, 370 records, 2 decision categories. The results of experiments are presented in the Table 1. One can see that particular high coverage degree is achieved for
The results of numerical experiments for the Heart Disease data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
), the average total residual error (
)
The results of numerical experiments for the Heart Disease data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
This dataset consists of 6 continuous, 8 nominal attributes, 690 records, 2 decision categories. The results of experiments for this dataset are presented in the Table 2. One can see significant improvement with respect to the algorithm based on unweighted quadrance metric.
The results of numerical experiments for the Australian Credit Approval data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
), the average total residual error (
)
The results of numerical experiments for the Australian Credit Approval data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
Two continuous features in this dataset have very large variance. It is caused primarily by the fact that the numbers contained in corresponding two columns are very big comparing to other continuous data. However, we found out that these two features have observably bigger entropy than the remaining continuous features. We observe that corresponding computed weights are very small and recompense variance of these two features.
The Hepatitis dataset contains 6 continuous, 13 nominal attributes, 155 records, 2 decision categories. The results of experiments are presented in the Table 3. One can see that different values of
The results of numerical experiments for the Hepatitis data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
), the average total residual error (
)
The results of numerical experiments for the Hepatitis data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
This dataset contains 7 continuous, 9 nominal attributes, 4521 records, 2 decision categories. The results of experiments are presented in the Table 4. In this case the measure suc has considerably improvement from 38% to 100%.
The results of numerical experiments for the Bank Marketing data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
), the average total residual error (
)
The results of numerical experiments for the Bank Marketing data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
We have tested our algorithm on artificial dataset as well. Each record has 5 continuous and 5 nominal features. The cardinalities of nominal domains are arbitrary chosen as 2, 4, 2, 3 and 5. We use the weighted metric Eq. (6) with the following weights: 4.5, 2.8, 1.5, 0.0001, 0.0001 for continuous part of data and 0.01, 0.01, 1.15 and 2.5 for nominal part respectively. Predefined intended groups are two balls with random centers in this metric space. The radius of each ball is equal to 1/2 of distance between the centers. Analyzed data, 200 for each group are randomly chosen from corresponding balls.
The results of numerical experiments for the Artificial data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
), the average total residual error (
)
The results of numerical experiments for the Artificial data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations (
The results of experiments are presented in the Table 5. The analyzed dataset was generated randomly 10 times for each value of
The two features with little weights have small impact to the structure of predefined groups. We observe that our algorithm stably adapts the sought weights in appropriate way.
We see also that the values of both measures (cv, suc) are significantly better than that of clustering without weights adaptation (
The Table 6 contains comparison of the weighted metric clustering algorithm with other approaches to clustering of mixed data: hierarchical clustering, partitioning around medoids and spectral clustering, which are very important approach in most real-world applications. We do not show the measure suc for algorithms Diana, Agnes, PAM and SC since it has sense only for algorithms, for which the result of computation depends on random choice of some initial data.
Comparison with other clustering approaches. Diana – the divisive hierarchical clustering algorithm, Agnes – the agglomerative hierarchical clustering algorithm, PAM – partitioning of the data into clusters “around medoids”, SC and SCOpt – spectral clustering
Comparison with other clustering approaches. Diana – the divisive hierarchical clustering algorithm, Agnes – the agglomerative hierarchical clustering algorithm, PAM – partitioning of the data into clusters “around medoids”, SC and SCOpt – spectral clustering
For hierarchical clustering and partitioning around medoids we used their implementations in R-package cluster [12].
Diana – the divisive hierarchical clustering algorithm, implemented in function diana.
Agnes – the agglomerative hierarchical clustering algorithm, implemented in function agnes.
PAM – partitioning of the data into clusters “around medoids”, implemented in function pam.
The dissimilarity matrix for all this algorithms was calculated with the standard daisy function from the cluster package.
For spectral clustering we used its implementation in R-package kernlab [8]. We tested two variants: SC – the algorithm with the kernel function
The scaling coefficient
SCOpt – the algorithm with the kernel function
The results of experiments with Diana, Agnes, PAM, SC and SCOpt algorithms were presented in the paper [4] and we recall them here.
Continuous parts of the analyzed data sets were standardized to
For the artificial data sets defined in Subsection 8.5 Diana, Agnes, Pam, SC, SCOpt algorithms produce very bad results (again
The error analysis, performed at the end of the Section 7, implies that the optimal value of parameter
The results of experiments show that the weighted metric clustering can be considered as a comparable alternative for other modern algorithms.
In this work we transferred the methodology of reformulation functions [9] to the case of combined continuous-nominal data. We used the spherical representation of nominal data [4]. The impact of specific features is modeled with corresponding weights in metric definition (Eq. (6)).
As a result we obtained a clustering algorithm with adaptation of weights. It turns out that this algorithm can successfully cluster raw, non-normalized data.
Additionally, we obtained an estimation of total residual error of clustering (Eq. (33)). This estimation allowed us to improve the algorithm by diminishing of dependence of c-means clustering on guess of initial prototypes.
The algorithm, developed in our previous paper [4], was based on the metric of the form
In order to achieve reasonable results of clustering, the parameter
Let us mention that as a side effect of the used methodology, some new characteristics of the data appeared: matrix
As a possible application of the proposed algorithm, let us mention that adapted metric could be considered as a starting point for the other machine learning algorithms based on a dissimilarity matrix.
Footnotes
Appendix
Distortion of embedding of the weighted Hamming metric space into a sphere
In this section we estimate the distortion of embedding (Eq. (7)) with respect to the weighted Hamming metric on the discrete space and the weighted quadrance metric on the sphere.
Let
where
The weighted quadrance metric after the embedding Eq. (7) into the unit sphere is defined in Eq. (8) as
The distortion of a mapping measures how the metric changes after the mapping. The formal definition is as follows.
The following theorem estimates the distortion of embedding Eq. (7).
where
