Embedding of the Hamming space into a sphere with weighted quadrance metric and c-means clustering of nominal-continuous data

Abstract

In this paper we present a new c-means clustering algorithm for combined continuous-nominal data. We use spherical representation of nominal data. The impact of specific features is modeled with corresponding weights in metric definition. To solve constrained minimization problem we transfer the methodology of reformulation functions to the considered context. As a result we obtain a clustering algorithm with adaptation of weights. Series of numerical experiments on real and synthetic data show that the algorithm can successfully cluster raw, non-normalized data.

Keywords

C-means clustering weighted metric nominal-continuous data low-distortion embedding quadrance reformulation

1. Introduction and notations

The purpose of this article is further development of algorithms for analysis of combinations of continuous and nominal data by embedding of the discrete nominal part into continuous metric space.

The general method is described in [7]. Assume that we are given some problem, defined on points of a metric space $U$ . To solve this problem, we embed the space $U$ into some simpler metric space $U^{\prime}$ and solve the problem there. If the embedding is not isometric, the developed algorithm is only an approximate solution of the initial problem. The measure of quality of the approximation is the distortion of embedding.

In this paper we assume that different features has distinct impact on the clusters’ structure. So, the initial metric on the space $U$ in both, continuous and nominal parts, is perturbed by unknown weights, that correspond to impact of each feature Eq. (6). The metric on the continuous space $U^{\prime}$ is modified in an appropriate way Eq. (9).

We develop a new variant of the c-means clustering algorithm under the above assumption. Namely, we assume that clusters are defined by the nearest-prototype condition and obtain an algorithm for determining prototypes and metric weights that minimizes the objective function Eq. (10). Our method uses the framework of reformulation functions [9], which was adapted to the case of non-linear metric space.

Besides, we deduce an explicit formula for the complete residual error estimation. This estimation appeared to be very useful in our numerical experiments. Namely, we applied the following simple heuristics: start with five randomly initial prototypes and chose the one with the less error estimation. This heuristics allows us to decrease the strong dependence of the c-means clustering algorithm on initial data, which is known to be the main shortcoming of such algorithms.

The paper is organized as follows. The second section is devoted to a short description of background and related works. In the next section we introduce all the necessary concepts and define an embedding of the data space into a metric space. The reformulation function is defined and analyzed in the Section 4. The fifth and sixth sections describe, correspondingly, algorithms of prototypes and weights determining. The complete residual error of the algorithm is estimated in the Section 7. The next section presents numerical experiments. Finally, we give some conclusions and remarks. The article has an appendix, in which we prove that the proposed embedding of the weighted dataset space into the weighted continuous metric space has the same distortion as embedding without weights, introduced in [4].

Let us fix some conventions and notations, used in the article.

In this paper we will consider the data set $\mathbb{X}=\{\,\mathbf{X}_{1},\ldots,\mathbf{X}_{M}\,\}$ of $M$ records. Each record consists of two parts $\mathbf{X}_{i}=(X_{i},Y_{i})$ , where $X_{i}=(x_{i}^{1},\ldots,x_{i}^{n})\in\mathbb{R}^{n}$ are the continuous data, and $Y_{i}=(y_{i,1},\ldots,y_{i,m})$ are the nominal data, $i=1,\ldots,M$ . We assume that nominal feature $y_{\beta}$ has $n_{\beta}+1$ variants, where $n_{\beta}\geqslant 1$ , $\beta=1,\ldots,m$ , i.e. $n_{\beta}+1$ is equal to the cardinality of the $\beta$ -th nominal domain.

In the Section 3 we define embedding of the nominal part into the unit sphere and will keep the same notations for the embedded data. So, the components of the nominal part have the following representation: $y_{\beta}=(y_{\beta}^{1},\ldots,y_{\beta}^{n_{\beta}})\in\mathbb{R}^{n_{\beta}}$ and the nominal feature vector $y$ belongs to the unit sphere $\mathbb{S}^{s-1}\subset\mathbb{R}^{s}$ , where $s=n_{1}+\ldots+n_{m}$ , i.e. $\|Y\|^{2}=\sum_{\beta=1}^{m}\|y_{\beta}\|^{2}=1$ . Here and thereafter $\|y_{\beta}\|$ always represents the Euclidean norm of the vector: $\|y_{\beta}\|^{2}=\sum_{\alpha=1}^{n_{\beta}}(y_{\beta}^{\alpha})^{2}$ . In such a way the data can be considered as points of a cylinder $\mathbb{D}=\mathbb{R}^{n}\times\mathbb{S}^{s-1}$ .

For two vectors $X,Y\in\mathbb{R}^{d}$

$\displaystyle\langle X,Y\rangle=\sum_{i=1}^{d}X_{i}Y_{i}$ (1)

will always denote the standard Euclidean scalar product.

We consider the problem of partitioning the data to $c$ clusters, $\mathbb{X}=C_{1}\cup\ldots\cup C_{c}$ , where $c<M$ . Clusters are represented by prototypes (or centroids) $\mathbf{Y}_{j}\in\mathbb{D}$ , $j=1,\ldots,c$ . We will use the following convention for notations of continuous and nominal parts of prototypes: $\mathbf{Y}_{j}=(\chi_{j},\nu_{j})$ , where $\chi_{j}=(\chi_{j}^{1},\ldots,\chi_{j}^{n})\in\mathbb{R}^{n}$ , $\nu_{j}=(\nu_{j,1},\ldots,\nu_{j,m})\in\mathbb{S}^{s-1}$ , and $\nu_{j,\beta}=(\nu_{j,\beta}^{1},\ldots,\nu_{j,\beta}^{n_{\beta}})$ for $\beta=1,\ldots,m$ . The array of prototypes will be denoted by $\mathbf{Y}=(\mathbf{Y}_{1},\ldots,\mathbf{Y}_{c})$ .

For a record $\mathbf{X}_{i}$ and a prototype $\mathbf{Y}_{j}$ we define the indicator function with respect to a distance:

$\displaystyle\operatorname{ind}_{i,j}=\begin{cases}1,&\text{ if }\operatorname% {dist}(\mathbf{X}_{i},\mathbf{Y}_{j})<\operatorname{dist}(\mathbf{X}_{i},% \mathbf{Y}_{l}),\quad\forall l\neq j,\\ 0,&\text{ otherwise,}\end{cases}$ (2)

where $i=1,\ldots,M$ , $j=1,\ldots,c$ . The indicator function will always be considered with respect to the weighted distance, defined in Eq. (9).

2. Background and related work

Clustering is an important part of various data analysis algorithms. It is of active study in recent decades. Many different approaches were worked out in clustering.

Our study concerns data with continuous and nominal features. Such data occur naturally in several contexts, for instance, in medical or social investigations [13].

Many contemporary approaches to clustering of continuous-nominal data rely on a similarity matrix: hierarchical clustering, partitioning around medoids [10], spectral clustering [14]. However, constructing of the similarity matrix itself needs very subtle and crucial preprocessing stage.

Our approach does not demand any preprocessing. It can be applied directly to raw, unprocessed data in case when no domain knowledge is available. Nevertheless, some domain knowledge can be useful. Specifically, it can suggest certain initial guess, that is close to the sought metric. In such a way algorithm has good chances to avoid local minimums. Falling into a local minimum in know to be a disadvantage of gradient-based methods (Section 5).

The proposed algorithm emerges from two works. It is essentially based on the methodology of reformulation functions. This methodology was developed by Karayiannis and Randolph-Gips in [9] for adaptation of weights in case of continuous data.

The second methodology is embedding of the nominal part of data into a sphere.

In our previous works [3, 4] we considered the standard Hamming space as a nominal part of $U$ and a sphere as corresponding metric space $U^{\prime}$ . The continuous part of data is assumed to be embedded into the standard Euclidean space. We analyzed different embeddings, reformulated the c-means clustering algorithm for a sphere, and showed that embedding with lower distortion implies better quality algorithm. Specifically, in [4], instead of the standard metric on the sphere, we proposed to use the quadrance distance. It turned out that embedding into such metric space has very low distortion and corresponding algorithm has a quality comparable to other modern approaches to clustering of combinations of continuous-nominal data. The algorithm, proposed in this article is as simple to implement as the previous one, but outperforms it when tested on different data sets.

3. Embedding into a sphere and weighted metric

In this section we will define an embedding of the Hamming metric space into the sphere, and the weighted quadrance distance.

The standard Hamming metric on the set of nominal data is defined as follows:

$\displaystyle\operatorname{dist}_{h}(Y_{1},Y_{2})=\frac{1}{m}\Bigl{|}\{\,\beta% =1,\ldots,m\mid y_{1,\beta}\neq y_{2,\beta}\,\}\Bigr{|}=\frac{1}{m}\sum_{\beta% =1}^{m}\operatorname{diff}(y_{1,\beta},y_{2,\beta}),$ (3)

where

$\displaystyle\operatorname{diff}(t_{1},t_{2})=\begin{cases}1&\text{ if }t_{1}% \neq t_{2},\\ 0&\text{ if }t_{1}=t_{2}.\end{cases}$ (4)

Combining the Hamming metric with the standard Euclidean metric on $\mathbb{R}^{n}$ , we define the product metric on the data space $\mathbb{X}$ :

$\displaystyle\operatorname{dist}^{2}(\mathbf{X_{1}},\mathbf{X_{2}})=% \operatorname{dist}^{2}_{e}(X_{1},X_{2})+\operatorname{dist}_{h}^{2}(Y_{1},Y_{% 2})=\sum_{\alpha=1}^{n}(x_{1}^{\alpha}-x_{2}^{\alpha})^{2}+\left(\frac{1}{m}% \sum_{\beta=1}^{m}\operatorname{diff}(y_{1,\beta},y_{2,\beta})\right)^{2}.$

We do not assume that any additional structure is given for continuous data, so we use the Euclidean metrics as the most common and straightforward way for measuring distances between continuous data [10].

We assume that different features have distinct impact into the clusters’ structure. Let us introduce the weights vector:

$\displaystyle\mathbf{W}=(W,U)=(w_{1},\ldots,w_{n},u_{1},\ldots u_{m}),$ (5)

where $w_{\alpha}>0$ , $u_{\beta}>0$ for $\alpha=1,\ldots,n$ , $\beta=1,\ldots,m$ .

The assumption is that clusters are formed with respect to the weighted distance:

$\displaystyle\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X_{1}},\mathbf{X_{2}% })=\operatorname{dist}_{W,e}^{2}(X_{1},X_{2})+\operatorname{dist}_{U,h}^{2}(Y_% {1},Y_{2})=\sum_{\alpha=1}^{n}w_{\alpha}^{2}(x_{1}^{\alpha}-x_{2}^{\alpha})^{2% }+\left(\sum_{\beta=1}^{m}u_{\beta}\operatorname{diff}(y_{1,\beta},y_{2,\beta}% )\right)^{2}.$ (6)

Now let us remind the embedding of the finite discrete space into the unit sphere according to [3]. Consider at first a discrete set $A=\{\,a_{1},\ldots,a_{n}\,\}$ of equidistant points. Place each element of $A$ into a vertex of an $(n-1)$ -simplex, centered at the origin of $\mathbb{R}^{n-1}$ . Let $(e_{1},\ldots,e_{n-1})$ be the coordinates of a vertex of the simplex. We will assume that the radius of the simplex is equal to one, i.e. $\sum e_{i}^{2}=1$ .

Consider now the case when the set of nominal data is a product $A=A_{1}\times\ldots\times A_{m}$ , where $|A_{\beta}|=n_{\beta}+1$ for $\beta=1,\ldots,m$ . The desired embedding into the unit sphere is defined as follows.

.

The embedding of the space $A$ into the unit sphere $\mathbb{S}^{s-1}\subset\mathbb{R}^{s}$ is defined by the following formula:

$\displaystyle\phi:A=A_{1}\times\ldots\times A_{m}\to\mathbb{R}^{s},$ (7) $\displaystyle\phi:Y=(y_{1},\ldots,y_{m})\mapsto\frac{1}{\sqrt{s}}(y^{1}_{1},% \ldots,y^{n_{1}}_{1},\ldots,y^{1}_{m},\ldots,y^{n_{m}}_{m}),$

where $s=\sum_{\beta=1}^{m}n_{\beta}$ , $n_{\beta}=|A_{\beta}|-1$ and $y^{1}_{\beta},\ldots,y^{n_{\beta}}_{\beta}$ are the coordinates of $y_{\beta}$ on the corresponding simplex ( $\beta=1,\ldots,m$ ).

To determine the weights vector $\mathbf{W}$ we embed the discrete space of nominal data into the sphere and use the weighted quadrance similarity function as corresponding continuous metric.

.

[4] Let two points $Y_{1},Y_{2}\in\mathbb{S}^{s-1}\subset\mathbb{R}^{s}$ be given,

$Y_{i}=(y_{i,1},\ldots,y_{i,m})=(y_{i,1}^{1},\ldots,y_{i,1}^{n_{1}},\ldots,y_{i% ,m}^{1},\ldots,y_{i,m}^{n_{m}}),$

where $s=n_{1}+\ldots+n_{m}$ , $\|Y_{i}\|=1$ , $i=1,2$ . The quadrance distance $\operatorname{dist}_{q}(Y_{1},Y_{2})$ is defined by the following formula:

$\operatorname{dist}_{q}(Y_{1},Y_{2})=\frac{1-\langle Y_{1},Y_{2}\rangle}{2}=% \frac{1}{4}{\|Y_{1}-Y_{2}\|^{2}}=\frac{1}{4}\sum_{\beta=1}^{m}\|y_{1,\beta}-y_% {2,\beta}\|^{2},$

where $\langle Y_{1},Y_{2}\rangle$ is the standard Euclidean scalar product Eq. (1).

Remark 1. One can see that $\operatorname{dist}_{q}$ coincides with the square of the Euclidean distance, and, therefore does not satisfy the triangle inequality [5]. Nevertheless, it is positively defined, symmetric, continuous, and thus can be used as a measure of “closeness” of points on the sphere. For the sake of simplicity, we will call $\operatorname{dist}_{q}$ “the distance”.

We chose the quadrance metric because the embedding Eq. (7) from the standard Hamming space into the sphere has less distortion than embedding into the sphere with the standard metric [4]. On other hand, it is possible to generalize to that space linear methods of clustering.

.

Let two points $Y_{1},Y_{2}\in\mathbb{S}^{s-1}\subset\mathbb{R}^{s}$ like in Definition 2 and a weight vector $U=(u_{1},\ldots,u_{m})$ be given ( $u_{\beta}>0$ for $\beta=1,\ldots,m$ ). The weighted quadrance distance $\operatorname{dist}_{U,q}(Y_{1},Y_{2})$ is defined by the following formula:

$\displaystyle\operatorname{dist}_{U,q}(Y_{1},Y_{2})=\sum_{\beta=1}^{m}u_{\beta% }\|y_{1,\beta}-y_{2,\beta}\|^{2}.$ (8)

We show in the appendix that the distortion of the embedding Eq. (7) from the space with weighted Hamming metric into the sphere has the same estimate as the one obtained in [4] for the standard Hamming and quadrance metrices.

Combining Eq. (8) with the weighted Euclidean metric we obtain the product metric on the cylinder $\mathbb{D}$ :

$\displaystyle\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X_{1}},\mathbf{X_{2}% })=\operatorname{dist}_{W,e}^{2}(X_{1},X_{2})+\operatorname{dist}_{U,q}^{2}(Y_% {1},Y_{2})=\sum_{\alpha=1}^{n}w_{\alpha}^{2}(x_{1}^{\alpha}-x_{2}^{\alpha})^{2% }+\left(\sum_{\beta=1}^{m}u_{\beta}\|y_{1,\beta}-y_{2,\beta}\|^{2}\right)^{2},$ (9)

where $\mathbf{W}$ is the weight vector (Eq. (5)).

4. Reformulation function

The c-means clustering algorithm is based on minimizing of the following objective function [2]:

$\displaystyle H(\mathbf{W},\mathbf{Y})=\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{c}% \operatorname{ind}_{i,j}\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X}_{i},% \mathbf{Y_{j}})=\frac{1}{M}\sum_{i=1}^{M}\min_{1\leqslant j\leqslant c}\{\,% \operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X}_{i},\mathbf{Y_{j}})\,\}.$ (10)

To overcome analytical difficulties, that are caused by the minimum operator, we will use the method of reformulation functions.

The reformulation function is defined as follows [9]:

$\displaystyle R_{p}(\mathbf{W},\mathbf{Y})=\frac{1}{M}\sum_{i=1}^{M}\left(% \frac{1}{c}\sum_{j=1}^{c}\bigl{(}\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{% X}_{i},\mathbf{Y_{j}})\bigr{)}^{p}\right)^{\frac{1}{p}}.$ (11)

The method is based on the following relation:

$\displaystyle H(\mathbf{W},\mathbf{Y})=\lim_{p\to-\infty}R_{p}(\mathbf{W},% \mathbf{Y}).$ (12)

So, instead of minimizing $H(\mathbf{W},\mathbf{Y})$ , one can minimize $R_{p}$ , which does not involve the minimum operator, and then pass to the limit for $p$ tending to minus infinity. We refer to the article [9] for discussion on reformulation functions and clustering.

Differentiation of $R_{p}(\mathbf{W},\mathbf{Y})$ is a part of a minimization algorithm. Now we will deduce a formula for the limit of derivative of $R_{p}$ as $p\to-\infty$ . Essentially our computation coincides with that of [9], but it is performed in some more general framework. Let $\nabla$ denote a differentiation. Rewrite Eq. (11) as

$R_{p}(\mathbf{W},\mathbf{Y})=\frac{1}{M}\sum_{i=1}^{M}\left(S_{i}\right)^{% \frac{1}{p}},$

where

$S_{i}=\frac{1}{c}\sum_{j=1}^{c}\bigl{(}\operatorname{dist}_{\mathbf{W}}^{2}(% \mathbf{X}_{i},\mathbf{Y_{j}})\bigr{)}^{p}.$

So,

$\nabla R_{p}=\frac{1}{Mc}\sum_{i=1}^{M}\sum_{j=1}^{c}S_{i}^{\frac{1}{p}-1}% \left(\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i},\mathbf{Y}_{j})% \right)^{p-1}\nabla\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X}_{i},\mathbf% {Y}_{j}).$

Let us rewrite the coefficient at $\nabla\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i},\mathbf{Y}_{j})$ in the following way:

$\frac{S_{i}^{\frac{1}{p}}}{\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i}% ,\mathbf{Y}_{j})}\cdot\frac{\bigl{(}\operatorname{dist}^{2}_{\mathbf{W}}(% \mathbf{X}_{i},\mathbf{Y}_{j})\bigr{)}^{p}}{S_{i}},$

and calculate its limit as $p\to-\infty$ .

Note that

$\lim_{p\to-\infty}S_{i}^{\frac{1}{p}}=\lim_{p\to-\infty}\left(\frac{1}{c}\sum_% {j=1}^{c}\bigl{(}\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i},\mathbf{Y% }_{j})\bigr{)}^{p}\right)^{\frac{1}{p}}=\min_{1\leqslant j\leqslant c}% \operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i},\mathbf{Y}_{j}).$

For the second fraction we have:

$\displaystyle\frac{\bigl{(}\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X}_{i}% ,\mathbf{Y}_{j})\bigr{)}^{p}}{S_{i}}=\left(\frac{1}{c}\sum_{l=1}^{c}\left(% \frac{\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X}_{i},\mathbf{Y}_{l})}{% \operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X}_{i},\mathbf{Y}_{j})}\right)^{p% }\right)^{-1}=c\left(1+\sum_{l\neq j}\left(\frac{\operatorname{dist}_{\mathbf{% W}}^{2}(\mathbf{X}_{i},\mathbf{Y}_{l})}{\operatorname{dist}_{\mathbf{W}}^{2}(% \mathbf{X}_{i},\mathbf{Y}_{j})}\right)^{p}\right)^{-1}.$

Since each fraction in the above equality is less then 1,

$\displaystyle\lim_{p\to-\infty}\frac{\bigl{(}\operatorname{dist}_{\mathbf{W}}^% {2}(\mathbf{X}_{i},\mathbf{Y}_{j})\bigr{)}^{p}}{S_{i}}=\begin{cases}c&\text{ % if }\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{X}_{i},\mathbf{Y}_{j})=\min% \limits_{1\leqslant l\leqslant c}\operatorname{dist}_{\mathbf{W}}^{2}(\mathbf{% X}_{i},\mathbf{Y}_{l}),\\ 0&\text{ otherwise.}\end{cases}$ (13)

Summarizing up, we have the following formula for the limit of the derivative of the reformulation function:

$\displaystyle\lim_{p\to-\infty}\nabla R_{p}(\mathbf{W},\mathbf{Y})=\frac{1}{M}% \sum_{i=1}^{M}\sum_{l=1}^{c}\operatorname{ind}_{i,l}\nabla\operatorname{dist}^% {2}_{\mathbf{W}}(\mathbf{X}_{i},\mathbf{Y}_{l}).$ (14)

Remark 2. In this calculation we assumed that each record $\mathbf{X}_{i}$ belongs exactly to one cluster, i.e. there exists only one $j\in\{\,1,\ldots,c\,\}$ such that $\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i},\mathbf{Y}_{j})=\min% \limits_{1\leqslant l\leqslant c}\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{% X}_{i},\mathbf{Y}_{l})$ . If there are $k\geqslant 2$ of such centroids, the right side of Eq. (13) is multiplied by $\frac{1}{k}$ . One should made appropriate correction of all the developed algorithms, if probability of non-uniqueness is high.

5. Determining the prototypes

To determine the prototypes we assume that the weights vector $\mathbf{W}$ is fixed and use method of constrained gradient descent to minimize the function $R_{p}(\mathbf{W},\mathbf{Y})$ subject to $\|\nu_{j}\|=1$ for $j=1,\ldots,c$ , where $\mathbf{Y}_{j}=(\chi_{j},\nu_{j})\in\mathbb{D}$ , and $p$ tends to minus infinity. Each iteration of the method has two steps:

1.
$\mathbf{Y}_{\mathrm{next}}=\mathbf{Y}-\eta\nabla_{\mathbf{Y}}R_{p}(\mathbf{W},% \mathbf{Y})$ , where $\eta$ is a relaxation parameter, that can be chosen independently for each coordinate of $\mathbf{Y}$ .
2.
Projection onto constraints surface, i.e. normalization of $\nu_{j}$ for $j=1,\ldots,c$ .

Consider at first continuous part of prototype $\mathbf{Y}_{j}$ , $\chi_{j}^{\alpha}$ , where $\alpha=1,\ldots,n$ :

$\frac{\partial\operatorname{dist}^{2}_{\mathbf{w}}(\mathbf{X}_{i},\mathbf{Y}_{% l})}{\partial\chi_{j}^{\alpha}}=-2w_{\alpha}^{2}(x_{i}^{\alpha}-\chi_{j}^{% \alpha})\delta_{l}^{j},$

where $\delta_{l}^{j}$ is the Kronecker delta:

$\delta_{l}^{j}=\begin{cases}1,&l=j,\\ 0,&l\neq j.\end{cases}$

Taking in consideration Eq. (14), the first step of the method becomes

$\chi_{j,\mathrm{next}}^{\alpha}=\chi_{j}^{\alpha}+w_{\alpha}^{2}\eta_{j}^{% \alpha}\frac{2}{M}\sum_{i=1}^{M}\operatorname{ind}_{i,j}x_{i}^{\alpha}-\left(w% _{\alpha}^{2}\eta_{j}^{\alpha}\frac{2}{M}\sum_{i=1}^{M}\operatorname{ind}_{i,j% }\right)\chi_{j}^{\alpha}.$

One can chose $\eta_{j}^{\alpha}=\left(w_{\alpha}^{2}\eta_{j}^{\alpha}\frac{2}{M}\sum_{i=1}^{% M}\operatorname{ind}_{i,j}\right)^{-1}$ , so the first step of the method will be as follows:

$\displaystyle\chi_{j,\mathrm{next}}^{\alpha}=\frac{\sum_{\mathbf{X}_{i}\in C_{% j}}x_{i}^{\alpha}}{|C_{j}|},$ (15)

i.e. the usual c-means.

For the nominal part:

$\frac{\partial\operatorname{dist}^{2}_{\mathbf{w}}(\mathbf{X}_{i},\mathbf{Y}_{% l})}{\partial\nu_{j,\beta}^{\alpha}}=-2\operatorname{dist}_{U,q}(y_{i},\nu_{l}% )u_{\beta}(y_{i,\beta}^{\alpha}-\nu_{l,\beta}^{\alpha})\delta_{j}^{l}.$

So, the first step of the method takes the following form:

$\displaystyle\nu_{j,\beta,\mathrm{next}}^{\alpha}=\nu_{j,\beta}^{\alpha}+u_{% \beta}\eta_{j,\beta}^{\alpha}\frac{2}{M}\sum_{i=1}^{M}\operatorname{ind}_{i,j}% \operatorname{dist}_{U,q}(y_{i},\nu_{j})y_{i,\beta}^{\alpha}-u_{\beta}\eta_{j,% \beta}^{\alpha}\frac{2}{M}\left(\sum_{i=1}^{M}\operatorname{ind}_{i,j}% \operatorname{dist}_{U,q}(y_{i},\nu_{j})\right)\nu_{j,\beta}^{\alpha}.$

Again, by choosing $\eta_{j,\beta}^{\alpha}=\left(u_{\beta}\frac{2}{M}\sum_{i=1}^{M}\operatorname{% ind}_{i,j}\operatorname{dist}_{U,q}(y_{i},\nu_{j})\right)^{-1}$ , one can simplify this formula to

$\displaystyle v_{j,\beta}^{\alpha}=\frac{\sum_{\mathbf{X}_{i}\in C_{j}}% \operatorname{dist}_{U,q}(y_{i},\nu_{j})y_{i,\beta}^{\alpha}}{\sum_{\mathbf{X}% _{i}\in C_{j}}\operatorname{dist}_{U,q}(y_{i},\nu_{j})},$ (16)

followed by normalization:

$\displaystyle\nu_{j,\mathrm{next}}=V_{j}/\|V_{j}\|,$ (17)

where $V_{j}=(v_{j,1},\ldots,v_{j,m})$ , $j=1,\ldots,c$ .

Note that the Eq. (16) contains the previous centroids, so the above calculation followed by normalization should be repeated iteratively to approximately minimize the objective function Eq. (10).

Remark 3. Since calculation of $V$ is followed by normalization, one can simplify the Eq. (16) to

$v_{j,\beta}^{\alpha}=\sum_{\mathbf{X}_{i}\in C_{j}}\operatorname{dist}_{U,q}(y% _{i},\nu_{j})y_{i,\beta}^{\alpha}$

in practical implementation.
6. Determining the weights

The weight vector $\mathbf{W}$ can be determined by assuming that prototypes $\mathbf{Y}$ are fixed and minimizing the objective function Eq. (10) with respect to $\mathbf{W}$ . Since the objective function $H(\mathbf{W},\mathbf{Y})$ is homogeneous with respect to $\mathbf{W}$ , we need to introduce additional constraint in order to workout an effective minimizing procedure. We assume that the generalized average of the weights is constant:

$\displaystyle\left(\frac{1}{n+m}\left(\sum_{\alpha=1}^{n}w_{\alpha}^{r}+\sum_{% \beta=1}^{m}u_{\beta}^{r}\right)\right)^{\frac{1}{r}}=1,$ (18)

where $r>0$ . The constant on the right side of Eq. (18) is chosen to be compatible with the case $\mathbf{W}=(1,\ldots,1)$ , that was considered in [4].

Note that the parameter $r$ has impact on determination of weights. We also analyze its influence on estimation of residual error in the Section 7.

Following the methodology of reformulation functions, we replace the objective function with the reformulation function Eq. (11) and consider appropriate constrained minimization problem in the limit $p\to-\infty$ . We will use the method of Lagrange multipliers. The reformulation of the Lagrange function is as follows:

$\mathcal{L}_{p}(\mathbf{W},\lambda)=R_{p}(\mathbf{W},\mathbf{Y})-\lambda\left(% \sum_{\alpha=1}^{n}w_{\alpha}^{r}+\sum_{\beta=1}^{m}u_{\beta}^{r}-(m+n)\right).$

The optimal weight vector is determined from the following conditions:

$\displaystyle\frac{\partial\mathcal{L}_{p}(\mathbf{W},\lambda)}{\partial w^{% \alpha}}=\frac{\partial\mathcal{L}_{p}(\mathbf{W},\lambda)}{\partial u^{\beta}% }=\frac{\partial\mathcal{L}_{p}(\mathbf{W},\lambda)}{\partial\lambda}=0,$ (19)

where $\alpha=1,\ldots,n$ , $\beta=1,\ldots,m$ .

Using Eq. (14) and the following relations

$\displaystyle\frac{\partial\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i}% ,\mathbf{Y}_{l})}{\partial w_{\alpha}}=2w_{\alpha}(x_{i}^{\alpha}-\chi_{l}^{% \alpha})^{2},$ $\displaystyle\frac{\partial\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i}% ,\mathbf{Y}_{l})}{\partial u_{\beta}}=2\operatorname{dist}_{U,q}(\mathbf{X}_{i% },\mathbf{Y}_{l})\|y_{i,\beta}-\nu_{l,\beta}\|^{2},$

Equation (19) is rewritten as follows:

$\displaystyle w_{\alpha}\cdot\frac{2}{M}\sum_{i=1}^{M}\sum_{l=1}^{c}% \operatorname{ind}_{i,l}(x_{i}^{\alpha}-\chi_{l}^{\alpha})^{2}=\lambda rw_{% \alpha}^{r-1},$ (20) $\displaystyle\frac{2}{M}\sum_{i=1}^{M}\sum_{l=1}^{c}\operatorname{ind}_{i,l}% \sum_{\gamma=1}^{m}u_{\gamma}\|y_{i,\gamma}-\nu_{l,\gamma}\|^{2}\|y_{i,\beta}-% \nu_{l,\beta}\|^{2}=\lambda ru_{\beta}^{r-1},$ (21) $\displaystyle\sum_{\alpha=1}^{n}w_{\alpha}^{r}+\sum_{\beta=1}^{m}u_{\beta}^{r}% =m+n.$ (22)

The Eq. (20) can be resolved with respect to $w_{\alpha}$ :

$\displaystyle w_{\alpha}=\left(\frac{\lambda r}{2}\right)^{-\frac{1}{r-2}}s_{% \alpha},$ (23)

where

$\displaystyle s_{\alpha}=\left(\frac{1}{M}\sum_{i=1}^{M}\sum_{l=1}^{c}% \operatorname{ind}_{i,l}(x_{i}^{\alpha}-\chi_{l}^{\alpha})^{2}\right)^{\frac{1% }{r-2}}.$ (24)

Equation (21) implies an analogous formula for $u_{\beta}$ :

$\displaystyle u_{\beta}=\left(\frac{\lambda r}{2}\right)^{-\frac{1}{r-2}}z_{% \beta}.$ (25)

The vector $z=(z_{1},\ldots,z_{m})$ satisfies equation

$\displaystyle z_{\beta}^{r-1}=\sum_{\gamma=1}^{m}A_{\beta\gamma}z_{\gamma},$ (26)

where $m\times m$ matrix $A$ has the following elements:

$\displaystyle A_{\beta\gamma}=\frac{1}{M}\sum_{i=1}^{M}\sum_{l=1}^{c}% \operatorname{ind}_{i,l}\|y_{i,\gamma}-\nu_{l,\gamma}\|^{2}\|y_{i,\beta}-\nu_{% l,\beta}\|^{2}$ (27)

for $\beta,\gamma=1,\ldots,m$ . We will discuss how to solve the Eq. (26) later in this section.

Note that the matrix $A$ is symmetric and positively defined. It can be considered as a matrix of covariance of particular features with respect to corresponding centroids.

Substituting Eqs (23) and (25) into Eq. (22), one can eliminate Lagrange multiplier $\lambda$ , and, finally, obtain the following formulas for the weight vector:

$\displaystyle w_{\alpha}=\left[\frac{1}{n+m}\left(\sum_{\gamma=1}^{n}\left(% \frac{s_{\gamma}}{s_{\alpha}}\right)^{r}+\sum_{\beta=1}^{m}\left(\frac{z_{% \beta}}{s_{\alpha}}\right)^{r}\right)\right]^{-\frac{1}{r}}\text{ for }\alpha=% 1,\ldots,n,$ (28) $\displaystyle u_{\beta}=\left[\frac{1}{n+m}\left(\sum_{\alpha=1}^{n}\left(% \frac{s_{\alpha}}{z_{\beta}}\right)^{r}+\sum_{\gamma=1}^{m}\left(\frac{z_{% \gamma}}{z_{\beta}}\right)^{r}\right)\right]^{-\frac{1}{r}}\text{ for }\beta=1% ,\ldots,m.$

Remark 4. In case of continuous data ( $m=0$ ) Eq. (28) coincides with that of [9], up to notations.

Now, let us discuss methods to solve the Eq. (26). Note that this equation has iterative form and immediately implies the simple fixed-point iteration method. However, if the data are distributed close to the cluster centroids, one can expect that the sum Eq. (27) is very small, so the norm of the matrix $A$ is close to zero. That is why the simple iteration method is not appropriate in case of $0<r<1$ .

So, we propose a relaxation iterative method:

$\displaystyle z_{\beta,\mathrm{next}}=z_{\beta}-\omega\left(z_{\beta}^{r-1}-% \sum_{\gamma=1}^{m}A_{\beta\gamma}z_{\gamma}\right),$ (29)

where $\omega$ is a relaxation parameter. The standard terminating criterion can be used:

$\|z_{\mathrm{next}}-z\|<\varepsilon,$

where $\varepsilon$ is the desired accuracy.

It turned out in all our numerical experiments that the proposed method provides satisfactory approximation.

7. Error estimation

The residual error can be obtained using Eq. (12) as

$E=\lim_{p\to-\infty}R_{p}(\mathbf{W},\mathbf{Y}),$

where $\mathbf{W}$ and $\mathbf{Y}$ are optimally determined as indicated in the previous sections. After each iteration it can be measured as

$\displaystyle E=\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{c}\operatorname{ind}_{i,j% }\operatorname{dist}^{2}_{\mathbf{W}}(\mathbf{X}_{i},\mathbf{Y}_{j})$ $\displaystyle\qquad=\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{c}\operatorname{ind}_% {i,j}\left[\sum_{\alpha=1}^{n}w_{\alpha}^{2}(x^{\alpha}_{i}-\chi_{j}^{\alpha})% ^{2}+\left(\sum_{\beta=1}^{m}u_{\beta}\|y_{i,\beta}-\nu_{j,\beta}\|^{2}\right)% ^{2}\right]=E_{\mathrm{c}}+E_{\mathrm{n}},$

where $E_{\mathrm{c}}$ and $E_{\mathrm{n}}$ are, correspondingly, continuous and nominal parts of error.

For the continuous part we have the following representation:

$E_{\mathrm{c}}=\sum_{\alpha=1}^{n}w_{\alpha}^{2}\left[\frac{1}{m}\sum_{i=1}^{M% }\sum_{j=1}^{c}\operatorname{ind}_{i,j}(x_{i}^{\alpha}-\chi_{j}^{\alpha})^{2}% \right]=\sum_{\alpha=1}^{n}w_{\alpha}^{2}s_{\alpha}^{r-2}=\sum_{\alpha=1}^{n}E% _{\mathrm{c},\alpha},$

where $s_{\alpha}$ was defined in Eq. (24) and $E_{\mathrm{c},\alpha}$ represents contribution of the continuous feature $\alpha$ into the total residual error. The Eq. (28) implies the following representation of $E_{\mathrm{c},\alpha}$ :

$\displaystyle E_{\mathrm{c},\alpha}=\left(\frac{\sum_{\gamma=1}^{n}s_{\gamma}^% {r}+\sum_{\beta=1}^{m}z_{\beta}^{r}}{n+m}\right)^{-\frac{2}{r}}s_{\alpha}^{r}.$ (30)

Due to usage of the quadrance metric, the nominal part of the residual error does not split with respect to particular features as explicitly as the continuous part. Nevertheless, it is possible to obtain for it the similar representation:

$\displaystyle E_{\mathrm{n}}=\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{c}% \operatorname{ind}_{i,j}\left(\sum_{\beta=1}^{m}u_{\beta}\|y_{i,\beta}-\nu_{j,% \beta}\|^{2}\right)\left(\sum_{\gamma=1}^{m}u_{\gamma}\|y_{i,\gamma}-\nu_{j,% \gamma}\|^{2}\right)=\sum_{\beta=1}^{m}u_{\beta}\left(\sum_{\gamma=1}^{m}u_{% \gamma}\left[\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{c}\operatorname{ind}_{i,j}\|% y_{i,\beta}-\nu_{j,\beta}\|^{2}\|y_{i,\gamma}-\nu_{j,\gamma}\|^{2}\right]% \right)=\sum_{\beta=1}^{m}u_{\beta}\sum_{\gamma=1}^{m}A_{\beta\gamma}u_{\gamma% }=\sum_{\beta=1}^{m}E_{\mathrm{n},\beta},$

where matrix $A$ is defined in Eq. (27). Using the Eq. (26), relation between $u_{\beta}$ and $z_{\beta}$ , and Eq. (28), we obtain the following representation for $E_{\mathrm{n},\beta}$ :

$\displaystyle E_{\mathrm{n},\beta}=\left(\frac{\sum_{\alpha=1}^{n}s_{\alpha}^{% r}+\sum_{\gamma=1}^{m}z_{\gamma}^{r}}{n+m}\right)^{-\frac{2}{r}}z_{\beta}^{r}.$ (31)

Summarizing, we get the following formula for the total residual error:

$E=\left(\frac{\sum_{\alpha=1}^{n}s_{\alpha}^{r}+\sum_{\beta=1}^{m}z_{\beta}^{r% }}{m+n}\right)^{-\frac{2}{r}}\left(\sum_{\alpha=1}^{n}s_{\alpha}^{r}+\sum_{% \beta=1}^{m}z_{\beta}^{r}\right).$

Let us make the following substitution:

$\displaystyle s_{\alpha}=\sigma_{\alpha}^{\frac{1}{r-2}},\quad z_{\beta}=\xi_{% \beta}^{\frac{1}{r-2}},$ (32)

where $\alpha=1,\ldots,n$ and $\beta=1,\ldots,m$ . Note that

$\sigma_{\alpha}=s_{\alpha}^{r-2}=\frac{1}{M}\sum_{i=1}^{M}\sum_{l=1}^{c}% \operatorname{ind}_{i,l}(x_{i}^{\alpha}-\chi_{l}^{\alpha})^{2}$

does not depend on $r$ .

The formula for the total residual error will be transformed as follows:

$\displaystyle E=(m+n)\left(\frac{1}{m+n}\left(\sum_{\alpha=1}^{n}\sigma_{% \alpha}^{\frac{r}{r-2}}+\sum_{\beta=1}^{m}\xi_{\beta}^{\frac{r}{r-2}}\right)% \right)^{\frac{r-2}{r}}=(m+n)\left(\frac{1}{m+n}\left(\sum_{\alpha=1}^{n}% \sigma_{\alpha}^{p}+\sum_{\beta=1}^{m}\xi_{\beta}^{p}\right)\right)^{\frac{1}{% p}}=(m+n)M_{p}(\sigma_{1},\ldots,\sigma_{n},\xi_{1},\ldots,\xi_{m}),$ (33)

where $p=r/(r-2)$ and $M_{p}$ is the generalized mean [6]. When $r$ increases from $0$ to $1$ , $p$ decreases from $0$ to $-1$ . Since the generalized mean is an increasing function of $p$ , in case of pure continuous data ( $m=0$ ) one deduces that the total residual error decreases on interval $(0,1)$ [9]. However, in our case components $\xi_{\beta}$ depend on $r$ and the above consideration can not be directly applied. We can not prove decreasing of $E_{r}$ on $(0,1)$ in general case at this moment, but we observe it in numerical experiments and claim it as a conjecture.

The estimate (Eq. (33)) was used in numerical experiments (Section 8).

Let us make one more remark concerning the meaning of parameter $r$ . Note that Eqs (30) and (31) for the error, related to corresponding features, contain a common factor, that coincides with a generalized mean:

$\left(\frac{\sum_{\alpha=1}^{n}s_{\alpha}^{r}+\sum_{\gamma=1}^{m}z_{\gamma}^{r% }}{n+m}\right)^{-\frac{2}{r}}=\left[M_{r/2}\bigl{(}s_{1}^{\frac{1}{2}},\ldots,% s_{n}^{\frac{1}{2}},z_{1}^{\frac{1}{2}},\ldots,z_{m}^{\frac{1}{2}}\bigr{)}% \right]^{-1}.$

Since the generalized mean $M_{p}$ tends to the geometrical mean while $p\to 0$ [6], we see that

$E_{\mathrm{c},\alpha}(0)=E_{\mathrm{n},\beta}(0)=\left(\prod_{\gamma=1}^{n}s_{% \gamma}(0)\prod_{\gamma=1}^{m}z_{\gamma}(0)\right)^{-\frac{1}{2(m+n)}},$

i.e., every feature has the same residual error. Hence, for small values of $r$ the impact of different features to the structure of clusters is equalized, while for greater values of $r$ the impact of each feature raises proportionally to $s_{\alpha}^{r}$ for continuous features and $z_{\beta}^{r}$ for nominal features ( $\alpha=1,\ldots,n$ , $\beta=1,\ldots,m$ ). The similar phenomenon was observed in [9] for pure continuous data ( $m=0$ ).

8. Experimental results

The above considerations are summarized up in the Algorithm 8.

[htb] C-means clustering algorithm $\mathbb{X}=\{\,\mathbf{X}_{1},\ldots,\mathbf{X}_{M}\,\}$ is the set of records, embedded into the cylinder $\mathbb{D}\times\mathbb{S}^{s-1}$ The resulting clusters $C_{1}$ , …, $C_{k}$ for the set of records $\mathbb{X}$ Choose randomly initial centroids $\mathbf{Y}_{1},\ldots,\mathbf{Y}_{c}\in\mathbb{D}$ Choose initial weight vector as $\mathbf{W}=(1,\ldots,1)$ Compute initial clusters $C_{1},\ldots,C_{c}\subset\mathbb{X}$ and the indicator function according to Eq. (2) The partition into clusters is not stabilized Compute new centroids $\mathbf{Y}_{1},\ldots,\mathbf{Y}_{c}\in\mathbb{D}$ , with Eqs (15)–(17) Compute new clusters $C_{1},\ldots,C_{c}\subset\mathbb{X}$ and the indicator function according to Eq. (2) Compute new weight vector $\mathbf{W}$ by the Eq. (28) $C_{1},\ldots,C_{c}$

We applied the following simple heuristic: start with five random collections of $c$ initial prototypes (centroids) and choose one, that after clustering produces the minimal total residual error (Eq. (33)). In such a way we decrease strong dependence of the c-means clustering algorithms on initial guess of centroids, which is known to be a serious shortcoming of the algorithm. The results without this heuristic are visibly worse.

Since the Eq. (28) involve parameter $r$ , the results of clustering of the Algorithm 8 depend on it. Thus we present results for different values of $r$ .

Following the original work of Karayiannis and Randolph-Gips [9], we accept the following two measures of clustering quality. Both are based on the coverage of predefined groups by the computed clusters.

Let two finite sets $A$ and $B$ be given. The degree of coverage of $A$ by $B$ is defined as $|A\cap B|/|B|$ .

The first measure (cv) is the mean coverage degree of the intended predefined groups by the corresponding computed clusters.

As the second measure (suc) we consider the percentage of least-half covering, i.e. the percentage of initial sets of centroids from 30 random initial collections of five prototypes, for which the algorithm computes clusters that cover the predefined groups in at least one-half degree. This measure makes sense for such algorithms, that computation results depend on some randomly chosen initial data. This is the case for the c-means clustering.

In order to demonstrate the improvement of clustering, we compare our new algorithm with the previous one [4], without weight optimization. Besides, we compare the weighted metric clustering algorithm with other approaches to clustering of mixed data: hierarchical clustering, partitioning around medoids and spectral clustering.

Our new algorithm works on raw data and does not require any normalization. For the previous one we applied the normalization to $N(0,1)$ distribution of continuous part of data. Without normalization the results of the previous algorithm as well as results of other algorithms (Section 8.6) are very weak. Most of other known clustering algorithms require normalization as a preliminary step. Note that sometimes normalization can disturb original clusters’ structure [10].

As a benchmark sets we took four real training sets with decision categories as predefined groups: Australian Credit Approval, Heart Disease, Hepatitis, and Bank Marketing. These datasets are described particularly in [1]. Specifically, let us note that missing data in Hepatitis and Bank Marketing data sets were reinserted according to probability distribution – a suitable normal distribution for continuous coordinates and appropriate discrete distribution for nominal features.

8.1 Heart disease

The Heart Disease dataset consists of 6 continuous, 7 nominal attributes, 370 records, 2 decision categories. The results of experiments are presented in the Table 1. One can see that particular high coverage degree is achieved for $r=$ 0.05.

Table 1
The results of numerical experiments for the Heart Disease data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Unweighted metric		cv	suc (%)	$N$
Normalized data		0.78	98	9.8
Weighted metric	$r$	cv	suc (%)	$N$	$E(r)$
Unnormalized data	0.95	0.70	100	11.1	6.75
	0.75	0.70	100	11.7	7.76
	0.55	0.70	100	13.6	10.35
	0.35	0.71	50	7.5	18.09
	0.15	0.79	80	12.9	31.49
	0.05	0.82	100	13.8	40.12

8.2 Australian credit approval

This dataset consists of 6 continuous, 8 nominal attributes, 690 records, 2 decision categories. The results of experiments for this dataset are presented in the Table 2. One can see significant improvement with respect to the algorithm based on unweighted quadrance metric.

Table 2
The results of numerical experiments for the Australian Credit Approval data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Unweighted metric		cv	suc (%)	$N$
Normalized data		0.66	0	11.1
Weighted metric	$r$	cv	suc (%)	$N$	$E(r)$
Unnormalized data	0.95	0.86	100	8.7	7.05
	0.75	0.86	100	11.7	8.20
	0.55	0.86	100	9.3	10.80
	0.35	0.73	100	12.4	19.43
	0.15	0.73	100	12.8	47.56
	0.05	0.73	100	16.2	82.23

Two continuous features in this dataset have very large variance. It is caused primarily by the fact that the numbers contained in corresponding two columns are very big comparing to other continuous data. However, we found out that these two features have observably bigger entropy than the remaining continuous features. We observe that corresponding computed weights are very small and recompense variance of these two features.

8.3 Hepatitis

The Hepatitis dataset contains 6 continuous, 13 nominal attributes, 155 records, 2 decision categories. The results of experiments are presented in the Table 3. One can see that different values of $r$ give similar results.

Table 3
The results of numerical experiments for the Hepatitis data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Unweighted metric		cv	suc (%)	$N$
Normalized data		0.78	99	8.8
Weighted metric	$r$	cv	suc (%)	$N$	$E(r)$
Unnormalized data	0.95	0.80	100	7.10	6.63
	0.75	0.80	100	7.14	6.99
	0.55	0.81	100	6.70	7.80
	0.35	0.81	100	7.40	9.89
	0.15	0.79	100	6.90	14.54
	0.05	0.80	100	8.20	18.78

8.4 Bank marketing

This dataset contains 7 continuous, 9 nominal attributes, 4521 records, 2 decision categories. The results of experiments are presented in the Table 4. In this case the measure suc has considerably improvement from 38% to 100%.

Table 4
The results of numerical experiments for the Bank Marketing data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Unweighted metric		cv	suc (%)	$N$
Normalized data		0.56	4	13.42
Weighted metric	$r$	cv	suc (%)	$N$	$E(r)$
Unnormalized data	0.95	0.58	100	7.5	7.81
	0.75	0.58	100	7.4	9.04
	0.55	0.58	100	7.8	12.03
	0.35	0.58	100	7.6	19.88
	0.15	0.58	100	8.4	50.06
	0.05	0.52	0	9.6	92.26

8.5 Artificial dataset

We have tested our algorithm on artificial dataset as well. Each record has 5 continuous and 5 nominal features. The cardinalities of nominal domains are arbitrary chosen as 2, 4, 2, 3 and 5. We use the weighted metric Eq. (6) with the following weights: 4.5, 2.8, 1.5, 0.0001, 0.0001 for continuous part of data and 0.01, 0.01, 1.15 and 2.5 for nominal part respectively. Predefined intended groups are two balls with random centers in this metric space. The radius of each ball is equal to 1/2 of distance between the centers. Analyzed data, 200 for each group are randomly chosen from corresponding balls.

Table 5
The results of numerical experiments for the Artificial data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

$r$	cv	suc (%)	$N$	$E(r)$
0.95	0.90	99	8.13	2.30
0.75	0.83	83	9.67	3.15
0.55	0.99	100	9.66	3.36
0.35	0.91	93	14.32	4.45
0.15	0.81	82	13.90	5.97
0.05	0.78	86	17.73	7.08

The results of experiments are presented in the Table 5. The analyzed dataset was generated randomly 10 times for each value of $r$ . The table contains average values.

The two features with little weights have small impact to the structure of predefined groups. We observe that our algorithm stably adapts the sought weights in appropriate way.

We see also that the values of both measures (cv, suc) are significantly better than that of clustering without weights adaptation ( $\text{cv}=$ 0.52, $\text{suc}\in[47,58]$ ).

8.6 Comparison with other methods

The Table 6 contains comparison of the weighted metric clustering algorithm with other approaches to clustering of mixed data: hierarchical clustering, partitioning around medoids and spectral clustering, which are very important approach in most real-world applications. We do not show the measure suc for algorithms Diana, Agnes, PAM and SC since it has sense only for algorithms, for which the result of computation depends on random choice of some initial data.

Table 6
Comparison with other clustering approaches. Diana – the divisive hierarchical clustering algorithm, Agnes – the agglomerative hierarchical clustering algorithm, PAM – partitioning of the data into clusters “around medoids”, SC and SCOpt – spectral clustering

	Diana	Agnes	PAM	SC	SCOpt		WQC
	cv	cv	cv	cv	cv	suc (%)	cv	suc (%)
Heart disease	0.83	0.82	0.78	0.50	0.82	100	0.82	100
Australian	0.86	0.50	0.80	0.50	0.65	85	0.86	100
Hepatitis	0.80	0.80	0.80	0.51	0.58	27	0.81	100
Bank marketing	0.59	0.50	0.57	0.50	0.50	0	0.58	100

For hierarchical clustering and partitioning around medoids we used their implementations in R-package cluster [12].

Diana – the divisive hierarchical clustering algorithm, implemented in function diana.

Agnes – the agglomerative hierarchical clustering algorithm, implemented in function agnes.

PAM – partitioning of the data into clusters “around medoids”, implemented in function pam.

The dissimilarity matrix for all this algorithms was calculated with the standard daisy function from the cluster package.

For spectral clustering we used its implementation in R-package kernlab [8]. We tested two variants: SC – the algorithm with the kernel function $k(\mathbf{X_{1}},\mathbf{X_{2}})=e^{-\operatorname{dist}^{2}(\mathbf{X_{1}},% \mathbf{X_{2}})/2}$ , where

$\displaystyle\operatorname{dist}^{2}(\mathbf{X_{1}},\mathbf{X_{2}})=% \operatorname{dist}^{2}_{e}(X_{1},X_{2})+\kappa\operatorname{dist}_{h}^{2}(Y_{% 1},Y_{2}).$

The scaling coefficient $\kappa$ was adjusted heuristically to each data set.

SCOpt – the algorithm with the kernel function $k(\mathbf{X_{1}},\mathbf{X_{2}})=e^{-\sigma\operatorname{dist}_{e}^{2}(\mathbf% {X_{1}},\mathbf{X_{2}})/2}$ . In this case the kernlab function performs additional iterative optimization with respect to the parameter $\sigma$ . The table contains average data from 100 tests (for Bank Marketing data set we used only 10 tests).

The results of experiments with Diana, Agnes, PAM, SC and SCOpt algorithms were presented in the paper [4] and we recall them here.

Continuous parts of the analyzed data sets were standardized to $N(0,1)$ normal distribution for Diana, Agnes, Pam, SC and SCOpt algorithms. Without this standardization the results computed by these algorithms are very weak ( $\text{cv}\in(0.5,0.6]$ ). Weighted metric algorithm does not need such standardization and acts on the raw data.

For the artificial data sets defined in Subsection 8.5 Diana, Agnes, Pam, SC, SCOpt algorithms produce very bad results (again $\text{cv}\in(0.5,0.6]$ ). It is quite understandable, since artificial data sets were defined in terms of weighted metric, and these algorithms have no weights adaptation mechanism.

The error analysis, performed at the end of the Section 7, implies that the optimal value of parameter $r$ depends on interrelations between different features. Specifically, one can see that in our experiments, that the best clustering is achieved while $r=$ 0.55 for the Australian, Hepatitis and Bank Marketing datasets and $r=$ 0.05 for the HeartDisease dataset. It reflects that the last dataset has more balanced impact of features. We think that optimization with respect to $r$ can be performed heuristically on the basis of the domain knowledge.

The results of experiments show that the weighted metric clustering can be considered as a comparable alternative for other modern algorithms.

9. Conclusion and final remarks

In this work we transferred the methodology of reformulation functions [9] to the case of combined continuous-nominal data. We used the spherical representation of nominal data [4]. The impact of specific features is modeled with corresponding weights in metric definition (Eq. (6)).

As a result we obtained a clustering algorithm with adaptation of weights. It turns out that this algorithm can successfully cluster raw, non-normalized data.

Additionally, we obtained an estimation of total residual error of clustering (Eq. (33)). This estimation allowed us to improve the algorithm by diminishing of dependence of c-means clustering on guess of initial prototypes.

The algorithm, developed in our previous paper [4], was based on the metric of the form

$\operatorname{dist}^{2}(\mathbf{X_{1}},\mathbf{X_{2}})=\operatorname{dist}^{2}% _{e}(X_{1},X_{2})+\kappa\operatorname{dist}_{h}^{2}(Y_{1},Y_{2}).$

In order to achieve reasonable results of clustering, the parameter $\kappa$ had to be tuned ( $\kappa\in[1,30]$ ). The considerable advantage of our new algorithm is that no tuning is needed.

Let us mention that as a side effect of the used methodology, some new characteristics of the data appeared: matrix $A$ Eq. (27) and vectors $\sigma$ and $\xi$ Eq. (32). The matrix $A$ can be extended in natural way to pairs of continuous-continuous and continuous-nominal features. This extension and possible statistical interpretation of these parameters will be the subject of our future work.

As a possible application of the proposed algorithm, let us mention that adapted metric could be considered as a starting point for the other machine learning algorithms based on a dissimilarity matrix.

Footnotes

Appendix

Distortion of embedding of the weighted Hamming metric space into a sphere

In this section we estimate the distortion of embedding (Eq. (7)) with respect to the weighted Hamming metric on the discrete space and the weighted quadrance metric on the sphere.

Let $A=A_{1}\times\ldots\times A_{m}$ be a discrete space with $|A_{\beta}|=a_{\beta}$ for $\beta=1,\ldots,m$ . Let $U=(u_{1},\ldots,u_{m})$ be a weight vector ( $u_{\beta}>0$ ). The weighted Hamming metric is defined as (Eq. (6))

$\operatorname{dist}_{U,h}(Y_{1},Y_{2})=\frac{1}{m}\sum_{\beta=1}^{m}u_{\beta}% \operatorname{diff}(y_{1,\beta},y_{2,\beta}),$

where $\operatorname{diff}$ is the difference function Eq. (4).

The weighted quadrance metric after the embedding Eq. (7) into the unit sphere is defined in Eq. (8) as

$\operatorname{dist}_{U,q}(Y_{1},Y_{2})=\sum_{\beta=1}^{m}u_{\beta}\|y_{1,\beta% }-y_{2,\beta}\|^{2}.$

The distortion of a mapping measures how the metric changes after the mapping. The formal definition is as follows.

The following theorem estimates the distortion of embedding Eq. (7).

Theorem 1. The distortion of embedding Eq. (7) is not greater than

(34) $\displaystyle\frac{a_{\min}}{a_{\max}}\frac{a_{\max}-1}{a_{\min}-1},$

where $a_{\min}=\min\limits_{\beta=1,\ldots,m}a_{\beta}$ , $a_{\max}=\max\limits_{\beta=1,\ldots,m}a_{\beta}$ .

This estimate shows that the embedding (Eq. (7)) gives much better distortion than embeddings into the Euclidean space or into a sphere with the standard spherical metric. We refer to the article [4] for discussion on importance of low-distortion embeddings and further references.

References

Asuncion

and Newman

D.J.

, UCI machine learning repository, 2007.

Bezdek

J.C.

, Pattern Recognition with Fuzzy Objective Function Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.

Denisiuk

and Grabowski

, A variant of the k-means clustering algorithm for continuous-nominal data, in: Burduk

Jackowski

Kurzyński

Woźniak

and Żołnierek

, editors, Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015, volume 403 of Advances in Intelligent Systems and Computing, Springer, 2016, pages 17–26.

Denisiuk

and Grabowski

, Low distortion embedding of the Hamming space into a sphere with quadrance metric and k-means clustering of nominal-continuous data, Fundamenta Informaticae 153(3) (2017), 221–233.

Deza

and Deza

M.M.

, Encyclopedia of Distances, Springer-Verlag, Berlin Heidelberg, 2009.

Dyckhoff

and Pedrycz

, Generalized means as model of compensative connectives, Fuzzy Sets and Systems 14(2) (1984), 143–154.

Indyk

and Matoušek

, Low-distortion embeddings of finite metric spaces, In Handbook of Discrete and Computational Geometry, 2004, pages 177–196. CRC Press.

Karatzoglou

Smola

Hornik

and Zeileis

, kernlab – an S4 package for kernel methods in R, Journal of Statistical Software 11(9) (refyear2004), 1–20.

Karayiannis

N.B.

and Randolph-Gips

M.M.

, Non-euclidean c-means clustering algorithms, Intell. Data Anal 7(5) (2003), 405–425.

10.

Kaufman

and Rousseeuw

P.J.

, Finding Groups in Data: An Introduction to Cluster Analysis, volume 344 of Wiley Series in Probability and Statistics, John Wiley, 2008.

11.

Linial

, Finite metric spaces – combinatorics, geometry and algorithms, in: Ta-Tsien

, editor, Proceedings of the International Congress of Mathematicians 2002, volume III, 2002, pages 573–586.

12.

Maechler

Rousseeuw

Struyf

Hubert

and Hornik

, Cluster: Cluster Analysis Basics and Extensions, 2016. R package version 2.0.5 – For new features, see the ’Changelog’ file (in the package source).

13.

Mirkin

, Clustering: A Data Recovery Approach, Second Edition, Chapman & Hall/CRC Computer Science & Data Analysis. CRC Press, 2016.

14.

von Luxburg

, A tutorial on spectral clustering, Statistics and Computing 17(4) (2007), 395–416.

Embedding of the Hamming space into a sphere with weighted quadrance metric and c-means clustering of nominal-continuous data

Abstract

Keywords

1. Introduction and notations

3. Embedding into a sphere and weighted metric

.

.

.

8.1 Heart disease

Table 1 The results of numerical experiments for the Heart Disease data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( N ), the average total residual error ( E ⁢ ( r ) )

Table 2 The results of numerical experiments for the Australian Credit Approval data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( N ), the average total residual error ( E ⁢ ( r ) )

Table 3 The results of numerical experiments for the Hepatitis data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( N ), the average total residual error ( E ⁢ ( r ) )

Table 4 The results of numerical experiments for the Bank Marketing data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( N ), the average total residual error ( E ⁢ ( r ) )

Table 5 The results of numerical experiments for the Artificial data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( N ), the average total residual error ( E ⁢ ( r ) )

Table 6 Comparison with other clustering approaches. Diana – the divisive hierarchical clustering algorithm, Agnes – the agglomerative hierarchical clustering algorithm, PAM – partitioning of the data into clusters “around medoids”, SC and SCOpt – spectral clustering

Footnotes

Appendix

Distortion of embedding of the weighted Hamming metric space into a sphere

References

Table 1
The results of numerical experiments for the Heart Disease data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Table 2
The results of numerical experiments for the Australian Credit Approval data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Table 3
The results of numerical experiments for the Hepatitis data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Table 4
The results of numerical experiments for the Bank Marketing data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Table 5
The results of numerical experiments for the Artificial data set. The mean coverage degree (cv), the percentage of least-half covering (suc), average number of iterations ( $N$ ), the average total residual error ( $E(r)$ )

Table 6
Comparison with other clustering approaches. Diana – the divisive hierarchical clustering algorithm, Agnes – the agglomerative hierarchical clustering algorithm, PAM – partitioning of the data into clusters “around medoids”, SC and SCOpt – spectral clustering