Cluster Width of probability Density functions

Abstract

This study establishes the new results for Cluster Width of probability Density functions (CWD). There are the upper and lower bounds of CWD and the relationships of CWD to other measures in statistical discriminant. The CWD for two and more two probability density functions is determined in the different cases. Based on CWD, we propose a measure called similar coefficient to evaluate the quality of the established clusters. Furthermore, CWD is also used as a criterion to build two algorithms: to determine the suitable number of clusters and to analyse the fuzzy clusters. The numerical examples are given to illustrate the proposed algorithms and to prove their advantages over existing methods.

Keywords

Cluster density distance fuzzy width

1. Introduction

Clustering that can partition unknown large data in to groups so that whose elements in each group have the similar properties is a basic method in data mining and statistics. It is the important first step to understand some basic information from data before implementing some more depth analysis. In global trend, storing, extracting and analyzing data play an important role and have a great influence on the development of the different science subjects. For this reason, cluster analysis has been of interest to many statisticians so far [1, 8]. It is possible to build clusters for discrete elements (CDE) [2, 4] and clusters for probability density functions (CPD) [21]. The CPD, which is useful for mining the data with high volumes and uncertain sources, has attracted the considerable interest of many researchers. It also provies more advantages than CDE in many applications [1, 21]. In both cases, the most important problem is to find a suitable criterion for building clusters and to evaluate their quality. For CDE, the distance is the main criterion. There are Euclidian distance, Chebyshev distance, city block distance, etc. for two elements and the minimum distance, the maximum distance, the mean distance, etc. for two groups [22]. For two pdfs, some types of distances have been widely used such as the $L^{p}$ – distance, the Bhattacharya distance, the Divergence distance, the Helinger distance, etc. [2, 22]. For more than two pdfs, some measures have also been introduced: the affinity of Matusita [10] and Toussaint [19], the separated measure of Glick [5]. However, these above measures are not used as a criteria for CPD. Since these measures are defined by integrating the weighted product of pdfs, the computation is rather complex. In addition, the visualization of these measures is not quite obvious. According to [11, 14], criterion to build CPD has not yet been much studied. Based on the distance of two pdfs, Goh and Vidal [6] had an initial contribution for CPD with a new algorithm which was later improved by Montanari and Calo [11]. From the concept of Glick about the separated measure, Pham-Gia et al. [15] first proposed the definition of the $L^{1}$ – distance between more than two pdfs. From this definition, Vovan and Pham-Gia [21] suggested the concept the cluster width of pdfs used as a criterion to build CPD. The hierarchical and non-hierarchical approaches based on this criterion were established. However, CWD was not fully surveyed in this research. The bounds of CWD, the relations between CWD and other measures and related quantities in statistical discriminant for two and more pdfs were not yet established. CWD was only calculated by approximating via quasi Monte-Carlo method without investing in building the specific expression. This article supplements some theoretical results for CWD and examines to determine it. In addition, CWD is also used as a criterion to build the fuzzy cluster.

In fuzzy cluster analysis, one of the most challenging problem is how to determine the suitable number of clusters. In the literature, the number of clusters can be determined by prior knowledge about the data. Vovan and Pham-Gia [21] proposed the algorithm with a bound on CWD to solve this problem. However, the result of this algorithm still depends on the CWD and intercluster distances. For this reason, this algorithm is not suitable for many cases. Chen and Hung [1] and Thao and Vovan [18] also gave a method to automatically determine the number of clusters. Many examples show that this algorithm has more advantages than previous proposed methods. However, it is only suitable in case where the overlapping-degree of the pdfs is not too great. Many applications show that the number of determined clusters by this method is incorrect. Based on CWD, we propose an algorithm to determine the Suitable Number of Clusters (SNC) that is more advantages than existing methods. Another problem is the quality evaluation of established clusters. In our knowledge, there is no interest about this problem so far. We need to construct a criterion to compare similar level of clusters whose number of elements are distinctly different. For this reason, we propose a new measure called Cluster Similar Coefficient (CSC) of pdfs. The advantages of the two proposed algorithms are tested by the numerical examples. All computational problems of the proposed algorithms are performed by the Matlab procedures. They offer effective method for the complicated calculations of the numerical examples.

The remaining part of the paper is arranged as follows. Section 2 presents the definition CWD and related concepts. This section also determines the CWD of two and more than two pdfs. Section 3 gives some results about CWD. Section 4 proposes two new related algorithms for fuzzy clustering of pdfs. Section 5 presents the numerical examples including the synthetic data, benchmark data and real data. The final section is the conclusion of the paper.

2. Width cluster of probability density functions

2.1 Some definitions

Let $F=\{f_{1}(x),f_{2}(x),\ldots,f_{k}(x)\}$ be pdfs on ${R^{n}},(n\geqslant 1)$ , and ${f_{\max}}(x)=\max\{f_{1}(x),f_{2}(x),\ldots,\linebreak f_{k}(x)\}$ . Then, we have the following definitions:

.

CWD of $F$ is defined by the following expression:

$\displaystyle w(F)=\int\limits_{{R^{n}}}{{f_{\max}}(x)dx-1}.$ (1)

If $F$ has only one pdf then CWD will equal 0.

.

Let $f,f_{1},f_{2},\ldots,f_{k},g_{1},g_{2},\ldots,g_{l}$ be the pdfs. Then, the WCD of $\{(f),(f_{1},f_{2},\ldots,f_{k})\}$ and of $\{(f_{1},f_{2},\ldots,f_{k})$ , $(g_{1},g_{2},\ldots,g_{l})\}$ are defined by

$\displaystyle w\left[{\{f\}\cup\left\{{{f_{1}},{f_{2}},\ldots,{f_{k}}}\right\}% }\right]{\rm and}\,w\left[{\left\{{{f_{1}},{f_{2}},\ldots{,f_{k}}}\right\}\cup% \left\{{{g_{1}},{g_{2}},\ldots,{g_{l}}}\right\}}\right].$ (2)

From Eqs (1) and (2), we have the following results:

(i)

$w({f_{1}},{f_{2}},\ldots,{f_{k}})$ is a non-decreasing function in $k$ and $0\leqslant w(F)\leqslant k-1$ . The equality on the left occurs when ${f_{{i}}}\left(x\right),i={{1}},{{2}},\ldots,k$ are identical and on the right occurs when ${f_{{i}}}\left(x\right)$ have disjoint supports. We also see that the smaller the cluster width is, the higher the similarity degree of the pdfs is.

(ii)

The relations concerning to the CWD of two consecutive clusters that differ by only one element and those of two clusters and their union are obtained as follows:

$\displaystyle w({f_{{1}}},{f_{{2}}},\ldots,{f_{k+1}})-w({f_{{1}}},{f_{{2}}},% \ldots,{f_{{k}}})=1-\int\limits_{{R^{n}}}{\min}\{{h_{1}}(x),{f_{k+1}}(x)\}dx,$ $\displaystyle w({f_{{1}}},{f_{{2}}},\ldots,{f_{{k}}})=w({f_{{1}}},{f_{{2}}},% \ldots,{f_{{m}}})+w({f_{{m+1}}},{f_{m+2}},\ldots,{f_{{k}}})+1-B,$

where

$\displaystyle h_{1}={\text{max}}\{{f_{1}(x),f_{2}(x),\ldots,f_{k}(x)}\},B=\int% \limits_{{R^{n}}}{\min}\{{h_{2}}(x),{h_{3}}(x)\}dx,n<k,$ $\displaystyle{h_{{2}}}\left(x\right)={\text{max}}\{{f_{{1}}}\left(x\right),{f_% {{2}}}\left(x\right),\ldots,{f_{{m}}}\left(x\right)\},$ $\displaystyle{h_{{3}}}\left(x\right)={\text{max}}\{{f_{m+1}}\left(x\right),{f_% {m+2}}\left(x\right),\ldots,{f_{{k}}}\left(x\right)\}.$

(iii)

For $k=2$ , we have

$\displaystyle w({f_{1}},{f_{2}})=\frac{1}{2}{\left\|{{f_{1}},{f_{2}}}\right\|_% {1}}=\frac{1}{2}\int\limits_{{R^{n}}}|{f_{1}}(x)-{f_{2}}(x)|dx=\int\limits_{{R% ^{n}}}f_{\max}(x)dx-1.$ (3)

.

CSC of $F$ is defined by

$\displaystyle c\left(F\right)=1-\frac{1}{{k-1}}w(F),k\geqslant 2.$ (4)

The CSC of a cluster with only one pdf is defined as 1.

We have

$\displaystyle{f_{i}}(x)\leqslant\max\left\{{{f_{1}}(x),{f_{2}}(x),\ldots,{f_{k% }}(x)}\right\}\leqslant\sum\limits_{i=1}^{k}{f_{i}}(x).$

Because $\int_{{R^{n}}}{f_{i}}(x)dx=1$ ,

$\displaystyle 1\leqslant\int\limits_{{R^{n}}}{{f_{\max}}(x)dx\leqslant k}{% \text{ or }}0\leqslant 1-\frac{1}{k}\int\limits_{{R^{n}}}{{f_{\max}}(x)dx}% \leqslant 1-\frac{1}{k}.$

From the above results, we obtain

$\displaystyle 0\leqslant c(F)\leqslant 1.$ (5)

The equality on the right of Eq. (5) occurs when all ${f_{{i}}}\left(x\right),i=1,2,\ldots,k$ are identical, and on the left of Eq. (5) occurs when ${f_{{i}}}\left(x\right)$ have disjoint supports.

From Eq. (4), we see that CSC determines the overlapping level of pdfs, and it has been standardized on [0,1]. We may consider that the CSC is the mean of overlapping coefficients of pdfs. The larger the CSC is, the higher the similarity degree of the pdfs is.

2.2 Determining the cluster width

(i) For two pdfs

Let $f_{1}(x)$ and $f_{2}(x)$ be two pdfs, $f_{{\text{min}}}={\text{ min}}\{f_{1}(x),f_{2}(x)\}$ , $f_{{\text{max}}}={\text{ max}}\{f_{1}(x),f_{2}(x)\}$ and let ${\lambda_{1,2}}$ be the overlap area measure of ${f_{{1}}}\left(x\right),{f_{{2}}}\left(x\right).$ From Eq. (3), we have

$\displaystyle w({f_{1}},{f_{2}})=1-{\lambda_{1,2}}=\frac{1}{2}\left[\int% \limits_{{R^{n}}}f_{\max}(x)dx-\int\limits_{R^{n}}f_{\min}(x)\right]dx=\int% \limits_{{R^{n}}}f_{\max}(x)dx-1.$

From the above result, we can establish the specific expressions for CWD where $f_{1}(x),f_{2}(x)$ are two normal pdfs.

(ii) For more than two pdfs

.

Let $F=\left\{{{f_{{1}}}\left(x\right),{f_{{2}}}\left(x\right),\ldots,{f_{{k}}}% \left(x\right)}\right\},k\geqslant 3$ be $k$ pdfs defined on ${R^{n}},n\geqslant 1$ and let $q_{i}\in(0,1),\sum_{i=1}^{k}q_{i}=1$ ,

$\displaystyle\left\{\begin{array}[]{l}R_{1}^{n}=\left\{{x\in{R^{n}}:{q_{1}}{f_% {1}}(x)>{q_{j}}{f_{j}}(x),2\leqslant j\leqslant k}\right\},\\ R_{k}^{n}=\left\{{x\in{R^{n}}:{q_{k}}{f_{k}}(x)>{q_{j}}{f_{j}}(x),1\leqslant j% \leqslant k-1}\right\},\\ R_{l}^{n}=\left\{{x\in{R^{n}}:{q_{l}}{f_{l}}(x)>{q_{i}}{f_{i}}(x),1\leqslant i% \neq l\leqslant k}\right\},2\leqslant l\leqslant k-1.\end{array}\right.$ (6)

The width cluster of $F$ is determined by

$\displaystyle w(F)=\int\limits_{R_{1}^{n}}{{f_{1}}(x)dx}+\sum\limits_{l=2}^{k-% 1}{\int\limits_{R_{l}^{n}}{{f_{l}}(x)dx}}+\int\limits_{R_{k}^{n}}{f{}_{k}(x)dx% }-1.$ (7)

Proof..

To obtain Eq. (7), we need to prove the following two results:

$\displaystyle R_{i}^{n}\cap R_{j}^{n}=\phi,(1\leqslant i\neq j\leqslant k)$

And

$\displaystyle\bigcup\limits_{i=1}^{k}{R_{1}^{n}}=R_{1}^{n}\cup\left(\bigcup% \limits_{l=1}^{k-1}{R_{l}^{n}}\right)\cup R_{k}^{n}={R^{n}},{f_{\max}}(x)={f_{% i}}(x),\forall x\in R_{i}^{n}.$

Let $\bar{A}={R^{n}}\backslash A,$ we have

$\displaystyle{\bar{R}_{ij}}=\left\{{x\in{R^{n}}:{f_{i}}(x)\leqslant{f_{j}}(x)}% \right\},$ $\displaystyle{R_{ij}}=\left\{{x\in{R^{n}}:{f_{i}}(x)>{f_{j}}(x)}\right\},\left% ({1\leqslant i,j\leqslant k}\right).$

From Eq. (7), we obtain

$\displaystyle R_{1}^{n}=\bigcap\limits_{j=2}^{k}{{R_{1j}}},R_{l}^{n}=\bigcap% \limits_{i\neq k}{{\bar{R}_{il}}},\left({2\leqslant l\leqslant k-1}\right).$

Therefore,

$\displaystyle R_{1}^{n}\cap R_{l}^{n}=\left(\bigcap\limits_{j=2}^{k}{R_{ij}}% \right)\cap\left(\bigcap\limits_{i\neq k}{{\bar{R}_{il}}}\right)\subset{R_{il}% }\cap{\bar{R}_{1l}}=\phi\Rightarrow R_{1}^{n}\cap R_{l}^{n}=\phi.$

On the other, from antithesis style of DMorgan, we have

$\displaystyle\overline{R_{1}^{n}\cup R_{l}^{n}}=\left(\bigcup\limits_{j=2}^{n}% {{\bar{R}_{ij}}}\right)\cup\left(\bigcup\limits_{i\neq k}{{R_{il}}}\right)% \subset{\bar{R}_{il}}\cap{R_{1l}}=\phi\Rightarrow R_{1}^{n}\cup R_{l}^{n}={R^{% n}}.$

Similarly,

$\displaystyle R_{k}^{n}\cap R_{l}^{n}=\phi,\left({2\leqslant l<k}\right),R_{1}% ^{n}\cap R_{k}^{n}=\phi.$

$\displaystyle\bigcup\limits_{i=1}^{k}{R_{i}^{n}}={R^{n}},\cup\left(\bigcup% \limits_{l=2}^{k-1}{R_{l}^{n}}\right)\cup R_{k}^{n}=R_{1}^{n}\cup\left(\bigcup% \limits_{l=2}^{k-1}{R_{l}^{n}}\right)\cup R_{k}^{n}=\left(\bigcup\limits_{l=2}% ^{k-1}{R_{1}^{n}\cup R_{l}^{n}}\right)\cup\left(\bigcup\limits_{l=2}^{k-1}{R_{% k}^{n}\cup R_{l}^{n}}\right)={R^{n}}\cup{R^{n}}={R^{n}}.$

In addition, from Eq. (6) we can directly find out

$\displaystyle{f_{\max}}(x)={f_{i}}(x),\forall x\in R_{i}^{n},\left({1\leqslant i% \leqslant k}\right).$

∎

(iii) In real application, for one-dimension, we have built an algorithm to find the specific expression for $f_{\max}(x)={\max}\{f_{1}(x),f_{2}(x),\ldots,f_{k}(x)\}$ , then we can calculate CWD as well as CSC. The pseudo code form of this algorithm is present by the Algorithm 1.

Algorithm 1. Finding the $f_{\max}(x)$ function
Input: Set of pdfs $\{f_{1},f_{2},\ldots,f_{k}.\},\varepsilon>0$ (a very small positive number).
Output: The $f_{\max}(x)$ function.
Find all roots of the equations: $f_{i}(x)=f_{j}(x),i=1,2,\ldots,k-1;j=i+1,\ldots,k$ . Let $B$ be the set of all roots.
for $x_{lm}\in B$ do
for $p\in\{1,2,\ldots,k\}\backslash\{l,m\}$ do
if $f_{l}(x_{lm})<f_{p}(x_{lm})$ then $B=B\backslash\{x_{lm}\}.$
end if
end for
end for
Arrange the elements of $B$ in order from small to large:
$B=\{x_{1},x_{2},\ldots,x_{h}\},x_{1}<x_{2}<\ldots<x_{h}.$
(Determine the $f_{\max}(x)$ in interval $(-\infty;x_{1}])$
for $i=1$ to $k$ do
if $f_{i}(x_{1}-\varepsilon)={\max}\{f_{1}(x_{1}-\varepsilon),f_{2}(x_{1}-% \varepsilon),\ldots,f_{k}(x_{1}-\varepsilon)\}$ then
$f_{\max}(x)=f_{i}(x)$ for all $x\in(-\infty;x_{1}]$
end if
end for
(Determine the $f_{\rm max}(x)$ in interval, $(x_{j},x_{j+1}),j=1,2,\ldots,h-1.$
for $i=1$ to $k$ do
for $j=1$ to $h-1$ do
if $f_{i}(x_{j}+\varepsilon)={\rm max}\{f_{1}(x_{j}+\varepsilon),f_{2}(x_{j}+% \varepsilon),\ldots,f_{k}(x_{j}+\varepsilon)\}$ then
$f_{\max}(x)=f_{i}(x)$ for all $x\in(x_{j};x_{j+1}];$
end if
end for
end for
(Determine the $f_{\max}(x)$ in interval $(x_{h};+\infty)$ )
for $i=1$ to $k$ do
if $f_{i}(x_{h}+\varepsilon)={\max}\{f_{1}(x_{h}+\varepsilon),f_{2}(x_{h}+% \varepsilon),\ldots,f_{k}(x_{h}+\varepsilon)\}$ then
$f_{\max}(x)=f_{i}(x)$ for all $x\in(x_{h};+\infty);$
end if
end for

Figure 1.

The graph of seven one-dimension pdfs and their $f_{\max}(x)$ .

Figure 2.

The graph of three bivariate normal pdfs and their intersections.

From this algorithm, we have written a Matlab procedure to find the $f_{\max}(x)$ . For example, Fig. 1 is the graph of the found $f_{\max}(x)$ for 7 normal pdfs established from the program. When $f_{\max}(x)$ is determined, we will easily calculate CWD and CSC. For multi-dimensions, it is very complicated to obtain a specific expression for CWD and CSC. This difficulty comes from the very varied forms of the intersection space curves between the density surfaces, which are either normal or non-normal [13]. For example, Fig. 2 gives the intersections of three normal surfaces. This problem has been studied in [15, 20, 21] but the optimal choice has not been found. In this article, we do not find expression for CWD and CSC. They are computed by quasi Monte-Carlo method as in [15]. An algorithm for doing calculations has been constructed, and a corresponding Matlab procedure is also established.

3. Some results of the cluster width

3.1 Upper and lower bounds

.

Given the set of pdfs: $F=\left\{{{f_{{1}}}\left(x\right),{f_{{2}}}\left(x\right),\ldots,{f_{{k}}}% \left(x\right)}\right\},k\geqslant 2,$ the following properties about the upper and lower bounds of $w(F)$ are satisfied:

$\displaystyle\rm{i)}\mathop{\max}\limits_{i<j}\{w({f_{i}},{f_{j}})\}\leqslant w% ({f_{1}},{f_{2}},\ldots,{f_{k}})\leqslant\frac{2}{k}\sum\limits_{i<j}w({f_{i}}% ,{f_{j}}).$ (8) $\displaystyle\rm{ii)}\mathop{\max}\limits_{i<j}\{w({f_{i}},{f_{j}})\}\leqslant% \sum\limits_{i=1}^{k}w({f_{i}},\bar{f})\leqslant\frac{2}{k}\sum\limits_{i<j}w(% {f_{i}},{f_{j}}),$ (9)

where $\bar{f}=\frac{1}{k}\sum_{i=1}^{k}{{f_{i}}}$ is mean of pdfs.

Proof..

We have

$\displaystyle\int\limits_{{R^{n}}}{f_{\max}(x)dx}\geqslant{\mathop{\max}% \limits_{i<j}}\left\{\int\limits_{{R^{n}}}{\max\{{f_{i}}(x),{f_{j}}(x)\}dx}% \right\}=\mathop{\max}\limits_{i<j}\left\{{\frac{1}{2}\int\limits_{{R^{n}}}{|{% f_{i}}-{f_{j}}}|dx+\frac{1}{2}\int\limits_{{R^{n}}}{[({f_{i}}(x)+{f_{j}}(x)]dx% }}\right\}=\mathop{\max}\limits_{i<j}\{w({f_{i}},{f_{j}})\}+1.$

Therefore,

$\displaystyle w({f_{1}},{f_{2}},\ldots,{f_{k}})\geqslant\mathop{\max}\limits_{% i<j}\left\{{w({f_{i}},{f_{j}})}\right\}.$ (10)

In addition, due to

$\displaystyle\sum\limits_{i<j}{\|f_{i}-f_{j}\|}\geqslant\sum\limits_{j=1}^{k}[% f_{\max}(x)-f_{j}]=kf_{\max}(x)-\sum\limits_{j=1}^{k}{f_{j}}.$

As a result,

$\displaystyle f_{\max}(x)\leqslant\frac{1}{k}\sum\limits_{i<j}{\left|{{f_{i}}-% {f_{j}}}\right|+\frac{1}{k}\sum\limits_{j=1}^{k}{{f_{j}}}}.$

Because $\int_{{R^{n}}}\sum_{j=1}^{k}{{f_{j}}}(x)dx=k$ , the above inequality becomes:

$\displaystyle\int\limits_{{R^{n}}}{{f_{\max}}(x)}dx-1\leqslant\frac{1}{k}\sum_% {i<j}\int\limits_{{R^{n}}}{\left|{{f_{i}}-{f_{j}}}\right|dx}=\frac{2}{k}\sum_{% i<j}w({f_{i}},{f_{j}}).$

$\displaystyle w({f_{1}},{f_{2}},\ldots,{f_{k}})\leqslant\frac{2}{k}\sum\limits% _{i<j}w({f_{i}},{f_{j}}).$ (11)

From Eqs (10) and (11), we obtain Eq. (8).

ii)

We have

$\displaystyle\mathop{\max}\limits_{i<j}|{f_{i}}-{f_{j}}|\leqslant\mathop{\max}% \limits_{i<j}(|{f_{i}}-\bar{f}|+|{f_{j}}-\bar{f}|)\leqslant\sum\limits_{i=1}^{% k}{|{f_{i}}-\bar{f}|}=\sum\limits_{i=1}^{k}{\left|{{f_{i}}-\frac{1}{k}\sum% \limits_{j=1}^{k}{{f_{j}}}}\right|}=\frac{1}{k}\sum\limits_{i=1}^{k}{\left|{% \sum\limits_{j=1}^{k}{({f_{i}}-{f_{j}})}}\right|}\leqslant\frac{1}{k}\sum% \limits_{i=1}^{k}\sum\limits_{j=1}^{k}{|{f_{i}}-{f_{j}}|}=\frac{2}{k}\sum% \limits_{i<j}{|{f_{i}}-{f_{j}}|}.$

Integrating on ${R^{n}}$ both sides of inequality, we will obtain Eq. (9).

∎

One the other hand, we have

$\displaystyle\frac{2}{k}\sum\limits_{i<j}{w({f_{i}},{f_{j}})}\leqslant\sum% \limits_{i<j}{w({f_{i}},{f_{j}})},\forall k\geqslant 2.$

From the above result, the two inequality Eqs (8) and (9) respectively become:

$\displaystyle\mathop{\max}\limits_{i<j}\{w({f_{i}},{f_{j}})\}\leqslant w({f_{1% }},{f_{2}},\ldots,{f_{k}})\leqslant\sum\limits_{i<j}w({f_{i}},{f_{j}}).$ $\displaystyle\mathop{\max}\limits_{i<j}w({f_{i}},{f_{j}})\leqslant w({f_{i}},% \bar{f})\leqslant\sum\limits_{i<j}w({f_{i}},{f_{j}}).$

These two results show that $w({f_{1}},{f_{2}},\ldots,{f_{k}})$ and $w({f_{i}},\bar{f})$ are the separate measures defined by Glick [5].

3.2 Relations between the width cluster and other measures

.

The relations of CWD of $k$ pdfs $\left\{{{f_{{1}}}\left(x\right),{f_{{2}}}\left(x\right),\ldots,{f_{{k}}}\left(% x\right)}\right\},k\geqslant 2$ and some other measures are obtained as follows:

$\displaystyle w\left({{f_{1}},{f_{2}},\ldots,{f_{k}}}\right)\geqslant\frac{k}{% {k-1}}[1-{D_{T}}{({f_{1}},{f_{2}},\ldots,{f_{k}})^{(\alpha)}}]-1,$ (12) $\displaystyle w({f_{1}},{f_{2}},\ldots,{f_{k}})\geqslant k-1-\sum\limits_{i=1,% i\neq j}^{k}{{D_{T}}{{({f_{i}},{f_{j}})}^{(\beta,1-\beta)}}},$ (13) $\displaystyle w({f_{1}},{f_{2}},\ldots,{f_{k}})\geqslant\frac{1}{{k-1}}-{\left% [{\frac{2}{{{{(k-1)}^{3}}}}}\right]^{1/2}}{\sum\limits_{i<j}{\left\{{1-\frac{1% }{4}{{\left[{{d_{k}}({f_{i}},{f_{j}})}\right]}^{2k}}}\right\}}^{1/2}},$ (14)

where

$\displaystyle{D_{T}}{({f_{1}},{f_{2}},\ldots,{f_{k}})^{(\alpha)}}=\int\limits_% {{R^{n}}}{\prod\limits_{j=1}^{k}{{{[{f_{j}}(x)]}^{{\alpha_{j}}}}dx}}\text{ is % affinity of Toussaint}\text{ [19]},$ $\displaystyle(\alpha)=({\alpha_{1}},{\alpha_{2}},\ldots,{\alpha_{k}}),0<{% \alpha_{i}},\beta<1,\sum\limits_{i=1}^{k}{{\alpha_{i}}=1},$ $\displaystyle{d_{k}}({f_{i}},{f_{j}})={\left\{{\int_{{R^{n}}}{{{\left[{f_{i}^{% 1/k}(x)-f_{j}^{1/k}(x)}\right]}^{k}}dx}}\right\}^{1/k}}.$

Proof..

For each $j=1,2,\ldots,k$ , we have

$\displaystyle\left(\sum_{j=1}^{k}{{f_{j}}}\right)^{{\alpha_{i}}}\geqslant{({f_% {i}})^{{\alpha_{i}}}},i=1,2,\ldots,k.$

Therefore,

$\displaystyle\left(\sum\limits_{j=1}^{k}{{f_{j}}}\right)^{{\alpha_{1}}+{\alpha% _{2}}+\ldots+{\alpha_{k}}}\geqslant\prod\limits_{j=1}^{k}{{({f_{j}})}^{{\alpha% _{j}}}}\Leftrightarrow\sum\limits_{j=1}^{k}{{f_{j}}}\geqslant\prod\limits_{j=1% }^{k}{{{({f_{j}})}^{{\alpha_{j}}}}}.$ (15)

In addition, due to

$\displaystyle(\mathop{\min}\limits_{1\leqslant j\leqslant k}\{{f_{j}}\})^{{% \alpha_{1}}}\leqslant{({f_{1}})^{{\alpha_{1}}}},\ldots,{(\mathop{\min}\limits_% {1\leqslant j\leqslant k}\{{f_{j}}\})^{{\alpha_{k}}}}\leqslant{({f_{k}})^{{% \alpha_{k}}}},$

We have

$\displaystyle{(\mathop{\min}\limits_{1\leqslant j\leqslant k}\left\{{{f_{j}}}% \right\})^{{\alpha_{1}}+\cdots+{\alpha_{k}}}}\leqslant\prod\limits_{j=1}^{k}{{% {({f_{j}})}^{{\alpha_{j}}}}}.$

$\displaystyle\mathop{\min}\limits_{1\leqslant j\leqslant k}({f_{j}})\leqslant% \prod\limits_{j=1}^{k}{{{({f_{j}})}^{{\alpha_{j}}}}}.$ (16)

Combining Eqs (15) and (16), we obtain

$\displaystyle 0\leqslant\sum_{j=1}^{k}{{f_{j}}-\prod_{j=1}^{k}{{{({f_{j}})}^{{% \alpha_{j}}}}\leqslant}}\sum\limits_{j=1}^{k}{{f_{j}}-\mathop{\min}_{1% \leqslant j\leqslant k}\{{f_{j}}\}}.$

Because $\sum_{j=1}^{k}{{f_{j}}-\mathop{\min}\limits_{1\leqslant j\leqslant k}\{{f_{j}}\}}$ includes $k-1$ terms, we have

$\displaystyle\sum\limits_{j=1}^{k}{{f_{j}}-\mathop{\min}\limits_{1\leqslant j% \leqslant k}\{{f_{j}}\}}\leqslant(k-1)\mathop{\max}\limits_{1\leqslant j% \leqslant k}\{{f_{i}}\}.$

Thus,

$\displaystyle 0\leqslant\sum\limits_{j=1}^{k}{{f_{j}}-\prod\limits_{j=1}^{k}{{% {\{{f_{j}}\}}^{{\alpha_{j}}}}\leqslant}}(k-1)\mathop{\max}\limits_{1\leqslant j% \leqslant k}\{{f_{i}}\}.$

Integrating on ${R^{n}}$ both sides of the inequality, we obtain:

$\displaystyle 1-{D_{T}}({f_{1}},{f_{2}},\ldots,{f_{k}})\leqslant(k-1)\int% \limits_{{R^{n}}}{{f_{\max}}(x)dx}.$ (17)

Using $\int_{{R^{n}}}{{f_{\max}}(x)=w\left({{f_{1}},{f_{2}},\ldots,{f_{k}}}\right)}+1$ for Eq. (17), we have Eq. (12).

ii)

Because

$\displaystyle{[\min\{{f_{i}},{f_{j}}\}]^{\beta}}\leqslant{({f_{i}})^{\beta}}{% \text{and}}\,{[\min\{{f_{i}},{f_{j}}\}]^{1-\beta}}\leqslant{({f_{j}})^{1-\beta% }},$

We have

$\displaystyle\min\{{f_{i}},{f_{j}}\}\leqslant{({f_{i}})^{\beta}}{({f_{j}})^{1-% \beta}}.$

Integrating the above inequality, we obtain:

$\displaystyle\sum\limits_{i=1,i\neq j}^{k}\int\limits_{R_{j}^{n}}\min\{{f_{i}}% ,{f_{j}}\}dx\leqslant\sum\limits_{i=1,i\neq j}^{k}\int\limits_{R_{j}^{n}}(f_{i% })^{\beta}(f_{j})^{1-\beta}dx.$

Moreover, due to

$\displaystyle\sum\limits_{i=1,i\neq j}^{k}\int\limits_{R_{j}^{n}}\min\{{f_{i}}% ,{f_{j}}\}dx\leqslant\sum\limits_{i=1,i\neq j}^{k}\int\limits_{R_{j}^{n}}(f_{i% })^{\beta}(f_{j})^{1-\beta}dx=\sum\limits_{i=1}^{k}\left[\int\limits_{{R^{n}}}% {f_{i}}(x)dx-\int\limits_{R_{i}^{n}}{\max\{{f_{i}}(x)dx}\right]=k-\int\limits_% {{R^{n}}}{{f_{\max}}(x)dx}.$

We get

$\displaystyle k-\int\limits_{{R^{n}}}f_{\max}(x)dx\leqslant\sum\limits_{i=1,i% \neq j}^{k}\int\limits_{{R^{n}}}({f_{i}})^{\beta}({f_{j}})^{1-\beta}dx=\sum% \limits_{i=1,i\neq j}^{k}D_{T}({f_{i}},{f_{j}})^{(\beta,1-\beta)}.$

From this inequality, we derive Eq. (13).

iii)

According to Kraft [7], we have

$\displaystyle{d_{1}}({f_{i}},{f_{j}})\leqslant 2{\left[{1-D_{M}^{2}({f_{i}},{f% _{j}})}\right]^{1/2}}.$ (18)

Matusita [10] has also been proved:

$\displaystyle{d_{1}}({f_{i}},{f_{j}})\geqslant{\left[{{d_{k}}({f_{i}},{f_{j}})% }\right]^{k}}.$ (19)

Combining Eqs (18) and (19), we obtain

$\displaystyle{\left[{{d_{k}}({f_{i}},{f_{j}})}\right]^{k}}\leqslant 2{\left[{1% -D_{M}^{2}({f_{i}},{f_{j}})}\right]^{1/2}}.$

$\displaystyle{D_{M}}({f_{i}},{f_{j}})\leqslant{\left\{{1-\frac{1}{4}{{\left[{{% d_{k}}({f_{i}},{f_{j}})}\right]}^{2k}}}\right\}^{1/2}}.$ (20)

On the other hand, from geometrical mean inequality of $k$ pdfs ${f_{1}}(x),{f_{2}}(x),\ldots,{f_{k}}(x)$ we get

$\displaystyle{\left[{{f_{1}}(x),{f_{2}}(x),\ldots,{f_{k}}(x)}\right]^{1/k}}% \leqslant{\left[{\frac{2}{{k(k-1)}}\sum\limits_{i<j}{{f_{i}}}(x).{f_{j}}(x)}% \right]^{1/2}}.$

$\displaystyle{D_{M}}({f_{1}},{f_{2}},\ldots,{f_{k}})\leqslant{\left[{\frac{2}{% {k(k-1)}}}\right]^{1/2}}\int_{{R^{n}}}{{\left[{\sum\limits_{i<j}{{f_{i}}}(x).{% f_{j}}(x)}\right]}^{1/2}}dx.$ (21)

Moreover,

$\displaystyle{\left[{\sum\limits_{i<j}{{f_{i}}}(x).{f_{j}}(x)}\right]^{1/2}}% \leqslant\sum\limits_{i<j}{\sqrt{{f_{i}}(x).{f_{j}}(x)}}.$ (22)

From Eqs (21) and (22), we obtain:

$\displaystyle{D_{M}}({f_{1}},{f_{2}},\ldots,{f_{k}})\leqslant{\left[{\frac{2}{% {k(k-1)}}}\right]^{1/2}}\sum\limits_{i<j}{{D_{M}}({f_{i}},{f_{j}})}.$ (23)

Finally, combining Eqs (20) and (23), we have Eq. (14).

∎

From the result of Eqs (12) and (13) with ${\alpha_{1}}={\alpha_{2}}=\ldots={\alpha_{k}}=1/k$ , we have relation of CWD and the affinity of Matusita. Especially, when $k=2$ , we have relationship between CWD and Hellinger distance.

4. Cluster analysis of probability density functions on the width cluster

4.1 The representative probability density function

.

Let $F=\left\{{{f_{{1}}},{f_{{2}}},\ldots,{f_{{k}}}}\right\}$ be a set of pdfs. The representative pdf of $F$ is defined as follows:

$\displaystyle{f_{F}}=\frac{{\sum\limits_{j=1}^{k}{{{\left({{\mu_{Fj}}}\right)}% ^{m}}{f_{j}}}}}{{\sum\limits_{j=1}^{k}{{{\left({{\mu_{Fj}}}\right)}^{m}}}}},$ (24)

where ${\mu_{Fj}}\in[0,1]$ is the probability when the $f_{j}$ is merged into cluster $F$ . In non-fuzzy clustering, ${\mu_{Fj}}=1$ when the $f_{j}$ belongs to the cluster $F$ and the others with ${\mu_{Fj}}=0$ . The weighted exponent $m$ of Eq. (24) has an effect on the fuzzy grade. If $m=1$ then fuzzy cluster analysis will become non-fuzzy method. Despite several discussions in the literature such as Chen and Hung [1], Vidal [6], this problem has not yet been completely solved. According these authors, the best integer $m$ between 2 and 5 will be used. Empirical application with many data sets, this article has chosen $m=2$ in all numerical examples.

It is also shown that ${f_{F}}\geqslant 0$ for all $x$ and $\int_{{R^{n}}}{{f_{F}}dx=1}$ . Therefore, the representative pdf of a cluster is also a pdf.

4.2 The proposed fuzzy clustering algorithm

Problem: There are $k$ populations ${N^{(0)}}=\{W_{1}^{(0)},W_{2}^{(0)},\ldots,W_{k}^{(0)}\}$ with the given pdfs $\{{f_{1}},{f_{2}},\ldots,\linebreak{f_{k}}\}$ . We need to partition them into $c$ clusters ( $c$ is given) so that the probability of each pdf belonging to its right cluster is greater than one of others.

Algorithm 2: The proposed Fuzzy Clustering Algorithm is present in the FCA pseudo code form. At the end of the algorithm computation, we receive $c$ clusters with the probabilities are presented in the final partition matrix $[\mu_{ij}]_{c\times k}.$

Algorithm 2. The FCA algorithm
Input: $k$ pdfs $\{{f_{1}},{f_{2}},\ldots,{f_{k}}\}$ , the integer values $m, c,$ and $\varepsilon>0$ (a very small positive number) and the initial partition matrix $U^{(0)}$ . ( $U^{(0)}$ is chosen randomly).
Output: The matrix $[\mu_{ij}]_{c\times k}$ , where $\mu_{ij}$ is the probability of the ith element belonging to the jth cluster.
repeat
Find the representative pdf of cluster $f_{v_{i}}$ by Eq. (24);
Compute the width cluster between $f_{v_{i}}$ and each given pdf by Eq. (3);
Update the new partition matrix $U^{(\textit{new})}$ :
if $w({f_{j}},f_{v_{i}})>0,i=1,2,\ldots,c$ then
$\mu_{v_{i}j}^{(\textit{new})}=\frac{1}{{\sum\limits_{j=1}^{c}{{{\left({w(f_{v_% {i}},{f_{j}})/w(f_{v_{i}},{f_{j}})}\right)}^{2/(m-1)}}}}},m\geqslant 2$ ;
else
$\mu_{v_{i}j}^{\left(\textit{new}\right)}=0$ ;
end if
Compute the value $S=\left\\|{{U^{\left(\textit{new}\right)}}-{U^{\left(0\right)}}}\right\\|=% \mathop{\max}\limits_{i,j}\left({\left\|{\mu_{v_{i}j}^{\left(1\right)}-\mu_{v_{% i}j}^{\left(0\right)}}\right\|}\right);$
$U^{(0)}\leftarrow U^{\textit{new}};$
until $S<\varepsilon;$
$U^{\textit{new}}=[\mu_{ij}]_{c\times k};$

In the above algorithm, $\varepsilon$ is very small number chosen arbitrarily. The smaller $\varepsilon$ is, the larger are the number of iterations and the computation time. In numerical examples, we chose $\varepsilon={10^{-4}}$ . The value of $m$ is chosen the same Subsection 4.1.

In this algorithm, after an iteration, we have the specific probability for merging the ${f_{{j}}}$ into cluster $c_{i}$ . When the algorithm ends, we shall obtain the final partition matrix ${[{\mu_{ij}}]_{c\times k}}$ that ${\mu_{ij}}$ is the probability of the jth pdf to belonging to the ith cluster. Therefore, if $\mathop{\max}\limits_{j}\{{\mu_{ij}}\}={\mu_{il}},j=1,2,\ldots,c$ then the pdf $f_{i}$ will be arranged to the $l$ th cluster.

4.3 Determining the suitable number of clusters

Let $F=\{{{f_{1}},{f_{2}},\ldots,{f_{k}}}\}$ be the set of $k$ pdfs and $F_{v}^{{\left(t\right)}}=\{{f_{v_{1}}^{\left(t\right)},f_{v_{2}}^{\left(t% \right)},\ldots,f_{v_{k}}^{\left(t\right)}}\}$ be the sequences of $k$ representative pdfs of clusters in the iteration $t$ . Based on CWD, we propose an algorithm to determine the Suitable Number of Clusters (SNC). The pseudo code form of this algorithm is present by the Algorithm 3.

Algorithm 3: Determining the Suitable Number of Clusters (SNC)
Input: $k$ pdfs $F=\{{f_{1}},{f_{2}},\ldots,{f_{k}}\},\varepsilon>0$ is a very small positive number.
Output: The number $c$ of clusters.
Initialize $t=0$ , determine the sequences of representative pdfs of clusters:
$F_{v}^{{\left(0\right)}}=\left\{{f_{v_{1}}^{\left(0\right)},f_{v_{2}}^{\left(0% \right)},\ldots,f_{v_{k}}^{\left(0\right)}}\right\}=F=\left\{{{f_{1}},{f_{2}},% \ldots,{f_{k}}}\right\};$
if $w(f_{v_{i}},f_{v_{j}})\leqslant{w_{s}}=\frac{1}{{\left({{}_{2}^{k}}\right)}}% \sum\limits_{i<j}{w(f_{v_{i}},f_{v_{j}})}$ then
$K_{\lambda}(f_{v_{i}},f_{v_{j}})={\exp\left({-\frac{w(f_{v_{i}},f_{v_{j}})}{% \lambda}}\right);}$
else
$K_{\lambda}(f_{v_{i}},f_{v_{j}})=0;$
end if
Update the sequences of representative pdfs of cluster by the formula:
$f_{v_{i}}^{(t+1)}=\frac{{\sum\limits_{j=1}^{k}{{K_{\lambda}}\left({f_{v_{i}}^{% \left(t\right)},f_{v_{j}}^{\left(t\right)}}\right)}.f_{v_{j}}^{\left(t\right)}% }}{{\sum\limits_{j=1}^{k}{{K_{\lambda}}\left({f_{v_{i}}^{\left(t\right)},f_{v_% {j}}^{\left(t\right)}}\right)}}};$
repeat $t\leftarrow t+1$ and compute $f_{v_{i}}^{(t)},f_{v_{i}}^{(t+1)},$
until $\mathop{\max}\limits_{i}\{w(f_{v_{i}}^{\left(t\right)},f_{v_{i}}^{\left({t+1}% \right)})\}<\varepsilon;$
$c=$ the number of elements of $F_{v}^{(t)};$

In this algorithm, the window size $\lambda$ determines the number of clusters in the data. When $\lambda\rightarrow 0$ , each pdf is a cluster and when $\lambda\rightarrow\infty$ , we have only one cluster. Chen and Hung [1] and Thao and Vovan [18] have discussed about the value of $\lambda$ , but there is no optimal method for all cases. It is also from experiment with many cases, this article has chosen $\lambda=\frac{{{w_{s}}}}{10}$ in the numerical examples and applications.

At the end of the algorithm computation, the pdfs belong to the same cluster will converge on their representative pdf. The number of representative pdfs is the suitable number of clusters. As a result, we can determine the number of clusters and the initial clusters in the first iteration of FCA algorithm.

We have also written the Matlab procedures to perform the FCA and SNC algorithms. These programs have applied effectively for numerical examples in Section 5. However, data usually contains discrete elements in practice, so we have to estimate the pdfs before clustering. There are many methods to solve this problem in which the kernel function method is the most popular one. In this method, the choice of smoothing parameter and the type of kernel function has effects on the result. Although some authors had many discussions about this problem, the optimal choice still has not been found yet [3, 12, 16, 17]. In the numerical examples, the smoothing parameter is chosen by Scott [16] and the kernel function is the Gaussian one.

Figure 3.

The graph of the two classes $f_{1}$ (left) and $f_{2}$ (right) (a) and the two representative pdfs (b).

5. Numerical examples

In this section, the article presents four numerical examples to show the proposed algorithms and to compare them with existing algorithms in [1, 6, 9, 11, 21]. The first example considers 100 uniform pdfs separated off in two groups with 50 pdfs in each. Example 2 consists of three Student bivariate class pdfs with of size 3 in each. We consider the above examples to illustrate and test the established procedures and algorithms. Examples 3 and 4 apply to images recognition, a problem can interest in many researches in data mining. The results reveal that the proposed algorithms are better than considered existing methods.

.

This example reviews the synthetic data studied in [1, 6, 11]. The data include two classes $f_{1}$ and $f_{2}$ with 100 uniform pdfs on the interval $\left[{0,1000}\right]$ (see Fig. 3a). The pdfs of these two classes are defined as follows:

$\displaystyle{f_{1,i}}=U\left({{a_{i}},{b_{i}}}\right),i=1,\ldots,50{\rm{\,% with\,}}{a_{i}}=4\left({i-1}\right)+{\lambda_{1}},{b_{i}}=195+5i+{\lambda_{2}},$ $\displaystyle{f_{2,i}}=U\left({{c_{i}},{d_{i}}}\right),i=1,\ldots,50{\rm{\,% with\,}}{c_{i}}=805-5i-{\lambda_{3}},{d_{i}}=1004-4i-{\lambda_{4}},$

where $U\left({{a_{i}},{b_{i}}}\right)$ and $U\left({{c_{i}},{d_{i}}}\right)$ denote the uniform distribution on the interval $\left({{a_{i}},{b_{i}}}\right)$ and $\left({{c_{i}},{d_{i}}}\right)$ , re- spectively, and ${\lambda_{1}},\ldots,{\lambda_{4}}$ are drawn from $U\left({0,{{4}}}\right)$ .

The SNC algorithm gives the two representative pdfs as the Fig. 3b. Running the FCA algorithm, we have the matrix partition $(2\times 100)$ which its some columns are given below:

$\displaystyle\left[{\begin{array}[]{*{20}{c}}{0.768}&{0.768}&{0.795}&{\ldots}&% {0.156}&{0.166}&{0.176}\\ {0.232}&{0.219}&{0.205}&{\ldots}&{0.844}&{0.834}&{0.824}\end{array}}\right]$

From the above matrix, we have two clusters $\left\{{{f_{1}},{f_{2}},\ldots,{f_{50}}}\right\},\left\{{{f_{51}},{f_{52}},% \ldots,{f_{100}}}\right\}$ with the same value for CSC (0.9776). The algorithm of Chen and Hung [1] gives the same result with the proposed algorithm. In this case the error of all considered algorithms are 0%.

We continue to cluster for two classes pdfs $g_{1}$ and $g_{2}$ , where ${g_{1}}={f_{1}}$ and ${g_{2}}=\lambda{f_{1}}+(1-\lambda){f_{2}},\lambda\in\left[{0,0.5}\right]$ (see Fig. 4a).

Figure 4.

(a) The graph of two classes $g_{1}$ (lef) and $g_{2}$ (right) with $\lambda=0.5$ ; (b) The final state of $g_{1}$ and $g_{2}$ after the NSC algorithm finishes.

Figure 5.

The graph shows the probabilities of pdfs belong to cluster 1 (lef) and cluster 2 (right).

Using the SNC algorithm, the pdfs converge to two representative pdfs (see Fig. 4b). The FCA algorithm gives us the partition matrix $(2\times 100)$ , where the probabilities of the 50 first column of the first row are larger than ones of the second row and the probabilities of 50 remainder columns are opposite. The probabilities for assigning elements to two clusters are shown by Fig. 5.

From the partition matrix, we also have two clusters: $\left\{{{g_{1}},{g_{2}},\ldots,{g_{50}}}\right\}$ and $\left\{{{g_{51}},{g_{52}},\ldots,{g_{100}}}\right\}$ with CSC of each cluster is 0.9776. The errors with different values of $\lambda$ for the proposed algorithm and the existing methods are shown in Table 1.

Table 1

The error (%) of algorithms for $g_{i}(x)$

Algorithms	$\lambda=0$	$\lambda=0.1$	$\lambda=0.2$	$\lambda=0.3$	$\lambda=0.4$	$\lambda=0.5$
k-Means	49.8	59.8	71.4	78.2	80.6	86.4
Goh and Vidal	0	0	0	0	0	5.0
Tai and Pham-Gia	9.2	9	9.2	8.8	11.2	13.4
Montanari and Calo	0	0	0	0	0	5.1
Chen and Hung	0	0	0	0	0	0
FCA (proposed)	0	0	0	0	0	0

Table 1 gives the best result with the Chen and Hung [1] and the proposed algorithms for all cases of $\lambda$ (the error 0%).

.

Given 9 pdfs of bivariate Student distribution with 3 degrees of freedom as follows:

$\displaystyle{f_{i}}\left({x,y}\right)=\frac{{\Gamma\left({2.5}\right){{\left|% {{\Sigma_{i}}}\right|}^{-1/2}}}}{{3\pi\Gamma(1.5)}}\times\frac{1}{{{{\left({1+% \delta\left({x,{\mu_{i}},{\Sigma_{i}}}\right)/3}\right)}^{2.5}}}},$

where $\delta\left({x,{\mu_{i}},{{\Sigma_{i}}}}\right)={\left({x-{\mu_{i}}}\right)^{T% }}({\Sigma_{i}})^{-1}\left({x-{\mu_{i}}}\right)$ and specific parameters ${\mu_{i}}$ and ${\Sigma_{i}},i=1,2,\ldots,9$ are given in [1]. This data has three clusters and each cluster consists of three pdfs with the graph and the contour given by Fig. 6a and b, respectively.

Figure 6.

The graph of the nine bivariate Student pdfs (a) and their contour (b).

When the SNC algorithm stops, we have three representative pdfs shown in Fig. 7a and their corresponding contours are presented in Fig. 7b.

Figure 7.

The representative pdfs of 3 clusters (a) and their contours (b).

The FCA algorithm gives the partition matrix as follows:

$\displaystyle\left[{\begin{array}[]{*{20}{c}}{0.798}&{0.635}&{0.632}&{0.097}&{% 0.178}&{0.172}&{0.097}&{0.172}&{0.177}\\ {0.101}&{0.187}&{0.182}&{0.807}&{0.647}&{0.655}&{0.095}&{0.172}&{0.176}\\ {0.101}&{0.177}&{0.186}&{0.095}&{0.175}&{0.173}&{0.808}&{0.656}&{0.647}\end{% array}}\right]$

This matrix gives us three 3 cluster $\left\{{{f_{1}},{f_{2}},{f_{3}}}\right\},\left\{{{f_{4}},{f_{5}},{f_{6}}}% \right\},\left\{{{f_{7}},{f_{8}},{f_{9}}}\right\}$ . Their CSC are 0.4057, 0.4167 and 0.4167, respectively. One time, this is the same result of Chen and Hung [1]. With 3 clusters, the emprical error of the proposed algorithm and algorithms of [6, 11] are 0% whereas the algorithms in [21, 9] are 9.38% and 64.4%, respectively.

In data processing, images are specially considered by their visibility and application for many fields such as agriculture, medicine, environment, etc. Therefore, the remaining two examples will cluster for images. Regularly, one image is characterized by three main features: color, texture and shape. Among them, the color is widely used for image segmentation or image classification problems. An image is structured from the matrix of image points with each image point contains color information. The G space is built based on Gray color. The RGB space is established based on three primary colors: Red, Green and Blue. These color space can be seen as one of the most recognition color spaces because almost visible colors could be generated by linear combinations of the above colors. In real life, the RGB is widely used to establish colors in electronic devices as computer, television, scanner, etc. In this article, different color features extracted from the images are represented by pdfs in single or multi-dimensional spaces by the Matlab procedure. Then, these pdfs become the input for the clustering algorithms.

.

This example considers 27 images of 2 categories with 13 images for lotus and 14 images for sunflower. The images are given by Fig. 8. Using the G scale, the estimated pdfs are given by Fig. 9. In this figure, the $x$ axis is the pixel for images normalized on $[0;1]$ .

Figure 8.

The images of lotus and sunflower.

Figure 9.

The estimated pdfs form images of lotuses and sunflowers.

The SNC algorithm gives 2 clusters and we obtain the partition matrix from the FCA algorithm that the probabilities for assigning to each cluster are showed by Fig. 10:

$\displaystyle\left[{\begin{array}[]{*{20}{c}}{0.098}&{0.462}&{0.251}&{\ldots}&% {0.421}&{0.812}&{0.834}\\ {0.902}&{0.538}&{0.749}&{\ldots}&{0.579}&{0.188}&{0.166}\end{array}}\right]$

Figure 10.

The graph shows the probabilities of pdfs belonging to cluster lotuses (a) and cluster sunflowers (b).

The CSC of the established two clusters are 0.861 and 0.878, respectively. The algorithm of Chen and Hung [1] only gives a single cluster with all pdfs. The emprical error of the proposed algorithm and the algorithms of [9, 21, 6, 11, 1] are 11%, 45%, 37%, 27% and 47%, respectively.

Using three variables (RGB), making the similar as using G, the SNC algorithm also gives two clusters. Their CSC are 0.831 and 0.863, respectively. We obtain the following partition matrix from the proposed algorithm:

$\displaystyle\left[{\begin{array}[]{*{20}{c}}{0.483}&{0.502}&{0.477}&{\ldots}&% {0.666}&{0.579}&{0.587}\\ {0.517}&{0.498}&{0.523}&{\ldots}&{0.334}&{0.421}&{0.413}\end{array}}\right]$

In this case, the errors of the proposed algorithm and the algorithms of [9, 21, 6, 11, 1] are 18.5%, 65%, 47%, 38% and 47%, respectively.

.

This example continunes to cluster for images. This database were researched by Chen and Hung [1] and Goh and Vidal [6] and can be downloaded from http://www1.cs.columbia.edu/sofware/ curet/. There are 218 material images separated into three groups including Human skin, Ribbed paper and Insulation with the numbers as 57, 77 and 84, respectively. Some samples are presented in Fig. 11.

Figure 11.

Three original textute samples in Benchmark database: Human skin (a), Ribbed paper (b) and Insulation (c).

Figure 12.

The pdfs of 218 material images.

The pdfs are also estimated based on the G and RGB scale of image pixels. In case of G scale, the pdfs are shown in Fig. 12.

For G, the SNC algorithm gives three cluster (see Fig. 13) with CSC are 0.625, 0.713 and 0,625, whereas the algorithm of Chen and Hung [1] is only one cluster (see Fig. 14). Running the FCA algorithm for one dimension and three dimensions, we have the partition matrix that the probabilities of images belonging to three clusters are given by Figs 15 and 16, respectively.

Table 2

The error (%) of the algorithms for material images

Algorithms	One dimension	Three dimensions
k-Means	30.28	55.50
Goh and Vidal	39.12	24.14
Tai and Pham-Gia	47.13	40.17
Montanari and Calo	53.18	26.17
Chen and Hung	21.47	5.05
FCA (proposed)	1.38	1.38

Figure 13.

The three representative pdfs in the final step of the proposed algorithm.

Figure 14.

The three representative pdfs in the final step of the algorithm of Chen and Hung.

Figure 15.

The graph of probabilities for partition matrix using G.

Figure 16.

The graph of probabilities for partition matrix using RGB.

The errors of the proposed algorithm and considered algorithms are given by Table 2. From Table 2, we see that the proposed method obtains the good result both in one-dimension and three-dimensions cases. In addition, in both two cases, the results of the proposed method are unaltered, while those of compared methods fluctuate significantly. Thus, it can be noticed that the proposed method also works well with the pdfs having much overlap area.

From the considered examples, we see that the results of Chen and Hung [1] and the proposed algorithm are the best. However, at the end step, the proposed algorithm gives extra detail on the probability which belongs to each cluster of elements from the established partition matrix. Moreover, we can know the quality of the established clusters by the CSC parameter. We also see that when the pdfs of groups are well separated (Examples 1 and 2), the results of Chen and Hung [1] are appreciable. However, in case of overlap regions of groups are large, this algorithm is disadvantages. Examples 3 and 4 show this problem. These examples also show an outstanding performance of the proposed algorithms in comparison with that of existing algorithm.

6. Conclusion

This studying has surveyed the CWD by establishing upper and lower bounds and by building its relations with other measures. From CWD, a new concept to evaluate the quality of the established clusters is proposed. Determining the CWD is also considered in both theoretical and real application. Based on the CWD, the article has proposed two algorithms: fuzzy cluster analysis for pdfs and determination of the suitable number of clusters. These algorithms are applied to several synthetic, benchmark and real data and performed by the Matlab procedures. The numerical examples show the suitability and the applicability of studied problem. They also show that the proposed algorithms are more efficient than existing ones. In the near future, we will apply these theories to other practical issues such as image processing and sound recognition. The convergence property of the proposed algorithms however is not studied in this article. It will be part of my further studies.

References

Chen

and Hung

, An automatic clustering algorithm for probability density functions, Journal of Statistical Computation and Simulation 85(15) (2015), 3047–3063.

Defays

, An efficient algorithm for a complete link method, The Computer Journal 20(4) (1977), 364–366.

Duin

, On the choice of smoothing parameters for parzen estimators of probability density functions, IEEE Transactions on Computers 25 (1997), 1175–1179.

Fukunaga

, Introduction to statistical pattern recognition (2nd Ed), Academic Press, New York, 1990.

Glick

, Separation and probability of correct classification among two or more distributions, Annals of the Institute of Statistical Mathematics 25(1) (1973), 373–382.

Goh

and Vidal

, Unsupervised Riemannian clustering of probability density functions, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2008, pp. 377–392.

Kraft

C.H.

, Some conditions for consistency and uniform consistency of statistical procedures, University of California Press, 1955.

Luo

Jiao

and Shang

, Learning simultaneous adaptive clustering and classification via moea, Pattern Recognition 60(2) (2016), 37–50.

MacQueen

, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 1, 1967, pp. 281–297.

10.

Matusita

, On the notion of affinity of several distributions and some of its applications, Annals of the Institute of Statistical Mathematics 19(1) (1967), 181–192.

11.

Montanari

and Calo

D.G.

, Model-based clustering of probability density functions, Advances in Data Analysis and Classification 7(3) (2013), 301–319.

12.

Parzen

, On estimation of a probability density function and mode, The Annals of Mathematical Statistics 33(3) (1962), 1065–1076.

13.

Pham-Gia

and Nhat

N.D.

, Statistical classification using the maximum function, Open Journal of Statistics 5(7) (2015), 665–680.

14.

Pham-Gia

Turkkan

and Bekker

, Bayesian analysis in the L1 – norm of the mixing proportion using discriminant analysis, Metrika 64(1) (2006), 1–22.

15.

Pham-Gia

Turkkan

and VoVan

, The maximum function in statistical discrimination analysis, Communication in Statistics: Simulation and Computation 37 (2008), 320–336.

16.

Scott

, Multivariate density estimation: Theory, practice and visualisation, John Willey and Sons, New York, 1992.

17.

Silverman

, Density estimation, Chapman & Hall, London, 2010.

18.

Thao

N.T.

and Vovan

, Fuzzy clustering of probability density functions, Journal of Applied Statistics 44 (2017), 583–601.

19.

Toussaint

, Some inequalities between distance measures for feature, I.E.E.E Trans. Comput 21 (1972), 409–410.

20.

Vovan

, L1 – distance and classification problem by bayesian method, Journal of Applied Statistics 44(3) (2017), 385–401.

21.

Vovan

and Pham-Gia

, Clustering probability distributions, Journal of Applied Statistics 37(11) (2010), 1891–1910.

22.

Webb

, Statistical pattern recognition, John Wiley & Sons, 2002.