An efficient semi-supervised graph based clustering

Abstract

Clustering is one of the most important tools in data mining and knowledge discovery from data. In recent years, semi-supervised clustering, that integrates side information (seeds or constraints) in the clustering process, has been known as a good strategy to boost clustering results. In this article, a new semi-supervised graph based clustering (SSGC) is presented. Using a graph of the k-nearest neighbors and a measure of local density for the similarity between vertex, SSGC integrates the seeds in the process of building clusters and hence can improve the quality of clustering. More over, SSGC can deal with noise, differential density of data, and uses only one parameter (i.e. the number of nearest neighbors). Experiments conducted on real data sets from UCI show that our method can produce good clustering results compared with the related techniques such as semi-supervised density based clustering (SSDBSCAN). Moreover, the computational cost of SSGC is much less than that of SSDBSCAN.

Keywords

Semi-supervised clustering seed k-nearest neighbors graph

1. Introduction

Clustering is an important task in the process of knowledge discovery in data mining [1, 2, 3]. In the past ten years, the problem of clustering with side information (known as semi-supervised clustering) has become an active research direction in machine learning community [4, 5, 6, 7, 8]. In general, semi-supervised clustering methods integrate a small set of side information (e.g. seeds or constraints) to boost the quality of clustering. There are two kinds of semi-supervised clustering: constraint based clustering and seed based clustering. In constraint based clustering, the most common form of constraints used are: must-link (ML) and cannot-link (CL) constraints [4]. $ML(x,y)$ indicates that two points of the data set will be grouped in the same cluster while $CL(x,y)$ means that $x$ and $y$ belong to different clusters. In seed based clustering, a small set of seeds will be provided to help the clustering process.

The motivation of our work focuses on the following open question: How can we integrate seeds efficiently in graph based clustering? Some works have been already proposed in the literature to handle this question such as seed-based K-Means (SSKM)[9], seed based Fuzzy-CMeans (SSFCM)[10], seed based hierarchical clustering (HISSCLU) [11], and seed based Density-Based Clustering (SSDBSCAN) [12]. SSKM and SSFCM have demonstrated the advantages in improving the result of clustering but these methods are specifically dedicated to limits for the clusters with spherical forms. The HISSCLU and SSDBSCAN are efficient for discovering clusters with different sizes and shapes. However, their computation time are quite costly.

To overcome the shortcomings aforementioned, we introduce a new semi-supervised clustering so called Semi-Supervised Graph based Clustering (SSGC). To the best of our knowledge, this is the first seed based graph clustering algorithm. The main advantage of our algorithm is the clustering process does not depend on the density of data set and can easily detect the noise by using the k-nearest-neighbor graph and a measure of the similarity among vertex based on local density of data.

Our algorithm uses only one parameter – the nearest neighbor k and for each data set, it exists a long interval to choose a suitable value for $k$ . Moreover, the experiments show that, by using seeds, SSGC not only gives good results but also improves the speed of computation when comparing to the reference algorithm SSDBSCAN.

The rest of the paper is organized as follow: Section 2 discusses related work. Section 3 presents our new semi-supervised graph based clustering. Section 4 describes the experiments that have been conducted on benchmark data sets from UCI. Finally, Section 5 concludes the paper and discusses several lines of future researches.

2. Related work

2.1 Seed-based DBSCAN

The seed-based DBSCAN extends the original DBSCAN algorithm with a small set of labeled data to enable the discovery of clusters with distinct densities.

According to [11], the notion of density is formalized by two parameters: MinPts specifies a minimum number of objects, and $\epsilon$ is the radius of a hypersphere in the space of the objects. However, because these parameters are set once for all clusters, DBSCAN can only detect clusters with the same density.

The objective of SSDBSCAN is to overcome this limit by using seeds to compute an adapted radius $\epsilon$ for each cluster. SSDBSCAN has only one parameter MinPts while the $\epsilon$ parameter is deduced from the set of provided seeds. Another difference is that, contrary to DBSCAN, SSDBSCAN seems a graph partitioning problem. In SSDBSCAN, the data set is represented as a weighted undirect graph where each vertex corresponds to a data object and each edge between two objects $p$ and $q$ has a weight determined by the rDist(p,q) measure described hereafter.

The rDist(p,q) measure indicates the smallest radius value $\epsilon$ for which $p$ and $q$ are core points and directly density connected with respect to MinPts. rDist() can be formalized as follows:

$\displaystyle\forall p,q\in D,\textit{rDist(p,q)}=\max(\textit{cDist(p), cDist% (q), d(p,q)})$ (1)

where $D$ denotes the data set, $d()$ is the metric used in the clustering and $\forall o\in D$ , cDist(o) is the minimal radius such that $o$ is a core-point and has MinPts nearest-neighbors.

Then, given a set of seeds $D_{L}$ , the SSDBSCAN algorithm proceeds as follows. Using the previous distance rDist(), it is possible to construct a density-based cluster $C$ that contains the first seed point $p$ , by first adding $p$ to $C$ and then iteratively adding the next closest point in term of rDist() distance to $C$ . The process continues until there is a point $q$ that has a different label from $p$ . At that time, the algorithm backtracks to the point $o$ with the largest rDist() before adding $q$ . The current expansion stops and includes all points up to but excluding $o$ , having a cluster $C$ containing $p$ . Conceptually, this is the same as constructing a minimum spanning tree (MST) for a complete graph where the set of vertices equals $D$ and the edge weights are given by rDist() [12]. The complexity of SSDBSCAN is $O(mn^{2}\textit{logn})$ or $O(mn^{2})$ (when using Fibonacci heap), so it is not easy to adapt with the real application.

2.2 Seed based K-Means

The seed based K-Means algorithm has been proposed by [9]. This method uses a small set of labeled data, the seeds, to help the clustering of the unlabeled data. Two variants of semi-supervised K-Means clustering are introduced: Seed K-Means and Constraint K-Means. In both methods, the seeds are supposed to be representative of all the clusters. In Seed K-Means, the labeled data is used to compute an initial center for each cluster. Then a traditional K-Means is applied on the dataset without any further use of the labeled data, while in Constraint K-Means the information is used as constraints so that the labeled data can not be removed from the cluster they have been affected by the user. The seed based K-Means is presented in the Algorithm 2.2. The limit of SSKM is also like the traditional K-Means, it is only work well with the cluster with the spherical form.

[ht!] Seed K-Means[1] Data set $X=\{x_{i}\}_{i=1}^{N}$ , number of clusters K, set $S=\{S_{l}\}_{l=1}^{K}$ of initial seeds Disjoint $K$ partitioning of $X=\cup_{l=1}^{K}X_{l}$ such that K-Means objective function is optimizedInitialize: $\mu_{h}^{(0)}\leftarrow\frac{1}{|S_{h}|}\sum_{x\in S_{h}}x$ , for $h=1,\ldots,K;t\leftarrow 0$ Assign_cluster: Assign each data point $x$ to the cluster $h^{*}$ (i.e. set $X_{h^{*}}^{(t+1)}$ ), for $h^{*}=\textit{argmin}\|x-\mu_{h}^{(t)}\|^{2}$ estimate_means: $\mu_{h}^{(t+1)}$ $\leftarrow$ $\frac{1}{|X_{h}^{(t+1)}|}\sum_{x\in X_{h}^{(t+1)}}x$ $t\leftarrow(t+1)$ (Convergence)

2.3 Seed based Fuzzy C-Means

The seed based Fuzzy C-Means algorithm proposed in [10]. Like Seed K-Means, this method uses a small set of seeds to improve clustering performance of image segmentation problem. Moreover, this algorithm overcomes three drawbacks of the original Fuzzy C-Means clustering which include (1) choosing number of cluster during initialization, (2) assigning physical labels to the classes at termination, and (3) least objective functional squares try to equalize cluster populations. The seed based Fuzzy C-Means is presented in the Algorithms 2.3.

[ht!] Seed Fuzzy C-Means[1] Data set $X=X^{d}\cup X^{u}$ , $n_{d}=|X^{d}|$ , $n_{u}=X^{u}$ , and $n=n_{d}+n_{u}$ , note that number of cluster $c$ is known and fixed by the set of seeds Disjoint $c$ partitioning of $X=\{\cup X_{l}\}_{l=1}^{c}$ Choose parameters $w$ , T, $\|.\|_{A},m>1,and\epsilon>0$ ;Initialize $U_{0}=[U^{d}|U_{0}^{u}]$ , with $U_{0}^{u}\in M_{fcm}$ Compute $v_{i,0}=\frac{\sum_{k=1}^{n_{d}}(u_{ik,0}^{d}x_{k}^{d})}{\sum_{k=1}^{n_{d}}(u_% {ik,0}^{d})^{m}}$ , $1\leqslant i\leqslant c$ For $t=1,2,\ldots,T$ – Compute $[\sum_{j=1}^{c}(\frac{||x_{k}^{u}-v_{i,t-1}||_{A}}{||x_{k}^{u}-v_{j,t-1}||_{A}% })^{\frac{2}{m-1}}]^{-1}$ , $1\leqslant i\leqslant c;1\leqslant k\leqslant n_{u};t=1,2,\ldots,T$ – Compute $E_{t}=||U_{t}^{u}-U_{t-1}^{u}||=\sqrt{\sum_{i=1}^{c}\sum_{k=1}^{n_{u}}(u_{ik,t% }^{u}-u_{ik,t-1}^{u})^{2}};$ – if $E_{t}\leqslant\epsilon$ stop; Else compute: $v_{i,t}=(\frac{\sum_{k=1}^{n_{d}}w_{k}(u_{ik,t}^{d})^{m}x_{k}^{d}+\sum_{k=1}^{% n_{u}}(u_{ik,t}^{u})^{m}x_{k}^{u}}{\sum_{k=1}^{n_{d}}w_{k}(u_{ik,t}^{d})^{m}+% \sum_{k=1}^{n_{u}}(u_{ik,t}^{u})^{m}})$ ; $1\leqslant i\leqslant c;t=1,2,\ldots,T$ – Next $t$ .

Note that, if $n_{d}=0$ (i.e. $n_{u}=n$ ) or $w=0$ , then SSFCM reduces to FCM. We also note that SSFCM can only detect clusters with spherical form.

2.4 HISSCLU method

In [11], the HISSCLU, a hierarchical density-based clustering algorithm relied on OPTICS, is proposed. HISSCLU includes two stages. At the first stage, given a set of labeled objects, HISSCLU starts the OPTICS expansion simultaneously from all the labeled objects and generates as many reachability-plots as the number of labeled objects, each one representing a cluster; during the label expansion they use a method to change the distance between points that resembles the distance learning [12]. The reachability-plots are reordered and concatenated with each other to produce one single plot. In the second stage, a cut at level $\epsilon$ is made in the plot to extract the clusters.

It is important to notice that HISSCLU is not able to extract the natural cluster structure from a data set if the plot generated in the first stage of the algorithm represents a distribution where the density varies widely between clusters, as it also uses only one single cut. As for DBSCAN, defining the value of a single cut corresponding to a single density level is difficult and requires the user often to perform a trial and error process, which makes the algorithm unsuitable for an automatic KDD process.

3. Proposed approach

This section proposes SSGC method which is the semi-supervised clustering based on graph. We first present briefly the concept of traditional k-nearest neighbors graph and then our SSGC approach.

3.1 The k-nearest neighbors graph

In this research, we use the graph which is presented in [13, 14]. We define the $k$ -NNG as a weighted undirected graph, in which each vertex represents a data point, and possesses at most $k$ edges to its $k$ -nearest neighbors. An edge is created between a pair of points, $x_{i}$ and $x_{j}$ , if and only if $x_{i}$ and $x_{j}$ have each other in their $k$ -nearest neighbors set. The weight $\omega(x_{i},x_{j})$ of the edge (the similarity) between two points $x_{i}$ and $x_{j}$ is defined as the number of common nearest neighbors the two points share, as shown in Eq. (2) as follow:

$\displaystyle\omega(x_{i},x_{j})=\mid NN(x_{i})\cap NN(x_{j})\mid$ (2)

where NN(.) denotes the set of $k$ -nearest neighbors of the specified point. The important property of this similarity measure is its own built-in automatic scaling, which makes it adapted to treat datasets with distinct cluster densities.

3.2 Semi-supervised graph based clustering

Based on the k-NN graph which is defined in the previous section. We develop a new semi-supervised graph based clustering, called SSGC for short. The SSGC algorithm is designed with 2 steps:

•
Step 1: Partition the k-NN graph. The aim of this step is to partition a graph into some connected components subjected to the following cut condition:

Cut condition: “each connected component has at most one kind of seed”.

To do this, a threshold $\theta$ is set to partition the k-NN graph to form connected components. We note that, to maximize the number of points in the connected components, we choose initial value of $\theta$ as small as possible: $\theta$ is initialized to $0$ and after each iteration as the cut condition is not satisfied then $\theta$ is incremented by 1.

After finding the connected components, all points of each component that have at least one seed will be propagated by the label of seeds presenting in its component. This step produces main clusters.
•
Step 2: Noisy detecting and building final clusters The remaining points, that are not labeled, will be divided into two kinds: points that have edges which relate to one or more clusters and others points can be seen as isolated points.

In the first case, points are assigned to main clusters with the largest weight related. This process is similar to the mechanism of expansion of “Minimum Spanning Tree (MST)” in SSDBSCAN algorithm [12].

For the isolated points in the second case, two choices are possible that depends on the purposes users: either remove them as the noise or label them. In the last case, we can apply the clustering method based on k-nearest neighbors: the label of an isolated point will be labeled by the majority in its label set of $k$ -nearest neighbors.

SSGC is presented in Algorithm 3.2. We note that the threshold $\theta$ is calculated automatically at each iteration. SSGC algorithm thus has a single parameter $k$ , which is the number of neighbors of graph.

SSGC Algorithm[1] Data set $\mathcal{X}$ , number of neighbor $k$ , a set of seeds $\mathcal{S}$ A partition of $\mathcal{X}$ Construction of k-NN graph of $\mathcal{X}$ $\theta=0$ Connected component construction with threshold $\theta$ $\theta=\theta+1$ the cut condition is satisfied Propagation of label to form the principal clustersConstruction final clusters

The complexity of the construction phase of k-NN graph is $O(n^{2})$ or O(nlogn) (if using an optimized structure of R-Tree with low dimension data). The complexity of the construction phase of the connected components is O(nk) where $k$ is the number of neighbors in the kNN graph, using the method of breast first search (BFS) or depth-first search (DFS). So the complexity of the algorithm SSGC is $O(n^{2})$ or O(nlogn) (if using R-Tree).
4. Experiment results

4.1 Data sets

We use 8 well-known real datasets from the Machine Learning Repository [17] namely: Iris, Soybean, Ecoli, Zoo, Protein, Thyroid, Haberman, and Yeast to evaluate our algorithm. The details of the data sets are shown in Table 1. These datasets have been chosen because they facilitate the reproducibility of the experiments and some of them have already been used in semi-supervised clustering articles.

Table 1
Main characteristics of the real datasets

ID	Data	#Objets	#Attributes	#Clusters
1	Soybean	47	35	4
2	Zoo	101	16	7
3	Protein	115	20	6
4	Iris	150	4	3
5	Ecoli	336	8	8
6	Thyroid	215	5	3
7	Yeast	1484	8	10
8	Haberman	306	3	2

Figure 1.

Result comparison between 4 semi-supervised clustering algorithms.

4.2 Evaluation method

The data set used for the evaluation includes a correct answer or label for each data point. We use the labels in a post-processing step to evaluate the performance of our approach.

We use the Rand Index (RI) measure [15], as it is widely used in evaluation of clustering results.

Figure 2.

Comparison of 4 semi-supervised clustering algorithms.

The RI measure computes the agreement between the theoretical partition of each dataset and the output partition of evaluated algorithms. This measure is based on $\frac{n(n-1)}{2}$ pairwise comparisons between the $n$ points of a data set $X$ . For each pair of points $x_{i}$ and $x_{j}$ in $X$ , a partition assigns them either to the same clusters or to different clusters.

Let us consider two partitions $P_{1}$ and $P_{2}$ , and let $a$ be the number of decisions where the point $x_{i}$ is in the same cluster as $x_{j}$ in $P_{1}$ and $P_{2}$ . Let $b$ be the number of decisions where the two points are placed in different clusters in both partitions. A total agreement can then be calculated as shown in Eq. (3).

$\displaystyle RI(P_{1},P_{2})=\frac{2(a+b)}{n(n-1)}$ (3)

RI takes values between 0 and 1; RI $=$ 1 when the result is the same as the ground-truth. The larger the RIis, the better the result is.

4.3 Comparative results

To evaluate the effectiveness of our algorithm SSGC, we compare SSGC with the SSDBSCAN and Semi-supervised K-Means clustering (SSKM). Additionally, we show the results obtained by K-Means clustering as the base line reference. We also compare the computation time between algorithm SSGC and the algorithm SSDBSCAN because its works with the same kind of clusters (difference size and shape).

The seeds for all the semi-supervised methods are randomly chosen in each time of running. The results are averaged over 20 runs.

4.3.1 The quality of clustering

Figures 1 and 2 show the results obtained by four algorithms. It can be observed from these figures that our method generally performs better or at least comparable with SSDBSCAN. This can be explained by the fact that the representation of data through a graph is very appropriate and natural.

With the Protein data set, the SSKM obtains the best result. It can be explain that, Protein has 6 clusters with 115 objects in total. So, both SSGC and SSDBSCAN make the faults in the propagation process.

With the Yeast data set, especially, SSKM and K-Means give the better result than SSGC and SSDBSCAN. It can be explained in [11], this data set is highly unbalances (5 to 423 objects per cluster), the reach-ability plot shows that no cluster structure at all, overlapping clusters, so SSGC and SSDBSCAN are not easy to deal with.

With the 3D-Haberman data set including 2 clusters, both K-Means and SSKM obtained only 50 percent, we can easily see that, in this data set, two clusters are highly overlap (see Fig. 3). However, the results can improve 10% of Rand Index with a small set of seeds with SSGC and SSDBSCAN because of the divide and conquer strategy in the process of finding clustering of SSGC and SSDBSCAN.

Figure 3.

Haberman data set visualisation.

Figure 4.

Running time comparison between SSGC and SSDBSCAN.

Figure 5.

Observing parameter for $8$ data sets.

4.3.2 The speed of clustering convergence

One important aspect that we want to point out here is the computation time of SSGC and SSDBSCAN. We compare SSGC and SSDBSCAN because they have the same kind of clustering (i.e. it can find clusters of different sizes, shapes, and densities in noisy).

With the same condition (computer, data set, number of seed), Fig. 4 shows that the computation time of the algorithm SSGC are very low in comparison with the the running time of SSDBSCAN. In general, the algorithm SSGC is about 20 times faster than SSDBSCAN.

This can be explained in theoretical that the complexity of the algorithm SSDBSCAN is $O(mn^{2}\textit{logn})$ or $O(mn^{2})$ , while the complexity of the algorithm SSGC is $O(n^{2})$ or O(nlogn) (if using kd-Tree for example).

4.3.3 Parameter setting

Our algorithm has only one parameter k that is the number of nearest neighbors. As illustrated Fig. 5, for all $8$ data sets, it exists a long interval to find a suitable value of $k$ .

5. Conclusion

In this paper, a new semi-supervised graph based clustering algorithm is proposed. To the best of our knowledge, this is the first seed based graph clustering. By using the common nearest neighbor to determine the similarity among objects, the algorithm can be effective for the problem of detecting clusters of arbitrary shape and different density. Moreover, the computation time is significantly lower than SSDBSCAN, which makes our method suitable for real KDD applications.

In future works, we will continue to develop new semi-supervised clustering algorithms and apply for real applications in KDD such as facial expression recognition, web usage mining, and so on. Finally, the finding number of nearest neighbor will also be an attractive task after the paper of [16].

Footnotes

Acknowledgments

This research is funded by Vietnam National University, Hanoi (VNU) under project number QG.18.40.

References

Jain

A.K.

, Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31(8) (2009), 651–666.

Oviedo

Moral

and Puris

, A hierarchical clustering method: Applications to educational data, Intelligent Data Analysis 20(4) (2016), 933–951.

Vágner

, The GridOPTICS clustering algorithm, Intelligent Data Analysis 20(5) (2016), 1061–1084.

Basu

Davidson

and Wagstaff

, Constrained Clustering: Advances in Algorithms, Theory, and Applications, Chapman & Hall/CRC, 2008.

Antoine

Labroche

and Vu

V.-V.

, Evidential seed-based semi-supervised clustering, in: Proceeding of the 7th International Conference on Soft Computing and Intelligent Systems and 15th International Symposium on Advanced Intelligent Systems, 2014, pp. 706–711.

Antoine

Quost

Masson

M.-H.

and Denoeux

, CECM: Constrained evidential C-means algorithm, Computational Statistics & Data Analysis 56(4) (2012), 894–914.

V.-V.

Labroche

and Bouchon-Meunier

, Improving constrained clustering with active query selection, Pattern Recognition 45(4) (2012), 1749–1758.

V.-V.

and Labroche

, Active seed selection for constrained clustering, Intelligent Data Analysis 21(3) (2017), 537–552.

Basu

Banerjee

and Mooney

R.J.

, Semi-supervised Clustering by Seeding, in: Proceeding of 19th International Conference on Machine Learning, 2002, pp. 281–304.

10.

Bensaid

A.M.

Hall

L.O.

Bezdek

J.C.

and Clarke

L.P.

, Partially Supervised clustering for image segmentation, Pattern Recognition 29(5) (1996), 859–871.

11.

Böhm

and Plant

, HISSCLU: a hierarchical density-based method for semi-supervised clustering, in: Proceeding of the 11th International Conference on Extending Database Technology, 2008, pp. 440–451.

12.

Lelis

and Sander

, Semi-supervised Density-Based Clustering, in: Proceeding of the IEEE International Conference on Data Mining, 2009, pp. 842–847.

13.

Jarvis

R.A.

and Patrick

E.A.

, Clustering using a similarity measure based on shared near neighbors, IEEE Transactions on Computer 11 (1973), 1025–1034.

14.

Ertoez

Steinbach

and Kumar

, Finding clusters of different sizes, shapes, and densities in Noisy, high dimensional data, in: Proceedings of the SIAM International Conference on Data Mining, 2003, pp. 47–58.

15.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336) (1971), 846–850.

16.

Zhang

and Song

, Predicting the number of nearest neighbors for the k-NN classification algorithm, Intelligent Data Analysis 18(3) (2014), 449–464.

17.

Asuncion

and Newman

D.J.

, UCI machine learning repository – University of California, Irvine, School of Information and Computer Sciences. http://www.ics.uci.edu/∼mlearn/MLRepository.html, 2015.

An efficient semi-supervised graph based clustering

Abstract

Keywords

1. Introduction

2. Related work

2.1 Seed-based DBSCAN

2.3 Seed based Fuzzy C-Means

2.4 HISSCLU method

3. Proposed approach

3.1 The k-nearest neighbors graph

4.1 Data sets

Table 1 Main characteristics of the real datasets

4.3.1 The quality of clustering

4.3.3 Parameter setting

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
Main characteristics of the real datasets