Spectral clustering with adaptive similarity measure in Kernel space

Abstract

The similarity measure for complex data may not precisely reflect the true data structure, which leads to suboptimal clustering performance for spectral clustering. In this paper, we propose a novel spectral clustering method which measures the similarity of data points based on the adaptive neighborhood in Kernel space. In Kernel space, by assigning the adaptive and optimal neighbors for each data point based on the local structure, the proposed method learns a sparse matrix as the similarity matrix for spectral clustering. The proposed method is able to explore the underlying similarity relationships between data points, and is robust to the complex data. To validate the efficacy of the proposed method, we perform experiments on both synthetic and real datasets in comparison with some existing spectral clustering methods. The experimental results demonstrate that the proposed method obtains quite promising clustering performance.

Keywords

Spectral clustering Kernel space similarity measure adaptive neighbors local structure

1. Introduction

Clustering is an important tool in fields like machine learning, data mining, graph compression and many other tasks. Spectral clustering has attracted a lot of attention due to its high performance on some challenging clustering tasks. Because of the capacity of mining intrinsic geometric structures, spectral clustering is efficient for complex data [17, 3], and thus has superior performance compared to the traditional clustering methods. The applications of spectral clustering can be found in many research fields, including image segmentation [21], video retrieval [12], circuit layout [1] and bioinformatics [27].

Spectral clustering makes use of the spectrum of some normalized similarity matrix derived from the data to reveal the cluster structure. The similarity matrix is constructed with some kind of similarity measure method. Thus, how to measure the similarity of data points is critical to the performance of spectral clustering. To achieve proper clustering result, spectral clustering assumes that two nearby data points in the high density region of the reduced space (formed by the eigenvectors) have high similarity and belong to the same cluster. However, this assumption does not always hold. The complex data are often heterogeneous, high dimensional, and without prior knowledge. Due to the curse of dimensionality and complex data structure, the two data points maybe far away from each other in the original space [23]. Therefore, measuring the similarity of data points in the original space may not precisely reflect the underlying data structure, which leads to inaccurate clustering result.

In this paper, instead of measuring the similarity of data points in the original space, we consider to project the original data to a feature space with the Mercer Kernel and then measure the similarity. In the Kernel space, if the underlying structure can be precisely captured, the performance of spectral clustering can be improved when applied to complex data clustering. Aiming to achieve this goal, we propose a novel spectral clustering method that measures the similarity by learning the adaptive and optimal neighbors for each data point based on the local structure in Kernel space. In the Kernel space, the similarity of two data points is measured based on the probability that these two data points are neighbors. Probabilistic neighborhood has been reported to be effective in similarity and data feature learning [9, 8]. To the best of our knowledge, we are the first to learn the probabilistic neighborhood in the Kernel space for similarity measure. The Kernel methods have been successfully used to overcome the limitation of some existing data analysis techniques, such as Kernel PCA [2] and Kernel K-means [18]. In the proposed method, we introduce the kernel methods to measure the similarity for constructing the similarity matrix of spectral clustering, which increase the linear separability by mapping the data points into the projected feature space. We demonstrate the effectiveness of the proposed method by using both synthetic and real datasets. The experimental results demonstrate that the proposed method not only achieves good performance, but also outperforms the existing spectral clustering methods.

The rest of this paper is organized as follows. After reviewing the related work in Section 2, we give a brief overview of spectral clustering in Section 3. We detail the proposed method in Section 4. Experimental results are given in Section 5 and Section 6 concludes the paper.

2. Related work

A good similarity measure method to construct the similarity matrix can significantly improve the clustering results of spectral clustering. Recently, a lot of effort has been devoted to the problem of how to construct the similarity matrix for spectral clustering. The Gaussian kernel function has been widely used to measure the similarity for spectral clustering, which is defined as $s_{ij}=\exp(-{\|x_{i}-x_{j}\|^{2}}/2\sigma^{2})$ with the kernel parameter $\sigma$ . $\|x_{i}-x_{j}\|$ is the Euclidean distance between data points $x_{i}$ and $x_{j}$ . Although Gaussian kernel function is simple to calculate, the optimal value of the kernel parameter $\sigma$ is difficult to find.

Many spectral clustering methods which aim at finding the optimal value of the kernel parameter $\sigma$ in the Gaussian kernel function have been proposed [19, 28, 29, 6]. Ng et al. [19] devoted to select the parameter automatically by comparing the results based on certain criteria. Zelnik-Manor and Perona [28] proposed a self-tuning spectral clustering method, which locally scaled the parameter by studying the local statistics of the neighborhood of data points. Cao et al. [6] optimized the local scaling parameter by calculating the maximum flows between data points. Li et al. [15] improved the parameter for more accurate similarity measure based on a warping model.

Instead of focusing on the kernel parameter, a family of spectral clustering methods measure the similarity of data points based on the $k$ -Nearest Neighbor ( $k$ NN) graph [17]. In the $k$ NN graph, point $x_{i}$ is connected to point $x_{j}$ if $x_{i}$ is among the $k$ -nearest neighbors of $x_{j}$ or $x_{j}$ is among the $k$ -nearest neighbors of $x_{i}$ . According to the $k$ NN graph, the pairwise similarity is $s_{ij}$ if point $x_{i}$ is connected to point $x_{j}$ , otherwise $s_{ij}=0$ . For such kind of methods, the parameter $k$ is easier to find. And, the similarity matrix $S=(s_{ji})$ is sparse, which has higher computational efficiency for the solution of eigenvectors. Spectral clustering methods based on the $k$ NN graph can be found in [16, 24, 14].

An interesting alternative to the distance-based similarity measure is to use information regarding the shared nearest neighbors. In most cases, two data points belong to the same cluster not only because they are near in the distance, but also because they have many shared nearest neighbors which connect them in the same cluster [11, 26]. Shared nearest neighbor based similarity measure methods have been conducted by many researchers [13, 25]. Zhang et al. [29] used the shared nearest neighbors to explore the local density between data points, which had an effect of amplifying intra-cluster similarity. Ye and Sakurai [25] measured the similarity based on the closeness of shared nearest neighbors to improve the robust of spectral clustering. Besides the methods based on shared nearest neighbors, Beauchemin [4] proposed a similarity measure method for spectral clustering by a density estimator which relied on the K-means method with subbagging procedure.

In most of the previous work, the similarity of data points is measured in the original space. In this paper, we consider to measure the similarity of data points in the Kernel space, with the objective to explore the underlying similarity relationships between data points and improve the performance of spectral clustering.

3. Overview of spectral clustering

Given a set of $n$ data points $X=\{x_{1},x_{2},\ldots,x_{n}\}$ in $\mathbb{R}^{d}$ , the objective of clustering is to divide the data points into $c$ clusters. A general spectral clustering algorithm consists of two basic stages: similarity matrix construction and clustering.

The similarity matrix is $S=(s_{ji})$ , where $s_{ji}$ is calculated by some kind of similarity measure methods. The normalized Laplacian matrix $L$ is then computed based on $S$ . Finally, the clustering step is performed based on the eigen-decomposition of the normalized Laplacian matrix $L$ .

We state the general spectral clustering algorithm as follows [17, 25] .

(1)
Construct a similarity matrix $S$ , where $S=(s_{ji})$ .
(2)
Compute the normalized Laplacian matrix $L$ as $L=I-D^{-1/2}SD^{-1/2}$ , where $D$ is the $n\times n$ diagonal matrix with $d_{i}=\sum\nolimits_{j=1}^{n}s_{ij}$ on the diagonal.
(3)
Compute the $c$ smallest eigenvectors of $L$ , which are $v_{1}$ , $v_{2}$ , …, $v_{c}$ .
(4)
Form the matrix $V=(v_{1},v_{2},\ldots,v_{c})_{n\times c}$ using the eigenvectors as its columns.
(5)
Form the matrix $U=(u_{ij})_{n\times c}$ from $V$ by normalizing the rows to norm 1, such as $u_{ij}=v_{ij}/(\sum\nolimits_{p}v^{2}_{ip})^{1/2}$ .
(6)
Let $u_{i}\in\mathbb{R}^{c}$ be a vector corresponding the $i^{th}$ row of $U$ .
(7)
Cluster the points $(u_{i})_{i=1,\ldots,n}$ with the K-means method.
(8)
Assign each data point $x_{i}$ to a given cluster if $u_{i}$ is assigned to this cluster.

The similarity matrix construction in the first step is crucial to the clustering result, since the following steps are based on it. In this paper, we focus on the first step to construct the similarity matrix $S$ , which is similar to that considered in the related work. Some work have also considered to improve other steps of spectral clustering. For example, the Nystrom method has been applied in the third step for the eigenvector extraction problem to reduce the computation cost [10]. Genetic algorithms have also been utilized in the seventh step to improve the results of spectral clustering [22].
4. The proposed method

In this section, we introduce the proposed method that performs Adaptive Similarity Measure in Kernel space for Spectral Clustering, which is referred to as the ASMK-SC method. We first introduce the Mercer Kernel. Then we learn the adaptive and optimal neighbors of each data point in the Kernel space. ASMK-SC measures the similarity of data points based on the obtained adaptive neighbors to learn a sparse matrix as the similarity matrix for spectral clustering.

4.1 Mercer Kernel

Given a set of $n$ data points $X=\{x_{1},x_{2},\ldots,x_{n}\}$ in $\mathbb{R}^{d}$ , a Mercer Kernel is calculated according to a function $K:X\times X\to\mathbb{R}$ and expressed as

$\displaystyle K(x_{i},x_{j})=\phi(x_{i})\cdot\phi(x_{j}),$ (1)

where $\phi:X\to\mathcal{F}$ performs a mapping from the original space $X$ to a high dimensional feature space $\mathcal{F}$ . The relevant aspects in applications is that it is possible to calculate Euclidean distance in $\mathcal{F}$ without knowing explicitly $\phi$ . In this way, Mercer Kernel allows large non-linear feature spaces to be explored while avoiding curse of dimensionality. The radial basis function (RBF) Kerne is one of the most popular Mercer kernels, which is defined as

$\displaystyle K(x_{i},x_{j})=\exp(-{\|x_{i}-x_{j}\|^{2}}/2\sigma^{2}),$ (2)

where the parameter $\sigma$ is to control the scale of the dot product of the two data points. For the sake of convenient, we use Eq. (2) to map the original space to its feature space. Note that as showed in the related work, the right side of Eq. (2) also has been used as a similarity function. In our method, we use it to perform data mapping but not similarity calculation.

The Euclidean distance between data points $x_{i}$ and $x_{j}$ in the feature space of Kernel is defined as

$\displaystyle d^{K}_{ij}=\sqrt{\|\phi(x_{i})-\phi(x_{j})\|^{2}}.$ (3)

According to Mercer Kernel, $d^{K}_{ij}$ can be calculated directly from Kernel values as follows.

$\displaystyle d^{K}_{ij}=\sqrt{K(x_{i},x_{i})+K(x_{j},x_{j})-2K(x_{i},x_{j})}.$ (4)

4.2 Adaptive neighbors in Kernel space

4.2.1 Adaptive neighbor learning

The probabilistic neighborhood is learned based on the assumption that each data point can be connected to all the other data points with certain probabilities [9]. In this paper, we use the Euclidean distance to measure the distance for the calculation of probabilistic neighborhood. In the Kernel space, consider that $\phi(x_{i})$ is connected to $\phi(x_{j})$ as a neighbor with probability $p_{ij}$ , $0\leqslant p_{ij}\leqslant 1$ . For all the data points, the probabilities to be connected to $\phi(x_{i})$ satisfy $\sum_{j=1}^{n}p_{ij}=1$ .

The data points that have smaller distances to $\phi(x_{i})$ should be connected to $\phi(x_{i})$ with higher probabilities. That is, $p_{ij}$ should be large if $\|\phi(x_{i})-\phi(x_{j})\|^{2}$ is small. Thus, the probabilities $p_{ij}$ ( $j=1,\ldots,n$ ) of all data points to be connected to $\phi(x_{i})$ can be determined by solving

$\displaystyle\min_{0\leqslant p_{ij}\leqslant 1,\sum_{j=1}^{n}p_{ij}=1}\sum_{j% =1}^{n}\|\phi(x_{i})-\phi(x_{j})\|^{2}p_{ij}.$ (5)

However, the optimization problem in Eq. (5) has a trivial solution. That is, only the nearest neighbor of $\phi(x_{i})$ can be connected to it with probability 1. Other data points cannot be the neighbors of $\phi(x_{i})$ . To avoid the trivial solution, adjustment can be made by combining the following equation to (5).

$\displaystyle\min_{0\leqslant p_{ij}\leqslant 1,\sum_{j=1}^{n}p_{ij}=1}\sum_{j% =1}^{n}p_{ij}^{2}.$ (6)

The optimal solution of Eq. (6) is to make all data points be connected to $\phi(x_{i})$ with the same probability $1/n$ .

By combining Eq. (6) to Eq. (5), the objective function to obtain the adaptive neighbors in the Kernel space is defined as follows.

$\displaystyle\min_{0\leqslant p_{ij}\leqslant 1,\sum_{j=1}^{n}p_{ij}=1}\sum_{j% =1}^{n}(\|\phi(x_{i})-\phi(x_{j})\|^{2}p_{ij}+\lambda p_{ij}^{2}),$ (7)

where $\lambda$ is a regularization parameter controlling the trade off between the trivial solution ( $\lambda=0$ ) and the uniform distribution ( $\lambda=\infty$ ).

4.2.2 Optimization

By adjusting the parameter $\lambda$ , we would like to keep only the $k$ nearest neighbors for each data point in the Kernel space for local structure preservation. Inspired by [9], we provide an effective method to find the optimal parameter $\lambda$ for the optimization in Eq. (7) .

As shown in Eq. (3), ${d^{K}_{ij}}^{2}=\|\phi(x_{i})-\phi(x_{j})\|^{2}$ . Let $g_{ij}=-\frac{{d^{K}_{ij}}^{2}}{2\lambda}$ and denote ${g_{i}}\in\mathbb{R}^{n\times 1}$ as a vector with the $j^{th}$ element as ${g_{ij}}$ . Denote $p_{i}\in\mathbb{R}^{n\times 1}$ as a vector with the $j^{th}$ element as $p_{ij}$ . Let $1_{n}\in\mathbb{R}^{n\times 1}$ denote a vector with all of its elements being 1. Equation (7) can be written in vector form as

$\displaystyle\min_{0\leqslant p_{ij}\leqslant 1,p_{i}^{T}1_{n}=1}\frac{1}{2}\|% p_{i}-g_{i}\|^{2}.$ (8)

The Lagrangian function of Eq. (8) is

$\displaystyle\Gamma=\frac{1}{2}\|p_{i}-g_{i}\|^{2}-\mu(p_{i}^{T}1_{n}-1)-% \sigma_{i}^{T}p_{i},$ (9)

where $\mu$ and $\sigma_{i}$ are the Lagrangian multipliers. According to the KKT condition [5], the optimal solution $p_{ij}$ can be obtained as

$\displaystyle p_{ij}=(g_{ij}+\mu)_{+}.$ (10)

To keep only the $k$ nearest neighbors of $\phi(x_{i})$ , we sort $g_{ij}$ ( $j=1,\ldots,n$ ) in descending order. Without loss of generality, we assume that $g_{i1}$ , $g_{i2}$ , …, $g_{in}$ are ordered from large to small. To satisfy that $p_{i}$ has only $k$ nonzero values, we set $p_{ik}>0$ and $p_{i,k+1}=0$ . According to Eq. (10), we have

$\displaystyle\left\{\begin{array}[]{ll}g_{i,k}+\mu>0,\\ g_{i,k+1}+\mu\leqslant 0.\end{array}\right.$ (11)

Since $p_{i}^{T}1_{n}=1$ , we can further obtain

$\displaystyle\mu=\frac{1}{k}-\frac{1}{k}\sum_{j=1}^{k}g_{ij}.$ (12)

Since $g_{ij}=-\frac{{d^{K}_{ij}}^{2}}{2\lambda}$ , by replacing $\mu$ in Eq. (10) with Eq. (12), the optimal value of $p_{ij}$ can be obtained by

$\displaystyle p_{ij}=\left(-\frac{{d^{K}_{ij}}^{2}}{2\lambda}+\frac{1}{k}+% \frac{1}{2k\lambda}\sum_{j=1}^{k}{d^{K}_{ij}}^{2}\right)_{+}.$ (13)

Then we show how to decide $\lambda$ in Eq. (13). According to Eqs (11) and (12), we have the following inequality for $\lambda$ .

$\displaystyle\frac{k}{2}{d^{K}_{ik}}^{2}-\frac{1}{2}\sum_{j=1}^{k}{d^{K}_{ij}}% ^{2}<\lambda\leqslant\frac{k}{2}{d^{K}_{i,k+1}}^{2}-\frac{1}{2}\sum_{j=1}^{k}{% d^{K}_{ij}}^{2}.$ (14)

To obtain the optimal solution $p_{i}$ that has only $k$ nonzero values, $\lambda$ can be set as

$\displaystyle\lambda=\frac{k}{2}{d^{K}_{i,k+1}}^{2}-\frac{1}{2}\sum_{j=1}^{k}{% d^{K}_{ij}}^{2}.$ (15)

Then, replace $\lambda$ in Eq. (13) with the above value, we can obtain the optimal value of $p_{ij}$ as

$\displaystyle p_{ij}=\frac{{d^{K}_{i,k+1}}^{2}-{d^{K}_{i,j}}^{2}}{k{d^{K}_{i,k% +1}}^{2}-\sum_{j=1}^{k}{d^{K}_{ij}}^{2}}.$ (16)

Compared to the regularization parameter $\lambda$ , the number of nearest neighbors $k$ is much easier and more intuitive to tune. Parameter $\lambda$ can be better handled by searching $k$ .

4.3 Similarity measure based on adaptive neighbors in Kernel space

We measure the similarity of data points based on the probabilistic neighborhood calculated in the Kernel space. Let $P$ denote the matrix with $p_{ij}$ as its entries. According to Eq. (16), $P$ is not symmetric, i.e., $p_{ij}\neq p_{ji}$ . Since the similarity matrix $S$ for spectral clustering should be symmetric, we calculate $s_{ij}$ based on $p_{ij}$ and $p_{ji}$ as

$\displaystyle s_{ij}=\frac{p_{ij}+p_{ji}}{2}.$ (17)

Thus, the similarity matrix $S$ is calculated as

$\displaystyle S=\frac{P+P^{T}}{2}.$ (18)

Note that the number of nearest neighbors $k$ is a parameter of $s_{ij}$ , as shown in Eqs (16) and (17). The similarity matrix $S$ is sparse, since $k$ is usually small compared to $n$ , which is computational efficiency for the solution of eigenvectors.

Spectral clustering is then performed based on the similarity matrix $S$ . The proposed ASMK-SC method is summarized in Algorithm 4.3. The detail of the last step in Algorithm 4.3 is performed as steps 4 $\sim$ 8 in the general spectral clustering algorithm introduced in Section 3.

[h] The proposed ASMK-SC method[1] Data matrix $X\in\mathbb{R}^{d\times n}$ ; Parameters $\sigma$ , $k$ , $c$ ; Cluster indexes of $x_{1}$ , $x_{2}$ , …, $x_{n}$ ; Construct the $k$ -nearest neighbor graph in the Kernel space by computing $d^{K}_{ij}$ in (4) ; $i=1$ to $n$ $j=1$ to $n$ $p_{ij}=\frac{{d^{K}_{i,k+1}}^{2}-{d^{K}_{i,j}}^{2}}{k{d^{K}_{i,k+1}}^{2}-\sum_% {j=1}^{k}{d^{K}_{ij}}^{2}}$ ; Construct the similarity matrix $S=(s_{ij})$ by computing $s_{ij}=\frac{p_{ij}+p_{ji}}{2}$ ; Compute a diagonal matrix $D$ with $d_{i}=\sum\nolimits_{j=1}^{n}s_{ij}$ on the diagonal; Compute the normalized Laplacian matrix $L$ as $L=I-D^{-1/2}SD^{-1/2}$ ; Compute the $c$ smallest eigenvectors of $L$ ; Cluster the data points into $c$ clusters based on the $c$ smallest eigenvectors of $L$ by using the K-means method.

4.4 Complexity analysis

We show the computational cost of the key steps in the proposed ASMK-SC method.

We obtain the solution of Eq. (7) according to Eq. (16) without solving the optimization problem directly. Thus, the computational cost to measure the probabilistic neighborhood of each data point in the Kernel space is no so high. The complexity of similarity matrix construction of ASMK-SC is $O(n^{2}\log n+nk)$ , which includes measuring the probabilistic neighborhood in Eq. (16) and adjusting the symmetry in Eq. (17). This computational complexity is similar to that of the spectral clustering methods which measure the similarity based on the $k$ NN graph.

In general, the complexity of computing the eigenvectors from a dense matrix is $O(n^{3})$ . However, similar to the $k$ NN based spectral clustering methods, the Laplacian matrix of the proposed ASMK-SC method is sparse. The eigenproblem can be solved by applying sparse eigensolvers [7], such as the variants of Lanczos/Arnoldi factorization (e.g., ARPACK) which have a cost of $O(h^{3})+(O(nh)+O(nk)+O(h-c))\times$ (number of restarted Arnoldi), where $h>c$ is the Arnoldi length used to compute the first $c$ eigenvectors of the Laplacian matrix.

5. Experimental results

We conduct the experiments on both synthetic and real datasets. We compare the proposed ASMK-SC method with three other state-of-the-art similarity measure methods for spectral clustering. The first one measures the similarity based on locally scaled parameter, i.e., ST-SC [28]. The second one measures the similarity based on local density, i.e., DA-SC [29]. The third one measures the similarity based on the $k$ NN graph, i.e., $k$ NN-SC [17]. To show the benefit of Kernel in our method, we also show the results by measuring the similarity based on the probabilistic neighborhood in the original space for spectral clustering, which is referred to as the ASM-SC method. The clustering results of K-means is also presented.

We evaluate the performance of different methods based on two widely used clustering metrics: NMI (Normalized Mutual Information) and ACC (Accuracy). In the experiments, we present the best result of each spectral clustering method obtained after exploring their parameters. The algorithm is repeated 20 times with random initializations, since the results of K-means (in the last step of spectral clustering) depend on the initialization. We present the average result with the standard deviation (std) for each method. We also evaluate the performance of the proposed method by varying the parameters $k$ and $\sigma$ . We show the average results and the 95% confidence intervals when varying the parameter $k$ .

5.1 Clustering results on synthetic data

We evaluate different methods on six 2D synthetic datasets which are usually used in spectral clustering studies for method efficiency assessment, as shown in Fig. 1. The first toy data is 2-Spiral that contains two clusters of data randomly distributed in the two spiral shapes. Similarly, 4-Corner and 4-Circle consist of four clusters. 8-Gaussian contains eight clusters of data which obey the eight Gaussian distributions. 4S+Noise contains data points randomly distributed in four squares and the space among the four squares. Since the data points among the four squares make the four squares not well separated, these data points can be seen as noises corresponding to the four squares. And, the noises are referred to as a single cluster. Thus, 4S+Noise has five clusters. 2S+Circle contains three clusters of data randomly distributed in two squares and one circle. The two squares are fairly close to the circle which make the clustering task a challenging problem.

Tables 1 and 2 report the clustering results of different methods in terms of NMI and ACC, respectively. For these synthetic datasets, the spectral clustering methods have better performance than the K-means method, specially on 2-Spiral and 4-Circle. Most of the spectral clustering methods can find the correct clusters of 2-Spiral, 4-Corner and 4-Circle. The proposed ASMK-SC method outperforms the other methods on most of the synthetic datasets.

Table 1
Clustering results (NMI% $\pm$ std) of different methods on synthetic data

Dataset	K-means	DA-SC	ST-SC	$k$ NN-SC	ASM-SC	ASMK-SC
2-Spiral	3.55 $\pm$ 0.11	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00
4-Corner	63.76 $\pm$ 9.98	97.68 $\pm$ 2.69	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00
4-Circle	13.29 $\pm$ 0.59	90.20 $\pm$ 3.45	96.75 $\pm$ 2.77	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00
8-Gaussian	91.97 $\pm$ 5.10	95.13 $\pm$ 4.35	96.43 $\pm$ 3.01	96.52 $\pm$ 2.72	96.69 $\pm$ 3.70	97.40 $\pm$ 2.27
4S+Noise	70.61 $\pm$ 4.61	81.01 $\pm$ 0.88	82.00 $\pm$ 2.58	81.66 $\pm$ 1.25	81.90 $\pm$ 1.56	82.66 $\pm$ 1.75
2S+Circle	60.14 $\pm$ 0.71	63.58 $\pm$ 2.22	67.90 $\pm$ 1.56	67.39 $\pm$ 1.84	67.27 $\pm$ 2.29	68.72 $\pm$ 2.18

Figure 1.

Six 2D synthetic datasets with true clusters denoted by different markers and colors

We further evaluate the performance of the proposed method by varying the parameter $k$ on the synthetic datasets. Since $k$ NN-SC and ASM-SC have the same parameter $k$ , we compare the proposed method with these two methods in the experiments. Figures 2 and 3 show the clustering results on the six synthetic datasets in terms of NMI and ACC, respectively. Although all the three methods can find the correct clusters of the 4-Circle dataset, the range of $k$ for the proposed method to find the correct clusters is longer than the other two methods. On the 2S+Circle dataset, $k$ NN-SC obtains the best result when $k=$ 6, however, it is very unstable as $k$ varies. The performances of the proposed ASMK-SC method are better and more stable than $k$ NN-SC and ASM-SC on most of the synthetic datasets.

Table 2

Clustering Results (ACC % $\pm$ std) of different methods on synthetic data

Dataset	K-means	DA-SC	ST-SC	$k$ NN-SC	ASM-SC	ASMK-SC
2-Spiral	61.03 $\pm$ 0.17	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00
4-Corner	64.58 $\pm$ 2.27	98.02 $\pm$ 0.52	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00
4-Circle	37.62 $\pm$ 1.04	89.63 $\pm$ 4.33	95.47 $\pm$ 3.63	100 $\pm$ 0.00	100 $\pm$ 0.00	100 $\pm$ 0.00
8-Gaussian	89.75 $\pm$ 5.96	93.06 $\pm$ 5.06	95.88 $\pm$ 3.21	95.98 $\pm$ 3.69	95.85 $\pm$ 3.24	97.42 $\pm$ 2.51
4S+Noise	84.13 $\pm$ 4.26	89.72 $\pm$ 1.84	89.84 $\pm$ 2.97	88.26 $\pm$ 3.63	89.67 $\pm$ 2.30	90.77 $\pm$ 2.54
2S+Circle	85.10 $\pm$ 0.31	86.82 $\pm$ 1.37	88.90 $\pm$ 0.18	87.80 $\pm$ 1.49	88.66 $\pm$ 0.96	89.19 $\pm$ 0.68

Figure 2.

NMI by varying the number of nearest neighbors $k$ on synthetic data.

Table 3

Properties of real data

Dataset	# of samples	# of Features	# of Clusters
UMIST	575	400	20
JAFFE	213	676	10
AT&T	400	644	40
AR10P	130	2400	10
ORL	400	1024	40
COIL20	1440	1024	20
MNIST	2000	784	10
Isolet1	1560	617	26
Lung	203	3312	5
GLA-BRA-180	180	49151	4

Figure 3.

ACC by varying the number of nearest neighbors $k$ on synthetic data.

5.2 Clustering results on real data

Table 4
Clustering results (NMI % $\pm$ std) of different methods on real data

Dataset	K-means	DA-SC	ST-SC	$k$ NN-SC	ASM-SC	ASMK-SC
UMIST	63.05 $\pm$ 1.61	67.09 $\pm$ 1.81	64.31 $\pm$ 1.29	81.05 $\pm$ 1.62	84.07 $\pm$ 2.11	85.82 $\pm$ 1.85
JAFFE	77.71 $\pm$ 5.15	85.54 $\pm$ 4.57	88.03 $\pm$ 6.51	91.95 $\pm$ 3.02	93.40 $\pm$ 3.70	95.37 $\pm$ 2.73
AT&T	79.08 $\pm$ 1.10	85.05 $\pm$ 1.27	80.03 $\pm$ 1.98	88.02 $\pm$ 1.02	89.36 $\pm$ 1.13	90.63 $\pm$ 1.10
AR10P	19.82 $\pm$ 3.66	29.22 $\pm$ 2.31	28.17 $\pm$ 3.09	29.63 $\pm$ 2.13	28.40 $\pm$ 2.71	30.21 $\pm$ 1.94
ORL	71.44 $\pm$ 1.85	79.75 $\pm$ 1.49	79.41 $\pm$ 0.76	78.73 $\pm$ 1.04	82.36 $\pm$ 0.70	82.83 $\pm$ 0.93
COIL20	73.98 $\pm$ 2.83	77.04 $\pm$ 0.10	78.25 $\pm$ 1.52	86.26 $\pm$ 2.09	90.87 $\pm$ 1.51	94.23 $\pm$ 0.13
MNIST	51.52 $\pm$ 2.17	52.74 $\pm$ 1.46	52.91 $\pm$ 1.68	52.10 $\pm$ 1.56	63.18 $\pm$ 2.06	66.86 $\pm$ 1.89
Isolet1	73.64 $\pm$ 1.56	72.05 $\pm$ 1.53	72.37 $\pm$ 1.57	72.22 $\pm$ 1.25	77.12 $\pm$ 1.20	78.26 $\pm$ 1.22
Lung	49.58 $\pm$ 2.84	60.17 $\pm$ 4.55	54.34 $\pm$ 4.25	62.89 $\pm$ 7.40	63.39 $\pm$ 6.80	65.81 $\pm$ 5.49
GLA-BRA-180	25.53 $\pm$ 2.64	24.34 $\pm$ 1.73	26.32 $\pm$ 2.31	25.80 $\pm$ 0.34	28.94 $\pm$ 0.96	29.83 $\pm$ 0.52

Table 5

Clustering results (ACC % $\pm$ std) of different methods on real data

Dataset	K-means	DA-SC	ST-SC	$k$ NN-SC	ASM-SC	ASMK-SC
UMIST	41.81 $\pm$ 2.18	48.55 $\pm$ 2.53	42.34 $\pm$ 8.22	64.044 $\pm$ 5.97	68.37 $\pm$ 3.58	70.22 $\pm$ 3.72
JAFFE	71.43 $\pm$ 7.09	75.35 $\pm$ 6.93	75.14 $\pm$ 6.51	89.30 $\pm$ 5.02	91.22 $\pm$ 6.33	93.42 $\pm$ 5.12
AT&T	59.21 $\pm$ 2.98	70.15 $\pm$ 3.21	61.26 $\pm$ 2.93	75.13 $\pm$ 2.87	77.39 $\pm$ 2.46	79.10 $\pm$ 2.27
AR10P	23.54 $\pm$ 2.88	30.77 $\pm$ 2.79	29.22 $\pm$ 3.00	34.50 $\pm$ 2.33	33.21 $\pm$ 2.75	35.12 $\pm$ 2.43
ORL	50.09 $\pm$ 2.98	63.48 $\pm$ 2.77	57.68 $\pm$ 2.92	54.26 $\pm$ 2.39	66.88 $\pm$ 2.17	67.09 $\pm$ 2.17
COIL20	58.48 $\pm$ 5.87	67.24 $\pm$ 0.23	59.89 $\pm$ 4.23	77.73 $\pm$ 3.86	80.71 $\pm$ 4.15	89.60 $\pm$ 1.25
MNIST	56.93 $\pm$ 3.94	55.81 $\pm$ 2.04	57.68 $\pm$ 2.92	52.26 $\pm$ 4.08	62.48 $\pm$ 3.97	66.37 $\pm$ 3.65
Isolet1	56.68 $\pm$ 3.47	58.54 $\pm$ 3.22	56.50 $\pm$ 3.39	56.26 $\pm$ 2.60	61.03 $\pm$ 2.86	62.84 $\pm$ 2.17
Lung	64.95 $\pm$ 4.83	74.26 $\pm$ 6.78	70.23 $\pm$ 7.33	82.46 $\pm$ 6.43	83.17 $\pm$ 5.34	85.79 $\pm$ 5.23
GLA-BRA-180	56.39 $\pm$ 3.40	54.17 $\pm$ 2.67	59.50 $\pm$ 3.66	56.66 $\pm$ 0.96	62.52 $\pm$ 2.02	63.27 $\pm$ 1.88

Figure 4.

NMI by varying the number of nearest neighbors $k$ .

We use a diversity of ten public datasets to evaluate the performance of different methods. The datasets include five face image datasets, i.e., UMIST,1

http://www.sheffield.ac.uk/eee/research/iel/research/face.

JAFFE,2

http://www.cs.nyu.edu/roweis/data.html.

AR10P,3

http://featureselection.asu.edu/old/datasets.php.

ORL4

⁴

http://featureselection.asu.edu/datasets.php.

and AT&T [20], one object image dataset, i.e., COIL20,5

⁵

http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.

one handwritten digit datasets, i.e., MNIST, one spoken letter recognition data, i.e., Isolet1, and two biological datasets, i.e., Lung and GLA-BRA-180. Their properties are summarized in Table 3.

The clustering results of different methods on the ten real datasets are shown in Tables 4 and 5 in terms of NMI and ACC, respectively. We can see from the two tables that the proposed ASMK-SC method outperforms other methods. The real datasets have high dimensions. Compared to the results on the synthetic datasets with two dimensions, ASMK-SC improves the clustering results more significantly on the real datasets. Compared to ASM-SC that is without Mercer Kernel, the proposed method has better performance. That is because Mercer Kernel can offer a more general way to represent the complex data, by which the clusters can be more accurately identified.

Figure 5.

ACC by varying the number of nearest neighbors $k$ .

The proposed ASMK-SC method has two parameters, i.e., $k$ and $\sigma$ . We evaluate the performance of ASMK-SC by varying the two parameters on six of the real datasets: JAFFE, AT&T, AR10P, COIL20, MNIST and LUNG. In Figs 4 and 5, we compare the clustering results among ASMK-SC, ASM-SC and $k$ NN-SC for variations in the parameter $k$ . We present the range of $k$ that gives the best clustering results for all the three clustering methods. From the clustering results, we can see that on most of the datasets ASMK-SC performs better and more stably than the other two methods. The results of ASMK-SC on the COIL20 dataset are very stable.

Figure 6.

NMI by varying $\sigma$ and the number of nearest neighbors $k$ .

Figure 7.

ACC by varying $\sigma$ and the number of nearest neighbors $k$ .

Finally, we study the performance variation of ASMK-SC with respect to the parameters $\sigma$ and $k$ . The experimental results are shown in Figs 6 and 7 in terms of NMI and ACC, respectively. The range of $\sigma$ to obtain the best result is found empirically. ASMK-SC can maintain good performance with a large range of $\sigma$ . The results on JAFFE, AT&T, COIL20 and MNIST are quite stable to both the parameters $\sigma$ and $k$ . The results on AR10P and LUNG are stable to the parameters $\sigma$ . We can see from these results that the proposed ASMK-SC method is not very sensitive to the parameter $\sigma$ .

Note that, on some of the datasets the proposed method could not obtain very significant improvements, and could not obtain ideal cluster results, such as AR10P. The reason maybe that we use the Euclidean distance as the distance metric to measure the similarity, which maybe not the best for the datasets. Another reason maybe that the datasets contain some redundant and noisy features that affect the accuracy of the clustering results. We will address these problems in the further work.

6. Conclusion

In this paper, we propose a novel spectral clustering method which measures the similarity of data points based on the adaptive neighborhood in the Kernel space. We introduce the kernel methods and apply probabilistic neighborhood to measure the similarity for constructing the similarity matrix of spectral clustering, which is able to explore the underlying similarity relationships between data points and improve the performance of spectral clustering. Experiments on various types of datasets demonstrate the advantages of the proposed method. In the future work, to further improve the clustering results, we will try different distance metrics to measure the data similarity and perform feature selection for data preprocessing. We will also consider some efficient methods to determine the parameter $\sigma$ automatically to obtain the best results.

References

Alpert

C.J.

and Kahng

A.B.

, Multiway partitioning via geometric embeddings, orderings and dynamic programming, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 4(11) (1995), 1342–1358.

Smola

Scholkopf

and Muller

K.R.

, Kernel principal component analysis, In Proceeding of Artificial Neural Networks, 1997, pp. 583–588.

Bach

F.R.

and Jordan

M.I.

, Learning spectral clustering, In Proceeding of Advances In Neural Information Processing Systems, 2004, pp. 305–312.

Beauchemin

, A density-based similarity matrix construction for spectral clustering, Neurocomputing, 2015, pp. 835–844.

Boyd

and Vandenberghe

, Convex optimization, Cambridge university press, 2004.

Cao

Chen

Zhengand

and Dai

, A max-flow-based similarity measure for spectral clustering, ETRI Journal 35(2) (2013), 311–320.

Chen

Song

Bai

Lin

and Chang

E.Y.

, Parallel spectral clustering in distributed systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(3) (2011), 568–586.

and Shen

, Unsupervised feature selection with adaptive structure learning, In Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 209–218.

Wang

Nie

and Huang

, Clustering and projected clustering with adaptive neighbors. In Proceeding of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 977–986.

10.

Fowlkes

Belongie

Chung

and Malik

, Spectral grouping using the nystrom method, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2004), 214–225.

11.

Guha

Rastogi

and Kyuseok

, Rock: a robust clustering algorithm for categorical attributes, In Proceedings of 15th International Conference on Data Engineering, 1999, pp. 512–521.

12.

Xie

Zeng

and Maybank

, Semantic-based surveillance video retrieval, IEEE Transactions on Image Processing 16(4) (2007), 1168–1181.

13.

Jarvis

R.A.

and Patrick

E.A.

, Clustering using a similarity measure based on shared near neighbors, IEEE Transactions on Computers C-22(11) (1973).

14.

Shen

Dick

and Zhang

, Context-aware hypergraph construction for robust spectral clustering, IEEE Transactions on Knowledge and Data Engineering 26(10) (2013), 2588–2597.

15.

Liu

Chenand

and Tang

, Noise robust spectral clustering, In Proceeding of ICCV, 2007.

16.

Lucińska

and Wierzchoń

S.T.

, Spectral clustering based on k-nearest neighbor graph, Computer Information Systems and Industrial Management 7564 (2012), 254–265.

17.

Luxburg

, A tutorial on spectral clustering, Statistics and Computing 17(4) (2007), 395–416.

18.

Masulli

Filippone

Camastra

and Rovetta

, Robust similarity measure for spectral clustering based on shared neighbors, Pattern Recognition 1(41) (2008), 176–190.

19.

A.Y.

Jordan

M.I.

and Weiss

, On spectral clustering: Analysis and an algorithm, In Proceeding of Advances In Neural Information Processing Systems, 2002, pp. 849–856.

20.

Samaria

and Harter

, Parameterisation of a stochastic model for human face identification, In Proceeding of IEEE Workshop on Applications of Computer Vision, 1994.

21.

Shi

and Malik

, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8) (2000), 888–905.

22.

Wang

Chen

and Guo

, A genetic spectral clustering algorithm, Journal of Computational Information Systems 7 (2011), 3245–3252.

23.

Yang

Chang

Nie

and Zhou

, A convex formulation for spectral shrunk clustering, In Proceeding of AAAI Conference on Artificial Intelligence 2015, pp. 2532–2538.

24.

Xiong

Johnson

D.M.

and Corso

J.J.

, spectral active clustering via purification of the k nearest neighbor graph, In Proceedings of European Conference on Data Mining, 2012.

25.

and Sakurai

, Spectral clustering using robust similarity measure based on closeness of shared nearest neighbors, In Proceeding of International Joint Conference on Neural Networks (IJCNN), 2015, pp. 1–8.

26.

and Sakurai

, Robust similarity measure for spectral clustering based on shared neighbors, ETRI Journal 38(3) (2016), 540–550.

27.

You

and Han

, Sc3: Triple spectral clustering based consensus clustering framework for class discovery from cancer gene expression profiles, IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(6) (2012), 175–1765.

28.

Zelnik-Manor

and Perona

, Self-tuning spectral clustering, In Proceeding of NIPS, 2005, pp. 1601–1608.

29.

Zhang

and Yu

, Local density adaptive similarity measurement for spectral clustering, Pattern Recognition Letters 32(2) (2011), 352–358.

Spectral clustering with adaptive similarity measure in Kernel space

Abstract

Keywords

1. Introduction

2. Related work

3. Overview of spectral clustering

4.1 Mercer Kernel

4.2.1 Adaptive neighbor learning

5. Experimental results

5.1 Clustering results on synthetic data

Table 1 Clustering results (NMI% ± std) of different methods on synthetic data

Table 4 Clustering results (NMI % ± std) of different methods on real data

References

Table 1
Clustering results (NMI% $\pm$ std) of different methods on synthetic data

Table 4
Clustering results (NMI % $\pm$ std) of different methods on real data