LSEC: Large-scale spectral ensemble clustering

Abstract

A fundamental problem in machine learning is ensemble clustering, that is, combining multiple base clusterings to obtain improved clustering result. However, most of the existing methods are unsuitable for large-scale ensemble clustering tasks owing to efficiency bottlenecks. In this paper, we propose a large-scale spectral ensemble clustering (LSEC) method to balance efficiency and effectiveness. In LSEC, a large-scale spectral clustering-based efficient ensemble generation framework is designed to generate various base clusterings with low computational complexity. Thereafter, all the base clusterings are combined using a bipartite graph partition-based consensus function to obtain improved consensus clustering results. The LSEC method achieves a lower computational complexity than most existing ensemble clustering methods. Experiments conducted on ten large-scale datasets demonstrate the efficiency and effectiveness of the LSEC method. The MATLAB code of the proposed method and experimental datasets are available at https://github.com/Li-Hongmin/MyPaperWithCode.

Keywords

Ensemble clustering spectral clustering landmark selection approximate similarity computation large-scale clustering

1. Introduction

Ensemble clustering, also known as consensus clustering, is a classic problem in machine learning, and it aims to combine multiple base clusterings through improved consensus clustering [28, 22, 9, 17, 32, 26, 36, 14, 15, 25, 24, 12, 11, 42, 43]. Owing to its suitable performance, ensemble clustering plays a pivotal role in many research areas, such as community detection [29] and bioinformatics [18, 33].

There are two critical steps in ensemble clustering: ensemble generation and consensus function. Ensemble generation aims to generate multiple base clusterings on the same datasets. In the early stage, $k$ -means based ensemble generation methods [17, 31, 24] were widely used. Recently, spectral clustering based ensemble generation methods [16, 19] have received attention for its high performance. In contrast, the consensus function is used to integrate multiple base clusterings into a consensus function. We can roughly categorize ensemble clustering according to the consensus function into two categories: co-association matrix-based methods and graph partitioning-based methods.

Co-association matrix-based ensemble clustering method [9, 17, 35, 34] is one of the most widely used ensemble clustering strategies. A typical example is the evidence accumulation clustering method [9], that counts the frequency of the pairwise co-occurrence of the same cluster between a pair of data points according to base clusterings. After treating the co-association matrix as a similarity matrix, a hierarchical agglomerative clustering algorithm is applied to obtain the consensus clustering. Iam-On et al. [17] extended the EAC method by constructing a co-association matrix-based on common neighborhood information between clusters. Tao et al. [30] proposed a robust spectral ensemble clustering method to learn a robust representation for the co-association matrix by capturing noise and conducting spectral clustering to realize consensus clustering. Huang et al. [15] also enhanced the co-association matrix-based on similarity mapping from the cluster level to the object level and achieved ensemble clustering via fast propagation of cluster-wise similarities. However, co-occurrence matrix-based methods often lead to high computational costs that has become a bottleneck for large-scale clustering tasks. Therefore, most co-association matrix-based methods can function efficiently in small-scale datasets, but do not complete large-scale clustering tasks within acceptable time.

Graph partitioning-based ensemble clustering methods [28, 7, 13, 19] aim to transform the ensemble clustering problem into a graph partitioning problem to realize consensus clustering. Strehl and Ghosh [28] constructed a hypergraph representation by exploring base clusterings and proposed three graph partitioning-based ensemble clustering methods. Huang et al. [13] developed a sparse graph with a small number of probably reliable links from base clusterings and realized consensus clustering based on probability trajectory analysis. Li et al. [19] applied the spectral clustering method on base clusterings; they considered the graph Laplacian matrices of base clusterings as input, thereafter learnt a consensus representation by optimizing the graph Laplacians of consensus clustering and base clusterings simultaneously, and finally conducted spectral clustering to realize consensus clustering. Although graph partitioning-based methods have successfully improved clustering quality, they still have limitations regarding large-scale datasets.

Recently, a few studies have made progress in the application of large-scale data for ensemble clustering. Wu et al. [36] proposed a $k$ -means-based consensus clustering (KCC) method that applies the $k$ -means method on a contingency matrix from base clustering to obtain the consensus clustering result efficiently. Liu et al. [24] transformed the spectral clustering of the co-association matrix into a weighted $k$ -means method and proved that the two approaches are equivalent, achieving high efficiency for ensemble spectral clustering. Huang et al. [16] emphasized the efficient bottleneck of $k$ -means-based ensemble generation and applied a large-scale spectral clustering method to speedily produce the base clusterings. Although these studies have achieved success in their respective fields, the large-scale ensemble clustering problem is still a significant challenge owing to its high computational complexity, and it is noteworthy that the ensemble-generation step occupies a considerable portion of the run-time during large-scale ensemble clustering tasks that has rarely been investigated in the literature.

In this paper, we propose a large-scale ensemble spectral clustering (LSEC) method to alleviate the problem of the application of ensemble clustering for large-scale data. A spectral clustering-based ensemble generation method is designed to efficiently handle nonlinear datasets and provide high-quality base clusterings. The ensemble generation process is further accelerated by reusing $K$ -nearest neighbors among the base clusterings and using light- $k$ -means to obtain the clustering results. After ensemble generation, a bipartite graph between data points and clusters from base clusterings is constructed to efficiently realize consensus clustering through the bipartite graph partitioning method. Experimental results on ten large-scale datasets demonstrate that LSEC exhibits highly efficient and high-quality clustering performance compared with six state-of-the-art consensus clustering methods.

The main contributions of this study are as follows:

•
An efficient spectral clustering-based ensemble generation method is designed to handle large-scale datasets and provide high-quality base clusterings via a divide-and-conquer-based large-scale spectral clustering method.
•
Two accelerating approaches are proposed: 1) the computation of similarity among multiple base clusterings is accelerated by reusing the $K$ -nearest neighbors, and 2) the process of obtaining base clustering results is accelerated by the light- $k$ -means method.
•
The proposed method efficiently generates base clusterings and conducts bipartite graph partitioning to realize consensus clustering. The computational and space complexites are dominated by $O\left(\frac{m}{q}N\alpha d\right)$ and $O(NK)$ respectively; thus, the proposed method realizes lower computational complexity than most existing ensemble clustering methods.

2. Related work

In this section, we briefly discuss the existing work relevant to the current study.

2.1 Spectral clustering

Given a dataset $X$ of data points $x_{1},\ldots,x_{n}\in R^{d}$ , spectral clustering first calculates an $n\times n$ pairwise similarity matrix $S$ , where $s_{ij}$ indicates the similarity between $x_{i}$ and $x_{j}$ according to a certain similarity metric. Thereafter, eigen-decomposition is performed on the graph Laplacian of the similarity matrix to compute the top $k$ eigenvectors. A low-dimensional embedding is determined by embedding the data points according to the obtained $k$ eigenvectors. In this embedding, the final clustering results can be achieved by $k$ -means or other discretization methods.

Although spectral clustering has shown priority performance on complex data, it is often limited in its application to large-scale datasets because of its $O(n^{3})$ time complexity and $O(N^{2})$ space complexity. To address this problem, various improvements [8, 3, 39, 20] on large-scale spectral clustering have been developed. More recently, Li et al. [21] proposed a divide-and-conquer based large-scale spectral clustering (DnC-SC) that reduces the time complexity and space complexity to $O(N\alpha d)$ and $O(Nk)$ , respectively. In this study, the proposed ensemble framework also applies DnC-SC as the base spectral clustering algorithm.

2.2 Ensemble clustering

Ensemble clustering, which aims to integrate various base partitions into a consensus partition, consists of two steps: the ensemble generation and consensus function. In the ensemble-generation step, a clustering algorithm is usually employed to produce various base partitions with different parameters. In the consensus function step, all base partitions are integrated into a consensus partition using a certain objective function. Ensemble clustering methods often demonstrate advantages in improving the clustering accuracy and robustness. However, most ensemble clustering methods are not suitable for large-scale clustering tasks owing to efficiency bottlenecks.

EAC [9] is a classic ensemble framework which considers the co-occurrence of the same cluster between two objects in a base clustering as evidence that the two objects should belong to the same cluster. Thus, EAC accumulates the “evidence” by counting the frequency of co-occurrence to construct a co-association matrix and conduct the single-link method to obtain a final clustering result. However, the co-association matrix-based ensemble framework has a high time complexity and is not scalable. To improve the scalability of ensemble clustering, KCC [36] and SEC [24] consider transformations to a modified $k$ -means clustering problem. KCC [36] was designed to transform the ensemble clustering to a $k$ -means based consensus clustering with less cost than EAC. SEC [24] considers applying spectral clustering on the co-association matrix, and transforms it into a weighted $k$ -means clustering problem. PTGP [13] introduces a microclusters representation to reduce the problem size and facilitate the high computation, and considers the set of microclusters as the primitive objects to conduct a probability trajectory-based graph partitioning. LWGP [14] is an ensemble clustering approach based on the ensemble-driven cluster uncertainty estimation and local weighting strategy. It reduces the problem size by applying a local weighting strategy and facilitates the computation through bipartite graph partitioning. The above-mentioned methods use $k$ -means based ensemble generation, in which the associated time costs are not suitable for large-scale datasets. To accelerate ensemble generation, U-SECN [16] employs the large-scale spectral clustering algorithm to reduce the cost of the ensemble generation, which significantly accelerates the ensemble clustering process. The large-scale spectral clustering-based ensemble generation not improves the efficiency and accuracy, particularly on large-scale datasets. Such expositions are unsatisfactory because they do not realize magnitude reduction in the time cost. In contrast to [16], our proposed ensemble clustering approach applies DnC-SC as the base clustering algorithm in a redesigned ensemble generation framework to reduce the time cost. We propose reusing the mid-variables of DnC-SC to obtain diverse graph partitioning problems, and subsequently compute approximate base clustering results based on light- $k$ -means [21]. Further, to obtain the consensus clustering results, we take advantage of bipartite graph partitioning to quickly compute the consensus clustering. Extensive experiments on a variety of datasets have shown that our approach exhibits significant advantages in clustering accuracy and efficiency when compared to the state-of-the-art approaches.

3. Preliminaries

3.1 Ensemble clustering

Let ${X}$ be a dataset ${X}=\{x_{1},\ldots,x_{n}\}$ with $n$ data points. Ensemble generation is the first step that applies a specific clustering algorithm to produce $m$ base clusterings. Let ${\Pi}=\{\pi_{1},\ldots,\pi_{m}\}$ be a set of base clusterings, where $pi_{i}$ is the $i$ -th base clustering and $\pi_{i}=\{\pi_{i}(x_{1}),\pi_{i}(x_{2}),\cdots,\pi_{i}(x_{n})\}$ indicates the clustering labels for all data points. Many studies [9, 17, 36, 23, 30, 13] employed $k$ -means-based ensemble generation, while some studies [16, 19] emphasized that spectral clustering-based ensemble generation can significantly improve the clustering quality on the nonlinear datasets. After ensemble generation, the consensus function is used to integrate all base clusterings through a consensus function.

3.2 Divide-and-conquer-based large-scale spectral clustering

DnC-SC has been proposed as an effective method for large-scale clustering tasks [21]. It first constructs an approximate similarity matrix via a divided-and-conquer-based landmark selection and approximates the $K$ -nearest landmarks search. Thereafter, it transfers the original spectral clustering problem into a bipartite graph partition problem to determine the low-dimensional embedding by solving a smaller eigenproblem. Finally, $k$ -means is applied on the low-dimensional embedding to obtain the final clustering result.

Let ${R}=\{r_{1},r_{2},\cdots,r_{p}\}$ denote a set of landmarks, where $r_{i}\in\mathbb{R}^{d}$ has the same dimension as $x_{i}$ . The divided-and-conquer-based landmark selection is designed to generate a set of landmark points which can best represent the original data $X$ . The objective Eq. (1) measures how appropriately $R$ represents $X$ by computing the residual sum of squares (RRS) between each $x_{j}$ and its nearest $r_{i}$ .

$\displaystyle\zeta=\sum_{i=1}^{p}\sum_{x_{j}\in S_{i}}\|x_{j}-r_{i}\|^{2},$ (1)

where $\zeta$ denotes RSS and $S_{i},S_{2},\dots,S_{p}$ indicate the subsets that are nearest to $r_{1},r_{2},\cdots,r_{p}$ , respectively. For each subset $S_{i}$ , $r_{i}$ is the subset center. Objective Eq. (1) can be rewritten as follows:

$\displaystyle g(X,p)=\text{argmin}_{S_{1},\dots,S_{p}}\sum_{i=1}^{p}\sum_{x_{j% }\in S_{i}}\|x_{j}-r_{i}\|^{2}.$ (2)

The recursive Eqs (3) and (4) are used to divide the optimization problem into small sub-problems that are easier to solve.

$\displaystyle g(Q,h)=\bigcup_{i=1}^{m}g(A_{i},k_{i}),$ (3) $\displaystyle\{A_{1},\dots,A_{m}\}=g(Q,m),$ (4) $\displaystyle s.t.\left\{\begin{array}[]{l}h>m;\\ \sum_{i=1}^{m}k_{i}=h;\\ k_{i}\leqslant\alpha,\quad i=1,2,\dots,m.\\ \end{array}\right.$ (5)

The parameter $\alpha$ is used to determine the upper bound of $k_{i}$ that controls the landmark selection rate. The light- $k$ -means algorithm [21] is used to solve the larger dividing process $g(\cdot)$ (with more than $10p$ samples) that randomly selects a part of the samples to determine a subset by $k$ -means and subsequently assign the remaining data points to the nearest subsets. For the smaller dividing processes (with less than or equal to $10p$ samples), $k$ -means is directly used to determine the subsets.

Similarities between each $x_{i}\in X$ and its $K$ -nearest landmarks are used to construct a sparse similarity matrix. The nature of the landmarks corresponding to the centers is used to estimate the $K$ -nearest landmarks. Let $S_{x_{i}}$ be the subset and $x_{i}\in S_{x_{i}}$ . We denote $r^{1}_{x_{i}}$ as the landmark that is the center of $S_{x_{i}}$ . According to the nature of the landmark corresponding to the center, $r^{1}_{x_{i}}$ is treated as the nearest landmark of $x_{i}$ . In DnC-SC, a set of $K^{\prime}$ -nearest landmarks ( $K^{\prime}>K$ ) of $r^{1}_{x_{i}}$ is first obtained, denoted as ${N}_{K^{\prime}}(r_{x_{i}}^{1})$ ; then the $K$ -nearest landmarks of $x_{i}$ are searched from ${N}_{K^{\prime}}(r_{x_{i}}^{1})$ , denoted as ${N}_{K}(x_{i})$ . Finally, the sparse similarity matrix $B$ is constructed as follows [40, 41]:

$\displaystyle b_{ij}=\left\{\begin{array}[]{ll}\exp\left(\frac{-\|x_{i}-r_{j}% \|^{2}}{2\sigma^{2}}\right),&\text{if }r_{j}\in N_{K}(x_{i}),\\ 0,&\text{otherwise},\\ \end{array}\right.$ (6)

where the Gaussian kernel is used to measure the similarity and $\sigma$ is the bandwidth parameter.

The similarity matrix $B$ reflects the relationship between data $X$ and landmarks $R$ that can be treated as the edge of the bipartite graph $G(X,R,B)$ . Therefore, the spectral clustering problem is converted into a bipartite graph partition problem. According to [21], the low-dimensional embedding of $R$ side can be computed as follows:

$\displaystyle L_{{R}}V=\lambda D_{{R}}V,$ (7) $\displaystyle U=D_{X}^{-1}BV.$ (8)

where $L_{{R}}=D_{{R}}-B^{T}D_{X}^{-1}B$ , $D_{X}\in\mathbb{R}^{n\times n}$ and $D_{{R}}\in\mathbb{R}^{p\times p}$ are the diagonal matrices whose entries are $d_{X}(i,i)=\sum_{j=1}^{n}B_{ij}$ and $d_{{R}}(j,j)=\sum_{i=1}^{n}B_{ij}$ , respectively. Equation (7) is a small eigenproblem with size $p\times p$ . $U$ comprises the $c$ bottom eigenvectors of $X$ side. Finally, $k$ -means is conducted on $U$ to determine $c$ clusters as the final clustering result.

4. Proposed framework

To improve the scalability of ensemble clustering, we propose the LSEC method that complies with the large-scale spectral clustering-based formulation and aims to overcome the efficiency bottleneck of previous algorithms. The LSEC method consists of two steps: (1) Large-scale spectral clustering-based ensemble generation: we designed a new framework that applies the state-of-the-art large-scale spectral clustering algorithm to product base clusterings and further accelerate the process by reusing the $K$ -nearest landmarks and using light- $k$ -means to obtain base clustering results. (2) Bipartite graph partitioning-based consensus function: we constructed a bipartite graph between data points and clusters from base clustering and obtained the consensus clustering result by bipartite graph partitioning. Figure 1 shows an overview of the proposed method.

4.1 Ensemble generation based on large-scale spectral clustering

The ensemble generation step aims to produce diverse $m$ base clusterings with high efficiency. To improve the scalability of ensemble generation, we consider the divide-and-conquer-based large-scale spectral clustering [21] as the base clustering algorithm that can handle nonlinear datasets more efficiently than a traditional clustering algorithm such as $k$ -means. For diversity and efficiency of base clusterings, we construct similarity matrices through multiple $K$ -nearest neighbor graph sparsification by reusing the $K$ -nearest landmarks. Moreover, bipartite graph partitioning was accelerated by applying light- $k$ -means to obtain the clustering results.

Figure 1.

Overview of proposed method. Given a dataset, $\frac{m}{q}$ sets of landmarks are first generated; subsequently, a set of $K$ -nearest neighbors is found for each $R^{(i)}$ , and $m$ sparse similarity matrices are constructed; finally, the base clusterings are obtained through a bipartite graph partitioning process. The proposed method accelerates similarity matrix construction by reusing $K$ -nearest neighbors and bipartite graph partitioning by applying light- $k$ -means.

4.1.1 Landmark selection

First, the $\frac{m}{q}$ sets of landmarks are independently generated by solving the optimization Eq. (2). We recursively apply Eqs (3) and (4) to determine an approximate local solution and consider the subset centers as landmarks. Let ${R}^{(i)}=\{r_{1}^{(i)},r_{2}^{(i)},\cdots,r_{p}^{(i)}\}$ be a set of landmarks. By repeating the divide-and-conquer-based landmark selection $\frac{m}{q}$ times, we have $\frac{m}{q}$ sets of landmarks as follows:

$\displaystyle\mathcal{R}=\{{R}^{(1)},{R}^{(2)},\cdots,{R}^{(\frac{m}{q})}\},$ (9)

where ${R}^{(i)}$ indicates the $i$ -th set of landmarks, and ${R}$ is a set containing all ${R}^{(i)}$ , $i=1,2,\ldots,\frac{m}{q}$ . The time complexity of the generation of each ${R}^{(i)}$ is $O\left(\frac{m}{q}N\alpha d\right)$ and the overall construction $\mathcal{R}$ has $O\left(\frac{m}{q}N\alpha d\right)$ time complexity. The parameter $\alpha$ is used to determine the upper bound of $k_{i}$ that controls the landmark selection rate.

4.1.2 Searching

K

-nearest landmarks

To construct a sparse similarity matrix with $K$ -nearest neighbor sparsification, we need to search the $K$ -nearest landmarks for each data point $x_{i}$ . The literature [3, 16, 21] shows that different values of $K$ will directly affect the base clustering results and ensemble results. Because diverse clustering results can reflect data structures from multiple perspectives, ensemble clustering provides more diverse base clustering results, realizing improved consensus clustering. Therefore, we consider constructing multiple similarity matrices for each $R^{(i)}$ with different sparsification of $K$ -nearest landmarks to produce diverse base clusterings.

Let $K_{1}<K_{2}<\dots<K_{q}$ be a set of numbers. We search $K_{1},K_{2},\dots,K_{q}$ -nearest landmarks for each $x_{i}$ , denoted as ${N}_{K_{1}}(x_{i}),{N}_{K_{2}}(x_{i}),\dots,{N}_{K_{q}}(x_{i})$ . Based on the definition of the $K$ -nearest neighbors, we have

$\displaystyle{N}_{K_{1}}(x_{i})\subset{N}_{K_{2}}(x_{i})\subset\dots\subset{N}% _{K_{q}}(x_{i}).$ (10)

That is, $N_{K_{j_{1}}}(x_{i})$ is a subset of $N_{K_{j_{2}}}(x_{i})$ if $K_{j_{1}}<K_{j_{2}}$ . Therefore, we only need to compute $K_{q}$ -nearest landmarks and subsequently obtain the other $K_{1},K_{2},\dots,K_{q}$ -nearest landmarks based on it without recomputing. This process is essentially the reuse of the nearest landmarks. Reusing the nearest landmarks accelerates the process of spectral clustering-based ensemble generation. It directly reduces the computational time in two high-cost steps, landmark selection and searching $K$ -nearest landmarks, by nearly $q$ times. In addition to efficiency, it also enhances the diversity of base clusterings by exploring multiple nearest neighbor graphs that facilitates improving the effectiveness of the proposed method.

4.1.3 Similarity matrix construction

The sparse similarity matrix between $X$ and each ${R}^{(i)}$ is constructed according to Eq. (6). Instead of constructing a similarity matrix for one set of landmarks, we construct multiple similarity matrices for one set of landmarks with different sparsification of the $K$ -nearest landmarks. For each ${R}^{(i)}$ , we constructed $q$ sparse similarity matrices with $K_{1},K_{2},\dots,K_{q}$ -nearest landmarks according to Eq. (6). We constructed $m$ similarity matrices as follows:

$\displaystyle\mathcal{B}=\{{B}^{(1)}_{1},\dots,{B}^{(1)}_{q},{B}^{(2)}_{1},% \dots,{B}^{(2)}_{q},\dots,{B}^{(\frac{m}{q})}_{1},\dots,{B}^{(\frac{m}{q})}_{q% }\},$ (11)

where ${B}^{(i)}_{j}$ indicates a similarity matrix between $X$ and $R^{(i)}$ with sparsification of $K_{j}$ -nearest landmarks, $\mathcal{B}$ is a set containing all ${B}^{(i)}_{j}$ and the total size of $\mathcal{B}$ is $m$ . The computational cost to obtain a sparse similarity matrix ${B}^{(i)}_{j}$ is $O(NK_{j}d)$ [21]. By reusing the nearest landmarks, we can generate $\mathcal{B}$ with only $O\left(\frac{m}{q}NK_{q}d\right)$ computational cost. Hereafter in this paper, we use use $K$ instead of $K_{q}$ to represent the computational complexity, for conenience.

4.1.4 Bipartite graph partitioning

After obtaining $m$ similarity matrices, we treat each ${B}^{(i)}_{j}$ as the edge of a bipartite graph $G(X,R^{(i)},{B}^{(i)}_{j})$ and solve a bipartite graph partition problem by Eqs (7) and (8) to construct a $c^{(i)}_{j}$ -dimensional embedding denoted as ${U}^{(i)_{j}}$ . Note that $c^{(i)}_{j}$ is also the number of clusters. It costs $O(p^{3})$ time complexity to solve each bipartite graph partition Eq. (7) and $O(NK(K+c^{(i)}_{j}))$ to compute each $c^{(i)}_{j}$ -dimensional embedding. The cluster number of $c^{(i)}_{j}$ is randomly selected as follow:

$\displaystyle c^{(i)}_{j}=\lfloor\tau(c_{\max}-c_{\min})\rfloor+c_{\min},$ (12)

where $\tau\in[0,1]$ is a random variable and $c_{\max}$ and $c_{\min}$ are the upper and lower bounds of the cluster number, respectively.

The obtained $c^{(i)}_{j}$ eigenvectors are stacked to form a new matrix, upon which the light- $k$ -means [21] is applied to obtain the base clustering result. In light- $k$ -means, a set of $p^{\prime}$ samples are first randomly selected as representatives, then $c$ clusters centers are generated by applying $k$ -means clustering on $p^{\prime}$ representatives, finally, assign labels to remained samples according to their nearest cluster centers. The computational complexity of light- $k$ -means is $O(\textit{pcdt}+\textit{Ncd})$ , where $O(\textit{Ncd})$ is the dominant term and $d$ is the dimensional size. Light- $k$ -means alleviates the computational cost from $t$ iterations and can realize more efficiency on the platform optimized for matrix operation. The use of light- $k$ -means significantly accelerates the process of obtaining base clustering for large-scale datasets. Finally, $m$ base clusterings are generated that are represented as

$\displaystyle\Pi=\{{\pi}^{(1)}_{1},\dots,{\pi}^{(1)}_{q},{\pi}^{(2)}_{1},\dots% ,{\pi}^{(2)}_{q},\dots,{\pi}^{(\frac{m}{q})}_{1},\dots,{\pi}^{(\frac{m}{q})}_{% q}\},$ (13)

where $\pi^{(i)}_{j}$ denotes a base clustering with $c^{(i)}_{j}$ clusters. For convenience, we use $c$ instead of $c^{(i)}_{j}$ to represent the computational complexity hereafter in this paper. The computational complexity of using light- $k$ -means is $O(Nc^{2}+p^{\prime}c^{2}t)$ , where $O(Nc^{2})$ is the dominant term. Overall, the computational complexity of the bipartite graph partition is $O(m(N(K^{2}+c^{2}+Kc)+p^{3}))$ . The ensemble generation process of the proposed method is summarized in Algorithm 4.1.4.

Proposed ensemble generation processDataset $X$ , number of base clusterings $m$ , a set of numbers of $K$ -nearest landmarks $K_{1},K_{2},\dots,K_{q}$ base clusterings $\pi_{1},\pi_{2},\dots,\pi_{m}$

Solve (1) by recursively applying (3) and (4) to obtain $\frac{m}{q}$ sets of landmarks $\mathcal{R}$ ; $i\leftarrow 1$ $\frac{m}{q}$ Search $K_{q}$ -nearest landmarks of each data points according to [21]; $j\leftarrow 1$ $q$ Obtain $K_{j}$ -nearest landmarks of each data points according to (10);Construct similarity matrix between $X$ and $R^{i}$ with sparsification of $K_{j}$ -nearest landmarks by (6); Collect all similarity matrices $\mathcal{B}$ by (11); $i\leftarrow 1$ $m$ Determine a low-dimensional embedding $U$ by (7) and (8);Apply light- $k$ -means on the embedding $U$ to obtain base clustering $\pi_{i}$ .

4.2 Consensus function based on bipartite graph partitioning

After ensemble generation, the base clusterings are combined according to a consensus function to obtain the consensus partition. We treat this problem as a bipartite graph partition problem and provide a solution similar to that in Section 3.2.

Table 1
The cluster indicator matrix

	$\omega_{1}$	$\omega_{2}$	$\cdots$	$\omega_{C}$	$\sum$
$x_{1}$	$\tilde{b}_{11}$	$\tilde{b}_{12}$	$\cdots$	$\tilde{b}_{1C}$	$m$
$x_{2}$	$\tilde{b}_{21}$	$\tilde{b}_{22}$	$\cdots$	$\tilde{b}_{2C}$	$m$
$\cdot$	$\cdot$	$\cdot$	$\cdots$	$\cdot$	$\cdot$
$x_{n}$	$\tilde{b}_{n1}$	$\tilde{b}_{n2}$	$\cdots$	$\tilde{b}_{nC}$	$m$
$\sum$	$\\|\omega_{1}\\|$	$\\|\omega_{2}\\|$	$\cdots$	$\\|\omega_{C}\\|$	$N m$

To define the bipartite graph, we first collect all clusters from the base clusterings using Eq. (13), and we denotes the clusters in Eq. (14) for clarity.

$\displaystyle\Psi=\{{\Omega}^{(1)}_{1},\dots,{\Omega}^{(1)}_{q},{\Omega}^{(2)}% _{1},\dots,{\Omega}^{(2)}_{q},\dots,{\Omega}^{(\frac{m}{q})}_{1},\dots,{\Omega% }^{(\frac{m}{q})}_{q}\},$ (14)

where $\Omega^{(i)}_{j}$ indicates the set of clusters in $\pi^{(i)}_{j}$ . There are $c^{(i)}_{j}$ clusters in each $\Omega^{(i)}_{j}$ that is denoted as:

$\displaystyle\Omega^{(i)}_{j}=\{\omega^{\prime}_{1},\omega^{\prime}_{2},\dots,% \omega^{\prime}_{c^{(i)}_{j}}\},$ (15)

where $\omega^{\prime}_{t}$ is the $t$ -th cluster in $\Omega^{(i)}_{j}$ . Thus, the total number of clusters in $\Psi$ can be counted as $C=\sum_{i=1}^{\frac{m}{q}}\sum_{j=1}^{q}c_{j}^{(i)}$ . For convenience, we simplify the notation of Eq. (16) as follows:

$\displaystyle\Psi=\{\omega_{1},\omega_{2},\dots,\omega_{C}\},$ (16)

Having defined of $\Omega$ , we design a bipartite graph between the data points and clusters as follows:

$\displaystyle\tilde{G}=\{\mathcal{X},\Omega,\tilde{B}\},$ (17)

where $\tilde{B}$ is the cross-affinity matrix between $\mathcal{X}$ and $\Omega$ . $\tilde{B}$ can also be interpreted as the cluster indicator matrix of ${X}$ . Table 1 shows the cluster indicator matrix, where $b_{ij}=1$ indicates that ${X}_{i}\in\omega_{j}$ . $\tilde{G}$ is an unweighted bipartite graph where any edge between node $X_{i}$ and $\omega_{j}$ indicates the cluster relationship $X_{i}\in\omega_{j}$ . We can obtain the formula of $\tilde{B}$ as follows:

$\displaystyle\tilde{b}_{ij}=\left\{\begin{array}[]{ll}1,&\text{if }x_{i}\in% \omega_{j},\\ 0,&\text{otherwise.}\\ \end{array}\right.$ (18)

As shown in Table 1 shows, the sum of each row of $\tilde{B}$ is as the same as the number of base clusterings $m$ because there is only one cluster $x_{i}$ belonging to each base clustering $\pi^{j}$ , that is, $\forall i^{\prime}\neq j^{\prime}$ , if $\omega_{i^{\prime}}\in\pi^{i}$ and $\omega_{j^{\prime}}\in\pi^{i}$ , then $\omega_{i^{\prime}}\bigcap\omega_{j^{\prime}}=\emptyset$ . Although the number of samples in each $\omega_{i}$ is uncertain, that is, $\|\omega_{i}\|$ , the total number of non-zeros entries is clearly $N m$ (see Table 1).

For this modified bipartite graph $\tilde{G}$ , we consider a partition strategy similar to that introduced in Section 3.2. According to [21], the full similarity of $\mathcal{G}$ is as follows

$\displaystyle\tilde{W}=\left[\begin{array}[]{ll}{0}&\tilde{B}\\ \tilde{B}^{T}&{0}\\ \end{array}\right].$ (19)

Table 2

Comparison of the computational complexity between LSEC and U-SPEC

Method	Ensemble generation			Consensus function
	Landmark selection	Similarity construction	Bipartite graph partitioning
U-SPEC	$O(mp^{2}dt)$	$O(\textit{mNp}^{\frac{1}{2}}d)$	$O(m(N(K^{2}+c^{2}t+Kc)+p^{3}))$	$O(N(m^{2}+mk+c^{2}t)+C^{3})$
LSEC	$O\left(\frac{m}{q}N\alpha d\right)$	$O\left(\frac{m}{q}\textit{NKd}\right)$	$O(m(N(K^{2}+c^{2}+Kc)+p^{3}))$	$O(N(m^{2}+mk+c^{2}t)+C^{3})$

Then we have a generalized eigenproblem of $\tilde{G}$

$\displaystyle\tilde{L}\tilde{f}=\lambda\tilde{D}\tilde{f},$ (20)

where $\tilde{L}=\tilde{D}-\tilde{W}$ and $\tilde{D}$ is a diagonal matrix with $\tilde{d_{ii}}=\sum_{j=1}^{n}\tilde{w_{ij}}$ . According to Eq. (21) and Eq. (8), we design a smaller eigenproblem to compute the eigenvetor $\tilde{U}$ in $\mathcal{X}$ side as follows:

$\displaystyle{L}_{\Omega}\tilde{V}=\tilde{\lambda}{D}_{\Omega}\tilde{V},$ (21)

where $L_{\Omega}=\tilde{D}_{\Omega}-\tilde{B}^{\top}{\tilde{D}_{\mathcal{X}}}^{-1}% \tilde{B}$ is the graph Laplacian, $\tilde{D}_{\mathcal{X}}\in R^{n\times n}$ and $\tilde{D}_{\mathcal{R}}\in R^{p\times p}$ are the diagonal matrices whose entries are $\tilde{d}_{\mathcal{X}}(i,i)=\sum_{j=1}^{n}\tilde{B}_{ij}$ and $\tilde{d}_{\mathcal{R}}(j,j)=\sum_{i=1}^{n}\tilde{B}_{ij}$ , respectively. The size of $L_{\Omega}$ is $C\times C$ . Solving the eigenproblem Eq. (21) costs $O(C^{3})$ computational time. Substituting $\tilde{V}$ into Eq. (8), we can compute $\tilde{U}$ as follows

$\displaystyle\tilde{D}=\tilde{D}_{\mathcal{X}}^{-1}\tilde{B}\tilde{V}.$ (22)

The $\tilde{c}$ bottom eigenvectors $\tilde{U}$ can be computed in $O(Nm(m+c))$ time. Finally, the consensus clustering results in LSEC can be obtained by the $k$ -means method with $O(Nc^{2}t)$ time. The proposed LSEC method is summarized in Algorithm 2.

Large-scale ensemble spectral clusteringDataset $X$ , number of base clusterings $m$ , a set of number of $K$ -nearest landmarks $K_{1},K_{2},\dots,K_{q}$ , $m$ base clusterings, number of clusters $\tilde{c}$ Consensus clustering $\tilde{\pi}$ Produce $m$ base clusterings by large-scale ensemble generation;Construct the cluster indicator matrix $\tilde{B}$ according to Eq. (18);Solve the eigenproblem Eq. (21) to compute $\tilde{V}$ ;Determine a low-dimensional embedding $\tilde{U}$ of $X$ by Eq. (22);Applying $k$ -means to determine $\tilde{c}$ clusters on $\tilde{U}$ to realize consensus clustering. Obtain consensus clustering by large-scale consensus function.

5. Discussion

5.1 Computational complexity analysis

In this section, we summarize the computational costs of the proposed method. The ensemble generation of the LSEC algorithm has a computational cost of $O\left(\frac{m}{q}N\alpha d+\frac{m}{q}\textit{NKd}+m(N(K^{2}+c^{2}+Kc)+p^{3})% \right)\linebreak=O\left(\frac{m}{q}N\alpha d\right)$ computational cost. The consensus function of the LSEC is $O(N(m^{2}+mk+c^{2}t)+C^{3})=O(Nm^{2})$ time. With consideration to $m,q,k,K<\alpha\ll p\ll N$ , the dominant term of the overall time complexity of LSEC is $O\left(N\frac{m}{q}\alpha d+Nm^{2}\right)$ . Meanwhile, the memory costs of the ensemble generation and the consensus function of the LSEC algorithm are $O(N\alpha)$ and $O(Nm)$ , respectively. Table 2 provides a comparison of the computational complexity of the DnC-SC algorithm with the state-of-the-art large-scale ensemble clustering method, U-SPEC.

5.2 Relations with other methods

As a large-scale spectral ensemble clustering method, the proposed method is closely related to the U-SENC method [16]. We compare the proposed method with the U-SENC to discuss the improvements in the proposed method.

First, we compare them to the ensemble generation methods in term of diversity. In U-SENC, base clusterings are directly generated by the large-scale spectral clustering U-SPEC with different numbers of clusters. As a large-scale spectral clustering method, the U-SPEC method also uses the landmark selection technique. Thus, the diversity of base clusterings of U-SENC is based on two facts: the different landmarks and the number of clusters of ensemble generation. However, the $K$ -nearest neighbor graph was not used to further improve the diversity in U-SENC further. In our proposed method, we consider the various landmarks and number of clusters and use different $K$ -nearest neighbors to construct a sparse similarity matrix to improve the overall diversity of base clustering.

Secondly, we compare them with the ensemble generation methods in term of efficiency. Because the different $K$ -nearest neighbor graphs can share the same $K$ -nearest neighbors between data points and landmarks, the computational complexity of similarity matrix construction is much less than that of the U-SENC method. For large-scale datasets, another computational bottleneck is the final $k$ -means step of large-scale spectral clustering. In our proposed method, we use the light- $k$ -means to accelerate the base clustering results, significantly improving the efficiency of large-scale datasets.

Overall, LSEC redesigns the ensemble generation framework based on a more efficient clustering method (that is, DnC-SC) and accelerates the process by reusing the $K$ -nearest neighbors among multiple base clusterings. Furthermore, a light- $k$ -means method was used to sppedily obtain the base clustering results. The computational complexity of the proposed method is higher than that of most existing large-scale ensemble clustering methods.

6. Experiments

In this section, we conduct experiments on five real and five synthetic datasets to evaluate the performance of the proposed LSEC method. The comparison experiments against six state-of-the-art spectral clustering methods confirmed improved clustering quality and efficiency of the LSEC methods. In addition, the parameters were analyzed. For each experiment, the test method was repeated 20 times, and the average performance was reported. All experiments were conducted in MATLAB R2020a on a Mac Pro with 3 GHz 8-Core Intel Xeon E5 with 16 GB of RAM.

6.1 Datasets and evaluation measures

Table 3
Properties of the real and synthetic datasets

Dataset		#Object	#Dimension	#Class
Real	USPS	9298	256	10
	PenDigits	10,992	16	10
	Letters	20,000	16	26
	MNIST	70,000	784	10
	Covertype	581,012	54	7
Synthetic	TB-1M	1,000,000	2	3
	SF-2M	2,000,000	2	4
	CC-5M	5,000,000	2	3
	CG-10M	10,000,000	2	11
	FL-20M	20,000,000	2	13

Figure 2.

Illustration of five synthetic datasets. Note that only 0.1% samples of each dataset are plotted.

Our experiments were conducted on ten large-scale datasets, varying from 9,000 to 20 million data points. Specifically, the five real datasets are PenDigits [1],1

https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits.

USPS [5],2

http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html.

Letters [10],3

https://archive.ics.uci.edu/ml/datasets/Letter+Recognition.

MNIST [4], and Covertype [2].4

⁴

https://archive.ics.uci.edu/ml/datasets/covertype.

The five synthetic datasets are Two Bananas (TB-1M), Smiling Face-2M (SF-2M), Concentric Circles-5M (CC-5M), Circles and Gaussians-10M (CG-10M), Flower-20M (FL-20M) [16].5

⁵

https://www.researchgate.net/publication/330760669.

Figure 2 shows the synthetic datasets. The properties of the datasets are presented in Table 3.

To evaluate the clustering results, we adopt two widely used evaluation metrics, that is, Normalized Mutual Information (NMI) [27] and Accuracy (ACC) [38]. Let $X=[x_{1},x_{2},\ldots,x_{n}]$ be the data matrix. For each data point $x_{i}$ , denote $\pi_{t}(x_{i})$ and $\pi_{c}(x_{i})$ as the cluster label of the ground truth and the obtained cluster label from the clustering methods, respectively. The ACC is defined as:

$\displaystyle\text{ACC}=\frac{\sum_{i=1}^{n}\delta(\pi_{t}(x_{i}),\text{map}(% \pi_{c}(x_{i})))}{n},$ (23)

where $n$ is the number of data and $\delta(\pi_{t}(x_{i}),\pi_{c}(x_{i}))$ is a function to check whether $\pi_{t}(x_{i})$ and $\pi_{c}(x_{i})$ are equal or not, returning 1 if equals otherwise returning 0. The map $(\pi_{c}(x_{i}))$ is the best mapping function that maps each predicted label to the most probable true cluster label by permuting operations [37].

The NMI is the normalization of mutual information by the joint entropy:

$\displaystyle\textit{NMI}=\frac{\sum_{\pi_{t}(x_{i})\in T,\pi_{c}(x_{i})\in C}% p(\pi_{t}(x_{i}),\pi_{c}(x_{i}))\text{ln}\frac{p(\pi_{t}(x_{i}),\pi_{c}(x_{i})% )}{p(\pi_{t}(x_{i}))p(\pi_{c}(x_{i}))}}{-\sum_{\pi_{t}(x_{i})\in T,\pi_{c}(x_{% i})\in C}p(\pi_{t}(x_{i}),\pi_{c}(x_{i}))\text{ln}(p(\pi_{t}(x_{i}),\pi_{c}(x_% {i})))},$ (24)

A better clustering result will provide a larger value of NMI/ACC. Both NMI and ACC are in the range of [0, 1].

6.2 Compared methods and experimental settings

In this experiment, we compare the proposed method with a baseline clustering method, that is, the divide-and-conquer based large-scale spectral clustering (DnC-SC) [21], as well as six state-of-the-art ensemble clustering methods. The spectral clustering methods compared are as follows:

(1)
EAC[9]: evidence accumulation clustering.
(2)
KCC[36]: $k$ -means based consensus clustering.
(3)
PTGP[13]: probability trajectory based graph partitioning.
(4)
SEC[23]: spectral ensemble clustering.
(5)
LWGP[14]: locally weighted graph partitioning.
(6)
U-SENC[16]: ultra-scalable ensemble clustering.

There are several common parameters among the methods mentioned above. We set these parameters as follows:

•
We set the number of landmarks as $p=1000$ for LSEC and U-SPEC. Parameter analysis of $p$ was conducted in [21].
•
Base clustering is generated by $k$ -means or large-scale spectral clustering as suggested by their papers [9, 36, 13, 14, 16]. The number of clusters $c$ of base clustering is randomly selected from [20, 60]. The number of base clusterings is set as $m=20$ for comparison. The parameter analysis on $m$ is discussed in Section 6.4.
•
We set the $K=5$ for the number of nearest neighbors for LSEC and U-SPEC.
•
There are some unique parameters $q$ and $K_{1},\dots,K_{q}$ for LSEC. We set the $q=4$ and conducted landmark selection $\frac{m}{q}=\frac{20}{4}=5$ times. We assign $1,\dots,5$ to $K_{1},\dots,K_{q}$ to search for the $K$ -nearest landmarks by reusing each set of landmarks.
•
The DnC-SC method has a unique parameter $\alpha$ that is used to determine the upper bound of $k_{i}$ that controls the landmark selection rate. In the experiments, $\alpha=50$ is used for all the datasets.
•
The true number of classes on each dataset is used to conduct all experiments.
•
Other parameters in the baseline methods are set as suggested in the existing literature.

6.3 Comparison results

The experimental comparison results are presented in Tables 4–6. Note that DnC-SC is not an ensemble clustering algorithm; its clustering results are provided for reference only. As shown in Tables 4 and 5, the LSEC algorithm realizes the highest ACC and NMI scores on most of datasets. In terms of the average score across the ten datasets, LSEC achieves the best average ACC (%) and NMI (%) scores of 77.01 and 77.21, respectively. The second-best ensemble clustering method (that is, U-SENC) achieves average ACC (%) and NMI (%) scores of 72.59 and 74.78, respectively. The EAC, KCC, PTGP, SEC, and LWGP methods use the $k$ -means based ensemble generation method. The LSEC and U-SENC methods that use the spectral clustering based ensemble generation demonstrate improved clustering quality of ACC and NMI than other methods on most datasets. In terms of average rank, LSEC has an average rank of 1.90 in ACC and 1.40 in. NMI, while the second-best method has an average rank of 2.30 in. ACC and 1.90 in. NMI.

Table 4
ACC (%) scores (over 20 runs) realized by the proposed method and the baseline ensemble clustering methods (The best score in each row is shown in bold)

Dataset	DnC-SC	EAC	KCC	PTGP	SEC	LWGP	U-SENC	LSEC
PenDigits	82.27 ${}_{\pm 1.33}$	77.67 ${}_{\pm 2.30}$	44.68 ${}_{\pm 5.10}$	80.89 ${}_{\pm 1.26}$	32.37 ${}_{\pm 3.88}$	73.66 ${}_{\pm 2.14}$	87.02 ${}_{\pm 1.65}$	88.26 ${}_{\pm 2.10}$
USPS	82.55 ${}_{\pm 1.96}$	66.76 ${}_{\pm 1.69}$	56.37 ${}_{\pm 3.41}$	67.14 ${}_{\pm 0.26}$	40.77 ${}_{\pm 5.82}$	65.82 ${}_{\pm 3.41}$	78.25 ${}_{\pm 2.39}$	80.97 ${}_{\pm 5.31}$
Letters	33.54 ${}_{\pm 1.21}$	29.79 ${}_{\pm 0.60}$	24.46 ${}_{\pm 1.24}$	26.66 ${}_{\pm 1.42}$	23.60 ${}_{\pm 1.28}$	27.88 ${}_{\pm 0.78}$	35.03 ${}_{\pm 1.28}$	36.33 ${}_{\pm 0.89}$
MINST	74.24 ${}_{\pm 2.14}$	N/A	45.61 ${}_{\pm 4.96}$	66.96 ${}_{\pm 0.68}$	33.15 ${}_{\pm 2.07}$	56.27 ${}_{\pm 1.47}$	75.48 ${}_{\pm 3.01}$	80.19 ${}_{\pm 3.74}$
Covertype	23.48 ${}_{\pm 1.86}$	N/A	32.52 ${}_{\pm 0.41}$	23.45 ${}_{\pm 0.96}$	39.63 ${}_{\pm 6.15}$	30.64 ${}_{\pm 0.42}$	21.34 ${}_{\pm 1.06}$	23.42 ${}_{\pm 1.86}$
TB-1M	99.62 ${}_{\pm 0.02}$	N/A	67.76 ${}_{\pm 1.41}$	81.95 ${}_{\pm 0.00}$	67.94 ${}_{\pm 3.66}$	99.71 ${}_{\pm 0.45}$	99.25 ${}_{\pm 0.01}$	99.72 ${}_{\pm 2.31}$
SF-2M	9.43 ${}_{\pm 0.31}$	N/A	50.94 ${}_{\pm 4.15}$	60.25 ${}_{\pm 0.94}$	49.88 ${}_{\pm 5.68}$	80.04 ${}_{\pm 3.45}$	76.54 ${}_{\pm 2.88}$	85.17 ${}_{\pm 9.18}$
CC-5M	99.98 ${}_{\pm 0.00}$	N/A	72.25 ${}_{\pm 6.41}$	34.95 ${}_{\pm 0.00}$	41.57 ${}_{\pm 0.81}$	97.84 ${}_{\pm 3.71}$	99.99 ${}_{\pm 0.00}$	97.61 ${}_{\pm 1.67}$
CG-10M	66.83 ${}_{\pm 4.46}$	N/A	58.12 ${}_{\pm 5.41}$	60.89 ${}_{\pm 1.73}$	46.26 ${}_{\pm 5.72}$	71.95 ${}_{\pm 3.19}$	82.34 ${}_{\pm 5.59}$	97.57 ${}_{\pm 3.49}$
FL-20M	81.90 ${}_{\pm 5.61}$	N/A	48.21 ${}_{\pm 4.14}$	51.21 ${}_{\pm 1.41}$	41.70 ${}_{\pm 0.42}$	72.15 ${}_{\pm 2.45}$	78.16 ${}_{\pm 3.21}$	82.81 ${}_{\pm 3.21}$
Avg. score	–	N/A	50.09	55.44	41.69	67.60	72.59	77.01
Avg. rank	–	6.00	5.00	4.00	5.60	3.30	2.30	1.90

${}^{*}$ N/A indicates that the algorithm can not finish due to error of out of memory.

Table 5

NMI (%) scores (over 20 runs) realized by the proposed method and the baseline ensemble clustering methods (The best score in each row is shown in bold)

Dataset	DnC-SC	EAC	KCC	PTGP	SEC	LWGP	U-SENC	LSEC
PenDigits	82.01 ${}_{\pm 0.21}$	76.12 ${}_{\pm 0.00}$	53.52 ${}_{\pm 3.46}$	78.31 ${}_{\pm 0.34}$	46.44 ${}_{\pm 2.10}$	76.46 ${}_{\pm 1.43}$	83.24 ${}_{\pm 1.11}$	84.65 ${}_{\pm 1.81}$
USPS	82.86 ${}_{\pm 1.08}$	69.05 ${}_{\pm 0.00}$	58.27 ${}_{\pm 0.25}$	70.32 ${}_{\pm 1.02}$	49.68 ${}_{\pm 1.89}$	70.71 ${}_{\pm 1.46}$	82.09 ${}_{\pm 1.65}$	83.51 ${}_{\pm 1.34}$
Letters	45.37 ${}_{\pm 0.85}$	39.28 ${}_{\pm 0.00}$	34.61 ${}_{\pm 1.40}$	36.98 ${}_{\pm 0.99}$	32.30 ${}_{\pm 0.89}$	39.29 ${}_{\pm 0.46}$	46.40 ${}_{\pm 0.20}$	48.71 ${}_{\pm 0.63}$
MINST	72.00 ${}_{\pm 0.51}$	N/A	46.43 ${}_{\pm 4.85}$	62.22 ${}_{\pm 1.12}$	38.84 ${}_{\pm 1.44}$	62.34 ${}_{\pm 0.62}$	75.11 ${}_{\pm 0.58}$	79.42 ${}_{\pm 1.45}$
Covertype	8.30 ${}_{\pm 0.30}$	N/A	6.38 ${}_{\pm 3.41}$	8.25 ${}_{\pm 0.43}$	9.23 ${}_{\pm 6.48}$	9.06 ${}_{\pm 0.41}$	9.34 ${}_{\pm 1.21}$	11.64 ${}_{\pm 1.76}$
TB-1M	96.42 ${}_{\pm 0.18}$	N/A	24.54 ${}_{\pm 2.45}$	31.89 ${}_{\pm 0.00}$	24.74 ${}_{\pm 4.45}$	97.16 ${}_{\pm 2.41}$	97.05 ${}_{\pm 0.04}$	97.19 ${}_{\pm 9.43}$
SF-2M	81.24 ${}_{\pm 0.32}$	N/A	38.06 ${}_{\pm 2.45}$	49.74 ${}_{\pm 0.18}$	33.65 ${}_{\pm 3.22}$	81.95 ${}_{\pm 4.15}$	77.57 ${}_{\pm 2.12}$	84.88 ${}_{\pm 6.55}$
CC-5M	99.78 ${}_{\pm 0.01}$	N/A	59.24 ${}_{\pm 0.41}$	0.13 ${}_{\pm 0.00}$	12.93 ${}_{\pm 1.80}$	98.15 ${}_{\pm 7.41}$	99.91 ${}_{\pm 0.00}$	97.57 ${}_{\pm 1.70}$
CG-10M	80.91 ${}_{\pm 3.59}$	N/A	63.56 ${}_{\pm 0.41}$	65.09 ${}_{\pm 0.92}$	55.77 ${}_{\pm 6.84}$	78.41 ${}_{\pm 2.93}$	86.28 ${}_{\pm 2.30}$	95.25 ${}_{\pm 1.32}$
FL-20M	87.67 ${}_{\pm 3.18}$	N/A	68.10 ${}_{\pm 2.41}$	71.32 ${}_{\pm 1.29}$	53.77 ${}_{\pm 2.52}$	78.51 ${}_{\pm 1.97}$	90.38 ${}_{\pm 2.45}$	91.32 ${}_{\pm 2.44}$
Avg. score	–	N/A	45.27	47.43	35.74	69.20	74.78	77.21
Avg. rank	–	6.30	5.40	4.30	5.80	3.00	1.90	1.40

Table 6

Time costs(s) of the proposed method and the baseline ensemble clustering methods

Dataset	DnC-SC	EAC	KCC	PTGP	SEC	LWGP	U-SENC	LSEC
PenDigits	0.64	18.78	9.19	6.06	3.00	4.00	18.31	3.40
USPS	1.25	25.79	23.56	41.32	15.08	15.93	28.82	5.81
Letters	0.90	115	48.48	89.88	10.76	11.15	20.86	3.71
MINST	5.11	N/A	831.12	2297.05	730.33	731.64	103.35	21.36
Covertype	13.15	N/A	634.18	16271.2	714.86	730.33	143.41	40.16
TB-1M	5.06	N/A	984.15	849.62	693.67	709.60	265.80	62.50
SF-2M	13.77	N/A	2225.64	1475.08	1344.66	1566.8	623.26	131.67
CC-5M	25.37	N/A	8541.13	3040.33	3232.06	3006	1851.2	321.71
CG-10M	281.05	N/A	12351.2	7244.01	7607.84	6685.8	3561.4	769.51
FL-20M	837.38	N/A	17112.1	13343.3	14938.73	13091	11763.07	2396.85
Avg. score	–	N/A	4276.07	4465.78	2929.10	2655.23	1837.95	375.65
Avg. rank	–	6.80	5.20	5.00	3.30	3.60	3.00	1.10

The time costs of different ensemble clustering methods are provided in Table 6. The proposed LSEC method achieves the lowest time costs on nine datasets and the second-lowest time cost on a single dataset. Except PenDigits dataset, LSEC is 2.4 (FL-20M) to 5.75 (CC-5M) times ahead of the second-best method in time consumption. The LSEC method has shown significant advantages over other ensemble clustering methods, particularly on large-scale datasets.

To further analyze the experimental results in Tables 4–6, we use the $t$ -test [6] (with $p<0.05$ ) to evaluate the statistical significance of the differences between proposed method and the baseline methods on each dataset. In a comparison, if proposed method achieves a better (or worse) clustering performance than a baseline method and the difference is statistically significant according to $t$ -test with $p<0.05$ , then we say proposed method is significantly better (or significantly worse) than a baseline method for one time. If the difference between proposed method and a baseline method is not statistically significant in a comparison, then we say these two methods are comparable to each other for one time. Table 10 report the proposed method is significantly better than or comparable to or significantly worse than a best baseline method w.r.t. ACC, NMI and Time, respectively. Specifically, as shown in Table 10, in terms of ACC, the proposed method exhibits statistically significant improvements over the EAC, KCC, PTGP and SEC in more than 9 comparisons, and statistically significantly outperform LWGP and U-SPNC mehtods at least 7 time out of the totally 10 comparisons. Similar advantages can also be observed in terms of NMI, which shows that proposed method significantly outperform EAC, KCC, PTGP and SEC in all of the 10 comparisons and exhibits statistically significant improvements over LWGP and U-SPNC at least 6 times out of 10 comparisons according to $t$ -test.

6.4 Parameter analysis on ensemble size

m

We conduct a parameter analysis experiment to demonstrate the performance of the proposed method, varying different parameter values of $m$ . The parameter $m$ denotes the number of base clusterings that is a common parameter in all ensemble clustering methods. We selected four datasets (MNIST, Covertype, TB-1M and SF-2M) as benchmark datasets to conduct the following experiments. As shown in Table 7, LSEC exhibits improved performance of ACC and NMI than most other ensemble clustering methods except for the ACC score on the Covertype dataset. Meanwhile, LSEC consistently requires a lower computational cost than all other ensemble clustering methods.

Table 7
Clustering performance (ACC (%), NMI (%), and time costs(s)) for different methods obtained by varying the number of base clusterings $m$

Table 8

Clustering performance (ACC (%), NMI (%), and time costs(s)) for LSEC with or without reusing the nearest landmarks

Table 9

Clustering performance (ACC (%), NMI (%), and time costs(s)) for LSEC using light- $k$ -means and $k$ -means to obtain base clusterings in the ensemble generation

Table 10

The number of times that proposed method is (significantly better than/comparable to/significantly worse than) a baseline method by statistical test ( $t$ -test with $p<0.05$ )

Metrics	EAC	KCC	PTGP	SEC	LWGP	U-SPNC
ACC	(10/0/0)	(10/0/0)	(9/1/0)	(9/0/1)	(7/2/1)	(7/1/2)
NMI	(10/0/0)	(10/0/0)	(10/0/0)	(10/0/0)	(8/2/0)	(6/3/1)
Time	(10/0/0)	(10/0/0)	(10/0/0)	(9/0/1)	(10/0/0)	(10/0/0)

6.5 Influence of reusing of

K

-nearest landmarks

In this section, we compare the performances of the proposed method with and without reusing the nearest landmarks, denoted as LSEC and LSEC-without-reusing. The experimental results are shown in Table 8. As mentioned, the reuse of nearest landmarks results in improved efficiency in searching the $K$ -nearest landmarks. In Table 8, LSEC and LSEC-without-reuse demonstrate comparable performances, but the time cost of LSEC is low. Because reusing the nearest landmarks does not influence the accuracy of the nearest landmarks, we consider that the difference in ACC and NMI between the two methods is due to the randomness of the algorithm. This result indicates that reuse of nearest landmarks provides significantly better efficiency while maintaining a similar clustering result.

6.6 Influence of light- $k$ -means

In this section, we compare the performances of the proposed method using light- $k$ -means and $k$ -means to obtain base clustering in the ensemble generation. The experimental results are shown in Table 9. Generally, the two methods show the similar performance of ACC and NMI. In particular, LSEC achieves relatively improved ACC and NMI on MNIST and SF-2F datasets, which is plausible because the light- $k$ -means method can provide improved diversity of base clusterings on these datasets. Overall, compared with the $k$ -means method, using light- $k$ -means in ensemble generation significantly improves the efficiency of LSEC and yields comparable clustering quality.

7. Conclusion

In this paper, we propose the LSEC method to balance between the efficiency and effectiveness of ensemble clustering on large-scale datasets. We design an efficient ensemble generation framework to produce base clusterings by applying divide-and-conquer large-scale spectral clustering to determine high-quality base clusterings. In the ensemble generation of the proposed method, we accelerate the process of searching $K$ -nearest neighbors by reusing the strategy and obtaining base clustering by the light- $k$ -means method. After the ensemble generation step, we combine all base clusterings into a consensus cluster through a bipartite graph-partitioning-based consensus function. The proposed method achieves a lower computational complexity than most existing ensemble clustering methods. Experiments conducted on ten large-scale datasets show that the proposed method outperforms other state-of-the-art large-scale spectral clustering methods.

Footnotes

Acknowledgments

This study was supported by the New Energy and Industrial Technology Development Organization (NEDO) Grant (ID:18065620) and JST COI-NEXT.

References

Asuncion

and Newman

, Uci machine learning repository, 2007.

Blackard

J.A.

and Dean

D.J.

, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture 24(3) (1999), 131–151.

Cai

and Chen

, Large scale spectral clustering via landmark-based sparse representation, IEEE Transactions on Cybernetics 45(8) (2014), 1669–1680.

Cai

and Han

, Speed up kernel discriminant analysis, The VLDB Journal 20(1) (2011), 21–33.

Cai

Han

and Huang

T.S.

, Graph regularized nonnegative matrix factorization for data representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8) (2010), 1548–1560.

Demšar

, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine Learning Research 7 (2006), 1–30.

Fern

X.Z.

and Brodley

C.E.

, Solving cluster ensemble problems by bipartite graph partitioning, in: Proceedings of the Twenty-First International Conference on Machine Learning, 2004, p. 36.

Fowlkes

Belongie

Chung

and Malik

, Spectral grouping using the nystrom method, IEEE Transactions on Pattern Analysis and Machine Intelligence 26(2) (2004), 214–225.

Fred

A.L.

and Jain

A.K.

, Combining multiple clusterings using evidence accumulation, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6) (2005), 835–850.

10.

Frey

P.W.

and Slate

D.J.

, Letter recognition using holland-style adaptive classifiers, Machine Learning 6(2) (1991), 161–182.

11.

Huang

Lai

and Wang

C.-D.

, Ensemble clustering using factor graph, Pattern Recognition 50 (2016), 131–142.

12.

Huang

Lai

J.-H.

and Wang

C.-D.

, Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis, Neurocomputing 170 (2015), 240–250.

13.

Huang

Lai

J.-H.

and Wang

C.-D.

, Robust ensemble clustering using probability trajectories, IEEE Transactions on Knowledge and Data Engineering 28(5) (2015), 1312–1326.

14.

Huang

Wang

C.-D.

and Lai

J.-H.

, Locally weighted ensemble clustering, IEEE Transactions on Cybernetics 48(5) (2017), 1460–1473.

15.

Huang

Wang

C.-D.

Peng

Lai

and Kwoh

C.-K.

, Enhanced ensemble clustering via fast propagation of cluster-wise similarities, IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018.

16.

Huang

Wang

C.-D.

J.-S.

Lai

J.-H.

and Kwoh

C.-K.

, Ultra-scalable spectral clustering and ensemble clustering, IEEE Transactions on Knowledge and Data Engineering 32(6) (2019), 1212–1226.

17.

Iam-On

Boongeon

Garrett

and Price

, A link-based cluster ensemble approach for categorical data clustering, IEEE Transactions on Knowledge and Data Engineering 24(3) (2010), 413–425.

18.

Kiselev

V.Y.

Kirschner

Schaub

M.T.

Andrews

Yiu

Chandra

Natarajan

K.N.

Reik

Barahona

Green

A.R.

et al., Sc3: Consensus clustering of single-cell rna-seq data, Nature Methods 14(5) (2017), 483.

19.

Imakur

and Sakurai

, Ensemble learning for spectral clustering, in: 2020 IEEE International Conference on Data Mining (ICDM), IEEE, 2020, pp. 1094–1099.

20.

Imakura

and Sakurai

, Hubness-based sampling method for nyström spectral clustering, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–8.

21.

Imakura

and Sakurai

, Divide-and-conquer based large-scale spectral clustering, arXiv preprint, page 2104.15042, 2021.

22.

Ogihara

and Ma

, On combining multiple clusterings, in: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, 2004, pp. 294–303.

23.

Liu

Tao

and Fu

, Spectral ensemble clustering, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 715–724.

24.

Liu

Tao

and Fu

, Spectral ensemble clustering via weighted k-means: Theoretical and practical evidence, IEEE Transactions on Knowledge and Data Engineering 29(5) (2017), 1129–1143.

25.

Liu

Zhao

Fang

Cheng

and Liu

Y.-Y.

, Entropy-based consensus clustering for patient stratification, Bioinformatics 33(17) (2017), 2691–2698.

26.

Naldi

M.C.

Carvalho

and Campello

R.J.

, Cluster ensemble selection based on relative validity indexes, Data Mining and Knowledge Discovery 27(2) (2013), 259–289.

27.

Slonim

and Tishby

, Agglomerative information bottleneck, in: Advances in Neural Information Processing Systems, 2000, pp. 617–623.

28.

Strehl

and Ghosh

, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3(Dec) (2002), 583–617.

29.

Tandon

Albeshri

Thayananthan

Alhalabi

and Fortunato

, Fast consensus clustering in complex networks, Physical Review E 99(4) (2019), 042301.

30.

Tao

Liu

and Fu

, Robust spectral ensemble clustering, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, 2016, pp. 367–376.

31.

Topchy

Jain

A.K.

and Punch

, Combining multiple weak clusterings, in: Third IEEE International Conference on Data Mining, IEEE, 2003, pp. 331–338.

32.

Vega-Pons

and Ruiz-Shulcloper

, A survey of clustering ensemble algorithms, International Journal of Pattern Recognition and Artificial Intelligence 25(03) (2011), 337–372.

33.

Wang

Armasu

S.M.

Kalli

K.R.

Maurer

M.J.

Heinzen

E.P.

Keeney

G.L.

Cliby

W.A.

Oberg

A.L.

Kaufmann

S.H.

and Goode

E.L.

, Pooled clustering of high-grade serous ovarian cancer gene expression leads to novel consensus subtypes associated with survival and surgical outcomes, Clinical Cancer Research 23(15) (2017), 4077–4085.

34.

Wang

and Li

, Generalized cluster aggregation, in: Twenty-First International Joint Conference on Artificial Intelligence, 2009.

35.

Wang

Yang

and Zhou

, Clustering aggregation by probability accumulation, Pattern Recognition 42(5) (2009), 668–675.

36.

Liu

Xiong

Cao

and Chen

, K-means-based consensus clustering: A unified view, IEEE Transactions on Knowledge and Data Engineering 27(1) (2014), 155–169.

37.

Liu

and Gong

, Document clustering based on non-negative matrix factorization, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, 2003, pp. 267–273.

38.

Yan

Huang

and Jordan

M.I.

, Fast approximate spectral clustering, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2009, pp. 907–916.

39.

Sakurai

and Liu

, Large scale spectral clustering using sparse representation based on hubness, in: 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), IEEE, 2018, pp. 1731–1737.

40.

and Sakurai

, Robust similarity measure for spectral clustering based on shared neighbors, ETRI Journal 38(3) (2016), 540–550.

41.

and Sakurai

, Spectral clustering with adaptive similarity measure in kernel space, Intelligent Data Analysis 22(4) (2018), 751–765.

42.

Zheng

and Ding

, A framework for hierarchical ensemble clustering, ACM Transactions on Knowledge Discovery from Data (TKDD) 9(2) (2014), 1–23.

43.

Zhong

Yue

Zhang

and Lei

, A clustering ensemble: Two-level-refined co-association matrix with path-based transformation, Pattern Recognition 48(8) (2015), 2699–2709.

44.

Yang

Jin

Jain

A.K.

and Mahdavi

, Robust ensemble clustering by matrix completion, in: 2012 IEEE 12th International Conference on Data Mining, IEEE, 2012, pp. 1176–1181.

45.

Chen

W.-Y.

Song

Bai

Lin

C.-J.

and Chang

E.Y.

, Parallel spectral clustering in distributed systems, IEEE Transacions on Pattern Analysis and Machine Ntelligence 33(3) (2010), 568–586.

LSEC: Large-scale spectral ensemble clustering

Abstract

Keywords

1. Introduction

2.1 Spectral clustering

2.2 Ensemble clustering

3. Preliminaries

3.1 Ensemble clustering

3.2 Divide-and-conquer-based large-scale spectral clustering

4.1 Ensemble generation based on large-scale spectral clustering

Table 1 The cluster indicator matrix

5.1 Computational complexity analysis

5.2 Relations with other methods

6. Experiments

6.1 Datasets and evaluation measures

Table 3 Properties of the real and synthetic datasets

Table 4 ACC (%) scores (over 20 runs) realized by the proposed method and the baseline ensemble clustering methods (The best score in each row is shown in bold)

Table 7 Clustering performance (ACC (%), NMI (%), and time costs(s)) for different methods obtained by varying the number of base clusterings m

6.6 Influence of light- k -means

7. Conclusion

Footnotes

Acknowledgments

References

Table 1
The cluster indicator matrix

Table 3
Properties of the real and synthetic datasets

Table 4
ACC (%) scores (over 20 runs) realized by the proposed method and the baseline ensemble clustering methods (The best score in each row is shown in bold)

Table 7
Clustering performance (ACC (%), NMI (%), and time costs(s)) for different methods obtained by varying the number of base clusterings $m$

6.6 Influence of light- $k$ -means