Manifold spatial clustering via asymmetric convolutional denoising autoencoder

Abstract

Deep unsupervised learning extracts meaningful features from unlabeled images and simultaneously serves downstream tasks in computer vision. The basic process of deep clustering methods can include features learning and clustering assignment. To enhance the discriminative ability of the features and further improve the clustering performances, a new deep clustering method namely ACMEC (asymmetric convolutional denoising autoencoder with manifold spatial embedding clustering) is proposed. In this method, an asymmetric convolution denoising autoencoder is employed to extract visual features from images, and a manifold learning algorithm is used to obtain more distinctive features, followed by a Gaussian Mixture Model (GMM) is for clustering learning. The stability of feature space is guaranteed using separately training mechanism. In addition, reconstruction from noisy images enhances the robustness of feature networks. Experimental results on nine benchmark datasets demonstrate that the proposed ACMEC method can provide the better performances such as 0.979 clustering accuracy on the MNIST dataset and 0.668 on the fashion-MNIST dataset. ACMEC is a comparable competitor to the N2D (not too deep clustering) algorithm that is with 0.979 and 0.672 clustering accuracies respectively. Moreover, it is 16.1% higher than DEC algorithm on the fashion-MNIST dataset.

Keywords

Clustering analysis feature learning asymmetric convolutional denoising autoencoder manifold embedding Gaussian mixture models (GMM)

1 Introduction

Deep unsupervised learning extracts meaningful feature representation from unlabeled images and simultaneously serves downstream tasks [1]. Clustering is one of the unsupervised learning methods, which automatically divides samples into several clusters so that the similarity of intra-cluster members is as larger as possible. To overcome the limitations of traditional clustering methods influenced by the curse of dimensionality, deep neural networks have become popular to automatically learn high-level visual features for high dimensional images [2]. Autoencoders (AE) is a kind of extraordinary networks with its symmetrical structure composed of encoder and decoder. AE has an excellent effect on reduction dimensionality and feature learning. Usually, the network parameters and features representation are optimized by minimizing the reconstruction loss [3, 4]. The ability of learning features of a deep network is evaluated by the performances of the downstream tasks.

The basic process of deep clustering methods can include features learning and clustering assignment. First a fully connected autoencoder was trained, next the encoder network extracted visual features of the input images, and then the visual features were used as the inputs for the clustering model. A joint embedded clustering algorithm is proposed in [5, 6], which used a fully connected autoencoder and k-means clustering algorithm. Later, several modified versions including convolutional autoencoder and denoising autoencoder were proposed to further improve the clustering performance [7 –10]. Compared with traditional clustering algorithms, the performances of these methods are greatly improved on benchmark tasks.

To enhance the discriminative ability of the features, manifold mapping is employed to preserve the geometric structure and neighbor properties of the samples. Nonlinear manifold learning method Isomap [11] focused on global structure, but t-SNE [12] is a kind of local method, while UMAP(uniform manifold approximation and projection) [13] is a local and can preserve global structure.

It is empirically found that image clustering models can deal with three problems: i) the curse of dimensionality caused by high-dimensional images; ii) feature extraction and cluster learning are affected by each other, and iii) limited representation ability of the fully connected autoencoders. We combine these findings and propose a new clustering method namely ACMEC that combines an asymmetric convolutional denoising autoencoder with manifold spatial embedding for learning features, and then a Gaussian Mixture Model (GMM) is used for cluster learning. Features extracted by the encoder are embedded into a manifold space keeping a special locality that is more conducive to cluster learning. By doing so, ACMEC replaces the complexity of the clustering network with a manifold learning method [13 –15] and straightforward clustering algorithm, reducing the deepness of the deep clustering, while achieving superior performance via the extra manifold learning step. The main merits of the ACMEC method are as follows:

An asymmetric convolutional denoising autoencoder, a not too deep network structure, is presented for obtaining meaningfully visual features;

A manifold embedding is employed to disentangle features for more separable clusters;

Make good use of local topological information of point to calculate the similarity between two points in feature embedding space;

Extensive experiments demonstrate that the proposed ACMEC method outperforms state-of-the-art clustering methods on benchmark datasets.

The purpose of this article is to provide a simple ACMEC method that replaces deep clustering networks with a manifold learning method followed by shallow clusters. According to the experiments, this approach can compete with state-of-the-art deep clustering approaches on several datasets. The rest of the paper is organized as follows. Section 2 mainly reviews the related works. The proposed ACMEC method and network architecture are carefully described in Section 3. Extensive experiments including comparisons and ablation studies of ACMEC are discussed in Section 4. At last, Section 5 states conclusion and future works.

2 Related work

The performance of the clustering approach is heavily dependent on the quality of feature learning. Therefore, it is necessary to design a network structure with high representation ability whatever meets kinds of images. Existing geometry transformations include linear transformation like Principal Component Analysis (PCA) [16, 17] and nonlinear transformation such as kernel methods and spectral methods [18 –20]. However, how to describe features in latent space is still a challenge. In this case, deep neural networks (DNN) including convolutional neural networks and generative adversarial networks can be utilized to learn high-level representation [21, 22]. Existing deep clustering methods mainly combine feature learning with conventional clustering methods together, by minimizing clustering loss to update network parameters and clustering performances [23, 24].

In 2014, a deep embedded network (DEN) was proposed to extract effective representations for clustering [25]. DEN firstly utilizes an autoencoder to learn representation from raw data. Secondly, a locality preserving constraint is implemented to preserve the local structure of the original data. An unsupervised deep embedding method (DEC) [5] was proposed in 2016 to implement feature extraction and clustering tasks simultaneously. It discards the decoder after training an autoencoder by minimizing reconstruction loss. Clustering allocation is iteratively optimized by minimizing KL(Kullback-Leibler) divergence. On the basis of the DEC method, the improved deep embedded clustering (IDEC) [26] and discriminatively boosted clustering (DBC) [27] were presented successively in 2017 and 2018. IDEC uses both clustering loss and reconstruction loss to disentangle feature space, and incomplete autoencoder term maintains local structure. In DBC, a convolutional autoencoder replaces the fully-connected autoencoder to mine the representation ability of network mapping.

Joint unsupervised learning of deep representations (JULE) [28] was proposed to jointly learn meaningful features and clusters, where a convolutional neural network is employed for representation learning and hierarchical clustering is used for clustering. JULE optimizes its objective function in a recurrent process. In 2017, a deep embedded regularized clustering (DEPICT) [7] was proposed that is a sophisticated method consisting of multiple striking tricks. It has a softmax layer stacking on the top of a multi-layer convolutional autoencoder. It minimizes a relative entropy loss and a regularization term for clustering [29]. In 2019, the ClusterGAN method [30] proposed a novel balanced self-paced learning to gradually bring samples into training. In 2020, deeply embedded dimensionality reduction clustering (DERC) [9] formed a deep clustering algorithm through optimizing image embedding, dimensionality reduction, and cluster learning. In 2021, Not too Deep (N2D) [10] learned an autoencoder embedding and further searched the manifold to replace the deeper network for clustering. Comparing the manifold learning methods found that combining UMAP (uniform manifold approximation and projection) with autoencoder is the best popularity. To conquer the offline limitation, contrastive clustering (CC) provided an online deep clustering method. Actually, CC is intended to train the classification model in the unsupervised setting using invariant information loss [31]. Figure 1 shows several “milestone” methods and clustering accuracies on the MNIST dataset.

Fig. 1

Milestones of popular clustering methods.

Several cluster learning problems from different aspects made some promising performances. However, there are still have some issues, for example, clustering results are still not satisfied. To further improve the feature extraction ability of network and clustering performance, a new clustering model is studied in this paper, which maps visual features into appropriate embedding space that brings better clustering results.

3 The proposed clustering method

Let’s consider a clustering task to group N images X ={ x₁, x₂, . . . , x_N } without any annotations into K clusters, where the dimension of an image is x _i ∈ R ^dx, dx = H × W . Inspired by recent gains of deep clustering methods, an unsupervised learning method is proposed to further improve clustering performances. Our method is composing of three components. An asymmetric convolutional denoising autoencoder is used to extract visual features that to be embedded into a manifold space using a manifold method, and then followed by a Gaussian Mixture Model (GMM) clustering. Thus, the proposed method is abbreviated as ACMEC. The whole framework of ACMEC is displayed in Fig. 2.

Fig. 2

The framework of ACMEC method.

The upper part of Fig. 2 is an asymmetric convolutional denoising autoencoder network (ACDAE). To avoid trivial solutions when training our autoencoder, the inputs of ACDAE are those images distorted by Gaussian noises denoting as $\tilde{X} = {{\tilde{x}}_{1}, {\tilde{x}}_{2}, . . ., {\tilde{x}}_{N}}$ , and the outputs are the reconstructed images denoting as $\hat{X} = {{\hat{x}}_{1}, {\hat{x}}_{2}, . . ., {\hat{x}}_{N}}$ . For the trained ACDAE, the encoder network with the optimal parameter-values, serving as a feature extractor will go to the following clustering task. Therefore, the lower part of Fig. 2 is the actual ACMEC framework that is composed of the encoder network, the manifold embedding and a GMM clusterer. While in the clustering process, the inputs of ACMEC are default images X ={ x₁, x₂, . . . , x_N }. Our work separates into two stages:

Training ACDAE by minimizing the reconstruction loss to update network parameters;

Collecting the visual features and mapping them into a manifold embedding space, and then clustering assignment by a GMM.

3.1 ACDAE architecture

In order to enhance the ability of feature representation, an asymmetric convolutional denoising autoencoder (ACDAE, referring to Fig. 3) is employed for features extraction. Encoder network is composed of four convolutional layers and three fully connected layers, while decoder part includes two fully connected layers and four deconvolutional layers. The detailed structure of ACDAE is described in Section 4.2.

Fig. 3

The detailed architecture of encoder network of ACDAE.

Encoder network is a kind of nonlinear mapping, i.e. $f_{θ_{1}} : \hat{X} \to Z$ , where θ₁ represents the parameters of the encoder. Each element of matrix Z ={ z₁, z₂, . . . , z_N } is a latent feature vector, and the dimension is z_i ∈ R ^dz, dz << dx . The further experiments are designed to verify the influence of dz value on clustering performances (see Section 4.4.2 for detailed analysis). The decoder is also a kind of nonlinear mapping $g_{θ_{2}} : Z \to \hat{X}$ . Minimizing reconstruction loss between $\tilde{X}$ and $\hat{X}$ to update parameters θ₁ and θ₂ until convergence. The objective loss of ACDAE is: $L_{ACDAE} = \frac{1}{N} \sum_{i = 1}^{N} | | {\tilde{x}}_{i} - {\hat{x}}_{i} | |^{2} + β \sum | | W | |_{2}^{2}$ (1) where the first term is reconstruction loss, and the reconstructed image is obtained by f _{θ
₁} and g _{θ
₂} for the noisy input. The second part is a L₂ norm regularization to avoid overfitting, and W = (θ₁, θ₂) is a parameters-collection. The coefficient β is fixed as 0.01 in the later experiments.

The purpose of autoencoder transformation is to extract high-level visual features of images. Although here we use quite simple network for feature extraction, which are quite beneficial to consequent clustering task.

3.2 Manifold embedding

Recent works have shown that the local structure information guides networks to pursue the fine-grained features for samples discrimination and clustering. For example, the recent proposed manifold learning method namely UMAP (Uniform Manifold Approximation and Projection) [13] can better preserve global structure and inherit local benefits. First a spectral layout is used to initialize the manifold embedding. The manifold embedding space is optimized by minimizing cross entropy.

In our work, in order to achieve more local topological structure to further disentangle the visual features, a manifold mapping is employed to project features Z into U ={ u ₁, u ₂, . . . , u _N }, where u _i = φ ( z _i) u _i ∈ R ^du, du ⩽ dz . In feature space, point z _i and its m neighbors is $N_{m} (z_{i}) = {z_{i}, z_{j}}_{j = 1}^{m}$ . The similarity between point z _j and point z _i is calculated using Equation (2) $p_{i | j} = exp (- \frac{d (z_{i}, z_{j}) - ρ_{i}}{σ_{i}})$ (2) where ρ _i = min { d ( z _i, z _j) | d ( z _i, z _j) >0, 1 ⩽ j ⩽ m } is the shortest distance among m nearest neighbors, and d (· , ·) denotes the Euclidean distance. Width factor σ_i is determined by solving $\sum_{j = 1}^{m} exp (\frac{ρ_{i} - d (z_{i}, z_{j})}{σ_{i}}) = {log}_{2} (m)$ . Due to p_i|j is possible not equal to p_j|i, so a symmetrical similarity of point pair (z_i, z_j) is calculated as: $p_{ij} = p_{i | j} + p_{j | i} - p_{i | j} p_{j | i}$ (3) While in manifold U space, inspired by UMAP (Uniform Manifold Approximation and Projection) algorithm [13], the similarity between feature point u _j and point u _i is as: $q_{ij} = (1 + a (u_{i} - u_{j})^{2 b})^{- 1}$ (4) where a ≈ 1.93, b ≈ 0.79. Minimizing the cross entropy between probability distributions p _ij and q _ij to fine-tune q _ij and u _i until reach to the predefined iteration number.

$\begin{matrix} L_{m} = - \sum_{i = 1}^{N} \sum_{j \in N_{m} (z_{i})} \\ [p_{ij} \log (\frac{p_{ij}}{q_{ij}}) + (1 - p_{ij}) \log (\frac{1 - p_{ij}}{1 - q_{ij}})] \end{matrix}$ (5)

3.3 GMM clustering

Gaussian Mixture Model (GMM) [32] is a most representative model-based clustering method, where the observations are depicted by K Gaussian distributions. So, the likelihood of a manifold feature u is a combination of K probability density mixture: $Ψ_{M} (u) = \sum_{k = 1}^{K} β_{k} \cdot ψ_{k} (u | μ_{k}, Σ_{k})$ (6) where coefficient β_k > 0 indicates the mixing proportion of the k-th Gaussian component and $\sum_{k = 1}^{K} β_{k} = 1$ . ψ_k ( u |μ_k, Σ_k) is the k-th Gaussian density function from the data sample u . It is usually written as $\frac{1}{\sqrt{{(2 π)}^{du} | Σ_{k} |}} \exp (- \frac{1}{2} (u - μ_{k})^{T} {Σ_{k}}^{- 1} (u - μ_{k}))$ where μ_k is a cluster-center vector and Σ_k is a covariance matrix. Parameters β_k, μ_k and Σ_k can be alternatively determined by expectation maximization (EM) algorithm.

After that, GMM is used to predict probability of u _i to a cluster as follows: $ψ_{M} (k | u_{i}) = \frac{p (k) \cdot ψ_{k} (u_{i} | k)}{Ψ_{M} (u_{i})}$ (7)

Here posteriori probability ψ_M ( k | u _i) indicates the probability of u _i belonging to k-th cluster. Prior probability p ( k ) is just β_k. Finally, the maximum criterion is used to assign the optimal cluster for a feature point u _i.

3.4 Pseudo-Code of ACMEC

Many deep clustering models iteratively and alternatively update network weights and clustering assignment by joint training mechanism. However, the clustering performances of joint training degenerated in the experiments especially on MNIST dataset. The reason is maybe that the spatial structure of latent feature space is partly destroyed and lead to clustering degeneration. So, our proposed ACMEC method separately train ACDAE, manifold mapping and clustering. The pseudo-code of ACMEC description is shown in Algorithm 1.

Table 7

Algorithm 1 Asymmetric Convolutional Manifold Embedding

Clustering (ACMEC)

Inputs: Images X ={ x ₁, x ₂, . . . , x _N }; the number of clusters K;

the number of nearest neighbors m; the maximal iterations of

fine-tuning.

Outputs: C = [ C ₁, . . . , C _K] , x _i ∈ C _j

1. Preprocessing inputs: resize images and add Gaussian

noise $\tilde{X} = {{\tilde{x}}_{1}, {\tilde{x}}_{2}, . . ., {\tilde{x}}_{N}}$ ;

2. End-to-end training ACDAE network until convergence using

Equation (1);

preserving the parameters and encoder network of ACDAE;

3. Implementing ACMEC;

Calculating probabilities p _ij and q _ij using Equations (3)

and (4), respectively;

Minimizing cross entropy Equation (5) to update q _ij and u _i;

Training GMM using EM algorithm to update parameters β_k,

μ_k and Σ_k; Allocate clusters using Equation (7).

Algorithm 1 Asymmetric Convolutional Manifold Embedding
Inputs: Images X ={ x ₁, x ₂, . . . , x _N }; the number of clusters K;
the number of nearest neighbors m; the maximal iterations of
fine-tuning.
Outputs: C = [ C ₁, . . . , C _K] , x _i ∈ C _j
1. Preprocessing inputs: resize images and add Gaussian
noise $\tilde{X} = {{\tilde{x}}_{1}, {\tilde{x}}_{2}, . . ., {\tilde{x}}_{N}}$ ;
2. End-to-end training ACDAE network until convergence using
Equation (1);
preserving the parameters and encoder network of ACDAE;
3. Implementing ACMEC;
Calculating probabilities p _ij and q _ij using Equations (3)
and (4), respectively;
Minimizing cross entropy Equation (5) to update q _ij and u _i;
Training GMM using EM algorithm to update parameters β_k,
μ_k and Σ_k; Allocate clusters using Equation (7).

4 Experiments

This section mainly verifies the clustering performances of the ACMEC algorithm and compares it with other clustering algorithms on public image datasets. Section 4.1 introduces the datasets and evaluation metrics. In Section 4.2, ACDAE network configurations and parameters are carefully discussed. The exhaustive clustering analyses and visualizations are displayed in Section 4.3.

In order to decrease the influence of random initialization on the performances, experiments are repeated 30 or 50 times and show the highest clustering accuracy. The experimental environment includes IntelCorei5-6300HQ processor, NVIDIA 2.0GB video memory, and 8.0GB RAM graphics card. The running code of ACMEC is built on the open-source Keras library.

4.1 Datasets and metrics

Here we briefly review two handwritten digits datasets, three daily clothes and necessities datasets, and four smaller face datasets. The detailed descriptions are displayed in Table 1.To unify our network structure, all images are resized to 28×28 dimension. In addition, the colorful images are transformed into gray-scale.

Table 1
The detailed information of datasets used in experiments

Dataset #Samples #Classes #Dimensions

MNIST-Full 70000 10 28×28

MNIST-Test 10000 10 28×28

Fashion-MNIST 70000 10 28×28

COIL-20 1440 20 128×128

COIL-100 7200 100 128×128

BioID-Face 162 27 384×286

CAS-PEAL-R1 200 40 480×360

IMM 240 40 640×480

UMISTS 564 20 220×220

Dataset	#Samples	#Classes	#Dimensions
MNIST-Full	70000	10	28×28
MNIST-Test	10000	10	28×28
Fashion-MNIST	70000	10	28×28
COIL-20	1440	20	128×128
COIL-100	7200	100	128×128
BioID-Face	162	27	384×286
CAS-PEAL-R1	200	40	480×360
IMM	240	40	640×480
UMISTS	564	20	220×220

Three evaluation indicators, widely used for evaluation clustering tasks, i.e. clustering accuracy (ACC) [33], normalized mutual information (NMI) [34] and adjusted rand index (ARI) [35], are described here. The closer the value is to 1, the more accurate the clustering result is.

Clustering accuracy (ACC) is calculated as follows: $ACC = max_{m} \frac{\sum_{i = 1}^{N} 1 {l_{i} = m (c_{i})}}{N}$ (8) where ACC is the proportion of the correct partition to the total number, and 1{ l _i = m ( c _i) } is the best mapping providing by the Hungarian algorithm for two clustering assignment.

Mutual information MI (Ω, C) = H (Ω) - H (Ω|C) describes information gain of clustering partition Ω for the given clustering allocation C. MI grows higher with the increasing of similarity between Ω and C. NMI normalizes mutual information into [0,1], which is: $NMI (Ω, C) = \frac{2 MI (Ω, C)}{H (Ω) + H (C)}$ (9) ARI represents the similarity of two clustering distributions [35]: $ARI = \frac{RI - E (RI)}{\max (RI) - E (RI)}$ (10)

ARI is bounded into the range [–1,1]. The negative ARI value indicates not ideal clustering result, while the larger positive value, the higher similarity the two clustering distributions.

4.2 Network configurations

The asymmetric convolutional denoising autoencoder, i.e. ACDAE network is trained by minimizing the reconstruction error between Gaussian noisy images and reconstructed images, and then the encoder network is used to collect features Z for the original images. Batch normalization is followed by each convolution layer with a ReLU activation function. Batch normalization is used to reduce the risk of falling into a local solution. The ACDAE configurations are summarized in Table 2, while the structure is plotted in Fig. 3 in Section 3.1.

Table 2
The network configuration of the encoder in ACDAE

Layers Configurations

Conv1 Channel 32, Filters 3×3, stride 3

Conv2 Channel 64, Filters 3×3, stride 2

Conv3 Channel 128, Filters 3×3, stride 1

Conv4 Channel 128, Filters 2×2, stride 1

Full 1 128 dimensions

Full 2 128 dimensions

Full 3 128 dimensions

bottleneck 2K

Layers	Configurations
Conv1	Channel 32, Filters 3×3, stride 3
Conv2	Channel 64, Filters 3×3, stride 2
Conv3	Channel 128, Filters 3×3, stride 1
Conv4	Channel 128, Filters 2×2, stride 1
Full 1	128 dimensions
Full 2	128 dimensions
Full 3	128 dimensions
bottleneck	2K

Note that the convolution kernel size in the first convolution lay is equal to the stride, which can extract local feature information of image patches and reduce computation complexity simultaneously. The size of bottleneck is set as 2K, where K is a parameter of clustering task. The higher dimensionality of the bottleneck (also called as feature vector) is greatly beneficial for the downstream clustering tasks. Section 4.4.2 will display the experimental results and detailed analyses for the reason 2K here. Adam is used as the optimizer, and the learning rate is 1e - 4. In manifold embedding, the number of nearest neighbors of a feature is configured as the number of final clusters K.

4.3 Experimental comparison with the competitive clustering methods

In this section, the proposed ACMEC method is compared with the state-of-the-art approaches on the public image datasets. The comparison results on three datasets are listed in Table 3. The performances of the competitors are from the literature [9, 10], and the mark (-) means no available results. The distinguishing difference between ACDAE and ACMEC is that ACDAE is without manifold learning for clustering while ACMEC with.

Table 3
The comparisons of ACMEC with the competitive clustering methods on three datasets

Datasets MNIST-full MNIST-test Fashion-MNIST

ACC NMI ACC NMI ACC NMI

K-means [6] 0.534 0.500 0.501 0.547 0.474 0.512

SEC [19] 0.804 0.779 0.815 0.790 – –

DEC [5] 0.844 0.816 0.859 0.827 0.518 0.546

IDEC [26] 0.880 0.867 – – 0.529 0.557

DBC [27] 0.964 0.917 – – – –

DEPICT [7] 0.965 0.917 0.963 0.915 0.392 0.392

JULE [28] 0.964 0.913 0.961 0.915 0.563 0.608

ClusterGAN [30] 0.964 0.921 – – – –

DERC [9] 0.975 0.927 0.972 0.923 – –

N2D [10] 0.979 0.942 – – 0.672 0.684

ACDAE(ours) 0.954 0.916 0.954 0.901 0.643 0.622

ACMEC(ours) 0.979 0.943 0.975 0.926 0.679 0.692

Datasets	MNIST-full	MNIST-test	Fashion-MNIST
	ACC	NMI	ACC	NMI	ACC	NMI
K-means [6]	0.534	0.500	0.501	0.547	0.474	0.512
SEC [19]	0.804	0.779	0.815	0.790	–	–
DEC [5]	0.844	0.816	0.859	0.827	0.518	0.546
IDEC [26]	0.880	0.867	–	–	0.529	0.557
DBC [27]	0.964	0.917	–	–	–	–
DEPICT [7]	0.965	0.917	0.963	0.915	0.392	0.392
JULE [28]	0.964	0.913	0.961	0.915	0.563	0.608
ClusterGAN [30]	0.964	0.921	–	–	–	–
DERC [9]	0.975	0.927	0.972	0.923	–	–
N2D [10]	0.979	0.942	–	–	0.672	0.684
ACDAE(ours)	0.954	0.916	0.954	0.901	0.643	0.622
ACMEC(ours)	0.979	0.943	0.975	0.926	0.679	0.692

According to Table 3, ACMEC can achieve higher ACC and NMI, which demonstrate ACMEC is better than the state-of-the-art approaches on the given datasets. ACMEC is a comparable competitor to N2D (not too deep clustering) method which are with 0.979 vs. 0.979 and 0.679 vs. 0.672 on clustering accuracies respectively. In addition, ACMEC approach is easy to be understood, which is composed of quite simple operation and not too deep network structure.

In order to further evaluate clustering performances of different learning algorithms, Friedman test is employed. First, sort ACC performances of six algorithms and the results are listed in Table 4. The last row in Table 4 are average rank values of each algorithm.

Table 4

Sorting ACC performances of six algorithms on three datasets

Datasets	K-means	DEC	DEPICT	JULE	ACDAE	ACMEC
MNIST	6	5	2	3	4	1
MNIST-test	6	5	2	3	4	1
Fashion-MNIST	5	4	6	3	2	1
Average rank	5.67	4.67	3.33	3	3.33	1

Statistic τ_F is computed using Equation (11): $τ_{F} = \frac{(N - 1) τ_{χ^{2}}}{N (k - 1) - τ_{χ^{2}}}$ (11) where, statistic $τ_{χ^{2}} = \frac{12 N}{k (k + 1)} (\sum_{i = 1}^{k} r_{i}^{2} - \frac{k {(k + 1)}^{2}}{4})$ , r_i represents average rank value, N is the number of datasets and k is the number of algorithms.

Assuming null hypothesis H₀ being the performances of the six algorithms are equal to each other. According to Table 4 and Equation (11), we can compute τ_F = 5.39 that is greater than the critical value 3.33(significance level α =0.05). Thus, null assumption is rejected, i.e., there exits statistically significant difference between six algorithms.

Next, the post-hoc test Nemenyi is adopted to further distinguish the algorithms. The critical range of average rank value is: $CD = q_{α} \sqrt{\frac{k (k + 1)}{6 N}}$ (12)

In this paper, q_α = 2.85, the critical range is CD = 4.35. According to the average order value in Table 4, the deviation between ACMEC and K-means 4.67 exceeds the critical range 4.35, so the performance of the two algorithms is significantly different.

According to the results in Table 4, Friedman test diagram of six algorithms is plotted in Fig. 4, where vertical axis is the algorithms and horizontal axis shows the average rank values. For each algorithm, a dot displays average rank value, and the line segment with the dot located in the center indicates the size of the critical value range.

Fig. 4

Friedman test diagram of six algorithms.

In Fig. 4, less overlapping between two horizontal lines means higher significant difference between the algorithms. So, there is statistically significant difference between our ACMEC and K-means. However, there is no significant difference between the joint algorithm DEC, in spite of our algorithm being 13.7% higher than DEC in terms of average clustering accuracy on three datasets.

4.4 Ablation study

Several ablation experiments are extensively conducted to check the contribution of each component.

4.4.1 Manifold mapping for clustering

A group of experiments are implemented to verify the effects of manifold embedding for clusters learning. The clustering performances displayed in Fig. 5 are generated separately as the following four processes.

Pretrain denotes k-means algorithm directly grouping feature Z without jointly training;

Joint represents k-means partitioning feature Z’ that are obtained by jointly updating Z and clustering until convergence;

Pre-UMAP refers to k-means grouping feature U that is mapped from feature Z using manifold mapping UMAP algorithm.

Joint-UMAP means k-means partitioning feature U’ that is mapped from feature Z’ using UMAP algorithm.

Fig. 5

The comparison of clustering performances with or without manifold embedding on the four datasets.

Observing four subfigures in Fig. 5, we can find that (i) Pre-UMAP provides the best clustering performances by larger margins compared with the other three competitors on the four datasets; (ii) Joint is slight superior to Pretrain, and almost similar to Joint-UMAP. This means that manifold embedding is beneficial to cluster learning, and improves the corresponding performances as well.

To further explain the influence of manifold embedding on the cluster learning, four kinds of features Z, Z’, U, U’ are visualized respectively in Fig. 6, in where the total 1000 features are generated from 100 images per cluster are randomly chosen from MNIST dataset.

Fig. 6

T-SNE visualization of features from MNIST dataset.

It is found that Fig. 6(a)–(c) have regular and clear cluster-boundary and incorrect points, while Fig. 6(d) becomes irregular. In addition, there becomes scattered within the clusters in spite of still separation between clusters in Fig. 6(b) and (d).

The comparisons and features visualization illustrate that manifold embedding brings effect on the downstream clustering task. It is also noted that jointly training can improve clustering accuracy, but the corresponding accuracy is hardly improved after adding the UMAP manifold, as in Fig. 6(c) and (d).

4.4.2 Feature dimensionality for clustering

The experiments consider the influence of bottleneck dimensionality 2K (i.e. the size of feature Z) on clusters learning. The clustering accuracies provided by ACMEC with the increasing of the size of feature Z on four datasets are plotted in Fig. 7.

Fig. 7

The influence of bottleneck dimensionality on clustering accuracy.

Obviously, both two curves have a local optimal at 20 in Fig. 7(a) and (b) while digit 20 is just equal to 2K (K = 10 represents the number of clusters). Another two curves have slight increase with the increasing of the size of feature Z in Fig. 7(c) and (d). According to the observations, the optimal size of feature Z is finally determined as 2K, twice the number of clusters.

4.4.3 ACMEC network for clustering

Here, a group of experiments are carried out to compare our ACMEC network with a fully connected autoencoder from literature [5] (namely FAE). The result of GMM clustering is the best of 30 runs on every dataset. The ACC, NMI, and ARI are shown in Table 5.

Table 5
Comparison of FAE and ACMEC networks on eight datasets

Dataset-Network ACC NMI ARI

MNIST-FAE 0.976 0.939 0.951

MNIST-ACMEC 0.979 (↑0.03) 0.941 0.953

Fashion-FAE 0.633 0.697 0.527

Fashion-ACMEC 0.679 (↑4%) 0.692 0.539

COIL-20-FAE 0.813 0.903 0.803

COIL-20-ACMEC 0.834 (↑2%) 0.916 0.819

COIL-100-FAE 0.752 0.897 0.708

COIL-100-ACMEC 0.775 (↑2%) 0.913 0.746

BIO-FAE 0.913 0.953 0.859

BIO-ACMEC 0.938 (↑2%) 0.962 0.886

CAS-FAE 0.865 0.937 0.793

CAS-ACMEC 0.900 (↑4%) 0.946 0.825

UMISTS-FAE 0.532 0.73 0.439

UMISTS-ACMEC 0.708 (↑17%) 0.82 0.639

IMM-FAE 0.55 0.764 0.379

IMM-ACMEC 0.600 (↑5%) 0.767 0.389

Dataset-Network	ACC	NMI	ARI
MNIST-FAE	0.976	0.939	0.951
MNIST-ACMEC	0.979 (↑0.03)	0.941	0.953
Fashion-FAE	0.633	0.697	0.527
Fashion-ACMEC	0.679 (↑4%)	0.692	0.539
COIL-20-FAE	0.813	0.903	0.803
COIL-20-ACMEC	0.834 (↑2%)	0.916	0.819
COIL-100-FAE	0.752	0.897	0.708
COIL-100-ACMEC	0.775 (↑2%)	0.913	0.746
BIO-FAE	0.913	0.953	0.859
BIO-ACMEC	0.938 (↑2%)	0.962	0.886
CAS-FAE	0.865	0.937	0.793
CAS-ACMEC	0.900 (↑4%)	0.946	0.825
UMISTS-FAE	0.532	0.73	0.439
UMISTS-ACMEC	0.708 (↑17%)	0.82	0.639
IMM-FAE	0.55	0.764	0.379
IMM-ACMEC	0.600 (↑5%)	0.767	0.389

The clustering performances show that the feature representation ability of ACMEC is better than that of FAE. For example, it can improve 17% of accuracy on the UMISTS dataset. Thus, comparisons prove the effectiveness of asymmetric convolutional denoising autoencoder for feature representation.

4.4.4 Clustering performances of ACMEC

The clustering performances provided by ACMEC method on six datasets are displayed in Table 6.

Table 6
The clustering performances of ACMEC on six datasets

Datasets ACC NMI ARI

COIL-20 0.834 0.916 0.819

COIL-100 0.775 0.913 0.746

BIOID 0.938 0.962 0.886

CAS 0.900 0.946 0.825

UMISTS 0.708 0.820 0.639

IMM 0.600 0.767 0.389

Datasets	ACC	NMI	ARI
COIL-20	0.834	0.916	0.819
COIL-100	0.775	0.913	0.746
BIOID	0.938	0.962	0.886
CAS	0.900	0.946	0.825
UMISTS	0.708	0.820	0.639
IMM	0.600	0.767	0.389

At last, randomly select 10 images per class from MNIST dataset for testing the proposed ACMEC method, and the visualization of clustering results are visualized in Fig. 8.

Fig. 8

Clustering results of 100 images using ACMEC method.

It is easy to know that ACC is 0.93 for these 100 images. The incorrect clustering appears to the handwritten digits 9, 1, 5 and 4. Some of them are quite difficult to distinguish such as digit 5 misclassified in the second last row. While some misclustering could be revised when feature disentangled ability is further improved, for example digit 4 in the first row and digit 9 in the last row in Fig. 8(b).

5 Conclusion

In this paper, we proposed an ACMEC clustering method by extracting discriminative manifold features to enhance the clustering performances. ACMEC is mainly composed of features transformation, manifold embedding, and clustering allocation. Feature transformation mainly deals with feature extraction from input images. The manifold embedding of features is beneficial to visual analysis and improves the clustering performances. Besides the feature extraction, an asymmetric convolutional denoising autoencoder is also used as a features transformation network. UMAP manifold algorithm improves the dispersion of the embedded features, and GMM clustering can achieve the competitive clustering performances. Experimental results on nine challenging datasets demonstrated that ACMEC achieves significant improvement over the state-of-the-art methods.

However, there is larger margins of performances with the increasing of the number of clusters for unsupervised learning. One of the future works is how to improve the clustering performances for larger number of clusters using ACMEC method.

Footnotes

Acknowledgments

This work is supported by the Hebei Province Introduction of Studying Abroad Talent Funded Project (No.C20200302), and partially supported by the National Natural Science Foundation of China (No. 62072024).

References

Peng Xi , Xiao Shijie , Feng Jiashi , et al., Deep Subspace Clustering with Sparsity Prior, IJCAI’16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2016), 1925–1931.

Yann Lecun and Leon Bottou , Gradient-based learning applied to document recognition, Proceedings of the IEEE (1998), 2278–2324.

Geoffrey Hinton

and Ruslan Salakhutdinov , Reducing the dimensionality of data with neural networks, Science (2006), 504–507.

Yang Bo , Fu Xiao , Sidiropoulos Nicholas

, et al., Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering, ICML’17 Proceedings of the 34th International Conference on Machine Learning (2006), 3861–3870.

Xie Junyuan , Girshick Ross and Farhadi Ali , Unsupervised deep embedding for clustering analysis, International Conference on Machine Learning (2014), 478–487.

Haeusser Philip , Plapp Johannes , Golkov Vladimir , et al., Associative deep clustering: training a classification network with no labels, Proc of the German Conference on Pattern Recognition (2018), 18–22.

Ghasedi Dizaji Kamran , Herandi Amirhossein , Deng Cheng , et al., Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization, Proceedings of the IEEE International Conference on Computer Vision (2017), 5736–5745.

Vincent Pascal , Lalle Hugo , Bengio Yoshua , et al., Extracting and composing robust features with denoising autoencoders, International Conference on Machine Learning (2008), 1096–1103.

Yan Yuanjie , Hao Hongyan , Xu Baile , et al., Image clustering via deep embedded dimensionality reduction and probability-based triplet loss, IEEE Transactions on Image Processing (99) (2020), 5652–5661.

10.

Ryan McConville , Raúl Santos-Rodríguez , Piechocki

, et al., Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding, 25th International Conference on Pattern Recognition (2021), 5145–5152.

11.

Tenenbaum

J.B.

, Silva

V.D.

and Langford

J.C.

, A global geometric framework for nonlinear dimensionality reduction, Science (2000), 2319–2323.

12.

van der Maaten

, Hinton

, Visualizing data using t-SNE, Journal of Machine Learning Research (2008), 2579–2605.

13.

Leland McInnes and John Healy , UMAP: uniform manifold approximation and projection for dimension reduction, The Journal of Open Source Software (2018), 861–879.

14.

Tenenbaum Joshua , Silva Vin De and Langford John , A global geometric framework for nonlinear dimensionality reduction, Science (2000), 2319–2323.

15.

Laurens Van , Der Maaten and Hinton Geoffrey , Visualizing data using t-SNE, Journal of Machine Learning Research (2008), 2579–2605.

16.

Roweis Sam

and Saul Lawrence , Nonlinear dimensionality reduction by locally linear embedding, Science (2000), 2323–2326.

17.

Abdi Hervé and Williams Lynny

, Principal component analysis, Wiley Interdisciplinary Reviews Computational Statistics (2010), 433–459.

18.

Hu Jie , Li Shen , Sun Gang , et al., Squeeze-and-excitation networks, IEEE Transactions on Pattern Analysis Machine Intelligence (2017), 2011–2023.

19.

Andrew Ng

, Jordan Michael

, Weiss Yair , et al., On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing System (2001), 849–856.

20.

Wu Jianlong , Lin Zhouchen , Zha Hongbin , et al., Essential tensor learning for multi-view spectral clustering, IEEE Transactions on Image Processing (2018), 5910–5922.

21.

Bengio Yoshua , Lamblin Pascal , Popovici Dan , et al., Greedy layer-wise training of deep networks, Advances in Neural Information Processing Systems (2006), 153–160.

22.

Yao Guanhong and Deng Cai , Accelerating locality preserving nonnegative matrix factorization, Proceedings of the 21st ACMInternational Conference on Information and Knowledge Management (2012), 2271–2274.

23.

Yang Chun , Zhang Xiaorong , Jiao Licheng , et al., Selftuning semi-supervised spectral clustering, International Conference on Computational Intelligence and Security (2008), 1–5.

24.

Xie Xingyu , Wu Jianlong , Liu Guangcan , et al., Differentiable linearized ADMM, International Conference on Machine Learning (2019), 6902–6911.

25.

Huang Peihao , Huang Yan , Wang Wei , et al., Deep embedding network for clustering, International Conference on Pattern Recognition (2014), 1532–1537.

26.

Guo Xifeng , Gao Long , Liu Xinwang , et al., Improved deep embedded clustering with local structure preservation, Twenty-Sixth International Joint Conference on Artificial Intelligence (2017), 1753–1759.

27.

Li Fengfu , Hong Qiao and Zhang Bo , Discriminatively boosted image clustering with fully convolutional autoencoders, Pattern Recognition (2018), 161–173.

28.

Yang Jianwei , Parikh Devi and Batra Dhruv , Joint unsupervised learning of deep representations and image clusters, IEEE Conference on Computer Vision & Pattern Recognition (2016), 5147–5156.

29.

Masci Jonathan , Meier Ueli and Cireşan Dan , Stacked convolutional auto-encoders for hierarchical feature extraction, International Conference on Artificial Neural Networks, Springer (2011), 52–59.

30.

Ghasedi Kamran , Wang Xiaoqian , Deng Cheng , et al., Balanced self-paced learning for generative adversarial clustering network, Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 4391–4400.

31.

Li Yunfan, , Hu Peng , Liu Zitao , et al., Contrastive clustering, Association for the Advance of Artificial Intelligence (AAAI) (2021), 3–10.

32.

Zoran Zivkovic , et al., Improved adaptive gaussian mixture model for background subtraction, Proceedings of the 17th International Conference on Pattern Recognition (2004), 28–31.

33.

Tao Li , Chris Ding , et al., The relationships among various nonnegative matrix factorization methods for clustering, ICDM (2006), 362–371.

34.

Alexander Strehl and Joydeep Ghosh , Cluster ensembles a knowledge reuse framework for combining multiple partitions, Journal on Machine Learning Research (JMLR) (2002), 583–617.

35.

Steinley Douglas , Properties of the Hubert-Arabie adjusted rand index, Psychological Methods (2004), 386–396.