Abstract
Multi-view clustering aims to group similar samples into the same clusters and dissimilar samples into different clusters by integrating heterogeneous information from multi-view data. Non-negative matrix factorization (NMF) has been widely applied to multi-view clustering owing to its interpretability. However, most NMF-based algorithms only factorize multi-view data based on the shallow structure, neglecting complex hierarchical and heterogeneous information in multi-view data. In this paper, we propose a deep multiple non-negative matrix factorization (DMNMF) framework based on AutoEncoder for multi-view clustering. DMNMF consists of multiple Encoder Components and Decoder Components with deep structures. Each pair of Encoder Component and Decoder Component are used to hierarchically factorize the input data from a view for capturing the hierarchical information, and all Encoder and Decoder Components are integrated into an abstract level to learn a common low-dimensional representation for combining the heterogeneous information across multi-view data. Furthermore, graph regularizers are also introduced to preserve the local geometric information of each view. To optimize the proposed framework, an iterative updating scheme is developed. Besides, the corresponding algorithm called MVC-DMNMF is also proposed and implemented. Extensive experiments on six benchmark datasets have been conducted, and the experimental results demonstrate the superior performance of our proposed MVC-DMNMF for multi-view clustering compared to other baseline algorithms.
Keywords
Introduction
Multi-view data collected from different views or multiple data sources are prevalent in many real-life applications [1]. Different views generally contain different features, and they complement each other to construct more comprehensive information for samples. Therefore, it is more reasonable to integrate the features of multiple views together to enhance the performance of learning tasks (such as clustering and classification) rather than only depending on a single view. Multi-view learning has been explored in different fields, such as social multimedia [18], computer vision [4, 25], and bioinformatics informatics [20].
As one of the primary research directions of multi-view learning, multi-view clustering attracts more and more attention because it is capable of analyzing a large amount of unlabeled multi-view data. Multi-view clustering is devoted to grouping samples into the different clusters by exploring heterogeneous information in multi-view data. However, multi-view clustering is generally faced with the following two challenges. On the one hand, data of different views are often heterogeneous, such as the Fourier feature and texture feature describing an image. The heterogeneity requires multi-view clustering algorithms not only to learn consensus information but also to capture complementary information amongst different views; On the other hand, there may be a large amount of hierarchical information between multi-view data and the obtained low-dimensional representation. Taking face datasets as an example, face recognition depends not solely on the appearance difference of face but also on some other attributes such as facial posture and facial expression. The traditional one level clustering algorithms transform the complex hierarchical information into a single layer mapping so that they cannot obtain the desired clustering results.
In order to overcome the problem of data heterogeneity, various multi-view clustering algorithms [12, 14, 21, 24, 25, 30, 31] have also been developed in recent years. Amongst them, multi-view clustering algorithms based on NMF [13, 14, 25, 30, 31] have been proposed owing to its interpretability and representation ability. Given a non-negative data matrix
Recently, deep AutoEncoder [8] has received widespread attention due to its wide application, such as network embedding [7] and pattern recognition [6, 17]. By combining Encoder and Decoder Components, deep AutoEncoder can reduce the gap between the lower-level abstractions and higher-level abstractions of the input data, so that the learned abstract representation can be seen as a more compact representation of the input data. Considering the superior structure of deep AutoEncoder as well as the shortcomings of the existing deep MVC algorithms, it is reasonable to design a multi-view clustering framework based on Encoder and Decoder Components.
Based on the discussion above, in this paper, we propose a new deep multi-view clustering framework based on AutoEncoder [8], named deep multiple non-negative matrix factorization (DMNMF), as shown in Fig. 1. It factorizes multi-view data (
In addition, the local geometric information within each view is also an essential factor affecting clustering performance. If data samples are similar in the original data space, they are also similar in the new low-dimensional feature space [3]. To preserve the local geometric information of each view, graph regularizers [3] are introduced into our DMNMF framework. Specifically, we encode the geometric information of each view by constructing the corresponding nearest neighbor graph. If data samples are connected in the graph, they are similar in low-dimensional feature space. Furthermore, the symmetry between Encoder Components and Decoder Components naturally imposes the orthogonal constraint on the basis matrix, which further makes the obtained consensus representation matrix smooth.
Based on the proposed DMNMF framework, we propose and implement a multi-view clustering algorithm, and carry it on six real-world datasets to evaluate the clustering performance by comparing with other baseline algorithms.
The framework of our proposed DMNMF. 
The main contributions can be summarized in the following four aspects:
A DMNMF framework is proposed to cluster multi-view data. DMNMF framework can integrate heterogeneous information across multi-view data through the common low-dimensional representation and capture hierarchical information contained in multi-view data by using AutoEncoders with deep structures. Graph regularizers are introduced to the DMNMF framework. This enables the learned low-dimensional representation to preserve more comprehensive local geometric information. Meanwhile, the symmetry between Encoder Components and Decoder Components makes the learned representation much smoother. A novel multi-view clustering algorithm is proposed and implemented based on the proposed DMNMF framework. As far as we know, this is the first algorithm to introduce Encoder Components and Decoder Components with the deep structure for multi-view clustering. Extensive experiments have been carried out on six benchmark datasets. The results demonstrate that the proposed algorithm is superior to other multi-view clustering algorithms in terms of the accuracy (ACC) and the normalized mutual information (NMI) metrics.
The rest of the paper is organized as follows. Section 2 offers a brief overview of related work. The details of DMNMF are presented in Section 3. Section 4 provides extensive experiments and results, and in Section 5, conclusions are given.
Multi-view clustering
In recent years, various multi-view clustering algorithms have been investigated from different aspects and they are roughly classified into six categories. The first category of algorithm is graph-based clustering algorithms [24], which constructed the affinity graph of each view, fused them into a unified graph and used some graph techniques for clustering. The second category of algorithm is spectral clustering-based algorithms [12], which looked for the best graph adjacency matrix first and then used spectral clustering to obtain the cluster result. The third category of algorithm is subspace clustering-based algorithms [9], which projected multi-view data into a common coefficient matrix, and then use spectral clustering or k-means clustering. The fourth category of algorithm is NMF-based algorithms [14, 15, 23, 25, 29, 30, 31], which applies NMF to each view and then fused representation matrices learned from different views to be a common consensus matrix. The fifth category of algorithm is ensemble learning-based algorithms [21], which clustered each view into a Basic Partitions (BPs) and integrated multiple BPs into a consensus one. The sixth category of algorithm is canonical correlation analysis-based algorithms [5], which projected multi-view data into multiple low-dimensional representations and maximized the correlated information among them. The above six categories of algorithms are not strictly exclusive, and some algorithms may be compatible with multiple characteristics at the same time.
Amongst the above six categories algorithms, NMF-based multi-view algorithms have attracted more and more attention due to its powerful interpretability. Liu et al. [14] developed a joint multi-view clustering algorithm based on NMF (MultiNMF). MultiNMF pushed the learned information of each view towards a common consensus representation matrix rather than fixing it. To reveal the geometric structure hidden in multi-view data space, a graph regularized multi-view NMF algorithm [25] was developed by considering the local geometric information of data space of each view. Moreover, a multi-manifold regularized NMF (MMNMF) algorithm [31] was also proposed to cluster multi-view data, which incorporates consensus manifold and consensus coefficient matrix with multi-manifold regularization to respect the local geometric information of the multi-view data space. Wang et al. [23] incorporated the concept factorization to handle the data containing negative. However, these algorithms only preserved the local geometric information of data space, neglecting the local geometric information of feature space. To preserve the local geometric information of the data space and the feature space simultaneously, Luo et al. [15] developed a dual-regularized multi-view NMF algorithm that jointly factorized data matrices of different views into the basis matrices and a common representation matrix (DMvNMF).
It is worth noting that ensemble clustering applies the idea of ensemble learning in supervised learning to multi-view clustering to obtain reasonable clustering performance. Therefore, ensemble clustering combined multiple clustering results obtained independently from single-view data as the final results, ignoring the connection between multiple views. On the contrary, MVC algorithms gradually integrates the information in multi-view data during the process learning.
Multi-view clustering via deep matrix factorization
Most traditional multi-view clustering algorithms are intrinsically shallow frameworks and cannot analyze intricate natural data well. Self-Organizing Mapping (SOM) [10] is a neural network-based clustering algorithm. It is essentially a two-layer neural network that includes an input layer and an output layer (competition layer), so it cannot capture complex hierarchical information. Meanwhile, SOM is single-view clustering algorithm, failing to capture the connection between views. In recent years, deep learning has achieved great success in many fields due to its powerful feature learning capabilities. Driven by studies in the field of deep learning, many algorithms on deep matrix factorization have also been proposed [22, 28]. Ye et al. [28] proposed a deep NMF algorithm based on AutoEncoder, called DaNMF, for community detection. This algorithm integrates an Encoder Component and a Decoder Component with deep structures, and these two components are mutually constrained and guided in the factorization process. However, this algorithm is also only applicable to the single-view data field, ignoring heterogeneous information in multi-view data. In [30], Zhao et al. proposed a deep multi-view clustering, called DMF-MVC, and tried to capture different levels of attribute information through different layers. Xu et al. [26] developed a semi-supervised deep multi-view factorization algorithm called Deep Multi-View Concept Learning (DMCL), which hierarchically performs non-negative factorization of the data. DMCL not only takes into account part of the label information of the data but also explicitly models the consistent and complementary information among multi-view data. However, these two algorithms are only considered as Decoder Components and ignore the importance of Encoder Components, so that they cannot inherit the ability of representation learning of AutoEncoder. Besides, DMCL requires partial label information, which is not suitable for clustering without label information.
In this paper, compared with traditional MVC algorithms, we capture the hierarchical information of multi-view data in a deep factorization manner. Different from the existing deep MVC algorithms which are only seen as the Decoder Components, our proposed framework jointly learns a more compact representation by integrating Encoder and Decoder Components in a unified framework without known label information, where these two Components are mutually constrained and enhanced in a way of sharing weight to achieve a better optimal solution.
Deep multiple NMF for multi-view clustering
In this section, we first give some necessary notations, and then introduce DMNMF framework in detail, and discuss the corresponding iterative update optimization scheme.
Notations and problem statement
Let
Let
DMNMF framework
The proposed DMNMF framework is exhibited in Fig. 1. DMNMF framework is composed of multiple Encoder Components and Decoder Components with deep structure. Each pair of Encoder Component and Decoder Component are used to hierarchically factorize the input data from each view to capture the hierarchical information, and all Encoder and Decoder Components are integrated into an abstract level to learn a common low-dimensional representation, aiming to capture the heterogeneous information across multi-view data. In each pair of Decoder Components and Encoder Components, Decoder Component and Encoder Component are symmetrical and share the same basis matrix. At the abstract level, the common low-dimensional representation can be shared and updated simultaneously by Encoder and Decoder Components of different views.
Decoder Components
In DMNMF framework, each of affinity graphs
Where
The Eq. (1) takes a hierarchy of
By hierarchy factorizations, the samples from the same cluster are forced to be closer layer by layer in the low-dimensional representation space. To learn the basis matrices
Encoder Components in DMNMF framework try to project affinity graphs into the low-dimensional representation space with the help of the basics matrices
In order to preserve the local geometric information of each view, graph regularizers are introduced to our proposed framework, as follows:
where
Equation (5) integrates the local geometric information across different views to make the final common representation matrix more consensus. In the proposed DMNMF framework,
Based on the above discussion, by combining Encoder Components, Decoder Components, and the graph regularizers, the unified loss function of DMNMF is given by following Eq. (3.2.4):
Equation (3.2.4) is not convex in both
Updating the basis matrices
By keeping
where
To optimize Eq. (3.3.1), let
The partial derivatives of
where
Where
Based on Eq. (10), we can get the update rule for
The steps of optimizing
The steps of optimizing
Hereto, all the updating rules have been derived. Based on the above analysis, we propose a multi-view clustering algorithm based on DMNMF framework, which is summarized in Algorithm 1. After obtaining the final common representation
Algorithm 1 consists of three steps, i.e., constructing affinity graphs step, the pre-training step, and the fine-tuning step. For simplify, we set the input feature dimensionalities for all views to
Experiments and results
In this section, we implement the proposed MVC-DMNMF algorithm by using Matlab language. In order to demonstrate the effectiveness of the proposed algorithm, a series of experiments are conducted on six benchmark datasets, including the comparison of performance between our algorithm and baseline algorithms, the visualization of low-dimensional representation, the investigation of the parameter sensitivity and the convergence of our algorithm. All experiments are run on the platform of Ubuntu Linux 16.04 with Intel(R) Core(TM) i7-7820X CPU at 3.60GHz and 64GB memory size.
Datasets
Six benchmark datasets include 3Sources,1
wurlhttp://mlg.ucd.ie/datasets/3sources.html.
Statistics of the six datasets
3Sources. It is collected from three popular online news sources: BBC, Reuters, and Guardian. From February to April 2009, there were 948 news articles covering 416 different news stories. Of these reports, 169 were reported from three sources. Each story was manually annotated with one of the six topical labels: business, entertainment, health, politics, sport and technology. BBC dataset (BBC). It consists of 685 documents collected from the BBC news website. Each document was split into four segments and was manually annotated with one of five topical labels. BBCSport dataset (BBCSport). It consists of 544 documents collected from the BBCSport website. Each document was split into two segments and was manually annotated with one of the five topical labels. Newsgroup dataset (NGs). It is a subset of the 20 Newsgroup datasets. It consists of newsgroup documents. Each raw document was pre-processed with three different feature extraction methods (giving three views), and was annotated with one of five topical labels. Wikipedia. It consists of 693 articles collected from Wikipedia’s featured articles collection. We have split them into sections based on section headings, and assign each image to the section in which it was placed by the authors. Then this dataset was pruned to keep only sections that contained a single image and at least 70 words. UCI Handwritten Digit dataset (Digits). This handwritten digits (0–9) data is from the UCI repository. The dataset consists of 2000 examples, including two views, the first is the 76 Fourier coefficients, and the second view is the 240 pixel averages in 2
The clustering results are evaluated by comparing the obtained cluster labels with the original labels provided by the datasets. Two standard metrics, the accuracy (ACC) and the normalized mutual information metric (NMI) [27], are selected to measure the effectiveness of the proposed algorithm. ACC is used to compute the percentage of agreement between the true labels and the clustering labels, which is defined as:
where
The normalized mutual information is employed to measure the similarity of two clusters, which is defined as:
Where
For these two metrics (ACC and NMI), the larger value indicates better clustering performance.
We use the following four baseline algorithms:
DaNMF:6
The codes of all baseline algorithms are downloaded from the GitHub home page of authors. We adjust the parameters of all baseline algorithms by greedy search to obtain their best clustering performance.
Table 2 gives the layer size configuration and the values of the regularization parameter
The layer size configuration of MVC-DMNMF and the values of the regularization parameter
The layer size configuration of MVC-DMNMF and the values of the regularization parameter
ACCs (mean (standard deviation)) of algorithms on six datasets
NMIs (mean (standard deviation)) of algorithms on six datasets
From Tables 3 and 4, we derive the following observations:
ACCs and NMIs of MVCC and MVC-DMNMF are higher than those of DaNMF-BST on all datasets. ACCs and NMIs of GBS-KO are higher than those of DaNMF-BST except for BBC, BBCSport, and Wikipedia datasets. Among four multi-view clustering algorithms, only DMF-MVC has poor performance (its ACCs and NMIs are only higher than those of DaNMF-BST on Wikipedia dataset). The above experimental comparison shows that in most cases, the clustering algorithm that combines multi-view information can obtain excellent clustering performance than only using single-view information. ACCs and NMIs of MVC-DMNMF are higher than those of GBS-KO on all datasets except for NGs dataset. This shows that our algorithm is superior to GBS-KO in the subsequent processing steps of the affinity graph, which proves the effectiveness of our proposed algorithm to a certain extent. Compared with MF-based algorithms MVCC and DMF-MVC, MVC-DMNMF performs better or comparable in most datasets, which proves the importance of introducing the structure of Encoder Component and the necessity of considering matrix factorization of Encoder Component and Decoder Component simultaneously to obtain an effective low-dimensional representation.
Furthermore, in order to verify whether MVC-DMNMF can obtain lower encoder error and reconstruction error, we calculated the encoder error and reconstruction error of DMF-MVC and MVC-DMNMF according to
Error Comparison on six datasets
2D visualization of the representations learned in different layers of DMNMF on BBCSport dataset.
2D visualization of consensus representations on BBCSport dataset obtained from four compared algorithms: DaNMF, DMF-MVC, MVCC, and GBS-KO.
To show whether MVC-DMNMF can capture complex hierarchical information, we visualize the learned hidden representation (i.e., the feature matrix
Further, in order to verify that our proposed algorithm can capture more heterogeneous information, the corresponding visualization results on BBCSport dataset are shown in Fig. 3, where the common representations obtained by our proposed algorithm and the other four algorithms are visualized respectively: (a) DaNMF-BST, (b) DMF-MVC, (c) MVCC, (d) GBS-KO and (e) DMNMF. In (a), (b) and (d) of Fig. 3, there are many overlaps between sample points of different colors, and the sample points of the same color appear very scattered, which is difficult to identify which cluster the samples belong to. Although both (c) and (e) show a more compact representation, there are fewer overlaps between sample points of different colors in (e) and there is a clearer clustering outline between the different clusters. In addition, compared with (c) and (f) of Fig. 2 that only utilize single view information, (e) of Fig. 3 has a clearer outline and fewer sample points overlapped with the help of multi-view information. This indicates that MVC-DMNMF can capture more heterogeneous information and preserve the latent geometric structure embedded in multi-view data space.
In our proposed MVC-DMNMF, there are three parameters: the graph regularizers constraints parameter
Figure 4 shows the effect of the parameter
Clustering results evaluated by ACC and NMI with the regularized parameter 
Figure 5 shows the effect of the number of hidden layers on clustering performance, which is turned from 1 to 21. Note that since Encoder Components and Decoder Components are symmetrical, the number of hidden layers of the overall algorithm is increased by two layers when Decoder Components add a hidden layer. The number of the
Figure 6 shows how clustering performance changes with the number of hidden layer nodes increasing. This experiment sets up a three hidden layer structure of DMNMF. The number of the first layer nodes of Encoder Components and Decoder Components are
Overall, MVC-DMNMF is less affected by the number of hidden layers and the number of nodes in each hidden layer. This demonstrates the superiority of our proposed algorithm, which can achieve promising performance for multi-view clustering, though, without the very deep layer structures.
ACCs and NMIs of DMNMF on all datasets with the number of hidden layers.
ACCs and NMIs of DMNMF on all datasets with the number of hidden layer nodes.
Convergence curve of algorithms DMNMF.
Figure 5 shows the convergence curve of the proposed algorithm. In each sub-figure, the x-axis and the y-axis denote the iteration number and the corresponding objective function value, respectively. As can be observed from Fig. 7, the objective function value decreases sharply with 50 iterations and then becomes steadily afterward except Digits dataset. For Digits dataset, the convergence of the algorithm is relatively slower compared with other datasets because the number of samples is large. Nevertheless, after 300 iterations, the algorithm converges. This indicates that MVC-DMNMF converges efficiently.
Conclusions
Heterogeneity and hierarchicy are two challenges of multi-view clustering. This study has investigated how to consider both heterogeneity and hierarchicy simultaneously in the process of clustering. To this end, we propose the DMNMF framework composed of multiple Encoder Components and Decoder Components with deep structures, where the hierarchical information is captured by the deep structures, and heterogeneous information across multi-view data are combined by an abstract level. We also develop and implement MVC-DMNMF based on DMNMF framework, and carry out a series of experiments on six real-world datasets. The experimental results demonstrate that DMNMF is capable of uncovering the hierarchical information of input data in a deep factorization way and can capture heterogeneous information and preserve the latent geometric structure embedded in multi-view data space. Therefore, MVC-DMNMF can improve clustering performance than single-view, Decoder-only Component, and shallow structure clustering algorithms.
In this paper, the parameter of graph regularizers is set by users, which is a difficult task. So how to set the parameter of graph regularizers automatically is our next work.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61762090, 61262069, 61472346, and 61662086), The Natural Science Foundation of Yunnan Province (2016FA026, 2015FB114), the Project of Innovative Research Team of Yunnan Province (2018HC019), and Program for Innovation Research Team (in Science and Technology) in University of Yunnan Province (IRTSTYN), the Education Department Foundation of Yunnan Province (2019J0005) and Yunnan University’s Research Innovation Fund for Graduate Students.
