Auto-weighted concept factorization for joint feature map and data representation learning

Abstract

Concept factorization (CF) is an effective matrix factorization model which has been widely used in many applications. In CF, the linear combination of data points serves as the dictionary based on which CF can be performed in both the original feature space as well as the reproducible kernel Hilbert space (RKHS). The conventional CF treats each dimension of the feature vector equally during the data reconstruction process, which might violate the common sense that different features have different discriminative abilities and therefore contribute differently in pattern recognition. In this paper, we introduce an auto-weighting variable into the conventional CF objective function to adaptively learn the corresponding contributions of different features and propose a new model termed Auto-Weighted Concept Factorization (AWCF). In AWCF, on one hand, the feature importance can be quantitatively measured by the auto-weighting variable in which the features with better discriminative abilities are assigned larger weights; on the other hand, we can obtain more efficient data representation to depict its semantic information. The detailed optimization procedure to AWCF objective function is derived whose complexity and convergence are also analyzed. Experiments are conducted on both synthetic and representative benchmark data sets and the clustering results demonstrate the effectiveness of AWCF in comparison with some related models.

Keywords

Concept factorization auto-weighting feature map data representation clustering

1 Introduction

Matrix factorization represents a group of models to decompose a target matrix into some factor matrices among which non-negative matrix factorization (NMF) is one of the most famous models. Mathematically, NMF aims to approximate a data matrix by the product of two non-negative matrices; one is called the dictionary matrix and the other is the representation coefficient matrix [1]. In the past decades, a lot of efforts have been made on NMF; therefore, many improved model variants and new applications were developed such as the GNMF [2], NLCF [3], GDNMF [4], SILR-NMF [5], PFNMF [6] and FNMFG [7]. There are some review papers from which we can find the recent advances on NMF [8, 9].

As a popular low-rank feature extraction model, NMF can only transform the data into its coefficient representation linearly, which cannot characterize the possible nonlinear structure of data. To get rid of this limitation, Xu and Gong proposed the concept factorization (CF) model in which the dictionary matrix is set as the linear combination of samples [10]. Due to such improvement, CF can be performed in the RKHS and accordingly acquires the ability of nonlinear modeling. Since the CF model was proposed, researchers have been working on improving its performance or applying it into new applications. Below we give a short review to its recent advances. The models such as the locally consistent concept factorization [11], dual-graph regularized concept factorization [12] and graph regularized multilayer concept factorization [13], were proposed to enforce the regularity of learned coefficient representation matrix in CF to coincide with the data manifold. Inspired by the local coordinate coding, the connections between the dictionary matrix (i.e., the linear combination of samples) and coefficient matrix were extensively explored by some models such as the local coordinate concept factorization [14], its graph regularized version [15] and adaptive weighting version [16]. By jointly optimizing the graph matrix to make full use of the data correlation between multiple views, Zhan et al. proposed a CF-based multiview clustering method for data integration [17]. Liu et al. proposed a semi-supervised matrix factorization method (termed constrained CF, CCF) model to constrain the extracted the data concepts to be consistent with the given label [18]. To further improve the performance of CCF, Zhang et al. proposed a robust semi-supervised adaptive concept factorization (RS²ACF) framework [19]. On one hand, it is simultaneously stable to small entry-wise noise and robust to sparse errors; on the other hand, it can propagate the available label information to unlabeled data by learning an explicit label indicator. Sometimes, the pairwise constraints can be treated as semi-supervised information as well and used to build semi-supervised CFs [20, 21]. Under the CF framework, Peng et al. proposed a joint structured graph learning and clustering model [22].

Though CF obtained increasing popularity in diverse fields, most of its existing variants directly commit to reconstructing the original data. This neglects a common case that the real-world data may have only a few meaningful features but many redundant or even corrupted features. Therefore, it is not reasonable to treat the importance of all features equally. In other words, if some dimensions of the feature vector were noisy, it is meaningless to reconstruct them accurately. Accordingly, feature weighting has been a hotspot in many fields such as machine learning and data mining. For example, Peng et al. improved the data gravitation model by using the discrimination and redundancy to measure the importance of features [23]. In [24], Phan et al. proposed to combine the genetic algorithm and support vector machine for feature weighting and parameter optimization in classification task. To achieve the purpose of feature selection, a cluster-dependent feature-weighting mechanism was introduced to reflect the within-cluster degree of relevance of a given feature [25]. Besides these shallow ones, deep leaning models were also competent in feature weighting tasks. A contextual reweighting network (CRN) was proposed to complete the image geo-localization task, which can predict the importance of each region in the feature map [26]. In order to identify the critical frequency bands and channels in EEG-based emotion recognition, Zheng et al. employed the deep belief network to learn the weights of all feature dimensions [27]. In [28], Nguyen et al. introduced a new convolutional neural network equipped with feature weighting and boosting abilities for few-short foreground object segmentation in images. Generally, the existing measures of feature weighting as [23, 25] are of limited adaptability while the deep learning models are of insufficient interpretability.

In this paper, we propose to improve the performance of CF model from the perspective of investigating the contributions of each feature dimension in discriminative data representation learning. Specifically, when treating CF as a data reconstruction model, it is more appropriate to assign a small value to measure its importance if one feature dimension is contaminated or less important in differentiating samples from different classes. Then, this feature will play a smaller role in determining the cluster assignments of data points. Inspired by [29, 30], we introduce a feature auto-weighting variable into the CF objective function with the purpose of adaptively learning the importance of each dimension of the feature vector. In the newly formulated Auto-Weighted Concept Factorization (AWCF) model, the feature auto-weighting variable is jointly optimized with the original two variables (i.e., the linear coefficient matrix in constructing the dictionary matrix, and the data representation coefficient matrix) in CF, which explicitly provides us the insight to the feature importance map.

The main contributions of this paper can be summarized as follows.

We propose an improved CF model to learn data representation by explicitly considering and adaptively learning the different contributions of different feature dimensions. Specifically, this is achieved by incorporating a feature auto-weighting variable into the CF objective function, which satisfies the non-negativity and normalization constraints.

We design an efficient algorithm to optimize the objective function of AWCF under the alternating direction optimization framework. The newly introduced feature auto-weighting variable is jointly optimized with the original two CF variables. Moreover, the complexity and convergence of the optimization algorithm are provided.

We conduct extensive experiments on both synthetic and representative real-world data sets to evaluate the effectiveness of AWCF. The experimental results demonstrate that the learned data representation is more efficient in clustering by introducing the feature weighting technique. Besides, the feature map can be obtained.

The remainder of the paper is organized as follows. In section 2, we first provide a brief introduction to NMF and CF and then formulate the AWCF model objective and derive its optimization. Experiments are conducted to evaluate the performance of AWCF in data clustering in section 3. Section 4 concludes the whole paper.

Notations. In this paper, matrices are all written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For example, the (i, j)-th element of matrix M is m_ij. By default, we use m_i to represent the i-th column of M and m ^j to represent its j-th row. The Frobenius norm of the matrix $M \in ℝ^{n \times m}$ is defined as $∥ M ∥_{F} = \sqrt{\sum_{i = 1}^{n} \sum_{j = 1}^{m} m_{ij}^{2}}$ .

2 Auto-weighted concept factorization

In this section, we first revisit the NMF and CF models which serve as the building blocks to our new development. Then, we propose the model formulation, optimization of the AWCF model whose convergence and complexity are also given.

2.1 Review of NMF and CF

Given the data matrix $X = [x_{1}, x_{2}, \dots, x_{n}] \in ℝ^{d \times n}$ where n and d respectively represent the size of data points and the number of feature dimensions, NMF aims to seek a series of linear combination to approximate the original data matrix as X ≈ UH ^T. In this approximation, the nonnegative basis matrix $U \in ℝ^{d \times c}$ involves the information of c clustering center points and the nonnegative matrix $H \in ℝ^{n \times c}$ represents the linear combination coefficients of each column of basis matrix U. These two matrices can be achieved by optimizing the following objective $O_{NMF} : min_{U, H} ∥ X - {UH}^{T} ∥_{F}^{2}, s . t . U \geq 0, H \geq 0 .$ (1) In reality, we usually use coefficient matrix H as a new data representation in pattern recognition tasks.

A major limitation of NMF is that it can only be performed in the original feature space. If data distribution is extremely nonlinear, NMF may not be sufficient to reconstruct the data directly as X ≈ UH ^T. In this situation, kernel trick is employed to extend NMF into the reproducible kernel Hilbert space (RKHS) to characterize the nonlinear structure of data. To this end, Xu and Gong proposed the concept factorization (CF) model to address this problem [10]. By replacing the original basis matrix in NMF by a linear combination of the data matrix as U ≈ XW, the CF model has the following objective function $O_{CF} : min_{W, V} ∥ X - {XWV}^{T} ∥_{F}^{2}, s . t . W \geq 0, V \geq 0 .$ (2) The updating rules of $W \in ℝ^{n \times c}$ and $V \in ℝ^{n \times c}$ in problem (2) are derived in [10] and are also listed below $w_{jk} \leftarrow \frac{(KV)_{jk}}{({AWV}^{T} V)_{jk}} w_{jk},$ (3) $v_{jk} \leftarrow \frac{(KW)_{jk}}{({VW}^{T} AW)_{jk}} v_{jk},$ (4) where we can define the kernel matrix as K ≜ X ^TX or some others such as the nonlinear RBF kernel. Theoretically, any kernel matrix can be used here.

2.2 AWCF model formulation

When performing data reconstruction, the CF model neglects taking the importance of features into consideration. In other words, CF treats the importance of each dimension of the data feature vector equally. As shown by the left hand side of Figure 1, we use 1’s to denote such importance measurement. In a compact matrix form, the conventional CF essentially uses an identity matrix $I \in ℝ^{d \times d}$ as the feature weight such that the weighted data matrix is IX.

Fig. 1

Graphical show of the feature importance in both CF (left) and AWCF (right).

Let {f₁, f₂, f₃, ⋯ , f_d} respectively denote the attribute values of corresponding feature dimensions. In this paper, we propose to adaptively learn the feature importance of each dimension of the feature vector. Mathematically, we introduce an auto-weighting variable $θ \in ℝ^{d}$ in which each $θ_{i} |_{i = 1}^{d}$ measures the importance of $f_{i} |_{i = 1}^{d}$ . Then, two constraints are designed on θ ; specifically, the non-negativity and normalization constraints, θ ≥ 0 and $1^{T} θ = \sum_{i = 1}^{d} θ_{i} = 1$ . Accordingly, the original data matrix can be modified from X to ΘX where $Θ \in ℝ^{d \times d}$ is a diagonal matrix with its i-th diagonal element as θ_i. Based on the conventional CF model, we have an improved approximation, ΘX ≈ ΘXWV ^T. Following this idea, we formulate the objective function of AWCF as $\begin{matrix} min_{Θ, W \geq 0, V \geq 0} ‖ Θ X - Θ X W V^{T} ‖_{F}^{2}, \\ s . t . Θ = diag (θ), 1^{T} θ = 1, θ \geq 0 . \end{matrix}$ (5) In order to intuitively explain how our AWCF model can distinguish the importance of each dimension of the feature vector and adaptively learn the weight vector. We can decompose the objective (5) row-wisely as $\begin{matrix} min_{θ, W, V} ∥ \sum_{i = 1}^{d} θ_{i} (X - {XWV}^{T})^{i} ∥_{F}^{2}, \\ s . t . & θ_{i} \geq 0, \sum_{i = 1}^{d} θ_{i} = 1, W \geq 0, V \geq 0 . \end{matrix}$ (6) Mathematically, a larger value of θ_i will force the approximation error between the i-th row of X and the i-th row of XWV ^T to be small. That is, a certain feature dimension across data points should be important if it can be reconstructed well in AWCF.

2.3 Optimization to AWCF objective

Since there are three variables ( Θ , W and V) in the AWCF objective function, the basic framework to solve (5) is the alternating direction optimization and therefore we should update one variable by fixing the others. Below we give the detailed derivations of the updating rule to each variable.

■ Update Θ with W and V fixed. In this case, the objective with respect to Θ is $\begin{matrix} min_{Θ} ∥ Θ X - Θ X W V^{T} ∥_{F}^{2}, \\ s . t . & Θ = diag (θ), 1^{T} θ = 1, θ \geq 0 . \end{matrix}$ (7) By denoting $Z = X (I - W V^{T}) (I - W V^{T})^{T} X^{T} \in ℝ^{d \times d}$ , we can rewrite the objective function in (5theta) as $\begin{matrix} Tr ((Θ X - Θ X W V^{T})^{T} (Θ X - Θ X W V^{T})) \\ = & Tr (Θ X (I - {WV}^{T}) (I - {WV}^{T})^{T} X^{T} Θ^{T}) \\ ≜ & Tr (Θ Z Θ^{T}) . \end{matrix}$ (8) Inspired by [30], we define z_ii as the i-th diagonal element of matrix Z and assign m_i = z_ii and then problem (8) becomes $O (θ) : min_{θ} \sum_{i = 1}^{d} θ_{i}^{2} m_{i}, s . t . θ_{i} \geq 0, \sum_{i = 1}^{d} θ_{i} = 1 .$ (9) By constructing a diagonal matrix M with m_i as its i-th diagonal element, the objective (9) becomes $O (θ) : min_{θ} θ^{T} M θ, s . t . 1^{T} θ = 1, θ \geq 0 .$ (10) Obviously, even if θ < 0, $θ_{i}^{2}$ is still positive and therefore we can discard the non-negative constraint θ ≥ 0. The corresponding Lagrangian function is $L (θ, α) = θ^{T} M θ - α (1^{T} θ - 1),$ (11) where α > 0 is the Lagrangian multiplier. Taking the derivative of $L (θ, α)$ with respect to θ and setting it to zero, we have $\frac{\partial L (θ, α)}{\partial θ} = 2 M θ - α 1 = 0 .$ (12) From the above equation, we get $θ_{i} = \frac{α}{2 m_{i}} .$ (13) Considering the normalization constraint 1^T θ = 1, we can obtain the solution to α as $α = \frac{2}{\sum_{i = 1}^{d} \frac{1}{m_{i}}} .$ (14) Then, the updating rule to θ can be obtained as $θ_{i} = \frac{1}{m_{i} \sum_{i = 1}^{d} \frac{1}{m_{i}}} .$ (15)

■ Update W and V with Θ fixed. Since W and V have similar derivation processes, we give the analysis simultaneously. When Θ is fixed, problem (5) becomes $min_{W \geq 0, V \geq 0} ‖ Θ X - Θ X W V^{T} ‖_{F}^{2} .$ (16) Denoting Q ≜ ΘX, problem (5WV) becomes $min_{W \geq 0, V \geq 0} ‖ Q - {QWV}^{T} ‖_{F}^{2},$ (17) which is equivalent to the standard CF objective. The updating rules of W and V are same as (3) and (4). To prevent the non-unique problem, we respectively normalize W and V by $V \leftarrow V [diag (W^{T} KW)]^{1 / 2},$ (18) $W \leftarrow W [diag (W^{T} KW)]^{- 1 / 2} .$ (19)

Based on the above analysis, we summarize the complete optimization procedure to AWCF objective in Algorithm 1.

Algorithm 1 Auto-Weighted Concept Factorization

Input: data matrix $X \in ℝ^{d \times n}$ , the number of clusters c;

Output: W ∈ R^n×c, V ∈ R^n×c, $θ \in ℝ^{d}$ .

1: Initialization. Randomly initialize W and V as non-negative matrices, and $θ = [\frac{1}{d}, \frac{1}{d}, \dots, \frac{1}{d}] \in ℝ^{d}$ which treats each dimension of the feature vector equally;

2: while not converged do

3: Update variable θ by rule (15);

4: Update variable W by rule (3);

5: Update variable V by rule (4);

6: Normalize V and W based on (18) and (19);

7: end while

2.4 Convergence and complexity analysis

According to the transformation from optimization problems (5WV) to (5similiarCF), we know that the general form of AWCF is same as CF model during the iterations of updating W and V, which means that the updating to W and V shares similar convergence property with CF. It can be guaranteed by employing the Expectation-Maximization algorithm [31] to give the proof and the detail process is similar to that in [1 , 11]. It is worth mentioning that the updating to θ has an analytical solution in each iteration, which does not influence the convergence property of AWCF.

From Algorithm 1, we can observe the difference in optimization between AWCF and CF is the calculation of θ and the updating of K = (ΘX) ^T (ΘX) in each iteration. Assuming that AWCF converges in t iterations and c ⪡ d and d ⪡ n in common sense, it is easy to check that the complexity of calculating Θ is O (d), the updating of K is O (n ²d), the updating of W and V is O (n ²c). In general, the overall complexity of AWCF is O (t (d + n ²c + n ²d)).

3 Experiments

In this section, we first conduct experiments on a synthetic data set to show the rationality of the learned weights by AWCF and then evaluate the effectiveness of the feature auto-weighting strategy in AWCF by comparing it with other models on clustering some benchmark data sets.

3.1 Experiments on synthetic data

Data set. We constructed a synthetic data set consisting of three Gaussian distributed clusters and each cluster has 200 data points. Each of the three clusters approximately distributes a banding shape as shown in Figure 2. The x-axis is the first dimension and the y-axis is the second dimension. From clustering view, we can easily find that the dimension 1 is discriminative while the dimension 2 is noisy.

Fig. 2

The synthetic data set with three Gaussian distributed clusters.

Results and analysis. By performing our AWCF model on this synthetic data, we obtain the learned auto-weighting variable as θ =[0.6694,0.3306] whose bar plot is provided by Figure 3. From the perspective of clustering performance of these three clusters, we can view the contribution of the first dimension as 0.6694; similarly, the contribution of the second dimension is 0.3306. That is, AWCF considers that the first dimension is much more important than the second one. This is exactly consistent with our intuition to this data set.

From the perspective of feature selection, if we only use the first dimension of data, the clustering performance is definitely better than that with only the second dimension. Therefore, we project these data points to x-axis and y-axis and the results are respectively shown in Figures 4 and 5. From Figure 4, we can observe that it is much easier to differentiate these data points with such x-axis one-dimensional representation. However, the projected data points on the y-axis in Figure 5 have large overlapping areas and therefore it is hard to make differentiation. Such result demonstrates that AWCF can learn a meaningful feature map.

Fig. 3

The learned weight θ on the synthetic data set.

Fig. 4

The projected data onto x-axis.

Fig. 5

The projected data onto y-axis.

3.2 Experiments on benchmark data

Data sets. Four representative benchmark data sets are used in the following experiments including two face image data sets (Yale and ORL), one spoken letter data set (ISOLET) and one palm print data set (PalmData25). The statistics of these data sets including the sample size, the dimensionality and the number of clusters are summarized in Table 1.

Table 1
Description of the data sets for data clustering

Dataset #Size #Dimensionality #Cluster

Yale 165 1024 15

ORL 400 1024 40

ISOLET 1560 617 26

PalmData25 2000 256 100

Dataset	#Size	#Dimensionality	#Cluster
Yale	165	1024	15
ORL	400	1024	40
ISOLET	1560	617	26
PalmData25	2000	256	100

Experimental settings. In order to show the effectiveness of our proposed AWCF model, we compare it with four closely related clustering methods including the Kmeans (KM), Non-negative Matrix Factorization (NMF), Normalized Cut (NCut) and Concept Factorization (CF). In our experiments, we evaluate the clustering performance with different numbers of clusters. For example, the evaluations were conducted with the cluster number ranging from 2 to 15 on the Yale data set. For each given cluster number k (2 ≤ k < c), we randomly selected k clusters and ran the experiment 20 times, and the final results were obtained by averaging over these 20 tests. For NMF, we set the number of columns of the dictionary matrix as the number of clusters. Since the Kmeans has randomness in its initialization, in following experiments, we repeated it 60 times with different initializations and the best result in terms of its objective function was recorded.

Three well accepted clustering measurements are adopted as metrics, i.e., Accuracy (ACC), Normalized Mutual Information (NMI) and Purity to evaluate the clustering performance. The exact definitions to them can be found in [32].

Clustering results and analysis. Based on the above experimental settings, the clustering results including the average performance and standard deviation over these 20 runs on these four data sets are respectively reported in Tables 2, 3, 4 and 5, where the best results are highlighted in boldface. Generally, we can come to the conclusion that our proposed AWCF model outperforms the others in most cases in terms of all the three clustering metrics.

Table 2

Clustering Performance (%) of compared algorithms on Yale

k	2	4	6	8	10	12	14	15	Avg.
Accuracy (%)
KM	68.4±16.4(0)	57.2±9.7(0)	43.4±7.5(1)	44.2±5.5(1)	43.1±5.1(1)	39.0±3.7(1)	39.0±3.7(1)	43.6	47.2
NMF	64.0±14.9(0)	58.3±9.0(0)	47.8±6.1(1)	49.4±5.4(0)	44.6±6.1(1)	42.4±5.2(1)	40.6±3.2(1)	44.2	48.9
NCut	80.2±11.4(0)	62.9±12.1(0)	49.6±7.3(1)	47.7±5.0(1)	47.3±5.7(0)	45.0±3.8(0)	45.6±3.2(0)	43.0	52.7
CF	67.7±15.7(0)	55.5±11.0(1)	47.1±7.0(1)	47.6±4.8(1)	42.7±4.9(1)	40.3±4.0(1)	39.9±3.8(1)	41.8	47.8
AWCF	74.5±21.2	60.6±9.7	54.1±7.8	50.9±5.5	49.7±5.2	45.8±4.3	44.9±4.8	44.8	53.2
Normalized Mutual Information (%)
KM	21.5±27.0(1)	40.2±14.7(0)	31.8±9.3(1)	38.7±5.6(1)	42.4±5.8(1)	40.2±4.0(1)	42.7±3.4(1)	47.2	38.1
NMF	14.7±25.0(1)	40.3±14.2(0)	37.0±7.3(1)	43.8±5.4(0)	43.8±6.0(1)	43.7±3.9(1)	44.5±2.5(1)	46.6	39.3
NCut	37.1±24.7(0)	42.7±15.2(0)	39.0±8.3(1)	43.3±5.1(1)	47.6±4.7(0)	48.2±3.2(0)	50.8±2.6(1)	49.7	44.8
CF	19.9±28.1(1)	35.5±12.5(1)	36.6±9.3(1)	41.7±5.2(1)	41.5±4.6(1)	41.5±3.7(1)	43.1±3.4(1)	44.5	38.0
AWCF	37.5±37.3	43.3±14.6	43.6±9.2	47.1±5.2	47.9±4.7	47.3±4.1	48.7±3.3	49.0	45.6
Purity (%)
KM	68.4±16.4(0)	58.6±10.1(0)	44.5±7.3(1)	45.2±4.9(1)	44.4±4.8(1)	40.1±3.5(1)	40.7±3.2(1)	44.8	48.3
NMF	64.0±14.9(0)	59.5±8.7(0)	49.0±5.5(1)	50.1±5.1(0)	45.8±5.8(1)	43.1±5.0(1)	42.0±3.3(1)	44.8	49.8
NCut	80.2±11.4(0)	63.6±11.4(0)	50.7±7.2(1)	48.3±4.4(1)	48.0±5.4(0)	45.9±3.5(0)	46.7±3.1(0)	44.2	53.5
CF	67.7±15.7(0)	56.2±10.4(1)	48.7±6.5(1)	48.4±4.5(1)	44.5±4.8(1)	41.9±3.8(1)	41.1±3.5(1)	42.4	48.9
AWCF	74.5±21.2	60.8±9.5	54.7±7.7	52.7±4.4	50.5±5.2	47.1±4.3	46.5±4.3	46.0	54.1

Table 3

Clustering Performance (%) of compared algorithms on ORL

k	4	8	12	16	20	25	30	40	Avg.
Accuracy(%)
KM	75.5±15.0(1)	62.8±7.3(1)	62.0±6.0(1)	57.3±5.0(1)	55.9±5.1(1)	54.5±6.2(1)	52.9±2.9(1)	52.5	59.2
NMF	75.2±13.0(1)	71.0±7.3(0)	65.9±3.6(1)	64.0±5.0(0)	59.1±2.3(1)	56.3±4.2(1)	55.1±2.3(1)	54.0	62.6
NCut	81.3±9.5(0)	69.1±9.5(0)	67.4±4.0(1)	63.4±5.2(1)	62.4±3.5(0)	57.8±3.4(0)	57.1±3.4(0)	51.5	63.7
CF	77.2±12.3(1)	67.2±8.0(1)	67.0±4.3(1)	63.7±6.1(1)	60.2±5.5(1)	56.1±4.3(1)	56.0±2.9(1)	51.5	62.4
AWCF	84.6±8.4	72.5±6.7	69.3±4.1	65.4±2.7	63.0±3.9	58.5±3.8	58.1±3.5	54.7	65.8
Normalized Mutual Information(%)
KM	68.9±16.0(1)	67.7±7.5(1)	70.7±5.3(1)	69.7±4.4(1)	69.0±3.9(1)	70.1±3.5(1)	69.6±2.1(1)	69.8	69.4
NMF	69.4±12.2(1)	73.1±5.6(1)	73.4±3.0(1)	74.6±4.7(0)	73.1±1.5(0)	70.9±3.3(1)	70.1±2.3(1)	70.7	71.9
NCut	75.0±8.9(0)	71.2±9.6(1)	74.8±3.3(0)	74.2±3.9(0)	73.9±2.8(0)	72.5±2.2(0)	72.8±2.1(1)	70.3	73.1
CF	71.1±13.2(1)	71.8±7.2(1)	74.4±3.8(0)	73.7±4.2(1)	71.5±4.2(0)	70.5±3.2(1)	70.9±2.0(1)	68.0	71.5
AWCF	77.5±12.4	75.0±5.8	74.9±3.2	75.0±3.1	72.0±2.5	73.6±2.3	74.5±2.6	72.2	74.3
Purity(%)
KM	77.5±12.3(1)	67.5±6.5(1)	66.4±5.7(1)	62.8±4.4(1)	60.8±4.8(1)	59.5±4.8(1)	57.8±2.5(1)	57.2	63.7
NMF	78.1±9.6(1)	74.4±5.9(0)	70.3±3.2(1)	68.8±4.5(1)	64.6±2.3(1)	62.0±3.4(1)	60.7±2.5(1)	58.7	67.2
NCut	82.7±7.6(0)	71.6±8.2(1)	70.7±3.6(1)	68.0±4.9(1)	66.5±3.4(1)	62.8±2.6(1)	62.0±3.0(1)	56.5	67.6
CF	78.9±10.1(1)	71.5±7.7(1)	70.6±3.9(1)	68.2±5.2(1)	63.9±4.4(1)	61.9±3.6(1)	61.3±2.5(1)	56.0	66.5
AWCF	84.6±8.4	75.3±6.2	72.2±3.2	69.0±3.0	67.6±2.8	63.8±2.8	63.9±3.2	59.5	69.5

Table 4

Clustering Performance (%) of compared algorithms on ISOLET

k	4	8	12	16	20	24	26	Avg.
Accuracy(%)
KM	80.3±12.6(0)	76.1±8.0(1)	68.4±5.9(0)	59.4±3.8(1)	59.6±3.0(1)	56.3±3.2(1)	52.4	64.7
NMF	82.6±12.9(0)	78.0±8.5(0)	68.4±7.0(0)	62.3±4.9(1)	59.3±4.3(1)	55.0±4.0(1)	53.0	65.5
NCut	75.7±16.4(0)	68.2±8.2(1)	62.7±7.2(1)	56.9±5.5(1)	55.8±4.4(1)	49.4±3.7(1)	47.5	59.5
CF	75.5±16.4(0)	76.6±10.0(1)	66.4±7.7(1)	61.3±5.3(1)	57.9±4.8(1)	51.5±3.6(1)	50.3	62.8
AWCF	77.1±12.1	79.4±6.9	69.3±4.4	65.0±4.2	62.0±3.4	57.4±2.8	58.9	67.0
Normalized Mutual Information(%)
KM	75.9±13.1(0)	78.6±6.5(0)	75.0±4.6(0)	72.2±3.1(1)	70.6±2.0(0)	67.4±1.3(1)	68.8	72.6
NMF	75.9±15.9(0)	77.9±6.7(1)	73.5±5.0(1)	72.9±3.4(0)	70.2±2.5(1)	67.7±1.9(1)	67.5	72.2
NCut	76.2±15.0(0)	76.9±7.7(1)	74.2±5.8(1)	72.6±4.7(0)	71.0±2.6(0)	68.9±2.0(0)	68.3	72.6
CF	71.3±18.2(0)	78.1±8.0(0)	73.2±5.9(1)	71.0±4.5(1)	69.3±3.1(1)	66.8±1.8(1)	63.7	70.5
AWCF	73.0±12.0	79.6±5.4	75.6±3.5	73.4±3.1	71.2±2.1	68.8±1.4	69.2	73.0
Purity(%)
KM	81.5±11.2(0)	79.0±6.9(1)	71.9±5.8(0)	64.1±3.5(1)	64.5±2.6(0)	60.4±2.2(0)	58.6	68.6
NMF	83.3±11.2(0)	79.2±6.9(1)	71.6±5.8(0)	66.1±3.5(1)	63.8±2.6(1)	59.5±2.2(1)	56.6	68.6
NCut	79.9±13.0(0)	73.7±7.2(1)	67.9±6.3(1)	62.3±5.5(1)	61.0±3.5(1)	54.9±2.9(1)	54.4	64.9
CF	78.1±14.1(0)	79.3±8.4(1)	70.0±6.2(1)	64.7±4.8(1)	61.8±3.9(1)	55.4±3.2(1)	52.5	66.0
AWCF	79.6±9.8	81.2±6.2	72.3±3.9	67.6±3.9	65.4±3.0	60.7±2.4	62.8	69.9

Table 5

Clustering Performance (%) of compared algorithms on PalmData25

k	15	30	45	60	75	90	100	Avg.
Accuracy(%)
KM	71.0±8.8(1)	71.7±4.6(1)	72.3±2.9(1)	68.6±3.5(1)	69.5±2.9(1)	68.6±2.3(0)	66.1	69.7
NMF	80.5±6.4(0)	72.0±5.2(1)	71.3±3.6(1)	68.4±2.5(1)	70.4±2.2(0)	68.8±2.4(0)	69.4	71.6
NCut	56.6±7.9(1)	56.4±3.4(1)	53.0±3.3(1)	48.9±3.0(1)	47.2±2.4(1)	47.3±2.6(1)	44.4	50.5
CF	80.6±5.7(0)	73.9±4.2(1)	70.5±3.1(1)	67.5±2.9(1)	65.7±2.6(1)	64.3±3.1(1)	65.6	69.7
AWCF	83.7±6.1	75.6±3.6	74.8±3.7	71.6±2.3	70.8±2.1	69.4±2.3	68.6	73.5
Normalized Mutual Information(%)
KM	84.8±5.0(1)	87.4±2.4(0)	85.4±1.6(1)	85.5±1.4(0)	85.5±1.3(0)	85.8±0.8(0)	85.4	85.7
NMF	87.9±3.9(1)	85.5±2.3(1)	86.3±1.8(0)	85.5±1.1(0)	87.1±0.8(1)	86.1±1.1(0)	85.7	86.3
NCut	71.8±6.2(1)	76.1±3.4(1)	74.9±3.1(1)	70.6±3.1(1)	70.2±2.7(1)	70.0±2.4(1)	67.9	71.6
CF	89.5±3.0(0)	87.2±2.2(0)	86.1±1.6(1)	85.0±1.7(0)	83.9±1.5(0)	83.6±1.6(1)	84.3	85.6
AWCF	90.7±3.1	87.9±2.2	87.1±2.1	85.8±1.3	84.6±1.1	85.9±1.2	85.7	86.8
Purity(%)
KM	77.2±6.9(1)	77.3±3.7(1)	77.4±2.9(1)	73.5±2.8(1)	73.4±2.7(0)	73.0±1.9(0)	72.8	74.9
NMF	83.5±5.1(1)	76.5±4.0(1)	75.6±3.2(1)	74.0±2.1(0)	74.6±1.7(0)	73.3±2.1(0)	73.6	75.9
NCut	65.0±5.5(1)	64.1±3.1(1)	60.8±2.8(1)	56.8±2.8(1)	55.7±2.5(1)	55.3±2.1(1)	53.4	58.7
CF	84.5±4.5(0)	78.6±3.2(1)	75.5±2.5(1)	73.1±2.5(1)	71.4±2.1(1)	70.1±2.3(1)	72.1	75.0
AWCF	86.3±4.4	79.9±2.8	78.8±3.0	74.5±1.8	73.8±1.5	73.6±1.9	73.3	77.2

Since the main innovation of our proposed AWCF model is the introduction of the feature auto-weighting variable in CF, below we mainly provide the pairwise comparison between the clustering results respectively obtained by them. Specifically, as shown in Table 2, AWCF obtains 5.4%, 7.6% and 5.2% improvements in comparison with CF respectively in the three metrics of accuracy, NMI and purity on the Yale data set. NCut takes the second place, which has overall 1% weaker performance to AWCF. According to the performance in this figure, we can roughly categorize these five clustering models into two groups. One includes the Kmeans, NMF and CF and the other includes the NCut and AWCF. There exist significant performance differences between these two groups. Similarly, based on the results in Figure 3, we can find the improvements achieved by AWCF with respect to CF are respectively 3.4%, 2.8%, 3.0% in the three metrics on the ORL data set. On the ISOLET data set, AWCF consistently outperforms its counterpart CF model and the improvements in terms of the three metrics, accuracy, NMI and purity, are respectively 4.2%, 2.5% and 3.9%. When the number of clusters k varies from 15 to 60 on the PlamData25 data set, AWCF has obvious superiority when compared with the other models. Though AWCF did not always get the best clustering performance when given lager ks, its average performance is also in the first place.

To illustrate the statistical significance between AWCF and the other models, we performed the paired students t-test on their clustering results. Here the hypothesis is the clustering performance (accuracy, normalized mutual information, and purity) obtained by AWCF is better than that obtained by the other (given) method. Each t-test was run on two sequences corresponding to the clustering results of the 20 tests by our method and the given method. From Tables 2 to 5, the statistical test results were reported by the binary values in brackets. For each entity in brackets, "1" means that the hypothesis is correct (true) with probability 0.95, and "0" means that the hypothesis is wrong (false) with probability 0.95. For example, when the clustering number is 14 on the Yale data set, we can find that the average performance of AWCF over the 20 tests is 44.9±4.8 while the average performance of CF over the 20 test is 39.9±3.8. The appended (1) means that the hypothesis "AWCF is superior to CF" holds based on the statistical test. Similarly, the true hypothesis that AWCF is superior to Kmeans, NMF and CF can be obtained when the number of clusters ranges from 6 to 12. In some cases, though the true hypothesis that AWCF is superior to NCut cannot be achieved, AWCF always obtained better average performance than NCut. Similar statistical test results can be found on the remaining data sets.

Based on the above analysis to the experimental results, we can conclude that it is beneficial for learning more efficient data representation by incorporating the feature auto-weighting strategy into the data reconstruction in concept factorization. In the proposed AWCF model, different discriminative abilities of different features can be well explored and exploited by the auto-weighting variable. As a result, the clustering performance is enhanced.

Feature weight map and ranking. To intuitively visualize the importance of each feature dimension of face images from the ORL and Yale data sets, we show the learned feature auto-weighting variable θ in two dimensional form (termed feature map) in Figure 6 where brighter pixels are corresponding to larger values. From these two feature maps, we can clearly identify the contour information as a face image. Moreover, we can find that the gray values in feature maps around the eye and mouth areas are larger, meaning that these edges and corners are more important in characterizing and recognizing face images. Assigning larger weights to pixels within these areas is beneficial to boost the model discriminative ability, leading to better recognition performance.

Fig. 6

The visualization of feature maps on Yale (k = 15) and ORL (k = 40) data sets.

Figure 7 shows the elements of the learned θ on the four benchmark data sets based on the feature dimension order (the left column) and the ranked elements of θ in descending order (the right column). It is obvious that different weights were assigned to different feature dimensions, meaning that they made different contributions in the clustering task and our AWCF model can effectively make distinctions among them. Taking the ISOLET data set as an example, the learned θ shows that the dimensions of important features are approximately between 300 and 400, which correspond to the discrete fourier transform (DFT) coefficients following the sonorant interval (SON) and DFT coefficients from the second and fifth frame of the SON [33]. This is a good evidence to demonstrate that our proposed AWCF model has a desirable ability in feature weighting and selection.

Fig. 7

The value of θ on four benchmark data sets (left is original θ and right is sorted θ ).

Convergence of AWCF. To empirically check the convergence property of AWCF, we plot the variation of CF and AWCF objective values in terms of the iteration numbers on the four benchmark data sets in Figure 8, which are respectively shown in the left and right columns. We can find the proposed optimization method can monotonically decrease the objective function values of AWCF and the convergence speed of AWCF is similar to that of CF, usually in a few iterations. It is worth mentioning that there exists a rapid fall of AWCF objective values in the beginning of iterations. Concretely, such rapid change of objective values in the first few iterations is caused by the quick evolution of θ from stochastic state to nearly optimal state according to our optimization procedure.

Fig. 8

Convergence curves of CF and AWCF on four benchmark data sets.

4 Conclusion and future work

In this paper, we proposed an improved concept factorization model which can simultaneously complete the data representation and feature map learning, termed Auto-Weighted Concept Factorization (AWCF). The learned weights could intuitively and quantitatively provide us with the importance of different features. Experiments on representative data sets demonstrated that effectively identifying the different discriminative abilities of different features can improve the model clustering performance. In current work, we only consider incorporating the feature auto-weighting variable into the conventional concept factorization. As our future work, we will consider more advanced CF variants based on which we expect to achieve more promising clustering performance.

Footnotes

Acknowledgment

This work was supported by Natural Science Foundation of China (61971173, U1909202, U20B2074, 61972121), Zhejiang Provincial Natural Science Foundation of China (LY21F030005), National Social Science Foundation of China (19ZDA348), Fundamental Research Funds for the Provincial Universities of Zhejiang (GK209907299001-008), China Postdoctoral Science Foundation (2017M620470), Acoustics Science and Technology Laboratory of Harbin Engineering University (SSKF2018001), Key Laboratory of Advanced Perception and Intelligent Control of High-end Equipment of Ministry of Education, Anhui Polytechnic University (GDSC202015), Jiangsu Provincial Key Laboratory for Computer Information Processing Technology, Soochow University (KJS1841) and Zhejiang Xinmiao Talents Program (2019R407030).

References

Lee

D.D.

and Seung

H.S.

, Algorithms for nonnegative matrix factorization, In Advances in Neural Information Processing Systems, pages 556–562, 2001.

Cai

, He

, Han

and Huang

T.S.

, Graph regularized non-negative matrix factorization for data representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8) (2011), 1548–1560.

Chen

, Zhang

, Cai

, Liu

and He

, Nonnegative local coordinate factorization for image representation, IEEE Transactions on Image Processing 22(3) (2013), 969–979.

Long

, amd Lu

, Peng

and Li

, Graph regularized discriminative non-negative matrix factorization for face recognition, Multimedia Tools and Applications 72(3) (2014), 2679–2699.

, Yuan

, Zhu

and Li

, Structurally incoherent low-rank nonnegative matrix factorization for image classification, IEEE Transactions on Image Processing 27(11) (2018), 5248–5260.

Peng

, Tang

, Kong

Wanzeng K.

, Qin

and Nie

, Parallel vector field regularized non-negative matrix factorization for image representation, In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2216–2220, 2018.

Peng

, Long

, Qin

, Kong

Wanzeng K.

, Nie

and Cichocki

, Flexible non-negative matrix factorization with adaptively learned graph regularization, In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3107–3111, 2019.

Liu

J.-X.

, Wang

, Gao

Y.-L.

, Zheng

C.-H.

, Xu

and Yu

, Regularized non-negative matrix factorization for identifying differentially expressed genes and clustering samples: a survey, Computational Biology and Bioinformatics 15(3) (2018), 974–987.

, Huang

, Sidiropoulos

N.D.

and Ma

W.-K.

, Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications, IEEE Signal Processing Magazine 36(2) (2019), 59–80.

10.

and Gong

, Document clustering by concept factorization, In Annual International ACMSIGIR Conference on Research and Development in Informaion Retrieval, pages 202–209, 2004.

11.

Cai

, He

and Han

, Locally consistent concept factorization for document clustering, IEEE Transactions on Knowledge and Data Engineering 23(6) (2011), 902–913.

12.

and Jin

, Dual-graph regularized concept factorization for clustering, Neurocomputing 138 (2014), 120–130.

13.

, Shen

, Shu

, Ye

and Zhao

, Graph regularized multilayer concept factorization for data representation, Neurocomputing 238 (2017), 139–151.

14.

Liu

, Yang

, Wu

and Li

, Local coordinate concept factorization for image representation, IEEE Transactions on Neural Networks and Learning Systems 25(6) (2014), 1071–1082.

15.

and Jin

, Graph-regularized local coordinate concept factorization for image representation, Neural Processing Letters 46(2) (2017), 427–449.

16.

Zhang

, Zhang

, Li

, Liu

, Wang

and Yan

, Robust unsupervised flexible auto-weighted local-coordinate concept factorization for image clustering, In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2092–2096, 2019.

17.

Zhan

, Shi

, Wang

and Xie

, Adaptive structure concept factorization for multiview clustering, Neural Computation 30(4) (2018), 1080–1103.

18.

Liu

, Yang

, Wu

and Cai

, Constrained concept factorization for image representation, IEEE Transactions on Cybernetic 44(7) (2014), 1214–1224.

19.

Zhang

, Zhang

, Liu

, Tang

, Yan

and Wang

, Joint label prediction based semi-supervised adaptive concept factorization for robust data representation, IEEE Transactions on Knowledge and Data Engineering 32(5) (2020), 952–970.

20.

, Lu

, Huang

and Xie

, Pairwise constrained concept factorization for data representation, Neural Networks 52 (2014), 1–17.

21.

, Zhao

X.-J.

, Zhang

and Li

F.-Z.

, Semisupervised concept factorization for document clustering, Information Sciences 331 (2016), 86–98.

22.

Peng

, Tang

, Kong

, Zhang

, Nie

and Cichocki

, Joint structured graph learning and clustering based on concept factorization, In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3162–3166, 2019.

23.

Peng

, Zhang

and Yang

, A fast feature weighting algorithm of data gravitation classification, Information Sciences 375 (2017), 54–78.

24.

Phan

A.V.

, Nguyen

M.L.

and Bui

L.T.

, Feature weighting and svm parameters optimization based on genetic algorithms for classification problems, Applied Intelligence 46(2) (2017), 455–469.

25.

Panday

, de Amorim

R.C.

and Lane

, Feature weighting as a tool for unsupervised feature selection, Information Processing Letters 129 (2018), 44–52.

26.

Kim

H.J.

, Dunn

and Frahm

J.-M.

, Learned contextual feature reweighting for image geo-localization, In 16 Y. Zhang et al. / Auto-weighted concept factorization IEEE Conference on Computer Vision and Pattern Recognition, pages 3251–3260, 2017.

27.

Zheng

W.-L.

and Lu

B.-L.

, Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks, IEEE Transactions on Autonomous Mental Development 7(3) (2015), 162–175.

28.

Nguyen

and Todorovic

, Feature weighting and boosting for few-shot segmentation, In IEEE International Conference on Computer Vision, pages 622–631, 2019.

29.

Chen

, Yuan

, Nie

and Huang

J.Z.

, Semi-supervised feature selection via rescaled linear regression, In International Joint Conference on Artificial Intelligence, pages 1525–1531, 2017.

30.

Nie

, Shi

S.J.

and Li

, Semi-supervised learning with auto-weighting feature and adaptive graph, IEEE Transactions on Knowledge and Data Engineering 32(6) (2020), 1167–1178.

31.

Dempster

A.P.

, Laird

N.M.

and Rubin

D.B.

, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society: Series B (Methodological) 39(1) (1977), 1–22.

32.

Huang

, Nie

and Huang

, A new simplex sparse learning model to measure data similarity for clustering, In International Joint Conference on Artificial Intelligence, pages 3569–3575, 2015.

33.

Fanty

M.A.

and Cole

, Spoken letter recognition, In Advances in Neural Information Processing Systems, pages 220–226, 1991.

Auto-weighted concept factorization for joint feature map and data representation learning

Abstract

Keywords

1 Introduction

2 Auto-weighted concept factorization

2.1 Review of NMF and CF

3 Experiments

3.1 Experiments on synthetic data

Table 1 Description of the data sets for data clustering Dataset #Size #Dimensionality #Cluster Yale 165 1024 15 ORL 400 1024 40 ISOLET 1560 617 26 PalmData25 2000 256 100

Footnotes

Acknowledgment

References

Table 1
Description of the data sets for data clustering

Dataset #Size #Dimensionality #Cluster

Yale 165 1024 15

ORL 400 1024 40

ISOLET 1560 617 26

PalmData25 2000 256 100