HSIC regularized manifold learning

Abstract

At present, manifold learning is mainly applied to dimensionality reduction. However, from viewpoint of dimensionality reduction, manifold learning algorithms are only local feature preserving algorithms. For example, Local Linear Embedding is local linear preserving, Local Tangent Space Alignment is local homeomorphic preserving and Laplacian Eigenmap is local similarity preserving. The community of dimensionality reduction is now pursuing the algorithms which can preserve both local and global features of data during dimensionality reduction. In this paper, a new algorithm of dimensionality reduction, called Hilbert-Schmidt Independence Criterion Regularized Manifold Learning (HSIC-ML for short), is proposed, in which HSIC between the high dimensional data and the dimension-reduced data is added as a regularization term to the objective functions of manifold learning. The addition of HSIC regularization term makes HSIC-ML capable of preserving both local and global features during dimensionality reduction. HSIC is a criterion measuring the statistical dependence between two data sets and has been widely applied to machine learning in recent years. However, since HSIC was first proposed around 2005, there seems to have not been applied directly to dimensionality reduction, not applied as a regularization term either. The proposed HSIC-ML may be the first try in this respect. The experimental results presented in this paper show that the manifold learning with HSIC regularization performs better than that without HSIC regularization.

Keywords

Dimensionality reduction manifold learning Hilbert-Schmidt Independence Criterion regularization

1 Introduction

Dimensionality reduction is an important task for machine learning. As the advent of big data era, the curse of dimension has become more and more serious and therefore the algorithms of dimensionality reduction have attracted more and more attention. In the process of dimensionality reduction, information loss is inevitable. Therefore, the main concerns of algorithms of dimensionality reduction are what features of data can be preserved during dimensionality reduction. From this point of view, the algorithms of dimensionality reduction can be divided into three categories: local features preserving, global feature preserving and both local and global features preserving. In the local feature preserving algorithms, the high dimensional data are first divided into a number of local regions and each region is then dimensionally reduced without considering its interaction to other regions. The objective function of the whole algorithm is the sum of objective functions of all regions. Therefore, local feature preserving algorithms aim to preserve local features, not global features of data during dimensionality reduction. Most of manifold learning algorithms are local features preserving algorithms. For example, Local Linear Embedding (LLE for short) [1] is local linear preserving, Local Tangent Space Alignment (LTSA for short) [2] and Hessian LLE (HLLE for short) [3] are local homeomorphic preserving and Laplacian Eigenmap (LE for short) [4] is local similarity preserving. Manifold learning algorithms can be further developed into ones capable of preserving local and global features at the same time. For example, in the objective functions of manifold learning algorithms, the dimension-reduced data can be replaced with the linear [5] or polynomial [6] transform of the high dimensional data and it is the transform, not the dimension-reduced data, that is the target of optimization. Besides dimensionality reduction, another application of manifold learning is manifold regularization [7], in which the objective functions of manifold learning are added as regularization terms to the objective functions of other machine learning algorithms.

At present, there are many comprehensive and thorough literature reviews on dimensionality reduction such as [8 –10]. The first two papers come from the theoretical magazines of machine learning, Journal of Machine Learning Research, while the third paper comes from the mathematical magazine of statistics and probability, Statistical Science.

Just as its name implies, Hilbert-Schmidt Independence Criterion (HSIC for short) measures the degree of statistical independence between two random variables or two datasets [11]. From the viewpoint of machine learning, HSIC is a kind of kernel methods. However, unlike other kernel methods (e.g., kernel PCA [12], kernel LDA [13], kernel SVM [14]), the theory of HSIC seems a little complicated. HSIC involves not only Reproducing Kernel Hilbert Space (RKHS), but also Hilbert-Schmidt operators (HS operators) as well as the space of HS operators. HSIC is defined as the norm of a HS operator and the HS operator is implicitly defined by a continuous linear functional on the HS operator space. Although the theory of HSIC is complicated, the calculation formula of HSIC between two datasets is quite simple. HSIC between two datasets is the trace of the product of two kernel matrices corresponding to the two datasets respectively. In recent years, HSIC has been applied to many fields of machine learning and the related literatures are increasing day by day. In this paper, a novel algorithm of HSIC regularized manifold learning (HSIC-ML) is proposed, in which HSIC between the high dimensional data and the dimension-reduced data is added as a regularization term to the objective functions of manifold learning. The addition of HSIC regularization makes HSIC-ML capable of preserving not only local features (local linearity, local homeomorphism, local similarity, etc.), but also global features (statistical correlation as a whole) during dimensionality reduction.

The remaining sections in this paper are arranged as follows: In Section 2, some works related to HSIC-ML are introduced. Since manifold learning is well known in machine learning, the introduction is mainly focused on HSIC. In Section 3, HSIC-ML is proposed. In Section 4, some experimental results are given, showing that manifold learning with HSIC regularization performs better than that without HSIC regularization. In section 5, some conclusions are given.

2 Related work

The proposed HSIC-ML is only to add a HSIC regularization term to the objective functions of manifold learning. Unlike other algorithms developed from manifold learning, the proposed HSIC-ML dose not make any change to the objective functions of manifold learning. Therefore, here we only review HSIC related works.

Since it was proposed around 2005 [11], HSIC has found wide applications in machine learning. HSIC involves RKHS [15] and therefore belongs to the category of kernel methods in machine learning. However, compared with other kernel methods, the theory of HSIC seems much more complicated. HSIC involves not only RKHS, but also HS operators between RKHS as well as continuous linear functionals over HS operator spaces. Interested readers can refer to [11] as well as the related technical reports written by the authors of [11].

Although the theory of HSIC is complicated, the empirical formula of HSIC is quite simple. In practice, HSIC is often used to measure the statistical correlation between two data sets.

Let X = [x₁ ⋯ x_N] ∈ R^D×N and Y = [y₁ ⋯ y_N] ∈ R^d×N be two data sets, the empirical HSIC between X and Y is as follows [11]: $HSIC (X, Y) \approx trace (K_{X} C_{N} K_{Y} C_{N})$ (1) where $C_{N} = I_{N} - \frac{1}{N} Γ_{N} Γ_{N}^{T}$ is the centralizing matrix, Γ_N ∈ R^N is all 1 vector, $K_{X} = [\begin{matrix} k_{X} (x_{1}, x_{1}) & \dots & k_{X} (x_{1}, x_{N}) \\ ⋮ & ⋱ & ⋮ \\ k_{X} (x_{N}, x_{1}) & \dots & k_{X} (x_{N}, x_{N}) \end{matrix}],$

$K_{Y} = [\begin{matrix} k_{Y} (y_{1}, y_{1}) & \dots & k_{Y} (y_{1}, y_{N}) \\ ⋮ & ⋱ & ⋮ \\ k_{Y} (y_{N}, y_{1}) & \dots & k_{Y} (y_{N}, y_{N}) \end{matrix}]$ (2)

k_X and k_Y are two kernel functions and can be chosen according to practical applications.

In empiric HSIC shown in Eq.(1), the size of two data sets must be the same. However, this is not always true in practice. In [16, 17], the so-called surrogate kernel is then proposed to solve this problem. Let X = [x₁ ⋯ x_N] ∈ R^D×N and Y = [y₁ ⋯ y_N] ∈ R^d×N and be two data sets, where N ≠ M, then the following matrices are defined: $K_{XY} = [\begin{matrix} k_{X} (x_{1}, y_{1}) & \dots & k_{X} (x_{1}, y_{M}) \\ ⋮ & ⋱ & ⋮ \\ k_{X} (x_{N}, y_{1}) & \dots & k_{X} (x_{N}, y_{M}) \end{matrix}] \in R^{N \times M}$

$K_{YX} = [\begin{matrix} k_{Y} (y_{1}, x_{1}) & \dots & k_{Y} (y_{1}, x_{N}) \\ ⋮ & ⋱ & ⋮ \\ k_{Y} (y_{M}, x_{1}) & \dots & k_{Y} (y_{M}, x_{N}) \end{matrix}] \in R^{M \times N}$

$K_{X \leftarrow Y} = K_{XY} K_{X}^{- 1} K_{YX} \in R^{N \times N},$

$K_{Y \leftarrow X} = K_{YX} K_{X}^{- 1} K_{XY} \in R^{M \times M}$ (3) In this way, two different HSIC are produced: $HSIC (X, Y) = trace (K_{X} C_{N} K_{X \leftarrow Y} C_{N}),$

$HSIC (Y, X) = trace (K_{Y} C_{M} K_{Y \leftarrow X} C_{M})$ (4)

In supervised machine learning, the numbers of data in classes are often different. [16] uses the surrogate kernels to calculate HSIC between the data of different classes: $H = [\begin{matrix} HSIC (X^{1}, X^{1}) & \dots & HSIC (X^{1}, X^{C}) \\ ⋮ & ⋱ & ⋮ \\ HSIC (X^{C}, X^{1}) & \dots & HSIC (X^{C}, X^{C}) \end{matrix}] \in R^{C \times C}$ (5) where X^c represents the data in the c^th class, c = 1, ⋯ , C, C is the number of classes. [16] makes the diagonal of H dominant to construct the objective function of algorithm.

In recent years, HSIC has often been applied to supervised feature selection. Let $X = [\begin{matrix} x_{1} & \dots & x_{N} \end{matrix}] \in R^{D \times N}$ be a number of data, $Z = [\begin{matrix} z_{1} & \dots & z_{N} \end{matrix}] \in R^{C \times N}$ the labels of X, i.e., z_n is the label of x_n, if x_n belongs to the the c^th class, then the c^th component of z_n is 1, other components of z_n are 0, we hope to select the features of x_n which are most statistical correlative with the label, where 1 ≤ c ≤ C, n = 1, ⋯ , N.

For convenience, let’s assume the components of x ∈ R^D are its features, then the supervised feature selection is to select some components of x which are most statistical correlative with the label z of x. [18] proposes an algorithm of supervised feature selection based on HSIC and sparse learning. Let s ∈ R^D be the selection vector, the objective function is as follows:

$HSIC (X^{T} s, Z) + λ {∥ s ∥}_{1} \underset{choose s}{\to} min$ (6) where ∥ ∘ ∥ ₁ is the 1-norm. If let $u = X^{T} s = [\begin{matrix} x_{1}^{T} s \\ ⋮ \\ x_{N}^{T} s \end{matrix}] = [\begin{matrix} u^{1} \\ ⋮ \\ u^{N} \end{matrix}] \in R^{N}$ , then

$HSIC (X^{T} s, Z) = HSIC (u, Z)$

$= HSIC (K_{u} C_{N} K_{Z} C_{N})$ (7) where

$K_{u} = [\begin{matrix} k_{u} (u^{1}, u^{1}) & \dots & k_{u} (u^{1}, u^{N}) \\ ⋮ & ⋱ & ⋮ \\ k_{u} (u^{N}, u^{1}) & \dots & k_{u} (u^{N}, u^{N}) \end{matrix}]$ (8)

∥s ∥ ₁ is called the regularization term of sparse learning. The addition of ∥s ∥ ₁ means finding the most sparse solution, i.e., the solution with the smallest number of nonzero components [19]. The positions of the nonzero components of are the positions of selected features.

[20] proposes two methods of supervised feature selection based on HSIC: forward and backward methods, denoted by FOHSIC and BAHSIC respectively. FOHSIC arranges the features of data in ascending order according to HSIC between the features and labels of data, while BAHSIC in descending order. There have been many developments and varieties based on FOHSIC and BAHSIC in recent years, such as [21] and etcetera.

HSIC has also been applied to supervised dictionary learning [22, 23] or supervised PCA [24]. The problem of dictionary learning can be expressed as follows: ${∥ X - WY ∥}^{2} \underset{choose W, Y}{\to} min$ (9) where W ∈ R^D×d is called a dictionary and Y ∈ R^d×N the coefficients of dictionary. It is clear that the solution to the dictionary learning will be achieved if WY is choose to be the projection of X onto the subspace spanW, where spanW represents the space spanned by the column vectors of W. Furthermore, if the column vectors of W are orthonormal, according to the projection theorem of functional analysis, the dictionary coefficients Y is the Fourier coefficients of X on W, i.e., Y = W^TX. In this way, the problem of dictionary learning becomes the problem of subspace learning: ${∥ X - WY ∥}^{2} = {∥ X - W W^{T} X ∥}^{2} \underset{choose W}{\to} min$ (10) or equally, $trace (W^{T} X X^{T} W) \underset{choose W}{\to} max$ (11) This is actually PCA and the dictionary W is the subspace of PCA.

In the supervised dictionary or subspace learning, the dictionary or subspace W is chosen based on the maximization of HSIC between the coefficients Y and the labels Z. The kernel function related to Y is set to the linear kernel function: k_Y (y, y′) = y^Ty′. The kernel matrix of Y is $K_{Y} = [\begin{matrix} k_{Y} (y_{1}, y_{1}) & \dots & k_{Y} (y_{1}, y_{N}) \\ ⋮ & ⋱ & ⋮ \\ k_{Y} (y_{N}, y_{1}) & \dots & k_{Y} (y_{N}, y_{N}) \end{matrix}] =$

$[\begin{matrix} y_{1}^{T} y_{1} & \dots & y_{1}^{T} y_{N} \\ ⋮ & ⋱ & ⋮ \\ y_{N}^{T} y_{1} & \dots & y_{N}^{T} y_{N} \end{matrix}] = [\begin{matrix} x_{1}^{T} W W^{T} x_{1} & \dots & x_{1}^{T} W W^{T} x_{N} \\ ⋮ & ⋱ & ⋮ \\ x_{1}^{T} W W^{T} x_{N} & \dots & x_{N}^{T} W W^{T} x_{N} \end{matrix}]$

$= X^{T} W W^{T} X$ (12) At last, the problem of supervised dictionary or subspace learning based on HSIC can be expressed as: $HSIC (Y, Z) = tr (K_{Y} C_{N} K_{Z} C_{N})$

$= tr (X^{T} W W^{T} X C_{N} K_{Z} C_{N})$

$= tr (W^{T} X C_{N} K_{Z} C_{N} X^{T} W) \underset{choose W}{\to} max$ (13) The above problem is solved under the constraint W^TW = I_d to avoid trivial solutions.

3 HSIC regularized manifold learning (HSIC-ML)

Dimensionality reduction is an important application of manifold learning. Therefore, the proposed HSIC-ML is also an algorithm of dimensionality reduction. The problem of dimensionality reduction can be expressed as follows: given a number of high dimensional data $X = [\begin{matrix} x_{1} & \dots & x_{N} \end{matrix}] \in R^{D \times N}$ , it is desired to find a number of dimension-reduced data $Y = [\begin{matrix} y_{1} & \dots & y_{N} \end{matrix}] \in R^{d \times N}$ , where dD. Here the column vectors of matrices represent data.

3.1 HSIC-based dimensionality reductions (HSIC-DR)

HSIC has been widely applied to many fields of machine learning since it was first proposed around 2005. However, HSIC seems to have not been directly applied to dimensionality reduction so far. In this paper, a new algorithm for dimensionality reduction based on HSIC, called HSIC-DR, is first proposed. The objective function of HSIC-DR is as follows: $HSIC (X, Y) = trace (K_{Y} C_{N} K_{X} C_{N}) \underset{choose Y}{\to} \max$ (14) HSIC-DR means that the dimension-reduced data can be chosen so as to make the HSIC between the high dimensional data and the dimension-reduced data as large as possible. Obviously, HSIC-DR is a global feature preserving algorithm. What global feature HSIC-DR preserves is the statistical correlation between X and Y.

In the above equation, the dimension-reduced data Y are hidden in the kernel matrix K_Y and inconvenient for optimization. In order to solve this problem, the positive definite linear kernel is chosen for the dimension-reduced data Y: k_Y : R^d × R^d → R such that for all y′, y″ ∈ R^d, $k_{Y} (y^{'}, y^{″}) = {y^{'}}^{T} y^{″} + κ δ (y^{'}, y^{″})$ (15) where κ > 0 and $δ (y^{'}, y^{″}) = {\begin{matrix} 1 & y^{'} = y^{″} \\ 0 & others \end{matrix}$ . The existence of δ ensures the positive definiteness of k_Y. The kernel matrix K_Y is then to be: $K_{Y} = [\begin{matrix} k_{Y} (y_{1}, y_{1}) & \dots & k_{Y} (y_{1}, y_{N}) \\ ⋮ & ⋱ & ⋮ \\ k_{Y} (y_{N}, y_{1}) & \dots & k_{Y} (y_{N}, y_{N}) \end{matrix}]$

$= [\begin{matrix} y_{1}^{T} y_{1} & \dots & y_{1}^{T} y_{N} \\ ⋮ & ⋱ & ⋮ \\ y_{N}^{T} y_{1} & \dots & y_{N}^{T} y_{N} \end{matrix}] + κ I_{N} = Y^{T} Y + κ I_{N}$ (16) Substituting K_Y into HSIC (X, Y) gives

$\begin{matrix} HSIC (X, Y) & = & \frac{1}{N^{2}} trace (K_{Y} C_{N} K_{X} C_{N}) \\ = & \frac{1}{N^{2}} trace (Y^{T} Y C_{N} K_{X} C_{N}) \\ + \frac{κ}{N^{2}} trace (C_{N} K_{X} C_{N}) \\ = & \frac{1}{N^{2}} trace (Y C_{N} K_{X} C_{N} Y^{T}) \\ + \frac{κ}{N^{2}} trace (C_{N} K_{X} C_{N}) \end{matrix}$ (17) Since $\frac{κ}{N^{2}} trace (C_{N} K_{X} C_{N})$ is irrelevant to Y, the objective function of HSIC-DR can be expressed as: $trace (Y C_{N} K_{X} C_{N} Y^{T}) \underset{choose Y}{\to} max$

$Y Y^{T} = I_{d}$ (18) The addition of constraint YY^T = I_d is to avoid trivial solutions.

3.2 HSIC regularized manifold learning (HSIC-ML)

Although HSIC can be used for dimensionality reduction by itself, it is preferable to combine HSIC with other algorithms to form more effective algorithms of dimensionality reduction. As described in Subsection 3.1, the proposed HSIC-DR is a global feature preserving algorithm. Therefore, it is naturally desired to combine the proposed HSIC-DR with local feature preserving algorithms. At present, manifold learning algorithms are the most commonly-used local feature preserving algorithms. Therefore, we propose a novel algorithm for dimensionality reduction, called HSIC regularized manifold learning (HSIC-ML), to combine HSIC with manifold learning.

Most of the objective functions of manifold learning can be expressed as following formml: $trace (YL Y^{T}) \underset{choose Y}{\to} max$

$Y Y^{T} = I_{d}$ (19) where L is a positive semi-definite symmetric matrix and varies according to different manifold learning algorithms. For example, L is derived from local linear approximation in LLE, from local similarity in LE and from local homeomorphism in LTSA and HLLE.

In the proposed HSIC-ML, HSIC between the high dimensional data X and the dimension-reduced data Y is added as a regularization term to the objective functions of manifold learning: $trace (YL Y^{T}) - λ trace (Y K_{X} Y^{T}) \underset{choose Y}{\to} min$

$Y Y^{T} = I_{d}$ (20) where λ > 0 is the regularization coefficient. HSIC-ML means the the dimension-reduced data Y is chosen so as to preserve the local features (local linear approximation, local similarity, local homeomorphism, etcetera) of the high dimensional data X and at the same time keep statistical correlative with X as much as possible. Obviously, HSIC-ML is an algorithm capable of preserving both local and global features.

3.3 The solution to HSIC-ML

The objective function of HSIC-ML can be solved by using Rayleigh quotient. In fact, $trace (YL Y^{T}) - λ trace (Y C_{N} K_{X} C_{N} Y^{T})$

$= trace (Y (L - λ C_{N} K_{X} C_{N}) Y^{T})$

$= trace (YB Y^{T}) = \sum_{i = 1}^{d} Y_{iRow} {BY}_{iRow}^{T}$ (21) where B = L - λC_NK_XC_N is symmetric, $Y = [\begin{matrix} Y_{1 Row} \\ ⋮ \\ Y_{dRow} \end{matrix}]$ , Y_iRow ∈ R^1×N is the row vector of Y, i = 1, ⋯ , d.

Since YY^T = I_d, the row vectors of Y are orthonormal. The objective function can be then expressed as Rayleigh quotient

$\begin{matrix} trace (YB Y^{T}) & = & \sum_{i = 1}^{d} Y_{iRow} {BY}_{iRow}^{T} \\ = & \sum_{i = 1}^{d} \frac{Y_{iRow} {BY}_{iRow}^{T}}{Y_{iRow} Y_{iRow}^{T}} \\ = & \sum_{i = 1}^{d} R_{B} (Y_{iRow}^{T}) \underset{choose Y}{\to} min \end{matrix}$ (22) where $R_{B} (Y_{iRow}^{T}) = \frac{Y_{iRow} {BY}_{iRow}^{T}}{Y_{iRow} Y_{iRow}^{T}}$ is Rayleigh quotient. According to the properties of Rayleigh quotient, the transposition of row vectors of Y should be chosen as the orthonormal eigenvectors of B corresponding to the smallest d eigenvalues.

4 Experimental results

In this section, some experimental results of HSIC-ML and ML algorithms are presented for comparison. ML includes LTSA [2], HLLE [3], LE [4] and LLE [1]. Since HLEE has bee proven to be equivalent to LTSA [25, 26], the experimental results of HLLE and HSIC-HLLE are no longer listed separately. These experimental results of HLLE and HSIC-HLLE are the same as those of LTSA and HSIC-LTSA.

In addition, in the proposed HSIC-ML, the kernel function k₁ for the high dimensional data X is open and can be chosen according to the applications at hand. This increases the flexibility of HSIC-ML. In the experiments presented in this section, the kernel function k₁ is chosen to be the Gaussian kernel function.

4.1 Experimental results on toy data

Figure (5) shows the experimental results on toy data. The toy data and the experimental results of ML (here LTSA is chosen for ML) are produced on the platform of MANI. MANI is a platform commonly-used in manifold learning and can be downloaded free from internet. It can be seen from Figure (5) that the experimental results of HSIC-ML are reasonable and comparable with those of ML.

In Figure (5), Swiss Roll with or without hole is a rolled strip in the 3-dimensional space, the aim of manifold learning is to unroll the strip in the 2-dimensional plane. From Figure (5), it can be seen that both ML and HSIC-ML can unroll Swiss Roll. HSIC-ML seems to restore the shape of strip more faithfully.

Gaussian Hat, Punctured Sphere and Twin Peaks are objects in the 3-dimensional space, the aim of manifold learning is to present the top views of these objects. It can be seen from Figure (5) that the results of ML and HSIC-ML are quite similar to each other.

ToroidalHelix is a twisted circle in the 3-dimensional space, the aim of manifold learning is to restore the circle. Again both ML and HSIC-ML can finish the task.

The toy data listed in Figure (5) are often used in manifold learning. Since the geometric structures of these toy data are well known, a new algorithm of manifold learning is often first tested on these toy data to justify its reliability. It is clear that the proposed HSIC-ML passes the tests on toy data.

Figures (1) and (2) shows the experimental results of ML and HSIC-ML on the dataset of faces. This dataset is often used in many literatures of manifold learning. The face in the dataset only changes in gesture and expression. Therefore, although the faces are represented with high-dimensional vectors, it may be enough to represent these faces with 2 dimensional vectors. In Figures (1) and (2), the faces are dimension-reduced to 2 dimension plane with ML and HSIC-ML respectively. Some face images are also shown at the corresponding positions. It can be seen from Figures (1) and (2) that from up to bottom the face expression changes from serious to happy, while from left to right, the face gesture change from eastward to westward. The visual impression of HSIC-ML seems better than that of ML.

Fig.1

The Experimental Results of ML on Face Images.

Fig.2

The Experimental Results of HSIC-ML on Face Images.

4.2 Experimental results on real world data

The experimental results shown in Figures (1),(2) and (5) are qualitative, not quantitative, and are judged entirely by subjective feelings. In order to evaluate ML and HSIC-ML objectively, an example of classification is presented, where the high dimensional data are first dimension-reduced with ML and HSIC-ML respectively, and then classified with 3-NN method.

4.2.1 MNIST

MNIST is an image set of handwritten digits, the size (dimension) of each image is 32 × 32 = 1024. Some images of MNIST are shown in Figure (3). MNIST can be downloaded from the following website:

Fig.3

Some Images of MNIST.

https://github.com/jindongwang/transferlearning/ blob/master/doc/dataset.md#mnist+usps

In the experiment, for each digit, 200 images are taken as training samples and 100 images as testing samples. The accuracy rates of classification are listed in Table 1. The accuracy rates are also drawn in Figure (6). From Table1 or Figure (6) it can be seen that the accuracy rates of HSIC-LTSA and HISC-LE are higher than those of LTSA and LE respectively. However, the accuracy rates of HSIC-LLE and LLE are much that same. We will discuss this problem in a special subsection.

Table 1

The Accuracy Rates of Classification for MNIST, unit:%

the reduced dimension	LTSA	HSIC-LTSA	LE	HSIC-LE
2	51.00	43.10	59.10	65.10
5	63.50	75.40	72.70	81.10
10	77.10	80.50	79.10	82.80
20	81.20	81.70	82.80	86.10
30	81.90	85.30	83.80	86.60
40	83.50	86.40	86.90	87.20
50	83.40	86.80	85.80	87.00
60	83.50	87.10	85.90	86.10
70	83.50	87.40	85.80	85.80
80	84.20	88.00	86.60	88.50
90	84.50	87.30	87.00	87.60
100	84.80	87.20	87.40	88.80
150	85.20	86.30	87.60	88.00
200	88.00	88.50	87.00	87.20
without dimension reduction	90.1
the reduced dimension	LLE	HSIC-LLE
2	64.40	66.90
5	80.50	80.60
10	86.00	85.90
20	88.90	88.90
30	88.90	88.90
40	88.90	89.90
50	89.80	89.80
60	89.90	89.90
70	89.90	90.00
80	89.90	89.90
90	89.90	89.60
100	89.60	89.60
150	88.80	88.50
200	88.60	88.30
without dimension reduction	90.1

4.2.2 YaleB

YaleB is an image set containing the face images of 38 persons, 59 64 face images are taken for each person under different conditions of illumination, the size (dimension) of each image is 32 × 32 = 1024. Some images of YaleB are shown in Figure (4). YaleB can be downloaded from the followingwebsite:

http://www.cad.zju.edu.cn/home/dengcai/Data/ FaceData.html

In the experiment, for each person, 50 face images are randomly taken as training samples and the remaining face images as testing samples. The experiment is repeated four times and the average of four experimental results is presented as the final experimental results. The accuracy rates of classification are listed in Table 2. The accuracy rates are also drawn in Figure (6). From Table 2 or Figure (6) it can be seen that the accuracy rates of HSIC-LTSA and HISC-LE are higher than those of LTSA and LE respectively. However, the accuracy rates of HSIC-LLE and LLE are much that same. We will discuss this problem in a special subsection.

Fig.4

Some Images of YaleB.

Fig.5

The Experimental Results of HSIC-ML and ML on Toy Data.

Table 2

The Accuracy Rates of Classification for YaleB, unit: %

the reduced dimension	LTSA	HSIC-LTSA	LE	HSIC-LE
2	4.96	5.84	32.05	35.99
5	6.61	28.70	36.82	42.32
10	11.72	32.15	39.59	43.48
20	29.52	42.75	39.74	46.21
30	32.05	44.21	41.49	46.55
40	34.63	44.89	43.77	45.77
50	35.65	45.48	44.60	46.60
60	35.94	46.79	44.89	47.13
70	36.72	47.37	45.23	46.55
80	37.11	47.62	44.70	47.91
90	38.08	47.62	46.25	48.20
100	39.49	47.76	47.76	48.49
150	40.71	46.79	51.31	48.15
200	42.22	49.12	48.98	49.08
without dimension reduction	53.84
the reduced dimension	LLE	HSIC-LLE
2	3.94	14.6
5	29.18	35.70
10	50.73	53.50
20	61.53	61.87
30	64.49	65.61
40	66.10	66.63
50	66.73	66.83
60	67.12	66.97
70	67.17	67.22
80	68.00	67.80
90	67.90	68.04
100	68.68	68.77
150	69.11	69.36
200	69.07	68.92
without dimension reduction	53.84

4.3 Discussion on the experimental results

It can be seen from the experimental results shown in Subsection 4.2 that HSIC-LTSA and HSIC-LE perform better than LTSA and LE. However, the performances of HSIC-LLE and LLE are much the same. Here we will try to account for these experimental results.

4.3.1 The improvement of HSIC to ML

Generally speaking, most manifold learning algorithms belong to the so-called local preserving algorithms, in which the high dimensional data are first decomposed into a number of local areas and the patterns of local areas (local patterns) are calculated. The data in each local area are then dimensionally reduced according to the preservation of local patterns. For example, LTSA is local homeomorphism preserving, LE is local similarity preserving and LLE is local linearity preserving. None of them considers what overall properties will be preserving during dimensionality reduction. The addition of HSIC to ML makes up for the lack of overall consideration in ML. From Table 1 and 2 or Figure (6) and (6), it can be seen that the accuracy rates of HSIC-LTSA are much higher than those of LTSA (even more than 10% for YaleB dataset). HSIC-LE also performs better than LE.

Fig.6

The Accuracy Rates of Classification for MNIST.

Fig.7

The Accuracy Rates of Classification for YaleB.

4.3.2 The relation between LLE and HSIC-LLE

LLE is a local linearity-preserving algorithm for dimensionality reduction, in which each high dimensional data is approximated by the linear combination of its nearest neighbors and this linear relation will be preserved during dimensionality reduction. On the other hand, HSIC is essentially the covariance of two random variables. The normalized covariance is the correlation coefficient, measuring the degree of linear correlation. If a random variable is the linear representation of another random variable, the correlation coefficient of these two random variables will reach its maximum. The proposed HSIC-LLE takes the maximization of HSIC between the high dimensional data and the dimension-reduced data as the criterion for dimensionality reduction. Therefore, as an extreme case, if the high dimensional data is the linear representation of the dimension-reduced data, then the local linearity will be preserved.

To be specific, let represent a high dimensional data, x_{i
₁}, ⋯ , x_{i
_K} the K nearest neighbors of x_i and y_i, y_{i
₁}, ⋯ , y_{i
_K} the dimension-reduced data of x_i, x_{i
₁}, ⋯ , x_{i
_K}. In LLE, linear combination coefficients α₁, ⋯ , α_K are first calculated to meet $x_{i} = \sum_{k = 1}^{K} α_{i} x_{i_{k}}$ as much as possible. Then, the dimension-reduced data y_i, y_{i
₁}, ⋯ , y_{i
_K} are calculated to meet $y_{i} = \sum_{k = 1}^{K} α_{i} y_{i_{k}}$ as much as possible. If HSIC between the high dimensional data and the dimension-reduced data reaches the maximum, i.e., the high dimensional data x can be linearly represented with the dimension-reduced data: x = ay + b, then $x_{i} = \sum_{k = 1}^{K} α_{i} x_{i_{k}} \Rightarrow a y_{i} + b = \sum_{k = 1}^{K} α_{i} (a y_{i_{k}} + b)$ $= a (\sum_{k = 1}^{K} α_{i} y_{i_{k}}) + (\sum_{k = 1}^{K} α_{i}) b = a (\sum_{k = 1}^{K} α_{i} y_{i_{k}}) + b$ $\Rightarrow y_{i} = \sum_{k = 1}^{K} α_{i} y_{i_{k}}$ (23) This means that local linearity preservation and HSIC maximization are equivalent. Therefore the addition of HSIC regularization to LLE has little effect on LLE. This may account for the phenomena in which the performances of LLE and HSIC-LLE are much the same.

4.3.3 LLE and manifold learning

Although LLE has bee honored as the first algorithm of manifold learning, LLE seems to has nothing to do with the manifolds defined in mathematics. In mathematics, manifolds are defined as the topological spaces which are local homeomorphic to Euclidean spaces. Homeomorphism means keeping the continuous dependencies between data unchanged, or keeping the geometric relationship between data unchanged, while LLE keeps the local linear relationship between data unchanged. Therefore, strictly speaking, LLE is not a manifold learning algorithm. LTSA, HLLE and LE preserve the local geometric relation during dimensionality reduction and therefore are manifold learning algorithms. The experimental results presented Subsection 4.2 show that the addition of HSIC regularization will improve the performance of manifold learning.

5 Conclusions (The Regularization Applications of HSIC)

Since HSIC was first proposed around 2005, HSIC has found many applications in machine learning. However, it seems that there have been no regularization applications for HSIC so far. Regularization is a common practice in machine learning. A famous example in this respect is manifold regularization [7]. In this paper, HSIC between the high dimensional data and the dimension-reduced data is added as a regularization term to the objective functions of manifold learning. This may be the first try for regularization applications of HSIC.

In fact, there is a lot of room for regularization application of HSIC. For example, manifold learning is unsupervised learning. If the high dimensional data are labeled data, HSIC between the dimension-reduced data and the labels of high dimensional data can be added as a regularization term to the objective functions of manifold learning: $trace (YL Y^{T}) - λ HSIC (Y, Z) \underset{choose Y}{\to} \min$ (24) where Z ∈ R^C×N is the labels of high dimensional data. For another example, manifold learning is often criticized for its efficiency in dealing with the problem of out-of-samples. A possible solution to this problem may be $trace (YL Y^{T}) - λ HSIC (W^{T} X, Y) \underset{choose Y, W}{\to} \min$ (25) where W ∈ R^D×d is the projection matrix and can be used for the dimensionality reduction of out-of-samples. In addition to manifold learning, other machine learning algorithms can be also improved by HSIC regularization, just like they are now often improved by manifold regularization.

References

Roweis

S.T.

, Saul

L.K.

, Nonlinear dimensionality reduction by locally linear embedding, Science 290(5500) (2000), 2323–2326.

Zhang

, Zha

, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, SIAM on Scientific Computing 26(1) (2004), 313–338.

Donoho

D.L.

, Grimes

, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences 100(10) (2003), 5591–5596.

Belkin

, Niyogi

, Laplacian eigenmaps and spectral techniques for embedding and clustering, Advances in Neural Information Processing Systems 14(6) (2001), 585–591.

, Niyogi

, Locality preserving projections, Advances in Neural Information Processing Systems 16(1) (2003), 186–197.

Qiao

, Zhang

, Wang

, Zhang

, An explicit nonlinear mapping for manifold learning, IEEE T Cybernetics 43(1) (2013), 51–63.

Belkin

, Niyogi

, Sindhwani

, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7(1) (2006), 2399–2434.

van der Maaten

L.J.P.

, Postma

E.O.

and van den Herik

H.J.

, Dimensionality reduction: A comparative review, J Mach Learn Res 9(1) (2009), 66–71.

Cunningham

J.P.

, Ghahramani

, Linear dimensionality reduction: Survey, insights, and generalizations, Journal of Machine Learning Research 16(1) (2015), 2859–2900.

10.

Blum

M.G.

, Nunes

M.A.

, Prangle

, Sisson

S.A.

, A comparative review of dimension reduction methods in approximate bayesian computation, Statistical Science 28(2) (2013), 189–208.

11.

Gretton

, Bousquet

, Smola

, Scholkopf

, Measuring statistical dependence with hilbert-schmidt norms, ALT, LNAI 3734 (2005), 63–78.

12.

Schölkopf

, Smola

, Muller

K.R.

, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10(5) (1998), 1299–1319.

13.

Mika

, Rätsch

, Weston

, Schölkopf

, Müller

K.R.

, Fisher discriminant analysis with kernels, Neural Networks for Signal Processing IX (1999), 41–48.

14.

Schölkopf

, Christopher

J.C.

, Smola

, Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA 1999.

15.

Aronszajn

, Theory of reproducing kernels, Transactions on American Mathematical Society 68(3) (1950), 337–404.

16.

Damodaran

B.B.

, Courty

, Lefèvre

, Sparse hilbert schmidt independence criterion and surrogate-kernel-based feature selection for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 55(4) (2017), 2385–2398.

17.

Zhang

, Zheng

, Wang

, Kwok

, Yang

, Marsic

, Covariate Shift in Hilbert Space: A Solution via Sorrogate Kernels, Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 388–395.

18.

Gangeh

M.J.

, Zarkoob

, Ghodsi

, Fast and scalable feature selection for gene expression data using hilbert-schmidt independence criterion, IEEE/ACM Transactions on Computational Biology and Bioinformatics 14(1) (2017), 167–181.

19.

Dodge

, Statistical data analysis based on the L1-Norm and related methods, Computational Statistics and Data Analysis 6(4) (2002), 280–282.

20.

Song

, Smola

, Gretton

, Bedo

, Borgwardt

, Feature selection via dependence maximization, Journal of Machine Learning Research 13(11) (2012), 1393–1434.

21.

Camps-Valls

, Mooij

, Scholkopf

, Remote sensing feature selection by kernel dependence measures, IEEE Geoscience and Remote Sensing Letters 7(3) (2010), 587–591.

22.

Gangeh

M.J.

, Ghodsi

, Kamel

M.S.

, Kernelized supervised dictionary learning, IEEE Transactions on Signal Processing 61(19) (2013), 4753–4767.

23.

Gangeh

M.J.

, Fewzee

, Ghodsi

, Kamel

M.S.

, Karray

, Multiview supervised dictionary learning in speech emotion recognition, IEEE/ACM Transactions on Audio, Speech and Language Processing 22(6) (2014), 1056–1068.

24.

Barshan

, Ghodsi

, Azimifar

, Jahromi

M.Z.

, Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds, Pattern Recognition 44 (2011), 1357–1371.

25.

Martinez

A.M.

, Kak

A.C.

, PCA versus LDA, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2) (2001), 228–233.

26.

Zhang

, Ma

, Tan

, On the equivalence of HLLE and LTSA, IEEE Transactions on Cybernetics 48(2) (2018), 748–753.