Abstract
At present, manifold learning is mainly applied to dimensionality reduction. However, from viewpoint of dimensionality reduction, manifold learning algorithms are only local feature preserving algorithms. For example, Local Linear Embedding is local linear preserving, Local Tangent Space Alignment is local homeomorphic preserving and Laplacian Eigenmap is local similarity preserving. The community of dimensionality reduction is now pursuing the algorithms which can preserve both local and global features of data during dimensionality reduction. In this paper, a new algorithm of dimensionality reduction, called Hilbert-Schmidt Independence Criterion Regularized Manifold Learning (HSIC-ML for short), is proposed, in which HSIC between the high dimensional data and the dimension-reduced data is added as a regularization term to the objective functions of manifold learning. The addition of HSIC regularization term makes HSIC-ML capable of preserving both local and global features during dimensionality reduction. HSIC is a criterion measuring the statistical dependence between two data sets and has been widely applied to machine learning in recent years. However, since HSIC was first proposed around 2005, there seems to have not been applied directly to dimensionality reduction, not applied as a regularization term either. The proposed HSIC-ML may be the first try in this respect. The experimental results presented in this paper show that the manifold learning with HSIC regularization performs better than that without HSIC regularization.
Keywords
Introduction
Dimensionality reduction is an important task for machine learning. As the advent of big data era, the curse of dimension has become more and more serious and therefore the algorithms of dimensionality reduction have attracted more and more attention. In the process of dimensionality reduction, information loss is inevitable. Therefore, the main concerns of algorithms of dimensionality reduction are what features of data can be preserved during dimensionality reduction. From this point of view, the algorithms of dimensionality reduction can be divided into three categories: local features preserving, global feature preserving and both local and global features preserving. In the local feature preserving algorithms, the high dimensional data are first divided into a number of local regions and each region is then dimensionally reduced without considering its interaction to other regions. The objective function of the whole algorithm is the sum of objective functions of all regions. Therefore, local feature preserving algorithms aim to preserve local features, not global features of data during dimensionality reduction. Most of manifold learning algorithms are local features preserving algorithms. For example, Local Linear Embedding (LLE for short) [1] is local linear preserving, Local Tangent Space Alignment (LTSA for short) [2] and Hessian LLE (HLLE for short) [3] are local homeomorphic preserving and Laplacian Eigenmap (LE for short) [4] is local similarity preserving. Manifold learning algorithms can be further developed into ones capable of preserving local and global features at the same time. For example, in the objective functions of manifold learning algorithms, the dimension-reduced data can be replaced with the linear [5] or polynomial [6] transform of the high dimensional data and it is the transform, not the dimension-reduced data, that is the target of optimization. Besides dimensionality reduction, another application of manifold learning is manifold regularization [7], in which the objective functions of manifold learning are added as regularization terms to the objective functions of other machine learning algorithms.
At present, there are many comprehensive and thorough literature reviews on dimensionality reduction such as [8–10]. The first two papers come from the theoretical magazines of machine learning, Journal of Machine Learning Research, while the third paper comes from the mathematical magazine of statistics and probability, Statistical Science.
Just as its name implies, Hilbert-Schmidt Independence Criterion (HSIC for short) measures the degree of statistical independence between two random variables or two datasets [11]. From the viewpoint of machine learning, HSIC is a kind of kernel methods. However, unlike other kernel methods (e.g., kernel PCA [12], kernel LDA [13], kernel SVM [14]), the theory of HSIC seems a little complicated. HSIC involves not only Reproducing Kernel Hilbert Space (RKHS), but also Hilbert-Schmidt operators (HS operators) as well as the space of HS operators. HSIC is defined as the norm of a HS operator and the HS operator is implicitly defined by a continuous linear functional on the HS operator space. Although the theory of HSIC is complicated, the calculation formula of HSIC between two datasets is quite simple. HSIC between two datasets is the trace of the product of two kernel matrices corresponding to the two datasets respectively. In recent years, HSIC has been applied to many fields of machine learning and the related literatures are increasing day by day. In this paper, a novel algorithm of HSIC regularized manifold learning (HSIC-ML) is proposed, in which HSIC between the high dimensional data and the dimension-reduced data is added as a regularization term to the objective functions of manifold learning. The addition of HSIC regularization makes HSIC-ML capable of preserving not only local features (local linearity, local homeomorphism, local similarity, etc.), but also global features (statistical correlation as a whole) during dimensionality reduction.
The remaining sections in this paper are arranged as follows: In Section 2, some works related to HSIC-ML are introduced. Since manifold learning is well known in machine learning, the introduction is mainly focused on HSIC. In Section 3, HSIC-ML is proposed. In Section 4, some experimental results are given, showing that manifold learning with HSIC regularization performs better than that without HSIC regularization. In section 5, some conclusions are given.
Related work
The proposed HSIC-ML is only to add a HSIC regularization term to the objective functions of manifold learning. Unlike other algorithms developed from manifold learning, the proposed HSIC-ML dose not make any change to the objective functions of manifold learning. Therefore, here we only review HSIC related works.
Since it was proposed around 2005 [11], HSIC has found wide applications in machine learning. HSIC involves RKHS [15] and therefore belongs to the category of kernel methods in machine learning. However, compared with other kernel methods, the theory of HSIC seems much more complicated. HSIC involves not only RKHS, but also HS operators between RKHS as well as continuous linear functionals over HS operator spaces. Interested readers can refer to [11] as well as the related technical reports written by the authors of [11].
Although the theory of HSIC is complicated, the empirical formula of HSIC is quite simple. In practice, HSIC is often used to measure the statistical correlation between two data sets.
Let X = [x1 ⋯ x
N
] ∈ RD×N and Y = [y1 ⋯ y
N
] ∈ Rd×N be two data sets, the empirical HSIC between X and Y is as follows [11]:
k X and k Y are two kernel functions and can be chosen according to practical applications.
In empiric HSIC shown in Eq.(1), the size of two data sets must be the same. However, this is not always true in practice. In [16, 17], the so-called surrogate kernel is then proposed to solve this problem. Let X = [x1 ⋯ x
N
] ∈ RD×N and Y = [y1 ⋯ y
N
] ∈ Rd×N and be two data sets, where N ≠ M, then the following matrices are defined:
In supervised machine learning, the numbers of data in classes are often different. [16] uses the surrogate kernels to calculate HSIC between the data of different classes:
In recent years, HSIC has often been applied to supervised feature selection. Let
For convenience, let’s assume the components of x ∈ R D are its features, then the supervised feature selection is to select some components of x which are most statistical correlative with the label z of x. [18] proposes an algorithm of supervised feature selection based on HSIC and sparse learning. Let s ∈ R D be the selection vector, the objective function is as follows:
∥s ∥ 1 is called the regularization term of sparse learning. The addition of ∥s ∥ 1 means finding the most sparse solution, i.e., the solution with the smallest number of nonzero components [19]. The positions of the nonzero components of are the positions of selected features.
[20] proposes two methods of supervised feature selection based on HSIC: forward and backward methods, denoted by FOHSIC and BAHSIC respectively. FOHSIC arranges the features of data in ascending order according to HSIC between the features and labels of data, while BAHSIC in descending order. There have been many developments and varieties based on FOHSIC and BAHSIC in recent years, such as [21] and etcetera.
HSIC has also been applied to supervised dictionary learning [22, 23] or supervised PCA [24]. The problem of dictionary learning can be expressed as follows:
In the supervised dictionary or subspace learning, the dictionary or subspace W is chosen based on the maximization of HSIC between the coefficients Y and the labels Z. The kernel function related to Y is set to the linear kernel function: k
Y
(y, y′) = y
T
y′. The kernel matrix of Y is
Dimensionality reduction is an important application of manifold learning. Therefore, the proposed HSIC-ML is also an algorithm of dimensionality reduction. The problem of dimensionality reduction can be expressed as follows: given a number of high dimensional data
HSIC-based dimensionality reductions (HSIC-DR)
HSIC has been widely applied to many fields of machine learning since it was first proposed around 2005. However, HSIC seems to have not been directly applied to dimensionality reduction so far. In this paper, a new algorithm for dimensionality reduction based on HSIC, called HSIC-DR, is first proposed. The objective function of HSIC-DR is as follows:
In the above equation, the dimension-reduced data Y are hidden in the kernel matrix K
Y
and inconvenient for optimization. In order to solve this problem, the positive definite linear kernel is chosen for the dimension-reduced data Y: k
Y
: R
d
× R
d
→ R such that for all y′, y″ ∈ R
d
,
Although HSIC can be used for dimensionality reduction by itself, it is preferable to combine HSIC with other algorithms to form more effective algorithms of dimensionality reduction. As described in Subsection 3.1, the proposed HSIC-DR is a global feature preserving algorithm. Therefore, it is naturally desired to combine the proposed HSIC-DR with local feature preserving algorithms. At present, manifold learning algorithms are the most commonly-used local feature preserving algorithms. Therefore, we propose a novel algorithm for dimensionality reduction, called HSIC regularized manifold learning (HSIC-ML), to combine HSIC with manifold learning.
Most of the objective functions of manifold learning can be expressed as following formml:
In the proposed HSIC-ML, HSIC between the high dimensional data X and the dimension-reduced data Y is added as a regularization term to the objective functions of manifold learning:
The objective function of HSIC-ML can be solved by using Rayleigh quotient. In fact,
Since YY T = I d , the row vectors of Y are orthonormal. The objective function can be then expressed as Rayleigh quotient
In this section, some experimental results of HSIC-ML and ML algorithms are presented for comparison. ML includes LTSA [2], HLLE [3], LE [4] and LLE [1]. Since HLEE has bee proven to be equivalent to LTSA [25, 26], the experimental results of HLLE and HSIC-HLLE are no longer listed separately. These experimental results of HLLE and HSIC-HLLE are the same as those of LTSA and HSIC-LTSA.
In addition, in the proposed HSIC-ML, the kernel function k1 for the high dimensional data X is open and can be chosen according to the applications at hand. This increases the flexibility of HSIC-ML. In the experiments presented in this section, the kernel function k1 is chosen to be the Gaussian kernel function.
Experimental results on toy data
Figure (5) shows the experimental results on toy data. The toy data and the experimental results of ML (here LTSA is chosen for ML) are produced on the platform of MANI. MANI is a platform commonly-used in manifold learning and can be downloaded free from internet. It can be seen from Figure (5) that the experimental results of HSIC-ML are reasonable and comparable with those of ML.
In Figure (5), Swiss Roll with or without hole is a rolled strip in the 3-dimensional space, the aim of manifold learning is to unroll the strip in the 2-dimensional plane. From Figure (5), it can be seen that both ML and HSIC-ML can unroll Swiss Roll. HSIC-ML seems to restore the shape of strip more faithfully.
Gaussian Hat, Punctured Sphere and Twin Peaks are objects in the 3-dimensional space, the aim of manifold learning is to present the top views of these objects. It can be seen from Figure (5) that the results of ML and HSIC-ML are quite similar to each other.
ToroidalHelix is a twisted circle in the 3-dimensional space, the aim of manifold learning is to restore the circle. Again both ML and HSIC-ML can finish the task.
The toy data listed in Figure (5) are often used in manifold learning. Since the geometric structures of these toy data are well known, a new algorithm of manifold learning is often first tested on these toy data to justify its reliability. It is clear that the proposed HSIC-ML passes the tests on toy data.
Figures (1) and (2) shows the experimental results of ML and HSIC-ML on the dataset of faces. This dataset is often used in many literatures of manifold learning. The face in the dataset only changes in gesture and expression. Therefore, although the faces are represented with high-dimensional vectors, it may be enough to represent these faces with 2 dimensional vectors. In Figures (1) and (2), the faces are dimension-reduced to 2 dimension plane with ML and HSIC-ML respectively. Some face images are also shown at the corresponding positions. It can be seen from Figures (1) and (2) that from up to bottom the face expression changes from serious to happy, while from left to right, the face gesture change from eastward to westward. The visual impression of HSIC-ML seems better than that of ML.

The Experimental Results of ML on Face Images.

The Experimental Results of HSIC-ML on Face Images.
The experimental results shown in Figures (1),(2) and (5) are qualitative, not quantitative, and are judged entirely by subjective feelings. In order to evaluate ML and HSIC-ML objectively, an example of classification is presented, where the high dimensional data are first dimension-reduced with ML and HSIC-ML respectively, and then classified with 3-NN method.
MNIST
MNIST is an image set of handwritten digits, the size (dimension) of each image is 32 × 32 = 1024. Some images of MNIST are shown in Figure (3). MNIST can be downloaded from the following website:

Some Images of MNIST.
https://github.com/jindongwang/transferlearning/ blob/master/doc/dataset.md#mnist+usps
In the experiment, for each digit, 200 images are taken as training samples and 100 images as testing samples. The accuracy rates of classification are listed in Table 1. The accuracy rates are also drawn in Figure (6). From Table1 or Figure (6) it can be seen that the accuracy rates of HSIC-LTSA and HISC-LE are higher than those of LTSA and LE respectively. However, the accuracy rates of HSIC-LLE and LLE are much that same. We will discuss this problem in a special subsection.
The Accuracy Rates of Classification for MNIST, unit:%
YaleB is an image set containing the face images of 38 persons, 59 64 face images are taken for each person under different conditions of illumination, the size (dimension) of each image is 32 × 32 = 1024. Some images of YaleB are shown in Figure (4). YaleB can be downloaded from the followingwebsite:
http://www.cad.zju.edu.cn/home/dengcai/Data/ FaceData.html
In the experiment, for each person, 50 face images are randomly taken as training samples and the remaining face images as testing samples. The experiment is repeated four times and the average of four experimental results is presented as the final experimental results. The accuracy rates of classification are listed in Table 2. The accuracy rates are also drawn in Figure (6). From Table 2 or Figure (6) it can be seen that the accuracy rates of HSIC-LTSA and HISC-LE are higher than those of LTSA and LE respectively. However, the accuracy rates of HSIC-LLE and LLE are much that same. We will discuss this problem in a special subsection.

Some Images of YaleB.

The Experimental Results of HSIC-ML and ML on Toy Data.
The Accuracy Rates of Classification for YaleB, unit: %
It can be seen from the experimental results shown in Subsection 4.2 that HSIC-LTSA and HSIC-LE perform better than LTSA and LE. However, the performances of HSIC-LLE and LLE are much the same. Here we will try to account for these experimental results.
The improvement of HSIC to ML
Generally speaking, most manifold learning algorithms belong to the so-called local preserving algorithms, in which the high dimensional data are first decomposed into a number of local areas and the patterns of local areas (local patterns) are calculated. The data in each local area are then dimensionally reduced according to the preservation of local patterns. For example, LTSA is local homeomorphism preserving, LE is local similarity preserving and LLE is local linearity preserving. None of them considers what overall properties will be preserving during dimensionality reduction. The addition of HSIC to ML makes up for the lack of overall consideration in ML. From Table 1 and 2 or Figure (6) and (6), it can be seen that the accuracy rates of HSIC-LTSA are much higher than those of LTSA (even more than 10% for YaleB dataset). HSIC-LE also performs better than LE.

The Accuracy Rates of Classification for MNIST.

The Accuracy Rates of Classification for YaleB.
LLE is a local linearity-preserving algorithm for dimensionality reduction, in which each high dimensional data is approximated by the linear combination of its nearest neighbors and this linear relation will be preserved during dimensionality reduction. On the other hand, HSIC is essentially the covariance of two random variables. The normalized covariance is the correlation coefficient, measuring the degree of linear correlation. If a random variable is the linear representation of another random variable, the correlation coefficient of these two random variables will reach its maximum. The proposed HSIC-LLE takes the maximization of HSIC between the high dimensional data and the dimension-reduced data as the criterion for dimensionality reduction. Therefore, as an extreme case, if the high dimensional data is the linear representation of the dimension-reduced data, then the local linearity will be preserved.
To be specific, let represent a high dimensional data, x
i
1
, ⋯ , x
i
K
the K nearest neighbors of x
i
and y
i
, y
i
1
, ⋯ , y
i
K
the dimension-reduced data of x
i
, x
i
1
, ⋯ , x
i
K
. In LLE, linear combination coefficients α1, ⋯ , α
K
are first calculated to meet
Although LLE has bee honored as the first algorithm of manifold learning, LLE seems to has nothing to do with the manifolds defined in mathematics. In mathematics, manifolds are defined as the topological spaces which are local homeomorphic to Euclidean spaces. Homeomorphism means keeping the continuous dependencies between data unchanged, or keeping the geometric relationship between data unchanged, while LLE keeps the local linear relationship between data unchanged. Therefore, strictly speaking, LLE is not a manifold learning algorithm. LTSA, HLLE and LE preserve the local geometric relation during dimensionality reduction and therefore are manifold learning algorithms. The experimental results presented Subsection 4.2 show that the addition of HSIC regularization will improve the performance of manifold learning.
Conclusions (The Regularization Applications of HSIC)
Since HSIC was first proposed around 2005, HSIC has found many applications in machine learning. However, it seems that there have been no regularization applications for HSIC so far. Regularization is a common practice in machine learning. A famous example in this respect is manifold regularization [7]. In this paper, HSIC between the high dimensional data and the dimension-reduced data is added as a regularization term to the objective functions of manifold learning. This may be the first try for regularization applications of HSIC.
In fact, there is a lot of room for regularization application of HSIC. For example, manifold learning is unsupervised learning. If the high dimensional data are labeled data, HSIC between the dimension-reduced data and the labels of high dimensional data can be added as a regularization term to the objective functions of manifold learning:
