Geodesic Kernel embedding Distribution Alignment for domain adaptation

Abstract

Domain adaptation is a method to classify the new domain accurately by using the marked image of the old domain. It shows a good but a challenging application prospect in computer vision. In this article, we propose a unified and optimized problem modeling method, which is called as Geodesic Kernel embedding Distribution Alignment (GKDA). Specifically, GKDA aims to reduce the domain differences. GKDA avoids degenerated feature transformation by using geodesic kernel mapping feature, and then adjusts the weight of cross-domain instances in the process of dimensionality reduction in principle, finally, constructs a new feature to represent the difference of distribution and unrelated instances. The experiment result shows that GKDA has obvious superiority in cross-domain image recognition.

Keywords

Domain adaptation transfer learning distribution alignment geodesic kernel

1. Introduction

Human can enhance their abilities continuously by learning knowledge. In the past, machine start learning mostly from zero and machine can neither refer to the knowledge they learned nor improve the knowledge they learned, which limit the ability of machine learning significantly. The traditional machine learning is based on statistical learning, which shows a good learning effect within its ability. However, statistical learning requires the knowledge it learns and the problems it applies have the same statistical characteristics since it is based on mathematical statistics. Thus, in general, statistical learning can only solve the same problem in the same field. As long as learning and application scenarios transfers, the statistical characteristics will change accordingly, which affects the effect of statistical learning significantly. However, people do not learn the knowledge in the same field. For instance, when we study physics, we can use the mathematical foundation we have built up in the past. So, when humans learn, they can transfer the knowledge among different fields and different problems, which is the ability that machine learning is lacking by now. The ability to transfer knowledge among different situations is defined as “transfer learning”.

With the development of big data, the key issue in machine learning is how to use tagged relevant data to learn the knowledge in a new area. Obviously, it is inefficient to completely abandon the relevant annotation data and reconstruct the training data from the beginning. To solve this problem, a new method of transfer learning has been developed which is called as “domain adaptation”. The purpose of this study is to solve the problems as follows: when the training data is not identical with the test data, how to achieve better results by using the training data to train the optimal classifier for the test data.

Traditional machine learning algorithms requires the two datasets in the same space, which means that the two datasets have the same feature distribution. Then the researchers design the corresponding model and criterion to classify or predict the test data. However, it is different between the probability distribution of training samples and the one of test samples in many learning scenarios. For instance, the actual input image are different in the classification or recognition system by some extent, such as the shooting Angle, posture, expression, illumination, image quality and other problems in the image recognition. Under this background, researchers proposed the concept of “transfer learning”. The purpose of transfer learning is to use the knowledge learned from old domain to finish the learning tasks in the new domain effectively. Then domain adaptation(DA) is proposed as a special transfer learning, which is used to solve the learning problems in target domain by using the training data in sources domain [1]. Therefore, the main calculation of DA is how to reduce the difference of the sampled probability distributions between two different domains.

According to the literature review [1, 2], the method of domain adaptation by now has been divided into two categories: i) Feature matching, which exploits the subspace geometrical structure in both source domain and target domain [13, 15, 28], or narrows the marginal or conditional distribution gap between two different datasets [23, 17]. ii) Instance reweighting [9, 27], the weights of samples are selected in the source domain though a certain weighting technique is better to classify the models.

In this study, we propose a method which is based on Transfer Joint Matching (TJM) [24] and Geodesic Flow Kernel (GFK) [15]. To be specific, the method can realize the common feature representation in domains through the manifold feature embedded and achieve the effect of domain adaptive learning. Or more specifically about this method, firstly, it maps the source and target datasets in GFK kernel space. Secondly, it employs the mapped source domain and target domain to domain adaptation. Thirdly, it reweights the instances. We name it as Geodesic Kernel Embedding Distribution Alignment (GKDA), which is an unsupervised transfer learning algorithm. The GKDA is capable of addressing three challenges: avoid distortion, namely joint feature transformation and avoid negative transfer. We test our method on 6 datasets, which we will introduce specifically in the following Section 4. According to these datasets, we compared this method to several advanced traditional methods. As a result, GKDA can improve average accuracy by at least 1.1% than other methods with the same classifier (1NN) except D-GFK [30]. Next, we introduce DA’s related work in the Section 2. In Section 3, we describe the detail of GKDA about how to combine GFK and TJM. Lastly, we discuss the experiments and result in the Section 4.

2. Related work

According to the information above, feature matching and instance reweighting are the two major methods for feature transferring. The goal of the feature matching method is to reduce the distribution difference through feature transformation or feature representation. The feature representations by now are almost the three main methods as follows: I) extracting the potential factors of domain invariance [15, 18, 30]; II) minimizing appropriate distance measures [19, 20, 14, 29]; III) adjusting the weights of relevant features based on sparse regularization [3, 17, 9]. In addition, a paper [32] proposed an ‘EasyTL’ algorithm, getting the mapping through Intra-domain alignment, which is a non-parameter algorithm with good results. The purpose of the instance reweighting method is to reduce the distribution differences and reduce negative transfer [11, 7, 4, 5, 31] through reweighting the source and target instances according to their correlation. Other papers [2, 23, 25, 30] have estimated the soft label in the target domain by many different methods that makes the pseudo labels more accurate via iteration, improving the domain adaptation and optimizing the distribution distance of domains. However, all of the approaches used in those papers are independent to each other, which is not good enough for DA. In this study, we combine the two methods [15, 24], which involve geometric subspace learning, joint feature matching, and instance reweighting. In this process we use the manifold feature maps and the infinite RKHS to match these features more effectively.

3. Geodesic Kernel embedding Distribution Alignment (GKDA)

3.1 Problem definition

There is a labeled source domain $D_{s}=\{(x_{1},y_{1})\ldots,(x_{n_{s}},y_{n_{s}})\}$ and an unlabeled target domain $D_{t}=\{x_{n_{s}+1}\ldots x_{n_{s}+n_{t}}\}$ , where the feature space $X_{s}=X_{t}$ , label space $Y_{s}=Y_{t}$ , and marginal probability $P_{s}(x_{s})\neq P_{t}(x_{t})$ with conditional probability $Q_{s}(y_{s}|x_{s})\neq Q_{t}(y_{t}|x_{t})$ . GKDA transfers the two domain’s features to reduce the distributions difference by i) mapping geodesic kernel G, ii) minimizing feature distribution alignment, iii) reweighting source samples.

3.2 Main idea (Geodesic Kernel embedding Distribution Alignment)

This study proposes to use various domains by feature transformation $T$

$\displaystyle{\mathop{\min}\limits_{T\in{\rm{\mathcal{H}}}}}\left\|{\mathbb{E}% }_{P(x_{s})}\left[T(g(x_{s}))\right]-{\mathbb{E}}_{P(x_{t})}\left[T(g(x_{t}))% \right]\right\|+\lambda\left\|T\right\|_{2,1}$ (1)

3.2.1 Geodesic kernel learning

In fact, it is materially impossible that the most of data is distributed in Euclidean space. Therefore, it is difficult for the measurement of traditional Euclidean space to be applied to the non-linear data in the real world. So we need introduce a new hypothesis to the data distribution. Manifold is a kind of locally Euclidean space, which includes various latitude curves and surfaces, such as spheres and curved planes. The local and Euclidean Spaces of manifolds which include different dimensions of curves and surfaces are isomorphic. As same as general dimensional reduction, manifold learning reduces the data from a high dimensional space to a low dimensional space. To be different from previous methods, there is an assumption in manifold learning that the processed data are sampled on a potential manifold space or there is a potential manifold space for this dataset.

Figure 1.

The main idea of GFK.

Gong et al. [15] proposed an unsupervised DA algorithm on the basis of manifold theory, which is called Geodesic Flow Kernel (GFK). The innovation of this algorithm is that it proposes a new kernel function “GFK”, which is based on the research of Gopalan et al. [8]. This domain adaptive learning method is defined as “Sampling Geodesic Flows” (SGF). SGF assumes that the target domain and the source domain are two points in a special Grassmann manifold space, and then SGF structures several points which are in the middle of the two domains to establish a geodesic line by connecting several point between the two domains to get corresponding subspace. Finally, the two domains are projected onto the manifold subspace, and the formation of new features is used in the train classification. The main idea of GFK algorithm (Fig. 1) is very similar to SGF algorithm, which can be regarded as an extension of SGF algorithm. The unique feature of GFK algorithm is that it employs geodesics to establish an infinite dimensional feature space, combines all information of source domain, targets domain and virtual intermediate domain between them. Therefore, the inner product on this space can be represented by an efficient closed kernel function so that the algorithm is finally presented in the form of kernel function. The first two steps of the algorithm of GFK are exactly identical to the SGF algorithm. In particular, firstly, the two domains are embedded into two d-dimensional subspaces which are regarded as two points on the Glassman manifold. Then it structures a geodesic line $\Phi(t)$ between two points, and constructs the GFK kernel with the continuous geodesic $\Phi(t)$ . Here, we give a brief introduction, and details can be found in [15].

There are two original datasets $X_{i}$ , $X_{j}$ which are n-dim samples. Firstly, GFK mapped it to a subspace through the mapping ${\bm{\Phi}}(t)$ , and now there are two points in subspace ${\bm{\Phi}}(i)$ and ${\bm{\Phi}}(j)$ . Subsequently, linked as ‘walking’ from $\emptyset(i)$ to $\emptyset(j)$ and through mapping features can be represented as ${\bm{\Phi}}(t)^{T}$ . ${\bm{P}}_{s}\in{\mathbb{R}}^{D\times d}$ and ${\bm{P}}_{t}\in{\mathbb{R}}^{D\times d}$ are the two domains in the subspace and $d$ is the dimensionality of the space. The orthogonal complement of ${\bm{P}}_{s}$ denotes by ${\bm{R}}_{s}\in{\mathbb{R}}^{D\times(D-d)}$ .

From [15], their geodesic flow can be expressed as:

$\displaystyle{\bm{\Phi}}(t)=P_{s}U_{1}\Gamma(t)-R_{s}U_{2}\sum(t)=\left[P_{s}% \ \ \ R_{s}\right]\left[\begin{array}[]{cc}U_{1}&0\\ 0&U_{2}\\ \end{array}\right]\left[\begin{array}[]{c}\Gamma(t)\\ \Sigma(t)\end{array}\right],$

where $U_{1}\in{\mathbb{R}}^{D\times d}$ and $U_{2}\in{\mathbb{R}}^{D\times d}$ are two matrices and can be calculate by

$\displaystyle P_{S}^{T}P_{T}=U_{1}\Gamma V^{T},R_{S}^{T}P_{T}=-U_{2}\Sigma V^{% T}.$

The kernel matrix can be computed in a closed-form

$\displaystyle G=\left[P_{s}U_{1}\ \ \ R_{s}U_{2}\right]\left[\begin{array}[]{% cc}\Lambda_{1}&\Lambda_{2}\\ \Lambda_{2}&\Lambda_{3}\end{array}\right]\left[\begin{array}[]{c}P_{S}^{T}U_{1% }\\ U_{2}R_{S}^{T}\end{array}\right],$ (2)

where $\Lambda_{1}$ , $\Lambda_{2}$ and $\Lambda_{3}$ are three diagonal matrices with elements

$\displaystyle\lambda_{1i}=1+\frac{\sin(2\theta_{i})}{2\theta_{i}},\lambda_{2i}% =1+\frac{\cos(2\theta_{i})-1}{2\theta_{i}},\lambda_{3i}=1-\frac{\sin(2\theta_{% i})}{2\theta_{i}}.$

${\bm{G}}$ is a semi-positive definite matrix. According to this, the matrix can be expressed as $X\mapsto g(x)$ or $g(X)=[g(X_{1}),\ldots,g(X_{n})]$ .

Algorithm 1 summarizes the complete procedure.

Algorithm 1: GFK: Geodesic Flow Kernel

Input: Source data

\{X_{s},y_{s}\}

X=[X_{s},X_{t}]

, Geodesic kernel dimensionality d.

Output: The kernel of GFK

{\bm{G}}

an adaptive classifier

f

(in original paper f is 1NN).

1 Calculate GFK kernel

{\bm{G}}

by Eq. (2), and get the new represent feature

Z=GX

2 Return an adaptive classifier f trained on

\{z_{i},y_{i}\}^{n_{\mathcal{S}}}_{i=1}

3.2.2 Kernel-PCA dimensionality reduction

As mentioned before, most of the data are non-linear, so we cannot use the traditional linear dimensionality reduction. In this paper, we employ kernel-PCA (KPCA) method and calculate KPCA in Reproducing Kernel Hilbert Space (RKHS). In mathematics, Hilbert space is an extension of Euclidean space, which is an unlimited dimensional space. As same as Euclidean space, Hilbert space is a complete inner product space. It is called as ‘infinite dimensional Euclidean space’ which means that the limit operation in the space cannot run out of the space and infinite dimensions can be permitted. The work of PCA is to project the data set from high dimension to low dimension so as to cover most data set information with few features. The essential principle of PCA is that the greater the variance of the data distributed along a feature, the more information the feature contains. So there are $X=[x_{1},\ldots,x_{n}]\in{\mathbb{R}}^{m\times n}$ where $n=n_{s}+n_{t}$ , and so the maximum variance can be expressed as Max $C=\text{XHX}^{T}$ where $H=I-\frac{1}{n}1$ , $1$ is $n\times n$ matrix of ones, $C$ is an $m\times m$ matrix and $m$ is the number of features. The ultimate goal of PCA is to find a $v$ that maximizes the variance, so the final function can be expressed as:

$\displaystyle{\mathop{\max}\limits_{V^{T}V=I}}\text{tr}(V^{T}\text{KHK}^{T}V)$ (3)

where $\text{tr}(\cdot)$ denotes the trace of a matrix, uses eigenvalue decomposition to solve it. In particular, the most significant $V$ is the eigenmatrix corresponding to the largest eigenvalue of $\text{XHX}^{T}$ . These eigenvectors are used to complete the projection of $V^{T}X$ on the dataset, so as to complete the screening of features: $Z=[z_{1},\ldots,z_{n}]=V^{T}X$ .

KPCA: To work in the RKHS, consider kernel function $\varphi:K=\varphi(X)^{T}\varphi(X)\in{\mathbb{R}}^{n\times n}$ , represent features $Z=A^{T}K$ to kernel-PCA as

$\displaystyle{\mathop{\max}\limits_{A^{T}A={\rm I}}}\text{tr}(A^{T}\text{KHK}^% {T}A)$ (4)

where $A\in{\mathbb{R}}^{n\times k}$ is the matrix which transformed by KPCA, $Z=V^{T}X$ become to $Z=A^{T}K$ .

3.2.3 Maximum Mean Discrepancy (MMD)

Now we just reduce the dimension of the manifold mapping data to k-dimensional representation. However, since we assume that their distributions are different, the main problem of the domain adaptive is to minimize the distance between two domains. In this study, we proposed that the distance between $P(A^{T}X_{s})$ and $P(A^{T}X_{t})$ should be close. This operation has been proposed by Transfer Component Analysis (TCA) [20]. TCA uses the empirical Maximum Mean Discrepancy (MMD), which is usually used in transfer learning algorithm [10, 19, 20].

For minimizing MMD between the two domains in the RKHS, the MMD distance adopts k-dimensional embeddings extracted by KPCA.

$\displaystyle\left\|\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}A^{T}K_{i}-\frac{1}{n_{t}% }\sum_{i=1}^{n_{t}}A^{T}K_{j}\right\|_{{\mathcal{H}}}^{2}=\text{tr}(A^{T}\text% {KMK}^{T}A)$ (5)

where ${\bm{M}}$ denotes the MMD matrix, which is calculated as

$\displaystyle Μ_{ij}=\left\{\begin{array}[]{c}\frac{1}{n_{s}n_{s}},x_{i},x_{j% }\in D_{s}\\ \frac{1}{n_{t}n_{t}},x_{i},x_{j}\in D_{t}\\ \frac{-1}{n_{s}n_{t}},\textit{otherwise}\end{array}\right.$ (6)

As TCA, to maximize Eq. (4), and to minimize Eq. (5), the first- and high-order statistics of feature distributions are matched under feature $Z=A^{T}K$ .

3.2.4 Instance reweighting

Nevertheless, just joint matching feature is not good enough when the two domains are significantly different. There are always some source samples that are independent of the target domain. Accordingly, you should work with the TCA instance revaluation process to handle this difficult setup. According to study of TJM [9] proposes to impose $\mathrm{\ell}_{\text{2,1-norm}}$ structured sparsity regularizer for instance reweighting, so that row sparsity can be introduced into the transformation matrix. The instance reweighting regularizer is defined as:

$\displaystyle\left\|A_{s}\right\|_{2,1}+\left\|A_{t}\right\|_{F}^{2}$ (7)

where $A_{t}:=A_{n_{s}+1:n_{s}+n_{t,:}}$ presents the target transformation matrix; $A_{s}:=A_{1:n_{s},:}$ presents the source transformation matrix.

With the increase (decrease) weight in the new representation, according to difference between the instances, it reweighted the source instances by minimizing Eq. (7), so that Eq. (4) is maximized. Features are now reweighted in joint feature learning with their relevance. Otherwise, the regularizer is often used in machine models to solve the problem of model complexity and over-fitting.

3.2.5 Optimization problem

In this paper, we focus on using MMD distance to reduce different domains’ feature distributions and use regularizer to reweight the source instances. By solving Eqs (5) and (7) into Eq. (4), we can obtain the GKDA optimization problem:

$\displaystyle{\mathop{\min}\limits_{A^{T}\text{KHK}^{T}A=I}}\text{tr}(A^{T}% \text{KMK}^{T}A)+\lambda(\left\|A_{s}\right\|_{2,1}+\left\|A_{t}\right\|_{F}^{% 2})$ (8)

where $\lambda$ denotes the regularization parameter. We will explain its capabilities in the Section 4. A critical competitive strength of GKDA is not only inherit TJM to solve the problem of feature matching and source instances weighting, but also address the problem of feature degradation in dimension reduction. Therefore, the GKDA is more powerful than both TJM and GFK.

3.3 Learning algorithm

According to the constrained optimization theory, $\Phi=\text{diag}(\Phi_{1},\ldots,\Phi_{k})\in{\mathbb{R}}^{k\times k}$ denotes the Lagrange multiplier, and the Lagrange function is derived for problem Eq. (8) as

$\displaystyle L={\mathop{\min}\limits_{A^{T}\text{KHK}^{T}A=I}}\text{tr}\left(% A^{T}\text{KMK}^{T}A\right)+\lambda\left(\left\|A_{s}\right\|_{2,1}+\left\|A_{% t}\right\|_{F}^{2}\right)+tr\left(\left(I-A^{T}\text{KMK}^{T}A\right)\Phi\right)$ (9)

Setting $\frac{\partial L}{\partial A}=$ 0, we obtain generalized eigen-decomposition

$\displaystyle\left(\text{KMK}^{T}A+\lambda G_{s}\right)A=\text{KMK}^{T}A\Phi$ (10)

$||A_{s}||_{2,1}$ is a non-smooth function at zero and the sub-gradient as $\frac{\partial(||A_{S}||_{2,1}+||A_{t}||^{2}_{F})}{\partial A}=2G_{s}A$ , where $G_{s}$ is a diagonal sub-gradient matrix with ith element equally used to solve the Eq. (10) for the k smallest eigenvectors. The sub-gradient matrix $G_{s}$ depends on

$\displaystyle G_{ii}=\left\{\begin{array}[]{ll}\frac{1}{2||a^{i}},&x_{i}\in{% \mathcal{D}}_{s},a^{i}\neq 0\\ 0,&x_{i}\in{\cal D}_{s},a^{i}=0\\ 1,&x_{i}\in{\cal D}_{t}\end{array}\right.$ (11)

Algorithm 2 summarizes the complete procedure. $M_{0}$ is for unsupervised learning and $M_{c}$ is for supervised learning, which are based on Joint distribution alignment (JDA) [23].

Algorithm 2: GKDA: Geodesic Kernel embedding Distribution Alignment
Input: Data $X=\left[X_{s},X_{t}\right]$ , Geodesic kernel dimensionality d, # subspace dimensionality k, regularization parameter $\lambda$ .
Output: adaptive classifier f (in this paper f is 1NN).
1 Calculate GFK kernel G by Eq. (2), and get the new represent feature $Z=GX$
2 Construct MMD matrix M by Eq. (6), and select PCA kernel type to construct kernel K.
Set $M\leftarrow M_{0}/M_{c}$ , $G_{S}\leftarrow I$
3 repeat
Eigen decompose the matrix in Eq. (10) and choose the $k$ leading eigenvectors to construct the transformation
matrix $A$ , and represent $K\leftarrow A^{T}Z$ .
Update $G_{S}$ base on Eq. (11).
4 Until Convergence
5 Return an adaptive classifier $f$ trained on $\left\{A^{T}z_{i}\right.{\left.{,\mathcal{Y}}_{\imath}\right\}}^{n_{\mathcal{S% }}}_{\imath\mathrm{=1}}$ .

4. Experiments and evaluations

The performance of GKDA is assessed by extensive experiments on six datasets which are publicly used in DA tests.

4.1 Data preparation

We assess domain adaptation algorithms through six widely-used datasets, the detail of datasets is as below.1

¹
https://github.com/jindongwang/transferlearning/blob/master/data/dataset.md.

MNIST and USPS refer to handwritten digits datasets. These digital images are normalized and placed in the center of the image to make the image the same size, which saves a lot of processing and formatting time. But the two datasets have different sizes; the training instance in USPS contains 7291 images and the test instances contain 2, 007 image, and the size of instances are 16 $\times$ 16. MNIST consists of 10, 000 test images with the size of 28 $\times$ 28 and 60,000 training images. Thus, in the assessment, we rebuilt all instances of size to 16 $\times$ 16 in MNIST. Two tasks are formulated: $U\to M$ and $M\to U$ . 1800 and 2000 images are randomly sampled in MNIST.

Office-31 and Caltech-256 are [6] object domains. Office-31 contains three real-world object domains: Amazon (A), Webcam (W) and DSLR (D). It has 4,652 images with 31 categories. Caltech-256 (C) consists of 30,607 images and 256 categories. Almost every domain adaptation algorithm will use these two datasets to test the performance. In these experiments, the algorithms released by Gong et al. are used [6]. We rebuilt feature into SURF features which is an 800-bin histogram. It covers 12 cross-domain tasks: $D\to A$ , $W\to C$ , $\ldots$ , $C\to W$ .

ImageNet (I) and VOC2007 (V) [16] are recognition domains, which are used in the annual visual competition. We choose 5 classes: ‘bird’, ‘car’, ‘chair’, ‘dog’, ’person’ from both dataset construct two datasets, then we have $I\to V$ and $V\to I$ .

4.2 Advanced comparison methods

The performance of GKDA is compared with that of 10 advanced traditional methods for visual image classification.

1.
1NN, SVM, and PCA
2.
Transfer Component Analysis (TCA) [20] $+$ (1NN)
3.
Geodesic Flow Kernel (GFK) [15] $+$ (1NN)
4.
Joint distribution alignment (JDA) [23] $+$ (1NN)
5.
Transfer Joint Matching (TJM) [24] $+$ (1NN)
6.
Adaptation Regularization (ARTL) [22] $+$ (1NN)
7.
CORAL relation Alignment (CORA ) [21] $+$ SVM
8.
Scatter Component Analysis (SCA ) [18] $+$ (1NN)
9.
Transfer Independently Together (TIT) [29] $+$ SVM
10.
Discriminative Geodesic Flow Kernel (D-GFK) [30] $+$ (1NN)
11.
Optimal Transport (OT-GL) [31] $+$ (1NN)

GKDA has the closest relation to TJM, and GKDA is different from TJM by Geodesic Flow Kernel learning. Because 1-NN does not need cross-validation, so we choose 1-NN as the basic classification.
4.3 Implementation details

From [6, 20], this study employs the same method to evaluate the model. Firstly, we obtain the new representation of source and target data. Next, we use adaptive source data to train a classifier (1NN). Finally, we use the trained classifier to classify the unlabeled target data. Because of difference distributions, it cannot use cross validation to optimize the optimal parameters. So in this study we choose 1NN classifier. Accordingly, this study assesses GKDA through an empirical search of the parameter space and obtains the optimal parameter settings with the highest average accuracy over all data sets, and reports the best results for GKDA. Other results of the base methods are from paper [12]. For manifold kernel learning, we set dimension $d\in$ [10, 20, $\ldots$ ,70]. For subspace dimensionality, we set $k\in$ [10, 20, $\ldots$ , 100]. For transfer learning approaches, the adaptation regularization parameter $\lambda$ is set by searching $\lambda\in\{0.01,0.1,1,10,100\}$ . In this study, if it is not specified, we set that $d=$ 60, $k=$ 20, $t=$ 10, $\lambda=$ 1.

Classification Accuracy on test data is widely used in literature [20, 6, 16].

$\displaystyle\text{Acc}=\text{length}(Y_{\text{pret}}=Y_{\text{real}}/\text{% length}(Y_{\text{real}})$

Table 1
Accuracy (%) on Office+Caltech10 datasets using SURF features

ã€€	C $\to$ A	C $\to$ W	C $\to$ D	A $\to$ C	A $\to$ W	A $\to$ D	W $\to$ C	W $\to$ A	W $\to$ D	D $\to$ C	D $\to$ A	D $\to$ W	AVG
1NN	23.7	25.8	25.5	26	29.8	25.5	19.9	23	59.2	26.3	28.5	63.4	31.38333
SVM	53.1	41.7	47.8	41.7	31.9	44.6	28.8	27.6	78.3	26.4	26.2	52.5	41.71667
PCA $+$ 1NN	39.5	34.6	44.6	39.0	35.9	33.8	28.2	29.1	89.2	29.7	33.2	86.1	43.575
TCA $+$ 1NN	45.6	39.3	45.9	42.0	40.0	35.7	31.5	30.5	91.1	33.0	32.8	87.5	46.24167
GFK $+$ 1NN	46.0	37.0	40.8	40.7	37.0	40.1	24.8	27.6	85.4	29.3	28.7	80.3	43.14167
JDA $+$ 1NN	43.1	39.3	49	40.9	38.0	42.0	33.0	29.8	92.4	31.2	33.4	89.2	46.775
TJM $+$ 1NN	46.8	39.0	44.6	39.5	42.0	45.2	30.2	30.0	89.2	31.4	32.8	85.4	46.34167
CORAL $+$ SVM	52.1	46.4	45.9	45.1	44.4	39.5	33.7	36.0	86.6	33.8	37.7	84.7	48.825
SCA $+$ 1NN	45.6	40.0	47.1	39.7	34.9	39.5	31.1	30.0	87.3	30.7	31.6	84.4	45.15833
ARTL $+$ 1NN	44.1	31.5	39.5	36.1	33.6	36.9	29.7	38.3	87.9	30.5	34.9	88.5	44.29167
BDA $+$ 1NN	44.9	38.6	47.8	40.8	39.3	43.3	28.9	33.0	91.7	32.5	33.1	91.9	47.15
TIT $+$ SVM	59.7	51.5	48.4	47.5	45.4	47.1	34.9	40.2	87.9	36.7	42.1	84.8	52.2
D-GFK $+$ 1NN	54.3	46.4	49.6	45.6	41.6	46.5	35.0	38.6	90.4	32.0	38.1	84.4	50.2
OT-GL $+$ 1NN	44.2	38.9	44.5	34.6	37.0	38.9	36.0	39.6	84.0	32.4	37.2	81.1	47.70
GKDA $+$ 1NN	48.1	42.71	52.86	40.2	42.03	48.4	31.3	36.22	89.17	32.4	34	89.14	48.8775

Table 2

Accuracy (%) on USPS(U) $+$ MNIST(M) and ImageNet(I) $+$ VOC2007(V) datasets

Algorithm	U $\to$ M	M $\to$ U	I $\to$ V	V $\to$ I
1NN	44.1	65.9	50.8	38.2
SVM	62.2	68.2	52.4	42.7
PCA $+$ 1NN	45	66.2	58.4	65.1
TCA $+$ 1NN	51.2	56.3	63.7	64.9
GFK $+$ 1NN	46.5	61.2	59.5	73.8
JDA $+$ NN	59.7	67.3	63.4	70.2
TJM $+$ 1NN	52.3	63.3	63.7	73
CORAL $+$ SVM	30.5	49.2	59.6	70.3
SCA $+$ 1NN	48	65.1	–	–
ARTL $+$ 1NN	67.7	88.8	62.4	72.2
TIT $+$ 1NN	57.85	69.96	–	–
GKDA $+$ 1NN	57.3	70.1	64.2	73.5

Tables 1 and 2 show the accuracy of GKDA classification (recognition) which obtains the 14 baseline methods test result with 16 cross domain datasets. And we also make a figure (Fig. 2) to explain the results more intuitively. The Fig. 2 shows that GKDA performs better than the 11 out of the 14 baseline methods in the most of 16 datasets. The average classification accuracy of GKDA on 16 datasets is 48.88%. The GKDA has 1.1% improvement over OT-GL which is better than other baseline methods except CORAL and TIT. Although CORAL and TIT are unsupervised transfer learning algorithm, it needs to be trained by SVM classifier which requires manual selection of the appropriate classifier parameters. In this study we use 1NN as the base classifier because 1NN does not need cross-validation parameters. In the test result, SVM shows about 10% more accurate than 1NN, which proves that SVM is more powerful than 1NN as a classifier. Although the result of the baseline method of TIT is better than our method, it still cannot prove that TIT itself is better than our method.

JDA is the supervised transfer learning which uses the source domain labels to predict the target domain pseudo labels. But GKDA is the unsupervised transfer learning algorithm without source domain labels. What we need to pay attention to is that the adaptive varies of the 16 datasets are significantly difficult, and the average classification accuracy of the standard 1NN classifier is only 31.4%, which is poor in the large number of datasets.

The basic method of GKDA is to map the source domain and target domain with the geodesic kernel of GFK, and align the distribution of the two domains. GKDA mainly uses the method in GFK and TJM. However, Table 2 shows that the accuracy of GKDA is 5.73% and 2.54%, which is respectively higher than GFK and TJM. So even though the approach of GKDA is based on GFK and TJM, GKDA is still considered as a good method. Also we can show the superiority over GFK and TJM as follows.

Figure 2.

Recognition accuracy (%) on Office $+$ Caltech10 datasets using SURF features cross-domain (12 datasets).

Firstly, as shown in the figure above, the results of GKDA is better than TJM in almost every item, and it is just slightly not good enough compared to the best method. It proves that GKDA can represent feature more successfully in the way to solve the different domains problems.

Secondly, as shown in Fig. 2, the performance of TCA, JDA, ARTL, and TJM, which are based on distribution alignment methods, are worse than GKDA. The reason for this is that some ranges of feature distortion happen in the process of dimension reduction; the performance of GFK, CORAL, and SCA, which are based on subspace learning methods, are worse than GKDA. The reason is that these kinds of methods above cannot solve the problem of two domains’ distribution alignment, which reveals the weaknesses of the methods above in the way of coping with the degenerated feature transformation. Although the performance of ARTL is better than GKDA on the digit datasets, it is worse than GKDA on the object datasets. However, the domain is significantly complex, and image datasets are more complex than digit datasets, so we consider that GKDA is better than ARTL. Other studies [25, 26] have proved that manifold learning can reduce feature degradation in the original space in the way that the features represent to special geometric structure in the manifold space. Therefore, ARTL methods can perform better only on the digit datasets. Accordingly, it proves that manifold structure learning is also significantly effective and robust in domain adaptation in the case of large difference visual domains. The method of D-GFK and GKDA is as same as the one of GFK in the first step. To be specific, in the first step, D-GFK maps two domains into two n-dimensional subspaces, and regards the two subspaces as two points on the Glassman manifold. In the second step, D-GFK predicts the soft labels by “label propagation”, and then obtains the accurate soft label though several iterations. Finally, D-GFK trains it by the labeled domains. Although D-GFK has 1.4% improvement over GKDA, as shown in Table 1, GKDA is still better than D-GFK in 4 parts. Therefore we still consider that GKDA is a powerful algorithm.

Thirdly, GKDA is obviously better than TJM, especially on the object dataset, which means that the more different the domain is, the better the performance of GKDA can show. Therefore, GKDA is a better way to solve domain adaptation problem in the case of the significantly different domain. GKDA has solved degenerated feature transformation problem of TJM.

4.4 Effectiveness analysis

TJM refers to distribution alignment and instances reweight learning in the source domain, which is similar to GKDA. GKDA locates manifold space learning to geodesic kernel mapping, obtains the new source domain through geodesic kernel remapping and target domain, and aligns marginal distribution and instance reweight. For the above reasons, we primarily analysed the difference between GKDA and TJM.

Figure 3.

Compares TJM and GJKM with different regularization parameter.

In the Fig. 3, If the parameter $\lambda$ is set by searching $\lambda$ $\in$ {0.05, 0.1, $\ldots$ , 1}, the highest accuracy of GKDA can be nearly 5% higher than the best performance of TJM, namely 47.77%, and GKDA can show 52.86%. The values of regularization parameters in TJM appear to be relatively insensitive, ranging from 0.01 to 100, while the accuracy varies from 44.58–47.77. The accuracy of GKDA ranges from 40.13–52.86, which means that GKDA not only improves the accuracy, but also enhances the sensitivity to regularization parameters. Moreover, the accuracy of GKDA between the regularization parameter 0.05 and 0.9 are higher than that of TJM. TJM does not reach the best performance for GKDA regardless of the value taken by the regularization parameter, which can fully verify the powerful function of GKDA. But we can also see that it inherits the disadvantage of the GFK, and performs worse than the TJM on the digit dataset, which indicates that the distribution alignment on the digit dataset is more important than the instance reweight.

Figure 4.

Accuracy (%) w.r.t. #iterations.

Figure 5.

MMD distance w.r.t. #iterations.

As shown in Fig. 4, GKDA and TJM usually converge within T $=$ 10 in the object dataset. The accuracy of GKDA is worse than that of TJM in the first iteration, while, as the iteration grows, the accuracy of GKDA gets significantly better than TJM. As shown in Fig. 5, the MMD distance gets closer and closer as the iteration grows. Even if the MMD distance gets further in the iteration process, it is still slowly approaching from several iterations. As shown in Figs 4 and 5, the decreasing distance of MMD does not always get better results. It proves that the MMD distance significantly impacts the transfer effect, yet it is not always positive. If geodesic kernel mapping is added, the result of the first iteration gets worse. It means that Geodesic kernel mapping will make the marginal distribution alignment worse. However, the MMD distance decreased significantly faster in the subsequent iterations, which means that the distance of the two conditional probability distributions in the manifold subspace are getting close faster. Geodesic kernel originates from GFK algorithm, GKDA will perform better in the object dataset. Because adding the geodesic kernel mapping in the process of iteration will make the TJM convergence faster and improve the performance significantly in most datasets.

Figures 6 and 7 shows the two dimension results of GKDA, respectively. (Fig. 6 $k=$ 10 Fig. 7 $d=$ 60). In Fig. 6 when d ranges from 60 $\sim$ 80, the model can get a better result, so we usually use $d=$ 60 to test the model. As shown in Fig. 7, if $k=$ 20 the model is nearly get a better result.

4.5 Convergence and time complexity

The convergence of GKDA is empirically examined in Fig. 4, which shows that with the increase of iterations the classification accuracy will converge in only 10 iterations.

Table 3
Running time (s) of GKDA and some other methods

Task	#Sample $\times$ #Feature	GKDA	TJM	JDA	ARTL
C $\to$ D	1280 $\times$ 800	10.11	10.30	11.80	7.53
M $\to$ U	3,800 $\times$ 256	104.58	103.07	109.07	61.48

Figure 6.

Accuracy (%) w.r.t. Geodesic kernel dimensionality (d).

Figure 7.

Accuracy (%) w.r.t. subspace dimensionality (k).

Time complexity is assessed on dataset C $\to$ D with 1,280 images and 800 features. The results are expressed in Table 3. GKDA appears to iterate T-times, which is similar to TJM, and slightly better than JDA.

5. Conclusion and future work

We proposed a new transfer joint matching method for domain adaptation. The goal of GKDA is to reduce the feature distribution distance among different domains. And GKDA combines the advantage of TJM and GFK. The superiority of GKDA is that it solved the problem of both distributed differences and unrelated instances. From the overall experimental results, GKDA is effective to solve various cross-domain problems, and obviously outperforms the existing adaptation approaches even if the domain is remarkably different. The final aim of transfer learning is to narrow the gap between the two different domains. However, some samples in the source domain will always cause a lot of negative transfer, which is extremely worse for reducing the distance between domains. In the future studies, we will try to change or reduce the samples though some rules in the source domain, so as to avoid the negative transfer and make a better performance in distribution alignment.

Footnotes

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

References

Pan

S.J.

and Yang

, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering 22(10) (2009), 1345–1359.

Wang

et al., Everything about Transfer Learning and Domain Adapation, http://transferlearning.xyz, 2018.

Andreas

Evgeniou

and Pontil

, Multi-task feature learning, Advances in Neural Information Processing Systems, 2007, pp. 41–48.

Bruzzone

and Marconcini

, Domain adaptation problems: A DASVM classification technique and a circular validation strategy, IEEE Transactions on Pattern Analysis and Machine Intelligence 32(5) (2009), 770–787.

Chen

Weinberger

K.Q.

and Blitzer

J.C.

, Co-training for domain adaptation, In Advances in Neural Information Processing Systems (2011), 2456–2464.

Ben-David

Blitzer

Crammer

and Pereira

, Analysis of representations for domain adaptation, In Advances in Neural Information Processing Systems (2007), 137–144.

Chu

W.S.

De la Torre

and Cohn

J.F.

, Selective transfer machine for personalized facial action unit detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3515–3522.

Gopalan

and Chellappa

, Domain adaptation for object recognition: An unsupervised approach, In 2011 International Conference on Computer Vision, 2011, pp. 999–1006.

and Han

, Joint feature selection and subspace learning, In Twenty-Second International Joint Conference on Artificial Intelligence, 2011, pp. 1294–1299.

10.

Gretton

Borgwardt

Rasch

Schölkopf

and Smola

A.J.

, A kernel method for the two-sample-problem, In Advances in Neural Information Processing Systems, 2007, pp. 513–520.

11.

Huang

Gretton

Borgwardt

Schölkopf

and Smola

A.J.

, Correcting sample selection bias by unlabeled data, In Advances in Neural Information Processing Systems, 2007, pp. 601–608.

12.

Wang

Feng

Chen

Huang

and Yu

P.S.

, Visual domain adaptation with manifold embedded distribution alignment, In 2018 ACM Multimedia Conference on Multimedia Conference, 2018, pp. 402–410.

13.

Fernando

Habrard

Sebban

and Tuytelaars

, Unsupervised visual domain adaptation using subspace alignment, In Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2960–2967.

14.

Long

Ding

Wang

Sun

Guo

and Yu

P.S.

, Transfer sparse coding for robust image representation, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 407–414.

15.

Gong

Shi

Sha

and Grauman

, Geodesic flow kernel for unsupervised domain adaptation, In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2066–2073.

16.

Sun

and Saenko

, Deep CORAL: Correlation Alignment for Deep Domain Adaptation, Computer Vision â€“ ECCV 2016 Workshops, 2016, pp. 443–450.

17.

Zhang

and Ogunbona

, Joint geometrical and statistical alignment for visual domain adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1859–1867.

18.

Ghifary

Balduzzi

Kleijn

W.B.

and Zhang

, Scatter component analysis: A unified framework for domain adaptation and domain generalization, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(7) (2016), 1414–1430.

19.

Pan

S.J.

Kwok

J.T.

and Yang

, Transfer Learning via Dimensionality Reduction, in AAAI 8 (2008), 677–682.

20.

Pan

S.J.

Tsang

I.W.

Kwok

J.T.

and Yang

, Domain adaptation via transfer component analysis, IEEE Transactions on Neural Networks 22(2) (2010), 199–210.

21.

Sun

Feng

and Saenko

, Return of frustratingly easy domain adaptation, In Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 2058–2065.

22.

Long

Wang

Ding

et al., Adaptation regularization: A general framework for transfer learning, IEEE Transactions on Knowledge and Data Engineering 26(5) (2013), 1076–1089.

23.

Long

Wang

Ding

et al., Transfer feature learning with joint distribution adaptation, In Proceedings of the IEEE International Conference on Computer vision, 2013, pp. 2200–2207.

24.

Long

Wang

Ding

et al., Transfer joint matching for unsupervised domain adaptation, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1410–1417.

25.

Belkin

Niyogi

and Sindhwani

, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006), 2399–2434.

26.

Hamm

and Lee

D.D.

, Grassmann discriminant analysis: a unifying view on subspace-based learning, In Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 376–383.

27.

et al., A unified framework for metric transfer learning, IEEE Transactions on Knowledge and Data Engineering 29(6) (2017), 1158–1171.

28.

Chen

et al., Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation, AAAI Technical Track: Machine Learning, 2019, pp. 3296–3303.

29.

et al., Transfer independently together: A generalized framework for domain adaptation, IEEE Transactions on Cybernetics 49(6) (2018), 2144–2155.

30.

Wei

et al., Learning Discriminative Geodesic Flow Kernel for Unsupervised Domain Adaptation, 2018 IEEE International Conference on Multimedia and Expo (ICME) IEEE, 2018, pp. 1–6.

31.

Courty

et al., Optimal transport for domain adaptation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(9) (2016), 1853–1865.

32.

Jindong

et al., Easy transfer learning by exploiting intra-domain structures, 2019 IEEE International Conference on Multimedia and Expo (ICME), 2019.

Geodesic Kernel embedding Distribution Alignment for domain adaptation

Abstract

Keywords

1. Introduction

2. Related work

3. Geodesic Kernel embedding Distribution Alignment (GKDA)

3.1 Problem definition

3.2 Main idea (Geodesic Kernel embedding Distribution Alignment)

4.1 Data preparation

1 https://github.com/jindongwang/transferlearning/blob/master/data/dataset.md.

Table 1 Accuracy (%) on Office+Caltech10 datasets using SURF features

Table 3 Running time (s) of GKDA and some other methods

Footnotes

Conflict of interest

References

¹
https://github.com/jindongwang/transferlearning/blob/master/data/dataset.md.

Table 1
Accuracy (%) on Office+Caltech10 datasets using SURF features

Table 3
Running time (s) of GKDA and some other methods