Biased transfer matching for less overlapping degree for unsupervised domain adaptation

Abstract

Domain adaptation is an important branch of transfer learning. Previous studies have always taken efforts to minimize the optimization goal, but they neglect the relative quality of features or instances. For example, a classic work treats different instances equally in a degree and chooses these instances which minimize the optimization function value. This method will discard these instances that make the data distribution in source and target data domain different and will neglect the instances’ relative quality. To reduce interference between instances in the process of domain adaptation, we put forward a novel method of ODA that uses the overlapping degree to measure every feature or instance’s relative quality and implement feature or instance reweighting. At the same time, we have noticed that there are many parameters with values that will influence the effect of the method. Previous studies do not have a reasonable method to determine the parameters’ values. We can use the genetic algorithm to find the balance between marginal distribution adaptation and conditional distribution adaptation to find the best combination of multiple parameters. Experiments we have done verify that the ODA method outperforms by 3.26% compared with the best comparison method. We have found that our method of finding the optimal parameters can yield more accurate results than the original method.

Keywords

transfer learning data mining domain adaptation instance reweighting

1. Introduction

As we know, one of the reasons why traditional machine learning methods have positive effects is that they are based on a hypothesis that the training and testing datasets follow the same distribution. However, in many real-world cases, the training and testing datasets do not follow the same distribution. Data from the same class but different domains may show different characteristics. Furthermore, because of the limitation of training conditions, we often need to train models in source data domain, and then to use the models in target data domain.

We often confront such problems where labeled data is scarce in a target data domain and therefore it’s nearly impossible to learn an effective model without rich labels or it takes too much to learn a model in the target domain. As a result, it is a challenging problem to learn an accurate classifier for the target domain using labeled data from the source domain. We use this model to determine which category the data in the target data domain belongs to. However, in this way, the accuracy of the model is compromised.

Therefore, transfer learning is receiving more and more attention and it has been used in many different fields, such as image classification [1, 2], tagging [3, 4], object recognition [5, 6, 7, 8], and feature learning [9, 10, 11]. There are many research methods in the transfer learning. Among them, feature-based transfer learning is the most popular. Transfer component analysis [12] is the classic distribution adaptation method of the feature-based transfer learning and adapts marginal distribution. In addition, STL [17] considers conditional distribution adaptation. The joint distribution adaptation method [13] considers both marginal distribution adaptation and conditional distribution adaptation. There are many methods which have been proposed to improve JDA. For example, VDA [18] adds the calculation of within-class distance.

We mainly get our ideas from two papers in next two paragraphs:

TJM [14] proposes adding instances reweighting to the marginal distribution adaptation. They calculate the $L_{2,1}$ paradigm within the transformation matrix about source data. The essence of the process is to enhance the loss of those source instances which have a large move of position in the last iterate process of data transfer. But which instances will have a large move? It is dependent on the marginal distribution adaptation. Previous research has not paid attention to the instances’ stand or fall and the instances reweighting process is only passive implemented. In other words, they regard instances as the same in a way. However, it is clear they are not the same in fact. We put forward a novel method of ODA to measure instances’ relative quality and to suggest a new concept called overlapping degree. Then we use feature or instance reweighting to reduce the degree.

Wang notices the deficiency of JDA: marginal distribution adaptation and conditional distribution adaptation are usually not equally important in some datasets. They propose BDA [15] to solve this problem, but the BDA only finds the trade-off parameter by traversing from 0 to 1 with the interval of 0.1. As the two distribution adaptations can be regarded as two optimization goals, our paper uses a multi-objective optimization algorithm to find a better solution.

In the next step, we further elaborate our theory about ODA.

In Fig. 1, the two ellipses represent the instances’ distribution of two different classes in feature space. We can clearly note that the instances in Class 1, such as A,O,H, and the instances in Class 2, such as $C$ , $K_{1}$ , $L_{1}$ , are successful instances. The instances T, S, $B_{1}$ , $J_{1}$ , are evidently undesirable and abnormal instances because the instances around them belong to different class. The instances Z, V, $A_{1}$ , $O_{1}$ , $W$ , $N_{1}$ , $C_{1}$ are worse than the instances in the former case, but better than the instances in the latter case. Our target is to reduce the influences of the bad instances on the transformation matrix we get in every iteration process of the domain adaptation.

Figure 1.

Two different classes during the domain adaptation process.

We notice the classification error caused by the mutual interference between classes in the source data domain. In the domain adaptation, we usually use the kernel function to map the instance in original space to the KHKS space. When this is used, we can propose a method of measuring the interference degree and to reweight instances to decrease it. When the kernel function is not used, we decrease the interference degree by implementing features reweighting. In the previous work TJM [14], they do not implement instances reweighting in the first iteration and then get the first transformation matrix. In the next iterations, according to the transformation matrix, they enhance the loss of those instances which have a large move. This is the reason why we say that they regard instances as the same to some extent. Further more, they don’t consider the situation when the kernel function is not used. As shown in Fig. 3, the instances reweighting process in previous domain adaptation methods are usually driven by an objective function. When we pay attention to the instances’ stand or fall, the bad instances’ weight is decreased (the edge OB). In this way, the instance we choose changes from instance 2 to instance 1. This is similar in feature selection.

Figure 2.

Instances or features reweighting.

In this paper, we put forward a concept of overlapping area to measure interference degree between classes. Our target is to reduce the overlapping degree in every iteration of solving the transformation matrix and data projection. Then, the model trained by projected source data can perform better on projected target data. Instances reweighting or features reweighting is the method used to achieve that target. When the kernel method is not used, we focus on the feature reweighting. We believe that the features which have a higher overlapping degree of instances have a higher misleading degree on the result. In the process of acquiring the transformation matrix, we take less consideration of these bad features. In the projection process using the kernel method, we take less consideration of the impact of the instances which are more likely to be abnormal instances, around which have greater number of different-class instances. At the same time, we discuss the effect of different parameters in the optimization problem. For the trade-off of a single parameter, for example, the balance parameter u in the BDA, we should strike a balance between the marginal distribution adaptation and conditional distribution adaptation. This problem can be regarded as a multi-objective optimization problem. Compared with the BDA, which is achieved by iteratively traversing, we can figure out a better solution using a genetic algorithm. For the multi-parameter problems, our experiments show that the multi-parameter genetic algorithm can find a better, convergent solution than using each single parameter. We can conclude our motivations and contributions into four aspects:

We propose the concept of overlapping, and use the values of a formula to measure the harm degree of each feature (in the primal method) or each instance (in the kernel method). The values are calculated by the overlapping degree which is caused by unclear boundaries between different classes of instances in the source data domain.

When we use the primal method, for each feature, the value reflects the possible harm degree of the feature to the target data’s classification. The more instances which have similar values in a feature but belong to different classes, the higher harm degree of the feature to the target data’s classification will be. When we use the kernel method, for each instance, the value reflects the anomalous degree of each instance in the source data domain, that is, whether the surrounding instances belong to the same classification as the instance. The more instances that do not belong, the higher the harm degree of the instance to the target data’s classification will be.

The overlapping of different classes in the source data domain will adversely influence the resulting transformation matrix and the model. We use the values obtained above to reweight the features or instances and to reduce influences of bad features or bad instances on transformation matrix and model.

We find a convergent trade-off solution of marginal distribution adaptation and conditional distribution adaptation by genetic algorithm, resulting in achieving a better accuracy than BDA does.

We show that when multiple parameters are used as variables, a genetic algorithm can find a better solution than using each single parameter as a variable because of the interaction of the parameters.

2. Related work

2.1 Data distribution adaptation

TCA [12] is the representative method in marginal distribution adaptation. It proposes using MMD [16] to calculate the discrepancy between the source and target domain. Under the condition of maintaining the data characteristics, TCA [17] finds a transformation matrix and brings the MMD distance closer after the data transformation. Bregman divergence is used instead of MMD in TSL[21] to measure the distance of different distributions. STL is a method in conditional distribution adaptation which argues that many studies ignore intra-class correlations and adaptively reduces the dimension of space by using the intra-class correlation. STL has achieved convincing results in the cross-domain behavior recognition tasks. JDA proposes a method to extract a shared subspace between the source and target domains by considering both above two adaptations. BDA notices that the marginal distribution adaptation and the conditional distribution adaptation shouldn’t be considered as equally important in many scenarios.

2.2 Feature selection and instance selection

TJM proposes adding regularization term to select instances from the source data domain in the process of marginal distribution adaptation.

2.3 Multi-objective optimization

The genetic algorithm, NSGA-II [19], is a classic method in multi-objective optimization problems.

3. Problem specification

3.1 Problem definition

In the unsupervised transfer learning, we have a batch of tagged source data $\{X_{s_{i}},Y_{s_{i}}\}_{i=1}^{n_{s}}$ , and a batch of untagged target data $\{X_{t_{j}}\}_{j=1}^{n_{t}}$ , we need to use the features $X_{s}$ and tags $Y_{s}$ of the source data to train a classifier, then we put the target data $\{X_{t_{j}}\}_{j=1}^{n_{t}}$ into the classifier to get the labels of the target data.

In other words, we use the model trained in source data to predict the category of target data. The rationality of the method is based on an assumption that the feature space and label space of the source domain and target domain are the same: $X_{s}=X_{t}$ , $Y_{s}=Y_{t}$ . But the marginal distribution and conditional distribution in the two data domains are usually unequal: $P_{s}(x_{s})\neq P_{t}(x_{t})$ , $P_{t}(y_{s}|x_{s})\neq P_{t}(y_{t}|x_{t})$ . Previous studies have mentioned that we should make the marginal distribution and conditional distribution between two data domains closer. That means, after mapping of a function, we make $P_{s}(\varPhi(x_{s}))\approx P_{t}(\varPhi(x_{t}))$ and $P_{s}(y_{s}|\varPhi(x_{s}))\approx P_{t}(y_{t}|\varPhi(x_{t}))$ . The previous studies have mentioned that we can approximate $P_{s}(y_{s}|\varPhi(x_{s}))$ with $P_{s}(\varPhi(x_{s})|y_{s})$ and approximate $P_{t}(y_{t}|\varPhi(x_{t}))$ with $P_{t}(\varPhi(x_{t})|y_{t})$ .

3.2 Dimensionality reduction

Dimensionality reduction can reduce noise influence. We usually use Principal Component Analysis (PCA) for this. $X=[X_{s},X_{t}]\in R^{m*n}$ is the input data. H is the centering matrix. $text{H}=\textbf{I}-\frac{1}{n}\textbf{1}$ , where $\textbf{I}\in R^{n*n}$ is the identity matrix, $n=n_{s}+n_{t}$ . 1 is the matrix of all ones. $n_{s}$ represents the number of instances in the source data domain. $n_{t}$ represents the number of instances in the target data domain. The covariance matrix can be computed as $\textbf{XHX}^{\textbf{T}}$ . PCA aims to find an orthogonal transformation matrix A $\in R^{m*d}$ that maximizing the variance of embedded data. We try to maximize the following formula.

$\displaystyle\max\limits_{\textbf{A}^{\textbf{T}}\textbf{A}=\textbf{I}}\textit% {tr}(\textbf{A}^{\textbf{T}}\textbf{XHX}^{\textbf{T}}\textbf{A})$ (1)

where tr(X) denotes the trace of a matrix $X$ . The problem above can be converted into a eigendecomposition problem: $\textbf{XHX}^{\textbf{T}}\textbf{A}=\textbf{A}\bm{\theta}$ , where $\bm{\theta}=\textit{diag}({\theta}^{1},\ldots,{\theta}^{d})\in R^{d*d}$ are the d largest eigenvalues. Then we can get the embedded data: $\textbf{Z}=\textbf{A}^{\textbf{T}}\textbf{X}$ , which is used with their tags to train a model and tag the embeded target data.

3.3 Feature matching

3.3.1 The marginal distribution distance between two data domains.

Marginal distribution adaptation is to make $P_{s}(\varPhi(x_{s}))\approx P_{t}(\varPhi(x_{t}))$ . We use MMD to measure the discrepancy between the two data domains. MMD has been widely used in many existing transfer learning methods to measure the distance between two data domains. The distance can be calculated as:

$\displaystyle D_{1}(X_{s},X_{t})=\left|\left|\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}% A^{T}x_{s_{i}}-\frac{1}{n_{t}}\sum_{j=1}^{n_{t}}A^{T}x_{t_{j}}\right|\right|^{2}$ (2)

$x_{s_{i}}$ represents the No.i instance in the source data domain. $x_{t_{j}}$ represents the No.j instance in the target data domain.

The minimization of the Eq. (2) is equivalent to the following formula:

$\displaystyle D_{1}(X_{s},X_{t})=\text{argmin}_{A}tr(A^{T}XM_{0}X^{T}A)$ (3)

where the MMD matrix $M_{0}$ can be calculated in the Eq. (4).

$\displaystyle(M_{0})_{ij}=\left\{\begin{array}[]{ll}\frac{1}{n_{s}^{2}}&x_{i},% x_{j}\in X_{s}\\ \frac{1}{n_{t}^{2}}&x_{i},x_{j}\in X_{t}\\ -\frac{1}{n_{s}*n_{t}}&\textit{others}\end{array}\right.$ (4)

3.3.2 Distance of conditional distributions

Conditional distributions adaptation is to make $P_{s}(y_{s}|\varPhi(x_{s}))\approx P_{t}(y_{t}|\varPhi(x_{t}))$ . Because $P_{s}(y_{s}|\varPhi(x_{s}))$ and $P_{t}(y_{t}|\varPhi(x_{t}))$ can be approximated with $P_{s}(\varPhi(x_{s})|y_{s})$ and $P_{t}(\varPhi(x_{t})|y_{t})$ , we use other distances to train a classier in source data domain and get pseudo labels for target data, then we iteratively refine them. The conditional distribution distance can be defined as the following formula:

$\displaystyle D_{2}(X_{s},X_{t})=\sum_{c=1}^{C_{n}}\left|\left|\frac{1}{n_{s}^% {(c)}}\sum_{x_{s_{i}}\in X_{s}^{(c)}}A^{T}x_{s_{i}}-\frac{1}{n_{t}^{(c)}}\sum% \limits_{x_{t_{j}}\in X_{t}^{(c)}}A^{T}x_{t_{j}}\right|\right|^{2}$ (5)

Because we’re assuming that $Y_{s}=Y_{t}$ , the source and target data have the same category. $C_{n}$ represents the number of classes, $x_{s_{i}}\in X_{s}^{(c)}$ represents the No.i instance among the instances belonging to class c in the source data domain. $x_{t_{j}\in X_{t}^{(c)}}$ represents the No.j instance among the instances belonging to class c in the target data domain. $n_{s}^{(c)}$ represents the number of instances which belong to class c in the source data domain. $n_{t}^{(c)}$ represents the number of instances which belong to class c in the target data domain. The minimization of the above formula is equivalent to the following formula:

$\displaystyle D_{2}(X_{s},X_{t})=\textit{argmin}_{A}tr(A^{T}XM_{c}X^{T}A)$ (6)

where the MMD matrix, $M_{c}$ , can be calculated as the Eq. (7):

$\displaystyle(M_{c})_{ij}=\left\{\begin{array}[]{ll}\frac{1}{\left(n_{s}^{(c)}% \right)^{2}}&x_{i},x_{j}\in X_{s}^{(c)}\\ \frac{1}{\left(n_{t}^{(c)}\right)^{2}}&x_{i},x_{j}\in X_{t}^{(c)}\\ -\frac{1}{n_{s}^{(c)}}{n_{t}^{(c)}}&\left\{\begin{array}[]{l}x_{i}\in X_{s^{(c% )}},x_{j}\in X_{t^{(c)}}\\ x_{i}\in X_{t^{(c)}},x_{j}\in X_{s^{(c)}}\\ \end{array}\right.\\ 0&\textit{others}\end{array}\right.$ (7)

3.3.3 Variance of the intra-class matrix

Control the intra-class variance, which makes the same classes congregate more closely, and the classification effect better.

$\displaystyle D_{3}(X_{s})=\textit{argmin}_{A}\textit{tr}(A^{T}S_{w}A)$ (8) $\displaystyle S_{w}=\sum\limits_{c=1}^{C_{n}}X_{s^{(c)}}H_{s^{\rm(c)}}(X_{s^{(% c)}})^{T}$ (9)

where $X_{s}^{(c)}\in R^{m*{n_{s}^{(c)}}}$ , $H_{s}^{(c)}=\textbf{I}_{s}^{(c)}-\frac{1}{n_{s}^{(c)}}\textbf{1}_{s}^{(c)}$ is the centering matrix, $\textbf{I}_{s}^{(c)}\in R^{n_{s}^{(c)}}*{n_{s}^{(c)}}$ is the identity matrix. and $\textbf{1}_{s}^{(c)}$ is the $n_{s}^{(c)}*n_{s}^{(c)}$ matrix of all ones. $S_{w}$ is the intra-class matrix.

3.4 Overlap matrix and model solution

A wrong classification of the instances of the target data occurs when there are incorrect instances of the source data around the target data within the new space. After the projected of the source data, each class has a general distribution range within the new space. The boundaries of these areas are usually unclear. Despite this, clear boundaries are the key factor in deciding whether the target datas classification is correct or not. Thus, the degree of overlap between the different classes in the source data is partly responsible for the classification errors.If there are more mixed instances in the source data, there will be a much bigger degree of overlap and the target data will be more likely to be misclassified. Thus, one must ask how we can define the degree of overlap and put it into the solving process of the transformation matrix A? For the primal method, we use the degree to implement feature reweighting. For the kernel method, we use the degree to implement instance reweighting.

$\displaystyle D_{4}(X_{s})=\textit{argmin}_{A}\textit{tr}(A^{T}CA)$ (10)

where $C$ is the overlap matrix which reflects overlap degree. The calculation process is shown in Sections 3.4.1 and 3.4.2.

3.4.1 Primal method

This method applies to the situation where the kernel function is not used to non-linearize original data. For each feature, we do not consider other features and only take this feature and category of each instance into consideration. We calculate the frequency and degree of the occurrence of the source datas abnormal instances for each feature. In the later coordinate transformation process, we accord more attention to those features with a low occurrence of abnormal instances (the degree of overlap).

Furthermore, we calculate a loss value for each feature. The loss value reflects the overlapping degree of the feature in the source data domain.

Figure 3.

The calculation of features’ overlapping degree.

As shown in Fig. 3, $X_{s}\in R^{m*n_{s}}$ , where m is the feature dimension and $n_{s}$ is the number of the source data’s instances. We sort the instances by the value of the No.i feature and every scale represents a sorted instance. The red arrow points to the compared instance and goes through the entire scale. So we arrive at the boundary ll on the left and the boundary lr on the right (the instances pointed to by the blue arrow) for each red arrow. Among them, $kn=lr-ll+1$ . Within the range of the boundary, we count the number of instances which belong to the same class as the compared instance as nnumj (not including the compared instance itself).

The boundary [ll,lr] can be defined as Eq. (11):

$\displaystyle([ll,lr]=\left\{\begin{array}[]{ll}{[}1,kn]&j\in\left[1,\frac{kn}% {2}+1\right)\\ {[}n_{s}-kn+1,n_{s}]&j\in\left(n_{s}-\frac{kn}{2},n_{s}\right]\\ {[}\lfloor j-\frac{kn}{2},j+\frac{kn}{2}-1\rfloor]&\textit{others}\end{array}\right.$ (11)

Then we can calculate the overlapping degree of No.i feature as following formula:

$\displaystyle\textit{crossloss(i)}=-\sum\limits_{j=1}^{n_{s}}\textit{log}_{2}(% (\textit{nnum}_{j}+\textit{eps})/(\textit{kn}-1+\textit{eps}))\textit{% crossloss}\in R^{m*1}$ (12)

After calculating every feature’s loss, we can get the overlap matrix:

$\displaystyle C=\textit{crossloss* crossloss}^{T}$ (13)

With the introduction of the overlap matrix, we implement the feature reweighting and despise those features which have a high loss. Among the Eq. (12), eps is a tiny positive number. Taking the Eqs (3) and (4), (6)–(13) into consideration, we can get the optimization Eq. (14) and the conditions it is subject to. Our goal is to minimize it.

$\displaystyle\textit{min}\left(\textit{tr}\left({A^{T}}\left(X\left(M_{0}+\sum% \limits_{c=1}^{C_{n}}M_{c}\right)X^{T}+S_{w}+C\right)A+\lambda||A||_{F}^{2}% \right)\right)s.t.A^{T}\textit{XHX}^{T}A=I$ (14)

In Eq. (14), we add F paradigm to prevent A from getting too complicated. $\bm{\lambda}$ is the regularization parameter. We convert Eq. (14) into Lagrange function Eq. (15).

$\displaystyle L=\textit{tr}\left(A^{T}\left(X\left(M_{0}+\sum\limits_{c=1}^{C_% {n}}M_{c}\right)X^{T}+S_{w}+C\right)A+\lambda||A||_{F}^{2}\right)+\textit{tr}(% (\textbf{I}-A^{T}\textit{XHX}^{T}A)\bm{\theta})$ (15)

we denote $\bm{\theta}=\textit{diag}(\theta^{1},\ldots,\theta^{d})\in R^{d*d}$ as the Lagrange multiplier. Then we derive L with respect to A, and let $\partial{L}/\partial{A}=$ 0, then the optimization problem can be converted to a generalized eigendecomposition problem Eq. (16).

$\displaystyle\left(X\left(M_{0}+\sum\limits_{c=1}^{C_{n}}M_{c}\right)X^{T}+S_{% w}+\lambda I+C\right)A=\textit{XHX}^{T}A\theta$ (16)

Our goal is to find the matrix A which conforms Eq. (16) for d smallest eigenvectors.The procedure of the ODA with the primal method is summarized in Algorithm 3.

ODA with primal method[1] Instances’ feature ${\textbf{X}_{\textbf{s}}}$ and label $\textbf{Y}_{\textbf{s}}$ from the source data domain. Data $\textbf{X}_{\textbf{t}}$ from the target data domain. Regularization parameter $\bm{\lambda}$ . Dimension d after Dimensionality reduction. Iterations T. Transformation matrix $\bm{A}$ , classifier f Construct $\bm{X}=[\bm{X_{s}},\bm{X_{t}}]$ . Calculate $\bm{M_{0}}$ by Eq. (4). Calculate intra-class variance $\bm{S_{w}}$ by Eq. (9). Calculate overlap matrix ${\bm{C}}$ by Eq. (11), Eqs (12) and (13). $t\in[1,T]$ t=1 Use Eq. (16) except conditional distribution and take the d smallest eigenvectors to get the transformation matrix $\bm{A}$ . Solve the eigendecomposition problem according to Eq. (16) and take the d smallest eigenvectors to construct $\bm{A}$ .

Train the classifier f by the transferred source data and labels: ${\{\bm{A^{T}}\bm{X_{s}},\bm{Y_{s}}\}}$ Update the pseudo labels $\bm{Y_{t}}$ , $\bm{Y_{t}}=f(\bm{A^{T}}\bm{X_{t}})$ Update $\bm{M_{c}}$ by using Eq. (7) return Classifier f and transformation matrix $\bm{A}$

3.4.2 Kernel method

Kernelization:

We often encounter cases where the instances are not linearly separable. We use a kernel function to map the original data from X to $\psi(X)$ .Then we construct the kernel matrix $K=\psi(X)^{T}\psi(X)$ where $K\in{R}^{(n_{s}+n_{t})*(n_{s}+n_{t})}$ using linear or RBF kernel. We denote $K_{s}=\psi(X)^{T}\psi(X_{s})$ , $K_{t}=\psi(X)^{T}\psi(X_{t})$ . At this time, $A\in{\rm{R}}^{\rm{(n_{s}+n_{t})*d}}$ , which is derived in detail in TCA [12].

We are more concerned about overlapping degree of instances after dimensionality reduction. But when we use the primal method, there are different number of features of data before and after the dimension reduction, so it is hard for us to use the post-dimension reduction data to calculate the C matrix in the next iteration for feature reweighting. When it comes to the kernel method, the way that we add the overlapping degree into the optimization problem changes from feature reweighting to instance reweighting. Because of the invariance of numbers of instances, we can use the loss value of each post-dimensionality reduction instance to directly implement instance reweighting.

So we put forward a $K_{d}$ matrix. In the first iteration, $K_{d}=\psi(X_{s})^{T}\psi(X_{s})$ . In the later iteration, $K_{d}={Z_{s}}^{T}Z_{s}$ where $Z_{s}=A^{T}K_{s}$ . $Z_{s}$ is the source data following dimensionality reduction in the previous iteration. We rewrite $K_{d}$ as $K_{d}={M^{T}}M$ .

Furthermore, the effect of overlap matrix on the model is changed from reweighting the features to reweighting the instances. The values of the overlap matrix reflect the overlapping degree of every source instance. In the projection process, we give less consideration to the impact of the instance which is more likely to be an abnormal instance and around which there is greater amount of different-class instances.

Because of the normalization of the instances, when an element in $K_{d}$ has horizontal coordinates: i and vertical coordinates: j, the value of the element represents the distance between $X_{i}$ and $X_{j}$ in the space. For example, the first row of $K_{d}$ is ${M_{1}^{T}}{M_{1}}$ , ${M_{1}^{T}}{M_{2}}$ , ${M_{1}^{T}}{M_{3}}$ , $\ldots$ , ${M_{1}^{T}}{M_{n}}$ which represents the distance between $M_{1}$ (the first instance in the source data) and other instances. The second row of $K_{d}$ is ${M_{2}^{T}}{M_{1}}$ , ${M_{2}^{T}}{M_{2}}$ , ${M_{2}^{T}}{M_{3}}$ , $\ldots$ , ${M_{2}^{T}}{M_{n}}$ which is representing the distance between $M_{2}$ (the second instance in the source data) and the other instances. As shown in Fig. 4, for row i of $K_{d}$ matrix, we sort the elements of the row from small to large. We select the largest $k_{n}$ number except 1. It means that we set [ll,lr] $=$ [ $n_{s}-k_{n}$ , $n_{s}-1$ ]. In these elements, we find the instances multiplied by instance i and count the number of instances that belong to the same category with the instance i as $\textit{nnum}_{i}$ .

Figure 4.

the calculation of instances’ overlapping degree.

Then we can calculate the overlapping degree of No.i instance as following formula:

$\displaystyle\textit{crossloss(i)}=\left\{\begin{array}[]{ll}-\textit{log}_{2}% ((\textit{nnum}_{i}+\textit{eps})/(kn+\textit{eps}))&i<=n_{s}\\ 0&n_{s}+1<=i<=n_{s}+n_{t}\\ \end{array}\right. \textit{% crossloss}\in R^{(n_{s}+n_{t})*1}$ (17)

The instance weighting is slightly different from the feature weighting. We choose the Eq. (18) as C matrix which is a little different to the primal method.

$\displaystyle C=\textit{diag(crossloss)}$ (18)

We can get the optimization formula and the condition it subject to as Eq. (19).

$\displaystyle\textit{min}\left(\textit{tr}\left(A^{T}\left(K\left(M_{0}+\sum% \limits_{c=1}^{C_{n}}M_{c}\right)K^{T}+S_{w}+C\right)A+\lambda||A||_{F}^{2}% \right)\right)s.t.A^{T}\textit{KHK}^{T}A=I$ (19)

where

$\displaystyle S_{w}=\sum\limits_{c=1}^{C_{n}}K_{s^{(c)}}H_{s^{\rm(c)}}\left(K_% {s^{(c)}}\right)^{T}$ (20)

Where $K_{s^{(c)}}=\psi(X)^{T}\psi\left(X_{s^{(c)}}\right)$ . Similar to the primal method, we convert Eq. (19) into Lagrange function and denote $\bm{\theta}=\textit{diag}(\theta^{1},\ldots,\theta^{d})\in R^{d*d}$ as the Lagrange multiplier.

$\displaystyle L\!=\!\textit{tr}\left(A^{T}\!\left(K\!\left(M_{0}+\sum\limits_{% c=1}^{C_{n}}M_{c}\right)K^{T}+S_{w}+C\right)A+\lambda||A||_{F}^{2}\right)+% \textit{tr}((\textbf{I}-A^{T}\textit{KHK}^{T}A)\bm{\theta})$ (21)

Then we derive L with respect to A, and let $\partial{L}/\partial{A}=$ 0, the optimization problem can be converted to a generalized eigendecomposition problem Eq. (22).

$\displaystyle\left(K\left(M_{0}+\sum\limits_{c=1}^{C_{n}}M_{c}\right)K^{T}+S_{% w}+\lambda I+C\right)A=\textit{KHK}^{T}A\bm{\theta}$ (22)

The procedure of ODA with kernel method is summarized in Algorithm 4.

ODA with kernel method[1] Instances’ feature $\textbf{X}_{s}$ and label $\textbf{Y}_{s}$ from the source data domain. Data $\textbf{X}_{t}$ from the target data domain. regularization parameter $\bm{\lambda}$ . Dimension d after Dimensionality reduction. Iterations T. Transformation matrix $\bm{A}$ , classifier f Construct $\bm{X}=[\bm{X_{s}},\bm{X_{t}}]$ . $\bm{K_{s}}=\psi(\bm{X})^{T}\psi(\bm{X_{s}})$ $\bm{K_{t}}=\psi(\bm{X})^{T}\psi(\bm{X_{t}})$ Calculate $\bm{M_{0}}$ by Eq. (4). Calculate intra-class variance $S_{w}$ by Eq. (20). $t\in[1,T]$ $t=$ 1 $\bm{K_{d}}=\psi(\bm{X_{s}})^{T}\psi(\bm{X_{s}})$

Calculate overlap matrix C by Eqs (17) and (18). Use Eq. (22) except conditional distribution and take the d smallest eigenvectors to get the transformation matrix $\bm{A}$ . Update $\bm{K_{d}}=\bm{{Z_{s}}^{T}Z_{s}}$ Update overlap matrix $\bm{C}$ by Eqs (17) and (18). Solve the eigendecomposition problem according to Eq. (22) and take the d smallest eigenvectors to construct $\bm{A}$ .

Train the classifier f by the transferred source data and labels: ${\{\bm{A^{T}}\bm{K_{s}},\bm{Y_{s}}\}}$ Update $\bm{Z_{s}}=\bm{A^{T}K_{s}}$ Update the Pseudo labels $\bm{Y_{t}}$ , $\bm{Y_{t}}=f(\bm{A^{T}}\bm{K_{t}})$ Update $\bm{M_{c}}$ by using Eq. (7) return Classifier f and transformation matrix $\bm{A}$

3.5 Use multi-objective optimization to find better parameters

We found that there are many parameters that need to be determined in the process of domain adaptation. We can regard these problems as multi-objective optimization problems.

3.5.1 Find a better trade-off parameter between the marginal distribution adaptation and the conditional distribution adaptation

$\displaystyle\textit{DISTANCE}(X_{s},X_{t})=(1-\mu)\textit{DISTANCE}(P(x_{s}),% P(x_{t}))+\mu\textit{DISTANCE}(P(y_{s}|x_{s}),P(y_{t}|x_{t}))$ (23)

The Eq. (23) is the optimization objective in BDA. We notice that previous studies have artificially defined the parameters’ values. For example, the BDA runs with parameter $\mu\in\{0,0.1,\ldots,1.0\}$ and compares the performances to find the best $\mu$ value. In this way, they often fail to get the optimal value. We employed select genetic algorithms to solve the multi-objective optimization problem and find a better parameter for the balance between marginal distribution adaptation and conditional distribution adaptation.

3.5.2 The interaction between parameters brings a better solution

We observed that there are many parameters involved in the process of solving the optimization problem. Such as the parameter $\mu$ in the BDA, the features’ dimension d of data after dimensionality reduction, the regularization parameter $\lambda$ , etc. When multiple parameters are used as variables because of interaction of the parameters, the genetic algorithm can find a better solution than using each single parameter as a variable.

In order to verify it more quickly, we selected the parameters and d for the experiment. It can be observed that through the genetic algorithm, the final accuracy rate is found to be convergent and a better solution can be found.

4. Experiments

4.1 Datasets

We chose the following five datasets: Office $+$ Caltech, COIL20 and USPS $+$ MNIST. Among them, MNIST (M) and USPS(U) are handwritten digits recognition datasets which contain the digits 0–9. MNIST contains 60,000 training images and 10,000 test images. USPS(U) contains 7,291 training images and 2,007 test images. COIL20 contains 1,440 images which belong to 20 categories. Caltech-256 contains 30,607 images which belong to 256 categories. Office dataset contains three sub-datasets: Amazon, Webcam, DSLR. The dataset contains 4,652 images which belong to 31 categories. The details have been shown in Table 1. These datasets have been used in previous methods [13, 14, 15] and have been clearly described and preprocessed. We followed these methods to construct 16 different tasks.

4.2 Comparison methods

We chose six state-of-the-art comparison methods:

•
1 Nearest Neighbor classier (NN)
•
Principal Component Analysis (PCA) $+$ NN
•
Transfer Component Analysis (TCA) [12] $+$ NN
•
Joint Distribution Adaptation (JDA)[13] $+$ NN
•
Transfer Subspace Learning (TSL) [21] $+$ NN
•
Transfer Joint Matching (TJM) [14] A $+$ NN
•
Balanced Distribution Adaptation (BDA) [15] $+$ NN

Among these methods, NN and PCA are traditional learning methods, other methods are transfer learning approaches. As suggested by Geodesic Flow Kernel (GFK)[20], NN is chosen as the base classifier since it does not require tuning cross-validation parameters. BDA is used for experiment 2: Parameters Optimization, other methods are used for experiment 1: The ODA Method.

Table 1
The details of the five datasets

Dataset Type #Sample #Feature #Class Domain

USPS Digit 1,800 256 10 U

MNIST Digit 2,000 256 10 M

COIL20 Object 1,440 1,024 20 CO1, CO2

Office Object 1,410 800 10 A, W, D

Caltech Object 1,123 800 10 C

4.3 Implementation details

Dataset	Type	#Sample	#Feature	#Class	Domain
USPS	Digit	1,800	256	10	U
MNIST	Digit	2,000	256	10	M
COIL20	Object	1,440	1,024	20	CO1, CO2
Office	Object	1,410	800	10	A, W, D
Caltech	Object	1,123	800	10	C

4.3.1 Experiment 1: The ODA method

We built the experimental environment by imitating JDA, TJM, and BDA. PCA, TCA, JDA, TSL, TJM are processes of dimensionality reduction. Then we put the data after dimensionality reduction into NN to get a model. For comparison study, we set $d=$ 100, $\lambda=$ 0.1 for MNIST $+$ USPS/Office $+$ Caltech datasets and $d=$ 100, $\lambda=$ 0.01 for COIL20 dataset. We choose the linear kernel method in the comparison methods. And we observe our method’s performance in the primal method, linear kernel method and rbf kernel method. Iteration number for the comparison methods and ODA is set as $T=$ 10. We adopt classification accuracy on the target domain as the evaluation metric which is widely used in literatures [12, 13].

4.3.2 Experiment 2: Parameters optimization

We regarded 2 representative parameters as variables to use the genetic algorithm and chose BDA as a baseline method. We set d and $\lambda$ as the above configuration in the BDA and our method. When we don’t regard $\mu$ as a variable, we set $\mu$ as 0.2. We chose the Sheffield genetic arithmetic toolbox and set genetic algebra as 100 for Office $+$ Caltech/MNIST $+$ USPS datasets, 50 for COIL20 dataset. The population is set as 100. We set the variables’ number of bits as 20, generation gap as 0.90, crossover rate as 0.97, mutation rate as 0.001. The balance parameter $\mu$ $\in$ [0,1], $d\in$ [1, min(n,m)]. For each seed, we chose the accuracy of the last iteration to decide whether the seed would be killed.

4.4 Performance evaluation of ODA

Table 2
Accuray (%) of ODA in three methods and other methods on 16 tasks

Task	1NN	PCA	TCA	JDA	TSL	TJM	ODA(primal)	ODA(linear)	ODA(rbf)
U $\rightarrow$ M	44.7	44.95	52.20	57.45	53.75	57.55	59.25	60.25	65.60
M $\rightarrow$ U	65.94	66.22	54.28	62.89	66.06	69.83	75.28	51.06	64.22
CO1 $\rightarrow$ CO2	83.61	84.72	88.61	97.22	88.06	96.81	91.11	97.50	97.92
CO2 $\rightarrow$ CO1	82.78	84.03	96.25	86.39	87.92	96.25	87.50	97.5	96.81
C $\rightarrow$ A	23.70	36.95	44.89	42.90	44.47	45.61	46.24	46.24	54.59
C $\rightarrow$ W	25.76	32.54	36.61	38.64	34.24	36.61	39.32	40	41.02
C $\rightarrow$ D	25.48	38.22	45.86	47.13	43.31	47.13	45.86	50.32	46.50
A $\rightarrow$ C	26	34.73	40.78	38.82	37.58	40.78	39.27	39.09	45.15
A $\rightarrow$ W	29.83	35.59	37.63	37.29	33.90	37.97	35.93	44.75	43.39
A $\rightarrow$ D	25.48	27.39	31.85	40.13	26.11	42.03	42.04	48.41	41.40
W $\rightarrow$ C	19.86	26.36	27.16	25.29	29.83	27.24	34.02	37.04	36.33
W $\rightarrow$ A	22.96	31	30.69	31.84	30.27	32.77	39.25	44.47	39.67
W $\rightarrow$ D	59.24	77.07	90.45	90.45	87.26	90.45	89.17	92.99	93.63
D $\rightarrow$ C	26.27	29.65	32.50	30.99	28.50	32.50	33.30	34.73	34.64
D $\rightarrow$ A	28.5	32.05	31.52	32.25	27.56	33.40	37.68	39.25	38.62
D $\rightarrow$ W	63.39	75.93	87.12	91.19	85.42	93.22	90.17	93.90	92.88
Average	40.84	47.34	51.78	53.18	50.27	55.01	55.38	57.34	58.27

Figure 5.

Genetic algorithm.

We used primal method, linear kernel method and rbf kernel method to test 16 tasks. The performance is shown in Table 2. By using the rbf function, we can achieve a correct rate of 58.27% which is 3.26% better than the best comparison method TJM. The TJM implements the instances selection, and only discards these instances which let the optimization function value (discrepancy between marginal distribution adaptation and conditional distribution adaptation) become smaller. In other words, it discards these instances which make the source data domain and target domain different. But what we focus on is that if the features or instances in the source domain are sufficient, then we reweight the features or instances. The experiment has shown that our method ODA has been effective.

Table 3

Result of the parameters optimization experiment

	BDA	Variable $\mu$	Variable d	variables $\mu$ and d
C $\rightarrow$ A	44.89	46.7	46.7	48
C $\rightarrow$ W	38.64	42	43	45.5
C $\rightarrow$ D	47.77	48.5	54	54
A $\rightarrow$ C	40.78	41	42.8	42.6
A $\rightarrow$ W	39.32	41.7	44.05	46.4
A $\rightarrow$ D	43.31	46.6	46	49.6
W $\rightarrow$ C	28.94	32.78	34.1	37.6
W $\rightarrow$ A	32.99	34.03	34.7	34.7
W $\rightarrow$ D	91.72	91.1	93.63	93.63
D $\rightarrow$ C	32.50	32.86	31.9	33.7
D $\rightarrow$ A	33.09	33.32	36.65	35.05
D $\rightarrow$ W	91.86	91.86	92.505	93.52
CO1 $\rightarrow$ CO2	97.22	96.1	99.58	99.58
CO2 $\rightarrow$ CO1	96.81	96.4	96.8	99.45
U $\rightarrow$ M	59.35	62.3	60	63.3
M $\rightarrow$ U	69.78	70.5	70	70.5
average	55.56	56.73	57.90	59.20

4.5 Performance evaluation of the parameters optimization

The horizontal axis of subfigures in Fig. 5 is the number of genetic algebra. The vertical axis of the subfigures is the accuracy. As shown in subfigures, the black lines represent the best results in BDA’s iteration, the red lines represent the results that we use genetic algorithm to find the balance between marginal adaptation and conditional adaptation through the adjust of parameter $\mu$ . We have found better convergent results than BDA’s in the 16 tasks. The blue lines represent the results for which we adjust the dimensionality (parameter d) after getting the dimensionality reduction through genetic algorithm. The purple lines represent the results that we get for adjusting the parameter $\mu$ and parameter d at the same time. It shows an interactive effect of different parameters and proves that combined action of multiple parameters produces a better solution through the use of the genetic algorithm.

As shown in Table 3, average accuracy of seeing $\mu$ as a parameter can reach 56.73% which is better than 55.56% of BDA. It turns out that to find the balance parameter $\mu$ between marginal distribution adaptation and conditional distribution adaptation, our method is able to perform 1.17% better than the BDA. Simultaneously, we notice that, when we use d as the only variable, we can get the average accuracy: 57.90%. When we use both $\mu$ and d as variables, the average accuracy can reach 59.20%, which performs 2.47% and 1.3% better than regarding each one as a variable.

5. Conclusions

Although traditional machine learning also has features or instances selection, due to the projection process of datasets in transfer learning, the projected datasets will be redistributed in the feature space which makes it even more important to keep clear boundaries between classes. The essence of our paper is to make the boundaries between different classes of projected data clearer and the overlapping degree lighter. Simply put, our motivation is to reduce interference between instances of different classes. Both the features reweighting and the instances reweighting are just methods to help us achieve that.

It is important to note two facts: 1. In our unsupervised learning scenario, we constantly need to iterate to make pseudo tags gradually convert to true tags. This projection process happens continually, and each projection process will vary with the pseudo tags and loss of values. 2. The prior knowledge we essentially consider is information of relative positions between instances. Because of the continual projection process, reducing interference during the iteration processes between instances is particularly important. This is the reason why our work has produced good results.

It is worth noting that we judge advantages and disadvantages of instances from the interactive information between instances, which is ignored by predecessors. TJM introduces the $L_{2,1}$ paradigm into the domain adaptation approach by computing the instances’ weights from the transformation matrix A. This method will increase the loss of instances that happens at a sharp shift. In this way, through the process of closer source data domain and target data domain, it is natural to think less of instances that are far from the center of symmetry. There is no doubt that the method is effective, but there are still a few problems. Firstly, because the process of obtaining A matrix in the first iteration does not introduce instances reweighting, the bad instances were actually determined by the objective function. This is why we argue that, to some extent, the method assumes that the instances are equivalent. The method is undoubtedly rough and unrefined. Secondly, is a the instance closer to the center necessarily a good instance? Apparently not. If the instances near the center belong to many different classes, they are still bad samples. Above you can see what our work aims to solve and significance of our improvement.

For less overlapping area problem, we chose a classical domain adaptation method to implement our ideas and experiments. The information regarding instances relative positions is used to reduce interference between classes when the source data and target data are projected into an intermediate space. We obtain good results and believe that this idea can also be applied to other classical domain adaptation methods with some changes, which can be tried in future work.

In addition, we attempt to show that we can find a better solution through the use of the multi-objective optimization algorithm to optimize the multi-objective and the parameters in the domain adaptation. We use genetic algorithm to verify this. We obtain better results than the BDA which relies on the linear search method. In Fig. 5, we can find with the increasing of iteration number, the accuracy gradually begining to be stable and convergent. We believe that the introduction of other multi-objective optimization algorithms in domain adaptation problems can deliver good results too. We can pay attention to this fascinating aspect in future work.

In this paper, we propose the method, ODA, in order to find which features or instances in the source data domain constitute good features or instances, and then reweight the features or instances during the process of domain adaptation in order to reduce the overlapping degree of data. We found this to be effective in our experiment. At the same time, we propose using genetic algorithm to find a better balance between marginal distribution adaptation and conditional distribution adaptation. We successfully validated the hypothesis that the combined action of multiple parameters in the process of domain adaptation can produce a better solution through the use of a genetic algorithm.

References

Wang

Nie

Huang

and Ding

, Dyadic transfer learning for cross-domain image classication, In Proceedings of ICCV, 2011.

Jie

Tommasi

and Caputo

, Multiclass transfer learning from unconstrained priors, in Proc. of ICCV, 2011.

Roy

S.D.

Mei

Zeng

and Li

, Socialtransfer, Crossdomain transfer learning from social streams for media applications, in Proceedings of ACM MM, 2012.

Wang

Jiang

Huang

and Tian

, Multi-featuremetric learning with knowledge transfer among semantics and social tagging, in Proceedings of CVPR, 2012.

Lampert

C.H.

Nickisch

and Harmeling

, Learning to detect unseen object classes bybetween-class attributetransfer, in Proceedings of CVPR, 2009.

Aytar

Zisserman

and Rasa

, Model transfer for object category detection, in Proceedings of ICCV, 2011.

Gopalan

and Chellappa

, Domain adaptation for object recognition: An unsupervised approach, in Proceedings of ICCV, 2011.

Guillaumin

and Ferrari

, Large-scale knowledge transfer for object localization in imagenet, in Proceedings of CVPR, 2012.

Lampert

C.H.

and Kromer

, Weakly-paired maximum covariance analysis formultimodal dimensionalityreduction and transfer learning, in Proceedings of ECCV, 2010.

10.

Jhuo

I.-H.

Liu

Lee

D.-T.

and Chang

S.-F.

, Robust visual domain adaptation with low-rank reconstruction, in Proceedings of CVPR, 2012.

11.

Qiu

Patel

V.M.

Turaga

and Chellappa

, Domain adaptive dictionarylearning, in Proceedings of ECCV, 2012.

12.

Pan

S.J.

et al., Domain adaptation via transfer component analysis, IEEE Transactions on Neural Networks 22.2 (2011), 199–210.

13.

Long

et al., Transfer Feature Learning with Joint Distribution Adaptation, IEEE International Conference on Computer Vision IEEE, 2014, pp. 2200–2207.

14.

Long

et al., Transfer Joint Matching for Unsupervised Domain Adaptation, Computer Vision and Pattern Recognition IEEE, 2014, pp. 1410–1417.

15.

Wang

et al., Balanced Distribution Adaptation for Transfer Learning, IEEE International Conference on Data Mining IEEE, 2017, pp. 1129–1134.

16.

Schlkopf

Platt

and Hofmann

, A Kernel Method for the Two-Sample-Problem, abs/0805.2368.2007 (2008), pp. 513–520.

17.

Wang

et al., Stratified Transfer Learning for Cross-domain Activity Recognition, IEEE International Conference on Pervasive Computing and Communications IEEE Computer Society, 2018, pp. 1–10.

18.

Tahmoresnezhad

and Hashemi

, Visual domain adaptation via transfer feature learning, Knowledge & Information Systems 50(2) (2016), 1–21.

19.

Deb

et al., A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation 6.2 (2002), 182–197.

20.

Gong

Shi

Sha

and Grauman

, Geodesic flow kernel for unsupervised domain adaptation, In Proceedings of CVPR, 2012.

21.

Tao

and Geng

, Bregmandivergence-based regularization for transfer subspace learning, IEEE TKDE, 2010.

Biased transfer matching for less overlapping degree for unsupervised domain adaptation

Abstract

Keywords

1. Introduction

2.1 Data distribution adaptation

2.2 Feature selection and instance selection

2.3 Multi-objective optimization

3. Problem specification

3.1 Problem definition

3.2 Dimensionality reduction

3.3.1 The marginal distribution distance between two data domains.

3.5.1 Find a better trade-off parameter between the marginal distribution adaptation and the conditional distribution adaptation

4. Experiments

4.1 Datasets

4.2 Comparison methods

4.3.1 Experiment 1: The ODA method

4.3.2 Experiment 2: Parameters optimization

4.4 Performance evaluation of ODA

Table 2 Accuray (%) of ODA in three methods and other methods on 16 tasks

5. Conclusions

References

Table 2
Accuray (%) of ODA in three methods and other methods on 16 tasks