Abstract
Domain adaptation is an important branch of transfer learning. Previous studies have always taken efforts to minimize the optimization goal, but they neglect the relative quality of features or instances. For example, a classic work treats different instances equally in a degree and chooses these instances which minimize the optimization function value. This method will discard these instances that make the data distribution in source and target data domain different and will neglect the instances’ relative quality. To reduce interference between instances in the process of domain adaptation, we put forward a novel method of ODA that uses the overlapping degree to measure every feature or instance’s relative quality and implement feature or instance reweighting. At the same time, we have noticed that there are many parameters with values that will influence the effect of the method. Previous studies do not have a reasonable method to determine the parameters’ values. We can use the genetic algorithm to find the balance between marginal distribution adaptation and conditional distribution adaptation to find the best combination of multiple parameters. Experiments we have done verify that the ODA method outperforms by 3.26% compared with the best comparison method. We have found that our method of finding the optimal parameters can yield more accurate results than the original method.
Introduction
As we know, one of the reasons why traditional machine learning methods have positive effects is that they are based on a hypothesis that the training and testing datasets follow the same distribution. However, in many real-world cases, the training and testing datasets do not follow the same distribution. Data from the same class but different domains may show different characteristics. Furthermore, because of the limitation of training conditions, we often need to train models in source data domain, and then to use the models in target data domain.
We often confront such problems where labeled data is scarce in a target data domain and therefore it’s nearly impossible to learn an effective model without rich labels or it takes too much to learn a model in the target domain. As a result, it is a challenging problem to learn an accurate classifier for the target domain using labeled data from the source domain. We use this model to determine which category the data in the target data domain belongs to. However, in this way, the accuracy of the model is compromised.
Therefore, transfer learning is receiving more and more attention and it has been used in many different fields, such as image classification [1, 2], tagging [3, 4], object recognition [5, 6, 7, 8], and feature learning [9, 10, 11]. There are many research methods in the transfer learning. Among them, feature-based transfer learning is the most popular. Transfer component analysis [12] is the classic distribution adaptation method of the feature-based transfer learning and adapts marginal distribution. In addition, STL [17] considers conditional distribution adaptation. The joint distribution adaptation method [13] considers both marginal distribution adaptation and conditional distribution adaptation. There are many methods which have been proposed to improve JDA. For example, VDA [18] adds the calculation of within-class distance.
We mainly get our ideas from two papers in next two paragraphs:
TJM [14] proposes adding instances reweighting to the marginal distribution adaptation. They calculate the
Wang notices the deficiency of JDA: marginal distribution adaptation and conditional distribution adaptation are usually not equally important in some datasets. They propose BDA [15] to solve this problem, but the BDA only finds the trade-off parameter by traversing from 0 to 1 with the interval of 0.1. As the two distribution adaptations can be regarded as two optimization goals, our paper uses a multi-objective optimization algorithm to find a better solution.
In the next step, we further elaborate our theory about ODA.
In Fig. 1, the two ellipses represent the instances’ distribution of two different classes in feature space. We can clearly note that the instances in Class 1, such as A,O,H, and the instances in Class 2, such as
Two different classes during the domain adaptation process.
We notice the classification error caused by the mutual interference between classes in the source data domain. In the domain adaptation, we usually use the kernel function to map the instance in original space to the KHKS space. When this is used, we can propose a method of measuring the interference degree and to reweight instances to decrease it. When the kernel function is not used, we decrease the interference degree by implementing features reweighting. In the previous work TJM [14], they do not implement instances reweighting in the first iteration and then get the first transformation matrix. In the next iterations, according to the transformation matrix, they enhance the loss of those instances which have a large move. This is the reason why we say that they regard instances as the same to some extent. Further more, they don’t consider the situation when the kernel function is not used. As shown in Fig. 3, the instances reweighting process in previous domain adaptation methods are usually driven by an objective function. When we pay attention to the instances’ stand or fall, the bad instances’ weight is decreased (the edge OB). In this way, the instance we choose changes from instance 2 to instance 1. This is similar in feature selection.
Instances or features reweighting.
In this paper, we put forward a concept of overlapping area to measure interference degree between classes. Our target is to reduce the overlapping degree in every iteration of solving the transformation matrix and data projection. Then, the model trained by projected source data can perform better on projected target data. Instances reweighting or features reweighting is the method used to achieve that target. When the kernel method is not used, we focus on the feature reweighting. We believe that the features which have a higher overlapping degree of instances have a higher misleading degree on the result. In the process of acquiring the transformation matrix, we take less consideration of these bad features. In the projection process using the kernel method, we take less consideration of the impact of the instances which are more likely to be abnormal instances, around which have greater number of different-class instances. At the same time, we discuss the effect of different parameters in the optimization problem. For the trade-off of a single parameter, for example, the balance parameter u in the BDA, we should strike a balance between the marginal distribution adaptation and conditional distribution adaptation. This problem can be regarded as a multi-objective optimization problem. Compared with the BDA, which is achieved by iteratively traversing, we can figure out a better solution using a genetic algorithm. For the multi-parameter problems, our experiments show that the multi-parameter genetic algorithm can find a better, convergent solution than using each single parameter. We can conclude our motivations and contributions into four aspects:
We propose the concept of overlapping, and use the values of a formula to measure the harm degree of each feature (in the primal method) or each instance (in the kernel method). The values are calculated by the overlapping degree which is caused by unclear boundaries between different classes of instances in the source data domain. When we use the primal method, for each feature, the value reflects the possible harm degree of the feature to the target data’s classification. The more instances which have similar values in a feature but belong to different classes, the higher harm degree of the feature to the target data’s classification will be. When we use the kernel method, for each instance, the value reflects the anomalous degree of each instance in the source data domain, that is, whether the surrounding instances belong to the same classification as the instance. The more instances that do not belong, the higher the harm degree of the instance to the target data’s classification will be. The overlapping of different classes in the source data domain will adversely influence the resulting transformation matrix and the model. We use the values obtained above to reweight the features or instances and to reduce influences of bad features or bad instances on transformation matrix and model. We find a convergent trade-off solution of marginal distribution adaptation and conditional distribution adaptation by genetic algorithm, resulting in achieving a better accuracy than BDA does. We show that when multiple parameters are used as variables, a genetic algorithm can find a better solution than using each single parameter as a variable because of the interaction of the parameters.
Data distribution adaptation
TCA [12] is the representative method in marginal distribution adaptation. It proposes using MMD [16] to calculate the discrepancy between the source and target domain. Under the condition of maintaining the data characteristics, TCA [17] finds a transformation matrix and brings the MMD distance closer after the data transformation. Bregman divergence is used instead of MMD in TSL[21] to measure the distance of different distributions. STL is a method in conditional distribution adaptation which argues that many studies ignore intra-class correlations and adaptively reduces the dimension of space by using the intra-class correlation. STL has achieved convincing results in the cross-domain behavior recognition tasks. JDA proposes a method to extract a shared subspace between the source and target domains by considering both above two adaptations. BDA notices that the marginal distribution adaptation and the conditional distribution adaptation shouldn’t be considered as equally important in many scenarios.
Feature selection and instance selection
TJM proposes adding regularization term to select instances from the source data domain in the process of marginal distribution adaptation.
Multi-objective optimization
The genetic algorithm, NSGA-II [19], is a classic method in multi-objective optimization problems.
Problem specification
Problem definition
In the unsupervised transfer learning, we have a batch of tagged source data
In other words, we use the model trained in source data to predict the category of target data. The rationality of the method is based on an assumption that the feature space and label space of the source domain and target domain are the same:
Dimensionality reduction
Dimensionality reduction can reduce noise influence. We usually use Principal Component Analysis (PCA) for this.
where tr(X) denotes the trace of a matrix
The marginal distribution distance between two data domains.
Marginal distribution adaptation is to make
The minimization of the Eq. (2) is equivalent to the following formula:
where the MMD matrix
Conditional distributions adaptation is to make
Because we’re assuming that
where the MMD matrix,
Control the intra-class variance, which makes the same classes congregate more closely, and the classification effect better.
where
A wrong classification of the instances of the target data occurs when there are incorrect instances of the source data around the target data within the new space. After the projected of the source data, each class has a general distribution range within the new space. The boundaries of these areas are usually unclear. Despite this, clear boundaries are the key factor in deciding whether the target datas classification is correct or not. Thus, the degree of overlap between the different classes in the source data is partly responsible for the classification errors.If there are more mixed instances in the source data, there will be a much bigger degree of overlap and the target data will be more likely to be misclassified. Thus, one must ask how we can define the degree of overlap and put it into the solving process of the transformation matrix A? For the primal method, we use the degree to implement feature reweighting. For the kernel method, we use the degree to implement instance reweighting.
where
This method applies to the situation where the kernel function is not used to non-linearize original data. For each feature, we do not consider other features and only take this feature and category of each instance into consideration. We calculate the frequency and degree of the occurrence of the source datas abnormal instances for each feature. In the later coordinate transformation process, we accord more attention to those features with a low occurrence of abnormal instances (the degree of overlap).
Furthermore, we calculate a loss value for each feature. The loss value reflects the overlapping degree of the feature in the source data domain.
The calculation of features’ overlapping degree.
As shown in Fig. 3,
The boundary [ll,lr] can be defined as Eq. (11):
Then we can calculate the overlapping degree of No.i feature as following formula:
After calculating every feature’s loss, we can get the overlap matrix:
With the introduction of the overlap matrix, we implement the feature reweighting and despise those features which have a high loss. Among the Eq. (12), eps is a tiny positive number. Taking the Eqs (3) and (4), (6)–(13) into consideration, we can get the optimization Eq. (14) and the conditions it is subject to. Our goal is to minimize it.
In Eq. (14), we add F paradigm to prevent A from getting too complicated.
we denote
Our goal is to find the matrix A which conforms Eq. (16) for d smallest eigenvectors.The procedure of the ODA with the primal method is summarized in Algorithm 3.
ODA with primal method[1] Instances’ feature
Train the classifier f by the transferred source data and labels:
Kernelization:
We often encounter cases where the instances are not linearly separable. We use a kernel function to map the original data from X to
We are more concerned about overlapping degree of instances after dimensionality reduction. But when we use the primal method, there are different number of features of data before and after the dimension reduction, so it is hard for us to use the post-dimension reduction data to calculate the C matrix in the next iteration for feature reweighting. When it comes to the kernel method, the way that we add the overlapping degree into the optimization problem changes from feature reweighting to instance reweighting. Because of the invariance of numbers of instances, we can use the loss value of each post-dimensionality reduction instance to directly implement instance reweighting.
So we put forward a
Furthermore, the effect of overlap matrix on the model is changed from reweighting the features to reweighting the instances. The values of the overlap matrix reflect the overlapping degree of every source instance. In the projection process, we give less consideration to the impact of the instance which is more likely to be an abnormal instance and around which there is greater amount of different-class instances.
Because of the normalization of the instances, when an element in
the calculation of instances’ overlapping degree.
Then we can calculate the overlapping degree of No.i instance as following formula:
The instance weighting is slightly different from the feature weighting. We choose the Eq. (18) as C matrix which is a little different to the primal method.
We can get the optimization formula and the condition it subject to as Eq. (19).
where
Where
Then we derive L with respect to A, and let
The procedure of ODA with kernel method is summarized in Algorithm 4.
ODA with kernel method[1] Instances’ feature
Calculate overlap matrix C by Eqs (17) and (18). Use Eq. (22) except conditional distribution and take the d smallest eigenvectors to get the transformation matrix
Train the classifier f by the transferred source data and labels:
We found that there are many parameters that need to be determined in the process of domain adaptation. We can regard these problems as multi-objective optimization problems.
Find a better trade-off parameter between the marginal distribution adaptation and the conditional distribution adaptation
The Eq. (23) is the optimization objective in BDA. We notice that previous studies have artificially defined the parameters’ values. For example, the BDA runs with parameter
We observed that there are many parameters involved in the process of solving the optimization problem. Such as the parameter
In order to verify it more quickly, we selected the parameters and d for the experiment. It can be observed that through the genetic algorithm, the final accuracy rate is found to be convergent and a better solution can be found.
Experiments
Datasets
We chose the following five datasets: Office
Comparison methods
We chose six state-of-the-art comparison methods:
Among these methods, NN and PCA are traditional learning methods, other methods are transfer learning approaches. As suggested by Geodesic Flow Kernel (GFK)[20], NN is chosen as the base classifier since it does not require tuning cross-validation parameters. BDA is used for experiment 2: Parameters Optimization, other methods are used for experiment 1: The ODA Method.
The details of the five datasets
Experiment 1: The ODA method
We built the experimental environment by imitating JDA, TJM, and BDA. PCA, TCA, JDA, TSL, TJM are processes of dimensionality reduction. Then we put the data after dimensionality reduction into NN to get a model. For comparison study, we set
Experiment 2: Parameters optimization
We regarded 2 representative parameters as variables to use the genetic algorithm and chose BDA as a baseline method. We set d and
Performance evaluation of ODA
Accuray (%) of ODA in three methods and other methods on 16 tasks
Accuray (%) of ODA in three methods and other methods on 16 tasks
Genetic algorithm.
We used primal method, linear kernel method and rbf kernel method to test 16 tasks. The performance is shown in Table 2. By using the rbf function, we can achieve a correct rate of 58.27% which is 3.26% better than the best comparison method TJM. The TJM implements the instances selection, and only discards these instances which let the optimization function value (discrepancy between marginal distribution adaptation and conditional distribution adaptation) become smaller. In other words, it discards these instances which make the source data domain and target domain different. But what we focus on is that if the features or instances in the source domain are sufficient, then we reweight the features or instances. The experiment has shown that our method ODA has been effective.
Result of the parameters optimization experiment
The horizontal axis of subfigures in Fig. 5 is the number of genetic algebra. The vertical axis of the subfigures is the accuracy. As shown in subfigures, the black lines represent the best results in BDA’s iteration, the red lines represent the results that we use genetic algorithm to find the balance between marginal adaptation and conditional adaptation through the adjust of parameter
As shown in Table 3, average accuracy of seeing
Conclusions
Although traditional machine learning also has features or instances selection, due to the projection process of datasets in transfer learning, the projected datasets will be redistributed in the feature space which makes it even more important to keep clear boundaries between classes. The essence of our paper is to make the boundaries between different classes of projected data clearer and the overlapping degree lighter. Simply put, our motivation is to reduce interference between instances of different classes. Both the features reweighting and the instances reweighting are just methods to help us achieve that.
It is important to note two facts: 1. In our unsupervised learning scenario, we constantly need to iterate to make pseudo tags gradually convert to true tags. This projection process happens continually, and each projection process will vary with the pseudo tags and loss of values. 2. The prior knowledge we essentially consider is information of relative positions between instances. Because of the continual projection process, reducing interference during the iteration processes between instances is particularly important. This is the reason why our work has produced good results.
It is worth noting that we judge advantages and disadvantages of instances from the interactive information between instances, which is ignored by predecessors. TJM introduces the
For less overlapping area problem, we chose a classical domain adaptation method to implement our ideas and experiments. The information regarding instances relative positions is used to reduce interference between classes when the source data and target data are projected into an intermediate space. We obtain good results and believe that this idea can also be applied to other classical domain adaptation methods with some changes, which can be tried in future work.
In addition, we attempt to show that we can find a better solution through the use of the multi-objective optimization algorithm to optimize the multi-objective and the parameters in the domain adaptation. We use genetic algorithm to verify this. We obtain better results than the BDA which relies on the linear search method. In Fig. 5, we can find with the increasing of iteration number, the accuracy gradually begining to be stable and convergent. We believe that the introduction of other multi-objective optimization algorithms in domain adaptation problems can deliver good results too. We can pay attention to this fascinating aspect in future work.
In this paper, we propose the method, ODA, in order to find which features or instances in the source data domain constitute good features or instances, and then reweight the features or instances during the process of domain adaptation in order to reduce the overlapping degree of data. We found this to be effective in our experiment. At the same time, we propose using genetic algorithm to find a better balance between marginal distribution adaptation and conditional distribution adaptation. We successfully validated the hypothesis that the combined action of multiple parameters in the process of domain adaptation can produce a better solution through the use of a genetic algorithm.
