Joint category-level and discriminative feature learning networks for unsupervised domain adaptation

Abstract

Unsupervised domain adaptation (UDA) aims to build a classifier for the unlabeled target domain by transferring knowledge from a well-labeled source domain. Recently deep domain adaptation methods can not effectively integrate discriminability with transferability of features, and these methods can only reduce, but not remove, the cross-domain discrepancy. To this end, this paper proposes a new domain adaptation method called Joint Category-Level and Discriminative Feature Learning Network (CDN). CDN not only achieves domain adaptation by minimizing category-level distribution discrepancy between domains but also learns discriminative feature representations via maximizing inter-category distance and selecting transferability samples simultaneously. Moreover, we develop a Transferability Weighting Module (TWM), which is based on a constructed classifier, to further strengthen the discriminability of sample’s features. The experimental results demonstrate that CDN can significantly decrease the cross-domain distribution inconsistency and further promote the classification performance.

Keywords

Domain adaptation deep learning discriminative feature learning transfer learning

1 Introduction

Pattern recognition methods have attracted much attention in many applications [1 –13]. Especially deep learning method has achieved huge success in computer vision [10, 11] and natural language processing [12, 13]. However, training a deep neural network requires a large amount of labeled samples, which is not easy to obtain. Especially in many real-world applications, it is difficult to collect training data that have the same distribution as testing data. Therefore, to address the scarcity of labeled data on some target tasks, it is a strong motivation to leverage labeled samples of the auxiliary domain (i.e., source domain) to improve classifier’s performance in the unlabeled domain (i.e., target domain) [14, 15]. Unfortunately, the training data and test data are drawn from different distributions, which is also known as domain shift [16 –19]. For example, the models trained with the simulated images do not generalize well in realistic domains. In these cases, domain shift is a major obstacle of knowledge transfer between domains.

Unsupervised domain adaptation reaps huge fruits [14 , 20–24] in reducing the distribution discrepancy between a well-labeled source data (domain) and a related unlabeled target data (domain). Many of them attempt to align the distributions of source and target domains by minimizing the maximum mean discrepancy (MMD) [25] distance to obtain the domain invariant feature [26 –29] or reweighting the samples of source domain to select transferable samples[30, 31].

Recent advances show that deep networks can learn abstract feature representations, i.e., deep adaptation networks [32, 33] matches the different domain distributions by learning hidden representations of all task-specific layers in a reproducing kernel Hilbert space. However, most of them [26 , 33–36] only conduct the global adaptation and do not consider local adaptation. As illustrated in Fig. 1, the global adaptation reduces the marginal distribution discrepancy between domains while the local adaptation minimizes conditional distribution between domains. A possible consequence of such global adaptation is that some originally well aligned categories between the source and target are incorrectly mapped (negative transfer [14]), which leads to worse classification results in the target domain. On the other hand, even if the local distribution of the source data and the target data are well aligned, there are still many samples that are easy to be misclassified on the classification plane, which weakens the transferability from the source data to the target data.

Fig. 1

(Best viewed in color.) The figure shows some toy examples. (a) Global adaptation aligns marginal distribution by drawing centroids of all data between domains closer. (b) Local adaptation aligns conditional distribution by drawing closer the category centroids between domains. (c) The proposed method in this paper. We encourage the networks to pull the source or target data close to the category centroids of source data.

In summary, most of domain adaptation methods [26 , 33–36] still are constrained by two bottlenecks. First, global adaptation may fail to solve the domain shift. Second, reducing the distribution discrepancy can not ensure discriminative feature learning, which is important to obtain a more accurate model [37 –39]. Note that this risk cannot be tackled by existing local adaptation methods [40], and we propose joint Category-Level and Discriminative Feature Learning Networks (CDN) in this paper to adapt the distribution of domains. Specifically, the reduction of the intra-category variance aims to reduce the intra-category distance between the source and target domain. Hence it can promise the domain adaptation and lessen the domain shift. We further propose to learn discriminative feature by means of increasing inter-category distance and selecting transferability samples. The main contributions of this paper are listed as follows.

Firstly, this paper considers pulling the source or target data closer to the category centroids of source data which enhances the transferability of the source data to the target data. Due to the reduction of intra-category variance, the category-level adaptation mitigate the domain shift.

We propose to maximize inter-category distance and select transferable samples simultaneously to obtain discriminative feature representation, which can lead to obtaining a more accurate and robust classifier.

Extensive experiments on image datasets demonstrate that the category-level and discriminative feature representation will further mitigate the negative transfer and benefit the classification, which would significantly enhance the performance of unsupervised domain adaptation.

The rest of the paper is organized as follows. Section 2 is the review of the related works about domain adaptation. The formal definitions for the problem and the framework are presented in Section 3. In Section 4, comprehensive experiments on benchmark data set verify the effectiveness and efficiency of the proposed methods. Finally, we conclude our work in Section 5.

2 Related works

According to the difference in feature representation, domain adaptation method can be roughly divided into shallow learning methods and deep learning methods.

Shallow learning methods: Transfer component analysis method [26] looks for domain feature invariants by reducing the cross-domain marginal distribution discrepancy in the reproducing kernel Hilbert space while joint distribution adaptation method [28] also considers the class conditional distribution adaptation. Geodesic flow kernel method [41] regards two subspaces as two points on Grassmann manifold, and bridges source and target subspaces via seeking the smooth geodesic path. Subspace alignment method seeks a domain adaptation solution by learning a projection function to aligns the source subspace with the target one [42]. Transfer joint matching method [43] reduces the domain difference by jointly matching the features and re-weighting the instances across domains. Joint geometrical and statistical alignment method [29] considers preserving the source discriminative information and aligns the distributions by two coupled projections. EasyTL learns both non-parametric transfer features and classifiers by exploiting intra-domain structures[44].

Deep learning methods: Recent advances show that deep networks can learn abstract feature representations. The feature transferability drops significantly in higher layers with increasing domain discrepancy [34]. Hence, the adaptation of higher layers is the key to avoid the negative transfer. Deep domain confusion [34] introduces an adaptation layer and an additional domain confusion loss to learn a representation that is both semantically meaningful and domain invariant. Deep adaptation networks [32, 33] matches the different domain distributions by learning hidden representations of all task-specific layers in a reproducing kernel Hilbert space. Joint adaptation networks [40] aligns the joint distributions of multiple domain-specific layers across domains basing on a joint maximum mean discrepancy criterion. Residual transfer networks [35] jointly learn adaptive classifiers and transferable features from labeled data in the source domain and unlabeled data in the target domain.

To illustrate the difference between the proposed method and the previous works, Table 1 shows a basic overview of the properties of the CDN and some of the most advanced methods. On the one hand, inter-class variance minimization and intra-class distance maximization decrease not only model learning errors but also distribution discrepancy. On the other hand, reducing the weights of outlier source samples further promote the reduction of distribution discrepancy.

Table 1
Basic overview of method properties

Property\Method GFK[41] TCA[26] DAN[33] RTN[35] DANN[36] ADDA[48] EasyTL[44] JAN[40] CDN

Distribution adaptation × ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

Intra-category variance minimization × × × ✓ × × ✓ × ✓

Inter-category distance maximization × × × × × × ✓ × ✓

Instance reweighting × × × × × ✓ × × ✓

Property\Method	GFK[41]	TCA[26]	DAN[33]	RTN[35]	DANN[36]	ADDA[48]	EasyTL[44]	JAN[40]	CDN
Distribution adaptation	×	✓	✓	✓	✓	✓	✓	✓	✓
Intra-category variance minimization	×	×	×	✓	×	×	✓	×	✓
Inter-category distance maximization	×	×	×	×	×	×	✓	×	✓
Instance reweighting	×	×	×	×	×	✓	×	×	✓

3 Proposed method

3.1 Notations and problem definition

Given n_S labeled source samples with visual features x_Si ∈ R^d and corresponding labels y_Si ∈ R^K, transfer learning aims to recognize n_T unknown visual samples x_Tj ∈ R^d, which has same label space but follows a different distribution.

The frequently used notations are listed in Table 2, and the concepts about domain adaptation will be described in this part.

Table 2
Notations and corresponding descriptions

Notation Description Notation Description

x_S and y_S Source samples/Labels ${\hat{c}}_{k}$ Centroid of k^th category

x_T and y_T Target samples / Labels K Number of categories

$D_{S}$ and $D_{T}$ Source / Target domain ${\hat{y}}_{T}$ Pseudo labels

$X$ and $Y$ Feature /Label space w Sample weight

λ and β Penalty pameters $P$ Probability distributions

Notation	Description	Notation	Description
x_S and y_S	Source samples/Labels	${\hat{c}}_{k}$	Centroid of k^th category
x_T and y_T	Target samples / Labels	K	Number of categories
$D_{S}$ and $D_{T}$	Source / Target domain	${\hat{y}}_{T}$	Pseudo labels
$X$ and $Y$	Feature /Label space	w	Sample weight
λ and β	Penalty pameters	$P$	Probability distributions

Task: Transfer learning aims to improve the performance of models $C$ in the target domain through relevant labeled source data.

Domain: A domain $D$ contains feature space $X$ , label space $Y$ , and corresponding classifier $C$ . This paper considers the problem of homogeneous domain adaptation, that is, the feature space and category space of source data and target data are consistent, i.e., $X_{S} = X_{T}$ and $Y_{S} = Y_{T}$ , but the distributions are generally different.

Local Adaptation: The conditional distribution adaptation can effectively achieve local adaptation [45]. Unfortunately, it is nontrivial to match the conditional distributions since there is no labeled data in the target domain. Therefore, the pseudo target labels ${\hat{y}}_{T}$ [30 , 45–48] are adopted to measure the conditional distribution discrepancy. $Distance (D_{S}, D_{T}) = {∥ P (x_{S} | y_{S}) - P (x_{T} | {\hat{y}}_{T}) ∥}_{2}^{2} .$ (1)

Maximum mean discrepancy is an effective distance measure for conditional distribution [25 , 40]. With pseudo target labels ${\hat{y}}_{T}$ , the conditional distribution discrepancy can be calculated as: $\begin{matrix} Distance (D_{S}, D_{T}) = \\ \sum_{k = 1}^{K} {∥ \frac{1}{n_{k}} \sum_{y_{Si} = k} x_{Si}^{k} - \frac{1}{m_{k}} \sum_{{\hat{y}}_{Tj} = k} x_{Tj}^{k} ∥}_{2}^{2} . \end{matrix}$ (2) where n_k and m_k denote the numbers of k^th class samples in the source domain and target domain, respectively. In other words, conditional distribution adaptation tries to draw the centroid of same category closer. [45] has verified that conditional distribution adaptation plays a crucial role in both adaptation and classification. Therefore, we adopt conditional distribution adaptation to achieve the goal of UDA.

3.2 Motivation

Existing deep domain adaptation methods may not effectively integrate discriminability with transferability of features since the transferable representations and the discriminative representations of the source and target domains mainly linger in thedifferent high-level feature layers(i.e., the last pooling layer and the fully connected layer) respectively. We believe that learning both category-level and discriminative feature representations is important since it can promote the robustness of networks. This paper explores how to effectively transfer a learning model from the source domain to the target domain via the category-level feature representations in high-order statistics. It is important to increase inter-category distance and select transferability samples simultaneously in order to obtain a discriminative feature representation.

3.3 Domain adaptation via variance reduction

It is essential for domain adaptation to learn the category-level feature representation between source and target domain. Intuitively, the minimization of intra-category variance contributes to category-level feature learning, which is beneficial to the adaptation of category centroids. Therefore, in order to avoid negative transfer during domain adaptation, the source domain should learn the robust category-level feature representation through the intra-category variance reduction.

In unsupervised domain adaptation problems, however, the distributions of the source domain and target domain usually embody complex structures, which reflects the category boundaries [49]. With the label information of the source domain, the intra-category variance reduction of the source can be formulated as: $E_{var}^{S} = \frac{1}{2} \sum_{k = 1}^{K} \sum_{j = 1}^{n_{k}} {∥ x_{Sj}^{k} - c_{S}^{k} ∥}_{2}^{2} .$ (3) where $c_{S}^{k}$ is the centroid of k^th category, and it is defined as: $c_{S}^{k} = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} x_{Sj}^{k} .$ (4) where n_k denotes the number of k^th class samples in the source domain. With minimizing the formulation of the equation 3, the distance between samples of the same category in the source domain is significantly reduced.

Unfortunately, the adaptation of conditional distribution is nontrivial, since there is no labeled data in the target domain. Target pseudo labels have achieved significant results in practice of conditional distribution adaptation [30 , 45–48]. Since the target domain has no labels, we also use the pseudo labels from source classifier trained on the well-labeled x_S. Therefore, with the source labels and pseudo target labels, the intra-category variance between $D_{S}$ and $D_{T}$ can be defined as: $E_{var}^{TS} = \frac{1}{2} \sum_{k = 1}^{K} \sum_{i = 1}^{m_{k}} {∥ x_{Ti}^{k} - c_{S}^{k} ∥}_{2}^{2}$ (5) where m_k denotes the number of k^th class samples in the target domain. We minimize the intra-category variance of between $D_{S}$ and $D_{T}$ to reduce the conditional distribution discrepancy.

The above goal can be rewritten as: $E_{var} = E_{var}^{S} + E_{var}^{ST} .$ (6) The centroids of the source data and the target data are well aligned via optimizing equation 6, and the alignment of centroids achieves the adaptation of source and target domain. To improve the performance of the transfer task, the discriminability of the features is also indispensable. We will describe this in the next part in detail.

3.4 Discriminative features learning

Increase Inter-Category Distance: To further improveing the classification performance during variance reduction, it is essential to ensure the discriminative feature learning. However, only reducing the distribution discrepancy does not attain this goal. Intuitively, if the distances between different categories are as large as possible, the discriminativeness of the learned feature representation will be improved. Hence, we suggest to let the distance among each category’s centroid with non-matching labels y_i to be as large as possible. Take the source domain as an example, we can calculate the distance between data in different categories. In order to attain this goal, the inter-category distance of source domain should be increased, which is defined as: $E_{D}^{S} = \sum_{k = 1}^{K} \sum_{y_{i} \neq k} {∥ x_{Si} - c_{S}^{k} ∥}_{2}^{2}$ (7) And the inter-category distance should be increased, which is defined as: $E_{D}^{ST} = \sum_{k = 1}^{K} \sum_{{\hat{y}}_{i} \neq k} {∥ x_{Ti} - c_{S}^{k} ∥}_{2}^{2}$ (8)

Since only source data are labeled, we propose to let the distance between source centroid category k with non-matching target pseudo labels ${\hat{y}}_{i}$ to be as large as possible. Considering the above objectives, $E_{D} = E_{D}^{S} + E_{D}^{T}$ (9) Through maximizing $E_{D}$ , the different centroid will be pulled away, which can encourage discriminative feature learning.

Transferability Weighting Module: The source instance which is irrelevant to the target instance can make the domain adaptation difficult. We believe it is important to explore strategies for identifying the most relevant source instances to learn discriminative features. So we select source domain instance that is closer to the target domain and adjusts the source domain distribution by transferability weighting. Because the domain classification can quantify the similarity between the source sample and the target sample, we pre-trained domain classifier G_d to distinguish the representations of the source domain from that the target domain, $\begin{matrix} D_{d} = & - \frac{1}{n_{S}} \sum_{i = 1}^{n_{S}} log (1 - G_{d} (f (x_{Si}))) \\ - \frac{1}{n_{T}} \sum_{j = 1}^{n_{T}} log (G_{d} (f (x_{Ti}))) \end{matrix}$ (10)G_d is only used to select the instances with high transferability, and its gradient does not back-propagated for updating f. The output of the G_d gives the probability of whether a sample belongs to the source domain through a softmax activations. The smaller the probability value of G_d (x_si), the closer the source instance is to the target instance. Hence, the domain classification probability from the domain classifier G_d can be defined as the weight of source instance. Based on information theory, the entropy function is an uncertainty measure defined as H (p) = - ∑_jp_j · log(p_j) that can properly quantify the sample transferability. Therefore, we utilize the entropy criterion to generate the transferability weighting value for each sample of the source domain as: $w (x_{si}) = 1 - H (G_{d} (x_{si}))$ (11) In this way, the instance with higher transferability will be weighted by a larger value.

In summary, this paper considers increasing inter-category distance and selecting transferability samples simultaneously in order to obtain discriminative feature representation.

3.5 Network structure

In the task of extracting features of data such as images and videos, deep learning can extract more transferability features representations. The transferability of features has been proven effective in many deep transfer learning methods. However, the high level features of source domain learned by the last few layers are not transferable to the target domain due to the domain shift [50], which hinders the improvement of classification performance. Therefore, we extended the deep residual network ResNet [51] to implement CDN as shown in Fig. 2, which contains 50 convolution-pooling layers conv1-pool5 as share feature extractor f for source domain and target domain, and fully connected layer fc as the task classifier G_y. We design the domain adaptation layer at the fully connected layers fc and the pooling layers pl via joint category-level and discriminative features representations learning. In addition, we design a transferability weighting module (TWM) basing on the domain classifier G_d used for calculation of w (x_si). The empirical error of CDN classifier G_y on source domain labeled data $D_{S}$ is $min_{G_{y}, f} E_{task}^{S} = \frac{1}{n_{S}} \sum_{i = 1}^{n_{S}} L_{S} (G_{y} (f (x_{Sj})), y_{Sj})$ (12) where $L_{S}$ is the cross-entropy loss function. Due to the existence of cross-domain discrepancy, when training by (12) directly, the classification performance for the target domain task will be poor. Meanwhile, the transferability of features and classifiers decreases when cross-domain discrepancy increases. Therefore, it is very important to explicitly reduce the distribution difference. The transferable representations of the source and target domains $P (x_{S})$ and $P (x_{T})$ mainly linger in the Pooling layer (pl) while the discriminative representations $P (y_{S})$ and $P ({\hat{y}}_{T})$ mainly remain in Fully connected layer (fc). Such limitations may cause the failure to capture discriminability and transferability of feature representation simultaneously. Our network bridges the gap of these layers by variance reduction and discriminative feature learning. Therefore, the domain adaptation layer completes the category-level feature alignment by minimizing $E_{var}$ (6). In addition $E_{D}$ (9) and transferability weighting module(TWM) make the network learn more discriminative features which can further improve classification performance during training. We have the following training objective for a CDN:

$\begin{matrix} l min E_{task}^{S} (f, D_{y}) \\ + λ (E_{{var}_{y_{j} = k}}^{S} (w (x_{Sj}^{k}) (x_{Sj}^{k}, c_{S}^{k})) + E_{{var}_{y_{i} = k}}^{T} (x_{Ti}^{k}, c_{S}^{k})) \\ - β (E_{D_{y_{j} \neq k}}^{T} (w (x_{Sj}^{k}) (x_{Sj}^{k}, c_{S}^{k})) + E_{{var}_{y_{i} \neq k}}^{T} (x_{Ti}^{k}, c_{S}^{k})) \end{matrix}$ (13) where λ and β is a trade-off parameter, $w (x_{Sj}^{k})$ denotes the weight of each source sample x_si. It is worth noting that the centroid of the same category ( ${\hat{c}}_{k}$ ) needs to update as the parameters of the deep learning network change. In the CDN optimization process, the centroid of the same category is updated based on mini-batch in each training iteration that can reduce the computational cost and adjust the centroid in time. At the same time, γ as optimization learning rate of centroid, can reduce the interference caused by few mislabelled samples. The update equation of $c_{S}^{k}$ is defined as: $Δ {\hat{c}}_{S}^{k} = \frac{\sum w (x_{sj}) \cdot ({\hat{c}}_{S}^{k} - x_{sj})}{1 + \sum w (x_{sj})}$ (14) where w (x_sj) denotes the transferability of source sample. The larger the w (x_sj) value, the higher the transferability of the source data, so it is important to stabilize the adaptation process and learn discriminative features.

Fig. 2

The overall architecture of our proposed CDN which is based on ResNet-50 [51]. The network consist of three modules: 1) the feature extractor f composed of convolutional layers; 2) the domain adaptation layer composed of pooling layers and fully connected layers, which is used to complete the adaptive task with $E_{var}$ minimization and $E_{D}$ maximizing; 3) the transferability weighting module(TWM) for outputing transferability weighting w for source domain based on a domain classifier G_d(the gradient of G_d will not be back-propagated for updating f).

4 Experiments

We conduct experiments to evaluate the CDN compared with state-of-the-art unsupervised domain adaptation methods.

Comparison Methods: We compare CDN with shallow domain adaptation and the state-of-the-art deep domain adaptation methods: Transfer Component Analysis (TCA) [26], Geodesic Flow Kernel (GFK) [41], ResNet Deep Domain Confusion (DDC) [34], Residual Transfer Network (RTN) [35], Reverse Gradient (DANN) [36], Adversarial Discriminative Domain Adaptation (ADDA) [48], Deep Adaptation Network (DAN) [32, 33], Easy Transfer Learning(EasyTL) [44],Joint Adaptation Networks (JAN) [40]. We follow standard experimental setting for all unsupervised domain adaptation tasks [33, 36]: fully labeled samples of the source domain and unlabeled samples of target domain are used for domain adaptation training.

4.1 Datasets

Office-31 dataset [52] is a standard open benchmark dataset for domain adaptation, comprising 4,652 images and 31 categories collected from three subsets: Amazon (A), which contains images downloaded from amazon.com, Webcam (W) and DSLR (D), which contain images respectively taken by web camera and digital single-lens reflex camera under different settings in office environment. Each subset is an independent domain, and their distribution is different. Hence, we evaluate all methods on six unsupervised domain adaptation tasks A → W, D → W, W → D, A → D, D → A and W → A following the common evaluation protocol as [32].

ImageCLEF-DA dataset [53] is a standard benchmark dataset for ImageCLEF 2014 domain adaptation task,which contains three subsets: Caltech-256(C), ImageNet ILSVRC 2012(I) and PascalVOC 2012(P). There are 600 images in each subset with12 categories and 50 images per category. Because three subsets in ImageCLEF-DA are of equal size, it is a good complement to Office-31 where images from different subsets have different sizes. We evaluate all six unsupervised domain adaptation tasks: I → P, P → I, I → C, C → I, C → P, and P → C.

In detail, we follow standard evaluation protocols for unsupervised domain adaptation: all labeled source data and all unlabeled target data are used for training, and only the target data is used for testing (the label of target data is only used for testing stage). Moreover, we show some samples of the datasets in Fig. 3.

Fig. 3

Some selected sample images of object datasets.

Evaluation Protocols: Our CDN and all comparative domain adaptation methods are based on models from ResNet(50-layer) [51] which are the base framework for learning deep transferable features. Specifically, the deep representations output by the layer pool5 of ResNet50 is used as features for domain adaption. We follow standard evaluation protocols for unsupervised domain adaptation [33 , 52], and we compare the classification accuracy and the standard error of all domain adaptation task through three random experiments.

We implement our CDN basing on the PyTorch framework. The base architecture ResNet(50-layer) [51] is pre-trained on the ImageNet dataset [54]. We fine-tune the shared feature extractor f and train a classifier G_y by end-to-end back propagation. We adopt mini-batch stochastic gradient descent (SGD) with the momentum of 0.9 for all parameters updated. The learning rate annealing follows the strategy DANN [36]: the learning rate is adjusted during SGD by $η_{p} = \frac{η_{0}}{(1 + α p)^{β}}$ , where p is the training progress of epochs linearly changing from 0 to 1 for promoting convergence and reducing error on the source domain. Since the classifier is retrained, we set its learning rate to be ten times of the other layers. In the domain adaptation layer, the centroid of the same source category is updated by standard mini-batch stochastic gradient descent (SGD). The hyper parameter λ and β dominates the penalty of $E_{v a r}$ , $E_{D}$ , respectively. γ controls the learning rate of centroid ${\hat{c}}_{k}$ in source domain. We conducted three experiments on office-31 tasks A → W to study the sensitivity of the three parameters, which are shown in Fig. 5(c), (d) and (e).

4.2 Results and discussion

Results on Office-31 and ImageCLEF-DA: The classification performance of unsupervised domain adaptation is shown in Tables 3 and 4. For fair comparison, we cite the published results from their original papers or directly from [40]. In most domain adaptation tasks, our proposed CDN model outperforms others for all comparison methods. We gain several interesting observations from the experimental results. First of all, compared to traditional shallow transfer learning methods and standard deep learning methods, deep transfer learning methods achieve better performance. Some recent works such as DDC [34], DAN [32, 33], DANN [36], RTN [35] and JAN [40], have validated that the reduction of domain discrepancy plays an important role in the extraction of migratable features. Secondly, in order to study the influence of our method on transferable feature learning, we conduct ablation experiments on the Office-31 dataset. The study evaluate several variants of CDN: CDN(w/o DR), which denotes that Increase Inter-Category Distance and Transferability Weighting Module are not used in the training process; CDN(w/o R), which denotes training completely without the Transferability Weighting Module. The results in Table 5 prove that the motivation of this paper is correct. (a) CDN(w/o DR) achieves better results than JAN. This validates that the variance reduction plays an important role in domain adaptation. (b) CDN(w/o R) works better than CDN(w/o DR), but worse than full CDN. This validates that transferability weighting module makes more transferable source domain features and form dense clusters. This demonstrates that TWN stabilizes the adaptation process. By comparing the result of CDN(w/o DR) with CDN(w/o R), it can be observed that the increase of inter-category distance makes centroid of different categories more discrete which promotes the network to learn more discriminative features and further improve the classification performance of the target domain. (c) The full CDN model achieves the best results whose domain adaptation layer consists of category-level and discriminative features learning. It suggests that domain adaptation of the same category samples reduces distribution discrepancy, while samples of different categories are more discrete, making the features more discriminative. Finally, in the domain adaptation layer of CDN, the network can fully utilize the conditional distribution information of the source domain. Based on the deep learning and domain adaptation layer, the distribution of source and target domains can be well adapted. Thus the CDN model gains the state-of-the-art performance on the Office31 and ImageCLEF-DA.

Table 3
Classification accuracy (%) on Office-31 dataset

Method A → W D → W W → D A → D D → A W → A Avg

ResNet 68.4 ± 0.2 96.7 ± 0.1 99.3 ± 0.1 68.9 ± 0.2 62.5 ± 0.3 60.7 ± 0.3 76.1

GFK 72.8 ± 0.0 95.0 ± 0.0 98.2 ± 0.0 74.5 ± 0.0 63.4 ± 0.0 61.0 ± 0.0 77.5

TCA 72.7 ± 0.0 96.7 ± 0.0 99.6 ± 0.0 74.1 ± 0.0 61.7 ± 0.0 60.9 ± 0.0 77.6

DDC 75.6 ± 0.2 96.0 ± 0.2 98.2 ± 0.1 76.5 ± 0.3 62.2 ± 0.4 61.5 ± 0.5 78.3

RTN 84.5 ± 0.2 96.8 ± 0.1 99.4 ± 0.1 77.5 ± 0.3 66.2 ± 0.2 64.8 ± 0.3 81.6

EasyTL 81.8 ± 0.0 85.7 ± 0.0 67.6 ± 0.0 96.3 ± 0.0 93.8 ± 0.0 67.2 ± 0.0 82.1

DANN 82.0 ± 0.4 96.9 ± 0.2 99.1 ± 0.1 79.7 ± 0.4 68.2 ± 0.4 67.4 ± 0.5 82.2

ADDA 86.2 ± 0.5 96.2 ± 0.3 98.4 ± 0.3 77.8 ± 0.3 69.5 ± 0.4 68.9 ± 0.5 82.9

DAN 86.3 ± 0.3 97.2 ± 0.2 99.6 ± 0.1 82.1 ± 0.3 64.6 ± 0.4 65.2 ± 0.3 82.5

JAN 85.4 ± 0.3 97.4 ± 0.2 99.8 ± 0.2 84.7 ± 0.3 68.6 ± 0.3 70.0 ± 0.4 84.3

CDN 90.6 ± 0.3 99.0 ± 0.0 100.0 ± 0.0 89.7 ± 0.6 66.1 ± 0.6 66.4 ± 0.1 85.3

Method	A → W	D → W	W → D	A → D	D → A	W → A	Avg
ResNet	68.4 ± 0.2	96.7 ± 0.1	99.3 ± 0.1	68.9 ± 0.2	62.5 ± 0.3	60.7 ± 0.3	76.1
GFK	72.8 ± 0.0	95.0 ± 0.0	98.2 ± 0.0	74.5 ± 0.0	63.4 ± 0.0	61.0 ± 0.0	77.5
TCA	72.7 ± 0.0	96.7 ± 0.0	99.6 ± 0.0	74.1 ± 0.0	61.7 ± 0.0	60.9 ± 0.0	77.6
DDC	75.6 ± 0.2	96.0 ± 0.2	98.2 ± 0.1	76.5 ± 0.3	62.2 ± 0.4	61.5 ± 0.5	78.3
RTN	84.5 ± 0.2	96.8 ± 0.1	99.4 ± 0.1	77.5 ± 0.3	66.2 ± 0.2	64.8 ± 0.3	81.6
EasyTL	81.8 ± 0.0	85.7 ± 0.0	67.6 ± 0.0	96.3 ± 0.0	93.8 ± 0.0	67.2 ± 0.0	82.1
DANN	82.0 ± 0.4	96.9 ± 0.2	99.1 ± 0.1	79.7 ± 0.4	68.2 ± 0.4	67.4 ± 0.5	82.2
ADDA	86.2 ± 0.5	96.2 ± 0.3	98.4 ± 0.3	77.8 ± 0.3	69.5 ± 0.4	68.9 ± 0.5	82.9
DAN	86.3 ± 0.3	97.2 ± 0.2	99.6 ± 0.1	82.1 ± 0.3	64.6 ± 0.4	65.2 ± 0.3	82.5
JAN	85.4 ± 0.3	97.4 ± 0.2	99.8 ± 0.2	84.7 ± 0.3	68.6 ± 0.3	70.0 ± 0.4	84.3
CDN	90.6 ± 0.3	99.0 ± 0.0	100.0 ± 0.0	89.7 ± 0.6	66.1 ± 0.6	66.4 ± 0.1	85.3

Table 4

Classification accuracy (%) on ImageCLEF-DA dataset

Method	I → P	P → I	I → C	C → I	C → P	P → C	Avg
ResNet	74.8 ± 0.3	83.9 ± 0.1	91.5 ± 0.3	78.0 ± 0.2	65.5 ± 0.3	91.2 ± 0.3	80.7
GFK	73.5 ± 0.0	74.8 ± 0.0	91.2 ± 0.0	84.8 ± 0.0	70.0 ± 0.0	82.3 ± 0.0	79.4
TCA	75.0 ± 0.3	78.3 ± 0.0	91.5 ± 0.0	83.7 ± 0.0	71.7 ± 0.0	84.0 ± 0.0	80.7
DAN	74.8 ± 0.3	83.9 ± 0.1	91.5 ± 0.3	78.0 ± 0.2	65.5 ± 0.3	91.2 ± 0.3	80.7
RTN	74.6 ± 0.3	85.8 ± 0.1	94.3 ± 0.1	85.9 ± 0.3	71.7 ± 0.3	91.2 ± 0.4	83.9
DANN	75.0 ± 0.6	86.0 ± 0.3	96.2 ± 0.4	87.0 ± 0.5	74.3 ± 0.5	91.5 ± 0.6	85.0
EasyTL	78.5 ± 0.0	83.9 ± 0.0	93.3 ± 0.0	85.5 ± 0.0	72.0 ± 0.0	95.0 ± 0.0	85.6
JAN	76.8 ± 0.4	88.0 ± 0.2	94.7 ± 0.2	89.5 ± 0.3	74.2 ± 0.3	91.7 ± 0.3	85.8
CDN	78.8 ± 0.4	92.0 ± 0.2	95.9 ± 0.1	91.0 ± 0.3	75.7 ± 0.3	95.2 ± 0.4	88.1

Table 5

Ablation experiments on the Office-31 dataset

Method	A → W	D → W	W → D	A → D	D → A	W → A	Avg
ResNet	68.4 ± 0.2	96.7 ± 0.1	99.3 ± 0.1	68.9 ± 0.2	62.5 ± 0.3	60.7 ± 0.3	76.1
CDN(w/o DR)	89.0 ± 0.4	98.8 ± 0.1	100.0 ± 0.0	87.3 ± 0.8	65.2 ± 0.1	64.6 ± 0.4	84.1
CDN(w/o R)	89.6 ± 0.3	98.9 ± 0.1	100.0 ± 0.0	89.0 ± 0.2	65.7 ± 0.5	66.1 ± 0.0	84.9
CDN	90.6 ± 0.3	99.0 ± 0.0	100.0 ± 0.0	89.7 ± 0.6	66.1 ± 0.6	66.4 ± 0.1	85.3

Conditional Distribution Discrepancy: The theory result of domain adaptation shows that cross-domain discrepancy is a key to solve the domain shift problem [22, 55]. For the Office-31 dataset, the distribution of domains W and D are similar, but they are not similar to A. The results are shown in Table 3, and the tasks on (W → D and D → W) with similar domain distribution differences gain better accuracy than on A → W and A → D which well validates the above theory. For the transfer task A → W and A → D which have the larger distribution discrepancy of domains, CDN model is even more accurate. However, the difficulty to transfer increases for asymmetric domain adaptation tasks from a small number of samples to a large number of samples (D → A and W → A). Different to Office-31, the three domains of ImageCLEF-DA are more symmetric. With these domain adaptation tasks, we expect to verify whether the performance improves with the same subset size. The classification results based on ResNet-50 [51] are shown in Table 4. The CDN models gain better performance on most domain adaptation tasks than other comparison methods. This also verifies the completeness of our theory.

4.3 Analysis

Feature Visualization: We visualize the network activations for feature extractors of ResNet, DAN [33], CDN(w/o DR), CDN(w/o R) and full CDN on the adaptation task A → W by t-SNE [56] in Fig. 4. For the feature of ResNet, the clusters of the source domain is not compact enough, and samples of the target domain are not well aligned with the centroid of the cluster. It is noted that DAN only aligns the marginal distributions, thus this global adaptation strategy does not consider the category-level distribution adaptation as illustrated in Fig. 4(b). As a result, the source and target are not aligned well with global adaptation i.e., DAN, better aligned with ResNet but categories are not discriminated well. Figure 5(a) shows the conditional distribution discrepancy of learned features during the training phase, which are calculated with Equation (2). The conditional distribution discrepancy can be reduced by our method after several iterations. It can be observed that the distribution divergence become smaller during the optimization iterations. In other words, our approach can effectively achieve domain adaptation. For the features of CDN(w/o DR), the source domain forms compact clusters, and samples of the target domain are well aligned with the centroid of the same category in source domain. However, some features of the source and target domains have not been well identified. For the features of CDN(w/o R), the clusters of the source and target domain are more discrete. For the features of our CDN, the source and target domain further form more dense clusters, and the shared categories across domains are perfectly aligned while different categories are well distinguished. The effectiveness of CDN is verified intuitively.

Fig. 4

The t-SNE visualization of feature representations learned by (a) ResNet (trained by source data), (b) DAN, (c) CDN(woDC/R), (d) CDN(woR) and (e) CDN. Note that the ▵ are samples from the source domain Amazon (A), and the $𝕏$ are samples from the target domain Webcam (W) respectively. The same color represents the same category of data. Especially, global adaptation (DAN) does not guarantee category-level adaptation as illustrated in Fig. 2(b).

Parameter Sensitivity: We check the sensitivity of CDN hyper-parameter λ, β, and γ which controls the learning rate of the centralization of clusters in the adaptation layer. Figure 5 demonstrates the parameter sensitivity of CDN on task A → W. When testing CDN penalty λ, we fix γ = 0.25, β = 0.1 and vary λ ∈ {0.0005, 0.001, 0.003, 0.005, 0.01, 1}. When testing β, we fix λ = 0.003, γ = 0.25 and vary β ∈ {0.01, 0.1, 0.5, 1}. The results are shown in Fig. 5(c) and (d). Theoretically, larger values of λ and β can make shrinkage regularization more important in CDN. When λ→ ∞ or β→ ∞, the optimization problem is ill-defined. When λ → 0 or β → 0, distribution adaptation is not performed, and CDN cannot construct robust representation. When testing γ, we fix λ = 0.003 β = 0.1 and vary λ ∈ {0.01, 0.05, 0.25, 0.5, 1}. The verification accuracies are illustrated in Fig. 5(e), the verification performance of our model remains mostly stable across a wide range of γ.

Fig. 5

(a) Conditional distribution discrepancy. It is worth noting that the reduction of variances decreases the discrepancy of the conditional distribution, (b) Convergence performance on the adaptation task of A → W by our CDN, (c)-(e) Sensitivity of λ, β and γ.

Convergence Performance: We also empirically check the convergence performances of CDN. Figure 5(b) shows the test errors of CDN on the adaptation task A → W, which suggests that classification accuracy (distribution distance) increases (decreases) steadily with more iterations.

5 Conclusion

In this paper, we have introduced a novel unsupervised domain adaptation approach, which focuses on category-level and discriminative feature representations learning. In this paper, we consider the category-level distribution adaptation by variances reduction between source and target. Comprehensive experimental evidence on image classification datasets verifies the effectiveness and efficiency of the proposed approach over several state-of-the-art methods.

Most of the recent work has focused on cross-domain image classification tasks. In real-world applications, unsupervised domain adaptation is gaining more and more attention. We will explore the application of unsupervised domain adaptation in other areas, such as cross-domain image segmentation or object detection tasks.

Footnotes

Acknowledgment

The work is supported by National Key R&D Program of China (2018YFC0309400), National Natural Science Foundation of China (61871188), Guangzhou city science and technology research projects(201902020008).

References

Fathollahi-Fard

A.M.

, Hajiaghaei-Keshteli

and Mirjalili

, Multi-objective stochastic closed-loop supply chain network design with social considerations, Applied Soft Computing, 2018, 505–25.

Fathollahi-Fard

A.M.

, Hajiaghaei-Keshteli

and Mirjalili

, A bi-objective green home health care routing problem, Journal of Cleaner Production, 2018, 423–43.

Hajiaghaei-Keshteli

and Fathollahi-Fard

A.M.

, A set of efficient heuristics and metaheuristics to solve a two-stage stochastic bi-level decision-making model for the distribution network problem, Computers & Industrial Engineering, 2018, 378–95.

, Tian

, Fathollahi-Fard

A.M.

, Ahmadi

and Zhang

, Stochastic multi-objective modelling and optimization of an energy-conscious distributed permutation flow shop scheduling problem with the total tardiness constraint, Journal of cleaner production, 2019, 515–25.

Fard

A.M.

and Hajiaghaei-Keshteli

, A bi-objective partial interdiction problem considering different defensive systems with capacity expansion of facilities under imminent attacks, Applied Soft Computing, 2018, 343–59.

Fathollahi-Fard

A.M.

, Hajiaghaei-Keshteli

and Mirjalili

, A set of efficient heuristics for a home healthcare problem, Computing and Applications, 2019, 1–21.

Bahadori-Chinibelagh

, Fathollahi-Fard

A.M.

and Hajiaghaei-Keshteli

, Two Constructive Algorithms to Address a Multi-Depot Home Healthcare Routing Problem, IETE Journal of Research, 2019, 1–7.

Abdi

, Abdi

, Fathollahi-Fard

A.M.

and Hajiaghaei-Keshteli

, A set of calibrated metaheuristics to address a closed-loop supply chain network design problem under uncertainty, International Journal of Systems Science: Operations & Logistics, 2019, 1–8.

Safaeian

, Fathollahi-Fard

A.M.

, Tian

, Li

and Ke

, A multi-objective supplier selection and order allocation through incremental discount in a fuzzy environment, Journal of Intelligent & Fuzzy Systems, 2019, 1435–1455.

10.

Elsayed

, Shankar

, Cheung

, Papernot

, Kurakin

, Goodfellow

and Sohl-Dickstein

, Adversarial examples that fool both computer vision and time-limited humans, Advances in Neural Information Processing Systems, 2018, 3910–3920.

11.

Voulodimos

, Doulamis

and Protopapadakis

, Deep learning for computer vision: A brief review, Computational intelligence and neuroscience, 2018.

12.

Young

, Hazarika

, Poria

and Cambria

, Recent trends in deep learning based natural language processing, IEEE Computational intelligenCe magazine 13(3) (2018), 55–75.

13.

Becker

, Kasper

, Böckmann

, Jöckel

K.H.

and Virchow

, Natural language processing of german clinical colorectal cancer notes for guideline-based treatment evaluation, International journal of medical informatics 127 (2019), 141–146.

14.

Pan

S.J.

and Yang

, A survey on transfer learning, IEEE Transactions on knowledge and data engineering 20(10) (2009), 1345–1359.

15.

Duan

, Xu

and Tsang

I.W.H.

, Domain adaptation from multiple sources: A domain-dependent regularization approach, IEEE Transactions on Neural Networks and Learning Systems 23(3) (2012), 504–518.

16.

Torralba

and Efros

A.A.

, Unbiased look at dataset bias, CVPR 1(2) (2011), 7.

17.

Dai

, Yang

, Xue

G.R.

and Yu

, Boosting for regression transfer, Proceedings of the 24th international conference on Machine learning, ACM, 2007, 193–200.

18.

Pardoe

and Stone

, Boosting for regression transfer, Proceedings of the 27th International Conference on International Conference on Machine Learning, Omnipress, 2010, 863–870.

19.

Luo

, Zheng

, Guan

, Yu

and Yang

, Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation, arXiv preprint arXiv:1809.09478, 2018.

20.

Blitzer

, Dredze

and Pereira

, Biographies, Bollywood, Boom-boxes and Blenders: Domain adaptation for sentiment classification, Proceedings of the 45th annual meeting of the association of computational linguistics, 2007, 440–447.

21.

Zhou

J.T.

, Tsang

I.W.

, Pan

S.J.

and Tan

, Heterogeneous domain adaptation for multiple classes, Artificial Intelligence and Statistics, 2014, 1095–1103.

22.

Ben-David

, Blitzer

, Crammer

, Kulesza

, Pereira

and Vaughan

J.W.

, A theory of learning from different domains, Machine learning 79(1-2) (2010), 151–175.

23.

Huang

, Gretton

, Borgwardt

, Schölkopf

and Smola

A.J.

, Correcting sample selection bias by unlabeled data, Advances in neural information processing systems, 2007, 601–608.

24.

Wang

, Wang

, Zhang

and Gao

, Cross-domain metric and multiple kernel learning based on information theory, Neural computation 30(3) (2018), 820–855.

25.

Gretton

, Borgwardt

K.M.

, Rasch

M.J.

, Schölkopf

and Smola

, A kernel two-sample test, Journal of Machine Learning Research 13(Mar) (2012), 723–773.

26.

Pan

S.J.

, Tsang

I.W.

, Kwok

J.T.

and Yang

, Domain adaptation via transfer component analysis, IEEE Transactions on Neural Networks 22(2) (2010), 199–210.

27.

Baktashmotlagh

, Harandi

M.T.

, Lovell

B.C.

and Salzmann

, Unsupervised domain adaptation by domain invariant projection, Proceedings of the IEEE International Conference on Computer Vision, 2013, 769–776.

28.

Long

, Wang

, Ding

, Sun

and Yu

P.S.

, Transfer feature learning with joint distribution adaptation, Proceedings of the IEEE international conference on computer vision, 2013, 2200–2207.

29.

Zhang

, Li

and Ogunbona

, Joint geometrical and statistical alignment for visual domain adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1859–1867.

30.

Sun

, Chattopadhyay

, Panchanathan

and Ye

, A two-stage weighting framework for multi-source domain adaptation, Advances in neural information processing systems, 2011, 505–513.

31.

Gong

, Grauman

and Sha

, Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation, February, International Conference on Machine Learning, 2013, 222–230.

32.

Long

, Cao

, Wang

and Jordan

M.I.

, Transferable representation learning with deep adaptation networks, IEEE transactions on pattern analysis and machine intelligence, 2018.

33.

Long

, Cao

, Wang

and Jordan

M.I.

, Learning transferable features with deep adaptation networks, arXiv preprint arXiv:1502.02791, 2015.

34.

Tzeng

, Hoffman

, Zhang

, Saenko

and Darrell

, Deep domain confusion: Maximizing for domain invariance, arXiv preprint arXiv:1412.3474, 2014.

35.

Long

, Zhu

, Wang

and Jordan

M.I.

, Unsupervised domain adaptation with residual transfer networks, Advances in Neural Information Processing Systems, 2016, 136–144.

36.

Ganin

and Lempitsky

, Unsupervised domain adaptation by backpropagation, International Conference onMachine Learning (ICML), 2015, 1180–1189.

37.

Dosovitskiy

, Springenberg

J.T.

, Riedmiller

and Brox

, Discriminative unsupervised feature learning with convolutional neural networks, Advances in neural information processing systems, 2014, 766–774.

38.

Wen

, Zhang

, Li

and Qiao

, A discriminative feature learning approach for deep face recognition, European conference on computer vision, Springer Cham, 2016, 499–515.

39.

Wang

, Wang

, Zhou

, Ji

, Gong

, Zhou

and Liu

, Cosface: Large margin cosine loss for deep face recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, 5265–5274.

40.

Long

, Zhu

, Wang

and Jordan

M.I.

, Deep transfer learning with joint adaptation networks, Proceedings of the 34th International Conference on Machine Learning Learning-Volume 70, JMLR, org, 2017, 2208–2217.

41.

Gong

, Shi

, Sha

and Grauman

, Geodesic flow kernel for unsupervised domain adaptation, 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, 2066–2073.

42.

Fernando

, Habrard

, Sebban

and Tuytelaars

, Unsupervised visual domain adaptation using subspace alignment, Proceedings of the IEEE international conference on computer vision, 2013, 2960–2967.

43.

Long

, Wang

, Ding

, Sun

and Yu

P.S.

, Transfer joint matching for unsupervised domain adaptation, Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, 1410–1417.

44.

Wang

, Chen

, Yu

, Huang

and Yang

, Easy Transfer Learning By Exploiting Intra-domain Structures, arXiv preprint arXiv:1904.01376, 2019.

45.

Long

, Wang

, Ding

, Pan

S.J.

and Philip

S.Y.

, Adaptation regularization: A general framework for transfer learning, IEEE Transactions on Knowledge and Data Engineering 26(5) (2013), 1076–1089.

46.

Wang

, Chen

, Hu

, Peng

and Philip

S.Y.

, Stratified transfer learning for cross-domain activity recognition, IEEE International Conference on Pervasive Computing and Communications (PerCom), IEEE, 2018, 1–10.

47.

Quanz

, Huan

and Mishra

, Knowledge transfer with low-quality data: A feature extraction issue, IEEE Transactions on Knowledge and Data Engineering 24(10) (2012), 1789–1802.

48.

Tzeng

, Hoffman

, Saenko

and Darrell

, Adversarial discriminative domain adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 7167–7176.

49.

Pei

, Cao

, Long

and Wang

, Multi-adversarial domain adaptation, Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

50.

Yosinski

, Clune

, Bengio

and Lipson

, How transferable are features in deep neural networks? Advances in neural information processing systems, 2014, 3320–3328.

51.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 770–778.

52.

Saenko

, Kulis

, Fritz

and Darrell

, Adapting visual category models to new domains, European conference on computer vision, Springer, Berlin, Heidelberg, 2010, 213–226.

53.

da dataset imageclef, http://imageclef.org/2014/adaptation/, 2014.

54.

Russakovsky

, Deng

, Su

, Krause

, Satheesh

, Ma

and Berg

A.C.

, Imagenet large scale visual recognition challenge, International journal of computer vision 115(3) (2015), 211–252.

55.

Mansour

, Mohri

and Rostamizadeh

, Domain adaptation: Learning bounds and algorithms, arXiv preprint arXiv:0902.3430, 2009.

56.

Maaten

L.V.D.

and Hinton

, Visualizing data using t-sne, Journal of machine learning research 9(Nov) (2008), 2579–2605.