Inter-class distribution alienation and inter-domain distribution alignment based on manifold embedding for domain adaptation

Abstract

Domain adaptation (DA) aims to train a robust predictor by transferring rich knowledge from a well-labeled source domain to annotate a newly coming target domain; however, the two domains are usually drawn from very different distributions. Most current methods either learn the common features by matching inter-domain feature distributions and training the classifier separately or align inter-domain label distributions to directly obtain an adaptive classifier based on the original features despite feature distortion. Moreover, intra-domain information may be greatly degraded during the DA process; i.e., the source data samples from different classes might grow closer. To this end, this paper proposes a novel DA approach, referred to as inter-class distribution alienation and inter-domain distribution alignment based on manifold embedding (IDAME). Specifically, IDAME commits to adapting the classifier on the Grassmann manifold by using structural risk minimization, where inter-domain feature distributions are aligned to mitigate feature distortion, and the target pseudo labels are exploited using the distances on the Grassmann manifold. During the classifier adaptation process, we simultaneously consider the inter-class distribution alienation, the inter-domain distribution alignment, and the manifold consistency. Extensive experiments validate that IDAME can outperform several comparative state-of-the-art methods on real-world cross-domain image datasets.

Keywords

Domain adaptation structural risk minimization maximum mean discrepancy manifold embedding

1 Introduction

Traditional machine learning algorithms, which have achieved remarkable results, usually assume that the distributions of the training and testing datasets are similar or even identical. However, this assumption is often invalid in real-world applications; thus, we have to relabel large amounts of data instances that follow the same distribution as the testing dataset to guarantee the model’s performance, which is very labor-intensive and time-consuming [25, 34]. To deal with this challenge, domain adaptation (DA) has been proposed by leveraging the rich source labeled data instances (training dataset) to learn a robust classifier for a newly coming and totally unlabeled target domain (testing dataset) from a very divergent distribution [27]. The key to achieving DA is to reduce the distribution divergence between the target and source domains; thus, the knowledge of the domains can be expectantly shared, and label consumption for the target domain is largely mitigated.

Concerning visual recognition technology for autonomous vehicles, it is impractical to collect all the possible visual images in the real-world, as this would be expensive, and potentially dangerous. An alternative and feasible strategy is to simulate the real-world images as sufficiently as possible to save time and labor. Then, we utilize these simulated visual images to train the model to assist the autonomous vehicles in their object recognition systems. However, there is still a very large gap between the real-world and simulated visual images, so the performance of the recognition systems is poor. To this end, a classification model trained on the simulated visual images needs to be adapted to help make predictions for the real-world visual images. In such cases, DA is an effective technique that can compensate for this challenge and narrow the gap between the real-world and simulated visual images by aligning their distributions.

In the past several years, researchers have developed various DA approaches to reduce the cross-domain distribution difference. In summary, there are three main types: (a) re-weighting the source data instances according to some weighting techniques so that the effect of the “bad” instances is reduced while the contribution of the “good” instances is enhanced [8, 14, 17, 28, 29, 32]; (b) conducting feature adaptation by minimizing a predefined distribution distance that ably characterizes the cross-domain discrepancy [2, 18, 21, 24, 26]; and (c) directly designing an adaptive classifier by minimizing inter-domain label distribution difference [3, 10, 20, 30].

The approaches of re-weighting the source data instances are effective only when the cross-domain distribution difference is small, while the distributions of different domains differ greatly due to their feature or label distribution misalignment in many real-world scenarios [22]. Recently, many approaches [1, 5, 12, 15, 19, 30] have been devised to address this challenge by elaborately aligning the feature distributions, or directly designing an adaptive classifier by considerably reducing the label distribution discrepancy. However, there still exist several challenges that are not solved satisfactorily. First, feature adaptation can only reduce, rather than remove, feature distribution discrepancy. Second, directly designing an adaptive classifier based on the original features can match their label distributions but ignores their feature distortion [16]. Third, intra-domain information might be lost as the inter-domain distribution difference is reduced. For instance, the source data samples from different classes are probably closer than before, which might degrade the classifier behavior and influence the final DA performance.

To address these challenges, we propose a novel approach involving inter-class distribution alienation and inter-domain distribution alignment based on manifold embedding (IDAME) for domain adaptation. Specifically, IDAME commits to adapting the classifier on the Grassmann manifold by using structural risk minimization, which can not only reduce the inter-domain feature/label distribution divergence but also preserve the label distribution difference among source data instances that are from different classes. IDAME can be divided into two stages to achieve DA. In the first stage, the specific features can be learned on the Grassmann manifold to align feature distributions of different domains to avoid feature distortion in the original space, and the target pseudo labels are exploited using the distances on the Grassmann manifold. In the second stage, with the manifold aligned features, an adaptive classifier based on structural risk minimization is modeled, where the inter-domain label distribution alignment and the manifold consistency are regarded as the regularization terms of the classifier. This paper proposes inter-class label distribution discrepancy preservation in the source domain, and we consider this preservation as another regularization term to mitigate the degradation of classification performance since the source data samples from different classes are probably closer than before. In summary, the contributions of this paper are as follows.

We aim to adapt the classifier on the Grassmann manifold by using structural risk minimization, where inter-domain feature distributions are aligned to alleviate the original feature distortion and where the target pseudo labels are exploited using the distances on the Grassmann manifold.

During the classifier adaptation process, we reduce the inter-domain label distribution difference and preserve the intra-domain label distribution discrepancy to further achieve domain adaptation and prevent intra-domain information degradation.

Extensive experiments validate that IDAME can outperform several comparative state-of-the-art methods on real-world cross-domain image datasets.

2 Related works

This section discusses the current DA approaches on feature adaptation and classifier adaptation that are most related to our work.

In most feature adaptation methods, a latent shared feature space is extracted and discovered to statistically reduce the feature distribution difference between the source and target domains. Pan et al. [26] proposed transfer component analysis (TCA) to minimize the inter-domain feature distribution difference measured by the maximum mean discrepancy (MMD). Long et al. [18] presented joint distribution analysis (JDA) to improve TCA by jointly minimizing the marginal and conditional feature distribution distances between the two different domains using target pseudo labels predicted by some base classifiers. In addition, transfer joint matching (TJM) was developed to explore not only feature matching but also instance re-weighting [19]. Other methods exploited common features of the source and target domains by searching for their intermediate subspaces to geometrically reduce the feature distribution shift. Gopalan et al. [23] used intermediary projections of the two domains along the shortest geodesic path and therefore connected them on the Grassmann manifold. Additionally, Gong et al. [2] presented geodesic flow kernel (GFK) based on the Grassmann manifold, to learn the domain-invariant features by integrating infinite subspaces lying on the geodesic flow. Zhang et al. [12] further explored joint geometrical and statistical alignment (JGSA) to match the feature distributions. Generally speaking, in statistical and geometrical strategies, these methods reduce the feature distribution divergence between the two different domains, but they still cannot remove this divergence. Unlike these methods, this paper not only conducts feature adaptation to mitigate feature distortion but also further adapts the classifier by effectively aligning the label distributions.

Some models were recently developed to devise an adaptive classifier directly based on the original features. Yang et al. [33] proposed the extreme learning machine based domain adaptation (EDA) method, which established manifold regularization to leverage an adaptive classifier. Long et al. [20] utilized adaptation regularization based transfer learning (ARTL) to learn an adaptive classifier based on structural risk minimization with inter-domain label distribution alignment. Cao et al. [30] developed distribution matching machine (DMM) to exploit a domain-transfer support vector machine by using distribution matching and data instance weighting regularization. To mitigate the influence of feature distortion in the original data space, Wang et al. [11] constructed manifold embedded distribution alignment (MEDA), which aligns the inter-domain feature distributions and then leverages an adaptive classifier in this feature space. However, the intra-domain information might be lost during the DA process, i.e., the source data samples from different classes probably become closer than before. To address this problem, we propose inter-class distribution alienation to preserve the source label distribution discrepancy during the classifier adaptation process.

The proposed IDAME approach is derived from ARTL and MEDA, but is distinctly different from them. First, IDAME obtains the initial target pseudo labels by using the distances on the Grassmann manifold so that the geometrical properties can be desirably passed down in the subsequent iteration process. In addition, IDAME preserves the source label distribution discrepancy in case the final classification performance is degraded during the process of inter-domain label distribution alignment.

3 Proposed approach

In this section, some notations and the problem statement are introduced. Then, we revisit several essential formulas related to our approach. Finally, the proposed IDAME approach is presented in detail.

3.1 Notations and the problem statement

For clarity, notations frequently used in this paper are shown in Table 1.

Table 1
Notations and their descriptions

Notation Description Notation Description

$D_{s}$ / $D_{t}$ source/target domain w/α classifier parameters

n_s/n_t #source/target samples $X_{s}$ / $X_{t}$ source/target feature space

m/C #original features/classes $Y_{s}$ / $Y_{t}$ source/target label space

d/p #subspace dimension/#nearest neighbors X_s/X_t source/target data matrix

T #iterations Y_s/Y_t source/target label matrix

σ shrinkage regularization parameter K kernel matrix

γ manifold regularization parameter M MMD matrix

λ MMD regularization parameter G GFK matrix

β trade-off parameter E label indicator matrix

P/Q marginal/conditional distribution L graph Laplacian matrix

Notation	Description	Notation	Description
$D_{s}$ / $D_{t}$	source/target domain	w/α	classifier parameters
n_s/n_t	#source/target samples	$X_{s}$ / $X_{t}$	source/target feature space
m/C	#original features/classes	$Y_{s}$ / $Y_{t}$	source/target label space
d/p	#subspace dimension/#nearest neighbors	X_s/X_t	source/target data matrix
T	#iterations	Y_s/Y_t	source/target label matrix
σ	shrinkage regularization parameter	K	kernel matrix
γ	manifold regularization parameter	M	MMD matrix
λ	MMD regularization parameter	G	GFK matrix
β	trade-off parameter	E	label indicator matrix
P/Q	marginal/conditional distribution	L	graph Laplacian matrix

Domain a daptation (DA): Given a labeled source domain $D_{s} = {(x_{1}, y_{1}), \dots, (x_{n_{s}}, y_{n_{s}})}$ and an unlabeled target domain $D_{t} = {x_{n_{s} + 1}, \dots, x_{n_{s} + n_{t}}}$ , it is assumed that the source and target feature and label spaces are identical, i.e., $X_{s} = X_{t}$ , $Y_{s} = Y_{t}$ , and that they follow different marginal and conditional distributions, i.e., P (x_s) ≠ P (x_t), Q (y_s|x_s)≠ Q (y_t|x_t). DA aims to train a robust classifier f : x_s ↦ y_s on $D_{s}$ , which can be directly applied to $D_{t}$ with low expectation error, where $x_{s} \in X_{s}$ , $x_{t} \in X_{t}$ , $y_{s} \in Y_{s},$ and $y_{t} \in Y_{t} .$

3.2 Related formulas revisited

For convenience, we first revisit the technical core of relevant methods as follows.

G eodesic f low k ernel (GFK): Under the assumption that the source and target data can be embedded in a linear subspace with lower dimensions, the GFK is proposed accordingly [2]. Let $S_{1}, S_{2} \in ℝ^{m \times d}$ denote the two generative subspaces of the source and target domains, which can be obtained by principal component analysis. Considering S₁ and S₂ as the two points on the Grassmann manifold $G^{m \times d}$ , the geodesic flow is expressed as $Ψ : t \in [0, 1] \to Ψ (t) \in G^{m \times d}$ with Ψ (0) = S₁ and Ψ (1) = S₂. Using the geodesic flow, $\forall x_{i}, x_{j} \in ℝ^{m}$ can be projected to an infinite-dimensional space, i.e., $z_{i}^{\infty}, z_{j}^{\infty}$ . For their inner product, we have $\begin{matrix} 〈 z_{i}^{\infty}, z_{j}^{\infty} 〉 = \int_{0}^{1} {(Ψ {(t)}^{T} x_{i})}^{T} (Ψ {(t)}^{T} x_{j}) dt \\ = x_{i}^{T} {Gx}_{j}, \end{matrix}$ (1) where $G \in ℝ^{m \times m}$ is a positive semidefinite matrix as follows $G = \int_{0}^{1} Ψ (t) Ψ {(t)}^{T} dt .$ (2)

According to [2], the geodesic distance between any two data samples on the Grassmann manifold can be computed by Equation (1), which can be utilized to label data samples that comply with the distance similarity. In addition, utilizing the geodesic distance, manifold features can be obtained to achieve feature distribution alignment. Specifically, as mentioned in [11], the manifold features can be learned with $z = \sqrt{G} x$ . Here, $\sqrt{G}$ is the square root of G, which can be obtained by using the Denman-Beavers method [6].

Structural risk function: Given a Hilbert space $H$ and a kernel function K corresponding to its mapping $φ : X \to H$ , the structural risk function is formulated as follows $\sum_{i = 1}^{n_{s}} {(y_{i} - f (x_{i}))}^{2} + σ {∥ f ∥}_{K}^{2},$ (3)

where $f \in H_{K}$ and $H_{K}$ is a classifier set in the kernel space. In addition, y_i is the label of x_i, ${∥ • ∥}_{K}^{2}$ is the squared norm, and σ is the positive regularization parameter. As expected, minimizing Equation (3) can exploit a classifier with low expectation error.

Inter-domain distribution alignment: With the distribution distance metric of empirical maximum mean discrepancy (MMD), inter-domain distribution alignment involves minimizing the formula as follows $\begin{matrix} D_{f, K} (D_{s}, D_{t}) \\ = \sum_{c = 0}^{C} {∥ \frac{1}{n_{s}^{(c)}} \sum_{x_{i} \in D_{s}^{(c)}} f (x_{i}) - \frac{1}{n_{t}^{(c)}} \sum_{x_{j} \in D_{t}^{(c)}} f (x_{j}) ∥}_{H}^{2}, \end{matrix}$ (4) where $f \in H_{K}$ , $D_{s} (c) = {x_{i} : x_{i} \in D_{s} \land y_{i} = c \neq 0}$ , $D_{t} (c) = {x_{j} : x_{j} \in D_{t} \land {\bar{y}}_{j} = c \neq 0}$ , $D_{s} (0) = D_{s}$ , $D_{t} (0) = D_{t}$ , $n_{s}^{(c)} = | D_{s} (c) |$ , $n_{t}^{(c)} = | D_{t} (c) |$ , and ${\bar{y}}_{j}$ is the target pseudo label.

Laplacian regularization: Laplacian regularization can preserve the intrinsic geometry characteristic of the p-nearest points ( $N_{p} (\cdot)$ ), which can be formulated as follows $\begin{matrix} M_{f, K} (D_{s}, D_{t}) \\ = \sum_{i, j = 1}^{n_{s} + n_{t}} {(f (x_{i}) - f (x_{j}))}^{2} W_{ij}, \end{matrix}$ (5) where W_ij is the element of a similarity matrix W.

3.3 Proposed approach

To address the shortcomings of both the feature and classifier adaptation DA methods, we propose a two-stage IDAME approach.

In the first stage, IDAME finds the manifold features on the Grassmann manifold by using the GFK approach to mitigate the influence of feature distortion based on the original features. Specifically, as mentioned before, the manifold features can be learned with $z = \sqrt{G} x$ , whose feature distribution difference is minimized. In addition, the target pseudo labels can be obtained using the distances on the Grassmann manifold, which can be applied to the process of the following adaptive classifier learning.

In the second stage, an adaptive classifier f can be modeled in the manifold feature space by minimizing the structural risk function (Equation (3)) with Laplacian regularization, which can preserve the geometrical property of the data manifold (Equation (5)). To further achieve domain adaptation and prevent intra-domain information degradation during the DA process, we reduce the inter-domain label distribution discrepancy by using Equation (4) and preserve the intra-domain label distribution difference by using inter-class distribution alienation (Equation (6)). Next, we describe the inter-class distribution alienation.

Inter-class distribution alienation: To preserve the intra-domain label distribution difference, we adopt the strategy of inter-class distribution alienation, which can be achieved by maximizing the inter-class label distribution distance in the source domain, and we can also utilize the empirical MMD to measure this distance. The inter-class label distribution distance can be calculated as $\begin{matrix} \bar{D_{f, K}} (D_{s}) \\ = \sum_{c = 1}^{C} {∥ \frac{1}{n_{s}^{(c)}} \sum_{x_{i} \in D_{s}^{(c)}} f (x_{i}) - \frac{1}{n_{s}^{(\bar{c})}} \sum_{x_{j} \in D_{s}^{(\bar{c})}} f (x_{j}) ∥}_{H}^{2}, \end{matrix}$ (6) where $D_{s} (\bar{c}) = {x_{i} : x_{i} \in D_{s} \land y_{i} \neq c}$ and $n_{s}^{(\bar{c})} = | D_{s} (\bar{c}) |$ .

Therefore, in the second stage, IDAME utilizes the manifold features to learn an adaptive classifier, which can be formulated as follows $\begin{matrix} f = \underset{f \in H_{K}}{arg min} \sum_{i = 1}^{n_{s}} {(y_{i} - f (z_{i}))}^{2} + σ {∥ f ∥}_{K}^{2} \\ + λ (D_{f, K} (D_{s}, D_{t}) - β \bar{D_{f, K}} (D_{s})) \\ + γ M_{f, K} (D_{s}, D_{t}), \end{matrix}$ (7) where $f \in H_{K}$ ; and where γ, λ and β are positive regularization parameters. Using the representer theorem [4], we obtain $f (z) = \sum_{i = 1}^{n_{s} + n_{t}} α_{i} K (z_{i}, z) .$ (8)

Then we incorporate Equation (8) into Equation (3) using the manifold features to obtain $\begin{matrix} \sum_{i = 1}^{n_{s}} {(y_{i} - f (z_{i}))}^{2} + σ {∥ f ∥}_{K}^{2} \\ = {∥ ({Y - α}^{T} K) E ∥}_{F}^{2} + σ tr (α^{T} K α), \end{matrix}$ (9) where Y = [y₁, ⋯ , y_{n_s+n_t}] is the label matrix, the element of $K \in ℝ^{(n_{s} + n_{t}) \times (n_{s} + n_{t})}$ is (K) _ij = K (z_i, z_j), α = [α₁, ⋯ , α_{n_s+n_t}], E = diag (E₁, E₂, ⋯ , E_{n_s+n_t}), and E_i = 1 if 1 ⩽ i ⩽ n_s; otherwise, E_i = 0. The unknown labels in the target domain can be filtered out by E. Using the manifold features, Equation (4) can be rewritten as follows $\begin{matrix} D_{f, K} (D_{s}, D_{t}) = \sum_{c = 0}^{C} tr (α^{T} {KM}_{c} K α) \\ = tr (α^{T} KMK α), \end{matrix}$ (10) where ${(M_{c})}_{i, j} = {\begin{matrix} \frac{1}{n_{s}^{(c)} n_{s}^{(c)}}, & z_{i}, z_{j} \in D_{s}^{(c)} \\ \frac{1}{n_{t}^{(c)} n_{t}^{(c)}}, & z_{i}, z_{j} \in D_{t}^{(c)} \\ \frac{- 1}{n_{s}^{(c)} n_{t}^{(c)}}, & {\begin{matrix} z_{i} \in D_{s}^{(c)} \land z_{j} \in D_{t}^{(c)} \\ z_{i} \in D_{t}^{(c)} \land z_{j} \in D_{s}^{(c)} \end{matrix} \\ 0, & otherwise \end{matrix},$ (11) and $M = \sum_{c = 0}^{C} M_{c} \in ℝ^{(n_{s} + n_{t}) \times (n_{s} + n_{t})}$ . For Equation (5), we get $M_{f, K} (D_{s}, D_{t}) = tr (α^{T} KLK α),$ (12) where L is calculated as L = I - D^-1/2WD^-1/2 and $I \in ℝ^{(n_{s} + n_{t}) \times (n_{s} + n_{t})}$ is the identity matrix. W is defined as $W_{ij} = {\begin{matrix} cos (z_{i}, z_{j}), & z_{i} \in N_{p} (z_{j}) \lor z_{j} \in N_{p} (z_{i}) \\ 0, & otherwise \end{matrix},$ (13) and D = diag (D₁, D₂, ⋯ , D_{n_s+n_t}) with $D_{i} = \sum_{j = 1}^{n_{s} + n_{t}} W_{ij}$ . Similarly, with respect to Equation (6), we have $\begin{matrix} \bar{D_{f, K}} (D_{s}) = \sum_{c = 1}^{C} tr (α^{T} K {\bar{M}}_{c} K α) \\ = tr (α^{T} K \bar{M} K α), \end{matrix}$ (14)

where ${({\bar{M}}_{c})}_{i, j} = {\begin{matrix} \frac{1}{n_{s}^{(c)} n_{s}^{(c)}}, & z_{i}, z_{j} \in D_{s}^{(c)} \\ \frac{1}{n_{s}^{(\bar{c})} n_{s}^{(\bar{c})}}, & z_{i}, z_{j} \in D_{s}^{(\bar{c})} \\ \frac{- 1}{n_{s}^{(c)} n_{s}^{(\bar{c})}}, & {\begin{matrix} z_{i} \in D_{s}^{(c)} \land z_{j} \in D_{s}^{(\bar{c})} \\ z_{i} \in D_{s}^{(\bar{c})} \land z_{j} \in D_{s}^{(c)} \end{matrix} \\ 0, & otherwise \end{matrix},$ (15)

and $\bar{M} = \sum_{c = 1}^{C} {\bar{M}}_{c} \in ℝ^{(n_{s} + n_{t}) \times (n_{s} + n_{t})}$ .

Finally, we substitute Equations (9), (10), (12) and (14) for each quantity into (7) to obtain $\begin{matrix} f = \underset{f \in H_{K}}{arg min} {∥ (Y - α^{T} K) E ∥}_{F}^{2} + σ tr (α^{T} K α) \\ + tr (α^{T} K (λ (M - β \bar{M}) + γ L) K α) . \end{matrix}$ (16)

Let ∂f/ ∂α = 0; then, we can obtain the optimal parameters of the classifier by $α = ((E + λ (M - β \bar{M}) + γ L) K + σ I)^{- 1} {EY}^{T} .$ (17)

Note that ${(K)}_{ij} = K (z_{i}, z_{j}) = K (\sqrt{G} x_{i}, \sqrt{G} x_{j})$ since the manifold features are utilized here for feature distribution alignment.

Algorithm 1: Inter-class distribution alienation and
inter-domain distribution alignment based on
manifold embedding
Input: Data matrices X_s, X_t, Y_s ; parameters d,
p, λ, β, γ ; iterations T.
Output: Adaptive classifier f.
1: Calculate the GFK matrix G by solving Equation (2),
and predict the pseudo target labels ${\bar{Y}}_{t}$ using the distances
on the Grassmann manifold.
2: Obtain the manifold feature
$[Z_{s}, Z_{t}] = [\sqrt{G} X_{s}, \sqrt{G} X_{t}]$ .
3: Construct kernel matrix K based on [Z_s, Z_t] ,
calculate graph affinity matrix W by Equation (13), and
calculate MMD matrices ${M_{c}}_{c = 0}^{C}$ and ${{\bar{M}}_{c}}_{c = 1}^{C}$ by using
Equations (11) and (15), respectively.
4: Compute MMD matrices $M = \sum_{c = 0}^{C} M_{c}$ and $\bar{M} = \sum_{c = 1}^{C} {\bar{M}}_{c}$
and the Graph Laplacian matrix L = I - D^-1/2WD^-1/2.
5: While not converged or t ⩽ Tdo
6: Compute α by using Equation (17), and obtain the adaptive
classifier f by using Equation (8).
7: Update the pseudo target labels by ${\bar{Y}}_{t} = f (Z_{t})$ .
8: Update ${M_{c}}_{c = 1}^{C}$ and M.
9: t = t + 1.
10: End while
11: Return Classifier f.

By substituting α into Equation (8), we obtain the adaptive classifier f. In summary, IDAME is presented in Algorithm 1.

4 Experiments

This section conducts comprehensive experiments on real-world object and handwritten digit recognition datasets to compare the proposed IDAME approach with several state-of-the-art methods and evaluate the performance of our approach.

4.1 Dataset descriptions

We use the benchmark datasets, i.e., Office 10 + Caltech 10 [2,7, 2,7] (Amazon, DSLR, Webcam, Caltech) and USPS+MNIST [18] (USPS, MNIST) which are widely utilized to demonstrate the effectiveness of domain adaptation (DA) algorithms. Descriptions of these datasets are presented in Table 2.

Table 2
Dataset descriptions

Dataset #Samples #Classes #Features Type

Amazon 958 10 800/4096 Object

DSLR 157 10 800/4096 Object

Webcam 295 10 800/4096 Object

Caltech 1123 10 800/4096 Object

USPS 1800 10 256 Digit

MNIST 2000 10 256 Digit

Dataset	#Samples	#Classes	#Features	Type
Amazon	958	10	800/4096	Object
DSLR	157	10	800/4096	Object
Webcam	295	10	800/4096	Object
Caltech	1123	10	800/4096	Object
USPS	1800	10	256	Digit
MNIST	2000	10	256	Digit

Office 10 + Caltech 10 contain object images from 4 domains: Amazon (A), Webcam (W), DSLR (D) and Caltech 10 (C). Accordingly, 12 DA tasks can be constructed with different directions, i.e., C⟶A, C⟶W, C⟶D, etc. For example, in C⟶W, C indicates the source domain, while W indicates the target domain. In addition, 10 categories are shared among the 4 domains, and we adopt two feature types, i.e., SURF features with 800 dimensionality [2] and Decaf6 deep features with 4096 dimensionality [7].

USP S + MNIST contain handwritten digit images from 2 different domains, i.e., USPS [9] and MNIST [31], both of which share 10 categories. Following [18, 19], we randomly sample 1800 examples from USPS and 2000 examples from MNIST and establish 2 DA tasks, i.e., USPS⟶MNIST (U⟶M) and MNIST⟶USPS (M⟶U).

4.2 Experimental setting

We compare our approach with 8 baseline methods: 1-nearest neighbor (1-NN) [13], TCA [26], GFK [2], JDA [18], ARTL [20], MEDA [11], JGSA [12], and TJM [19].

Since the target data instances are totally unlabeled, we estimate the optimal parameters by using the empirical searching strategy instead of cross validation. In the comparative study, we also utilize this strategy to evaluate the baseline methods, and their best results are reported accordingly. Specifically, the parameters in this paper are set as T = 10, p = 10, λ = 10, σ = 0 . 1, and d = 20. In addition, the manifold regularization parameter is set as γ = 0 . 1, 1, and 8 for the Office 10 + Caltech 10 (SURF features), Office 10 + Caltech 10 (Decaf6 features) and USPS+MNIST datasets, respectively. For the kernel function, the radial basis function (RBF) kernel is adopted in the structural risk minimization loss, and the trade-off parameter is set as β = 0.1.

We adopt the evaluation metric classification accuracy, which is widely utilized in the literature [2 , 26], to evaluate the effectiveness of our approach compared with the aforementioned methods. Specifically, the classification accuracy is formulated as follows $Accuracy = \frac{| x : x \in D_{t} \land f (x) = y (x) |}{| x : x \in D_{t} |},$

where f (x) is the predicted label and y (x) is the ground truth label.

4.3 Experimental results

The reported results of IDAME and other comparative methods are shown in Tables 3, 4 and 5. As shown, the overall average classification accuracy of IDAME is higher than that of all baseline methods. The 1NN method trains the classifier on the source domain to directly predict the target labels based on their original features without the DA process. Therefore, 1NN performs worst on all DA tasks, thereby indicating that the DA technique is essential to the cross-domain problem. TCA and GFK align the marginal feature distributions statistically and geometrically, respectively; thus, they behave better than the non-DA method 1NN. Based on TCA, JDA and TJM consider the conditional feature distribution alignment and source instance re-weighting strategy, respectively, so that their corresponding accuracies are notably high. JGSA aims to match the feature distributions statistically and geometrically, thereby further boosting DA performance. However, all 5 DA methods pertain to feature adaptation strategies, and recent studies state that this kind of strategy can only reduce, rather than remove, the inter-domain distribution difference. Therefore, ARTL is proposed to leverage an adaptive classifier by minimizing their label distribution discrepancy, while the feature distribution shift is ignored, which easily causes the feature distortion problem. As such, ARTL does not outperform JGSA on all DA tasks. Unlike these methods, the proposed IDAME aligns the inter-domain feature distribution while the adaptive classifier is exploited. Therefore, IDAME outperforms these 6 methods on most DA tasks (20/26 tasks).

Table 3
Accuracy (%) on Office 10 + Caltech 10 (SURF features)

Tasks 1NN TCA GFK JDA TJM JGSA ARTL MEDA IDAME

C⟶ A 23.70 43.42 41.02 44.78 46.76 51.46 44.05 56.47 58.14

C⟶ W 25.76 37.29 40.68 41.69 38.98 45.42 31.53 53.90 53.56

C⟶ D 25.48 43.95 41.40 45.22 44.59 45.86 39.49 50.32 56.69

A⟶ C 26.00 38.20 40.25 39.36 39.45 41.50 36.06 43.90 48.17

A⟶ W 29.83 37.97 40.00 37.97 42.03 45.76 33.56 53.22 51.19

A⟶ D 25.48 30.57 36.31 39.49 45.22 47.13 36.94 45.86 40.13

W⟶ C 19.86 29.74 30.72 31.17 30.19 33.21 29.74 34.19 33.21

W⟶ A 22.96 32.25 31.84 32.78 29.96 39.87 38.31 42.69 43.84

W⟶ D 59.24 85.35 87.90 89.17 89.17 90.45 87.90 88.54 90.45

D⟶ C 26.27 30.90 30.10 31.52 31.43 29.92 30.45 32.59 34.91

D⟶ A 28.50 29.33 32.05 33.09 32.78 38.00 34.86 41.13 49.06

D⟶ W 63.39 84.75 84.41 89.49 85.42 91.86 88.47 87.46 88.81

average 31.37 43.64 44.72 46.31 46.33 50.04 44.28 52.52 54.01

Tasks	1NN	TCA	GFK	JDA	TJM	JGSA	ARTL	MEDA	IDAME
C⟶ A	23.70	43.42	41.02	44.78	46.76	51.46	44.05	56.47	58.14
C⟶ W	25.76	37.29	40.68	41.69	38.98	45.42	31.53	53.90	53.56
C⟶ D	25.48	43.95	41.40	45.22	44.59	45.86	39.49	50.32	56.69
A⟶ C	26.00	38.20	40.25	39.36	39.45	41.50	36.06	43.90	48.17
A⟶ W	29.83	37.97	40.00	37.97	42.03	45.76	33.56	53.22	51.19
A⟶ D	25.48	30.57	36.31	39.49	45.22	47.13	36.94	45.86	40.13
W⟶ C	19.86	29.74	30.72	31.17	30.19	33.21	29.74	34.19	33.21
W⟶ A	22.96	32.25	31.84	32.78	29.96	39.87	38.31	42.69	43.84
W⟶ D	59.24	85.35	87.90	89.17	89.17	90.45	87.90	88.54	90.45
D⟶ C	26.27	30.90	30.10	31.52	31.43	29.92	30.45	32.59	34.91
D⟶ A	28.50	29.33	32.05	33.09	32.78	38.00	34.86	41.13	49.06
D⟶ W	63.39	84.75	84.41	89.49	85.42	91.86	88.47	87.46	88.81
average	31.37	43.64	44.72	46.31	46.33	50.04	44.28	52.52	54.01

Table 4

Accuracy (%) on USPS+MNIST

Tasks	1NN	TCA	GFK	JDA	TJM	JGSA	ARTL	MEDA	IDAME
U⟶M	44.70	52.90	49.95	59.65	51.25	68.15	67.70	77.70	78.15
M⟶U	65.94	57.50	65.33	67.28	63.00	80.44	88.78	89.39	89.44
average	55.32	55.20	57.64	63.47	57.13	74.30	78.24	83.55	83.79

Table 5

Accuracy (%) on Office 10 + Caltech 10 (Decaf6 features)

Tasks	1NN	TCA	GFK	JDA	TJM	JGSA	ARTL	MEDA	IDAME
C⟶A	87.27	90.19	88.20	90.29	89.04	91.44	92.38	93.42	93.42
C⟶W	72.54	76.95	77.63	85.08	76.95	86.78	87.80	95.59	95.59
C⟶D	79.62	85.35	86.62	89.17	85.35	93.63	86.62	91.08	91.08
A⟶C	71.68	82.72	79.16	83.97	80.14	84.86	87.44	87.36	87.53
A⟶W	68.14	74.58	70.85	78.64	75.25	81.02	88.47	88.14	88.14
A⟶D	73.89	80.25	82.17	80.89	84.71	88.54	85.35	87.90	91.72
W⟶C	55.30	79.88	69.72	84.15	78.01	84.95	88.16	88.07	87.80
W⟶A	62.53	84.45	76.83	90.08	84.86	90.71	92.28	93.22	93.22
W⟶D	98.09	100.00	100.00	100.00	100.00	100.00	100.00	99.36	99.36
D⟶C	42.03	82.46	71.42	85.04	77.38	86.20	87.27	87.53	88.07
D⟶A	49.90	88.20	76.30	91.02	87.37	91.96	92.69	93.22	93.22
D⟶W	91.53	99.66	99.32	100.00	98.64	99.66	100.00	97.63	97.29
average	71.04	85.39	81.52	88.19	84.81	89.98	90.71	91.88	92.20

Compared to the best baseline method (MEDA, which also incorporates feature adaptation into adaptive classifier learning), we obtain the initial target pseudo labels computed by the distances on the Grassmann manifold. Moreover, we propose the regularization of inter-class distribution alienation to preserve the label distribution discrepancy of the source domain so that the negative impact on classifier learning is largely mitigated. Therefore, the accuracy results of IDAME are still higher than those of MEDA on most DA tasks. In particular, the average accuracy of our approach on the Office 10 + Caltech 10 dataset with SURF feature has 1.5% improvement compared with the MEDA results, thus indicating that the proposed IDAME can be regarded as an extension of MEDA.

4.4 Ablation study

To verify the rigor of our approach, we first inspect the trends of inter-class label distribution distance on different DA tasks. As shown in Fig. 1, the original inter-class label distribution distances (yellow bars) without DA are the maximum among the 3 methods. As discussed before, the inter-class label distribution distances are smaller during the DA process, thus indicating a large decline in the inter-class label distribution distances (blue bars) under MEDA; this decline accords with our motivation. To this end, this paper proposes to maximize the inter-class label distribution distance to respect the intra-domain label distribution discrepancy. Therefore, the decline in the inter-class label distribution distances is smaller under our IDAME than under MEDA, thus verifying the effectiveness and rigor of our approach.

Fig. 1

Inter-class label distribution distance of Ours, MEDA and Original.

We also conduct an ablation study with respect to the strategy to initialize pseudo target labels, which are computed by the distances on the Grassmann manifold (Our-I), or the regularization term of inter-class distribution alienation (Our-II). Due to space limitations, we run MEDA, Our-I and Our-II on only several selected DA tasks. As shown in Fig. 2, both Our-I and Our-II can outperform MEDA. Specifically, Our-II outperforms MEDA by 0.8% on average, which verifies that the regularization term of inter-class distribution alienation is important to preserve the intra-domain label distribution discrepancy. Moreover, the average classification accuracy of Our-I has a 1.7% improvement compared with MEDA. Therefore, the initial target pseudo labels obtained on the Grassmann manifold are effective for adaptive classifier learning.

Fig. 2

Classification accuracy of MEDA and two variants of Ours (i.e., Our-I and Our-II).

4.5 Parameter sensitivity

Because we have verified that DA performs better in all DA tasks when we set p = 10, λ = 10, σ = 0 . 1, and d = 20, we discuss the parameter sensitivity of IDAME only with respect to γ, β and T. The results on the tasks of D⟶A (SURF), U⟶M and D⟶C (Decaf6) are reported, while similar trends on all other DA tasks are not shown due to space limitations.

Here, γ is the manifold regularization parameter. The classification accuracy of IDAME with different values of γ from 0 to 10 is shown in Fig. 3(a), which indicates that γ has a wide range to obtain the optimal results. Moreover, the accuracy of IDAME with γ = 0 is still the highest for the object dataset, thus indicating that manifold consistency is insignificant for the object dataset in our IDAME.

Fig. 3

Parameter sensitivity of IDAME (dashed lines are the best baseline results): (a) Manifold regularization γ. (b) Inter-class distribution alienation regularization β.

In addition, β is the trade-off parameter to regularize the importance of inter-domain distribution alignment and inter-class distribution alienation. Larger values of β enable inter-class distribution alienation regularization to be more significant. The accuracy of IDAME with different values of β from 2^-10 to 0.3 is illustrated in Fig. 3(b), thus indicating that IDAME is robust over a wide range of β (β ∈ [2^-10, 0 . 25]).

Finally, in Fig. 4, we plot the convergence curves of the classification accuracy of our approach with different iterations from 1 to 20; the figure indicates that IDAME converges in only a few iterations (T < 10).

Fig. 4

Empirical convergence analysis.

5 Conclusion

In this paper, we propose a novel domain adaptation approach referred to as inter-class distribution alienation and inter-domain distribution alignment based on manifold embedding (IDAME). IDAME exploits an adaptive classifier on the Grassmann manifold based on structural risk minimization, where the inter-domain feature distributions are matched and where the initial target pseudo labels are computed by the distances on the Grassmann manifold. Additionally, not only are the inter-domain label distributions aligned, but also the label distribution discrepancy of the source domain is favorably preserved. We conduct comprehensive experiments on different types of image recognition datasets, and the results demonstrate that IDAME outperforms several comparative state-of-the-art methods. However, the proposed approach still performs the feature and label distributions alignment in a separate strategy, which might limit their mutual promotion to some extent. Moreover, IDAME cannot deal with negative transfer incurred by irrelevant source data instances. In future work, we will focus on devising a more general domain adaptation (DA) model to compensate for these challenges.

Funding

Program for Outstanding Young Teachers in Higher Education Institutions of Anhui Province, China (No. gxyq2020103).

Footnotes

Acknowledgments

The work was supported in part by the National Nature Science Foundation of China (Nos. 91546108, 71901001, 61806068), the Major Special Science and Technology Project of Anhui Province, China (No. 201903a05020020), the Natural Science Foundation of Anhui Province, China (No. 1708085MG169, 2008085QA16), and the Social Science Research Project of the Education Department of Anhui Province, China (No. JS2017AJRW0135).

References

Fernando

, Habrard

, Sebban

and Tuytelaars

, Unsupervised visual domain adaptation using subspace alignment, Proceedings of the IEEE International Conference on Computer Vision, 2013, 2960–2967.

Gong

, Shi

, Sha

and Grauman

, Geodesic flow kernel for unsupervised domain adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, 2066–2073.

Quanz

and Huan

, Large margin transductive transfer learning, Proceedings of the 18th ACMconference on Information and Knowledge Management, 2009, 1327–1336.

Schölkopf

, Herbrich

and Smola

A.J.

, A generalized representer theorem, Proceedings of the International Conference on Computational Learning Theory. Springer, Berlin, Heidelberg, 2001, 416–426.

Lee

C.Y.

, Batra

, Baig

M.H.

and Ulbricht

, Sliced wasserstein discrepancy for unsupervised domain adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, 10285–10295.

Denman

E.D.

and Beavers

A.N.

Jr, , The matrix sign function and computations in systems, Applied Mathematics and Computation 2(1) (1976), 63–94.

Donahue

, Jia

, Vinyals

, Hoffman

, Zhang

, Tzeng

and Darrell

, Decaf: A deep convolutional activation feature for generic visual recognition, Proceedings of the International Conference on Machine Learning, 2014, 647–655.

Huang

, Gretton

, Borgwardt

, Schölkopf

and Smola

A.J.

, Correcting sample selection bias by unlabeled data, In Advances in neural information processing systems, 2007, 601–608.

Hull

J.J.

, A database for handwritten text recognition research, IEEE Transactions on Pattern Analysis and Machine Intelligence 16(5) (1994), 550–554.

10.

Tao

, Chung

F.L.

and Wang

, On minimum distribution discrepancy support vector machine for domain adaptation, Pattern Recognition 45(11) (2012), 3962–3984.

11.

Wang

, Feng

, Chen

, Yu

, Huang

and Yu

P.S.

, Visual domain adaptation with manifold embedded distribution alignment, Proceedings of the 26th ACM Multimedia Conference on Multimedia, 2018, 402–410.

12.

Zhang

, Li

and Ogunbona

, Joint Geometrical and Statistical Alignment for Visual Domain Adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 5150–5158.

13.

Fukunage

and Narendra

P.M.

, A branch and bound algorithm for computing k-nearest neighbors, IEEE Transactions on Computers 7 (1975), 750–753.

14.

Bruzzone

and Marconcini

, Domain adaptation problems: A DASVM classification technique and a circular validation strategy, IEEE Transactions on Pattern Analysis and Machine Intelligence 32(5) (2009), 770–787.

15.

Zhang

, Wang

, Huang

G.B.

, Zuo

, Yang

and Zhang

, Manifold criterion guided transfer learning via intermediate domain generation, IEEE Transactions on Neural Networks 30(12) (2019), 3759–3773.

16.

Baktashmotlagh

, Harandi

M.T.

, Lovell

B.C.

and Salzmann

, Unsupervised domain adaptation by domain invariant projection, Proceedings of the IEEE International Conference on Computer Vision, 2013, 769–776.

17.

Chen

, Weinberger

K.Q.

and Blitzer

J.C.

, Co-training for domain adaptation, In Advances in neural information processing systems, 2011, 2456–2464.

18.

Long

, Wang

, Ding

, Sun

and Yu

P.S.

, Transfer feature learning with joint distribution adaptation, Proceedings of the IEEE International Conference on Computer Vision, 2013, 2200–2207.

19.

Long

, Wang

, Ding

, Sun

and Yu

P.S.

, Transfer joint matching for unsupervised domain adaptation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, 1410–1417.

20.

Long

, Wang

, Ding

, Pan

S.J.

and Philip

S.Y.

, Adaptation regularization: A general framework for transfer learning, IEEE Transactions on Knowledge and Data Engineering 26(5) (2014), 1076–1089.

21.

Courty

, Flamary

, Tuia

and Rakotomamonjy

, Optimal transport for domain adaptation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(9) (2017), 1853–1865.

22.

Zhang

, Huang

, Zhou

, Chen

, Shang

and Yang

, Joint category-level and discriminative feature learning networks for unsupervised domain adaptation, Journal of Intelligent & Fuzzy Systems 37(6) (2019), 8499–8510.

23.

Gopalan

, Li

and Chellappa

, Domain adaptation for object recognition:An unsupervised approach, Proceedings of the IEEE International Conference on Computer Vision, 2011, 999–1006.

24.

Ievgen

and Younès

, Random subspaces NMF for unsupervised transfer learning, Proceedings of the IEEE International Joint Conference on Neural Networks, 2014, 3901–3908.

25.

Pan

S.J.

and Yang

, A survey on transfer learning, IEEE Transactions on knowledge and data engineering 22(10) (2010), 1345–1359.

26.

Pan

S.J.

, Tsang

I.W.

, Kwok

J.T.

and Yang

, Domain adaptation via transfer component analysis, IEEE Transactions on Neural Network 22(2) (2011), 199–210.

27.

Patel

V.M.

, Gopalan

, Li

and Chellappa

, Visual domain adaptation: A survey of recent advances, IEEE signal processing magazine 32(3) (2015), 53–69.

28.

Dai

, Yang

, Xue

G.R.

and Yu

, Boosting for transfer learning, Proceedings of the 24th International Conference on Machine Learning, 2007, 193–200.

29.

Chu

W.S.

, De la Torre

and Cohn

J.F.

, Selective transfer machine for personalized facial action unit detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, 3515–3522.

30.

Cao

, Long

and Wang

, Unsupervised domain adaptation with distribution matching machines, Proceedings of the 32nd AAAI International Conference on Artificial Intelligence, 2018, 2795–2802.

31.

LeCun

, Bottou

, Bengio

and Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11) (1998), 2278–2324.

32.

, Pan

S.J.

, Xiong

, Wu

, Luo

, Min

and Song

, A unified framework for metric transfer learning, IEEE Transactions on Knowledge and Data Engineering 29(6) (2017), 1158–1171.

33.

Yang

, Xu

, Yang

and Meng

, Kernel Extreme Learning Machine Based Domain Adaptation, Proceedings of the 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems, 2018, 593–597.

34.

Zhao

, Zhao

and L L., Self labeling online sequential extrem learning machine and it’s application, Journal of Intelligent & Fuzzy Systems 37(4) (2019), 4485–4491.

Inter-class distribution alienation and inter-domain distribution alignment based on manifold embedding for domain adaptation

Abstract

Keywords

1 Introduction

2 Related works

3 Proposed approach

3.1 Notations and the problem statement

4.1 Dataset descriptions

Table 2 Dataset descriptions Dataset #Samples #Classes #Features Type Amazon 958 10 800/4096 Object DSLR 157 10 800/4096 Object Webcam 295 10 800/4096 Object Caltech 1123 10 800/4096 Object USPS 1800 10 256 Digit MNIST 2000 10 256 Digit

4.3 Experimental results

Funding

Footnotes

Acknowledgments

References

Table 2
Dataset descriptions

Dataset #Samples #Classes #Features Type

Amazon 958 10 800/4096 Object

DSLR 157 10 800/4096 Object

Webcam 295 10 800/4096 Object

Caltech 1123 10 800/4096 Object

USPS 1800 10 256 Digit

MNIST 2000 10 256 Digit