Research on transfer learning algorithm based on support vector machine

Abstract

Transfer learning is a new machine learning algorithm. It solves problems in different but related target domains by utilizing the knowledge in existing data. Based on the classical SVM algorithm and transfer learning, a selective transfer learning support vector machine (STL-SVM) algorithm is proposed in this paper. First, STL-SVM uses the maximum mean discrepancy to measure the weight vector of the source domain samples relative to the target domain, and selects samples from the source domain according to each weight to avoid negative transfer. Then, the knowledge in the source domain is learned by the approximate extreme point support vector at the minimum training data cost. Finally, the object function is constructed by the obtained knowledge and the soft-margin SVM. In the constraint conditions of the objective function, the learned knowledge that is highly correlated with the target domain is selected, and further, the phenomenon of negative transfer is avoided in principle. STL-SVM solves the problem of negative transfer, and has considerable advantages in training time efficiency compared with the existing algorithms. The experimental results on artificial and real datasets show the effectiveness of the proposed algorithm.

Keywords

Machine learning support vector machine transfer learning classification

1 Introduction

Currently, machine learning algorithms are being widely used in text classification, image processing, computer security and artificial intelligence and are achieving great results. However, the shortcomings of traditional machine learning methods restrict further development in these fields. In traditional machine learning methods, to make the classifier model more accurate, the training and test samples are generally required to satisfy the following two basic assumptions [1 –4]: 1) the training and test data must have independent and identical distributions; 2) many training samples are necessary to learn a better classifier model. However, in practical applications, especially in emerging applications such as text mining, bioinformatics, distributed network sensor networks and social network research, training and test data have difficulty meeting the conditions of independent and identical distributions [2]. In addition, the data resources in some areas are often scarce, and the cost of collecting these data is high. Although there are enough training samples in some fields, the work of labeling these samples is time consuming and laborious. Therefore, insufficient training samples and non-independent and identical distributions of training and test samples are the main problems experienced by traditional machine learning.

In recent years, the rise of transfer learning has provided an effective way method for solving the above problems of traditional machine learning [1, 2]. The purpose of transfer learning is to use the data already labeled or unlabeled in some domains to assist in the learning tasks in similar target domains, which contain a small amount of labeled or unlabeled data, so that the learning task in the target domain is more accurate. Wikipedia has a description of transfer learning: Transfer learning is a new machine learning method that uses existing knowledge to solve problems in different but similar fields. It no longer follows the two basic assumptions in traditional machine learning. Instead, the existing knowledge is transferred to solve the problem of only a small amount of labeled or unlabeled data in the target domain [3]. Figure 1 shows the difference between traditional machine learning and transfer learning. From Fig. 1(a), it can be seen that each learning task starts from scratch in traditional machine learning; however transfer learning can transfer the knowledge in the previous learning task to the current target learning task in Fig. 1(b).

Fig. 1

Differences between Traditional Machine Learning and Transfer Learning.

In transfer learning, the more data in the source domain between the source and the target domain that are shared, the easier the transfer. Otherwise, the transfer is more difficult, negative transfer occurs, and the poorer learning results of the target domain that are obtained. For example, if we have learned to ride a bicycle (source domain), then it will be easier to learn to ride a motorcycle (target domain), but provides no help in learning the piano (target domain); if one person is familiar with Chinese chess (source do-main), then he (she) can also transfer knowledge to speed up learning International chess (target domain). Transfer learning has been widely studied by researchers. Currently, transfer learning has been successfully applied in many real applications: distributed network intrusion detection [5], behavioral positioning [6], text classification [7, 22], image classification [34], human activity recognition [35] and leukemia diagnosis [36].

Transfer learning has been continuously studied by researchers, and the representative algorithms are as follows. Hong Jiaming et al. [8], used domain similarity knowledge and transfer learning theory to propose transfer SVM, (TrSVM). Gao et al. [9] proposed the local weighted embedded transfer learning algorithm (LWE). Brain et al. [10] proposed a feature space-based transductive transfer learning algorithm, large-margin projected transductive SVM (LMPROJ). Long et al. [22] proposed the least square transfer learning framework - adaptation regularization based transfer learning (ARTL) based on SVM. Xie et al. [30] applied transfer learning to incremental learning and proposed the selective transfer incremental learning (STIL) algorithm. Lu et al. [31] proposed a selective transfer learning for collaborative (STLCF) algorithm. Li et al. [32] proposed a new transfer learning extreme learning machine (TL-ELM) and transfer learning domain adaptation kernel extreme learning machine (TL-DAKELM), based on an extreme learning machine. Li et al. [33] proposed the transfer learning algorithm, rank-based reduce error transfer learning (RankRE-TL). In addition, with the rapid development of neural networks and deep learning, researchers have proposed some novel ideas [34 –37] combining transfer learning with neural networks for different application scenarios. The combination of the two can effectively solve the problem that deep learning cannot address the sparse training data.

However, the existing transfer learning algorithms generally have the problem of high training time and complexity when the source domain contains a large-scale dataset because the algorithms need to learn all the knowledge from the source domain samples and transfer the knowledge to the target domain, but they do not need to transfer all the knowledge from the source domain. In addition, the processing negative transfer of these algorithms is also unsatisfactory, resulting in poor classification results. These reasons restrict the further improvement of the performance and classification effect of the classifier trained by transfer learning algorithm. In summary, the motivation of this paper is to propose a STL-SVM algorithm based on support vector machine (SVM) to overcome the shortcomings of traditional machine learning algorithms such as insufficient training samples, nonindependent data and identical distribution of training and test samples. STL-SVM uses a large number of labeled samples in the source domain to assist in building classification models for target domains containing only a small of labeled or unlabeled samples. STL-SVM is different from the existing transfer learning algorithms, it does not need all the samples in the source domain, and can deal with the negative transfer phenomenon more effectively, thus improving the performance and classification effect of the transfer learning classifier.

The core idea of the STL-SVM algorithm is as follows. First, the improved maximum mean discrepancy (MMD) method is used to calculate the weight vector of the importance of samples in the source domain relative to the target domain. Then, in the source domain, approximate extreme points support vector machine (AESVM) is used to select representative datasets and the weights of samples from a large number of labeled data in the source domain, and then the negative transfer problem is preliminarily solved by combining the weights calculated by the MMD. Finally, an objective function with transfer learning ability is constructed by combining the support vector. The constraint conditions of the objective function restrict the negative transfer again by using the decision value (knowledge) obtained from training in the source domain and the target domain. Thus, the objective function of the learning model is a quadratic programming problem.

The remainder of this paper is organized as follows. Section 2 describes the principle of MMD, the soft margin model of the SVM and the approximate extreme point training SVM. Section 3 is the main content of the paper, focusing on the detailed construction process of the STL-SVM objective function and theoretical reasoning and proof of the objective function. In Section 4, the experimental results validate the effectiveness of STL-SVM. Section 5 summarizes the paper and discusses future works.

2 Relate works

2.1 Maximum mean discrepancy

For the transfer learning algorithm, the distribution differences between the source domain and the target domain seriously affect the classification effect. Currently, most transfer learning algorithms often use all samples in the source domain for training. If the differences between the source domain and target domain are too large, the effect of the learning task in the target domain becomes worse, and negative transfer occurs. Currently, a common method for effectively estimating the distance between two distributions is the MMD [10]. Below, we outline the process of the MMD method to estimate the distance of two distributions.

A training dataset in source domain is given $D_{s} = {(x_{1}^{s}, y_{1}^{s}), (x_{2}^{s}, y_{2}^{s}), . . ., (x_{n}^{s}, y_{n}^{s})}$ and $y_{i}^{s} \in {- 1, 1}$ , $x_{i}^{s}$ is the i–th sample, $y_{i}^{s}$ is the class label of $x_{i}^{s}$ , and n is the number of training samples. Similarly, a dataset in the target domain is as follows $D_{t} = {(x_{1}^{t}, y_{1}^{t}), (x_{2}^{t}, y_{2}^{t}), . . ., (x_{m}^{t}, y_{m}^{t})}$ , and m is the number of the samples. The square of the MMD between the source domain and the target domain is as follows: ${MMD}^{2} = | | \frac{1}{m} \sum_{x_{i} \in D_{t}} ϕ (x_{i}) - \frac{1}{n} \sum_{x_{j} \in D_{s}} ϕ (x_{j}) | |^{2}$ (1)

In Equation (1), ϕ (·) is a nonlinear mapping function, when the MMD method measures the difference between the source domain and the target domain, all samples of the source domain need to be used. To fully consider the importance of each source domain sample, set the weight of the sample in the source domain as w_i (0 ≤ w_i ≤ 1). Equation (1) is improved to obtain a weight-based MMD method WMMD. $\begin{matrix} {WMMD}^{2} = | | \frac{1}{m} \sum_{x_{i} \in D_{t}} ϕ (x_{i}) - \frac{1}{n} \sum_{x_{j} \in D_{s}} w_{j} ϕ (x_{j}) | |^{2} \\ = \frac{1}{m^{2}} \sum_{i = 1}^{m} \sum_{j = 1}^{m} ϕ (x_{i})^{T} ϕ (x_{j}) \end{matrix}$ (2) $\begin{matrix} + \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i} w_{j} ϕ (x_{i})^{T} ϕ (x_{j}) \\ - 2 \frac{1}{mn} \sum_{i = 1}^{m} \sum_{j = 1}^{n} w_{j} ϕ (x_{i})^{T} ϕ (x_{j}) \end{matrix}$

In Equation (2), the first term on the right of the equation is the constant, which can be ignored, and only the last two terms are minimized. Therefore, Equation (3) is as follows: $\begin{matrix} f (w) = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} w_{i} w_{j} ϕ (x_{i})^{T} ϕ (x_{j}) \\ - 2 \frac{1}{mn} \sum_{i = 1}^{m} \sum_{j = 1}^{n} w_{j} ϕ (x_{i})^{T} ϕ (x_{j}) \end{matrix}$ (3)

To simplify Equation (3), we can obtain the objective function in Equation (4): $\begin{matrix} min_{β} \frac{1}{2} w^{T} K w - k^{T} w \\ s . t . 0 \leq w_{i} \leq 1, i = 1, 2, . . ., n \end{matrix}$ (4)

In Equation (4), K = K (x_i, x_j) = ϕ (x_i) ^Tϕ (x_j) and $k_{i} = \frac{n}{m} \sum_{j = 1}^{m} ϕ (x_{i})^{T} ϕ (x_{j})$

The objective function is a standard quadratic programming problem, so it can be solved using the quadratic programming solver to obtain the weight of each sample in the source domain. Based on the values of w_i, the transfer of knowledge in the source domain can effectively avoid negative transfer.

2.2 Support vector machine

The SVM, which was formally proposed by Vapnik in 1995 [14], is a kind of classifier for binary-classification problems, whose basic model is the linear classifier defined in the feature space [11 –13]. SVM is based on the theory of the Vapnik-Chervonenkis (VC) dimension and the structural risk minimization principle of statistical learning theory, which seeks the best compromise between model complexity and learning ability based on limited sample information to obtain the best learning ability. The learning strategy of the SVM is to maximize the margin which can be formalized as a solution to the convex quadratic programming problem. Therefore, the SVM algorithm is an optimization algorithm for solving convex quadratic programming.

In SVM, it is assumed that the training samples are linearly separable in the sample or feature space. However, it is difficult to find a linearly separable situation in reality. To alleviate this situation, SVMs are allowed to make mistakes on some samples, introducing ‘soft margin’ for SVM. The typical algorithm derived from the SVM is as follows. [23] proposed a new deep convolutional neural network (DCNN) analysis method SVM histogram, which uses the decision boundary of linear SVM to check the spatial distribution of the feature representation extracted by the DCNN. The experimental results show that the method has higher precision than the original DCNN. In [24] and [25], the researchers changed the solution from the original quadratic programming problem to a set of linear equations by modifying the equality constraints of Vapink’s original SVM formula in regression mode. The least squares SVM method is proposed. In [26], a least squares SVM Par-LSSM with parametric margin is proposed, Par-LSSVM can deal with “cross planes” datasets. The literature [27] proposed an algorithm of pattern recognition based on nonlinear SVM. The basic form of a typical soft-margin SVM is described below. ${\begin{matrix} min_{w, b, ξ} \frac{1}{2} | | w | |^{2} + C \sum_{i = 1}^{n} ξ_{i} \\ s . t . y_{i} (w^{T} \cdot x_{i} + b) \geq 1 - ξ_{i}, i = 1, 2, . . ., n \\ ξ_{i} \geq 0, i = 1, 2, . . ., n \end{matrix}$ (5)

In Equation (5), ξ_i ∈ {ξ₁, ξ₂, ·· · , ξ_n}, ξ_i is the distance from the optimal hyperplane for each sample, also known as the “slack variable”, w is the normal vector of the classification hyperplane, C (C > 0) is a penalty coefficient, which is a constant and generally determined by a specific application problem.

The soft margin SVM requires the same distribution between the training and the testing samples, which solves the linear inseparability problem through slack variables. However, for the target domain with a small number of training samples, soft margin SVM algorithm is not sufficient for obtaining an accurate classification model. Given this situation, using sufficient knowledge of training samples in in source domain that are transferred into target domains speeds up establishing of learning tasks in the target domains, and alleviates the problem of accuracy degradation due to a lack of training samples. In addition, we must pay special attention to negative transfer when knowledge is transferred. Once negative transfer occurs, the effect of classifiers obtained after using the knowledge of a similar domain may be even worse than when they are not used.

2.3 Approximate extreme points support vector machine

Since the basic idea of the SVM is to find a hyperplane that represents the largest margins between two classes, from the geometric point of view, the calculation problem of the hyperplane is equivalent to calculating the nearest sample between two convex hulls [18, 19]. For SVM, a large number of training samples is a prerequisite for better training results. A large number of training samples require considerable manpower for labeling, and a consider time is consumed in the training phase, so the training efficiency of SVM is not satisfactory. To improve the efficiency of SVM training, a method of training SVM using training samples near the hyperplane, approximate extreme points SVM (AESVM) is proposed in [20]. AESVM does not require all training samples to train the learning model, which can greatly reduce the scale of the training sample, which reduces the cost of training classifiers. [20] and [21] theoretically proved that the effect of AESVM trained by the representative samples is very close to the SVM using all samples within a certain range.

Defining a training dataset X = {x₁, x₂, . . . , x_n}, the corresponding class labels are Y = {y₁, y₂,..., y_n}, y_i ∈ {1, -1}, and the optimization problem of AESVM is described as Equation (6): $\begin{matrix} min_{w, b} F_{AESVM} (w, b) = \frac{1}{2} w^{T} w \\ + \frac{C}{M} \sum_{i = 1}^{M} β_{i} l (w, b, ϕ (x_{i})) \end{matrix}$ (6)

In Equation (6), M is the number of training samples in the representative X^*, which is selected from dataset X. The meaning of parameters w, b, i and C are same as Equation (5). β = [β₁, β₂, . . . β_M] is the weight vector corresponding to the representative datasets in Equation (9), l is a hinge loss function, l (w, b, ϕ (x_i) = max {0, 1 - y_i (w^Tϕ (x_i) + b)} , x_i ∈ X^*, and ϕ (·) is nonlinear mapping function.

To obtain a representative dataset X^*, the dataset X needs to be grouped according to a certain separation strategy, X = {X₁, X₂, . . . , X_n/V}, n represents the number of samples in X, V represents the predefined maximum number of samples in each group, and X_q (q = 1, 2, . . . , n/V) indicates q - th group. There is a high degree of similarity between the samples in each group, and the similarity of samples between different groups is low.

First, calculate the initial representative dataset using the support vector data description (SVDD) algorithm [19]. Then, according to the descending order of the distance between the sample and the SVDD nuclear space center, determine whether the sample $x_{i} (x_{i} \in X_{q} and x_{i} \notin X_{q}^{*})$ belongs to a representative dataset. The formal description is shown in Equation (7): ${\begin{matrix} max_{x_{i} \in X_{q}, x_{i} \notin X_{q}^{*}} f (ϕ (x_{i}), X_{q}^{*}) \leq ɛ \\ s . t . 0 \leq μ_{i, j} \leq 1, \sum_{j = 1}^{| X_{q}^{*} |} μ_{i, j} = 1 \end{matrix}$ (7)

In Equation (7), $f (ϕ (x_{i}), X_{q}^{*}) = min_{μ_{it}} | | ϕ (x_{i}) - \sum_{t = 1}^{| X_{q}^{*} |} μ_{i, t} ϕ (x_{t}) | |^{2}$ , $| X_{q}^{*} |$ represents the number of samples in $X_{q}^{*}$ , ɛ is a given small error coefficient, and μ_i,j is analogous to the convex combination weights [17]. If x_i satisfies Equation (7), we expand the representative dataset $X_{q}^{*} = X_{q}^{*} \cup {x_{i}}$ . All samples expressed as an Equation (8):

In Equation (8), γ_i,j = ${\begin{matrix} μ_{i, j}, x_{j} \in X_{q}^{*} and x_{i} \in X_{q} \\ 0 \end{matrix}$ . τ_i is the approximate error vector in Equation (6), ||τ_i||² ≤ ɛ. Finally, the weight vector corresponding to the representative dataset is shown in Equation (9): $β_{j} = \sum_{i = 1}^{n} γ_{i, j}$ (9)

3 Selective transfer support vector machine

Based on SVM, this paper constructs an STL-SVM model by using MMD, AESVM and transfer learning theory. The algorithm framework is shown in Fig. 2. As seen in Fig. 2, the STL-SVM uses the weight vector of the importance of the source domain samples relative to the target domain calculated by the MMD algorithm, the knowledge of the selected representative dataset acquired by the AESVM, and the target domain dataset to train the objective function of STL-SVM. Finally, we obtain the classifier.

Fig. 2

Framework of STL-SVM.

3.1 Objective function constructed

To make the support vector have transfer-ability, the objective function of the STL-SVM is built based on Equation (5) and Equation (6). It is assumed that there are a large number of labeled samples in the source domain and a small number of labeled samples in the target domain. The probability distribution of the samples does not have to be the same. Without loss of generality, only the binary classification is considered in this paper.

Combined with the weight of each sample in the source domain calculated by MMD in Section 2.1, the weights of samples are corrected according to the weights of the representative dataset obtained by the AESVM algorithm. This process can also be called the sample selection, as Equation (10): $β_{j} = β_{j} \cdot w_{κ (j)}$ (10)

A source domain D_s contains n samples, $D_{s} = {(x_{1}^{s}, y_{1}^{s}), (x_{1}^{s}, y_{1}^{s}), . . ., (x_{n}^{s}, y_{n}^{s})}, X_{s} = {x_{1}^{s}, x_{1}^{s}, . . ., x_{n}^{s}}, Y_{s} = {y_{1}^{s}, y_{1}^{s}, . . ., y_{n}^{s}}, Y_{s} \in {1, - 1}$ .

Similarly, for a target domain that contains m samples, $D_{t} = {(x_{1}^{t}, y_{1}^{t}), (x_{1}^{t}, y_{1}^{t}), . . ., (x_{m}^{t}, y_{m}^{t})}$ , $X_{t} = {x_{1}^{t}, x_{1}^{t}, . . ., x_{m}^{t}}$ . The objective function is shown in Equation (11).

In Equation (11), w_t and b_t indicate the parameters of the target domain, w_s and b_s are the parameters of the source domain, and these parameters contain knowledge in two domains. ${\tilde{w}}_{t}$ and ${\tilde{b}}_{t}$ represent the knowledge of the traditional SVM trained on the target domain dataset. Similar to Section 2.3, ϕ (.) represents nonlinear mapping functions. $ξ_{i}^{t} \geq 0$ ( $ξ_{i}^{t} \geq 0$ ) and $ξ_{i}^{s}$ ( $ξ_{i}^{s} \geq 0$ ) are slack variables in the target domain and source domain, respectively. n is the number of samples in the source domain, M is the number of samples in the representative dataset obtained by Equation (6), m is the number of samples in the target domain. β_i ∈ [β₁, β₂, . . . , β_M] is the weight vector of samples in the representative dataset. C_t(C_t ≥ 0) and C_s(C_s ≥ 0) are the regularization coefficients (degree of penalty error) in the target and the source domain, respectively. T denotes the transposition of matrix. $f (x_{i}) = {\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) - {\tilde{b}}_{t}$ is the decision function of the SVM classifier in the target domain. ${\begin{matrix} min_{w_{t}, b_{t}} \frac{1}{2} | | w_{t} | |^{2} + C_{t} \sum_{i = M + 1}^{M + m} ξ_{i}^{t} + \frac{1}{2} | | w_{s} | |^{2} \\ + \frac{C_{s}}{M} \sum_{i = 1}^{M} β_{i} ξ_{i}^{s} + \frac{λ}{2} | | w_{t} - w_{s} | |^{2} \\ s . t . y_{i}^{s} (w_{s}^{T} ϕ (x_{i}^{s}) + b_{s}) \geq 1 - ξ_{i}^{s} \\ y_{i}^{t} (w_{t}^{T} ϕ (x_{i}^{t}) + b_{t} - {\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) - {\tilde{b}}_{t}) \geq 0 \\ ξ_{i}^{s} \geq 0, i = 1, 2 . . . M \end{matrix}$ (11)

For the objective function in Equation (11), give the following description:

$\frac{1}{2} | | w_{t} | |^{2} + C_{t} \sum_{i = 1}^{m} ξ_{i}^{t}$ and $\frac{1}{2} | | w_{s} | |^{2} + \frac{C_{s}}{M} \sum_{i = 1}^{M} β_{i} ξ_{i}^{s}$ , which represents the structural risk items and empirical risk items in the target and source domains, respectively.

$\frac{λ}{2} | | w_{t} - w_{s} | |^{2}$ represents the differences between the classifier learned in the source domain and the classifier learned in the target domain. The larger the value, the greater the difference, otherwise the smaller the difference. λ(λ ≥ 0) is the coordination coefficient, artificially given a constant.

Constraint condition $y_{i}^{s} (w_{s}^{T} ϕ (x_{i}^{s}) + b_{s}) \geq 1 - ξ_{i}^{s}$ ensures that the classifier in the source domain is classified as correctly as possible.

The constraint condition $y_{i}^{t} (w_{t}^{T} ϕ (x_{i}^{t}) + b_{t} - {\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) - {\tilde{b}}_{t}) \geq 0$ ensures that the effect of the transfer is not worse than the SVM in the target domain, limiting the possibility of negative transfer.

Finally, the decision function of the STL-SVM algorithm can be described as follows formula (12): $f (x) = w_{t}^{T} ϕ (x) + b_{t}$ (12)

The STL-SVM algorithm is compared with the current study of transfer learning algorithms, such as TrSVM [8], LMPROJ [8], LWE [9], ARTL [22] and STIL [30]. The difference considers both the negative transfer situation in the transfer learning and the problem of long training time caused by too many training samples in the source domain, so this paper has considerable advantages under large-scale datasets.

3.2 Proof of theorem supporting the objective function

To solve the objective function of the STL-SVM algorithm, the dual problem of formula (11) must first be obtained, then the problem is proven to be a quadratic programming problem, and the quadratic programming problem obtained is finally further proved to be a convex quadratic programming problem. The quadratic programming problem can be solved and the result is a global optimal solution.

The original problem in Equation (11) can be translated into the following dual problem:

Theorem 3.1. The dual problem of the STL-SVM original optimization problem is Equation (13). ${\begin{matrix} min_{Γ} \frac{1}{2} Γ^{T} \tilde{K} Γ + {\tilde{e}}^{T} Γ \\ s . t . f^{T} Γ = 0 \end{matrix}$ (13)

In Equation (13): $Γ = [α, α^{*}]$ $α = [α_{1}, α_{2}, . . . α_{M}],$ $α^{*} = [α_{i}^{*}, α_{2}^{*}, . . . α_{m}^{*}]$ $0 \leq Γ \leq [\underset{M}{\underset{︸}{C_{s} β_{i} / M, C_{s} β_{i} / M, . . ., C_{s} β_{i} / M}}, \underset{m}{\underset{︸}{C_{t}, C_{t}, . . ., C_{t}}}]$ $β_{j} = \sum_{i = 1}^{n} γ_{i, j}, f^{T} = [\underset{M}{\underset{︸}{y_{1}^{s}, y_{2}^{s}, ..., y_{M}^{s}}}, \underset{m}{\underset{︸}{y_{1}^{t}, y_{2}^{t}, ..., y_{m}^{t}}}]$ $\begin{matrix} \tilde{e} = [{\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) + b_{t}, {\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) + b_{t}, . . ., \\ {\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) + b_{t}, 1, . . ., 1] \end{matrix}$

Proof. The dual problem of Equation (11) is shown in Equation (13).

The Lagrange function of Equation (14) is: ${\begin{matrix} L (w_{t}, w_{s}, b_{t}, b_{s}, ξ^{t}, ξ^{s}, α, α^{*}, γ, γ *) = \\ \frac{1}{2} | | w_{t} | |^{2} + \frac{1}{2} | | w_{s} | |^{2} + C_{t} \sum_{i = M + 1}^{M + m} ξ_{i}^{t} \\ + \frac{C_{s}}{M} \sum_{i = 1}^{M} β_{i} ξ_{i}^{s} + \frac{λ}{2} | | w_{t} - w_{s} | |^{2} \\ - \sum_{i = M + 1}^{M + m} γ_{i}^{*} ξ_{i}^{t} - \sum_{i = 1}^{M} γ_{i} ξ_{i}^{s} \\ - \sum_{i = 1}^{M} α_{i} (y_{i}^{s} (w_{s}^{T} ϕ (x_{i}^{s}) + b_{s}) - 1 + ξ_{i}^{s}) \\ - \sum_{i = M + 1}^{M + m} α_{i}^{*} (y_{i}^{t} (w_{t}^{T} ϕ (x_{i}^{t}) + b_{t} - {\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) - {\tilde{b}}_{t})) \end{matrix}$ (14)

In Equation (14), α = [α₁, α₂, . . . α_M], $α^{*} = [α_{i}^{*}, α_{2}^{*}, . . . α_{m}^{*}]$ , γ = [γ₁, γ₂, . . . γ_M], and $γ^{*} = [γ_{i}^{*}, γ_{2}^{*}, . . . γ_{m}^{*}]$ are Laplacian coefficients, $ξ^{s} = [ξ_{1}^{s}, ξ_{2}^{s}, . . ., ξ_{M}^{s}]$ is slack variable vector in the source domain, and $ξ^{t} = [ξ_{1}^{t}, ξ_{2}^{t}, . . ., ξ_{m}^{t}]$ is a slack variable vector of the target domain. For the convenience of proof, set $y_{i} \in {y_{1}^{s}, . . . y_{M}^{s}, y_{1}^{t}, . . ., y_{m}^{t}}$ , $x_{i} \in {x_{1}^{s}, . . . x_{M}^{s}, x_{1}^{t}, . . ., x_{m}^{t}}$ , $y_{i}^{s}$ and $y_{i}^{t}$ are replaced by y_i, x_i replaces $x_{i}^{s}$ and $x_{i}^{t}$ : when 1 ≤ i ≤ M, $y_{i} \in {y_{1}^{s}, . . . y_{M}^{s}}$ , $x_{i} \in {x_{1}^{s}, . . ., x_{M}^{s}}$ , when M + 1 ≤ i ≤ M + m, $y_{i} \in {y_{1}^{t}, . . . y_{m}^{t}}$ , $x_{i} \in {x_{1}^{t}, . . ., x_{m}^{t}}$ .

By the Karush–Kuhn–Tucker (KKT) condition, the following equations are obtained: $\begin{matrix} \frac{\partial L}{\partial ξ_{i}^{s}} = 0 \Rightarrow \sum_{i = 1}^{M} γ_{i} + \sum_{i = 1}^{M} α_{i} = \frac{C_{s}}{M} \sum_{i = 1}^{M} β_{i}^{s} \\ \Rightarrow γ_{i} + α_{i} = \frac{C_{s}}{M} β_{i} \end{matrix}$ (15)

$\frac{\partial L}{\partial ξ_{i}^{t}} = 0 \Rightarrow γ_{i}^{*} = C_{t}$ (16) $\frac{\partial L}{\partial b_{s}} = 0 \Rightarrow \sum_{i = 1}^{M} α_{i} y_{i} = 0$ (17) $\frac{\partial L}{\partial b_{t}} = 0 \Rightarrow \sum_{i = M + 1}^{M + m} α_{i}^{*} y_{i} = 0$ (18) $\frac{\partial L}{\partial w_{s}} = 0 \Rightarrow w_{s} + λ (w_{s} - w_{t}) - \sum_{i = 1}^{M} α_{i} y_{i} ϕ (x_{i}) = 0$ (19) ${\begin{matrix} \frac{\partial L}{\partial w_{t}} = 0 \Rightarrow w_{t} + λ (w_{t} - w_{s}) - \sum_{i = M + 1}^{M + m} α_{i}^{*} y_{i} ϕ (x_{i}) = 0 \\ w_{t} = \frac{(1 + λ) \sum_{i = M + 1}^{M + m} α_{i}^{*} y_{i} ϕ (x_{i}) + λ \sum_{i = 1}^{M} α_{i} y_{i} ϕ (x_{i})}{1 + 2 λ} \\ w_{s} = \frac{λ \sum_{i = M + 1}^{M + m} α_{i}^{*} y_{i} ϕ (x_{i}) + (1 + λ) \sum_{i = 1}^{M} α_{i} y_{i} ϕ (x_{i})}{1 + 2 λ} \end{matrix}$ (20)

Substituting Equation (15) to Equation (20) into Equation (11) yields the dual form shown in Equation (21) after simplification: ${\begin{matrix} min_{α, α^{*}} \frac{1 + λ}{2 (1 + 2 λ)} (\sum_{i = M + 1}^{M + m} \sum_{j = M + 1}^{M + m} α_{i} α_{j}^{*} y_{i} y_{j} k (x_{i}, x_{j}) \\ + \sum_{i = 1}^{M} \sum_{j = 1}^{M} α_{i} α_{j}^{*} y_{i} y_{j} k (x_{i}, x_{j})) + \\ \frac{λ (1 + λ)}{2 (1 + 2 λ)} \sum_{i = 1}^{M} \sum_{j = M + 1}^{M + m} α_{i} α_{j}^{*} y_{i} y_{j} k (x_{i}, x_{j}) \\ + \frac{λ (1 + λ)}{2 (1 + 2 λ)} \sum_{i = 1}^{M} \sum_{j = M + 1}^{M + m} α_{i} α_{j}^{*} y_{i} y_{j} k (x_{i}, x_{j}) \\ + \sum_{i = 1}^{M} α_{i} + \sum_{i = M + 1}^{M + m} α_{i}^{*} ({\tilde{w}}_{t}^{T} ϕ (x_{i}^{t}) + {\tilde{b}}_{t}) \end{matrix}$ (21) ${\begin{matrix} \tilde{K} = \frac{1 + λ}{2 (1 + 2 λ)} [\begin{matrix} k_{11} \\ k_{21} \end{matrix} \begin{matrix} λ k_{12} \\ λ k_{22} \end{matrix}] \\ k_{12} = k_{11} = k_{21} = k_{22} = y_{i} y_{j} k (x_{i}, x_{j}) \end{matrix}$ (22) $\begin{matrix} α_{i}^{*} \in [0, C_{t}], α_{i} \in [0, \frac{β_{i}}{M} C_{s}], \\ \sum_{M + i}^{M + m} α_{i}^{*} y_{i} = 0, \sum_{i = 1}^{M} α_{i} y_{i} = 0 \end{matrix}$

Substituting Equation (22) into Equation (21) obtains the Equation (13).

Theorem 3.2. The quadratic programming problem transformed from the dual problem shown in Equation (13) is a convex quadratic programming problem.

Proof. Equation (13) is a convex quadratic programming problem.

To prove that Equation (13) is a convex quadratic programming problem, we need to prove that $\tilde{K}$ is a positive semidefinite matrix. $\tilde{K}$ can be expressed as follows: $\tilde{K} = Q^{T} Q$ (23)

$Q = \sqrt{\frac{1 + λ}{2 (1 + 2 λ)}} (y_{1} x_{1}, . . ., y_{M + m} x_{M + m}) .$ In this case, $\tilde{K}$ is a semi-positive definite matrix, from which we can prove that the quadratic programming problem represented by Equation (13) is convex quadratic programming.

Theorem 3.3. The solution to the quadratic programming problem of Equation (13) is the global optimal solution.

Proof. The solution of Equation (13) is the global optimal solution.

Since the quadratic programming of Equation (13) is a convex quadratic programming problem, the KKT condition is also a sufficient condition, so the solution of the quadratic programming is the global optimal solution.

Theorem 3.4. $w_{t}^{*}$ and $b_{t}^{*}$ are the global optimal solutions of the original problem in Equation (12), which are in Equation (24) and Equation (25): $\begin{matrix} w_{t}^{*} = \frac{(1 + λ)}{1 + 2 λ} \sum_{i = M + 1}^{M + m} {\tilde{α}}_{i}^{*} y_{i} ϕ (x_{i}) \\ + \frac{λ}{1 + 2 λ} \sum_{i = 1}^{M} {\tilde{α}}_{i} y_{i} ϕ (x_{i}) \end{matrix}$ (24) ${\begin{matrix} b_{t}^{*} = y_{i} - \frac{(1 + λ)}{1 + 2 λ} \sum_{i = M + 1}^{M + m} \sum_{j = M + 1}^{M + m} {\tilde{α}}_{i}^{*} y_{i} ϕ (x_{i}) ϕ (x_{j}) + \\ \frac{λ}{1 + 2 λ} \sum_{i = 1}^{M} \sum_{j = 1}^{M} {\tilde{α}}_{i} y_{i} ϕ (x_{i}) ϕ (x_{j})) \end{matrix}$ (25)

The optimal solutions given in Equation (24) and Equation (25) contain both knowledge in the target domain and the source domain, for example, in $w_{t}^{*}$ , $\frac{λ}{1 + 2 λ} \sum_{i = 1}^{M} {\tilde{α}}_{i} y_{i} ϕ (x_{i})$ is knowledge learned from the source domain, and $\frac{(1 + λ)}{1 + 2 λ} \sum_{i = M + 1}^{M + m} {\tilde{α}}_{i}^{*} y_{i} ϕ (x_{i})$ is knowledge learned from the target domain. The STL-SVM algorithm transfers the knowledge in the source domain by constructing the objective function (Equation (11)), realizing the purpose of transfer learning, and effectively avoiding the negative transfer phenomenon, which is beneficial to the improvement of the target domain learning model.

3.3 STL-SVM algorithm flow

According to the description of 3.1 and 3.2, the proposed STL-SVM algorithm flow is as follows:

Steps of STL-SVM Algorithm
Input: n labeled samples in source domain $D_{s} = {(x_{i}, y_{i})}_{i = 1}^{n}$ ; m labeled samples in target domain $D_{t} = {(x_{j})_{j = n + 1}^{n + m}}$ .
Output: Decision function $f (x) = w_{t}^{T} ϕ (x) + b_{t}$
Step 1. Calculate Equation (6) to obtain the weight vector of the samples in the source domain, the vector represents the importance of samples in the source domain relative to that of the target domain;
Step 2. The AESVM algorithm selects the samples in the source domain to obtain M representative samples;
Step 3. Compute Equation (10);
Step 4. Compute Equation (13) to get Γ by the QP solver;
Step 5. Compute the normal vector of the decision hyperplane w_t according to Equation (24);
Step 6. According to Equation (25) to calculate the hyperplane offset b_t;
Step 7. Output the decision function $f (x) = w_{t}^{T} ϕ (x) + b_{t}$ .

The complexity calculation process of the STL-SVM classifier construction is as follows: given the source domain containing n-labeled samples and the target domain containing m-labeled samples, the number of representative datasets selected by the AESVM algorithm from the source domain is M, and the time complexity of constructing the classifier is expressed as O (M + n) ³.

STL-SVM reduces the size of the dataset in the source domain by AESVM and the time complexity of learning tasks in the source domain to acquire knowledge. At the same time, it uses the improved MMD method and decision value constraints to restrict negative transfer. Compared with previous transfer learning algorithms, STL-SVM uses the knowledge of source domain data to the greatest extent at the least cost of samples, improves the training efficiency, and better solves the problem of negative transfer. Therefore, compared with existing machine learning algorithms with transfer ability, the STL-SVM algorithm has certain advantages in classification accuracy and algorithm performance.

4 Experimental results

To comprehensively evaluate the performance of the STL-SVM algorithm, experiments were performed on different datasets: manual two-dimensional two-moon, 20Newsgroups, Reuters and Email spam datasets. The experiment was organized from the perspective of non-transfer, transfer, and improved transfer to evaluate the effectiveness of the STL-SVM in the paper.

To verify the effectiveness of the proposed algorithm, the following comparison algorithms are presented: SVM [20], AESVM [16], TrSVM [8], ARSVM [22], ARRLS [22], STL-SVM, STIL [30], STLCF [31], TL-DAKELM [32] and RankRE-TL [33], where SVM and AESVM do not have transfer ability. In addition, in order to evaluate the algorithm, we commonly choose the classification accuracy, precision and recall used by the classification algorithm as the evaluation criteria of the algorithms, which is specifically expressed as: $Accuracy = \frac{| {x | x_{t} \in D_{t} \cap f (x_{i}) = y_{t}} |}{| {x | x_{t} \in D_{t}} |}$

D_t represents the dataset in the target domain, y_t is the truth class label, and f (x_t) is the result of x_t predicted by the classifiers.

$Pr ecision = \frac{TP}{TP + FP}$ $Re call = \frac{TP}{TP + FN}$

TP represents the number of positive class samples that are accurately classified as positive classes by the classifier; FP is the number of negative class samples that are incorrectly classified as positive classes; and FN is the number that indicates that the positive class samples are incorrectly classified as negative classes.

All experimental kernel functions use a Gaussian function in this paper. The kernel width parameter 2σ² is based on S the square of the average 2-norm of the source domain samples, and by using the grid search algorithm [17], the optimal values are searched in {s/64, s/32, s/16, s/8, s/4, s/2, s, 2s, 4s, 8s, 16s, 32s, 64s}. Similarly, the regularization parameters C_t, C_s, and the balance parameter λ are searched in {10^-4, 10^-3, 10^-2, 10^-1, 10, 10¹, 10², 10³, 10⁴} with a grid search. For the sake of the fairness of the experiments, we take the classification accuracy obtained by 5-fold cross-validation as the evaluation. All experiments were performed on Intel Core(TM), 3.6 GHz, 8 GB, Windows 10 systems. The SVM algorithm was implemented by LibSVM software [20], and the other algorithms were implemented in the MATLAB R2009A environment. In the experiments, the source and the target domains have labeled samples, and the labeled samples of the target domain were only used to evaluate the performance of the classification method.

4.1 Experiments on artificial two-moon datasets

Generating 200 unlabeled samples in the target domain with a mean value of 0 and a standard deviation [21]. The target domain dataset was rotated clockwise around the center five times, each time rotating 15°, thus obtaining five source domain datasets. In the five datasets in the source domain, 100 samples were randomly labeled as positive class, and the remaining 100 samples were labeled negative class. Figure 3 (a)-(f) describes the target domain dataset and the five source domain datasets obtained after rotation. The larger the rotation angle of the target domain dataset is, the greater the differences of the distribution between the target domain and source domain, and the more likely the negative transfer will occur. Because of the limited space in the paper, we analyzed only the experimental results of rotating 15°, 30 °and 45 °. Table 1 shows the average classification accuracy and standard deviation of SVM, AESVM, TrSVM, STL-SVM, STIL, STLCF, ARSVM, ARRLS, TL-DAKELM, and RankRE-TL. Table 2 compares the average training time and standard deviation of the above algorithm. The average precision and standard deviation of the above algorithm are given in Table 3. The average recall and standard deviation of the above algorithm are presented in Table 4.

Fig. 3

Two-moon datasets.

Table 1

Comparison of average accuracy (%) with standard deviation on two-moon datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
15°	87.15 (0.02)	87.14 (0.01)	93.45 (0.00)	92.46 (0.09)	93.42 (0.06)	96.18 (0.08)	90.56 (0.12)	91.78 (0.11)	93.11 (0.10)	94.25 (0.09)
30°	87.16 (0.02)	87.15 (0.01)	85.15 (0.00)	90.21 (0.09)	91.88 (0.07)	92.65 (0.06)	84.45 (0.11)	85.01 (0.10)	91.25 (0.09)	91.58 (0.10)
45°	88.15 (0.02)	87.15 (0.02)	80.15 (0.00)	91.43 (0.08)	91.78 (0.06)	91.87 (0.06)	82.34 (0.10)	82.68 (0.09)	89.56 (0.10)	90.10 (0.08)

Table 2

Comparison of average running time (%) with standard deviation on two-moon datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
Time	6.25 × 10^-4	2.25 × 10^-4	6.55 × 10^-2	6.22 × 10^-2	6.35 × 10^-2	2.24 × 10^-3	3.66 × 10^-3	3.84 × 10^-3	6.41 × 10^-2	5.55 × 10^-2

Table 3

Comparison of average precision (%) with standard deviation on two-moon datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
15°	84.76 (0.02)	84.76 (0.01)	88.15 (0.01)	86.19 (0.08)	88.56 (0.06)	90.16 (0.07)	84.1 5(0.09)	86.27 (0.08)	87.25 (0.10)	89.28 (0.08)
30°	84.88 (0.03)	84.19 (0.02)	80.22 (0.02)	80.56 (0.08)	83.43 (0.07)	88.55 (0.07)	79.75 (0.10)	78.98 (0.08)	83.36 (0.10)	84.39 (0.09)
45°	85.25 (0.01)	83.95 (0.04)	78.67 (0.01)	79.82 (0.09)	82.21 (0.06)	83.28 (0.08)	74.98 (0.10)	75.28 (0.09)	81.87 (0.11)	82.11 (0.08)

Table 4

Comparison of average recall (%) with standard deviation on two-moon datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
15°	80.15 (0.03)	81.26 (0.02)	82.11 (0.03)	82.16 (0.08)	83.23 (0.07)	84.18 (0.06)	80.25 (0.10)	79.95 (0.09)	80.85 (0.09)	83.11 (0.07)
30°	78.56 (0.02)	75.87 (0.01)	74.58 (0.03)	78.64 (0.09)	80.35 (0.06)	83.46 (0.08)	75.68 (0.10)	73.67 (0.08)	78.56 (0.10)	80.87 (0.08)
45°	79.27 (0.02)	74.11 (0.01)	69.65 (0.02)	70.62 (0.07)	80.35 (0.06)	80.25 (0.06)	69.97 (0.09)	70.57 (0.08)	76.98 (0.10)	78.11 (0.07)

From Tables 1 –4, the following conclusions can be drawn:

Because SVM and AESVM algorithms have no transfer characteristics, they utilize only the samples in the target domain for training, so the differences between the source domain and the target domain do not affect the classification result. AESVM chooses representative datasets for training, its training effect is similar to that of SVM, but its training time decrease due to the reduction in the training sample scale.

The transfer learning algorithms TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCF, TL-DAKELM and RankRE-TL, can be trained simultaneously by using samples in the source and target domain, so that the classification effects of the trained classifiers are better than non-transferred SVM and AESVM algorithms. However, as the rotation angle increases, the data difference between the target domain and the source domain becomes larger, and the classification of the transfer learning classifier deteriorates, but it has no effect on the classification effect of SVM and AESVM.

In terms of running time, SVM and AESVM training time is less time than that of the other transfer learning algorithms, but the classification accuracy is lower than that of the transfer learning algorithm. The algorithms with selective functions STL-SVM, STIL and STLCF require fewer source training samples than TrSVM, ARSVM, ARRLS, TL-DAKELM and RankRE-TL, so the algorithm execution running time is reduced and the classifier training cost is relatively lower.

In Tables 3 and 4, STL-SVM has some advantages over the benchmark algorithm, both in precision and recall.

4.2 Experiments on real datasets

In this paper, the 20Newsgroups, Reuters and Email spam datasets are used for the evaluation of the STL-SVM algorithm on the real dataset.

20Newsgroups [22] dataset contains approximately 20,000 news documents, which are mainly divided into 4 top categories, and each category contains 4 subcategories. See Table 5 for details. In the experiments, we randomly selected 2 top categories from 4 top categories, one of which was a positive class and the other was a negative class. Then, according to different subcategories in the top categories, the samples are segmented to obtain the datasets of the source domain target domain. The specific cross-domain task groups were comp vs rec(c vs r), comp vs sci(c vs s), comp vs talk(c vs t), rec vs sci(r vs s), rec vs talk(r vs t) and sci vs talk(s vs t). The above task groups ensure that the source and target domains were related because they come from the same top categories; simultaneously, the source and target domains were guaranteed to be different because they came from different subcategories.

Table 5
The statistics information of 20Newsgroups and Reuters

Datasets Top categories Sub-categories Examples

comp.graphics 970

comp.os.ms-windows.misc 963

comp comp.sys.ibm.pc.hardware 979

comp.sys.mac.hardware 958

rec.autos 987

rec.motorcycles 993

rec rec.sport.baseball 991

rec.sport.hokey 997

20Newsgroups sci.crypt 989

sci.electronics 984

sci sci.med 987

sci.space 985

talk.politics.guns 909

talk.politics.mideast 940

talk talk.politics.misc 774

talk.religion.misc 627

orgs many subcategories 1237

Reuters people many subcategories 1208

place many subcategories 1016

Datasets	Top categories	Sub-categories	Examples
		comp.graphics	970
		comp.os.ms-windows.misc	963
	comp	comp.sys.ibm.pc.hardware	979
		comp.sys.mac.hardware	958
		rec.autos	987
		rec.motorcycles	993
	rec	rec.sport.baseball	991
		rec.sport.hokey	997
20Newsgroups		sci.crypt	989
		sci.electronics	984
	sci	sci.med	987
		sci.space	985
		talk.politics.guns	909
		talk.politics.mideast	940
	talk	talk.politics.misc	774
		talk.religion.misc	627
	orgs	many subcategories	1237
Reuters	people	many subcategories	1208
	place	many subcategories	1016

The Reuters dataset [28] is similar to the 20Newsgroup dataset, and contains several top classes, each of which consists of a few subcategories. See Table 5 for details. The main top categories are orgs, people, and place. We used the categories to construct learning task groups: orgs vs people (o vs pe), orgs vs place (o vs pl), and people vs place (pe vs pl).

The Email spam dataset was released by the EMCL/PKDD 2006 Knowledge Discovery Challenge Contest [29], and includes three personal email datasets U1, U2 and U3, and a public Email dataset. Each personal email dataset contains 2,500 messages, half of which are normal and the other half are spam. The transfer learning tasks are U1 vs U2, U2 vs U3, U3 vs U1.

Tables 6, 7, 8, and 9 show the average classification accuracy, average running time, average precision, and average recall of SVM, AESVM, TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCF, TL-DAKELM, and RankRE-TL on the cross-domain dataset constructed by the real dataset. From these results, we can draw the following conclusions.

Table 6

Comparison of average accuracy (%) with standard deviation on real datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
c vs s	73.03 (1.87)	73.24 (1.88)	72.31 (1.58)	84.42 (1.22)	86.66 (1.27)	87.88 (1.12)	81.76 (1.68)	82.87 (1.58)	80.16 (1.49)	85.64 (1.18)
r vs t	69.15 (2.21)	69.98 (2.19)	80.11 (2.06)	96.02 (1.24)	96.75 (1.26)	97.06 (1.17)	79.28 (1.75)	80.27 (1.62)	78.97 (1.66)	92.27 (1.42)
r vs s	79.38 (2.84)	79.45 (2.83)	86.25 (1.58)	87.13 (1.12)	90.99 (1.15)	91.52 (1.06)	86.65 (1.26)	87.18 (1.61)	86.53 (1.72)	88.87 (1.37)
s vs t	75.87 (2.88)	76.25 (2.89)	85.87 (1.62)	88.97 (1.21)	91.12 (1.18)	92.57 (1.21)	82.71 (1.62)	83.27 (1.55)	84.72 (1.63)	92.13 (1.33)
c vs r	84.35 (2.56)	84.22 (2.55)	85.06 (1.86)	94.98 (1.23)	96.62 (1.32)	97.32 (1.18)	89.57 (1.64)	88.56 (1.51)	87.56 (1.64)	96.25 (1.21)
c vs t	92.01 (1.57)	91.51 (1.58)	94.55 (1.33)	97.43 (1.34)	97.98 (1.42)	98.25 (1.01)	93.56 (1.53)	92.69 (1.44)	91.75 (1.56)	96.21 (1.24)
o vs pe	69.7 (1.95)	70.01 (1.96)	74.15 (1.66)	86.99 (1.64)	87.16 (1.41)	87.35 (1.13)	78.64 (1.47)	79.56 (1.38)	80.07 (1.55)	80.25 (1.18)
o vs pl	70.31 (1.97)	70.12 (1.98)	72.18 (1.75)	77.78 (1.55)	76.98 (1.47)	77.89 (1.21)	75.66 (1.31)	74.89 (1.35)	75.69 (1.43)	76.65 (1.28)
pe vs pl	57.94 (2.77)	57.91 (2.75)	68.34 (1.85)	68.86 (1.42)	67.93 (1.51)	71.12 (1.15)	70.58 (1.58)	68.87 (1.65)	70.87 (1.67)	69.53 (2.15)
U1vsU2	96.26 (1.46)	96.21 (1.45)	95.98 (1.88)	93.89 (1.89)	94.87 (1.91)	98.11 (1.01)	94.82 (1.46)	95.83 (1.59)	93.88 (1.68)	96.26 (1.65)
U2vsU3	96.89 (1.34)	97.56 (1.35)	96.12 (1.27)	95.54 (1.09)	96.32 (1.15)	97.05 (0.98)	95.57 (1.28)	94.64 (1.45)	92.54 (1.32)	94.18 (1.16)
U3vsU1	89.17 (2.15)	89.78 (2.15)	89.03 (2.11)	90.35 (1.45)	89.32 (1.23)	94.55 (1.33)	91.29 (1.76)	92.16 (1.39)	90.65 (1.29)	88.65 (1.20)

Table 7

Comparison of average running time (s) with standard deviation on real datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
c vs s	0.32 (0.09)	1.18 (0.11)	7.99 (0.54)	7.67 (0.91)	7.88 (0.88)	0.55 (0.31)	1.55 (0.56)	2.48 (0.78)	5.66 (0.85)	6.45 (0.47)
r vs t	0.33 (0.07)	1.27 (0.12)	8.15 (0.49)	7.92 (0.96)	8.02 (0.91)	0.58 (0.32)	1.64 (0.52)	2.56 (0.75)	5.15 (0.83)	6.72 (0.55)
r vs s	0.35 (0.08)	1.36 (0.13)	8.47 (0.51)	9.01 (0.92)	9.11 (0.96)	0.66 (0.35)	1.52 (0.55)	2.28 (0.68)	4.52 (0.79)	5.99 (0.58)
s vs t	0.34 (0.06)	1.32 (0.11)	8.12 (0.48)	8.52 (0.96)	8.67 (0.92)	0.59 (0.29)	1.58 (0.58)	2.86 (0.72)	4.98 (0.86)	6.21 (0.64)
c vs r	0.33 (0.08)	1.25 (0.10)	8.26 (0.38)	8.82 (0.95)	9.10 (0.97)	0.58 (0.33)	1.43 (0.56)	2.18 (0.65)	4.35 (0.78)	5.37 (0.65)
c vs t	0.34 (0.07)	1.45 (0.12)	8.44 (0.56)	9.11 (0.94)	9.23 (0.95)	0.62 (0.28)	1.88 (0.61)	3.12 (0.72)	6.14 (0.92)	6.62 (0.61)
o vs pe	0.49 (0.06)	1.71 (0.14)	13.15 (0.60)	12.71 (0.93)	12.98 (0.99)	0.91 (0.24)	2.15 (0.65)	3.57 (0.81)	9.16 (1.13)	9.45 (0.65)
o vs pl	0.48 (0.07)	1.69 (1.15)	12.87 (0.61)	11.82 (0.89)	12.14 (0.92)	0.82 (0.23)	2.11 (0.63)	3.26 (0.79)	8.97 (1.25)	9.26 (0.68)
pe vs pl	0.47 (0.05)	1.65 (0.15)	12.66 (0.62)	11.12 (0.88)	11.23 (0.93)	0.77 (0.25)	1.98 (0.61)	2.88 (0.84)	8.56 (0.97)	8.88 (0.72)
U1vsU2	0.56 (0.05)	1.77 (0.11)	13.56 (0.66)	13.23 (1.05)	13.26 (1.11)	1.01 (0.36)	2.36 (0.72)	3.68 (0.85)	10.15 (1.05)	10.26 (0.65)
U2vsU3	0.51 (0.04)	1.74 (0.12)	13.44 (0.61)	12.45 (1.02)	12.56 (0.99)	0.98 (0.28)	2.26 (0.67)	3.21 (0.79)	9.78 (0.99)	9.13 (0.62)
U3vsU1	0.52 (0.04)	1.76 (0.13)	13.32 (0.63)	13.32 (1.13)	13.37 (1.16)	0.97 (0.35)	2.08 (0.75)	2.98 (0.87)	8.88 (1.27)	8.26 (0.74)

Table 8

Comparison of average precision (%) with standard deviation on real datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
c vs s	68.69 (2.07)	68.18 (2.16)	67.29 (2.15)	76.21 (1.77)	77.54 (1.65)	80.12 (1.18)	75.17 (2.15)	77.56 (2.75)	75.41 (2.14)	75.46 (2.32)
r vs t	64.26 (3.15)	64.21 (3.25)	75.46 (2.75)	71.32 (2.11)	72.34 (1.56)	78.23 (1.37)	74.72 (2.46)	75.32 (2.66)	73.29 (1.92)	77.61 (2.31)
r vs s	74.46 (2.18)	75.42 (2.69)	80.32 (1.97)	79.24 (1.98)	78.98 (1.87)	83.27 (1.23)	81.96 (2.96)	81.24 (2.16)	81.65 (2.52)	82.13 (2.53)
s vs t	69.56 (2.53)	71.75 (3.15)	78.53 (1.87)	75.28 (1.85)	76.22 (1.76)	79.31 (1.17)	77.67 (2.67)	76.85 (2.15)	77.27 (2.66)	78.23 (2.11)
c vs r	79.57 (2.03)	79.87 (2.87)	80.58 (2.15)	80.21 (2.76)	81.45 (2.54)	86.11 (1.12)	84.46 (2.24)	83.75 (2.72)	82.74 (2.72)	85.16 (1.75)
c vs t	86.12 (1.97)	85.65 (1.96)	87.82 (1.78)	87.65 (1.69)	87.01 (1.76)	88.75 (1.15)	87.95 (1.98)	87.16 (2.17)	85.36 (2.55)	87.24 (1.65)
o vs pe	64.65 (1.76)	66.13 (2.13)	69.86 (2.08)	73.85 (1.38)	74.32 (1.43)	76.56 (1.27)	72.46 (2.26)	74.45 (1.85)	74.36 (2.83)	74.94 (1.94)
o vs pl	65.73 (1.68)	65.24 (2.15)	67.25 (1.75)	69.76 (2.43)	70.11 (2.26)	71.26 (1.31)	69.53 (1.88)	68.26 (1.93)	70.56 (2.13)	69.56 (2.07)
pe vs pl	52.88 (2.64)	53.89 (2.85)	63.66 (2.35)	64.43 (2.19)	65.43 (1.97)	67.15 (1.39)	66.05 (2.63)	63.68 (2.09)	65.13 (2.25)	64.85 (3.11)
U1vsU2	90.12 (2.71)	91.35 (2.58)	90.05 (2.74)	87.87 (1.48)	88.87 (1.52)	92.21 (1.01)	89.68 (2.45)	90.38 (2.34)	88.72 (2.88)	91.72 (2.86)
U2vsU3	88.47 (2.38)	90.08 (2.64)	89.24 (1.88)	87.49 (1.85)	89.32 (1.78)	90.13 (1.18)	89.16 (2.06)	88.57 (2.57)	87.62 (1.96)	89.31 (2.51)
U3vsU1	83.47 (2.66)	84.25 (2.72)	85.18 (3.04)	83.98 (2.08)	84.28 (1.98)	89.15 (1.43)	86.83 (2.86)	87.81 (1.88)	85.28 (1.85)	84.86 (1.99)

Table 9

Comparison of average recall (%) with standard deviation on real datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
c vs s	63.16 (3.15)	63.64 (3.25)	64.52 (3.26)	71.15 (2.88)	72.32 (2.65)	75.23 (1.88)	71.54 (3.16)	72.25 (3.86)	70.13 (3.35)	71.26 (3.21)
r vs t	59.89 (3.72)	59.96 (3.63)	70.24 (3.65)	66.17 (3.21)	68.87 (2.98)	71.45 (2.36)	69.43 (3.72)	70.23 (3.59)	68.62 (2.99)	70.54 (2.65)
r vs s	69.27 (3.26)	70.74 (2.87)	74.73 (2.88)	74.98 (2.45)	75.11 (2.54)	76.28 (2.43)	74.12 (3.28)	74.98 (3.71)	73.14 (3.44)	72.91 (3.65)
s vs t	65.19 (3.18)	76.43 (4.03)	73.5 5(2.78)	72.52 (2.73)	73.11 (2.54)	74.12 (2.25)	72.05 (3.67)	71.68 (3.36)	72.58 (3.76)	73.11 (3.73)
c vs r	74.7 6(2.98)	74.88 (3.26)	79.26 (3.23)	79.72 (3.17)	80.34 (2.88)	81.42 (2.17)	79.45 (3.22)	79.87 (3.67)	77.75 (3.57)	79.32 (2.22)
c vs t	80.01 (2.15)	79.66 (2.39)	82.63 (2.65)	81.54 (2.52)	82.15 (2.45)	83.45 (2.23)	80.19 (2.15)	81.01 (3.54)	79.86 (3.85)	81.25 (2.48)
o vs pe	58.86 (2.55)	61.81 (3.27)	64.18 (3.12)	68.27 (2.37)	69.25 (2.26)	70.29 (2.32)	67.65 (3.55)	69.15 (2.98)	68.84 (3.88)	66.99 (2.89)
o vs pl	60.54 (1.97)	60.24 (3.28)	63.47 (2.82)	64.65 (2.43)	65.64 (2.31)	65.57 (2.43)	64.25 (2.68)	63.65 (2.59)	63.88 (3.71)	62.86 (3.16)
pe vs pl	48.92 (2.86)	49.03 (3.65)	59.36 (3.72)	59.88 (3.16)	60.23 (2.65)	61.33 (2.49)	60.13 (3.56)	58.95 (3.12)	59.81 (3.62)	57.78 (3.92)
U1vsU2	84.21 (3.55)	83.57 (3.25)	84.27 (3.27)	82.72 (2.85)	83.5 6(2.58)	85.52 (2.16)	84.26 (3.81)	84.57 (3.63)	83.96 (3.58)	83.77 (3.58)
U2vsU3	82.68 (2.56)	83.05 (2.98)	84.64 (2.98)	81.35 (2.95)	83.22 (2.86)	86.28 (2.21)	84.51 (3.24)	83.85 (3.85)	82.86 (2.79)	83.83 (3.75)
U3vsU1	79.26 (2.71)	80.02 (3.67)	81.65 (4.12)	79.65 (3.06)	81.32 (2.55)	83.75 (2.55)	84.48 (3.75)	82.25 (2.97)	80.12 (2.98)	79.98 (2.69)

In terms of classification accuracy, it can be seen from Table 6 that because the SVM and the AESVM cannot use the knowledge in the source domain, only the samples in the target domain were used for training classifier, so compared with TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCF, TL-DAKELM and RankRE-TL the classification effect was worse. STL-SVM is better than most transfer learning algorithms, so STL-SVM obtains considerable classification performance on all transfer learning tasks.

In terms of runtime, in Table 7, the training time of STL-SVM was significantly better than TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCFTL-DAKELM and RankRE-TL; the AESVM algorithm used only a small number of representative samples selected from the target domain, so it had the least training time. Although the experimental results in the real datasets showed that the training time of AESVM and SVM was also very efficient, their classification accuracy could not be compared with transfer learning.

As can be seen from the precision and recall in Tables 8 and 9, the STL-SVM proposed in this paper effectively improved the accuracy and recall compared with SVM, AESVM, TrSVM, ARSVM, ARRLS, STIL, STLCF, TL-DAKELM and RankRE-TL.

Finally, to test the difference between the STL-SVM algorithm and transfer learning algorithms with similar classification results in Table 10, the Wilcoxon signed rank test [38] was applied to these methods. According to the contents of Table 6, the average precision of all algorithms in the table on real datasets 20Newsgroups, Reuters and Email Spam were calculated. The results are shown in Table 10.

Table 10

Average classification (%) on real datasets

Datasets	AESVM	SVM	TrSVM	ARSVM	ARRLS	STL-SVM	STIL	STLCF	TL-DAKELM	RankRE-TL
20News groups	78.97	73.07	84.03	91.49	93.35	93.76	85.59	85.81	84.95	86.79
Reuters	65.98	66.01	71.56	77.88	77.37	78.79	74.96	74.44	75.54	75.48
Email Spam	94.08	93.98	93.71	93.26	93.50	95.24	93.89	94.54	92.36	93.03

In Table 10, we can see the average classification accuracy of the algorithm on real datasets. The results of the Wilcoxon test on real datasets 20 Newsgroups, Reuters and spam are discussed below.

20Newsgroups: The classification accuracy of STL-SVM is only 0.41% higher than ARRLS; therefore, when using STL-SVM and ARRLS to classify six cross-domain tasks, each task was repeated five times, the values of W⁺ and W^- are +145, and - 17, respectively. For the bilateral test of α= 0.05, when n = 30, by querying the distribution table of the Wilcoxon signed rank test, T^0.025 = 137. Because W⁺>T^0.025, H₀ was accepted: there was no significant difference in the classification results between the two methods.

Reuters: The classification accuracy of STL-SVM is 0.91% higher than that of ARSVM. When using STL-SVM, and ARSVM to classify three cross-domain tasks, each task was repeated five times. For STL-SVM and ARSVM, the values of W⁺ and W^- are +112 and –28, respectively. For a bilateral test of α= 0.05, when n = 15, the distribution table of the Wilcoxon sign rank test is queried, T^0.025 = 25. Because W⁺ >T^0.025, H₀ is accepted: there is no significant difference between the classification results of the two methods.

Email spam: Compared with STLCF, the classification accuracy of STL-SVM increased by 0.7%. When STL-SVM and STLCF were used to classify three cross-domain tasks, each task was repeated five times, and the values of W+ and W^- were +86 and –15, respectively. For the bilateral test of α= 0.05, when n = 15, by querying the distribution table of the Wilcoxon signed rank test, T^0.025 = 25. Because W+ >T^0.025, H₀ was accepted: there was no significant difference in the classification results between the two methods.

4.3 Parameter sensitivity analysis of STL-SVM

In this section, we performed a parameter sensitivity analysis to verify the best performance that the STL-SVM can achieve within a certain range of parameters. According to the objective function of STL-SVM in Equation (11), the parameters related to the algorithm include source domain structural risk regularization C_s, target domain regularization C_t and adjustable parameter λ. In this section, a sensitivity analysis of these three parameters was conducted to illustrate their impact on STL-SVM classification performance. For each parameter, we set the other two parameters as the best value obtained by cross validation, and then observed the effect of the parameters on the performance. The experimental results on 12 cross-domain classification tasks of real datasets are shown in Figs. 4 –6.

Fig. 4

Sensitivity of parameter C_s for STL-SVM.

The experimental results are analyzed in detail be-low.

Fig. 5

Sensitivity of parameter C_t for STL-SVM.

Fig. 6

Sensitivity of parameter λ for STL-SVM.

In order to find the optimal value of C_s, we first fix C_t = 10 and λ = 10, then search the values of C_s on the grid {10^-4, 10^-3, 10^-2, 10^-1, 10, 10¹, 10², 10³, 10⁴}, and simultaneously record the experimental results with different values of C_s on the real datasets as shown in Fig. 4. Figure 4 shows that the effect classification of STL-SVM is different for different values of C_s. We can see that the classification effect of STL-SVM is the best among the 12 cross-domain tasks under the condition of C_s = 10². In the same way, fix λ = 10 and C_s = 10² for the purpose of finding the optimal value of C_t. The experimental results of different values of C_t are shown in Fig. 5, we can conclude that STL-SVM has the best classification on most cross-domain classification tasks on real datasets from Fig. 5 when C_t is set to optimal value 1. After the above analysis, when C_s and C_t have different values the average classification accuracy of STL-SVM method on 12 cross-domain classification tasks are significantly different. We can find that the STL-SVM is sensitive to regularization parameters C_s and C_t within a certain range, and obtain the optimal parameter values with the best classification effect on different cross-domain tasks though multiple experiments.

For the parameter λ, fix C_s = 10² and C_t = 1, the experimental results are obtained in the same way as in the above (1) as shown in Fig. 6. By analyzing the results in Fig. 6, we can get the following conclusions: when the value of λ is 10, the STL-SVM achieved the best classification effect on the 12 cross-domain classification tasks. If the value was too small, the differences between the source domain and the target domain was neglected, and the classification effect is not satisfactory. Otherwise, if the value was too large, the difference of the distribution between the source and target domains was made more obvious, which leads to a reduction in the knowledge that could be transferred in the source domain, and the classification effect was also poor.

It can be seen from Figs. 4 –6 and above analysis that the STL-SVM is sensitive to the regularization parameters C_s, C_t and λ within a certain range of values, so selecting the parameter values was crucial for the classification performance of STL-SVM.

5 Conclusions and further works

Because of the shortcomings of traditional ma-chine learning methods, this paper proposed an STL-SVM algorithm based on transfer learning and SVM. The STL-SVM reduces the scale of training samples in source domain by AESVM to speed up the learning progress. Additionally, the MMD and objective function construction principles are used to effectively solve the negative transfer problem that easily occurs in transfer learning. Therefore, the STL-SVM achieves better classification results by effectively integrating the knowledge of the source and target domains to complete the process of transferring knowledge. Experimental results on artificial and real datasets show the effectiveness of the proposed method. For the STL-SVM algorithm, further researches will be conducted in the following aspects: 1) Only the problem of binary classification is explored, and the multi-classification problem is not involved, so STL-SVM will be extended on the multi-classification problems; 2) research on STL-SVM transferring knowledge from multiple sources is needed.

Footnotes

Acknowledgments

This work was supported by National Key Research and Development Plan of China (2016YFB0801004).

References

Pan

S.J.

and Yang

, A survey on Transfer Learning, IEEE Transactions on Knowledge & Data Engineering 22(10) (2010), 1345–1359.

Day

and Khoshgoftaar

T.M.

, A survey on heterogeneous transfer learning, Journal of Big Data 4(1) (2017), 29.

Weiss

, Khoshgofteaar

T.M.

and Wang

D.D.

, A survey of transfer learning, Journal of Big Data 3(1) (2016), 9.

Zhuang

F.Z.

, Luo

, He

and Shi

Z.Z.

, Survey on transfer learning research, Journal of Software 26(1) (2015), 26–39.

Gou

, Wang

, Jiao

, et al., Distributed Transfer Network Learning Based Intrusion Detection, IEEE International Symposium on Parallel & Distributed Processing with Application, 2009.

Chen

, Yan

, Xue

, et al., Transfer learning for behavioral targeting, [ACM Press the 19th international conference-Raleigh, North Carolina, USA (2010.04.26-2010.04.30)] Proceedings of the 19th international conference on World wide web - WWW ´’10, (2010), 1077.

Dai

, Xue

G.R.

, Yang

, et al., Transferring Naive Bayes Classifiers for Text Classification, Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence (2007), 22–26.

Jiaming

, Jian

, Yun

, et al., TrSVM: A Transfer Learning Algorithm Using Domain Similarity, Journal of Computer Research and Development 48(10) (2011), 1823–1830.

Gao

, Fan

, Jiang

, et al., Knowledge transfer via multiple local structure mapping, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Los Vegas, Nevada, USA, (2008), 283–291.

10.

Quanz

and Huan

, Large margin transductive transfer learning, Proceeding of 18^th ACM Conference on Information and Knowledge Management (2009), 1327–1336.

11.

Zhou

, Machine Learning, Tsinghua University Press, (2016), pp. 121–140.

12.

, Statistical Learning Method, Tsinghua University Press, (2012), pp. 95–130.

13.

Yang

and Tian

, Support Vector Machines: Theory, Algorithms and Extensions, Science Press, (2009), pp. 102–203.

14.

Vapnik

, The nature of statistical learning theory, New York: Springer-Verlag, (2013), pp. 593–609.

15.

Chau

A.L.

, Li

and Yu

, Convex and concave hulls for classification with support vector machine, Neurocomputing 122(1) (2013), 198–209.

16.

Dong

J.X.

, Krzyzak

and Suen

C.Y.

, A Fast SVM Training Algorithm, Springer Berlin Heidelberg, (2013), pp. 455–474.

17.

Nandan

, Khargonekar

P.P.

and Talathi

S.S.

, Fast SVM training using approximate extreme points, Journal of Machine Learning Research 15(1) (2013), 59–98.

18.

, Hong

, Bo

, et al., Online Support Vector Machine Based on Convex Hull Vertices, IEEE Transactions on Neural Networks & Learning Systems 24(4) (2013), 593–609.

19.

Gretton

, A Kernel Method for the Two-Sample-Problem, Advances in Neural Information Processing Systems 2368 (2006), 513–520.

20.

Chang

C.C.

and Lin

C.J.

, LIBSVM: A library for support vector machines 2(3) (2011), 1–27.

21.

Haykin

S.S.

, Neural networks and learning, China Machine Press, (2009), pp. 32–41.

22.

Long

, Wang

, Ding

, et al., Adaptation Regularization: A General Framework for Transfer Learning, IEEE Transactions on Knowledge and Data Engineering 26(5) (2014), 1076–1089.

23.

Suzuki

and Shouno

, Support Vector Machine Histogram: New Analysis and Architecture Design Method of Deep Convolution Neural Network, Neural Processing Letters 2017(4) (2017), 1–16.

24.

Suykens

J.A.K.

, Lukas, Sparse Least Squares Support Vector Machine Classifiers, Neural Processing Letters 9(3) (2000), 293–300.

25.

Suykens

J.A.K.

, Lukas, Sparse Least Squares Support Vector Machine Classifiers, Neural Processing Letters 9(3) (1999), 293–300.

26.

Yang

, Zhou

and Jiang

, Least Squares support vector machine with parametric margin for binary classification , 30 (5) (2016), 2897–2904.

27.

Maheswazi

R.V.

, Subburaj

, Vigneshwaran

, et al., Non Linear support vector machine based partial discharge patterns recognition using fractal features, Journal of Intelligent & Fuzzy Systems 27(5) (2014), 2649–2664.

28.

Long

, Wang

, Ding

, et al., Transfer Learning with Graph Co-Regularization [J]. 2014.

29.

Bickel

, Ecml-pkdd discovery challenge 2006 overview. In Proc. ECML/PKDD Discovery ChallengeWorkshop, 2006.

30.

Xie

, Sun

, Lin

, et al., A Selective Transfer Learning Method for Concept Drift Adaptation, 14th International Symposium on Neural Networks (ISNN) 2017(10262), 353–361.

31.

, Zhong

, Zhao

, et al., Selective Transfer Learning for Cross Domain Recommendation[C]// SIAM International Conference on Data Mining 2013. IEEE, 2013.

32.

, Mao

and Jiang

, Extreme learning machine based transfer learning for data classification, Neurocomputing 174 (2016), 203–210.

33.

and Dai

, A novel knowledge-leverage-based transfer learning algorithm, Applied Intelligence 48(8) (2018), 2355–2372.

34.

Yang

F.W.

, Lin

H.J.

, Yen

S.H.

, et al., A study on the convolutional neural algorithm of image style transfer, International Journal of Pattern Recognition and Artificial Intelligence 5(33), 2019.

35.

Wang

, Zheng

V.W.

, Chen

, et al., Deep Transfer Learning for Cross-domain Activity Recognition, [ACM Press the 3rd International Conference - Singapore, Singapore (2018.07.28-2018.07.31)] Proceedings of the 3rd International Conference on Crowd Science and Engineering, - ICCSE’18, (2018), 1–8.

36.

Vogado

Luis H. S.

, Veras

, et al., Leukemia diagnosis in blood slides using transfer learning in CNNs and SVM for classification, Eng Appl of AI 72 (2018), 415–422.

37.

Yang

, When Deep Learning Meets Transfer Learning[C]// Acm on Conference on Information & Knowledge Management. 2017.

38.

Wilcoxon

, Individual Comparisons by Ranking Methods, Biometrics Bulletin 1(6) (1945), 80–83.