Abstract
Transfer learning is a new machine learning algorithm. It solves problems in different but related target domains by utilizing the knowledge in existing data. Based on the classical SVM algorithm and transfer learning, a selective transfer learning support vector machine (STL-SVM) algorithm is proposed in this paper. First, STL-SVM uses the maximum mean discrepancy to measure the weight vector of the source domain samples relative to the target domain, and selects samples from the source domain according to each weight to avoid negative transfer. Then, the knowledge in the source domain is learned by the approximate extreme point support vector at the minimum training data cost. Finally, the object function is constructed by the obtained knowledge and the soft-margin SVM. In the constraint conditions of the objective function, the learned knowledge that is highly correlated with the target domain is selected, and further, the phenomenon of negative transfer is avoided in principle. STL-SVM solves the problem of negative transfer, and has considerable advantages in training time efficiency compared with the existing algorithms. The experimental results on artificial and real datasets show the effectiveness of the proposed algorithm.
Introduction
Currently, machine learning algorithms are being widely used in text classification, image processing, computer security and artificial intelligence and are achieving great results. However, the shortcomings of traditional machine learning methods restrict further development in these fields. In traditional machine learning methods, to make the classifier model more accurate, the training and test samples are generally required to satisfy the following two basic assumptions [1–4]: 1) the training and test data must have independent and identical distributions; 2) many training samples are necessary to learn a better classifier model. However, in practical applications, especially in emerging applications such as text mining, bioinformatics, distributed network sensor networks and social network research, training and test data have difficulty meeting the conditions of independent and identical distributions [2]. In addition, the data resources in some areas are often scarce, and the cost of collecting these data is high. Although there are enough training samples in some fields, the work of labeling these samples is time consuming and laborious. Therefore, insufficient training samples and non-independent and identical distributions of training and test samples are the main problems experienced by traditional machine learning.
In recent years, the rise of transfer learning has provided an effective way method for solving the above problems of traditional machine learning [1, 2]. The purpose of transfer learning is to use the data already labeled or unlabeled in some domains to assist in the learning tasks in similar target domains, which contain a small amount of labeled or unlabeled data, so that the learning task in the target domain is more accurate. Wikipedia has a description of transfer learning: Transfer learning is a new machine learning method that uses existing knowledge to solve problems in different but similar fields. It no longer follows the two basic assumptions in traditional machine learning. Instead, the existing knowledge is transferred to solve the problem of only a small amount of labeled or unlabeled data in the target domain [3]. Figure 1 shows the difference between traditional machine learning and transfer learning. From Fig. 1(a), it can be seen that each learning task starts from scratch in traditional machine learning; however transfer learning can transfer the knowledge in the previous learning task to the current target learning task in Fig. 1(b).

Differences between Traditional Machine Learning and Transfer Learning.
In transfer learning, the more data in the source domain between the source and the target domain that are shared, the easier the transfer. Otherwise, the transfer is more difficult, negative transfer occurs, and the poorer learning results of the target domain that are obtained. For example, if we have learned to ride a bicycle (source domain), then it will be easier to learn to ride a motorcycle (target domain), but provides no help in learning the piano (target domain); if one person is familiar with Chinese chess (source do-main), then he (she) can also transfer knowledge to speed up learning International chess (target domain). Transfer learning has been widely studied by researchers. Currently, transfer learning has been successfully applied in many real applications: distributed network intrusion detection [5], behavioral positioning [6], text classification [7, 22], image classification [34], human activity recognition [35] and leukemia diagnosis [36].
Transfer learning has been continuously studied by researchers, and the representative algorithms are as follows. Hong Jiaming et al. [8], used domain similarity knowledge and transfer learning theory to propose transfer SVM, (TrSVM). Gao et al. [9] proposed the local weighted embedded transfer learning algorithm (LWE). Brain et al. [10] proposed a feature space-based transductive transfer learning algorithm, large-margin projected transductive SVM (LMPROJ). Long et al. [22] proposed the least square transfer learning framework - adaptation regularization based transfer learning (ARTL) based on SVM. Xie et al. [30] applied transfer learning to incremental learning and proposed the selective transfer incremental learning (STIL) algorithm. Lu et al. [31] proposed a selective transfer learning for collaborative (STLCF) algorithm. Li et al. [32] proposed a new transfer learning extreme learning machine (TL-ELM) and transfer learning domain adaptation kernel extreme learning machine (TL-DAKELM), based on an extreme learning machine. Li et al. [33] proposed the transfer learning algorithm, rank-based reduce error transfer learning (RankRE-TL). In addition, with the rapid development of neural networks and deep learning, researchers have proposed some novel ideas [34–37] combining transfer learning with neural networks for different application scenarios. The combination of the two can effectively solve the problem that deep learning cannot address the sparse training data.
However, the existing transfer learning algorithms generally have the problem of high training time and complexity when the source domain contains a large-scale dataset because the algorithms need to learn all the knowledge from the source domain samples and transfer the knowledge to the target domain, but they do not need to transfer all the knowledge from the source domain. In addition, the processing negative transfer of these algorithms is also unsatisfactory, resulting in poor classification results. These reasons restrict the further improvement of the performance and classification effect of the classifier trained by transfer learning algorithm. In summary, the motivation of this paper is to propose a STL-SVM algorithm based on support vector machine (SVM) to overcome the shortcomings of traditional machine learning algorithms such as insufficient training samples, nonindependent data and identical distribution of training and test samples. STL-SVM uses a large number of labeled samples in the source domain to assist in building classification models for target domains containing only a small of labeled or unlabeled samples. STL-SVM is different from the existing transfer learning algorithms, it does not need all the samples in the source domain, and can deal with the negative transfer phenomenon more effectively, thus improving the performance and classification effect of the transfer learning classifier.
The core idea of the STL-SVM algorithm is as follows. First, the improved maximum mean discrepancy (MMD) method is used to calculate the weight vector of the importance of samples in the source domain relative to the target domain. Then, in the source domain, approximate extreme points support vector machine (AESVM) is used to select representative datasets and the weights of samples from a large number of labeled data in the source domain, and then the negative transfer problem is preliminarily solved by combining the weights calculated by the MMD. Finally, an objective function with transfer learning ability is constructed by combining the support vector. The constraint conditions of the objective function restrict the negative transfer again by using the decision value (knowledge) obtained from training in the source domain and the target domain. Thus, the objective function of the learning model is a quadratic programming problem.
The remainder of this paper is organized as follows. Section 2 describes the principle of MMD, the soft margin model of the SVM and the approximate extreme point training SVM. Section 3 is the main content of the paper, focusing on the detailed construction process of the STL-SVM objective function and theoretical reasoning and proof of the objective function. In Section 4, the experimental results validate the effectiveness of STL-SVM. Section 5 summarizes the paper and discusses future works.
Maximum mean discrepancy
For the transfer learning algorithm, the distribution differences between the source domain and the target domain seriously affect the classification effect. Currently, most transfer learning algorithms often use all samples in the source domain for training. If the differences between the source domain and target domain are too large, the effect of the learning task in the target domain becomes worse, and negative transfer occurs. Currently, a common method for effectively estimating the distance between two distributions is the MMD [10]. Below, we outline the process of the MMD method to estimate the distance of two distributions.
A training dataset in source domain is given
In Equation (1), ϕ (·) is a nonlinear mapping function, when the MMD method measures the difference between the source domain and the target domain, all samples of the source domain need to be used. To fully consider the importance of each source domain sample, set the weight of the sample in the source domain as w
i
(0 ≤ w
i
≤ 1). Equation (1) is improved to obtain a weight-based MMD method WMMD.
In Equation (2), the first term on the right of the equation is the constant, which can be ignored, and only the last two terms are minimized. Therefore, Equation (3) is as follows:
To simplify Equation (3), we can obtain the objective function in Equation (4):
In Equation (4), K = K (x
i
, x
j
) = ϕ (x
i
)
T
ϕ (x
j
) and
The objective function is a standard quadratic programming problem, so it can be solved using the quadratic programming solver to obtain the weight of each sample in the source domain. Based on the values of w i , the transfer of knowledge in the source domain can effectively avoid negative transfer.
The SVM, which was formally proposed by Vapnik in 1995 [14], is a kind of classifier for binary-classification problems, whose basic model is the linear classifier defined in the feature space [11–13]. SVM is based on the theory of the Vapnik-Chervonenkis (VC) dimension and the structural risk minimization principle of statistical learning theory, which seeks the best compromise between model complexity and learning ability based on limited sample information to obtain the best learning ability. The learning strategy of the SVM is to maximize the margin which can be formalized as a solution to the convex quadratic programming problem. Therefore, the SVM algorithm is an optimization algorithm for solving convex quadratic programming.
In SVM, it is assumed that the training samples are linearly separable in the sample or feature space. However, it is difficult to find a linearly separable situation in reality. To alleviate this situation, SVMs are allowed to make mistakes on some samples, introducing ‘soft margin’ for SVM. The typical algorithm derived from the SVM is as follows. [23] proposed a new deep convolutional neural network (DCNN) analysis method SVM histogram, which uses the decision boundary of linear SVM to check the spatial distribution of the feature representation extracted by the DCNN. The experimental results show that the method has higher precision than the original DCNN. In [24] and [25], the researchers changed the solution from the original quadratic programming problem to a set of linear equations by modifying the equality constraints of Vapink’s original SVM formula in regression mode. The least squares SVM method is proposed. In [26], a least squares SVM Par-LSSM with parametric margin is proposed, Par-LSSVM can deal with “cross planes” datasets. The literature [27] proposed an algorithm of pattern recognition based on nonlinear SVM. The basic form of a typical soft-margin SVM is described below.
In Equation (5), ξ i ∈ {ξ1, ξ2, ·· · , ξ n }, ξ i is the distance from the optimal hyperplane for each sample, also known as the “slack variable”, w is the normal vector of the classification hyperplane, C (C > 0) is a penalty coefficient, which is a constant and generally determined by a specific application problem.
The soft margin SVM requires the same distribution between the training and the testing samples, which solves the linear inseparability problem through slack variables. However, for the target domain with a small number of training samples, soft margin SVM algorithm is not sufficient for obtaining an accurate classification model. Given this situation, using sufficient knowledge of training samples in in source domain that are transferred into target domains speeds up establishing of learning tasks in the target domains, and alleviates the problem of accuracy degradation due to a lack of training samples. In addition, we must pay special attention to negative transfer when knowledge is transferred. Once negative transfer occurs, the effect of classifiers obtained after using the knowledge of a similar domain may be even worse than when they are not used.
Since the basic idea of the SVM is to find a hyperplane that represents the largest margins between two classes, from the geometric point of view, the calculation problem of the hyperplane is equivalent to calculating the nearest sample between two convex hulls [18, 19]. For SVM, a large number of training samples is a prerequisite for better training results. A large number of training samples require considerable manpower for labeling, and a consider time is consumed in the training phase, so the training efficiency of SVM is not satisfactory. To improve the efficiency of SVM training, a method of training SVM using training samples near the hyperplane, approximate extreme points SVM (AESVM) is proposed in [20]. AESVM does not require all training samples to train the learning model, which can greatly reduce the scale of the training sample, which reduces the cost of training classifiers. [20] and [21] theoretically proved that the effect of AESVM trained by the representative samples is very close to the SVM using all samples within a certain range.
Defining a training dataset X = {x1, x2, . . . , x
n
}, the corresponding class labels are Y = {y1, y2,..., y
n
}, y
i
∈ {1, -1}, and the optimization problem of AESVM is described as Equation (6):
In Equation (6), M is the number of training samples in the representative X*, which is selected from dataset X. The meaning of parameters w, b, i and C are same as Equation (5). β = [β1, β2, . . . β M ] is the weight vector corresponding to the representative datasets in Equation (9), l is a hinge loss function, l (w, b, ϕ (x i ) = max {0, 1 - y i (w T ϕ (x i ) + b)} , x i ∈ X*, and ϕ (·) is nonlinear mapping function.
To obtain a representative dataset X*, the dataset X needs to be grouped according to a certain separation strategy, X = {X1, X2, . . . , Xn/V}, n represents the number of samples in X, V represents the predefined maximum number of samples in each group, and X q (q = 1, 2, . . . , n/V) indicates q - th group. There is a high degree of similarity between the samples in each group, and the similarity of samples between different groups is low.
First, calculate the initial representative dataset using the support vector data description (SVDD) algorithm [19]. Then, according to the descending order of the distance between the sample and the SVDD nuclear space center, determine whether the sample
In Equation (7),
In Equation (8), γi,j =
Based on SVM, this paper constructs an STL-SVM model by using MMD, AESVM and transfer learning theory. The algorithm framework is shown in Fig. 2. As seen in Fig. 2, the STL-SVM uses the weight vector of the importance of the source domain samples relative to the target domain calculated by the MMD algorithm, the knowledge of the selected representative dataset acquired by the AESVM, and the target domain dataset to train the objective function of STL-SVM. Finally, we obtain the classifier.

Framework of STL-SVM.
To make the support vector have transfer-ability, the objective function of the STL-SVM is built based on Equation (5) and Equation (6). It is assumed that there are a large number of labeled samples in the source domain and a small number of labeled samples in the target domain. The probability distribution of the samples does not have to be the same. Without loss of generality, only the binary classification is considered in this paper.
Combined with the weight of each sample in the source domain calculated by MMD in Section 2.1, the weights of samples are corrected according to the weights of the representative dataset obtained by the AESVM algorithm. This process can also be called the sample selection, as Equation (10):
A source domain D
s
contains n samples,
Similarly, for a target domain that contains m samples,
In Equation (11), w
t
and b
t
indicate the parameters of the target domain, w
s
and b
s
are the parameters of the source domain, and these parameters contain knowledge in two domains.
For the objective function in Equation (11), give the following description:
Constraint condition The constraint condition
Finally, the decision function of the STL-SVM algorithm can be described as follows formula (12):
The STL-SVM algorithm is compared with the current study of transfer learning algorithms, such as TrSVM [8], LMPROJ [8], LWE [9], ARTL [22] and STIL [30]. The difference considers both the negative transfer situation in the transfer learning and the problem of long training time caused by too many training samples in the source domain, so this paper has considerable advantages under large-scale datasets.
To solve the objective function of the STL-SVM algorithm, the dual problem of formula (11) must first be obtained, then the problem is proven to be a quadratic programming problem, and the quadratic programming problem obtained is finally further proved to be a convex quadratic programming problem. The quadratic programming problem can be solved and the result is a global optimal solution.
The original problem in Equation (11) can be translated into the following dual problem:
In Equation (13):
The Lagrange function of Equation (14) is:
In Equation (14), α = [α1, α2, . . . α
M
],
By the Karush–Kuhn–Tucker (KKT) condition, the following equations are obtained:
Substituting Equation (15) to Equation (20) into Equation (11) yields the dual form shown in Equation (21) after simplification:
Substituting Equation (22) into Equation (21) obtains the Equation (13).
To prove that Equation (13) is a convex quadratic programming problem, we need to prove that
Since the quadratic programming of Equation (13) is a convex quadratic programming problem, the KKT condition is also a sufficient condition, so the solution of the quadratic programming is the global optimal solution.
The optimal solutions given in Equation (24) and Equation (25) contain both knowledge in the target domain and the source domain, for example, in
According to the description of 3.1 and 3.2, the proposed STL-SVM algorithm flow is as follows:
The complexity calculation process of the STL-SVM classifier construction is as follows: given the source domain containing n-labeled samples and the target domain containing m-labeled samples, the number of representative datasets selected by the AESVM algorithm from the source domain is M, and the time complexity of constructing the classifier is expressed as O (M + n) 3.
STL-SVM reduces the size of the dataset in the source domain by AESVM and the time complexity of learning tasks in the source domain to acquire knowledge. At the same time, it uses the improved MMD method and decision value constraints to restrict negative transfer. Compared with previous transfer learning algorithms, STL-SVM uses the knowledge of source domain data to the greatest extent at the least cost of samples, improves the training efficiency, and better solves the problem of negative transfer. Therefore, compared with existing machine learning algorithms with transfer ability, the STL-SVM algorithm has certain advantages in classification accuracy and algorithm performance.
Experimental results
To comprehensively evaluate the performance of the STL-SVM algorithm, experiments were performed on different datasets: manual two-dimensional two-moon, 20Newsgroups, Reuters and Email spam datasets. The experiment was organized from the perspective of non-transfer, transfer, and improved transfer to evaluate the effectiveness of the STL-SVM in the paper.
To verify the effectiveness of the proposed algorithm, the following comparison algorithms are presented: SVM [20], AESVM [16], TrSVM [8], ARSVM [22], ARRLS [22], STL-SVM, STIL [30], STLCF [31], TL-DAKELM [32] and RankRE-TL [33], where SVM and AESVM do not have transfer ability. In addition, in order to evaluate the algorithm, we commonly choose the classification accuracy, precision and recall used by the classification algorithm as the evaluation criteria of the algorithms, which is specifically expressed as:
D t represents the dataset in the target domain, y t is the truth class label, and f (x t ) is the result of x t predicted by the classifiers.
TP represents the number of positive class samples that are accurately classified as positive classes by the classifier; FP is the number of negative class samples that are incorrectly classified as positive classes; and FN is the number that indicates that the positive class samples are incorrectly classified as negative classes.
All experimental kernel functions use a Gaussian function in this paper. The kernel width parameter 2σ2 is based on S the square of the average 2-norm of the source domain samples, and by using the grid search algorithm [17], the optimal values are searched in {s/64, s/32, s/16, s/8, s/4, s/2, s, 2s, 4s, 8s, 16s, 32s, 64s}. Similarly, the regularization parameters C t , C s , and the balance parameter λ are searched in {10-4, 10-3, 10-2, 10-1, 10, 101, 102, 103, 104} with a grid search. For the sake of the fairness of the experiments, we take the classification accuracy obtained by 5-fold cross-validation as the evaluation. All experiments were performed on Intel Core(TM), 3.6 GHz, 8 GB, Windows 10 systems. The SVM algorithm was implemented by LibSVM software [20], and the other algorithms were implemented in the MATLAB R2009A environment. In the experiments, the source and the target domains have labeled samples, and the labeled samples of the target domain were only used to evaluate the performance of the classification method.
Experiments on artificial two-moon datasets
Generating 200 unlabeled samples in the target domain with a mean value of 0 and a standard deviation [21]. The target domain dataset was rotated clockwise around the center five times, each time rotating 15°, thus obtaining five source domain datasets. In the five datasets in the source domain, 100 samples were randomly labeled as positive class, and the remaining 100 samples were labeled negative class. Figure 3 (a)-(f) describes the target domain dataset and the five source domain datasets obtained after rotation. The larger the rotation angle of the target domain dataset is, the greater the differences of the distribution between the target domain and source domain, and the more likely the negative transfer will occur. Because of the limited space in the paper, we analyzed only the experimental results of rotating 15°, 30 °and 45 °. Table 1 shows the average classification accuracy and standard deviation of SVM, AESVM, TrSVM, STL-SVM, STIL, STLCF, ARSVM, ARRLS, TL-DAKELM, and RankRE-TL. Table 2 compares the average training time and standard deviation of the above algorithm. The average precision and standard deviation of the above algorithm are given in Table 3. The average recall and standard deviation of the above algorithm are presented in Table 4.

Two-moon datasets.
Comparison of average accuracy (%) with standard deviation on two-moon datasets
Comparison of average running time (%) with standard deviation on two-moon datasets
Comparison of average precision (%) with standard deviation on two-moon datasets
Comparison of average recall (%) with standard deviation on two-moon datasets
From Tables 1–4, the following conclusions can be drawn: Because SVM and AESVM algorithms have no transfer characteristics, they utilize only the samples in the target domain for training, so the differences between the source domain and the target domain do not affect the classification result. AESVM chooses representative datasets for training, its training effect is similar to that of SVM, but its training time decrease due to the reduction in the training sample scale. The transfer learning algorithms TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCF, TL-DAKELM and RankRE-TL, can be trained simultaneously by using samples in the source and target domain, so that the classification effects of the trained classifiers are better than non-transferred SVM and AESVM algorithms. However, as the rotation angle increases, the data difference between the target domain and the source domain becomes larger, and the classification of the transfer learning classifier deteriorates, but it has no effect on the classification effect of SVM and AESVM. In terms of running time, SVM and AESVM training time is less time than that of the other transfer learning algorithms, but the classification accuracy is lower than that of the transfer learning algorithm. The algorithms with selective functions STL-SVM, STIL and STLCF require fewer source training samples than TrSVM, ARSVM, ARRLS, TL-DAKELM and RankRE-TL, so the algorithm execution running time is reduced and the classifier training cost is relatively lower. In Tables 3 and 4, STL-SVM has some advantages over the benchmark algorithm, both in precision and recall.
In this paper, the 20Newsgroups, Reuters and Email spam datasets are used for the evaluation of the STL-SVM algorithm on the real dataset.
20Newsgroups [22] dataset contains approximately 20,000 news documents, which are mainly divided into 4 top categories, and each category contains 4 subcategories. See Table 5 for details. In the experiments, we randomly selected 2 top categories from 4 top categories, one of which was a positive class and the other was a negative class. Then, according to different subcategories in the top categories, the samples are segmented to obtain the datasets of the source domain target domain. The specific cross-domain task groups were comp vs rec(c vs r), comp vs sci(c vs s), comp vs talk(c vs t), rec vs sci(r vs s), rec vs talk(r vs t) and sci vs talk(s vs t). The above task groups ensure that the source and target domains were related because they come from the same top categories; simultaneously, the source and target domains were guaranteed to be different because they came from different subcategories.
The statistics information of 20Newsgroups and Reuters
The statistics information of 20Newsgroups and Reuters
The Reuters dataset [28] is similar to the 20Newsgroup dataset, and contains several top classes, each of which consists of a few subcategories. See Table 5 for details. The main top categories are orgs, people, and place. We used the categories to construct learning task groups: orgs vs people (o vs pe), orgs vs place (o vs pl), and people vs place (pe vs pl).
The Email spam dataset was released by the EMCL/PKDD 2006 Knowledge Discovery Challenge Contest [29], and includes three personal email datasets U1, U2 and U3, and a public Email dataset. Each personal email dataset contains 2,500 messages, half of which are normal and the other half are spam. The transfer learning tasks are U1 vs U2, U2 vs U3, U3 vs U1.
Tables 6, 7, 8, and 9 show the average classification accuracy, average running time, average precision, and average recall of SVM, AESVM, TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCF, TL-DAKELM, and RankRE-TL on the cross-domain dataset constructed by the real dataset. From these results, we can draw the following conclusions.
Comparison of average accuracy (%) with standard deviation on real datasets
Comparison of average running time (s) with standard deviation on real datasets
Comparison of average precision (%) with standard deviation on real datasets
Comparison of average recall (%) with standard deviation on real datasets
In terms of classification accuracy, it can be seen from Table 6 that because the SVM and the AESVM cannot use the knowledge in the source domain, only the samples in the target domain were used for training classifier, so compared with TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCF, TL-DAKELM and RankRE-TL the classification effect was worse. STL-SVM is better than most transfer learning algorithms, so STL-SVM obtains considerable classification performance on all transfer learning tasks. In terms of runtime, in Table 7, the training time of STL-SVM was significantly better than TrSVM, ARSVM, ARRLS, STL-SVM, STIL, STLCFTL-DAKELM and RankRE-TL; the AESVM algorithm used only a small number of representative samples selected from the target domain, so it had the least training time. Although the experimental results in the real datasets showed that the training time of AESVM and SVM was also very efficient, their classification accuracy could not be compared with transfer learning. As can be seen from the precision and recall in Tables 8 and 9, the STL-SVM proposed in this paper effectively improved the accuracy and recall compared with SVM, AESVM, TrSVM, ARSVM, ARRLS, STIL, STLCF, TL-DAKELM and RankRE-TL.
Finally, to test the difference between the STL-SVM algorithm and transfer learning algorithms with similar classification results in Table 10, the Wilcoxon signed rank test [38] was applied to these methods. According to the contents of Table 6, the average precision of all algorithms in the table on real datasets 20Newsgroups, Reuters and Email Spam were calculated. The results are shown in Table 10.
Average classification (%) on real datasets
In Table 10, we can see the average classification accuracy of the algorithm on real datasets. The results of the Wilcoxon test on real datasets 20 Newsgroups, Reuters and spam are discussed below.
20Newsgroups: The classification accuracy of STL-SVM is only 0.41% higher than ARRLS; therefore, when using STL-SVM and ARRLS to classify six cross-domain tasks, each task was repeated five times, the values of W+ and W- are +145, and - 17, respectively. For the bilateral test of α= 0.05, when n = 30, by querying the distribution table of the Wilcoxon signed rank test, T0.025 = 137. Because W+>T0.025, H0 was accepted: there was no significant difference in the classification results between the two methods.
Reuters: The classification accuracy of STL-SVM is 0.91% higher than that of ARSVM. When using STL-SVM, and ARSVM to classify three cross-domain tasks, each task was repeated five times. For STL-SVM and ARSVM, the values of W+ and W- are +112 and –28, respectively. For a bilateral test of α= 0.05, when n = 15, the distribution table of the Wilcoxon sign rank test is queried, T0.025 = 25. Because W+ >T0.025, H0 is accepted: there is no significant difference between the classification results of the two methods.
Email spam: Compared with STLCF, the classification accuracy of STL-SVM increased by 0.7%. When STL-SVM and STLCF were used to classify three cross-domain tasks, each task was repeated five times, and the values of W+ and W- were +86 and –15, respectively. For the bilateral test of α= 0.05, when n = 15, by querying the distribution table of the Wilcoxon signed rank test, T0.025 = 25. Because W+ >T0.025, H0 was accepted: there was no significant difference in the classification results between the two methods.
In this section, we performed a parameter sensitivity analysis to verify the best performance that the STL-SVM can achieve within a certain range of parameters. According to the objective function of STL-SVM in Equation (11), the parameters related to the algorithm include source domain structural risk regularization C s , target domain regularization C t and adjustable parameter λ. In this section, a sensitivity analysis of these three parameters was conducted to illustrate their impact on STL-SVM classification performance. For each parameter, we set the other two parameters as the best value obtained by cross validation, and then observed the effect of the parameters on the performance. The experimental results on 12 cross-domain classification tasks of real datasets are shown in Figs. 4–6.

Sensitivity of parameter C s for STL-SVM.
The experimental results are analyzed in detail be-low.

Sensitivity of parameter C t for STL-SVM.

Sensitivity of parameter λ for STL-SVM.
In order to find the optimal value of C
s
, we first fix C
t
= 10 and λ = 10, then search the values of C
s
on the grid {10-4, 10-3, 10-2, 10-1, 10, 101, 102, 103, 104}, and simultaneously record the experimental results with different values of C
s
on the real datasets as shown in Fig. 4. Figure 4 shows that the effect classification of STL-SVM is different for different values of C
s
. We can see that the classification effect of STL-SVM is the best among the 12 cross-domain tasks under the condition of C
s
= 102. In the same way, fix λ = 10 and C
s
= 102 for the purpose of finding the optimal value of C
t
. The experimental results of different values of C
t
are shown in Fig. 5, we can conclude that STL-SVM has the best classification on most cross-domain classification tasks on real datasets from Fig. 5 when C
t
is set to optimal value 1. After the above analysis, when C
s
and C
t
have different values the average classification accuracy of STL-SVM method on 12 cross-domain classification tasks are significantly different. We can find that the STL-SVM is sensitive to regularization parameters C
s
and C
t
within a certain range, and obtain the optimal parameter values with the best classification effect on different cross-domain tasks though multiple experiments. For the parameter λ, fix C
s
= 102 and C
t
= 1, the experimental results are obtained in the same way as in the above (1) as shown in Fig. 6. By analyzing the results in Fig. 6, we can get the following conclusions: when the value of λ is 10, the STL-SVM achieved the best classification effect on the 12 cross-domain classification tasks. If the value was too small, the differences between the source domain and the target domain was neglected, and the classification effect is not satisfactory. Otherwise, if the value was too large, the difference of the distribution between the source and target domains was made more obvious, which leads to a reduction in the knowledge that could be transferred in the source domain, and the classification effect was also poor. It can be seen from Figs. 4–6 and above analysis that the STL-SVM is sensitive to the regularization parameters C
s
, C
t
and λ within a certain range of values, so selecting the parameter values was crucial for the classification performance of STL-SVM.
Because of the shortcomings of traditional ma-chine learning methods, this paper proposed an STL-SVM algorithm based on transfer learning and SVM. The STL-SVM reduces the scale of training samples in source domain by AESVM to speed up the learning progress. Additionally, the MMD and objective function construction principles are used to effectively solve the negative transfer problem that easily occurs in transfer learning. Therefore, the STL-SVM achieves better classification results by effectively integrating the knowledge of the source and target domains to complete the process of transferring knowledge. Experimental results on artificial and real datasets show the effectiveness of the proposed method. For the STL-SVM algorithm, further researches will be conducted in the following aspects: 1) Only the problem of binary classification is explored, and the multi-classification problem is not involved, so STL-SVM will be extended on the multi-classification problems; 2) research on STL-SVM transferring knowledge from multiple sources is needed.
Footnotes
Acknowledgments
This work was supported by National Key Research and Development Plan of China (2016YFB0801004).
