A novel Pseudo-label based domain adaptation method on tabular data

Abstract

Tabular data is a widely used data form in many fields such as product marketing. In some cases, the domain shift between source and target domain of tabular data may occur with the changing of collection conditions such as time. The extant methods on tabular data mainly consist of neural-network-based methods and tree-based methods. They both meet challenges induced by domain shift on tabular data. First, neural-network-based methods are lack of effective mechanism to extract the features of tabular data and the performance may not be higher than tree-based models. Second, tree-based methods are lack of effective feature representations to model the associations between source domain and target domain. To improve the performance of tree-based methods for domain shift, a novel pseudo-label based domain adaptation method is proposed for the tree-based method called Xgboost. The proposed method consists of pseudo-label generation and selection strategies. The pseudo-label generation strategy can control the effects of pseudo-labels on Xgboost in a more flexible way by setting proper values of pseudo-labels. The pseudo-label selection strategy can select the pseudo-labels with high confidences under a consistency condition based on the outputs of Xgboost. The quality of pseudo-labels for the data in target domain is improved and so does the performance of Xgboost trained by the data in both source domain and target domain. In the experiment, several UCI datasets and 5G terminal datasets are used to show that the proposed methods can effectively improve the performance of Xgboost.

Keywords

Domain adaptation Pseudo-label Tabular data Xgboost

1 Introduction

Tabular data has been widely used in many fields such as finance, marketing and engineering. There are mainly two categories of methods for the tasks on tabular data: tree-based methods and neural-network-based methods. Tree-based methods aim to construct decision rules with tree structures. GBDT [7], XGB [2], Lightgbm[14] and Catboost [23] are the representative tree-based methods. Neural-network-based methods mainly focus on learning the feature representations and extracting effective features of tabular data via different modes structures. For example in [3, 9], wide&deep structures are proposed to extract sufficient information by fusing linear models and deep neural networks. In [8 , 28], more effective feature embeddings are modeled by transformer blocks to improve the model performance. Until now, almost all the methods for tabular data focus on typical tasks. But for the scenes such as precision marketing, there may be distribution discrepancy between the source and target domain which is called domain shift. The domain shift may make models trained in source domain have performance degradation in target domain. These methods on tabular data are lack of effective mechanisms for scenes where domain drift occurs. So it is necessary to study domain adaptation mechanisms on tabular data and make models learn adaptively to overcome the performance degradation induced by the domain drift of source domain and target domain.

Extant domain adaptation methods mainly focus on the fields of CV [19], NLP [4, 10]. There are few domain adaptation methods focusing on tabular data or tree-based methods. Until now, domain adaptation methods can be divided into three categories: feature adaption, instance adaption and model adaptation. Feature adaption aims to construct invariant feature space between source domain and target domain. The ways of constructing invariant feature spaces are based on distribution discrepancies measure by the criterions in [13 , 27] or the neural networks in [22]. Instance based domain adaptation methods aim to reweigh the data in source domain and the models are trained with the data in target domain. For example in [34], the cross-domain diagnostic problems in PHM are solved by instance-level weighted adversarial learning. Instance based domain adaptation methods work under the assumption that the conditional distributions of source and target domain are nearly identical, and some data in source domain can be available in target domain during reweighing such as in [20]. Model adaptation aims to modify the parameters to get better performance on target domain. For example in [35], a constrained optimization of feature normalization statistics in pre-trained source models is proposed. It is constructed by a small support set in target domain to achieve better performance.

It is necessary to study domain adaptation strategies for tree-based models. On tabular data, tree-based models have advantage over neural-network-based models on the performance of accuracy, stability and time complexity in some cases. On the other hand, most domain adaptation strategies in CV and NLP are not suitable for tree-based models on tabular data. Feature adaptation can not keep tabular structures of tabular data in invariant feature space which may lead to performance degradation of tree-based models. For instance adaptation, there are not proper reweighing strategies based on the structures of trees. For model adaptation, most methods depend on the feature space constructed by neural networks and can not be applied directly to tree structures.

Pseudo-label is a promising way to make an effective domain adaptation strategy for tree-based models. This is because the generation of pseudo-labels can depend only on the outputs of models without the constraints of model structures. Until now, most neural-network-based pseudo-label strategies focus on the regularization of neural networks with some properties, such as consistency of different model structures [16] or the consistency of class center for the data in source and target domain [30]. The pseudo-label strategies mainly depend on entropy minimization for soft pseudo-labels or the pre-defined threshold for hard pseudo-labels. There are some drawbacks of extant pseudo-label strategies on tree-based methods:

The regularization of entropy minimization as soft pseudo-labels is through parameterized optimization which is not suitable for tree-based methods. This is because tree structures can not be constructed by end-to-end parametric optimization during training.

The effectiveness of entropy minimization or the threshold for hard pseudo-label is not guaranteed on tree-based methods which will be discussed in this paper and may lead to performance degradation.

In this paper, a novel pseudo-label based domain adaptation method is proposed for Xgboost on tabular data. We analyze the drawbacks of entropy minimization and generate more effective pseudo-labels only by the outputs of Xgboost. We also propose a simple but effective method to select pseudo-labels with high confidence scores under a consistency condition. Our main contributions are shown as follows:

A soft pseudo-label generation method is proposed to optimize the performance of Xgboost in target domain. Compared with entropy minimization, the proposed method can control the value of pseudo-label in a more flexible way according to the outputs of Xgboost.

A consistency condition of pseudo-label selection is proposed to measure the confidence scores of pseudo-labels. It consists of two criterions. The first criterion selects the samples which lie in the center region of each class. The second one selects the samples which have stable outputs on Xgboost models with similar parameter values. The first one is a trivial criterion for each pseudo-label selection strategy and the second one is novel and proposed by us.

2 Related work

Semi-supervised learning Both labeled and unlabeled samples are used to improve the performance. Among semi-supervised learning methods, various pseudo-label strategies are proposed on unlabeled samples to train the model in supervised way. In [25], pseudo-labels on weak augmented data and strongly augmented data are matched to train model on images. In [6], low-confidence samples are utilized to predict with case "what it is not" by the classifier called TrueNegative. In [33], a novel framework is proposed to allow the creation of both pseudo and complementary labels to support the positive and negative learning. In [37], class-aware contrastive loss is used to learn discriminative feature representations and label information is propagated to the unlabeled samples across the potential data manifold.

Self-training learning The performance of the model in target domain is improved with an iterative process of predicting and retraining. For example in [18], a regularization with energy-based model is used and the training of unlabeled target samples is constrained. In [29], a strategy called Class-Rebalancing Self-Training (CReST) is proposed to improve existing SSL methods on class-imbalanced data. In [32], an interactive form of self-training with mean teachers is proposed for semi-supervised object detection and nonmaximum suppression is used to fuse the detection results from different iterations. In [17], generative adversarial networks based self-training framework with progressive augmentation (SPA) is proposed to obtain robust features of unlabeled data in target domain, according to the information of labeled data in source domain.

Domain adaptation with neural networks Neural-network-based domain adaptation methods aim to construct invariant feature space by neural networks. In [21], a novel neural network called Transferrable Prototypical Networks(TPN) is proposed to reduce distribution discrepancy between source domain and target domain by matching the data in target domain with closet prototype in source domain. In [5], a frequency-weighted aggregation strategy is proposed to generate soft pseudo-labels for unlabeled target data. For adversarial learning in [15], domain invariant feature representations are constructed by generative adversarial networks. On the other hand, adversarial learning can also be used to generate synthetic target data related to source domain, such as style transfer in [11]. in [38], a reconstruction based domain adaptation method is proposed to construct a map between source domain and target domain.

3 Proposed methods

In this section, the details of the proposed methods are introduced. The proposed method consists of two parts: pseudo-label generation and pseudo-label selection. First, the drawbacks of hard pseudo-label and entropy minimization as soft pseudo-label are analyzed for the tree-based method Xgboost. Then the proposed pseudo-label generation strategy is introduced to overcome the drawbacks of Xgboost. Finally, a pseudo-label selection strategy is proposed to select the pseudo-labels with high confidence scores in novel uncertainty-aware framework. Before introducing our work, let us introduce some basic notations and concepts.

Let $D_{L} : = {(x_{i}^{1}, y_{i}^{1})}_{i = 1}^{N_{L}}$ be a labeled dataset in source domain where N_L is the number of labeled samples and $y_{i}^{1} \in {0, 1}$ is the label of sample $x_{i}^{1}$ . $y_{i}^{1} = 1$ represents positive class and $y_{i}^{1} = 0$ represents negative class. $D_{U} : = {x_{i}^{2}}_{i = 1}^{N_{U}}$ is denoted to be an unlabeled dataset in target domain where N_U is set to be the number of unlabeled samples. Suppose ${y_{i}^{2}}_{i = 1}^{N_{U}}$ to be the set of real labels for D_U. The proposed method aims to construct a Xgboost model f (x|θ) ∈ [0, 1] such that $f (x_{i}^{2} | θ) = y_{i}^{2}$ where θ is the parameter set of Xgboost such as max-depth, number of trees.

Given a sample $x_{i}^{2}$ , the pseudo-label generator $G (x_{i}^{2}) \in [0, 1]$ in pseudo-label generation strategy outputs $y_{i}^{pseudo}$ as the pseudo-label of $x_{i}^{2}$ . The selector $S (x_{i}^{2}) \in {0, 1}$ outputs 1 when the confidence of $x_{i}^{2}$ is high, otherwise outputs 0. Based on $G (x_{i}^{2})$ and $S (x_{i}^{2})$ , the tree structure of Xgboost is constructed by minimizing the loss function L (D_L, D_U) with respect to θ which is defined as follows: $\begin{matrix} L (D_{L}, D_{U}) : = \sum_{(x_{i}^{1}, y_{i}^{1}) \in D_{L}} ∥ f (x_{i}^{1} | θ) - y_{i}^{1} ∥ \\ + \sum_{x_{i}^{2} \in D_{U}} S (x_{i}^{2}) ∥ f (x_{i}^{2} | θ) - G (x_{i}^{2}) ∥ . \end{matrix}$ In next sections, the details of $G (x_{i}^{2})$ and $S (x_{i}^{2})$ are introduced.

3.1 Pseudo-label generation with entropy minimization

In this section, a soft pseudo-label generation method is proposed to control the effect of generated pseudo-labels on the performance of Xgboost. Since the form of pseudo-label can be divided into hard pseudo-label and soft pseudo-label, we will analyze the drawbacks of hard pseudo-label and entropy minimization as soft pseudo-label before introducing the proposed method.

In [31, 36], the form of hard pseudo-label is defined as follows: $\hat{y} = 1 [f (x_{i}^{2} | θ) > τ],$ where $\hat{y} \in {0, 1}$ and τ represents the threshold. Since the distance between $\hat{y}$ and $f (x_{i}^{2} | θ)$ is fixed and lack of flexibility, it may be large or small for some data. During training, the tree structure in Xgboost depends on the whole labeled training data and unlabeled testing data with pseudo-labels. When the distance $\hat{y}$ and $f (x_{i}^{2} | θ)$ is large, $\hat{y}$ may have large effect on the construction of trees in Xgboost. When the pseudo-label threshold τ is inexact, the performance of Xgboost will degrade.

For entropy minimization as soft pseudo-label, the pseudo-label generator $G (x_{i}^{2})$ for each $x_{i}^{2}$ is constructed such that the entropy of $f (x_{i}^{2} | θ)$ defined in (1) is minimized with respect to the parameters θ: $\begin{matrix} min & - (f (x_{i}^{2} | θ) \cdot log f (x_{i}^{2} | θ) + \\ (1 - f (x_{i}^{2} | θ)) \cdot log (1 - f (x_{i}^{2} | θ))) . \end{matrix}$ (1) Now, we analyze the effects of entropy minimization defined in (\ref entropy1) on the outputs $f (x_{i}^{2} | θ)$ . The gradient of the entropy in (1) with respect to $f (x_{i}^{2} | θ)$ is shown as follows: $- \frac{log f (x_{i}^{2} | θ)}{log (1 - f (x_{i}^{2} | θ))} .$ (2) From 2, it is obvious that when $f (x_{i}^{2} | θ) > \frac{1}{2},$ minimizing the entropy of $f (x_{i}^{2} | θ)$ can make $f (x_{i}^{2} | θ)$ increase. When $f (x_{i}^{2} | θ) < \frac{1}{2},$ minimizing the entropy of $f (x_{i}^{2} | θ)$ can make $f (x_{i}^{2} | θ)$ decrease. Since the boundary of positive class and negative class is not necessarily equal to $\frac{1}{2}$ , minimizing the entropy of $f (x_{i}^{2} | θ)$ by (1) may make $f (x_{i}^{2} | θ)$ increase or decrease inappropriately for some samples.

From the drawbacks of entropy minimization, the entropy of model output should be minimized in a more flexible way by adjusting the scope where $f (x_{i}^{2} | θ)$ increases or decreases. So the proposed pseudo-label generator $G (x_{i}^{2})$ has the following form: $G (x_{i}^{2}) = {\begin{matrix} c \cdot f (x_{i}^{2} | θ), f (x_{i}^{2} | θ) > a \\ f (x_{i}^{2} | θ), b \leq f (x_{i}^{2} | θ) \leq a \\ d \cdot f (x_{i}^{2} | θ), f (x_{i}^{2} | θ) < b \end{matrix},$ (3) where the pre-defined hyper-parameters a, b, c and d satisfy the following condition: $0 < b < a < 1, c > 1, 0 < d < 1 .$ Compared with entropy minimization defined in (1), hyper-parameters a and b can control the scope where $f (x_{i}^{2} | θ)$ increases or decreases, c and d can control the change range when $f (x_{i}^{2} | θ)$ falls in the scope.

Given a set of hyper-parameters ${θ_{k}}_{k = 1}^{m}$ in Xgboost, where m is the number of models and θ_k is the set of parameters such as max-depth, number of trees, gamma. The ensemble of models with respect to parameter θ_k is proposed to improve the effectiveness of pseudo-labels. The model $f (x_{i}^{2} | θ_{k})$ for each k is trained on the same training set and the ensemble result $f_{1} (x_{i}^{2})$ with respect to ${θ_{k}}_{k = 1}^{m}$ is defined as follows: $f_{1} (x_{i}^{2}) = \frac{1}{m} \sum_{k = 1}^{m} f (x_{i}^{2} | θ_{k}) .$ In this case, $G (x_{i}^{2})$ has the following form: $G (x_{i}^{2}) = {\begin{matrix} c \cdot f_{1} (x_{i}^{2}), f_{1} (x_{i}^{2}) > a \\ f_{1} (x_{i}^{2}), b \leq f_{1} (x_{i}^{2}) \leq a \\ d \cdot f_{1} (x_{i}^{2}), f_{1} (x_{i}^{2}) < b \end{matrix} .$ (4)

3.2 Pseudo-label selection with sample stability

In this section, the pseudo-labels with high confidence are selected. Given a sample $x_{i}^{2}$ , if $f (x_{i}^{2} | θ)$ is far from the class boundary, $f (x_{i}^{2} | θ)$ should be more confident than other samples.

There are two criterions to characterize the confidence of sample $x_{i}^{2}$ :

If $f (x_{i}^{2} | θ)$ is close to the desired value, the prediction error of $f (x_{i}^{2} | θ)$ should be small.

If $∥ f (x_{i}^{2} | θ) - f (x_{i}^{2} | \hat{θ}) ∥$ is small when disturbing the parameters from θ to $\hat{θ}$ , then $f (x_{i}^{2} | θ)$ lies far away from classification boundary and has high confidence.

Now, let us see the implementation details of these criterions.

For the first conterion, selecting the samples such that the distance between $f (x_{i}^{2} | θ)$ and real label y, i.e., $∥ f (x_{i}^{2} | θ) - y ∥$ is small. Given hyper-parameters ɛ₁ and ɛ₂ satisfying ɛ₁ < ɛ₂ such that if $f (x_{i}^{2} | θ) < ɛ_{1},$ then $x_{i}^{2}$ is set as negative sample and if $f (x_{i}^{2} | θ) > ɛ_{2},$ $x_{i}^{2}$ is set as positive sample. The values of ɛ₁ and ɛ₂ can be set according to the output distributions of $f (x_{i}^{2} | θ)$ . In this case, the selector $S_{1} (x_{i}^{2})$ is defined as follows: $S_{1} (x_{i}^{2}) = {\begin{matrix} 1, f (x_{i}^{2} | θ) > ɛ_{2} or f (x_{i}^{2} | θ) < ɛ_{1} \\ 0, ɛ_{1} \leq f (x_{i}^{2} | θ) \leq ɛ_{2} \end{matrix} .$

Given a set ${θ_{k}}_{k = 1}^{m}$ of Xgboost, where m is the number of models and θ_k is the parameter set of Xgboost such as max-depth, number of trees, gamma. For the second criterion, the confidence of sample $x_{i}^{2}$ can be measured by the covariance $s (x_{i}^{2})$ defined as follows: $s (x_{i}^{2}) = \frac{1}{m - 1} \sum_{k = 1}^{m} (f (x_{i}^{2} | θ_{k}) - \bar{f})^{2},$ where $\bar{f}$ is the average of $f (x_{i}^{2} | θ_{k})$ for each k: $\bar{f} = \frac{1}{m} \sum_{k = 1}^{m} (f (x_{i}^{2} | θ_{k}) .$ In this case, the selector $S_{2} (x_{i}^{2})$ is defined as follow: $S_{2} (x_{i}^{2}) = {\begin{matrix} 1, s (x_{i}^{2}) \leq ɛ_{3} \\ 0, s (x_{i}^{2}) > ɛ_{3} \end{matrix} .$ Finally, the selector $S (x_{i}^{2})$ is defined as follows: $S (x_{i}^{2}) = S_{1} (x_{i}^{2}) \cdot S_{2} (x_{i}^{2}) .$

Given a sample $x_{i}^{2}$ such that $S (x_{i}^{2}) = 1$ , the pseudo-label of $x_{i}^{2}$ has high confidence. Otherwise, $S (x_{i}^{2}) = 0$ and the pseudo-label of $x_{i}^{2}$ is not confident enough to train Xgboost.

3.3 Pseudo-label based domain adaptation method

In this section, the framework of the proposed method is presented. The method mainly consists of three parts: training Xgboost, pseudo-label generation and pseudo-label selection. The frame of the proposed method is shown in Fig. 1:

Fig. 1

The frame of the proposed method.

In the process of training Xgboost, the training and testing set are formed. Xgboost model is trained on training set. In the part of pseudo-label generation, the proposed methods predict the testing set and generate pseudo-labels by (3) or (4). In the part of pseudo-label selection, the samples with high confidence scores are selected and form new training set with the original training set. After a few epoches, the proposed method is terminated and predict the whole testing set as the final results. The details of the proposed method are shown in Algorithm 1.

Algorithm 1 Pseudo-label based domain adaptation

Inputs: training set $D_{L} = {(x_{i}^{1}, y_{i}^{1})}_{i = 1}^{N_{L}}$ , testing set $D_{U} = {x_{i}^{2}}_{i = 1}^{N_{U}}$ , maximal epoch n_max, number of models n.

Outputs: the outputs O_f of Xgboost on testing set D_U.

Initialization: parameters set ${S_{j}}_{j = 1}^{n}$ of baseline

Xgboost models, parameters S₀ of proposed

Xgboost model, ɛ₁, ɛ₂, ɛ₃, a, b, c, d, $D_{L}^{(0)} : =$

D_L, $D_{U}^{(0)} : = D_{U}$ , epoch:=0.

Train Xgboost for parameters ${S_{j}}_{j = 1}^{n}$ .

for j = 1 to ndo

Train Xgboost model with parameters S_j and dataset $D_{L}^{(epoch)}$ , return outputs O_j ∈ R^{N
_U} on $D_{U}^{(epoch)}$

end for

Compute the average ${\bar{O}}_{mean}$ and variance ${\bar{O}}_{var}$ as follows: $\begin{matrix} {\bar{O}}_{mean} = \frac{1}{n} \sum_{j = 1}^{n} O_{j}, \\ {\bar{O}}_{var} = \frac{1}{n - 1} \sum_{j = 1}^{n} (O_{j} - {\bar{O}}_{mean})^{2} . \end{matrix}$

Go to Step 2.

Compute the pseudo-labels of $D_{U}^{(epoch)}$ according to ${\bar{O}}_{mean}$ by (3) or (4).

Go to Step 3.

Set index ${\bar{O}}_{choice - 1}$ as follows: ${\bar{O}}_{choice - 1} = {\begin{matrix} 1, {\bar{O}}_{mean} \leq ɛ_{1} or {\bar{O}}_{mean} \geq ɛ_{2} \\ 0, ɛ_{1} < {\bar{O}}_{mean} < ɛ_{2} \end{matrix} .$ Set index ${\bar{O}}_{choice - 2}$ as follows: ${\bar{O}}_{choice - 2} = {\begin{matrix} 1, {\bar{O}}_{var} \leq ɛ_{3} \\ 0, {\bar{O}}_{var} > ɛ_{3} \end{matrix} .$ Compute selection index ${\bar{O}}_{choice}$ : ${\bar{O}}_{choice} = {\bar{O}}_{choice - 1} \cdot {\bar{O}}_{choice - 2} .$

Define set Ω : =∅ and . ${N^{'}}_{U} : = | D_{U}^{(epoch)} | .$

for j = 0 to $\begin{array}{l} {N^{'}}_{U} - 1 \end{array}$

if ${\bar{O}}_{choice} [j] = = 1$ then

$Ω : = Ω \cup {x_{i}^{2}}$

end if

end for

Go to Step 4.

Define training set $D_{L}^{(epoch + 1)}$ and testing set $D_{U}^{(epoch + 1)}$ : $D_{L}^{(epoch + 1)} : = D_{L}^{(epoch)} \cup Ω, D_{U}^{(epoch + 1)} : = D_{U}^{(epoch)} ∖ Ω .$ Go to Step 5.

Determine whether to be terminated.

IF epoch <n_maxthen

epoch:= epoch + 1

Go to Step 1

else

Train Xgboost with parameters S₀ on training set $D_{L}^{(epoch)}$ , return outputs O_f on D_U and the algorithm is terminated

end if

4 Experimental results

In this section, the performance of domain adaptation strategy is compared with other neural-network-based methods, tree-based methods and some baseline variants for ablation study on several UCI datasets and 5G terminal purchase prediction dataset(i.e. 5G terminal dataset). The environment is offered by the nine day artificial intelligence platform.

For both UCI datasets and 5G terminal datasets, the methods to be compared are Logistic, MLP, Catboost, Tabnet [1], Saint [26]. The details of these models are shown as follows:

Logistic regression: logistic regression is a linear model which is optimized by Adams and the maximal epoch is set to be 30.

MLP: MLP is denoted as the fully-connected neural network with two dense layer of dimension 10, 10.

Catboost: Catboost is a tree-based method which is implemented by Catboost package.

Tabnet: Tabnet is a nerual-network-based method in AAAI 2021. Using official implementation with website: https://github.com/\\ dreamquark-ai/tabnet. All the parameters of Tabnet are set to be default.

Saint: Saint is a nerual-network-based method whose official implementation is at the website https://github.com/somepago/saint. All parameters but batchsize are set to be default. Since the sizes of higgs and 5G terminal datasets are large, the values of batchsize are set to be 2560 and the experimental results are based on the outputs within acceptable time on these datasets.

The experimental results show that the proposed method can effectively improve the performance of Xgboost on tabular data.

4.1 Results on UCI datasets

In this section, the experimental results are shown on UCI datasets. The UCI datasets include adult, bank and higgs. The task for such datasets is binary classification. Adult dataset aims to predict whether income exceeds $50000 based on census data. Bank dataset is used to predict if the client will subscribe a term deposit. Higgs dataset distinguishs between a signal process which produces higgs bosons and a background process. The statistical descriptions are shown in Table 1.

Table 1
Statistical details of UCI datasets

Name Adult Bank Higgs

Number of Instances 48842 45211 3498124

Number of Features 15 17 29

Majority Class Percentage(%) 76.07 88.30 52.97

Minority Class Percentage(%) 23.93 11.70 47.02

Name	Adult	Bank	Higgs
Number of Instances	48842	45211	3498124
Number of Features	15	17	29
Majority Class Percentage(%)	76.07	88.30	52.97
Minority Class Percentage(%)	23.93	11.70	47.02

Since the datasets of bank and higgs are not divided into training sets and testing sets, 80% samples are randomly chosen from the whole dataset as training set and the others are testing set.

The proposed method is named as Xgbours. The baseline methods consist of xgb1, xgb2, xgb3 and Xgbpseudo. The details of these models are shown as follows:

xgb1,xgb2,xgb3: xgb1,xgb2 and xgb3 are general implementations of Xgboost with different key parameters.

Xgbpseudo: Xgbpseudo is the implementation of Xgboost model where pseudo-label generation is applied but pseudo-label selection is not applied during training.

Xgbours: Xgbours is the implementation of Xgboost model where both pseudo-label generation and pseudo-label selection are applied during training.

The common parameters of these models are shown in Table \ref paraUCI, where Num.E represents the number of trees and Wei.ps represents the weights of positive samples.

Table 2

Parameters of Xgboost on UCI datasets

Name	Model	Max-depth	Num.E	Wei.ps
Adult	xgb1	4	100	3
	xgb2	5
	xgb3	6
	Xgbpseudo	6
	Xgbours	6
	Catboost	8	1000	1
Bank	xgb1	8	100	5
	xgb2	9
	xgb3	10
	Xgbpseudo	6
Xgbours	10
	Catboost	8	1000	1
Higgs	xgb1	8	500	6
	xgb2	9
	xgb3	10
	Xgbpseudo	6
Xgbours	10
	Catboost	8	1000	1

For UCI datasets, the precision of binary classification is measured for each method. All models output the probability of positive class and negative class as binary classification. Since training set and testing set are randomly selected from the whole dataset, the results may be different among the choices. All of them are shown in Table 3.

Table 3

The accuracy(%) of classification for UCI datasets\label resultUCI

Name	Adult	Bank	Higgs
Logistic	76.3	88.8	61.6
MLP	80.5	89.0	69.9
Catboost	83.7	88.5	54.7
Tabnet	85.2	89.9	73.7
Saint	86.1	91.0	73.9
xgb1	86.7	89.6	73.1
xgb2	86.6	89.9	73.1
xgb3	86.7	89.8	73.1
Xgbpseudo	87.2	90.6	74.7
Xgbours	87.4	90.8	74.8

The experimental results of Table 3 show that after domain adaptation, the proposed method can achieve better performance than the baseline methods. Both pseudo-label generation and selection method can improve the performance of Xgboost on all datasets. The effects of pseudo-label generation method are more significant than that of pseudo-label selection method. The proposed method Xgbours has advantage over Adult and Higgs compared with Tabnet and Saint. But for Bank, Xgbours is not better than Saint in accuracy.

4.2 Results on 5G terminal datasets

In 5G terminal purchase prediction, the mobile company needs to determine the people who purchase 5G terminals in next month from the crowd target database by binary classification. The people with 5G terminal purchase behaviors are labeled as positive samples, and the people without 5G terminal purchase behaviors are labeled as negative samples.

The dataset is collected from a city for the precision marketing of 5G mobile terminals. It is monthly data where the features of people behaviours are collected in current month, and the labels are whether purchase behaviors occur in next month. When purchase behavior occurs for a person, the label is set to be 1, otherwise 0. The number of samples for each month is from 4.5 million to 5.5 million. The dimension of the features is 49, including numerical features such as age and traffic statistical indicators, and discrete features such as terminal brand.

In this paper, five labeled datasets are constructed from August to December in 2021. Set the data of current month to be training set, and the data of next month to be testing set. In testing phase, the model outputs the probabilities of positive samples for testing set and select the top 200000 largest samples. The proportion of positive samples in 200000 samples is used as the measure for the performance. Define Num.P and Num.N to be the number of positive samples and negative samples respectively and Ratio to be the proportion of positive samples in the datasets. The statistical details of 5G terminal datasets are shown in Table 4.

Table 4
Statistical details of 5G terminal datasets

Month Num.P Num.N Ratio

08 74733 4500475 1.63%

09 104186 4781037 2.13%

10 118466 5033377 2.30%

11 102236 5019323 2.00%

12 133058 4993937 2.60%

Month	Num.P	Num.N	Ratio
08	74733	4500475	1.63%
09	104186	4781037	2.13%
10	118466	5033377	2.30%
11	102236	5019323	2.00%
12	133058	4993937	2.60%

The baseline methods are xgb1, xgb2, xgb3, xgb4, xgb5. To estimate the performance upper-bound for different months, xgb1 is also trained by testing set and returns the precision of top 200000 largest samples in testing set. The key parameters of all Xgboost models are shown in Table 5, where Num.E represents the number of trees and Wei.ps represents the weight of positive samples.

Table 5

Parameters of Xgboost on 5G terminal datasets

Model	Max-depth	Num.E	Wei.ps
xgb1	6
xgb2	7
xgb3	8	100	10
xgb4	9
xgb5	10
Xgbpseudo	10	100	10
Xgbours	10
Catboost	3	100	1

The proposed method is names as Xgbours and selects Xgboost with the largest value of max-depth. This is because the number of data with pseudo-labels is large, Xgboost with smaller max-depth is prone to be under-fitting. The detailed results are shown in Table 6, where 08-09 refers to the data in August as training set, the data in September as testing set.

Table 6

The precision(%) of top 200000 people on 5G terminal dataset

Month	11-12	10-11	09-10	08-09	Average
upper-bound	10.070	8.821	10.327	10.000	9.805
Logistic	6.326	6.905	8.184	6.772	7.047
MLP	8.153	7.024	8,623	6.864	7.666
Catboost	8.480	7.288	8.677	6.761	7.802
Tabnet	5.269	6.387	6.514	5.029	5.800
Saint	9.100	7.600	8.700	7.300	8.175
xgb1	8.695	6.777	8.765	7.174	7.853
xgb2	8.555	6.897	8.771	6.889	7.778
xgb3	8.388	6.861	8.116	6.877	7.561
xgb4	8.204	6.887	8.530	6.822	7.611
xgb5	7.897	6.606	8.325	6.623	7.363
Xgbpseudo	8.791	7.131	9.215	7.368	8.126
Xgbours	8.866	7.268	9.201	7.470	8.201

The results in Table 6 show that the proposed method Xgbours achieves better performance than Xgboost baseline methods in all datasets. The pseudo-label generation method can improve the performance of Xgboost baseline methods. The performance of pseudo-label selection method depends on the selection of hyper-parameters such as ɛ₃. In 5G terminal datasets, each ɛ₃ is defined as 90 quantile for the distribution of the scores computed by the criterions. If ɛ₃ in 09 - 10 are specially selected as 99 quantile for the distribution of the scores computed by the criterions, the precision of Xgbours will be 9.240, which is higher than that of Xgbpseudo shown in Table 6.

Compared with neural-netowk-based methods, Xgbours has small preponderance in the accuracy. One the other hand, the running time of Xgbours is smaller than that for Tabnet and Saint. For Xgbours, less than 10 minutes are needed for each dataset. For Saint, more than 2 hours per epoch is needed in our platform.

5 Conclusions

In this paper, a novel domain adaptation strategy is proposed to improve the performance of tree-based model Xgboost on tabular data. The strategy consists of pseudo-label generation and pseudo-label selection for testing data and forms more labeled data to train Xgboost. Hence, Xgboost can learn more effective features on test data. The experimental results show that the strategy can improve the performance of Xgboost not only on UCI data, but also on the scene of 5G terminal purchase prediction where the features of users change dynamic among different months. Since the accuracy of pseudo-labels has large effect on the performance of such methods, how to generate more effective pseudo-labels on tabular data is an important study direction in future.

References

ArÄśk

S.O.

, PïnĄster

Tabnet: Attentive interpretable tabular learning. In AAAI, (2021), pp. 6679–6687.

Chen

, Guestrin

Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, (2016), pp. 785–794.

Cheng

H.T.

, Koc

, Harmsen

, Shaked

, Chandra

, Aradhye

, Anderson

, Corrado

, Chai

, Ispir

Wide & deep learning for recommender systems. in Proceedings of the 1st workshopon deep learning for recommender systems, (2016), pp. 7–10.

Conneau

and Lample

, Cross-lingual language model pretraining, Advances in Neural Information Processing Systems 32, 2019.

Ding

, Sheng

, Liang

, Zheng

, He

Proxymix: Proxy-based mixup training with label refinery for sourcefree domain adaptation. arXiv preprint arXiv: 2205.14566, 2022.

Duan

, Zhao

, Qi

, Wang

, Zhou

, Shi

, GaoMutexmatch:

Mutexmatch: Semi-supervised learning with mutex-based consistency regularization. ArXiv preprint arXiv:2203.14316, 2022.

Friedman

J.H.

Greedy function approximation: a gradient boosting machine, Annals of Statistics, (2001), pp. 1189–1232.

Gorishniy

Y.V.

, Rubachev

, Khrulkov

, Babenko

Revisiting deep learning models for tabular data. ArXiv, abs/2106.11959, 2021.

Guo

, Tang

, Ye

, Li

, He

Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.

10.

Gururangan

, Marasović

, Swayamdipta

Don’t stop pretraining: adapt language models to domains and tasks, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics(ACL 2020) (2020), pp. 8342–8360.

11.

Hou

, Zheng

Visualizing adapted knowledge in domain transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 13824–13833.

12.

Huang

, Khetan

, Cvitkovic

, Karnin

Tabtransformer: Tabular data modeling using contextual embeddings, 2020.

13.

Kang

, Jiang

, Yang

, Hauptmann

A.G.

Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 4893–4902.

14.

, Meng

, Finley

, Wang

, Chen

, Ma

, Ye

and Liu Tie

, Lightgbm: A highly efficient gradient boosting decision tree, Advances in Neural Information Processing Systems 30, 2017.

15.

Kim

, Jeong

, Kim

, Choi

, Kim

Diversify and match:Adomain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 12456–12465.

16.

Kuniaki Saito

T.H.

, Ushiku

Asymmetric tri-training for unsupervised domain adaptation, 2017.

17.

, Chen

, Qi

, Zhu

, Haner

and Cai

, A gan-basedself-training framework for unsupervised domain adaptive personre-identification, Journal of Imaging 7(4) (2021), 62.

18.

Liu

, Hu

, Liu

, Lu

, You

, Kong

Energy-constrained self-training for unsupervised domain adaptation. In 2020 25th International Conference on Pat-tern Recognition (ICPR), (2021), pp. 7515–7520. IEEE.

19.

Munro

, Damen

Multi-modal domain adapta-tion for fine-grained action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 122–132.

20.

Pan

S.J.

and Yang

, A survey on transfer learning, IEEETransactions on Knowledge and Data Engineering 22(10) (2009), 1345.

21.

Pan

, Yao

, Li

, Wang

, Ngo

C-W.

, Mei

Trans-ferrable prototypical networks for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 2239–2247.

22.

Pinheiro

P.O.

Unsupervised domain adaptation with sim-ilarity learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 8004–8013.

23.

Prokhorenkova

, Gusev

, Vorobev

, Dorogush

A.V.

and GulinCatboost:

, unbiased boosting with categorical features, Advancesin Neural Information Processing Systems 31, 2018.

24.

Rozantsev

, Salzmann

and Fua

, Beyond sharing weights fordeep domain adaptation, IEEE Transactions on Pattern Analysisand Machine Intelligence 41(4) (2018), 801–814.

25.

Sohn

, Berthelot

, Carlini

, Zhang

, Raffel

C.A.

, Cubuk

E.D.

, Kurakin

and Li

C-L.

, Fixmatch: Simplifyingsemisupervised learning with consistency and confidence, Advances in Neural Information Processing Systems 33 (2020), 596–608.

26.

Somepalli

, Goldblum

, Schwarzschild

, Bruss

C.B.

, Goldstein

Goldstein, Saint: Improved neural networks for tabu-lar data via row attention and contrastive pretraining. arXiv preprint arXiv:2106.01342, 2021.

27.

Sun

, Saenko

Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, (2016), pp. 443–450. Springer.

28.

Wang

, Sun

Transtab: Learning transferable tabular transformers across tables. In Advances in Neural Informa-tion Processing Systems, 2022.

29.

Wei

, Sohn

, Mellina

, Yuille

, Yang

Crest: Aclass-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021), pp. 10857–10866

30.

Xie

, Zheng

, Chen

Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning (2018), pp. 5423–5432. PMLR.

31.

Yang

, Zhuo

, Qi

, Shi

, Gao

St++: Make self-training work better for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 4268–4277.

32.

Yang

, Wei

, Wang

, Hua

X.-S.

, Zhang

Interactive self-training with mean teachers for semisupervised object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021), pp. 5941–5950.

33.

Yao

, Shen

, Xu

, Zhong

, Xiao

Cls: Cross labeling supervision for semi-supervised learning. arXiv preprint arXiv:2202.08502, 2022.

34.

Zhang

, Li

, Ma

, Luo

and Li

, Open-set domain adaptationin machinery fault diagnostics using instance-level weightedadversarial learning, IEEE Transactions on IndustrialInformatics PP(99) (2021), 1–1.

35.

Zhang

, Shen

, Zhang

, Foo

C-S.

Few-shot adaptation of pre-trained networks for domain shift. arXiv preprint arXiv:2205.15234, 2022.

36.

Zhang

, Davison

B.D.

Deep spherical manifold gaussian kernel for unsupervised domain adaptation. In Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 4443–4452.

37.

Zhao

, Zhou

, Wang

, Shi

, Gao

Lassl: Label-guided self-training for semi-supervised learning, AAAI Conference on Artificial Intelligence, 2022.

38.

Zhu

J-Y.

, Park

, Isola

, Efros

A.A.

Unpaired image-to-image translation using cycle-consistent adver-sarial networks. In Proceedings of the IEEE international conference on computer vision, (2017), pp. 2223–2232.

A novel Pseudo-label based domain adaptation method on tabular data

Abstract

Keywords

1 Introduction

2 Related work

3 Proposed methods

3.1 Pseudo-label generation with entropy minimization

3.3 Pseudo-label based domain adaptation method

4.1 Results on UCI datasets

Table 1 Statistical details of UCI datasets Name Adult Bank Higgs Number of Instances 48842 45211 3498124 Number of Features 15 17 29 Majority Class Percentage(%) 76.07 88.30 52.97 Minority Class Percentage(%) 23.93 11.70 47.02

Table 4 Statistical details of 5G terminal datasets Month Num.P Num.N Ratio 08 74733 4500475 1.63% 09 104186 4781037 2.13% 10 118466 5033377 2.30% 11 102236 5019323 2.00% 12 133058 4993937 2.60%

References

Table 1
Statistical details of UCI datasets

Name Adult Bank Higgs

Number of Instances 48842 45211 3498124

Number of Features 15 17 29

Majority Class Percentage(%) 76.07 88.30 52.97

Minority Class Percentage(%) 23.93 11.70 47.02

Table 4
Statistical details of 5G terminal datasets

Month Num.P Num.N Ratio

08 74733 4500475 1.63%

09 104186 4781037 2.13%

10 118466 5033377 2.30%

11 102236 5019323 2.00%

12 133058 4993937 2.60%