Safe co-training for semi-supervised regression

Abstract

Co-training is a popular semi-supervised learning method. The learners exchange pseudo-labels obtained from different views to reduce the accumulation of errors. One of the key issues is how to ensure the quality of pseudo-labels. However, the pseudo-labels obtained during the co-training process may be inaccurate. In this paper, we propose a safe co-training (SaCo) algorithm for regression with two new characteristics. First, the safe labeling technique obtains pseudo-labels that are certified by both views to ensure their reliability. It differs from popular techniques of using two views to assign pseudo-labels to each other. Second, the label dynamic adjustment strategy updates the previous pseudo-labels to keep them up-to-date. These pseudo-labels are predicted using the augmented training data. Experiments are conducted on twelve datasets commonly used for regression testing. Results show that SaCo is superior to other co-training style regression algorithms and state-of-the-art semi-supervised regression algorithms.

Keywords

Co-training regression safe learning semi-supervised learning

1. Introduction

Co-training [1] is a simple and effective disagreement-based semi-supervised learning (SSL) method. Two learners trained in different views independently select high-confidence unlabeled instances. Then these instances are assigned pseudo-labels and added to the training pool of the other learner to update the model parameters. Since the two views are compatible and independent [2], co-training can reduce the accumulation of errors in the learner by exchanging pseudo-labels. This method has been applied to many fields, such as image retrieval [3, 4], sentiment analysis [5, 6], and medical diagnosis [7, 8, 9].

Currently, there are a number of key issues of co-training. First, the assumption of sufficient and redundant views [10] has strict requirements on the data. However, some studies have proved that this is unnecessary [11, 12, 13]. The data-driven methods split the dataset into different subsets as views. The split methods include random splitting [14, 15], splitting according to set mechanisms [16] or conditions [17], splitting based on view sufficiency and independence [18, 19, 20], and splitting based on relevant domain knowledge [21, 22]. The learner-driven methods utilize the learners to achieve view difference. One is to use learners with different learning mechanisms to obtain more comprehensive data information [23, 24]. The other is to use learners with the same type however different parameters to achieve view difference [10, 25, 26]. Second, co-trainers trust the pseudo-labels assigned at each iteration, so they will not be updated in the subsequent training process. However, the pseudo-labels assigned in training may not reach the expected confidence level [27]. This problem may be more serious in the early stage of training. The errors caused by inappropriate instances will always exist or even accumulate.

Figure 1.

The illustration of the SaCo algorithm at the $t$ -th iteration. 1) Update the regressor $g^{(1)}$ with the training data $\mathbf{X}_{l}\cup\mathbf{X}_{s}^{(1)}$ of $\textit{view}_{1}$ . 2) Select the most confident instance $x$ by two-stage selection and add it to the selected set $\mathbf{X}_{s}^{(2)}$ of $\textit{view}_{2}$ . 3) Use safe labeling technique to assign safe pseudo-labels $\tilde{g}(\mathbf{X}_{s}^{(2)})$ for $\mathbf{X}_{s}^{(2)}$ . The regressor $g^{(2)}$ trained in the updated training data of $\textit{view}_{2}$ performs the same operation.

This paper proposes a safe co-training algorithm (SaCo) for semi-supervised regression (SSR). It mainly reduces the impact of the second disadvantage by guaranteeing and improving the quality of pseudo-labels. Figure 1 shows the framework of SaCo. The contributions of SaCo are as follows:

A two-stage selection approach that selects easy-to-learn instances. In the first stage, $k$ NN is used to select candidate instances from a randomly picked subset of unlabeled instances. In the second stage, the regressors of the current view are used to select the most reliable unlabeled instances from candidates. It is time-consuming for a neural network to directly compute the performance gains that all unlabeled instances can bring to it. Our two-stage instance selection reduces the time consumption by reducing the number of instances the neural network needs to evaluate.

A safe labeling technique that assigns reliable pseudo-label to the selected unlabeled instance. For the unlabeled instance selected in current round, a more reliable prediction than the baseline is integrated by the predictions of different views. Unlike many algorithms that directly assign pseudo-labels without guaranteeing their quality, this technique can guarantee and improve the quality of pseudo-labels to reduce the risk in model learning.

A label dynamic adjustment strategy that ensures the performance of pseudo-labels will not degrade as training progresses. Different from the general co-training framework, our algorithm dynamically updates the pseudo-labels. At each iteration, the model not only assigns safe pseudo-label to the currently selected unlabeled instance, but also updates the previous ones. This strategy guarantees that the quality of pseudo-labels will not be out of date.

Experiments are conducted on 12 datasets that are commonly used to test regression algorithms. We test the reliability of safe pseudo-labels and compare SaCo with other state-of-the-art semi-supervised regression algorithms. The results show that SaCo can significantly outperform the counter-parts. Statistics test shows that SaCo is significantly different from other algorithms. The source code of SaCo can be found at https://github.com/105730841/SaCo.git.

The rest of the paper is organized as follows. Section 2 briefly introduces related works of co-training. Section 3 presents the proposed safe co-training framework. Section 4 presents the of experimental setting and some results, as well as a brief discussion. Finally, the necessary conclusions and the improvements for future work are recorded in Section 5.

2. Related work

This section reviews the related work of SaCo, including co-training and safe learning.

2.1 Co-training

Co-training [1] has attracted widespread attention due to its wide range of applications, and many variants have been gradually extended. For classification tasks, a co-training style algorithm [28] uses different algorithms instead of data features to divide views. The Co-EM algorithm [29] extends the standard co-training framework by assigning pseudo-labels to all unlabeled data at each iteration. The co-labeling algorithm [30] models the learning problem on each view as a weakly labeled learning problem, and then uses a set of pseudo-label vectors generated from classifiers of other views to learn the best classifier. DCT [26] employs two neural networks as regressors for views. It employs adversarial examples to achieve view difference, thereby avoiding network collapse. The SPamCo algorithm [31] incorporates self-paced learning and co-regularization into the co-training framework to remove inappropriate unlabeled instances, and it is also suitable for multi-view cases.

For regression tasks, a kernel regression algorithm coRLSR [32] designed a semi-parametric variant that greatly reduces the runtime for handling numerous unlabeled data. COREG [10] is the most representative single-view regression algorithm. It uses two $k$ NN regressors with different schemes with distance metrics and $k$ values, and selects the most reliable pseudo-label by calculating the mean squared error (MSE) change of labeled instances caused by adding unlabeled instances. It is enhanced with interactive genetic algorithms (IGAs) to improve model efficiency [33]. CoBCReg [34] employs $n$ different RBF neural network regressors to extend single-view co-regression.

2.2 Safe learning

Since the use of unlabeled instances in semi-supervised learning may degrade model performance [35, 36, 37, 38], it is necessary to develop a safe prediction framework. Safe learning is a semi-supervised learning strategy that aims to assign more reliable pseudo-labels to unlabeled instances. These pseudo-labels should be able to often improve or at least not reduce the performance of the learner after participating in training.

For classification task, S4VMs [37] is a safe semi-supervised support vector machine algorithm. This method employs multiple low-density separators to approximate the decision boundary and maximize the SVMs performance of candidate separators. A general safe framework WILLSVM [39] is suitable for various performance testing methods. The ICLS classifier [40] minimizes the squared loss of a set of parameters implied by the labeled data under all possible labels of the unlabeled data. It can guarantee that the performance of the model is better than that of supervised ones. Balsubramani and Freund [41] proposed a method to learn a highly accurate prediction by limiting the allocation of real labels to a specific candidate set. LEAD [42] is a safe large-margin separation method for graph-based semi-supervised learning. Its strategy is that high-quality graphs should have a large margin of separation from predictions on unlabeled data. The UMVP method [38] integrates multiple semi-supervised learners and obtains the final prediction by maximizing the worst-case performance gain. When the performance metric is top- $k$ precision, $F_{1}$ score or AUC, UMVP can effectively solve the minimax convex relaxation of the maximin optimization.

In regression, coRLSR [32] trains the model under multiple views, and enforces the consistency of predictions from multiple views through co-regularization. SAFER [43] regards the problem of safe prediction as a geometric projection issue. It learns safe predictions that are not worse than supervised model from multiple semi-supervised regressors. SAFEW [44] further extends the framework of SAFER to weakly supervised learning.

3. The proposed method

In this section, we present the details of the SaCo algorithm. These include general optimization problem of the model, selection strategy for confident instances, safe labeling technique used in training, and algorithm description. Table 1 lists notations used throughout the paper.

Table 1
Notations

Notation	Meaning	Comments
$\mathbf{X}_{l}$	The labeled dataset
$\mathbf{X}_{u}$	The unlabeled dataset
$\mathbf{X}_{s}$	The selected dataset	Store all selected instances
$x$	The instance
$L$	The loss function
$N_{l}$	The number of labeled instances
$N_{u}$	The number of unlabeled instances
$g$	The regressor
$g(x)$	The prediction of $x$ by $g$
$g(x;\mathbf{X})$	the prediction of $x$ by $g$ trained using $\mathbf{X}$
$y$	The ground-truth label of $x$
$\tilde{g}(x)$	The safe pseudo-label of $x$	Obtained by safe labeling technique
$\theta$	The parameters of regressor
$\alpha^{(i)}$	The weight of the regressor on the $i$ -th view
$\Delta_{x}$	The confidence of $x$

3.1 SaCo model

Let $\mathbf{X}_{l}$ , $\mathbf{X}_{u}$ denote labeled set and unlabeled set, respectively. $N_{l}$ and $N_{u}$ are the number of labeled and unlabeled instances. According to [35, 10, 2], deliberately selected unlabeled data helps to improve the performance of the model. Specifically, the loss of the model on the training set after utilizing these selected unlabeled instances should be smaller than before. Therefore, the reduction in loss caused by unlabeled instances can be utilized to calculate their gains. The larger the reduction, the higher the gains of the selected unlabeled instances. So the optimization problem of SaCo is as follows:

$\displaystyle\underset{\mathbf{X}_{s}\subseteq\mathbf{X}_{u}}{\operatorname{% argmax}}\sum_{x_{i}\in\mathbf{X}_{l}}L(y_{i},g(x_{i};\mathbf{X}_{l}))-L(y_{i},% g(x_{i};\mathbf{X}_{l}\cup\mathbf{X}_{s})),$ (1)

where $\mathbf{X}_{s}$ is the set of selected unlabeled data, $y_{i}$ is the ground-truth label of $x_{i}$ , $L$ is the loss function using mean square error (MSE), $g(x;\mathbf{X})$ denotes the prediction value of $x$ with the regressor trained using $\mathbf{X}$ .

3.2 Two-stage selection

Let $\mathbf{X}_{u}^{\prime}$ denote a subset obtained by random sampling from $\mathbf{X}_{u}$ . In each iteration, unlabeled instance $x$ is selected to maximize the performance improvement to the current regressor [10]. The performance improvement that $x$ brings to the learner is represented by $\Delta_{x}$ , and it is calculated as follows:

$\displaystyle\Delta_{x}=\sum_{x_{i}\in\mathbf{X}_{l}}((y_{i}-g(x_{i};\mathbf{X% }_{l}))^{2}-(y_{i}-g(x_{i};\mathbf{X}_{l}\cup\{x\}))^{2}).$ (2)

More specifically, the selection of high-confidence unlabeled instance in SaCo is divided into two stages:

Use the $k$ NN regressor to select candidate instances from $\mathbf{X}_{u}$ . For an unlabeled instance $x$ , its $k$ -nearest labeled data set $\Omega_{x}$ is used instead of $\mathbf{X}_{l}$ . Candidate instances should greatly improve the performance of the $k$ NN regressor. At this stage, Eq. (2) can be rewritten as:

$\displaystyle\Delta_{x}=\sum_{x_{i}\in\Omega_{x}}((y_{i}-g(x_{i};\mathbf{X}_{l% }))^{2}-(y_{i}-g(x_{i};\mathbf{X}_{l}\cup\{x\}))^{2}).$ (3)

Use the supervised and semi-supervised regressors of the current view to select the most confident instance. For the unlabeled instance $x$ , we can calculate the sum of the performance improvements it brings to the two regressors. The unlabeled instance with the highest confidence should be able to maximize the $\Delta_{x}$ . At this stage, Eq. (2) can be rewritten as:

$\displaystyle\Delta_{x}=\sum_{x_{i}\in\mathbf{X}_{l}}((y_{i}-g(x_{i};\mathbf{X% }_{l}))^{2}-(y_{i}-g(x_{i};\mathbf{X}_{l}\cup\{x\}))^{2}){}+((y_{i}-g_{0}(x_{i% };\mathbf{X}_{l}))^{2}-(y_{i}-g_{0}(x_{i};\mathbf{X}_{l}\cup\{x\}))^{2}),$ (4)

where $g_{0}(x;\mathbf{X})$ denotes the prediction value of $x$ with the supervised regressor trained using $\mathbf{X}$ .

3.3 Safe labeling technique in co-training

Denote the baseline prediction and the $i$ -th view prediction of $x\in\mathbf{X}_{u}$ by $g_{0}(x)$ and $g^{(i)}(x)$ , respectively. $y$ is the ground-truth label of $x$ . Suppose we have obtained $g^{(1)}(x)$ , $g^{(2)}(x)$ , where $g^{(i)}(x)\in\mathbb{R}^{N_{u}}$ , $i=1,2$ . Our goal is to find a more reliable pseudo-label $\tilde{g}(x)$ for $x$ than $g_{0}(x)$ . By using the MSE as the loss function, we can easily have the objective function as:

$\displaystyle\max_{g(x)\in\mathbb{R}}((g_{0}(x)-y)^{2}-(g(x)-y)^{2}).$ (5)

However, $y$ is obviously unknown. Suppose the weight $\alpha^{(i)}$ of the $i$ -th regressor comes from a candidate convex set $\mathcal{M}=\{\bm{\alpha}\mid\mathbf{1}^{\mathrm{T}}\bm{\alpha}=1;\bm{\alpha}% \geqslant\mathbf{0}\}$ , which reflects the relation of individual learners in ensemble learning [45]. The larger the $\alpha^{(i)}$ , the closer the $g^{(i)}(x)$ is to the $y$ . In the co-training framework, we can use the information from different views. So the mean value of the supervised regression predictions on the two views can be employed as the baseline, i.e. $g_{0}(x)=\frac{1}{2}\sum_{j=1}^{2}g_{0}^{(j)}(x)$ . Then, the objective function can be optimized as:

$\displaystyle\max_{g(x)\in\mathbb{R}}\sum_{i=1}^{2}\alpha^{(i)}\left(\left(% \frac{1}{2}\sum_{j=1}^{2}g_{0}^{(j)}(x)-g^{(i)}(x)\right)^{2}-(g(x)-g^{(i)}(x)% )^{2}\right).$ (6)

In the absence of more information, our goal is to optimize the worst-case performance gains [37]. As long as good performance is achieved in the worst case, the quality of pseudo-label can be guaranteed. So the objective function becomes:

$\displaystyle\max_{g(x)\in\mathbb{R}}\min_{\bm{\alpha}\in\mathcal{M}}\sum_{i=1% }^{2}\alpha^{(j)}\left(\left(\frac{1}{2}\sum_{j=1}^{2}g_{o}^{(j)}(x)-g^{(i)}(x% )\right)^{2}-(g(x)-g^{(i)}(x))^{2}\right).$ (7)

By taking the derivative of Eq. (7) w.r.t. $g(x)$ and setting it to zero, we can get:

$\displaystyle g(x)=\sum_{i=1}^{2}\alpha^{(i)}g^{(i)}(x).$ (8)

Then substituting Eq. (8) into Eq. (7), the equivalent form only related to $\alpha$ is:

$\displaystyle\min_{\bm{\alpha}\in\mathcal{M}}\left(\sum_{i=1}^{2}\alpha^{(i)}g% ^{(i)}(x)-\frac{1}{2}\sum_{j=1}^{2}g_{0}^{(j)}(x)\right)^{2},$ (9)

which is a simple convex quadratic program. After expanding its quadratic form, it becomes:

$\displaystyle\min_{\bm{\alpha}\in\mathcal{M}}(\bm{\alpha}^{\mathrm{T}}\mathbf{% G}\bm{\alpha}-\mathbf{v}^{\mathrm{T}}\bm{\alpha}),$ (10)

where $\mathbf{G}\in\mathbb{R}^{2\times 2}$ is a kernel matrix of $g^{(i)}(x)$ , i.e., $G_{ij}=g^{(i)}(x)\cdot g^{(j)}(x)$ , $i,j\in\{1,2\}$ and $\mathbf{v}=[2g^{(1)}(x)\frac{1}{2}\sum_{j=1}^{2}g_{o}^{(j)}(x),2g^{(2)}(x)% \frac{1}{2}\sum_{j=1}^{2}g_{o}^{(j)}(x)]$ . After using the optimization solver to obtain the best weight $\bm{\alpha^{\ast}}$ , a safe prediction $\tilde{g}(x)=\sum_{i=1}^{2}{\alpha^{\ast}}^{(i)}g^{(i)}(x)$ can be obtained through Eq. (8).

[b] : SaCoInput: Labeled set $\mathbf{X}_{l}$ , unlabeled set $\mathbf{X}_{u}$ , pool size $P$ and max iteration $T$ .Output: Model parameter $\mathbf{\Theta}=(\theta^{(1)},\theta^{(2)})$ .[1] $\mathbf{\Theta}=(\theta^{(1)},\theta^{(2)})$ , $\mathbf{\Theta}_{0}=(\theta_{0}^{(1)},\theta_{0}^{(2)})$ ; //Initialize the semi-supervised regressors and the baseline supervised regressors $\mathbf{X}_{s}^{(j)}=\emptyset$ ; //Selected dataset $t=1$ ; //Current training round ( $t<T$ $\&\&$ has available data) ( $j$ $\leftarrow$ 1 to 2) Randomly select $P$ unlabeled instances; Select candidate instances according to Eq. (3); Select the most confident instance $x$ from the candidate according to Eq. (4); $\mathbf{X}_{s}^{(j)}=\mathbf{X}_{s}^{(j)}\cup\{x\}$ ; //Update the selected dataset Update the safe pseudo-labels $\tilde{g}(\mathbf{X}_{s}^{(j)})$ according to Algorithm 3.4; $\mathbf{X}_{l}\cup\mathbf{X}_{s}^{(j)}$ ; //Update labeled set Update $\theta^{(j)}$ based on the new labeled set; $\mathbf{\Theta}$ ;

3.4 Algorithm description

Algorithm 3.3 present the details of SaCo.

The inputs of our model includes the labeled dataset $\mathbf{X}_{l}=\{(x_{i},y_{i})\}_{i=1}^{N_{l}}$ , the unlabeled dataset $\mathbf{X}_{u}=\{x_{i}\}_{i=1}^{N_{u}}$ , the size of random selected unlabeled data $P$ , and the iteration rounds $T$ .

The first step is to initialize the parameters in our model. Semi-supervised regressor $\mathbf{\Theta}=(\theta^{(1)},\theta^{(2)})$ and baseline supervised regressor $\mathbf{\Theta}_{0}=(\theta_{0}^{(1)},\theta_{0}^{(2)})$ are trained based on labeled dataset. In particular, the parameters of baseline supervised regressor will no longer be updated. The selected dataset $\mathbf{X}_{s}^{(j)}$ for the $j$ -th view is initiated to an empty set.

The second step is to update $\mathbf{X}_{s}^{(j)}$ . Randomly select $P$ unlabeled instances to form the sample pool $\mathbf{X}^{\prime}_{u}$ . The regressors on $\textit{view}_{(3-j)}$ select the most confident instance from $\mathbf{X}^{\prime}_{u}$ according to the two-stage selection and add it to $\mathbf{X}_{s}^{(j)}$ .

The third step is to update the safe pseudo-labels $\tilde{g}(\mathbf{X}_{s}^{(j)})$ . For a selected instance $x$ , its safe prediction $\tilde{g}(x)$ can be learned by Algorithm 3.4.

The last step is to update the parameters of the $j$ -th regressor with the new labeled set $\mathbf{X}_{l}\cup\mathbf{X}_{s}^{(j)}$ .

The training will end when there is no unlabeled data available or the iteration round reaches the maximum. It is easy to see that the flow of SaCo is very similar to that of the standard co-training. It exchanges confident instances in different views and trains the learners in an iterative manner. The most significant improvement of our algorithm is the use of the safe labeling technique to improve the quality of the pseudo-labels, especially in the early stages of training.

[htb] : Safe labelingInput: Unlabeled data $x$ , semi-supervised predictions $g^{(1)}(x)$ , $g^{(2)}(x)$ and baseline supervised predictions $g_{0}^{(1)}(x)$ , $g_{0}^{(2)}(x)$ .Output: Safe prediction $\tilde{g}(x)$ .[1] Construct a linear kernel martrix $\mathbf{G}$ where $G_{ij}=g^{(i)}(x)\cdot g^{(j)}(x)$ , $i,j\in\{1,2\}$ ;Use the average of the two supervised predictions as a baseline $g_{0}(x)=\frac{1}{2}\sum_{j=1}^{2}g_{0}^{(j)}(x)$ ; Derive a vector $\mathbf{v}=[2g^{(1)}(x)\cdot g_{0}(x),2g^{(2)}(x)\cdot g_{0}(x)]$ ;Solve Eq. (10) to obtain the optimal weights $\bm{\alpha^{*}}=[\alpha^{*(1)},\alpha^{*(2)}]$ ; $\tilde{g}(x)=\sum_{j=1}^{2}\alpha^{*(j)}g^{(j)}(x)$ ;

4. Experiments

In this section, we analyze the effectiveness of the SaCo algorithm through experimental results. From these experiments, we answer the following questions:

Does safe labeling technique improve the quality of pseudo-labels?

Is the SaCo algorithm more accurate than other popular semi-supervised regression algorithm?

Is the SaCo algorithm more robust and interpretable than other algorithms?

4.1 Experimental setting

Table 2 lists 12 publicly available datasets used in our experiments. Among them, 7 datasets [Abalone, Electrical, Elevators, Parkinsons, Puma8NH, SeoulBikeData, Wine_quality] are from the UCI machine learning data repository, 3 datasets Bank8FM, Cpu_small and Kin8nm are from the Delve repository, the other 2 datasets Space_ga and Wind are from the StatLib data repository. These datasets cover different domains including physical (Electrical), biomedical (Parkinsons), business (Wine_quality), etc. The number of instances in the datasets ranges from 3107 (Space_ga) to 10000 (Electrical). The number of features ranges from 6 (Space_ga) to 21 (Parkinsons).

Table 2
Datasets

Dataset	Size	Feature	Source
Abalone	4177	8	UCI
Bank8FM	8192	9	Delve
Cpu_small	8192	12	Delve
Electrical	10000	13	UCI
Elevators	9517	7	UCI
Kin8nm	8192	8	Delve
Parkinsons	5875	21	UCI
Puma8NH	8192	8	UCI
SeoulBikeData	8760	14	UCI
Space_ga	3107	6	StatLib
Wind	6574	14	StatLib
Wine_quality	6497	11	UCI

Each dataset randomly selects 2000 instances as the training set, and the rest as the testing set. The instances in the training set are further split into labeled and unlabeled parts. The number of labeled data is set to 50, 100, 200, and 400, while the rest of the data is used as unlabeled part.

We choose the root mean square error (RMSE) and the coefficient of determination ( $R^{2}$ ) to evaluate the performance of the algorithm. RMSE is a quadratic scoring rule that measures the mean magnitude of the error. It is the square root of the mean of the squared differences between the predicted values and the ground-truth. The RMSE score is given by:

$\displaystyle\textit{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-g(x_{i}))^{2}},$ (11)

where $y_{i}$ is the ground-truth of $x_{i}$ , $g(x_{i})$ is the prediction and $n$ is the number of instances. $R^{2}$ is the ratio of the variance explained by the independent variable to its total variance in the regression, which is used to measure the fitness of the model to the data. The $R^{2}$ score is given by:

$\displaystyle R^{2}=\frac{\sum_{i=1}^{n}(g(x_{i})-\bar{y})^{2}}{\sum_{i=1}^{n}% (y_{i}-\bar{y})^{2}}=1-\frac{\sum_{i=1}^{n}(y_{i}-g(x_{i}))^{2}}{\sum_{i=1}^{n% }(y_{i}-\bar{y})^{2}},$ (12)

where $\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_{i}$ .

Four state-of-the-art algorithms are included in the experiment for comparison. (1) A popular co-training style semi-supervised regression algorithm COREG (TKDE 2005) [10]. In COREG, two Self-3NN methods are used as base regressors with Euclidean and Mahalanobis distance metrics. (2) A safe semi-supervised regression algorithm SAFER (AAAI 2017) [43]. It takes a supervised prediction as a baseline to ensemble safe predictions using multiple semi-supervised regressors. In SAFER, the model parameters are set to the recommended one in the package. It consists of a Self-LS and two Self-3NNs using Euclidean and cosine distance metrics, respectively. (3) A Multi-scheme Semi-supervised regression approach MSSRA (PRL 2019) [46]. It selects high-confidence instances by minimizing the difference in the predictions of multiple regressors. In MSSRA, the setting of based regressors is the default according to the implementation of the WEKA tool. Random Forest, SMOReg, and M5 are used as semi-supervised regressors for different views. (4) A novel graph-based semi-supervised regression algorithm BHD (Applied Soft Computing 2021) [47]. In BHD, heat diffusion with boundary-condition is employed to guarantee a closed-form solution. In SaCo, two basic networks with 3 hidden layers ( $32\times 64\times 32$ and $64\times 64\times 64$ ) are used as the base regressors. Figure 2 shows the architecture of the two networks.

Table 3

The average RMSE score of safe pseudo-labels and original pseudo-labels obtained by 20 tests on the twelve datasets. The best results in each series are highlighted in bold

Datasets	$N_{l}=50$		$N_{l}=100$
	Pseudo-labels	Safe pseudo-labels	Pseudo-labels	Safe pseudo-labels
Abalone	0.0934	0.0925 $\bullet$	0.0906	0.0899 $\bullet$
Bank8FM	0.0910	0.0766 $\bullet$	0.0727	0.0619 $\bullet$
Cpu_small	0.0783	0.0777 $\bullet$	0.0528	0.0522 $\bullet$
Electrical	0.1288	0.1280 $\bullet$	0.1191	0.1182 $\bullet$
Elevators	0.0582	0.0580 $\bullet$	0.0575	0.0572 $\bullet$
Kin8nm	0.1541	0.1535 $\bullet$	0.1460	0.1448 $\bullet$
Parkinsons	0.0737	0.0732 $\bullet$	0.0701	0.0695 $\bullet$
Puma8NH	0.0861	0.0753 $\bullet$	0.0703	0.0608 $\bullet$
SeoulBike	0.1358	0.1356 $\bullet$	0.1273	0.1258 $\bullet$
Space_ga	0.0499	0.0497 $\bullet$	0.0490	0.0488 $\bullet$
Wind	0.0916	0.0911 $\bullet$	0.0879	0.0872 $\bullet$
Wine_quality	0.1320	0.1320 $\bullet$	0.1291	0.1289 $\bullet$
	$N_{l}=200$		$N_{l}=400$
	Pseudo-labels	Safe pseudo-labels	Pseudo-labels	Safe pseudo-labels
Abalone	0.0903	0.0897 $\bullet$	0.0794	0.0789 $\bullet$
Bank8FM	0.0738	0.0650 $\bullet$	0.0442	0.0435 $\bullet$
Cpu_small	0.0400	0.0393 $\bullet$	0.0357	0.0352 $\bullet$
Electrical	0.1142	0.1132 $\bullet$	0.0741	0.0719 $\bullet$
Elevators	0.0567	0.0565 $\bullet$	0.0542	0.0541 $\bullet$
Kin8nm	0.1410	0.1397 $\bullet$	0.1397	0.1386 $\bullet$
Parkinsons	0.0676	0.0670 $\bullet$	0.0587	0.0579 $\bullet$
Puma8NH	0.0626	0.0553 $\bullet$	0.0436	0.0430 $\bullet$
SeoulBike	0.1234	0.1217 $\bullet$	0.1083	0.1074 $\bullet$
Space_ga	0.0487	0.0485 $\bullet$	0.0456	0.0449 $\bullet$
Wind	0.0863	0.0855 $\bullet$	0.0778	0.0774 $\bullet$
Wine_quality	0.1277	0.1272 $\bullet$	0.1239	0.1235 $\bullet$

Figure 2.

Two fully connected networks are used as weak learners. To encourage view difference, we adopted the following settings: 1) $g^{(1)}$ and $g^{(2)}$ have different hidden layer settings, i.e., $32\times 64\times 32$ for $g^{(1)}$ and $64\times 64\times 64$ for $g^{(2)}$ ; 2) $g^{(1)}$ has a shortcut connection that $g^{(2)}$ does not have.

All comparison algorithms are derived from the source code provided by the authors and use the same optimal settings as in the reference. For COREG and MSSRA, the code runs on the Weka platform. For SAFER, the code runs on the Matlab platform. For BHD and SaCo, the code runs on the Python platform.

4.2 Experiments for safe labeling technique

We validate the effectiveness of the safe labeling technique by comparing the difference in the RMSE of the safe pseudo-labels and the original pseudo-labels. Table 3 lists the RMSEs of the two types of labels for different datasets. Each dataset was run 20 times with different labeled data size settings. From the results, it can be observed:

The safe pseudo-labels are more reliable than the original pseudo-labels. Under four different labeled data size settings of 50, 100, 200 and 400, the RMSE of learner trained with the safe pseudo-labels was always smaller than that trained with the original pseudo-labels. For Kin8nm, the performance of learner trained with the safe pseudo-labels at 200 labels was comparable to the performance of the original pseudo-labels at 400 labels. In particular, the learner with safe pseudo-labels performed better at 100 labels than it with original pseudo-labels at 200 labels in Puma8NH.

The quality of these two types of labels maintains a consistent trend of change. As the number of labeled data increased, the performance of both the original pseudo-labels and the safe pseudo-labels gradually improved. This is because the safe pseudo-labels are generated on the basis of the original pseudo-labels.

4.3 Comparison with other methods

To validate the performance of SaCo, we compared it with four other state-of-the-art SSR algorithms. Each dataset was run 20 times at different labeled data size settings. Figure 3 shows the RMSE results of different algorithms under four label sizes. It can be observed that SaCo has superior performance compared to other algorithms, as follows:

When the number of labeled data is 50, SaCo achieves the best performance on 9 of the 12 datasets. In Abalone, Cpu_samll, Electrical, Kin8nm, Parkinsons, SeoulBikeData, and Wine_quality, SaCo has achieved significant improvement compared to other algorithms. In Bank8FM, Puma8NH and Space_ga, SaCo achieved the second-best performance. For these datasets, SaCo does not perform as well as MSSRA, but the gap is not obvious. It suggests that the selection of confident instances by the consistency principle may be applicable to these datasets. In Elevators and Wind, SaCo had a marginal improvement over MSSRA. One of the reasons for this might be that SaCo has adopted safe labeling technique to improve the quality of pseudo-labels.

As the label size increases, SaCo achieves more performance gains than other algorithms. SaCo achieved the best performance on at least 9 of the 12 datasets and the second-best in 2 datasets under all label size settings. In Cpu_samll, Electrical, Parkinsons, SeoulBikeData and Wine_quality, the performance of SaCo at 100 labels approached or exceeded that of other algorithms at 200 labels. In Bank8FM and Puma8NH, SaCo and MSSRA had similar performance. Both of them significantly outperformed the other three algorithms. In Space_ga, the performance of COREG was ahead of other algorithms, only SaCo outperformed it at 400 labels. The reason for this may be that the $k$ NN regressor performs better on this dataset.

Figure 3.

RMSE comparison of different methods on 12 datasets.

Figure 4.

$R^{2}$ comparison of different methods on 12 datasets.

Figure 4 shows the $R^{2}$ results of each algorithm with different label sizes. It can be observed that SaCo performed better than other algorithms in most datasets (8 out of 12) under all 4 label size settings. SaCo also achieved the second-best performance on the datasets where it did not achieve the best performance. In Electrical, Kin8nm, Puma8NH, SeoulBikeData and Wine_quality, the $R^{2}$ value of SaCo was significantly higher than other algorithms. This indicates that SaCo is more robust and has better interpretability than other algorithms. We noted that the MSSRA algorithm, which is closer to SaCo in prediction performance, does not perform as well in the $R^{2}$ test. As the label size increases, the $R^{2}$ value of MSSRA fluctuated and did not keep growing in Cpu_small and SeoulBikeData. In Kin8nm and Parkinsons datasets, MSSRA does not perform as well as SaCo, SAFER and COREG. These indicate that the performance of SaCo and MSSRA is significant different in interpretability.

Since different algorithms share the same random trials, we further performed statistical tests. The pairwise $t$ -test at a significance level of 0.05 was applied to find out if there is a significant difference between SaCo and other state-of-the-art algorithms. The $p$ -values of the $t$ -test under 50 and 100 labels are reported in Table 4. We observed the $p$ -values less than 0.05 between SaCo and COREG in most datasets, except Kin8nm and Space_ga at 50 labels. A similar situation occurred between SaCo and SAFER, with $p$ -values above 0.05 only on Kin8nm at 50 labels. The $p$ -values are less than 0.05 between SaCo and BHD on all datasets. These indicate that there is a significant difference between the prediction performed by SaCo with COREG, SAFER and BHD. The performance of MSSRA was relatively close to SaCo, but there were still significant differences on 7 of the 12 datasets.

Table 4

The $p$ -value of pairwise $t$ -tests (5% significance level) between the comparison methods and SaCo on 50 and 100 labeled instances. The best results in each series are highlighted in bullet

	Datasets	COREG	SAFER	MSSRA	BHD
$N_{l}=50$	Abalone	1.43E-02 $\bullet$	1.12E-06 $\bullet$	4.61E-01	7.98E-19 $\bullet$
	Bank8FM	7.54E-25 $\bullet$	5.90E-27 $\bullet$	3.10E-03 $\bullet$	4.01E-25 $\bullet$
	Cpu_small	6.63E-07 $\bullet$	5.37E-05 $\bullet$	2.34E-02 $\bullet$	3.50E-09 $\bullet$
	Electrical	6.59E-14 $\bullet$	9.26E-12 $\bullet$	4.92E-03 $\bullet$	4.71E-19 $\bullet$
	Elevators	2.96E-09 $\bullet$	6.35E-10 $\bullet$	9.04E-01	4.06E-23 $\bullet$
	Kin8nm	3.87E-01	1.78E-01	1.78E-01	1.77E-10 $\bullet$
	Parkinsons	9.55E-18 $\bullet$	9.62E-16 $\bullet$	2.63E-03 $\bullet$	2.58E-32 $\bullet$
	Puma8NH	4.35E-25 $\bullet$	5.68E-27 $\bullet$	4.24E-02 $\bullet$	3.96E-27 $\bullet$
	SeoulBikeData	4.42E-06 $\bullet$	2.90E-04 $\bullet$	4.41E-02 $\bullet$	1.14E-14 $\bullet$
	Space_ga	5.53E-01	3.13E-04 $\bullet$	6.38E-01	2.23E-13 $\bullet$
	Wind	6.84E-15 $\bullet$	1.92E-13 $\bullet$	1.50E-01	5.78E-24 $\bullet$
	Wine_quality	2.96E-05 $\bullet$	4.84E-06 $\bullet$	1.94E-03 $\bullet$	1.54E-15 $\bullet$
$N_{l}=100$	Abalone	1.69E-02 $\bullet$	5.04E-09 $\bullet$	5.56E-03 $\bullet$	4.51E-23 $\bullet$
	Bank8FM	3.04E-26 $\bullet$	4.85E-29 $\bullet$	8.09E-01	3.87E-31 $\bullet$
	Cpu_small	2.51E-08 $\bullet$	3.87E-04 $\bullet$	2.25E-05 $\bullet$	1.79E-17 $\bullet$
	Electrical	1.58E-17 $\bullet$	8.38E-17 $\bullet$	1.90E-05 $\bullet$	6.56E-26 $\bullet$
	Elevators	9.92E-14 $\bullet$	5.54E-13 $\bullet$	5.68E-01	4.77E-28 $\bullet$
	Kin8nm	3.26E-04 $\bullet$	4.47E-07 $\bullet$	8.87E-07 $\bullet$	3.56E-17 $\bullet$
	Parkinsons	4.10E-14 $\bullet$	1.42E-12 $\bullet$	1.17E-03 $\bullet$	1.20E-33 $\bullet$
	Puma8NH	9.01E-28 $\bullet$	9.23E-31 $\bullet$	4.24E-02 $\bullet$	7.48E-39 $\bullet$
	SeoulBikeData	1.86E-08 $\bullet$	3.06E-08 $\bullet$	5.87E-02	2.34E-24 $\bullet$
	Space_ga	4.93E-06 $\bullet$	1.24E-04 $\bullet$	7.23E-01	6.43E-15 $\bullet$
	Wind	1.89E-14 $\bullet$	2.61E-14 $\bullet$	3.80E-03 $\bullet$	1.04E-29 $\bullet$
	Wine_quality	8.40E-07 $\bullet$	9.98E-07 $\bullet$	3.76E-05 $\bullet$	1.91E-22 $\bullet$

Figure 5 shows the results of the Bonferroni-Dunn post-hoc test for $\alpha=0.05$ on the 12 datasets. The result indicates SaCo outperformed the other four algorithms. SaCo is significantly different from other algorithms when the number of labels is 50 and 400. There is no consistent evidence of a statistical difference between SaCo and MSSRA at 200 and labels. However, Saco is significantly better than the other three algorithms.

Figure 5.

Performance comparison of SaCo algorithm against the others by the Bonferroni-Dunn test with $CD=1.1171$ .

4.4 Discussions

Now we can answer the questions proposed at the beginning of this section.

The Safe labeling technique can effectively improve the quality of pseudo-labels. Table 3 shows that the performance of the learner is enhanced with safe pseudo-labels on all 12 datasets. For different number of labeled data, the safe pseudo-labels consistently have a better quality than the original pseudo-labels. In some datasets, safe pseudo-labels at low labels are of similar or even higher quality than the original pseudo-labels at high labels. All this shows that the quality improvement of pseudo-labels brought by safe labeling technology is comprehensive and stable.

SaCo is more accurate than state-of-the-art SSR algorithms, including COREG, SAFER, MSSRA and BHD. This is validated by Fig. 3, Fig. 5 and Table 4. Unfortunately, the performance of SaCo is weaker than other algorithms on some datasets when the number of labels is small. This may be due to the poor performance of our instance selection strategy on these datasets.

SaCo is more robust and has better interpretability than other algorithms. Figure 4 proves that the fitting performance of the regression model obtained by SaCo is better than other algorithms. Furthermore, the performance of SaCo maintains a steady improvement as the label size increases on all datasets. Although MSSRA does not have a significant gap with SaCo in performance, it is unstable on some datasets such as Cpu_small and SeoulBikeData, which is also the difference between these two algorithms.

The datasets used in the experiment does not contain missing values. Although there are outliers and noise, we do not handle them explicitly. Instead, our instance selection strategy and safe labeling technique handle them implicitly. On the one hand, the two-stage selection strategy evaluates the performance gain that each unlabeled data can bring to the current regressor in each iteration. Outliers tend to have low gains and are therefore difficult to be selected as confident instances. On the other hand, The safe labeling technique can reduce noise by improving the quality of pseudo-labels.

5. Conclusion and further work

In this paper, we proposed the SaCo algorithm to improve the quality of pseudo-labels selected during co-training. A safe labeling technique was designed for co-training framework to learn safe pseudo-labels. A label dynamic adjustment strategy was used to maintain the timeliness of safe pseudo-labels. Results not only validated the effectiveness of safe pseudo-labels, but also demonstrated the advantages of SaCo over state-of-the-art SSR methods.

The following research topics deserve further investigation:

Make use of prior knowledge. In some domains, it is more beneficial to achieve the view difference by splitting the dataset with prior knowledge. However, SaCo as a single view style algorithm ignores this. In further work, the use of relevant domain knowledge to split views will be considered.

More efficient evaluation measures to select reliable instances. SaCo evaluates the quality of selected instances by computing the performance gains on labeled data in two stages. This method helps to select unlabeled instances that are easy to learn, however it can be further improved in time cost. The evaluation measure should be fine-tuned or modified to improve the efficiency of selecting high-confidence instances. A new strategy based on minimizing the prediction gap between different views might be a good option.

Better safe pseudo-labels for selected instances. The performance of safe pseudo-labels is jointly determined by the original pseudo-labels and a linear convex set constructed from the predictions of the two views. Different convex sets can be constructed in other ways instead of linear combinations, possibly resulting in safe pseudo-labels with better quality.

SaCo is implemented in the two-view cases, but it is supposed to be extended to the multi-view scenario. The multi-view case could provide more choices for baseline labels, and the constructed candidate convex set is also richer.

In summary, SaCo is a general algorithmic framework that can be enriched in the future.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62136002 and 61876027, the Central Government Funds of Guiding Local Scientific and Technological Development under Grant No. 2021ZYD0003.

Appendix A. Two-stage instance selection

Figure 6.

A running example of the two-stage instance selection with the following settings: 1) the data are from the kin8nm dataset; 2) the value of the conditional attribute is rounded to two decimal places; and 3) 3NN is used for the first stage.

References

Blum

and Mitchell

, Combining labeled and unlabeled data with co-training, in: Proceedings of the 11th Annual Conference on Computational Learning Theory, 1998, pp. 92–100.

Kostopoulos

Karlos

Kotsiantis

and Ragos

, Semi-supervised regression: A recent review, Journal of Intelligent & Fuzzy Systems 35(2) (2018), 1483–1500.

Zhou

Z.H.

Chen

K.J.

and Jiang

, Exploiting unlabeled data in content-based image retrieval, in: European Conference on Machine Learning, 2004, pp. 525–536.

Zhou

Z.H.

Chen

K.J.

and Dai

H.B.

, Enhancing relevance feedback in image retrieval using unlabeled data, ACM Transactions on Information Systems 24(2) (2006), 219–244.

Wan

X.J.

, Co-training for cross-lingual sentiment classification, in: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009, pp. 235–243.

Bai

R.R.

Wang

Z.Q.

Kong

S.S.

and Zhou

G.D.

, Neural co-training for sentiment classification with product attributes, ACM Transactions on Asian and Low-Resource Language Information Processing 19(5) (2020), 1–17.

and Zhou

Z.H.

, Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 37(6) (2007), 1088–1098.

Deng

and Guo

M.Z.

, A new co-training-style random forest for computer aided diagnosis, Journal of Intelligent Information Systems 36(3) (2011), 253–281.

S.F.

Shi

Jiao

J.N.

Lei

and Wang

, Semi-supervised random forest regression model based on co-training and grouping with information entropy for evaluation of depression symptoms severity, Mathematical Biosciences and Engineering 18(4) (2021), 4586–4602.

10.

Zhou

Z.H.

and Li

, Semi-supervised regression with co-training style algorithms, IEEE Transactions on Knowledge and Data Engineering 19(11) (2007), 1479–1493.

11.

Balcan

M.F.

Blum

and Yang

, Co-training and expansion: Towards bridging theory and practice, Advances in Neural Information Processing Systems 17 (2004), 89–96.

12.

Zhou

Z.H.

Zhan

D.C.

and Yang

, Semi-supervised learning with very few labeled training examples, in: AAAI Conference on Artificial Intelligence, Vol. 675680, 2007.

13.

Wang

and Zhou

Z.H.

, Analyzing co-training style algorithms, in: European Conference on Machine Learning, 2007, pp. 454–465.

14.

Wang

Luo

S.W.

and Zeng

X.H.

, A random subspace method for co-training, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 195–200.

15.

Yaslan

and Cataltepe

, Co-training with relevant random subspaces, Neurocomputing 73(10–12) (2010), 1652–1661.

16.

Zhou

Z.H.

and Li

, Tri-training: Exploiting unlabeled data using three classifiers, IEEE Transactions on Knowledge and Data Engineering 17(11) (2005), 1529–1541.

17.

Chen

M.M.

Weinberger

K.Q.

and Chen

Y.X.

, Automatic feature decomposition for single view co-training, in: International Conference on Machine Learning, 2011.

18.

Feger

and Koprinska

, Co-training using rbf nets and different feature splits, in: The 2006 IEEE International Joint Conference on Neural Network Proceedings, 2006, pp. 1878–1885.

19.

Tang

H.L.

Lin

Z.K.

M.Y.

and Wu

, An advanced co-training algorithm based on mutual independence and diversity measures, Journal of Computer Research and Development 45(11) (2008), 1874.

20.

Sheng

X.C.

and Yue

X.D.

, Novel co-training algorithm based on rough sets, Application Research of Computers, 2013.

21.

Yang

L.Y.

Wang

Gao

Shrivastava

Weinberger

K.Q.

Chao

W.L.

and Lim

S.N.

, MiCo: Mixup Co-Training for Semi-Supervised Domain Adaptation, ArXiv abs/2007.12684, 2020.

22.

Zhang

Chen

Z.C.

Liang

Y.B.

Z.Q.

Zhu

J.M.

and Zhong

T.T.

, Application of co-training algorithm in noninvasive blood glucose detection, Chinese Journal of Medical Physics, 2018.

23.

H.B.

and Man

, DCPE co-training for classification, Neurocomputing 86 (2012), 75–85.

24.

L.L.

Jiao

L.C.

Ren

Z.L.

Hou

and Yang

S.Y.

, Modified diversity of class probability estimation co-training for hyperspectral image classification, ArXiv, 2018.

25.

and Wang

X.L.

, Semi-supervised regression based on support vector machine co-training, Computer Engineering and Applications 47(3) (2011), 177–180.

26.

Qiao

S.Y.

Shen

Zhang

Z.S.

Wang

and Yuille

, Deep co-training for semi-supervised image recognition, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 135–152.

27.

Meng

D.Y.

Xie

Z.N.

and Dong

X.Y.

, Self-paced co-training, in: International Conference on Machine Learning, 2017, pp. 2275–2284.

28.

Goldman

and Zhou

, Enhancing supervised learning with unlabeled data, in: International Conference on Machine Learning, 2000, pp. 327–334.

29.

Nigam

and Ghani

, Analyzing the effectiveness and applicability of co-training, in: Proceedings of the 9th International Conference on Information and Knowledge Management, 2000, pp. 86–93.

30.

X.X.

and Tsang

I.W.

, Co-labeling for multi-view weakly labeled learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 38(6) (2015), 1113–1125.

31.

Meng

D.Y.

Dong

X.Y.

and Yang

, Self-paced multi-view co-training, Journal of Machine Learning Research, 2020, 1–38.

32.

Brefeld

Gärtner

Scheffer

and Wrobel

, Efficient co-regularised least squares regression, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 137–144.

33.

Sun

X.Y.

Gong

D.W.

and Zhang

, Interactive genetic algorithms with large population and semi-supervised learning, Applied Soft Computing 12(9) (2012), 3004–3013.

34.

Abdel Hady

M.F.

Schwenker

and Palm

, Semi-supervised learning for regression with co-training by committee, in: International Conference on Artificial Neural Networks, 2009, pp. 121–130.

35.

Chapelle

Scholkopf

and Zien

, Semi-supervised learning, IEEE Transactions on Neural Networks 20(3) (2009), 542–542.

36.

Chawla

N.V.

and Karakoulas

, Learning from labeled and unlabeled data: An empirical study across techniques and domains, Journal of Artificial Intelligence Research 23 (2005), 331–366.

37.

Y.F.

and Zhou

Z.H.

, Towards making unlabeled data never hurt, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(1) (2014), 175–188.

38.

Y.F.

Kwok

J.T.

and Zhou

Z.H.

, Towards safe semi-supervised learning for multivariate performance measures, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 1816–1822.

39.

Y.F.

Tsang

I.W.

Kwok

J.T.

and Zhou

Z.H.

, Convex and scalable weakly labeled SVMs., Journal of Machine Learning Research 14(7) (2013), 2151–2188.

40.

Krijthe

J.H.

and Loog

, Implicitly constrained semi-supervised least squares classification, in: International Symposium on Intelligent Data Analysis, 2015, pp. 158–169.

41.

Balsubramani

and Freund

, Optimally combining classifiers using unlabeled data, in: Conference on Learning Theory, 2015, pp. 211–225.

42.

Y.F.

Wang

S.B.

and Zhou

Z.H.

, Graph quality judgement: A large margin expedition, in: Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016, pp. 1725–1731.

43.

Y.F.

Zha

H.W.

and Zhou

Z.H.

, Learning safe prediction for semi-supervised regression, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, 2017, pp. 2217–2223.

44.

Y.F.

Guo

L.Z.

and Zhou

Z.H.

, Towards Safe Weakly Supervised Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021), 334–346.

45.

Zhou

Z.H.

, Ensemble methods: foundations and algorithms, 2012.

46.

Fazakis

Karlos

Kotsiantis

and Sgarbas

, A multi-scheme semi-supervised regression approach, Pattern Recognition Letters 125(JUL.) (2019), 758–765.

47.

Timilsina

Figueroa

d’Aquin

and Yang

H.X.

, Semi-supervised regression using diffusion on graphs, Applied Soft Computing 104 (2021), 107188.

Safe co-training for semi-supervised regression

Abstract

Keywords

1. Introduction

2.1 Co-training

2.2 Safe learning

3. The proposed method

Table 1 Notations

4. Experiments

4.1 Experimental setting

Table 2 Datasets

4.3 Comparison with other methods

5. Conclusion and further work

Footnotes

Acknowledgments

Appendix A. Two-stage instance selection

References

Table 1
Notations

Table 2
Datasets