Abstract
Nowadays, Semi-Supervised Learning lies at the core of the Machine Learning field trying to effectively exploit unlabeled data as much as possible, together with a small amount of labeled data aiming to improve the predictive performance. Depending on the nature of the output class, Semi-Supervised Classification and Semi-Supervised Regression constitute the basic components of Semi-Supervised Learning. Various studies deal with the implementation of Semi-Supervised Classification techniques in many real world problems over the last two decades in contrast with Semi-Supervised Regression, which is deemed to be a more general and slightly touched case. This survey aims to provide a detailed review of Semi-Supervised Regression methods and implemented algorithms in recent years. Our in-depth study reveals the relatively few studies that deal with this specific problem. Moreover, we seek to classify these methods by proposing a schema and categorizing all the related methods that have been developed in recent years according to specific criteria.
Introduction
Machine Learning (ML) has already been the focus of attention in recent years due to the important development of computer science and information technology. The imperative need of data extracting and analysis in several scientific fields, such as bioinformatics, medical diagnosis, biology, finance, statistics, neural networks, educational data mining and many others, have made the use of ML techniques even more necessary [1]. There are mainly three types of ML: Supervised, Unsupervised and Semi-Supervised Learning (SSL).
Supervised Learning aims to predict the labels for unknown data by training a function h. More specifically, given a training set of l labeled examples Ld = {(x1, y1), (x2, y2), …, (x l , y l )} with x i ∈ R n , i = 1,2, …,l, a function h aiming to predict the label y on unknown examples x is trained. It is quite clear that the main characteristic of this scenario is the availability of the Ld subset. Classification and Regression are the two principally ML problems depending on the type of h (for discrete and continuous function h respectively).
According to Unsupervised or Descriptive Learning, the training set consists only of unlabeled data Ud = {x1, x2, …, x u } with x i ∈ R n , i = 1,2, …,u. At the same time, there are no classes in advance, meaning that there exists no knowledge of the expected output of each input example. The goal is to find interesting similarities-regularities on Ud subset or extreme instances that vary greatly from the others and classify them without human intervention. The most commonly used unsupervised method is Clustering, while Novelty Detection and Dimensionality Reduction are also well-known unsupervised methods.
The plethora of unlabeled data and the simultaneous lack of labeled data in several application areas, has leaded to the growth of SSL, a combination of Supervised and Unsupervised Learning. Depending on the nature of the output class, Semi-Supervised or Transductive Classification (SSC) and Semi-Supervised Regression (SSR) constitute the main components of SSL.
Various studies deal with the implementation of SSC techniques in many real world problems over the last two decades in contrast to SSR, which is deemed to be a more general and slightly touched case. In this regard, the purpose of this paper is twofold. Initially, we seek to classify all previous SSR methods by proposing a fully integrated schema, which will be serve as the reference point for any future work. The conducted classification is based on the main characteristics of the examined algorithms analog to each distinct family that they belong to, highlighting also their properties. In addition, we present a detailed review of the studies, methodologies and implemented algorithms dealing with the SSR problem in recent years. As far as we are aware, no work has been done up to now summarizing as well as presenting and analyzing the basic SSR algorithms that have been developed.
The rest of this paper is organized as follows: Section 2 describes SSL and presents the problem of regression and especially the SSR setting. In Section 3 we classify the implemented SSR methods by proposing a comprehensive schema according to specific criteria. Next, a detailed review of the studies and the implemented algorithms that deal with the SSR task in recent years according to the proposed schema is demonstrated. More specifically, Section 4 refers to non-parametric SSR methods, while Section 5 refers to the parametric ones. A subsection about some hybrid approaches is also placed there. A last category, this of Semi-Supervised ordinal regression, is examined in Section 6. Finally, Section 7 concludes the review considering a few thoughts for future research.
Semi-supervised regression
In many real world applications, there is often a multitude of unlabeled data, while labeled data is scarce. Labeling unlabeled data is strenuous, timewasting, and it is too expensive as it requires specialists, real time experiments, special devices and systems operated by qualified staff. SSL or transductive inference methods have attracted much attention and have been applied in numerous fields by using a small set of Ld along with larger sets of Ud [3]. SSL aims to build efficient learning algorithms and improve the learning performance obtaining better results than acquired from labeled or unlabeled data alone by extensively analyzing the information of the Ud subset [4].
Semi-supervised learning
SSL can be divided into two somewhat different scenarios, inductive and transductive learning, depending on the type of the training function [3]. Given a training dataset, inductive SSL aims to predict the labels on unknown examples, while transductive SSL attempts to predict the labels on unlabeled examples taken from the training dataset. Transductive methods outweigh the inductive ones, since they can effectively make use of the information retrieved from all the examples [5].
Some notable SSL methods that have been effectively used in numerous scientific fields with impressive results are Self-training [6], Co-training [7], Tri-training [8] and Transductive SVM (Support Vector Machines) [9]. These methods are trying to take as much advantage of the Ud as possible, since the exploitation of unlabeled data is crucial for their effectiveness. Therefore, introduction of stages or filters that reduce the impact of misclassified instances from the early stages of the learning procedure is highly beneficial for the learning accuracy [10]. Depending on the type of the output variable, SSL may be subdivided into the following two branches:
Semi-Supervised Classification (SSC) refers to the case where the output variable is discrete. Most of the studies about SSL deal with classification problems, with a view to predicting a label from a finite set of class labels. Semi-Supervised Regression (SSR) refers to the case where the output variable is real-valued.
Several SSR studies are presented and analyzed in this review highlighting the importance of applying SSR methods in real life learning problems.
Regression analysis under the SSL framework
Most frequently, it is essential to examine in detail the relationship between different variables in a dataset, with the aim of predicting the value of one variable (output or dependent variable) given the values of the rest variables (input or independent variables). Regression analysis is a statistical method building a mathematical model that best matches the data for the prediction of the output variable [11]. In simple regression, there is only one independent variable x that affects the value of the dependent variable y, while in multiple regression there are more than one. So, the problem of regression can be defined as: Given a set of l labeled examples Ld = {(x1, y1), (x2, y2), …, (x l , y l )} with x i ∈ R n , y i ∈ R, i = 1,2, …,l one wants to predict the value of y for any new example x.
Now we can extend the above problem in the SSL framework: Given a set of l labeled examples Ld = {(x1, y1), (x2, y2), …, (x l , y l )} and a set of u unlabeled examples Ud = {xl+1, xl+2, …, xl+u} with x i ∈ R n , = 1, 2, …, l + u and y i ∈ R, one wants to predict the values of y for any new x. Under this scenario, the information from the provided unlabeled set is also used by the learning algorithm, in contrast to the supervised scenario, where only the labeled set contributes to the learning phase. On this basis, the SSR framework is illustrated in the following Fig. 1.

The SSR Framework.
Various studies deal with the implementation of SSC techniques in contrast to SSR, which is deemed to be a more general case [12]. The real valued type of the output variable in regression results to an inherent difficulty of applying classification algorithms to regression problems [13]. Nevertheless, in the last few years there have been signs of considerable studies concerning the development of SSR algorithms. In the next sections we are attempting to present and analyze the majority proportion of these studies, after categorizing them in an explanatory schema.
Despite the limited number of studies concerning the SSR task, as far as we are aware, there is no survey in the literature summarizing them as well as presenting and analyzing the main SSR algorithms that have been developed. Moreover, these methods have not been classified, even in a rough taxonomy, contrary to the respective classification task. In a recent survey, Triguero et al. [14] presented in detail self-labeled methods for SSC and proposed taxonomy based on their main characteristics. Nevertheless, there exist some studies where a slight reference to SSR algorithms can be found. Seok [4] referred to some existing and notable SSR techniques such as the COREG [13] and the SSKR (Semi-Supervised Kernel Regression) [15] algorithms. Recently, Kang et al. [16] named some representative SSR algorithms by referring briefly to Co-training, kernel and graph based regression methods.
A Categorization of the SSR methods
As has just been mentioned, we seek to classify all SSR methods that have been developed in a comprehensive schema according to the followingcriteria:
The relationship between the input variables. – Parametric methods are based on the assumption that there is a specific relationship expressed as a functional form by the independent variables consisting of a fixed number of parameters. A typical form of this relationship is linear, quadratic or periodic. The best estimation of the parameters leads to the optimal solution of the regression problem. The main weakness of parametric methods is that they are vulnerable in estimating the outputs of the out-of-sample data, known as extrapolations [17]. – Non-Parametric methods trying to find relationships directly from the data. In non-parametric regression techniques, the unknown estimator function is smooth [18] and the estimation procedure is called smoothing. Well known and powerful non-parametric regression estimators are kernel, kNN (k-Nearest Neighbors), regression splines and polynomial estimators [19]. The performance of such a regressor depends on various parameters, particularly the proper selection of kernel bandwidth. Non-parametric methods take precedence over parametric ones, since they are not based on explicit function forms but rely on the information resulting direct from the data. The number of views. – Multiple view learning refers to the case where each example x of the training set can be described by two or even more different distinct feature sets, commonly referred to as ‘views’. Even if such different views do not naturally exist, there is a variety of view construction methods that can be applied, such as random split, genetic algorithms and principal component analysis [20]. Several theories and methods have been developed concerning multi-view learning from the viewpoint of SSL approach. – Single view learning is the most commonly occurring situation, since in most real life problems each data example can be expressed in one single view. Moreover, most of the implemented algorithms are single view based or manipulate different sources of features as a compact one. The number of learners. – Multiple learners Several SSR methods make use of multiple regressors during the learning process with a view to increase regression estimates as well as reducing possible over fitting [13]. – Single learner It is considered to be one of the simplest cases and the one that was initially used with utmost effectiveness in the semi-supervised framework, as in the case of self-training.
In view of the above, and in accordance to the several studies which are extensively analyzed further down, SSR methods may be classified according to the following schema (Fig. 2):

A categorization schema of implemented SSR methods.
For facilitating the characterization of the contained SSR approaches, we provide also a short comment to each referred SSR study about the addition procedure with which, the integration of the unlabeled examples in the original labeled set is carried out each time. The three main approaches that are usually met in the literature [14-20] are:
Instance-incremental The basic idea of this method is the definition of certain criteria before the initialization of the process and the examination of each incoming example separately. If an appropriate condition is met, the example is considered as trustworthy enough and is inserted into the Ld. Although this approach is the most efficient judging by the time factor, the existing randomness during the examination phase, as it concerns the turn of the visited examples, may affect its performance negatively. Thus, appropriate modifications should be adopted for estimating the labeling confidence more effectively and reducing in parallel the time response of the whole scheme. Batch-incremental According to this method, when larger groups of examples have been constructed, let their size be b, they are provided to the corresponding stage for being assessed. Then, the selection may vary based on the specifications of each approach. The drawbacks of such mechanisms are the demand of defining parameter b, which may be tuned according to the specific application, the necessity of replacing or deleting learned models and the fact that the most recent examples could not be exploited until a full batch is formatted. New rules that describe the intermediate interconnections of this approach could be introduced, deteriorating even more the time complexity of these methods. Amending There exists the possibility of removing any example that has been judged as confident enough during a previous learning iteration. This may occur when the learning behavior of the model has been affected towards other directions that recant the previously extracted decisions. This leads to the most computational demanding approach, since all the current labeled examples are scrutinized before the end of each iteration, promising a more successful labeling stage.
Confidence measures and performance measures
Other important factors that characterize the operation and the performance of the SSR methods are the confidence and the performance measures, respectively. While the latter group is composed of several metrics that measure different aspects of algorithms’ performance, the former describes the mechanisms that are based each time on specific performance measures for extracting the most suitable examples for labeling.
Confidence measures
There exist two different strategies for certifying as much as possible the accuracy of the predicted value for the unlabeled examples: simple and combinatory confidence measures.
Simple In this strategy the labeling confidence of the unlabeled examples is examined, applying their predicted values on the current labeled set and measuring the error reduction, under an objective function based on a performance measure. Therefore, the simplicity achieved by selecting such confidence measures is compensatory enough, but the quality of the selected examples to be inserted into the current Ld for training the next learning models could be too low. Combinatory On the other hand, there are occasions that the structure of the algorithm that is used for solving regression problems favors the exploitation of multiple decisions before the final assignment of the most confident examples take place. Especially, when the basic idea of the learning strategy of an algorithm is based on committees, then the produced decisions of each regressor could either average or contribute to voting schemes for outputting the most suitable example or set of examples for getting labeled. Furthermore, techniques of splitting the labeled set for obtaining more than one training sets, either with replacement or without, could be a solution for accessing numerous predictions that may be combined under similar decision schemes. Of course, when the amount of the initial Ld is small, the diversity of the newly formatted views might not offer accurate results.
Although the former is commonly used in the majority of the reviewed works, the latter could be introduced without neither highly modifying the structure of each method nor spending valuable computational resources. However, only specific SSR methods make use of this kind of confidence measures.
Performance measures
Several statistical metrics have been used to evaluate the predictive efficiency of the SSR methods as presented below. Amongst them, Mean Square Error (MSE), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are predominantly used to measure the average error of these methods.
RMSE Ld = RMSE of the regressor trained on the initial dataset
RMSE Ad = RMSE of the regressor trained on the enlarged dataset
– Relative Improvement Percentage on MSE:
– Relative Improvement Percentage on RMSE:
The overwhelming majority of the following studies deal with non-parametric SSR methods. Regression based on the Co-training approach, kernel regression and regression via Laplacian regularization seems to be the most well studied methodologies with stunning results compared to familiar regression methods. The flexibility of these methods, since they do not require a fixed number of model parameters, has resulted in the implementation of several SSR algorithms, which are discussed below.
Semi-supervised co-regression
Co-training is a SSL method implemented by Blum and Mitchell in 1998 and was a leading initiative to multi-view learning [7]. Co-training is based on three assumptions. The first one is that each example of the dataset can be divided into two distinct views that are not perfectly correlated, which means that each example can be described using two different types of information (multi-view assumption). The second one is that each view can effectively be exploited for classification (compatibility assumption), while the third assumption is that these views are conditionally independent given the class label (independence assumption). In this context, two classifiers are trained separately in each view using a small initial amount of labeled examples and the most confident predictions of each algorithm on Ud are used to enlarge the training dataset of the other. It can be easily understood that the efficiency of the Co-training method in real world applications depends largely on the fulfillment of the above assumptions [23] as well as the appropriate choice ofclassifiers.
A multitude number of studies deal with the implementation of the Co-Training algorithm for SSC. Although the assumptions about the existence of sufficient and redundant views can hardly be met in practice, several extensions or variants of the Co-Training algorithm, have been developed such as Tri-Training [8], De-Tri-Training [24], Democratic Co-Training ’[25], Co-Forest [26] and CoBC [27]. The following studies deal with the development and implementation of co-style SSR algorithms and are divided into two maincategories:
Multi-view Co-Regression Single-view Co-Regression
Multi-view co-regression
Brefeld et al. [28] proposed co-Regularised Least Squares (LS) Regression (coRLSR), a semi-supervised kernel regression method based on the co-training approach that encompasses two variants, the non-parametric and the semi-parametric one. Both variants employ the Co-training method as a regularized risk minimization problem in Hilbert spaces using a linear combination of Gaussian kernel functions in the form:
Wang et al. [29] proposed a semi-supervised co-training regression model for remote retrieving of variables regarding the water quality. The SS-SVR (Semi-Supervised Support Vector Regression) method employs two SVR regressors with different parameters using a Genetic Algorithm for improving the response of this step, since they deemed to be reliable and relevant for nonlinear relationship between variables. Using a small labeled training set, each regressor selects the most confident predictions from U d and enlarges the training dataset of the other. Estimation process of labeling confidence is very important for the majority of SSR methods. Finally, the whole proceeding is repeated for a default number of iterations and the final result is the average value of regressors’ outcomes.
Bao et al. [31] implemented the Co-training Partial LS (PLS) algorithm, which combines the co-training method with the PLS regression for soft sensor modeling. PLS is a combination of PCA (Principal Component Analysis) and multivariable regression and seems to be a very effective method in the case where there is a large set of independent variables [32]. The main asset of PLS technique is the achieved fast response against other approaches, such as kNN, kernel or hybrid methods. However, the prediction of the output variables is difficult and expensive, while input variables are easily measured. The original labeled dataset is equally divided into two subsets that represent the two different views required for the Co-training procedure and, at the same time, two diverse PLS regressors are trained separately in those subsets by adding the unlabeled example that makes each regressor more consistent, respecting the previous comment about the labeling confidence. The iterative procedure is terminated when all unlabeled examples are finally labeled, and the final output is the average value of the two regressors. The obtained results, evaluated by the RMSE, verified the selection of PLS technique for mining useful information through industrial process data, since the collinearity phenomena have been effectivelytackled.
In many real life applications the multi-view assumption of the Co-training algorithm can hardly be met. So, several modifications of the Co-training algorithm have been developed tackling that hurdle by promoting the employment of two or more diverse classifiers. A prime SSL example is the Tri-training algorithm utilizing three classifiers, while in case of SSR the COREG algorithm seems to be the basic single-view co-style methodology.
CO-training REGressors, is a co-training algorithm for SSR that was originally developed by Zhou and Li [13]. Contrary to co-training algorithm, COREG does not require two sufficient and redundant views or different learning algorithms. The algorithm employs two 3NN regressors using the Minkowsky distance with different metric orders, thereby setting a diversity framework. Each regressor predicts and assigns a value to each unlabeled example, but labels only the most confident unlabeled examples which are inserted into the labeled set of the other regressor before the end of each iteration. The labeling confidence is estimated by evaluating the MSE, a strategy that is highly followed by the majority of SSR methods. More specifically, the labeling confidence is computed by measuring the influence of the labeling stage over the initial training set. Thus, the MSE for each unlabeled example is evaluated on the full training set being enlarged with the specific example and its predicted label. The example with the biggest increase of Δ
x
u
function is regarded as the most harmonized with the available data. This function is defined as:
Hady et al. [35] generated a single view extension of the co-training algorithm for semi-supervised regression named CoBCReg (Co-training By Committee for Regression). This algorithm is an ensemble of n diverse RBF network regressors constructed by bagging and are trained using different training subsets, different bootstrap methods, different random initialization of RBF centers and different distance measures. Using a committee of such regressors, the number of unlabeled examples that are likely to be incorrectly labeled by a single regressor is expected to be reduced. Each regressor selects the most helpful formerly labeled examples, while the selection confidence is based on minimizing the regressor error on the validation set. In that way, the committee of predictors labels the unlabeled examples and the final prediction is the weighted average of the values of each one of the n regressors. Consequently, the two most important drawbacks of co-training methods, meaning the integration of noisy examples as trustworthy and the possible fact that the unlabeled examples that are exchanged among the regressors along with their labels may not offer any useful information, seem to be well tackled by such a committee. The produced results over 6 different datasets, both in their original version and under the scenario of additional Gaussian noise, assure the robust behavior of the proposed method according to the RMSE results, as well as, the relative improvement percentage on RMSE measures.
An alternative and recently proposed approach is the incorporation of interactive genetic algorithms (IGAs) inside the COREG algorithm for accelerating its efficiency [36]. Initially, two RBF regressors are trained on labeled data subsets estimating the output value of an example as the average value of regressors. Subsequently, co-training is applied using unlabeled examples to refine the regressors. Compared with the co-training algorithm, the number of iterations is smaller leading to a lowering computational cost.
Another approach called SAFER (SAFE semi-supervised Regression) tries to restrict the phenomenon of mislabeled predictions inside the automated learning process of SSR schemes [37]. The demanded inputs here are the predictions of a number of semi-supervised regressors over the unlabeled set and the corresponding prediction of a supervised regressor that has been trained exclusively over the labeled set. Then, the ambition is to achieve improvement against the default knowledge that is provided through the supervised regressor. Thus, a convex quadratic optimization problem is set, searching for the suitable weights of the former regressors so as to acquire an improved prediction (the executed transformation of this problem to an optimization task is described through a geometric projection framework). The included version of SAFER algorithm exploits three distinct semi-supervised regressors (one stemming from LS method and two from kNN family with different distance functions) and 1NN as the supervised one. The obtained results (MAE) prove the decent results of this method, even if the prediction of the supervised regressor is not so accurate.
It can be easily identified that the co-training framework has been successfully applied in various SSR problems with noteworthy results, even though the iterative procedure of co-style algorithms leads to computational problems as regards large datasets [38]. As concerns the mechanism that is responsible for augmenting the initial training set with unlabeled examples, instance-incremental strategy seems to be preferred, while batch-incremental is adopted only in the case of the coRLSR algorithm.
Kernel regression is one of the most common non-parametric techniques for regression analysis. A kernel is a positive definite symmetric function which satisfies the conditions of Mercer’s theorem corresponding to an inner product in feature space (Reproducing Kernel Hilbert Space - RKHS). The construction of a kernel function to depict the inner structure of the data is the key for the efficiency of the kernel method [39], since a kernel measures the distance based similarity of examples in input space. Several types of kernels functions have been used in kernel regression [40] such as:
Gaussian RBF kernel, one of the most broadly used kernels, which seems to have quite good smoothness properties. However, it does not reflect adequately the inner structure of the data nor the local structures and is defined as:
–Polynomial kernel encompasses more parameters resulting to an intrinsic difficulty of applying it to kernel methods, also supported by the frequent occurrence of overflowing or under flowing cases [41]. It is defined as
Sigmoid kernel originates from neural networks and was first reported by Vapnik [42]. It is a not positive semi-definite and very often applied kernel, and is defined as:
Linear kernel constitutes a special case of RBF kernel and is defined as:
– Reproducing kernel. Let G be a class of functions f(t) defined in a set U, forming a Hilbert space. The function k (t, t′) where t, t′∈U, is named a reproducing kernel of G if the following two conditions are satisfied [44].
Like many other methods, original kernel regression is based only on labeled examples. In recent years, considerable studies deal with the implementation of the classical kernel regression in the SSL framework. Based on the classical kernel regression estimator, we consider three principal types of SSKR:
SS Kernel Ridge Regression SS Support Vector Regression SS Output Kernel Regression
Kernel Ridge Regression (KRR) is based on the LS loss function to measure the difference between the output value y of the example x and the predictive value. Given a set of l independent labeled examples Ld = {(x1, y1), (x2, y2), …, (x
l
, y
l
)} with x
i
∈ R
d
, i = 1,2, …,l and y
i
is a scalar output variable, the problem of KRR can be defined as follows [45]:
Wang et al. [15] proposed a SSR Kernel algorithm termed SSKR, which is based on the kernel regression framework exploiting both Ld and Ud. Assuming that the predictions of the unlabeled examples are known, a Gaussian kernel regressor is constructed using an appropriate selected weighting factor. SSKR is the classical kernel regression when the weighting factor r is equal to zero and a graph based method when r equals to one. Moreover, the appropriate selection of the weighting factor improves the efficiency of SSKR compared with the above mentioned methods. The effectiveness of the method depends on the appropriate selection of the kernel bandwidth σ and the weighting factorλ neglecting the behavior of the effecting parameters. Experiments showed that SSKR outperforms KR and the graph based method with the appropriate selection of the weighting factor.
Cortes and Mohri [5] presented a two stage transductive regression algorithm based on KRR. According to this method, the position of each unlabeled point is originally used to estimate its label, either by local linear regression or by ridge regression or as the weighted average of the neighbor labels. More specifically, the local estimation of an unlabeled point is based only on its labeled neighbors. Otherwise, it is disregarded from the training procedure. Then, global optimization is used to ensure the accuracy of predictions based on kernel ridge regression and reduce erroneous predictions. The kernels used in the experiments were Gaussian kernels.
Angelini et al. [46] made use of KRR to develop a SSR and classification algorithm for transductive inference and dimensionality reduction by using a small Ld. A linear regularization predictor was used aiming to reduce the error resulting to the optimal target vector, which in fact is a generalization of the first kernel principal component (FPCA) to the SSL setting. The proposed method outperformed the classical KRR and the Transductive Linear Discriminating (TLD) [47].
Seok [48] proposed Semi-Supervised Local Constant Estimator (SSLCE), a SSR algorithm based on the Local Constant Estimator (LCE), a widely used kernel regression method which was originally proposed by Nadaraya [49] and Watson [50], both dated back to 1964. In fact, the estimator is the same as used in SSKR [15]. Experiments in several datasets showed that the predictive performance depends directly on the number of the unlabeled examples, as well as, on the weighting factor. Moreover, a larger number of unlabeled examples do not in itself automatically lead to more effective outcomes.
The evaluation of the pre-referred methods was based on MSE and RMSE measures, showing that SS-KRR outperforms supervised approaches based on kernel regression, as well as graph-based semi-supervised regression methods.
Support Vector Machines (SVM) is considered to be a powerful ML technique that was originally designed for Supervised Learning [16] and, more specifically, for classification problems. Given a training dataset Ld = {(x1, y1), (x2, y2), …, (x
n
, y
n
)} with x
i
∈ R
d
, i = 1,2, …,n and y
i
∈ {1, - 1}
n
, the problem of SVR can be defined as follows [52]:
For the semi-supervised setting, given a set of l labeled examples Ld = {(x1, y1), (x2, y2), …, (x
l
, y
l
)} and a set of u unlabeled examples Ud = {xl+1, xl+2, …, xl+u} with x
i
∈ R
n
, i = 1, 2 …, l + u and y
i
∈ R, the SS-SVR problem can be defined as follows [17]:
where c > 0 is a parameter penalizing the error term.
In recent years semi-supervised SVM has been applied in several regression problems, which are summarized below, showing the efficiency of the SVM method in SSR methods.
Zhu and Goldberg [53] combined manifold regularization, multi-view learning and SVR techniques for SSR based on the assumption that there is knowledge of certain order preferences between pairs of unlabeled points. The order knowledge of two input examples is transferred to their unknown output values, and then, the optimization problem is transformed to a linear problem using the RBFkernel.
Camps-Valls et al. [17] applied two semi-supervised SVR methods for estimating the Leaf Area Index from hyperspectral images and the spectral behavior of chlorophyll concentration in the subsurface waters. These kernel-based methods, namely Graph SVR and Hypergraph SVR, made use of unlabeled data by constructing either a graph Laplacian according to the weighted distance between examples or a hyper-graph Laplacian by performing k-means clustering, integrating the RBF kernel.
Similar to the previous studies, Demyanov et al. [54] applied the SVR methodology to model petro-physical properties in a fluvial reservoir by adopting the concept of geo-manifold. The algorithm was robust to noise, thus improving the reproduction of realistic geological structures.
Xu et al. [41] proposed a SSR technique based on LS SVR (LS-SVR). S2LS-SVR estimates the labels of each example in unlabeled data. The corresponding estimations are the weighted average of their neighborhood labels in the future space. The solution of a linear system similar to LS-SVR leads to the corresponding decision function, while the RBS kernel is adopted. Experiments on corn dataset showed the supremacy of the suggested technique compared to kernel Partial LS (PLS) and LS-SVM regression methods based on the average relative error (δ) and the correlation coefficient (R).
Some years later, Seok [4] proposed a semi-supervised SVR method named S3VR. A Lagrange function was constructed considering the estimates of the unlabeled training dataset, thus resolving the optimization problem. The radial basis kernel function was utilized in several real-world datasets, which described nonlinear regression cases, revealing the superiority of the S3VR method against the SVR approach. The main asset here is the strategy to apply modifications to the input space rather than in the high dimensional feature space, facilitating the process of handling effectively nonlinear problems.
Kang et al. [16] presented a SSL SVR method based on self-training. Initially, two Probabilistic Local Reconstruction (PLR) models, PLRlocal and PLRglobal, are employed to estimate the label distribution of Ud, thus forming the training set together with unlabeled data. PLRlocal and PLRglobal models capture the local topology of the unlabeled data and the global distribution of the underlying function respectively. Finally, SVR is applied, while Expected Margin based Pattern Selection (EMPS) is engaged to reduce the training complexity of the algorithm. Almost all the aforementioned algorithms measure the predictive performance using the MAE, MSE and RMSE metrics.
Recently, a kernel regression approach has been proposed using a kernel function (output kernel) to measure the similarity in the output space for link prediction. In particular, Brouard, Buc and Szafranski [55] proposed Penalized Output Kernel Regression (POKR) based on RKHS theory with vector-valued functions extended to the semi-supervised setting for link prediction. Under the assumption that an input kernel is given, a simple nonnegative operator-valued kernel is defined. Subsequently, the representer theorem proposed by Belkin et al. [56] is extended in the SSL in RKHS with vector-valued functions using a penalized LS cost functional. Before we summarize the SSKR methods, we have to refer also the study of Pozdnoukhov and Bengio [39] who deal with the implementation of non-parametric data-dependent kernels in the semi-supervised setting. The proposed kernel forms an appropriate match of SVR and Kernel Ridge Regression using the standard Gaussian RBF kernel, and these combinations are examined against their supervised variants over synthetic and real-world datasets. The results reveal that when some geometrical structure governs the dataset, then use of the proposed kernel isfavored.
SSR via graph regularization
Many studies incorporate graph based learning methods, especially for SSL problems. Graph regularization illustrates the intrinsic geometrical structure of data, building a nearest neighbor graph. Adding this graph via a regularization entity in the learning kernel upgrades the predictive performance of the algorithm. Various graph regularized SSL as well as SSR approaches have been implemented in recent years and are presented below.
SSR graph laplacian regularization
Graph-based Laplacian Regularization (GLR) is a widely used transductive classification technique [59] that has been successfully implemented to supervised learning. Over recent years, various studies deal with the implementation of GLR in the SSL framework. GLR is based on the manifold assumption which states that if two points are on the same manifold, then their corresponding values are similar. According to these methods, a nearest neighbor graph is created based on both labeled and unlabeled examples. However, in SSL there are possibly unlabeled points with no labeled neighboring points. The use of such points constitutes a crucial issue in the training procedure, while in several studies these points are ignored. Then, a graph Laplacian is used to measure and guarantee smooth properties of the targeted function inside the given data manifold.
Belkin et al. [56] presented and analyzed two families of SSL algorithms based on graph regularization showing that the exploitation of Ud may boost the predictive performance. Laplacian Regularized LS (LapRLS) and Laplacian Support Vector Machines (LapSVM), which are extensions to the SSL setting of RLS and SVM algorithms respectively, brought together spectral graph theory methods, manifold learning and the theory of regularization in RKHS for real-valued functions.
Several extensions of the Laplacian Regularized LS (LapRLSR) method for SSR have been proposed recently. Given a set of l labeled instances Ld = {(x1, y1), (x2, y2), …, (x
l
, y
l
)} and a set of u unlabeled examples Ud = {xl+1, xl+2, …, xl+u)} with x
i
∈ R
n
, y
i
∈ R and i = 1,2, …,l+u, the LapRLSR algorithm solves the following optimization problem:
An updated version of the LapRLSR method concerning the SSR framework regularized with a graph Laplacian prior to build an efficient regressor is called Temporal Laplacian Regularized LS Regression (TLapRLSR) algorithm [61]. In this setting, the graph Laplacian matrix is replaced by a temporal graph Laplacian constructed from sequences of partially labeled data.
In the same context, Xie and Newsam [62] developed two other extensions of LapRLSR algorithm in the SSL setting, called Vis LapRLRS and Vis-Geo LapRLRS, to create a map from geo-referenced images. A small number of manually labeled images were needed for building a regressor. Next, graph LapRLSR interacts with untagged images for improving the repressor’s accuracy under the assumption that outputs of untagged images would not fluctuate much from outputs of analogous labeled images.
Doquire and Verleysen [63] proposed a feature selection algorithm named SSLS (Semi-Supervised Laplacian Score), which combines both supervised and unsupervised Laplacian Score (LS, SLS) methods for regression problems. LS ranks features according to the similarity measures between the unlabeled examples, that is to say, by determining the nearest neighbor of each point, measured from the Euclidean distance with one other. Further, SLS ranks features according to the similarity measures of the outputs. As a result, SSLS is an extension of the SLS in the SSL context and its main advantage is the ability to maintain the local properties of dataset examples.
Zhao et al. [64] melded the LapRLS with SSL Discriminant Analysis methods (SDA), thereby creating a SSL dimensionality reduction method, called Learning from Local and Global Discriminative Information (LLGDI). Initially, both methods are implemented at the same time on labeled examples to train a classification function. In order to ensure more classification efficiency, a set of estimated labels is introduced, while a projection matrix is used to express the global discriminative direction per class. Adopting a number of local and global classification functions, so as to keep pace with local geometrical and discriminative information respectively, LLGDI aims to tackle both problems at the same time.
In a similar study, Zhao et al. [65] combined SDA and LapRLS under a regularized LS setting for face recognition, thus tackling the high dimensionality problem of data. LS-SDA and LS-SDA/F are the two slightly different versions of the regression algorithm, which is more effective when few labeled examples are used in the learning process.
Most recently, Sheng and Zhu [30] utilized a regularized regressor integrated with quadratic loss inside a LapRLS framework, so as to study the correlation of the convergence rate and the cardinality of unlabeled set. The choice of the specific loss function offers remarkably favorable properties for such kind of tasks, as the produced results prove.
In contrast to the previous methods, Kim et al. [66] proposed the use of Hessian regularization for SSR, and in particular the second-order Hessian energy. This method functions under shortage of labeled data as effectively as possible, making use of geodesic functions that vary linear along the data manifold.
SSR parallel field regularization
Lin et al. [67] examined the effectiveness of vector fields’ implementation in SSL. Parallel Field Regularization (PFR) is suggested to estimate the linearity of the targeted function and ensure second order smoothness in SSR applications. The aim that has to be met is the minimization of the empirical error measured on labeled instances by the objective function. Two constrains under which a function f and a vector field V are learned were reported: i) the distance between V and the gradient field (∇f) of the function f should be small enough, and ii) V should be as parallel as possible [68].
Spectral regression
Cai et al. [69] introduced the concept of Spectral Regression (SR), a graph regularization method for SSR by linking multivariable ordinary regression with spectral analysis. SR exploits labeled and unlabeled examples with the view of building a high-powered algorithm for multi-class problems. An adjacency graph is formed reflecting the label information as well as the intrinsic geometrical structure of input data and subsequently, the LS classifier is applied.
Following that study, they proposed [70] an extension of the Linear Discriminant Analysis (LDA) supervised method for dimensionality reduction incorporating the information provided by unlabeled data, named Semi-supervised Discriminant Analysis (SDA). A crucial point on the efficiency of SDA is the employment of labeled data together with unlabeled, assuming that neighboring instances are likely to have similar outputs. The embedding of RKHS in SDA preserves the local and discriminant structure of the data manifold, thus leading to an efficient method.
Goldberg et al. [71] proposed a cluster-then-label SSL algorithm specific for multi-manifold data, even when the manifold assumption is wrong. A subset of Ud is appropriately selected covering the whole dataset and, together with labeled examples form a graph. Size-constrained spectral clustering cuts the graph in a small number of clusters and, finally, supervised learning is applied, while each cluster must consist of enough labeled data.
Local linear SS regression
Local Linear SSR is a non-parametric regression method trying to resolve the locally constant estimation problem that often occurs in regression problems. The “smoothness” assumption, which states that “similar” examples have “similar outputs”, leads very often to the same output of neighborhood examples in regression problems, and thus a local linear estimator is incorporated.
A notable study stating the efficiency of Local Linear SSR has been published by Rwebangira and Lafferty [12], proposing an implementation of Local Linear Regression under the SSL setting named Local Linear Semi-supervised Regression (LLSR). LLSR employs manifold regularization, fitting a local linear model in each unlabeled example with the use of a local linear Laplacian function. Its performance on fitting over “smooth” functions is better than the corresponding behavior of other known supervised algorithms.
Semi-supervised gaussian process regression
According to [73], “A Gaussian Process (GP) is a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions”. GPs are considered to be a quite efficient tool for regression. Each output y is expressed in the following form, noting an underlying function f(x) via a Gaussian noise model:
Gaussian Process Regression (GPR) aims to learn the function f(x), a strategy that has been combined with great efficiency with local-basedapproaches [74].
Zhang and Yeung [75] combined SSR and multi-task regression, thereby creating SSR Multi-Task (SSMTR) algorithm, which is an extension of the supervised multi-task regression (SMTR) in the SSL setting. More precisely, all included learning tasks are tackled with GP as the base regressor, under the assumption that there is a common prior among all the kernel parameters. Finally, unlabeled data are used, with an appropriate adjustment of the kernel function. Furthermore, SSMTR with pairwise information is adopted in the case that there exists such information, so as to improve the prediction accuracy. The normalized MSE shows the supremacy of these methods. This method uses an RBF kernel and is applied for predicting student performance in secondary schools.
To conclude, almost all the aforementioned non-parametric methods follow the batch-incremental approach during the learning process. It seems that the kernel theory and the regularization approach, favors the merging of both labeled and unlabeled examples to build accurate learning models, being processed under the appropriate optimization methods.
Despite to the several studies dealing with non-parametric methods for SSR, the parametric ones have rarely been applied, since it is difficult to build a model for associating the independent variables so as to estimate the output variable. The so far covered parametric methods are grouped into two main categories:
Semi-Supervised Linear Regression Semi-Supervised Hybrid Methods
Semi-supervised linear regression
Linear Regression (LR) is the most popular regression model, especially in the case of simple regression. Several LR models have been used for supervised regression problems, such as Ridge Regression (RR), PLS Regression (PLSR), Stepwise Multiple LR (SMLR) and Principal Component Regression (PCR).
Ng et al. [76] combined semi-supervised clustering and multivariable regression to analyze datasets containing both numerical and categorical variables using the SSRM (Semi-Supervised Regression Model) algorithm. Assuming that all the numerical variables constitute the Ud, while all the categorical variables constitute the Ld, a k-mode clustering algorithm is used to partition the dataset in several clusters. Subsequently, a multivariate linear or quadratic regression model is applied for each cluster. The purpose is to minimize the weighted sum of the LS errors as well as the dissimilarity measures among the categorical variables. The efficiency of the SSRM algorithm depends on the nature of the data, as shown from experiments carried out on several datasets, considering one of the numerical variables as the dependent one.
Semi-supervised hybrid methods
There are a number of significant SSR methods which are a combination of techniques and are analyzed below. All these methods adopt the instance-incremental strategy to augment the training set with unlabeled examples.
Verbeek and Vlassis [60] applied Gaussian Fields (GFs) based on a district graph for SSR and correspondence learning. Firstly, an appropriate selection of a small-sized Ld based on the minimum entropy principle takes place, while unlabeled data is used to learn a function based on the low-dimensional manifold structure of the data. Finally, GFs based on a district graph are implemented under the SSL concept.
Lafferty and Wasserman [58] presented a comprehensive statistical analysis of semi-supervised techniques for regression from the perspective of minimax theory. The manifold assumption, as well as the semi-supervised smoothness assumption, is studied thoroughly showing that, if the first one holds, the second one is needless. Moreover, without the manifold assumption, the semi-supervised smoothness assumption is too weak. So, improved versions of the assumptions are proposed making optimum use of both Ld and Ud and leading to more efficient estimations.
Guillaumin et al. [57] used the Multiple Kernel Learning (MKL) framework for image classification inspired by the co-training concept. The training set consists of a labeled set of images with specific tags and labels, and an unlabeled one without their labels, thus describing the data with different features. Initially, an MKL classifier is learned using the image contents and associated tags and, subsequently, it is used to predict the labels on the unlabeled dataset of images. Finally, both labeled and unlabeled images with their previously predicted outputs are used to train a visual SVM classifier for predicting the labels of untagged images. The prediction performance might be improved, if LS regression is performed on MKL marks to leverage the predictions on unlabeled data. Consequently, a cascade structured method is implemented using SSR during the first stage and the acquired predictions are provided to the next stage for tackling a classification problem.
Xiang et al. [51] proposed Local Spline Regression (LSR), a transductive algorithm that exploits splines developed in Sobolev space, consisting of polynomials and Green’s functions for SSC. These nonlinear splines are estimated through regularized LS regression mapping the adjacent example points and evaluating the regularized loss formulated in terms of class label vector. The sum of the regularized losses and the sum of the squared errors of all labeled data points build an objective function obtaining a generally optimal classification performance.
Tang et al. [34] proposed a SSL method for image super-resolution based on local SSR algorithms. For each example in the test set, a local labeled training set along with a local unlabeled training set were reproduced by finding the nearest neighbors in the training set and test set respectively, thus forming a new training set. Finally, local SSR was applied in each one of the generated training sets. Several experiments in nature images were conducted with great success.
Tang et al. [33] proposed a SSL transductive regression forest algorithm for hand pose prediction. Two sets of images were used as the training set into a 3-step learning process, a large synthetic labeled dataset and a small realistic dataset composed of labeled and unlabeled images. In the first two steps, viewpoint classification of all labeled examples is performed, followed by joint classification. Finally, regression forest is used to describe the distribution of realistic data points and measure the compactness of the decision tree nodes.
Semi-supervised ordinal regression
Ordinal regression or ranking learning refers to multi class classification problems under order constraints. In practical terms, ordinal regression deals with the classification of finite and discrete variables when there is an order among their values and the metric distances between the ranks are not defined, as it is demonstrated in [77].
Seah et al. [72] presented a Transductive Ordinal Regression (TOR) algorithm for multiple ordinal class transduction. Prediction of the ordinal class label of the unlabeled data is carried out at the same time as the learning of the decision functions of each class utilizing non-linear kernel functions. The classification error is measured via the MAE and MZE metrics.
GPs are frequently used for resolving the ordinal regression problem for supervised learning tasks. So, Srijith et al. [78] extended the supervised GP ordinal regression approach to the SSL setting by using the Expectation Propagation (EP) algorithm. Semi-supervised Gaussian Process Ordinal Regression (SSGPOR) algorithm is based on the similarity of the output distributions corresponding to Ld and Ud, while creates decision boundaries which pass through a low density region. The similarity can actually be achieved by minimizing the Kullback-Leibler divergence between the predictive distribution over the Ud outputs and an approximate multinomial distribution which is similar to the Ld distribution. MAE and MZE metrics were used to depict the efficiency of SSGPOR.
More recent, Uto et al. [79] proposed an SSR method for hyperspectral subspace learning, known as semi-supervised subspace projection under multiple order paths (SMOP) based on order constraints among target variables.
A very recent work employs KRR method over datasets that are distributed at different nodes over a connected network, which constitutes an alternative case of the centralized process that is the usual one [2]. After the introduction of an intuitive method for unwrapping generalized error, the conclusions drawn from the experimental stage converged to the point that use of unlabeled data enhances the predictive behavior of the whole learner, allowing larger number of local learners, even in cases that the targeted function does not belong to the corresponding RKHS.
Summary and conclusions
Over recent years, SSL has been the core of ML trying to exploit both labeled and unlabeled data. Readily available, unlabeled data are of optimal utility in the learning procedure improving the predictive performance of the implemented SSL algorithms. As regards SSR, it parts an important and undoubtedly very promising branch of SSL. Although it is almost untouched and understudied compared with SSC, a several significant studies have been being carried out in the last decade.
In this survey, we have tried to present the major proportion of these studies in order to present a clear and complete view of the SSR methodology. Moreover, we attempted to provide a categorization, mainly oriented towards highlighting the distinct families of SSR methods. The description of each family of methods was committed so as to be consistent with the timeline that each work was published, combined with comments about the basic theories. Most studies are related to non-parametric regression and especially Co-regression, kernel regression and SSR via Laplacian Regularization. Furthermore, COREG seems to be the first implemented SSR algorithm having a small number of variants, forming the basis for the creation of various SSR algorithms based on the co-training approach.
It is important to examine the effectiveness of the above mentioned SSR methods and algorithms to other application areas, since it is evident that each SSR method has been applied to a specific scientific field.
A shortage that has been arisen from this work is that no strategy makes use of the amending mechanism for assigning unlabeled examples with the predicted values during the learning phase. Thus, no refutations are permitted as it concerns the insertion of unlabeled examples to the previous formatted training set, avoiding induced delays, which act as a suspensory factor even for small-sized problems. However, in cases of real-world applications, where the prediction accuracy plays the cardinal role, convenient trade-off assumptions should be applied.
The use of alternative formulas of the rate, or even the insertion of new corresponding formulas that could contribute to the final decision under averaging or voting schemes, should be examined. Besides the chance of getting improved, no extra computational complexity is added. As in the case of MKL based method introduced in [57] where a SSR approach is exploited initially by a cascade scheme, other SSR methods could also be employed for facilitating the operation of the next stage, actuating more robust and accurate ML tools.
Finally, we hope that the references cited covered the major theoretical issues, guiding researchers to contribute over aspects that have yet to be explored. A more systematic effort is needed in the field of self-labeled methods to deal with high-dimensional datasets and much reduced labeled ratio. At the same time, various ensemble techniques could be introduced along with self-labeled approaches for achieving more accurate regression methods [21]. Furthermore, Deep Learning algorithms can incorporate SSL methods towards the goal of defining criteria for good data representationlearning [22].
Footnotes
Acknowledgments
Stamatis Karlos gratefully acknowledges the support of his work by both the Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT).
