Abstract
Applying semi-supervised learning to extreme learning machine (ELM), we propose a semi-supervised extreme learning machine classification framework (SSELM) with arbitrary norm (q-norm, q=0,1 and 2). However, the SSELM involves nonconvex and nonsmooth problem. In this work, two types of optimization methods are developed to solve the proposed SSELM. The first one is an exact solution approach that reformulates SSELM as mixed integer programming. The second is an approximation approach that approximates the SSELM framework by DC (difference of convex functions) programming. Several formulations for SSELM are presented with different norm. Furthermore, the proposed methods are applied in a practical medical dataset using near-infrared spectral technology. Experimental results in different spectral regions show that incorporating unlabeled samples in training improves the generalization compared with the supervised ELM when insufficient training information is available. Moreover, the proposed methods achieve equivalent performance in benchmark data sets compared to the supervised ELM algorithms and other semi-supervised methods. These results show the feasibility and effectiveness of the proposed algorithms.
Keywords
Introduction
Extreme learning machine(ELM)[1–6] is a single-hidden layer feedforward neural network (SLFN). Compared with traditional neural networks, the main merits are that its hidden layer parameters are randomly initialized, and then output weights can be obtained through least square estimation method. Therefore the ELM runs fast with global solution and is easy to implement. And ELM can provide a unified learning platform to different applications, such as regression,binary and multi-class classification problems. However, the traditional ELM is primarily used for supervised learning tasks that greatly limit their applicability.
Using both labeled and unlabeled data for learning is called semi-supervised learning [7–9], the main goal of which is to employ the large collection of unlabeled data together with a few labeled data to improve generalization. Generally, semi-supervised classification methods are derived based on two fundamental geometric assumptions: (1) the cluster assumption, and (2) the manifold assumption [8, 9]. Recently, semi-supervised learning methods have been applied to ELM to improve the generalization when there are relatively few labelled data [10–12]. Essentially, these methods are all based on manifold assumption for semi-supervised learning, the main idea of which is to introduce the manifold regularization terms in their objective functions.
In this work, applying the cluster assumption to ELM, an arbitrary norm (q-norm, q = 0,1 and 2) semi-supervised ELM framework (SSELM) is proposed to handle semi-supervised classification problems. Noticing that the zero-norm ELM is rarely seen in the literature. The proposed SSELM framework involves nonconvex and nonsmooth problem, which makes it difficult to find the optimal solution. We present two types of optimization methods to solve the proposed SSELM. The first one is an exact solution approach that reformulates the SSELM as mixed integer programs with global solution. The second is an approximation approach that approximates the SSELM by DC (difference of convex functions) programming [13–17], and resulting DC algorithm (DCA) converges.
Background
Semi-supervised classification
Semi-supervised support vector machine (S3VM) [7, 18] can be viewed as a standard support vector machine (SVM) [19, 20] with an additional regularization term on unlabeled data.
In particular, for binary classification problems, assume that the data consists of a training set and a testing set. The training set consists of N labeled samples {(x i , y i ) , x i ∈ R n , y i = ±1, i = 1, 2, … N} and the test set consists of P unlabeled samples {x j ∈ R n , j = N + 1, N + 2, … N + P}. The main idea of S3VM is to find an optimal separation hyperplane far away from both labeled and unlabeled samples to guarantee good generalization. This can be formulated as
In this section, using the notation of Section 2.1, consider N labeled samples (x i , y i ), x i ∈ R n and y i = ±1, (i = 1, 2, …, N). In ELM feature space,the training set is described as {(φ (x i ) , y i ) , φ (x i ) ∈ R L , y i = ±1, i = 1, 2, … N} while the test set is {φ (x j ) ∈ R L , j = N + 1, N + 2, … N + P}. With L hidden nodes, the outputs of the supervised ELM is given by
With hinge loss function l h (·), Huang et al proposed a regularized ELM framework has the form [2](called OPT-ELM)
Inspired by S3VM, we build a semi-supervised ELM classification (SSELM) framework with arbitrary norm. The main idea of SSELM is to find a hyperplane far away from both the labeled and unlabeled samples, and also to minimize the norm of output weights. In particular, for each unlabeled sample φ (x
j
) in ELM feature space, we introduce two slack variables r
j
and s
j
which stand for two possible misclassification errors respectively. Then φ (x
j
) belongs to the class with a lower misclassification error: min {r
j
, s
j
}. To this end, SSELM with q-norm regularization can be formulated as
(1) Different from the popular S3VM which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters, kernel function of SSELM framework has the explicit form: K ELM (x i , x j ) = φ (x i ) T φ (x j ), and its network parameters are randomly generated without tuning.
(2) The bias ϱ is not required in SSELM since the separating hyperplane β T φ (x) =0 passes through the origin in ELM feature space, while S3VM needs threshold to determine the hyperplane. Therefore, SSELM is more convenient to apply.
(3) Two types of optimization methods are developed to solve the proposed SSELM. The first one is an exact solution approach that reformulates SSELM as mixed integer programming. The second is an approximation approach that approximates the SSELM framework by DC (difference of convex functions) programming.
In the following sections, we discuss the norm q = 0, 1 and q = 2 respectively.
SSELM framework (4) involves a nonconvex and nonsmooth problem owing to the last term in the objective function, which precludes the use of convex and smoothing methods.
Solving SSELM by mixed integer linear programming
It is common to take the norm q = 1, and then we have
This is a linear mixed integer program (called SSELM-LMIP) and solving SSELM-LMIP (6) obtains the global solution of SSELM (5). In general, this mixed integer programming algorithm is computationally very demanding.
The mixed integer programming algorithm for solving problem SSELM (5) is described as follows.
(1) Choose M > 0 is sufficiently large and suitable parameters ν, μ > 0.
(2) Construct the SSELM (5) and its mixed integer programming formulation SSELM-LMIP (6).
(3) Solve SSELM-LMIP (6) to obtain point (β, z, ξ, r, s, d).
Take the norm q = 2, and then we obtain
Solving SSELM by DC programming (SSELM-DC)
In this section, we solve SSELM (5) by DC programming with continuous objective function. To simplify the presentation, let Ω be the feasible region of (5), namely
With the definition (9), the SSELM (5) is simplified as:
The DC algorithm (DCA) for solving problem (11) is as follows.
(1) ∀ɛ > 0 is a sufficiently small, and set k=0. Choose an initial point x0 ∈ Ω1 and suitable parameters ν, μ > 0.
(2) Compute y k ∈ ∂h1 (x k ) via (35)-(36).
(3) Solve the following linear programming to obtain xk+1.
(4) If either∥xk+1 - x k ∥ < ɛ or g1 (xk+1) - h1 (xk+1) ≥ g1 (x k ) - h1 (x k ) - ɛ, stop and xk+1 is the computed solution. Otherwise, set k=k+1 and go to (2).
(1) Algorithm 2 generates the sequence {x k } such that g1 (x k ) - h1 (x k ) is monotonously decreasing.
(2) The sequence {x k } converges to x* in a finite number of iterations.
(3) If the optimal value of problem (11) is finite, then the limit point x* is a critical point of the objective function in problem (11).
(4) The point x* is almost always a local optimal minimizer of problem (11).
We discuss SSELM with 2-norm regularization, called 2-norm SSELM.
with DC decomposition:
Performing DCA for problem (17) amounts to computing the sequence {x k } and xk+1 is the solution to the linear program: min {g2 (x) - (y k ) T (x - x k ) , y k ∈ ∂h2 (x k )}, namely
The DCA for solving problem (17) is as follows.
(1) ∀ɛ > 0 is a sufficiently small, and set k=0. Choose an initial point x0 ∈ Ω and suitable parameters ν, μ > 0.
(2) Compute y k ∈ ∂h2 (x k ) via (13)-(14).
(3) Solve the following convex quadratic program (18) to obtain xk+1.
(4) If either∥xk+1 - x k ∥ < ɛ or g2 (xk+1) - h2 (xk+1) ≥ g2 (x k ) - h2 (x k ) - ɛ, stop and xk+1 is the computed solution. Otherwise, set k=k+1 and go to (2).
(1) Algorithm 3 generates the sequence {x k } such that g3 (x k ) - h3 (x k ) is monotonously decreasing.
(2) The sequence {x k } converges to x* in a finite number of iterations.
(3) If the optimal value of problem (17) is finite, then the limit point x* is a critical point of the objective function in problem (17).
(4) The point x* is almost always a local optimal minimizer of problem (17).
Consider the q = 0 in SSELM (4), the corresponding optimization problem is expressed as (called zero-SSELM)
This problem is a combinatorial optimization since the zero-norm involves discrete variable. We consider an appropriate continuous approximation to zero-norm [4]:
Where η is the function (see Fig.1) defined by

Approximations to zero-norm for the Gaussion function η (z)
Thus, the l0-norm ∥β ∥ 0 is approximated by:
With this approximation, we obtain the approximate problem of (19) as follows:
It is worth noting that function η can be expressed as
with
Obviously, both g and h are all convex, and the g is polyhedral convex. Thus η (β) is a DC function.
With the approximation (20), the proble m (23) takes the form:
Using the formulation (25), the above problem can be reformulated as:
with
and
Obviously, g and h are convex functions, and so are G1 and H1. Note that problem (27) is a polyhedral DC program since G1 is a polyhedral convex function and thus it has finite convergence.
Furthermore, it can be seen that the subdifferential of the H1 is given by
where ∇h (β, ξ, r, s) = (v, 0, 0, 0, 0) with
Furthermore, we introduce the variable z with |β| ≤ z. Let x = (β, z, ξ, r, s). According to the analysis in Section.2, performing DCA for problem (27) amounts to computing the sequence {x k = (β k , z k , ξ k , r k , s k ) , k = 1, 2, ⋯}, and xk+1 is the solution to the linear program: min {G1 (x) - ∂H1 (x k ) T (x - x k )}, namely:
where Ω2 = {(β, ξ, r, s) ∈ Ω1, |β| ≤ z}. Next we describe the DCA applied to (27).
1. ɛ > 0 is sufficiently small, and set k=0. Choose an initial point x0 ∈ Ω2.
2. Solve the linear program (32) to obtain xk+1.
3. If ∥xk+1 - x k ∥ < ɛ, or G1 (xk+1) - H1 (xk+1) ≥ G1 (x k ) - H1 (x k ) - ɛ, then stop and xk+1 is the computed solution. Otherwise, set k=k+1 and go to 2.
1. DCA generates two sequences {x k } and {y k } such that G1 (x k ) - H1 (x k ) decreases monotonically.
2. The sequence {x k } converges linearly.
3. The proof is simple and thus is omitted.
We test the performance of the proposed algorithms on various data sets. Numerical simulation experiments are composed of two parts in this investigation. In the first part, we test the performance of the proposed schemes in five benchmark datasets from UCI Machine Learning Repository. In the second part, the proposed methods are used directly to classify a breast cancer dataset using near-infrared (NIR) spectroscopy data [17].
Experimental design
We chose the optimization method based ELM (called OPT-ELM) [2], the traditional SVM with ELM kernel (called SVM-ELM) [21] and semi-supervised support vector machine (S3VM) methods [18] as the baseline methods. For fair comparison,these methods were performed under the same condition. Ten-fold cross-validation is used in this investigation. In each trial, each dataset is randomly divided into two parts: 10% data as labeled samples and the remaining 90% data as unlabeled samples, and then we implement algorithms to reclassify the unlabeled samples in use of the learning results.
For comprehensive evaluation, the performance of the model was assessed by the following criteria: accuracy (ACC) is the identification rate of all samples from two classes; Mathew correlation coefficient (MCC) is a comprehensive measure of the quality of classification model. The above values can be obtained from the decision function and are defined as [21]
where TP and TN denote true positives and true negatives; FN and FP denote false negatives and false positives, respectively. And a+ = TP/(TP + FN) , a- = TN/(TN + FP). The G-ACC is a comprehensive measure of the quality of classification models. The higher the values of these values are, the better the models are. The sigmoid function g (w, b, x) =1/1 + exp (- (w T x + b)) is used as activation function in hidden layer.
The number of hidden nodes is an important index for designing and training a ELM network. The number of hidden nodes L of ELM is adjusted from the set of values {50, 100, 200, 500, 1000} by ten-fold cross validation. And the optimum value of L is selected to maximize the accuracy.
The performance of the proposed SSELM depends also on the choices of penalty parameters ν and μ. In each dataset, these parameters were tuned from the sets of values {1, 10, 100, 1000} by ten-fold cross-validation. For each combination of these values, the accuracy was calculated and the optimum parameters are selected to maximize the accuracy in each dataset. The S3VM parameters are set to be the same as SSELM.
We compare with the corresponding S3VM
With q = 1, 2, it can be solved by different optimization algorithms.
1-norm S3VM is solved by mixed integer linear programming and DC programming,called S3VM-MILP and S3VM-DC respectively.
2-norm S3VM is solved by DC programming,called S3VM-DC2.
We carry out experiments on two databases.
We first compare with the supervised learning methods, the ELM (26), SVM-ELM and OPT-ELM with the ratio of labeled to unlabeled samples being 1:9 in five UCI datasets. The average results are reported in Table 1, which shows that incorporating unlabelled data improves generalization in Vote and Ionosphere; for other three datasets, the proposed algorithms achieve equivalent performances compared to the supervised ELM methods.
Comparisons of the proposed methods with other ELM methods
Comparisons of the proposed methods with other ELM methods
Then we compare with S3VM-DC, and the results are reported in Table 2. We find that the proposed algorithms achieve better results in most cases.
Comparisons of the proposed methods with the S3VM algorithm
To further test the proposed algorithms, we are interested in the effectiveness of the proposed algorithms when the percentage of the labelled samples varies. Thus we compare with the other semi-supervised ELM methods, SSL-ELM [10], SELM [11] and NRCM [11], with the different ratio of labeled to unlabeled samples in Ionosphere dataset, and results on ACC(%) are illustrated in Table 3.
Compared with other semi-supervised ELMs for Ionosphere dataset
Table 3 shows that the performance of the proposed SSELM-DC2 is competitive with other semi-supervised ELM methods which are based on the manifold assumption in all considered cases.
Classification of antineoplastics is an important issue. The proposed method is used in a practical application, breast cancer dataset consisting of tamoxifen and toremifene citrate tablets, China, in 2014. We chose 120 tablets including 60 tamoxifen tablets and 60 toremifene tablets which were used in experimental analysis. Near-infrared (NIR) spectra were acquired using an MPA spectrometer fitted in range of 4000-12000cm-1 with a resolution of 4cm-1. Each sample spectrum was the average of 32 scans. A final spectrum was taken as the mean spectrum of these spectra. To validate the performance of the proposed method, spectral range is divided into low frequency and high frequency spectral regions 4000-8000cm-1 and 8000-12000cm-1 respectively.
The tamoxifen and toremifene has a 1 : 1 ratio in the training set and test set. We compare with the OPT-ELM, SSELM-MILP and S3VM-MILP in terms of ACC on two spectral regions with the ratio of labeled to unlabeled samples being 1 : 9 in spectral datasets. The average results are reported in Table 4, which illustrates that the proposed methods either improve or show no significant difference in generalization compared to the traditional approaches.
Compared with other methods for NIR datasets
We implement the cluster assumption for semi-supervised learning in ELM and propose an arbitrary norm semi-supervised ELM classification framework (SSELM). The main contributions of this work are as follows:
(1) We construct a semi-supervised ELM framework with arbitrary norm regularization when insufficient training information is available.
(2) Different from the popular S3VM which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters. And its network parameters are randomly generated without tuning.
(3) The bias ϱ is not required in SSELM since the separating hyperplane β T φ (x) =0 passes through the origion in SSELM feature space, while S3VM needs threshold to determine the hyperplane. Therefore, SSELM is more convenient to apply.
(4) Two types of optimization methods are developed to solve the nonconvex SSELM. The first one is an exact solution approach that reformulates SSELM framework as a mixed integer program framework with global solution. The second is a approximation approach that approximates the SSELM framework by DC programming respectively, and the resulting DCA converges linearly or finitely. Then several optimization algorithms are developed to solve semi-supervised ELM framework.
Compared with the traditional methods, experimental results show that incorporating unlabeled samples in training improves the generalization when insufficient training information is available. Moreover, the proposed SSELM outperforms the existing semi-supervised learning methods and the tradition ELM by obtaining better performance in different spectral regions. These show the feasibility and effectiveness of the proposed methods.
Appendix
DC programming
Generally speaking, a DC program takes the form
A function θ (x) is said to be polyhedral convex if
where a i ∈ R n , b i ∈ R, (i = 1, 2, ⋯ m). The χ Ω (x) is the indicator function of the non-empty convex set Ω, and is defined as: χ Ω = 0 if x ∈ Ω and +∞ otherwise. A DC program is called a polyhedral DC program when either g or h is a polyhedral convex function.
A point x* that satisfies the following generalized Kuhn-Tucker condition is called a critical point of (P dc )
The necessary local optimality condition for (P dc ) is
Calculate y k ∈ ∂h (x k )
Solve convex program (39) to obtain xk+1
Let k:=k+1
DCA is a descent algorithm without line search. The following properties are used in the next sections (for simplicity, we omit the dual part of these properties)
(1) If g (xk+1) - h (xk+1) = g (x k ) - h (x k ), then x k is a critical point for (P dc ). In this case, DCA terminates at k-th iteration.
(2) Let y* be a local solution to the dual of (P dc ) and x* ∈ ∂g* (y*). If h is differentiable at x*, then x* is a local solution to (P dc ).
(3) If the optimal value of problem (P dc ) is finite and the infinite sequence {x k } is bounded, then every limit point x* of the sequence {x k } is a critical point of (P dc ).
(4) DCA converges linearly for general DC programs. Especially, for polyhedral DC programs, the sequence {x k } contains finite elements, and in a finite number of iterations the algorithm converges to a critical point satisfying the necessary optimality condition.
DCA is an efficient and robust algorithm for solving nonconvex problems, especially in the large-scale setting, and has been successfully applied to many nonconvex optimizations.
Footnotes
Acknowledgements
This work is supported by National Nature Science Foundation of China (Nos.11471010, 11271367).
