Arbitrary norm semi-supervised extreme learning machine

Abstract

Applying semi-supervised learning to extreme learning machine (ELM), we propose a semi-supervised extreme learning machine classification framework (SSELM) with arbitrary norm (q-norm, q=0,1 and 2). However, the SSELM involves nonconvex and nonsmooth problem. In this work, two types of optimization methods are developed to solve the proposed SSELM. The first one is an exact solution approach that reformulates SSELM as mixed integer programming. The second is an approximation approach that approximates the SSELM framework by DC (difference of convex functions) programming. Several formulations for SSELM are presented with different norm. Furthermore, the proposed methods are applied in a practical medical dataset using near-infrared spectral technology. Experimental results in different spectral regions show that incorporating unlabeled samples in training improves the generalization compared with the supervised ELM when insufficient training information is available. Moreover, the proposed methods achieve equivalent performance in benchmark data sets compared to the supervised ELM algorithms and other semi-supervised methods. These results show the feasibility and effectiveness of the proposed algorithms.

Keywords

Extreme learning machine semi-supervised classification mixed integer programming DC programming arbitrary norm

1 Introduction

Extreme learning machine(ELM)[1 –6] is a single-hidden layer feedforward neural network (SLFN). Compared with traditional neural networks, the main merits are that its hidden layer parameters are randomly initialized, and then output weights can be obtained through least square estimation method. Therefore the ELM runs fast with global solution and is easy to implement. And ELM can provide a unified learning platform to different applications, such as regression,binary and multi-class classification problems. However, the traditional ELM is primarily used for supervised learning tasks that greatly limit their applicability.

Using both labeled and unlabeled data for learning is called semi-supervised learning [7 –9], the main goal of which is to employ the large collection of unlabeled data together with a few labeled data to improve generalization. Generally, semi-supervised classification methods are derived based on two fundamental geometric assumptions: (1) the cluster assumption, and (2) the manifold assumption [8, 9]. Recently, semi-supervised learning methods have been applied to ELM to improve the generalization when there are relatively few labelled data [10 –12]. Essentially, these methods are all based on manifold assumption for semi-supervised learning, the main idea of which is to introduce the manifold regularization terms in their objective functions.

In this work, applying the cluster assumption to ELM, an arbitrary norm (q-norm, q = 0,1 and 2) semi-supervised ELM framework (SSELM) is proposed to handle semi-supervised classification problems. Noticing that the zero-norm ELM is rarely seen in the literature. The proposed SSELM framework involves nonconvex and nonsmooth problem, which makes it difficult to find the optimal solution. We present two types of optimization methods to solve the proposed SSELM. The first one is an exact solution approach that reformulates the SSELM as mixed integer programs with global solution. The second is an approximation approach that approximates the SSELM by DC (difference of convex functions) programming [13 –17], and resulting DC algorithm (DCA) converges.

2 Background

2.1 Semi-supervised classification

Semi-supervised support vector machine (S³VM) [7, 18] can be viewed as a standard support vector machine (SVM) [19, 20] with an additional regularization term on unlabeled data.

In particular, for binary classification problems, assume that the data consists of a training set and a testing set. The training set consists of N labeled samples {(x_i, y_i) , x_i ∈ Rⁿ, y_i = ±1, i = 1, 2, … N} and the test set consists of P unlabeled samples {x_j ∈ Rⁿ, j = N + 1, N + 2, … N + P}. The main idea of S³VM is to find an optimal separation hyperplane far away from both labeled and unlabeled samples to guarantee good generalization. This can be formulated as

$\begin{matrix} min_{ω, ϱ, ξ, r, s} ∥ ω ∥_{q}^{q} + ν \sum_{i = 1}^{N} l (y_{i} (ω^{T} x_{i} - ϱ)) \\ + μ \sum_{j = N + 1}^{N + P} l (| ω^{T} x_{j} - ϱ |) \end{matrix}$ (1) where the l (·) is a loss function. Two parameters (ν, μ > 0) reflect confidence on labels and on the cluster assumption respectively. Due to the information of unlabeled samples, S³VM (1) is usual nonconvex and thus difficult to obtain the global optimal solution.

2.2 Extreme Learning Machine(ELM)

In this section, using the notation of Section 2.1, consider N labeled samples (x_i, y_i), x_i ∈ Rⁿ and y_i = ±1, (i = 1, 2, …, N). In ELM feature space,the training set is described as {(φ (x_i) , y_i) , φ (x_i) ∈ R^L, y_i = ±1, i = 1, 2, … N} while the test set is {φ (x_j) ∈ R^L, j = N + 1, N + 2, … N + P}. With L hidden nodes, the outputs of the supervised ELM is given by

$\begin{matrix} f (x) = sign (\sum_{k = 1}^{L} β_{k} v (w_{k} \cdot x + b_{k})) \\ = sign (β^{T} φ (x)) \end{matrix}$ (2) where v (x) is an activation function, and φ (x) = [v (w₁, b₁, x) , …, v (w_L, b_L, x)] ^T, as a nonlinear feature mapping, denotes the hidden node output with respect to input x. Generally n > L, the ELM feature mapping φ maps sample x from input space x ∈ Rⁿ into low dimension ELM feature space φ (x) ∈ R^L.

With hinge loss function l_h (·), Huang et al proposed a regularized ELM framework has the form [2](called OPT-ELM)

$min_{β} ∥ β ∥_{2}^{2} + C \sum_{i = 1}^{N} l_{h} (1 - β^{T} φ (x_{i}))$ (3) which is to minimize the training error as well as the norm of the output weights. Where C > 0 is a penalty factor. This is a convex problem with global solution. However, when the dataset lacks of labelled samples, the supervised OPT-ELM (3) is difficult to apply.

3 Semi-supervised extreme learning machine

Inspired by S³VM, we build a semi-supervised ELM classification (SSELM) framework with arbitrary norm. The main idea of SSELM is to find a hyperplane far away from both the labeled and unlabeled samples, and also to minimize the norm of output weights. In particular, for each unlabeled sample φ (x_j) in ELM feature space, we introduce two slack variables r_j and s_j which stand for two possible misclassification errors respectively. Then φ (x_j) belongs to the class with a lower misclassification error: min {r_j, s_j}. To this end, SSELM with q-norm regularization can be formulated as $\begin{matrix} min_{β, ξ, r, s} & ∥ β ∥_{q}^{q} + ν \sum_{i = 1}^{N} ξ_{i} + μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \\ s . t . & y_{i} β^{T} φ (x_{i}) + ξ_{i} \geq 1, ξ_{i} \geq 0 \\ β^{T} φ (x_{j}) + r_{j} \geq 1, r_{j} \geq 0 \end{matrix}$ (4) $\begin{matrix} - β^{T} φ (x_{j}) + s_{j} \geq 1, s_{j} \geq 0 \\ i = 1, 2 \dots, N, j = N + 1, \dots, N + P \end{matrix}$ where ν, μ > 0 are two penalty parameters. The first two terms of the objective function together with the first constraint correspond to a supervised OPT-ELM with q-norm regularization. The last term in the objective function together with the remaining constraints assign each unlabeled sample φ (x_j) to the positive class or negative class, which generates a lower misclassification error: min {r_j, s_j}.

Remarks

(1) Different from the popular S³VM which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters, kernel function of SSELM framework has the explicit form: K_ELM (x_i, x_j) = φ (x_i) ^Tφ (x_j), and its network parameters are randomly generated without tuning.

(2) The bias ϱ is not required in SSELM since the separating hyperplane β^Tφ (x) =0 passes through the origin in ELM feature space, while S³VM needs threshold to determine the hyperplane. Therefore, SSELM is more convenient to apply.

(3) Two types of optimization methods are developed to solve the proposed SSELM. The first one is an exact solution approach that reformulates SSELM as mixed integer programming. The second is an approximation approach that approximates the SSELM framework by DC (difference of convex functions) programming.

In the following sections, we discuss the norm q = 0, 1 and q = 2 respectively.

3.1 SSELM by mixed integer programming(SSELM-MIP)

SSELM framework (4) involves a nonconvex and nonsmooth problem owing to the last term in the objective function, which precludes the use of convex and smoothing methods.

3.1.1 Solving SSELM by mixed integer linear programming

It is common to take the norm q = 1, and then we have $\begin{matrix} min_{β, ξ, r, s} & ∥ β ∥_{1} + ν \sum_{i = 1}^{N} ξ_{i} + μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \\ s . t . & y_{i} β^{T} φ (x_{i}) + ξ_{i} \geq 1, ξ_{i} \geq 0 \\ β^{T} φ (x_{j}) + r_{j} \geq 1, r_{j} \geq 0 \end{matrix}$ (5) $\begin{matrix} - β^{T} φ (x_{j}) + s_{j} \geq 1, s_{j} \geq 0 \\ i = 1, 2 \dots, N, j = N + 1, \dots, N + P \end{matrix}$ Further, we add the variable z ∈ R^L with component satisfying |β_k| ≤ z_k, (k = 1, ⋯ , L). By introducing integer variable d ∈ R^P with component d_j = 0 or 1 for each unlabeled sample, we obtain the epigraph form of SSELM (5) $\begin{matrix} min_{z, β, ξ_{i}, r, s, d} & \sum_{k = 1}^{L} z_{k} + ν \sum_{i = 1}^{N} ξ_{i} + μ \sum_{j = N + 1}^{N + P} (r_{j} + s_{j}) \\ s . t . & t_{i} β^{T} φ (x_{i}) + ξ_{i} \geq 1, ξ_{i} \geq 0, i = 1, 2 \dots, N \\ β^{T} φ (x_{j}) + r_{j} + M (1 - d_{j}) \geq 1, r_{j} \geq 0 \\ - β^{T} φ (x_{j}) + s_{j} + M d_{j} \geq 1, s_{j} \geq 0 \end{matrix}$ (6) $\begin{matrix} d_{j} = {0, 1}, j = N + 1, \dots, N + P \\ | β_{k} | \leq z_{k}, k = 1, 2, \dots, L \end{matrix}$ where M > 0 is a sufficiently large constant such that if d_j = 0 then r_j = 0 is feasible for any optimal w and b, which attempts to classify the unlabeled point x_j to negative class. Likewise if d_j = 1 then s_j = 0, which attempts to classify the point x_j to positive class.

This is a linear mixed integer program (called SSELM-LMIP) and solving SSELM-LMIP (6) obtains the global solution of SSELM (5). In general, this mixed integer programming algorithm is computationally very demanding.

The mixed integer programming algorithm for solving problem SSELM (5) is described as follows.

Algorithm 1.

(1) Choose M > 0 is sufficiently large and suitable parameters ν, μ > 0.

(2) Construct the SSELM (5) and its mixed integer programming formulation SSELM-LMIP (6).

(3) Solve SSELM-LMIP (6) to obtain point (β, z, ξ, r, s, d).

Theorem 1. Suppose point (β, z, ξ, r, s, d) is an optimal solution of SSELM-LMIP (6), where M > 0 is a sufficiently large constant, then the corresponding (β, ξ, r, s) is an exact solution of SSELM (5).

Proof. From the above analysis, we know that SSELM (5) in variables (β, ξ, r, s) is equivalent to the SSELM-LMIP (6) in variables (β, z, ξ, r, s, d). Thus, if (β, z, ξ, r, s, d) is the exact solution of the SSELM-LMIP (6), then the corresponding (β, ξ, r, s) is the exact solution of SSELM (5).

3.1.2 Solving SSELM by mixed integer quadratic programming

Take the norm q = 2, and then we obtain $\begin{matrix} min_{β, ξ, r, s} & ∥ β ∥_{2}^{2} + ν \sum_{i = 1}^{N} ξ_{i} + μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \\ s . t . & y_{i} β^{T} φ (x_{i}) + ξ_{i} \geq 1, ξ_{i} \geq 0 \\ β^{T} φ (x_{j}) + r_{j} \geq 1, r_{j} \geq 0 \end{matrix}$ (7) $\begin{matrix} - β^{T} φ (x_{j}) + s_{j} \geq 1, s_{j} \geq 0 \\ i = 1, 2 \dots, N, j = N + 1, \dots, N + P \end{matrix}$ Similarly, we obtain a mixed integer quadratic programming: $\begin{matrix} min_{β, ξ_{i}, r, s, d} & ∥ β ∥_{2}^{2} + ν \sum_{i = 1}^{N} ξ_{i} + μ \sum_{j = N + 1}^{N + P} (r_{j} + s_{j}) \\ s . t . & y_{i} β^{T} φ (x_{i}) + ξ_{i} \geq 1, ξ_{i} \geq 0 \\ β^{T} φ (x_{j}) + r_{j} + M (1 - d_{j}) \geq 1, r_{j} \geq 0 \\ - β^{T} φ (x_{j}) + s_{j} + M d_{j} \geq 1, s_{j} \geq 0 \end{matrix}$ (8) $\begin{matrix} d_{j} = {0, 1}, i = 1, 2 \dots, N \\ j = N + 1, \dots, N + P \end{matrix}$ which M > 0 is a sufficiently large constant. The algorithm is similar to Algorithm 1 and thus omitted.

3.2 SSELM by DC programming

3.2.1 Solving SSELM by DC programming (SSELM-DC)

In this section, we solve SSELM (5) by DC programming with continuous objective function. To simplify the presentation, let Ω be the feasible region of (5), namely $Ω = {\begin{matrix} (β, ξ, r, s) : y_{i} β^{T} φ (x_{i}) + ξ_{i} \geq 1, ξ_{i} \geq 0 \\ β^{T} φ (x_{j}) + r_{j} \geq 1, r_{j} \geq 0 \\ - β^{T} φ (x_{j}) + s_{j} \geq 1, s_{j} \geq 0 \\ i = 1, 2, \dots, N, j = N + 1, \dots, N + P \end{matrix}$ (9) which is a polyhedral convex set in ELM feature space.

With the definition (9), the SSELM (5) is simplified as:

$\begin{matrix} min_{β, ξ, r, s} & {∥ β ∥_{1} + ν e^{T} ξ + μ e^{T} \min {r, s} \\ : (β, ξ, r, s) \in Ω} \end{matrix}$ (10) By adding the variable z ∈ Rⁿ satisfying |β| ≤ z, the problem (10) can be reformulated as

$\begin{matrix} min_{β, z, ξ, r, s} & {e^{T} z + ν e^{T} ξ + μ e^{T} \min {r, s} \\ : (β, z, ξ, r, s) \in Ω_{1}} \end{matrix}$ (11) where Ω₁ = {|β| ≤ z, (β, ξ, r, s) ∈ Ω}.Then let x = (β, z, ξ, r, s) and we adopt the following DC decomposition:

$\begin{matrix} g_{1} (x) = e^{T} z + ν e^{T} ξ + χ_{Ω_{1}} (x), \\ h_{1} (x) = - μ e^{T} \min {r, s} . \end{matrix}$ (12) Note that functions g₁ and h₁ are convex, and h₁ is polyhedral convex. Thus problem (11) is a polyhedral DC program (SSELM-DC for short). Thus, its DCA terminates at a point satisfying a necessary optimality for problem (11) after a finite number of iterations. As usually, ∂h₁ is often explicitly computed with the help of known rules in convex analysis. Here we have

$\partial h_{1} (x) = \partial (- μ \sum_{i = 1}^{p} min {r_{i}, s_{i}})$ (13)

$= - μ \sum_{i = 1}^{p} {\begin{matrix} (0, 0, 0, I_{i}, 0), r_{i} < s_{i} \\ (0, 0, 0, 0, I_{i}), r_{i} > s_{i} \\ (0, 0, 0, (1 - λ) I_{i}, λ I_{i}), r_{i} = s_{i} \end{matrix}$ (14)I_i ∈ R^p is the ith column of the identity matrix I, and λ ∈ [0, 1].

The DC algorithm (DCA) for solving problem (11) is as follows.

Algorithm 2

(1) ∀ɛ > 0 is a sufficiently small, and set k=0. Choose an initial point x⁰ ∈ Ω₁ and suitable parameters ν, μ > 0.

(2) Compute y^k ∈ ∂h₁ (x^k) via (35)-(36).

(3) Solve the following linear programming to obtain x^k+1.

\begin{matrix} min & {e^{T} z + ν e^{T} ξ - x^{T} y^{k} \\ : (β, z, ξ, r, s) \in Ω_{1}} \end{matrix}

(15)

(4) If either∥x^k+1 - x^k ∥ < ɛ or g₁ (x^k+1) - h₁ (x^k+1) ≥ g₁ (x^k) - h₁ (x^k) - ɛ, stop and x^k+1 is the computed solution. Otherwise, set k=k+1 and go to (2).

Theorem 2.

(1) Algorithm 2 generates the sequence {x^k} such that g₁ (x^k) - h₁ (x^k) is monotonously decreasing.

(2) The sequence {x^k} converges to x^* in a finite number of iterations.

(3) If the optimal value of problem (11) is finite, then the limit point x^* is a critical point of the objective function in problem (11).

(4) The point x^* is almost always a local optimal minimizer of problem (11).

Proof. The conclusions (1) and (3) are the direct consequences of general DC programming, while the conclusions (2) and (4) are also true since problem (11) is a polyhedral DC program. The proof is then complete.

3.2.2 Semi-supervised ELM with 2-norm regularization

We discuss SSELM with 2-norm regularization, called 2-norm SSELM.

$\begin{matrix} min_{β, ξ, r, s} & ∥ β ∥_{2}^{2} + ν e^{T} ξ + μ e^{T} min {r, s} \\ s . t . & (β, ξ, r, s) \in Ω \end{matrix}$ (16) which is a nonconvex and nonsmooth optimization with linear constraints. Let x = (β, ξ, r, s). The 2-norm SSELM (16) can be posed as a DC program (called SSELM-DC2)

min {g_{2} (x) - h_{2} (x)}

(17)

with DC decomposition: $g_{2} (x) = ∥ β ∥_{2}^{2} + ν e^{T} ξ + χ_{Ω} (x)$ and h₂ (x) = - μe^Tmin {r, s}. Clearly, the h₂ (x) is convex, and thus problem (17) is a polyhedral DC program with h₂ (x) being polyhedral convex.

Performing DCA for problem (17) amounts to computing the sequence {x^k} and x^k+1 is the solution to the linear program: min {g₂ (x) - (y^k) ^T (x - x^k) , y^k ∈ ∂h₂ (x^k)}, namely

\begin{matrix} min & {∥ β ∥_{2}^{2} + ν e^{T} ξ - x^{T} y^{k} \\ : (β, ξ, r, s) \in Ω} \end{matrix}

(18)

The DCA for solving problem (17) is as follows.

Algorithm 3. (DCA for solving (17)

(1) ∀ɛ > 0 is a sufficiently small, and set k=0. Choose an initial point x⁰ ∈ Ω and suitable parameters ν, μ > 0.

(2) Compute y^k ∈ ∂h₂ (x^k) via (13)-(14).

(3) Solve the following convex quadratic program (18) to obtain x^k+1.

(4) If either∥x^k+1 - x^k ∥ < ɛ or g₂ (x^k+1) - h₂ (x^k+1) ≥ g₂ (x^k) - h₂ (x^k) - ɛ, stop and x^k+1 is the computed solution. Otherwise, set k=k+1 and go to (2).

Theorem 3.

(1) Algorithm 3 generates the sequence {x^k} such that g₃ (x^k) - h₃ (x^k) is monotonously decreasing.

(2) The sequence {x^k} converges to x^* in a finite number of iterations.

(3) If the optimal value of problem (17) is finite, then the limit point x^* is a critical point of the objective function in problem (17).

(4) The point x^* is almost always a local optimal minimizer of problem (17).

Proof. The conclusions (1) and (3) are the direct consequences of general DC programming, while the conclusions (2) and (4) are also true since problem (17) is a polyhedral DC program. The proof is then complete.

3.3 SSELM with the zero-norm regularization

Consider the q = 0 in SSELM (4), the corresponding optimization problem is expressed as (called zero-SSELM)

\begin{matrix} min_{β, ξ, r, s} & ∥ β ∥_{0} + ν \sum_{i = 1}^{N} ξ_{i} + μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \\ s . t . & y_{i} β^{T} φ (x_{i}) + ξ_{i} \geq 1, ξ_{i} \geq 0 \\ β^{T} φ (x_{j}) + r_{j} \geq 1, r_{j} \geq 0 \end{matrix}

(19)

\begin{matrix} - β^{T} φ (x_{j}) + s_{j} \geq 1, s_{j} \geq 0 \\ i = 1, 2 \dots, N, j = N + 1, \dots, N + P \end{matrix}

This problem is a combinatorial optimization since the zero-norm involves discrete variable. We consider an appropriate continuous approximation to zero-norm [4]:

∥ β ∥_{0} \approx \sum_{i = 1}^{n} η (w_{i})

(20)

Where η is the function (see Fig.1) defined by

η (z) = 1 - ɛ^{- α | z |}, α > 0, \forall z \in R

(21)

Fig.1

Approximations to zero-norm for the Gaussion function η (z)

Thus, the l₀-norm ∥β ∥ ₀ is approximated by:

∥ β ∥_{0} \approx η (β) = e^{T} (e - ɛ^{- α | β |}), α > 0

(22)

With this approximation, we obtain the approximate problem of (19) as follows:

\begin{matrix} min & \sum_{j = 1}^{n} η (β_{j}) + ν e^{T} ξ + μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \\ s . t . & (β, ξ, r, s) \in Ω \end{matrix}

(23)

It is worth noting that function η can be expressed as

η (β) = g (β) - h (β), α > 0

(24)

with

\begin{matrix} g (β) = α e^{T} | β |, \\ h (β) = α e^{T} | β | - e^{T} (e - ɛ^{- α | β |}) \end{matrix}

(25)

Obviously, both g and h are all convex, and the g is polyhedral convex. Thus η (β) is a DC function.

With the approximation (20), the proble m (23) takes the form:

\begin{matrix} min_{β, ξ, r, s} & \sum_{j = 1}^{n} (g (β_{j}) - h (β_{j})) + ν e^{T} ξ \\ + μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \\ s . t . & (β, ξ, r, s) \in Ω \end{matrix}

(26)

Using the formulation (25), the above problem can be reformulated as:

\begin{matrix} min_{x} & {G_{1} (β, ξ, r, s) - H_{1} (β, ξ, r, s), \\ (β, ξ, r, s) \in Ω} \end{matrix}

(27)

with

\begin{matrix} G_{1} (β, ξ, r, s) = \sum_{j = 1}^{n} α | β_{j} | + ν e^{T} ξ \\ + χ_{Ω} (β, ξ, r, s) \end{matrix}

(28)

and

\begin{matrix} H_{1} (β, ξ, r, s) = α e^{T} (| β | - e + ɛ^{- α | β |}) \\ - μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \end{matrix}

(29)

Obviously, g and h are convex functions, and so are G₁ and H₁. Note that problem (27) is a polyhedral DC program since G₁ is a polyhedral convex function and thus it has finite convergence.

Furthermore, it can be seen that the subdifferential of the H₁ is given by

\begin{matrix} \partial H_{1} (β, ξ, r, s) = \nabla h (β, b, ξ, r, s) + \\ \partial (- μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}}) \end{matrix}

(30)

where ∇h (β, ξ, r, s) = (v, 0, 0, 0, 0) with

$v_{j} = {\begin{matrix} α (1 - ɛ^{- α β_{j}}), β_{j} \geq 0 \\ - α (1 - ɛ^{α β_{j}}), β_{j} < 0 \end{matrix}$ (31) while $\partial (- μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}})$ is calculated by the formulas (13)-(14).

Furthermore, we introduce the variable z with |β| ≤ z. Let x = (β, z, ξ, r, s). According to the analysis in Section.2, performing DCA for problem (27) amounts to computing the sequence {x^k = (β^k, z^k, ξ^k, r^k, s^k) , k = 1, 2, ⋯}, and x^k+1 is the solution to the linear program: min {G₁ (x) - ∂H₁ (x^k) ^T (x - x^k)}, namely:

\begin{matrix} min_{β, z, ξ, r, s} & α e^{T} z + ν e^{T} ξ - \partial H_{1} (x^{k})^{T} (x - x^{k}) \\ s . t . & (β, z, ξ, r, s) \in Ω_{2} \end{matrix}

(32)

where Ω₂ = {(β, ξ, r, s) ∈ Ω₁, |β| ≤ z}. Next we describe the DCA applied to (27).

Algorithm 4. (Algorithm for solving (27))

1. ɛ > 0 is sufficiently small, and set k=0. Choose an initial point x⁰ ∈ Ω₂.

2. Solve the linear program (32) to obtain x^k+1.

3. If ∥x^k+1 - x^k ∥ < ɛ, or G₁ (x^k+1) - H₁ (x^k+1) ≥ G₁ (x^k) - H₁ (x^k) - ɛ, then stop and x^k+1 is the computed solution. Otherwise, set k=k+1 and go to 2.

Theorem 4.

1. DCA generates two sequences {x^k} and {y^k} such that G₁ (x^k) - H₁ (x^k) decreases monotonically.

2. The sequence {x^k} converges linearly.

3. The proof is simple and thus is omitted.

4 Experiments

We test the performance of the proposed algorithms on various data sets. Numerical simulation experiments are composed of two parts in this investigation. In the first part, we test the performance of the proposed schemes in five benchmark datasets from UCI Machine Learning Repository. In the second part, the proposed methods are used directly to classify a breast cancer dataset using near-infrared (NIR) spectroscopy data [17].

4.1 Experimental design

We chose the optimization method based ELM (called OPT-ELM) [2], the traditional SVM with ELM kernel (called SVM-ELM) [21] and semi-supervised support vector machine (S³VM) methods [18] as the baseline methods. For fair comparison,these methods were performed under the same condition. Ten-fold cross-validation is used in this investigation. In each trial, each dataset is randomly divided into two parts: 10% data as labeled samples and the remaining 90% data as unlabeled samples, and then we implement algorithms to reclassify the unlabeled samples in use of the learning results.

Evaluation criterions for algorithms

For comprehensive evaluation, the performance of the model was assessed by the following criteria: accuracy (ACC) is the identification rate of all samples from two classes; Mathew correlation coefficient (MCC) is a comprehensive measure of the quality of classification model. The above values can be obtained from the decision function and are defined as [21]

\begin{matrix} ACC = \frac{TP + TN}{TP + FN + TN + FP} \end{matrix}

\begin{matrix} G - ACC = \sqrt{a^{+} \times a^{-}} \end{matrix}

\begin{matrix} MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}} \end{matrix}

where TP and TN denote true positives and true negatives; FN and FP denote false negatives and false positives, respectively. And a⁺ = TP/(TP + FN) , a^- = TN/(TN + FP). The G-ACC is a comprehensive measure of the quality of classification models. The higher the values of these values are, the better the models are. The sigmoid function g (w, b, x) =1/1 + exp (- (w^Tx + b)) is used as activation function in hidden layer.

Parameter selection

The number of hidden nodes is an important index for designing and training a ELM network. The number of hidden nodes L of ELM is adjusted from the set of values {50, 100, 200, 500, 1000} by ten-fold cross validation. And the optimum value of L is selected to maximize the accuracy.

The performance of the proposed SSELM depends also on the choices of penalty parameters ν and μ. In each dataset, these parameters were tuned from the sets of values {1, 10, 100, 1000} by ten-fold cross-validation. For each combination of these values, the accuracy was calculated and the optimum parameters are selected to maximize the accuracy in each dataset. The S³VM parameters are set to be the same as SSELM.

We compare with the corresponding S³VM

\begin{matrix} min & ∥ ω ∥_{q}^{q} + ν \sum_{i = 1}^{N} ξ_{i} + μ \sum_{j = N + 1}^{N + P} min {r_{j}, s_{j}} \\ s . t . & y_{i} (ω^{T} x_{i} - ϱ) + ξ_{i} \geq 1, ξ_{i} \geq 0 \\ ω^{T} x_{j} - ϱ + r_{j} \geq 1, r_{j} \geq 0 \end{matrix}

(33)

\begin{matrix} - ω^{T} x_{j} + ϱ + s_{j} \geq 1, s_{j} \geq 0 \\ i = 1, 2 \dots, N, j = N + 1, \dots, N + P \end{matrix}

With q = 1, 2, it can be solved by different optimization algorithms.

1-norm S³VM is solved by mixed integer linear programming and DC programming,called S³VM-MILP and S³VM-DC respectively.

2-norm S³VM is solved by DC programming,called S³VM-DC2.

4.2 Experiments

We carry out experiments on two databases.

Experiments on UCI benchmark datasets

We first compare with the supervised learning methods, the ELM (26), SVM-ELM and OPT-ELM with the ratio of labeled to unlabeled samples being 1:9 in five UCI datasets. The average results are reported in Table 1, which shows that incorporating unlabelled data improves generalization in Vote and Ionosphere; for other three datasets, the proposed algorithms achieve equivalent performances compared to the supervised ELM methods.

Table 1
Comparisons of the proposed methods with other ELM methods

dataset data scale classification methods ACC (%) MCC (%) G-ACC (%)

Vote (432 × 13) ELM 87.28 75.76 87.91

OPT-ELM 88.65 77.73 88.87

SVM-ELM 89.55 78.90 89.45

SSELM-SDC 93.30 87.86 93.96

SSELM-DC2 89.17 78.93 89.47

Z-SSELM 91.64 83.10 91.55

Ionosphere (351 × 34) ELM 69.81 32.12 69.94

OPT-ELM 73.30 39.99 70.58

SVM-ELM 74.83 43.36 72.26

SSELM-SDC 76.82 45.39 76.15

SSELM-DC2 77.68 48.01 75.34

Z-SSELM 77.09 46.10 74.63

Sonar (208 × 60) ELM 50.28 1.48 50.74

OPT-ELM 51.11 2.61 51.30

SVM-ELM 54.37 9.46 54.84

SSELM-SDC 56.07 9.51 55.59

SSELM-DC2 56.58 9.78 56.40

Z-SSELM 56.04 11.61 56.32

WBC (699 × 9) ELM 88.94 78.27 89.24

OPT-ELM 87.31 71.87 86.05

SVM-ELM 86.96 70.50 85.44

SSELM-SDC 90.52 79.87 85.80

SSELM-DC2 86.78 70.10 85.23

Z-SSELM 93.30 85.45 92.75

Heart (270 × 13) ELM 61.56 21.07 60.80

OPT-ELM 60.37 18.03 59.50

SVM-ELM 58.44 13.87 57.34

SSELM-SDC 68.08 22.41 77.44

SSELM-DC2 71.19 30.32 72.49

Z-SSELM 71.07 41.99 70.99

dataset data scale	classification methods	ACC (%)	MCC (%)	G-ACC (%)
Vote (432 × 13)	ELM	87.28	75.76	87.91
	OPT-ELM	88.65	77.73	88.87
	SVM-ELM	89.55	78.90	89.45
	SSELM-SDC	93.30	87.86	93.96
	SSELM-DC2	89.17	78.93	89.47
	Z-SSELM	91.64	83.10	91.55
Ionosphere (351 × 34)	ELM	69.81	32.12	69.94
	OPT-ELM	73.30	39.99	70.58
	SVM-ELM	74.83	43.36	72.26
	SSELM-SDC	76.82	45.39	76.15
	SSELM-DC2	77.68	48.01	75.34
	Z-SSELM	77.09	46.10	74.63
Sonar (208 × 60)	ELM	50.28	1.48	50.74
	OPT-ELM	51.11	2.61	51.30
	SVM-ELM	54.37	9.46	54.84
	SSELM-SDC	56.07	9.51	55.59
	SSELM-DC2	56.58	9.78	56.40
	Z-SSELM	56.04	11.61	56.32
WBC (699 × 9)	ELM	88.94	78.27	89.24
	OPT-ELM	87.31	71.87	86.05
	SVM-ELM	86.96	70.50	85.44
	SSELM-SDC	90.52	79.87	85.80
	SSELM-DC2	86.78	70.10	85.23
	Z-SSELM	93.30	85.45	92.75
Heart (270 × 13)	ELM	61.56	21.07	60.80
	OPT-ELM	60.37	18.03	59.50
	SVM-ELM	58.44	13.87	57.34
	SSELM-SDC	68.08	22.41	77.44
	SSELM-DC2	71.19	30.32	72.49
	Z-SSELM	71.07	41.99	70.99

Then we compare with S³VM-DC, and the results are reported in Table 2. We find that the proposed algorithms achieve better results in most cases.

Table 2

Comparisons of the proposed methods with the S³VM algorithm

dataset data scale	classification methods	ACC (%)	MCC (%)	G-ACC (%)
Vote (432 × 13)	S³VM-DC	91.64	83.10	91.55
	SSELM-DC	93.30	87.86	93.96
	SSELM-DC2	92.09	82.03	92.02
Ionosphere (351 × 34)	S³VM-DC	77.09	46.10	74.63
	SSELM-DC	76.82	45.39	76.15
	SSELM-DC2	78.47	48.06	83.83
Sonar (208 × 60)	S³VM-DC	54.11	9.67	54.88
	SSELM-DC	56.07	9.51	55.59
	SSELM-DC2	56.04	11.61	56.32
WBC (699 × 9)	S³VM-SDC2	92.83	84.11	92.04
	SSELM-DC	90.52	79.87	85.80
	SSELM-DC2	93.30	85.45	92.75
Heart (270 × 13)	S³VM-DC	71.07	41.99	70.99
	SSELM-DC	68.08	22.41	77.44
	SSELM-DC2	71.03	20.25	70.33

To further test the proposed algorithms, we are interested in the effectiveness of the proposed algorithms when the percentage of the labelled samples varies. Thus we compare with the other semi-supervised ELM methods, SSL-ELM [10], SELM [11] and NRCM [11], with the different ratio of labeled to unlabeled samples in Ionosphere dataset, and results on ACC(%) are illustrated in Table 3.

Table 3

Compared with other semi-supervised ELMs for Ionosphere dataset

Ratio	SSELM-DC2	SSL-ELM	SELM	NRCM
10	74.65	76.35	71.79	74.85
50	85.47	88.74	86.68	87.27
150	91.90	91.08	90.21	90.84

Table 3 shows that the performance of the proposed SSELM-DC2 is competitive with other semi-supervised ELM methods which are based on the manifold assumption in all considered cases.

Experiments on medical data

Classification of antineoplastics is an important issue. The proposed method is used in a practical application, breast cancer dataset consisting of tamoxifen and toremifene citrate tablets, China, in 2014. We chose 120 tablets including 60 tamoxifen tablets and 60 toremifene tablets which were used in experimental analysis. Near-infrared (NIR) spectra were acquired using an MPA spectrometer fitted in range of 4000-12000cm^-1 with a resolution of 4cm^-1. Each sample spectrum was the average of 32 scans. A final spectrum was taken as the mean spectrum of these spectra. To validate the performance of the proposed method, spectral range is divided into low frequency and high frequency spectral regions 4000-8000cm^-1 and 8000-12000cm^-1 respectively.

The tamoxifen and toremifene has a 1 : 1 ratio in the training set and test set. We compare with the OPT-ELM, SSELM-MILP and S³VM-MILP in terms of ACC on two spectral regions with the ratio of labeled to unlabeled samples being 1 : 9 in spectral datasets. The average results are reported in Table 4, which illustrates that the proposed methods either improve or show no significant difference in generalization compared to the traditional approaches.

Table 4

Compared with other methods for NIR datasets

Regions	4000-8000 (cm^-1)	8000-10000 (cm^-1)	10,000-12,000 (cm^-1)
OPT-ELM	99.67	87.79	75.25
S³VM-MILP	99.95	93.82	95.57
S³VM-DC2	99.97	94.91	98.06
SSELM-MILP	99.97	93.34	94.68
SSELM-DC2	99.96	93.12	95.98
Z-SSELM	99.98	95.35	95.93

5 Conclusion

We implement the cluster assumption for semi-supervised learning in ELM and propose an arbitrary norm semi-supervised ELM classification framework (SSELM). The main contributions of this work are as follows:

(1) We construct a semi-supervised ELM framework with arbitrary norm regularization when insufficient training information is available.

(2) Different from the popular S³VM which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters. And its network parameters are randomly generated without tuning.

(3) The bias ϱ is not required in SSELM since the separating hyperplane β^Tφ (x) =0 passes through the origion in SSELM feature space, while S³VM needs threshold to determine the hyperplane. Therefore, SSELM is more convenient to apply.

(4) Two types of optimization methods are developed to solve the nonconvex SSELM. The first one is an exact solution approach that reformulates SSELM framework as a mixed integer program framework with global solution. The second is a approximation approach that approximates the SSELM framework by DC programming respectively, and the resulting DCA converges linearly or finitely. Then several optimization algorithms are developed to solve semi-supervised ELM framework.

Compared with the traditional methods, experimental results show that incorporating unlabeled samples in training improves the generalization when insufficient training information is available. Moreover, the proposed SSELM outperforms the existing semi-supervised learning methods and the tradition ELM by obtaining better performance in different spectral regions. These show the feasibility and effectiveness of the proposed methods.

Appendix

5.1 DC programming

Generally speaking, a DC program takes the form

$inf {f (x) = g (x) - h (x), x \in R^{n}} (P_{dc})$ (34) where the functions g and h are lower semicontinuous proper convex functions on Rⁿ. Such a function f is called a DC function. g and h are the DC components of f.

A function θ (x) is said to be polyhedral convex if

\begin{matrix} θ (x) = \max {a_{i}^{T} x - b_{i}, i = 1, 2, \dots m} \\ + χ_{Ω} (x), \forall x \in R^{n} \end{matrix}

(35)

where a_i ∈ Rⁿ, b_i ∈ R, (i = 1, 2, ⋯ m). The χ_Ω (x) is the indicator function of the non-empty convex set Ω, and is defined as: χ_Ω = 0 if x ∈ Ω and +∞ otherwise. A DC program is called a polyhedral DC program when either g or h is a polyhedral convex function.

A point x^* that satisfies the following generalized Kuhn-Tucker condition is called a critical point of (P_dc)

$\partial h (x^{*}) \cap \partial g (x^{*}) \neq \emptyset$ (36) where ∂h is the subdifferential of the convex function h. It follows that if h is polyhedral convex, then such a critical point for (P_dc) is almost always a local solution for (P_dc).

The necessary local optimality condition for (P_dc) is

$\partial h (x^{*}) \subset \partial g (x^{*}) \neq \emptyset$ (37) which is also sufficient for many important classes of DC programs, for example, for polyhedral DC programs, or when f is locally convex at x^*. We use g^* (y) = sup {x^Ty - g (x) , x ∈ Rⁿ} to denote the conjugate function of g. The Fenchel-Rockafellar dual of (P_dc) is defined as [16, 17]

$inf {h^{*} (y) - g^{*} (y), y \in R^{n}} (D_{dc})$ (38) DC optimization algorithm (DCA) is an iterative algorithm based on local optimality conditions and duality. The idea of DCA is simple: at each iteration, one replaces the second component h in the primal DC problem (P_dc) by its affine minorization, h (x^k) + (x - x^k) ^Ty^k, to generate a convex program

$\begin{matrix} min {g (x) - h (x^{k}) - (x - x^{k})^{T} y^{k}, \\ x \in R^{n}, y^{k} \in \partial h (x^{k})} \end{matrix}$ (39) In practice, a simplified form of the DCA is used. Two sequences {x^k} and {y^k} satisfying y^k ∈ ∂h (x^k) are constructed, and x^k+1 is a solution to the convex program (39). The simplified DCA scheme is described as follows.

Initialization: Choose an initial point x⁰ ∈ Rⁿ and set k = 0

Repeat

Calculate y^k ∈ ∂h (x^k)

Solve convex program (39) to obtain x^k+1

Let k:=k+1

Until some stopping criterion is satisfied.

DCA is a descent algorithm without line search. The following properties are used in the next sections (for simplicity, we omit the dual part of these properties)

(1) If g (x^k+1) - h (x^k+1) = g (x^k) - h (x^k), then x^k is a critical point for (P_dc). In this case, DCA terminates at k-th iteration.

(2) Let y^* be a local solution to the dual of (P_dc) and x^* ∈ ∂g^* (y^*). If h is differentiable at x^*, then x^* is a local solution to (P_dc).

(3) If the optimal value of problem (P_dc) is finite and the infinite sequence {x^k} is bounded, then every limit point x^* of the sequence {x^k} is a critical point of (P_dc).

(4) DCA converges linearly for general DC programs. Especially, for polyhedral DC programs, the sequence {x^k} contains finite elements, and in a finite number of iterations the algorithm converges to a critical point satisfying the necessary optimality condition.

DCA is an efficient and robust algorithm for solving nonconvex problems, especially in the large-scale setting, and has been successfully applied to many nonconvex optimizations.

Footnotes

Acknowledgements

This work is supported by National Nature Science Foundation of China (Nos.11471010, 11271367).

References

Huang

, Zhu

and Siew

, Extreme learning machine: Theory and applications, Neurocomputing 70 (2006), 489–501.

Huang

, Ding

and Zhou

, Optimization method based extreme learning machine for classification, Neurocomputing 74 (2010), 155–163.

Chorowski

, Wang

and Zurada

J.M.

, Review and performance comparison of SVM-and ELM-based classifiers, Neurocomputing 128 (2014), 507–516.

Yang

and Zhang

, A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition, Engineering Applications of Artificial Intelligence 53 (2016), 176–189.

Liu

, Gao

and Li

, A comparative analysis of support vector machines and extreme learning machines, Neural Networks 33(9) (2012), 58–66.

Chang

and Yang

, Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks, IEEE Trans Neural Netw Learn Syst 28(10) (2017), 2294–2305.

Yang

and Wang

, A class of smooth semi-supervised SVM by difference of convex functions programming and algorithm, Knowledge-Based Systems 41 (2013), 1–7.

Chapelle

, Sindhwani

and Keerthi

, Optimization Techniques for Semi-Supervised Support Vector Machines, Journal of Machine Learning Research 9 (2008), 203–233.

Ding

, Zhu

and Zhang

, An overview on semi-supervised support vector machine, Neural Computing & Applications 28 (2015), 1–10.

10.

Zhou

, Liu

B.Z.

, Xia

S.X.

and Liu

, Semi-supervised extreme learning machine with manifold and pairwise constraints regularization, Neurocomputing 149 (2015), 180–186.

11.

Liu

J.F.

, Chen

Y.Q.

and Liu

M.J.

, et al.,SELM: Semi-supervised ELM with application in sparse calibrated location estimation, Neurocomputing 74(16) (2011), 2566–2572.

12.

Huang

, Song

S.J.

, Jatinder

N.D.

and Wu

, Semi-supervised and Unsupervised Extreme Learning Machines, IEEE Transactions on Cybernetics 44(2) (2014), 2405–2417.

13.

Pham Dinh

and Le Thi

H. A.

, Convex analysis approach to DC programming: Theory, algorithms and applications, Acta Mathematica Vietnamica 22(1) (1997).

14.

Yang

, Ren

, Wang

and Dong

, A robust regression framework with Laplace kernel-induced loss, Neural Computation 29(11) (2017), 3014–3039.

15.

Le Thi

H.A.

, Pham Dinh

and Le

H.M.

, et al.,DC approximation approaches for sparse optimization, European Journal of Operational Research 244 (2014), 26–46.

16.

Le Thi

H.A.

, Nguyen Manh

and Pham Dinh

, A DC programming approach for finding Communities in networks, Neural Computation 26(12) (2014), 2827–2854.

17.

Yang

L.M.

and Dong

H.W.

, Support vector machine with truncated pinball loss and its application in pattern recognition, Chemometrics and Intelligent Laboratory Systems 177 (2018), 89–99.

18.

Yang

L.M.

and Wang

, A class of semi-supervised support vector machines by DC programming, Advances in Data Analysis and Classification 7(4) (2013), 417–433.

19.

Vapnik

V.N.

, The Nature of Statistical Learning Theory, Neural Networks IEEE Transactions on 10(5) (1995), 988–999.

20.

Taylor

J.S.

and Sun

, A review of optimization methodologies in support vector machines, Neurocomputing 74(17) (2011), 3609–3618.

21.

Fawcett

, An introduction to ROC analysis, Pattern Recognition Letters 27 (2006), 861–874.

22.

Liu

, He

and Shi

, Extreme support vector machine classifier, Lecture Notes in Computer Science 5012 (2008), 222–233.

23.

Wang

Z.Q.

, Yu

and Kang

, et al.,Breast tumor detection in digital mammography based on extreme learning machine, Neurocomputing 128(5) (2014), 175–184.