A smooth extreme learning machine framework

Abstract

Extreme learning machine (ELM) has demonstrated great potential in machine learning and data mining. Smoothing strategy is an important technology for continuous optimizations. In this work, we apply a smoothing technique to replace the hinge loss function by an accurate smooth approximation. This will allow us to solve ELM as an unconstrained minimization problem directly. We term this reformulated problem as smooth ELM (SELM). A Newton-Armijo algorithm is used to solve the proposed SELM, and the resulting algorithm converges globally and quadratically. The proposed SELM with fast running speed has less decision variables and can better deal with nonlinear problems than the existing smooth support vector machine. Numerical experiments on various types of datasets including two-class datasets and multi-class datasets demonstrate that the speed of SELM is much faster than that of the existing ELM models. And compared with other popular algorithms of support vector machine and ELM, the proposed SELM achieves better or similar generalization. These demonstrate the effectiveness and fast speed of the algorithm.

Keywords

Extreme learning machine neural networks smoothing technique Newton-Armijo algorithm

1 Introduction

Extreme learning machine (ELM) [1 –5] is an important learning algorithm for single-hidden-layer feedforward neural networks (SLFNs) [6]. With simple structure, low computational cost and good generalization, ELM has been used successfully in machine learning and data mining, especially in big data analysis [4 , 8]. Moreover, ELM overcomes the drawbacks of traditional neural networks such as local minima, imprecise learning rates and slow convergence rates. Its hidden nodes and input weights are randomly generated and the output weights are expressed analytically if the activation functions in the hidden layer are infinfitely differentiable. The ELM has better generalization performance than that of other neural networks algorithms such as gradient-based methods. Moreover, ELM can obtain similar generalization ability to support vector machine (SVM) [4 , 10] but with much less training time. In addition, different from SVM, ELM can provide a unified learning platform to different applications, such as regression,binary and multi-class classification problems [3].

Smoothing technology has been extensively applied to different optimizations, such as continuous optimization and variational inequalities [11 –13]. The main merits of smoothing methods are that they can convert the continuous optimizations into smooth optimizations and the information of higher derivatives can be used after smoothing. For example, the smooth support vector machine (SSVM) [14] has important mathematical properties such as strong convexity and infinitely often differentiability. Inspired by these investigations, in this work we change the traditional ELM model slightly and apply a smoothing technique to the extreme learning machine for pattern classification. We begin with the binary classification case which can be converted to an unconstrained optimization problem directly in which the objective function is not twice differentiable. We apply a smoothing technique to ELM and to solve the problem as an unconstrained minimization problem directly. We term this reformulated problem as smooth ELM (SELM) which has important mathematical properties such as strong convexity and infinitely often differentiability. Moreover, a fast Newton-Armijo method [15] is used to solve the SELM and the resulting algorithm globally and quadratically converges to the unique solution of the SELM. Taking advantage of SELM formulation, we only need to solve a system of linear equations iteratively,with fast learning speed, instead of solving a convex quadratic programming.

We use a kind of smooth approximation to the plus function (x) ₊ = max {0, x}. Suppose x is a stochastic variable whose density function is d (x), and the expectation of ∣x∣ satisfies E [∣ x ∣] _d(x) < + ∞. Set v (x, λ) = λd (λx) and $s (x, λ) = \int_{- \infty}^{x} v (t, λ) dt$ , where λ is a positive real number. Then we get a smooth approximation p (x, λ) of plus function (x) ₊ by a integrating $p (x, λ) = \int_{- \infty}^{x} s (t, λ) dt$ . Moreover the function p (x, λ) approximates (x) ₊ with increasing accuracy as parameter λ approaches infinity. This guarantees the rationality of using smoothing techniques in mathematical programming.

A typical example of smooth functions is neural networks smooth plus function [12]. Let $d (x) = \frac{∊^{- x}}{{(1 + ∊^{- x})}^{2}}$ . Integrating λd (λx) gives $s (x, λ) = \frac{1}{1 + ∊^{- λ x}}$ , where ∊ denotes the base of the natural logarithm. It is a sigmoid function of neural networks, which brings about the nomenclature above. And integrating s (x, λ), we have $p (x, λ) = x + \frac{1}{λ} ln (1 + ∊^{- λ x}) .$ (1)

Figure 1 illustrates the curves of the plus function (x) ₊ and the function p (x, λ) with different λ values. It can be seen intuitively that the approximation accuracy of the function p (x, λ) increases as λ does.

Fig.1

Curves of plus function and neural networks smooth plus function.

Newton methods [15, 16] are a class of algorithms which converge quickly. Newton-Armijo algorithm for solving smooth programming is a kind of damped Newton methods. Different from the original Newton method, it can converge globally under certain conditions.

Consider a unconstrained optimization problem $min_{x} f (x)$ where f (x) is twice differentiable. Let x_k represent the kth iterative point, and ∇f (x_k) and G_k denote the gradient of f (x) and the Hesse matrix respectively. Then the search direction is $- G_{k}^{- 1} \nabla f (x_{k})$ . Armijo criterion is used to choose the step size. It is established based on the fact that Taylor expansion approximates a function well only in a small neighborhood but too small step size leads to converging slowly. The main idea of Armijo criterion is to try from a large step size and proportionally reduce it repeatedly until the objective function value has a sufficient reduction. Set $x_{k + 1} = x_{k} - τ^{m_{k}} γ G_{k}^{- 1} \nabla f (x_{k})$ , then m_k is the smallest positive integer such that $f (x_{k + 1}) \leq f (x_{k}) - ρ τ^{m_{k}} γ {\nabla f (x_{k})}^{T} G_{k}^{- 1} \nabla f (x_{k})$ where parameters τ ∈ (0, 1), $ρ \in (0, \frac{1}{2})$ and γ > 0.

2 Background

2.1 ELM

We here give a brief description of ELM for binary classification problems; a more detailed description of ELM is available in literature [1 , 5]. Consider a dataset containing N training samples (x_i, t_i),i = 1, 2, . . . , N, where x_i ∈ Rⁿ and t_i ∈ {1, - 1} which denotes the category of the input sample x_i.

Suppose that there are L nodes in the hidden layer for the SLFN. Set w_i denote the weight vector connecting the input nodes with the ith hidden node, and b_i the bias term of the ith hidden node. Then a desired SLFN with L hidden nodes is to approximate these N samples with zero error, which means that the desired output for the jth pattern is $\sum_{i = 1}^{L} β_{i} g (w_{i}^{T} x_{j} + b_{i}) = t_{j}, j = 1, . . ., N$ (2) where β_i denotes the weight connecting the ith hidden node with the output node. Let $\begin{matrix} H = {[\begin{matrix} g (w_{1}^{T} x_{1} + b_{1}) & \dots & g (w_{L}^{T} x_{1} + b_{L}) \\ ⋮ & \dots & ⋮ \\ g (w_{1}^{T} x_{N} + b_{1}) & \dots & g (w_{L}^{T} x_{N} + b_{L}) \end{matrix}]}_{N \times L} \end{matrix}$

T = (t₁, t₂, ⋯ , t_N) ^T and β = (β₁, β₂, ⋯ , β_L) ^T.

Huang et al. point out that the input weights w_i and hidden layer biases b_i(i = 1, 2, . . . , N) in the SLFN are not necessarily tuned during training and may be assigned randomly [1, 2]. Based on this scheme, Huang et al. propose a simple SLFN algorithm, called ELM, the aim of which is to find a least-squares solution of the linear system (2). That is to say, determining a SLFN can be posed as finding the solution of the least-squares problem $min_{β} ∥ H β - T ∥_{2}^{2}$ (3) which is a normal quadratical programming model with no constraints. With H^TH being positive definite, its optimal solution β can be obtained by

$β = H^{†} T where H^{†} = (H^{T} H)^{- 1} H^{T}$ (4) where H^† is the Moore-Penrose generalized inverse of matrix H. After determining the output weights, we can get the classification decision function: $u (x) = sgn (\sum_{i = 1}^{L} β_{i} g (w_{i}^{T} x + b_{i})) .$

2.2 Optimization method based ELM

Recently, Huang et al. [2] have proposed a new ELM framework based on optimization theory (called OPTELM), in which the hinge loss function was introduced into ELM. According to Bartlett’s theory [17], the smaller the norms of weights are, the better generalization performance of the networks tends to have. Moreover

SVM’s maximal separating margin property and the ELM’s minimal norm of output weights property are actually consistent;

just as SVM does, ELM also minimizes the training error as well as maximizing the separating margin.

Theoretically, all the training data in the ELM feature space are linearly separable by a hyperplane passing through the origin with probability one, but in practical applications, the training data can not be strictly separated in the ELM feature space. Thus Huang et al. propose an optimization theory-based ELM framework (called OPTELM), where the hinge loss function is applied in ELM training:

$\begin{matrix} min_{(β, ξ)} & \frac{1}{2} {| | β | |}_{2}^{2} + C \sum_{i = 1}^{N} ξ_{i} \\ st . & t_{i} h (x_{i})^{T} β \geq 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, 2, . . ., N \end{matrix}$ (5) where $h (x) = (g (w_{1}^{T} x + b_{1}), g (w_{2}^{T} x + b_{2}), . . ., g (w_{L}^{T} x + b_{L}))^{T}$ , as an ELM nonlinear feature mapping. It is the output vector of hidden layer with respect to the input x. And ξ = (ξ₁, ξ₂, . . . , ξ_N), which is a slack variable. The C is a penalty parameter which is positive. This is a quadratic programming with global solution. The separating hyperplane for OPTELM has the form: h (x) ^Tβ = 0, which passes through the origin in the ELM feature space.

3 Smooth ELM framework

In this section, we begin with the binary classification case that can be converted to an unconstrained optimization problem directly.

3.1 Smooth ELM for binary classification

We change slightly the traditional ELM with hinge loss-function [18]. First, we replace the l₁-norm with the l₂-norm of the slack variable ξ by weighting $\frac{C}{2}$ , which guarantees the strict convexity of the object function. This leads to the following optimization: $\begin{matrix} min_{(β, ξ)} & \frac{1}{2} {| | β | |}_{2}^{2} + \frac{C}{2} \sum_{i = 1}^{N} {ξ_{i}}^{2} \\ st . & t_{i} h (x_{i})^{T} β \geq 1 - ξ_{i}, i = 1, 2, . . ., N \end{matrix}$ (6) $ξ_{i} \geq 0, i = 1, 2, . . ., N$

Let D be a N × N diagonal matrix with t_i (i = 1, 2, . . . , N) along its diagonal. Then problem (6) can be posed as the following form:

$\begin{matrix} min_{(β, ξ)} & \frac{1}{2} {| | β | |}_{2}^{2} + \frac{C}{2} | | ξ | |_{2}^{2} \\ st . & DH β + ξ \geq e, ξ \geq 0 \end{matrix}$ (7)

Where a column vector of ones of arbitrary dimension is denoted by e. This is also a quadratic programming with global solution.

Applying the Karush-Kuhn-Tucker Conditions (KKT conditions) of problem (7), we can come to the following conclusion.

Theorem 1. If (β^*, ξ^*) is the optimal solution of problem (7), then the ξ^* can be denoted by $ξ^{*} = {(e - DH β^{*})}_{+} .$ (8)

Proof. Since the ith component of vector DHβ^* is t_ih (x_i) ^Tβ^*, the equation (8) is equivalent to $ξ_{i}^{*} = {(1 - t_{i} h (x_{i})^{T} β^{*})}_{+}$ , (i = 1, 2, . . . , N). And we will prove this identity in the following text.

The Lagrange function of problem (7) is

$\begin{matrix} L (β, ξ, α, μ) & = & \frac{1}{2} {| | β | |}_{2}^{2} + \frac{C}{2} {| | ξ | |}_{2}^{2} \\ - α^{T} (DH β + ξ - e) - μ^{T} ξ \end{matrix}$ (9) where α ≥ 0 and μ ≥ 0 are the Lagrange multipliers. Let (β^*, ξ^*) be the optimal solution of (7), then there exists (α^*, μ^*) such that the following KKT conditions hold $\nabla_{ξ} L (β^{*}, ξ^{*}, α^{*}, μ^{*}) = C ξ^{*} - α^{*} - μ^{*} = 0$ (10) $\nabla_{β} L (β^{*}, ξ^{*}, α^{*}, μ^{*}) = β^{*} - H^{T} D^{T} α^{*} = 0$ (11) $DH β^{*} + ξ^{*} - e \geq 0$ (12) $ξ^{*} \geq 0, α^{*} \geq 0, μ^{*} \geq 0$ (13) $α_{i}^{*} (t_{i} h (x_{i})^{T} β^{*} + ξ_{i}^{*} - 1) = 0, i = 1, 2, . . ., N$ (14) $μ_{i}^{*} ξ_{i}^{*} = 0, i = 1, 2, . . ., N$ (15) where $α^{*} = (α_{1}^{*}, α_{2}^{*}, . . ., α_{N}^{*})^{T}$ and $μ^{*} = (μ_{1}^{*}, μ_{2}^{*}, . . ., μ_{N}^{*})^{T}$ .

Consider two cases. On one hand, for i which satisfies 1 - t_ih (x_i) ^Tβ^* > 0, 1 ≤ i ≤ N, we can get $ξ_{i}^{*} \geq 1 - t_{i} h (x_{i})^{T} β^{*} > 0$ from (12). Thus from (15) we have $μ_{i}^{*} = 0$ . Substitute $μ_{i}^{*} = 0$ into (10), then $α_{i}^{*} = C ξ_{i}^{*} > 0$ . According to (14), the equation $t_{i} h (x_{i})^{T} β^{*} + ξ_{i}^{*} - 1 = 0$ holds. Therefore, we obtain $ξ_{i}^{*} = (1 - t_{i} h (x_{i})^{T} β^{*})_{+}$ .

On the other hand, for i which satisfies 1 - t_ih (x_i) ^Tβ^* ≤ 0, 1 ≤ i ≤ N, it can be proved that $ξ_{i}^{*} = 0$ . Suppose $ξ_{i}^{*} > 0$ , then $t_{i} h (x_{i})^{T} β^{*} + ξ_{i}^{*} - 1 = ξ_{i}^{*} - (1 - t_{i} h (x_{i})^{T} β^{*}) > 0$ . So from (14) we get $α_{i}^{*} = 0$ . Substituting it into (10) we obtain $μ_{i}^{*} = C ξ_{i}^{*}$ , which leads to $C {(ξ_{i}^{*})}^{2} = 0$ according to (15). Apparently $ξ_{i}^{*} = 0$ because C > 0. This contradicts the above assumption.

Hence we have $ξ_{i}^{*} = (1 - t_{i} h (x_{i})^{T} β^{*})_{+}$ , i = 1, 2, . . . , N. So ξ^* = (e - DHβ^*) ₊ is proved. □

Thus, replacing ξ by (e - DHβ) ₊ in (7), we can convert (7) into an unconstrained optimization problem: $min_{β} \frac{1}{2} {| | β | |}_{2}^{2} + \frac{C}{2} {| | {(e - DH β)}_{+} | |}_{2}^{2} .$ (16)

The object function above is strictly convex, which guarantees that (16) has a unique solution. However, it is not twice differentiable so that Newton methods can’t be used directly.

Using neural networks smooth plus function (1), we get a smooth approximation of formulation (16), called SELM $min_{β} r_{λ} (β) = \frac{1}{2} {| | β | |}_{2}^{2} + \frac{C}{2} {| | p (e - DH β, λ) | |}_{2}^{2} .$ (17)

The following theorem shows the relationship between the solutions of the problem (16) and its smooth approximation SELM (17). Moreover, it can be proved that the solution of problem (16) is obtained by solving its smooth approximation SELM (17) with λ approaching infinity.

Theorem 2.Let A ∈ R^m×l and q ∈ R^m. Consider the following optimization problems: $min_{x} \frac{1}{2} {| | x | |}_{2}^{2} + \frac{1}{2} {| | (Ax - q)_{+} | |}_{2}^{2}$ (18) and $min_{x} \frac{1}{2} {| | x | |}_{2}^{2} + \frac{1}{2} {| | p (Ax - q, λ) | |}_{2}^{2}$ (19) with λ > 0. Then the following two conclusions hold:

There exists a unique solution $\bar{x}$ of problem (18) and a unique solution ${\bar{x}}_{λ}$ of problem (19).

For any λ, $| | {\bar{x}}_{λ} - \bar{x} | |$ satisfies the inequality ${| | {\bar{x}}_{λ} - \bar{x} | |}_{2}^{2} \leq \frac{m}{2} ({(\frac{ln 2}{λ})}^{2} + 2 η \frac{ln 2}{λ})$ (20) where η is defined by $η = | | A \bar{x} - q | |_{\infty}$ . Thus, ${\bar{x}}_{λ}$ converges to $\bar{x}$ as λ goes to infinity.

The proof of theorem 2 is similar to theorem 2.3 in literature [14].

The above theorem shows that ${\bar{x}}_{λ}$ can approximate the $\bar{x}$ when λ is sufficiently large. The objective function in problem (16) is not twice differentiable and thus it is difficult to optimize, it is a reasonable choice to solve (17) instead.

3.2 A Newton-Armijo algorithm for smooth ELM

As is presented in Section 1, Newton-Armijo algorithm has the merits that it converges quickly and globally. Since the object function of problem (17) is twice differentiable, a Newton-Armijo algorithm for solving SELM (17) is proposed as follows.

Algorithm (Newton-Armijo Algorithm for SELM)

Initialize C, λ and deviations ɛ₁ and ɛ₂. Let k = 0 and start with any $β_{λ}^{k} \in R^{L}$ .

Determine direction d^k ∈ R^L by the linear equations

$\begin{matrix} \nabla^{2} r_{λ} (β_{λ}^{k}) d^{k} = - \nabla r_{λ} (β_{λ}^{k}) \\ = β_{λ}^{k} - {CH}^{T} D^{T} Z \end{matrix}$ (21) where Z = (z₁, . . . , z_N) ^T with $z_{i} = \frac{θ_{i}^{k} + \frac{1}{λ} ln (1 + ∊^{- λ θ_{i}^{k}})}{1 + ∊^{- λ θ_{i}^{k}}}, i = 1, . . . N$ (22) $\nabla^{2} r_{λ} (β_{λ}^{k}) = I + {CH}^{T} D^{T} BDH$ (23)I is an L × L identity matrix. B is an N × N diagonal matrix with Bd along its diagonal, where Bd = (Bd₁, . . . , Bd_N) ^T and

$\begin{matrix} {Bd}_{i} & = & \frac{1 + λ ∊^{- λ θ_{i}^{k}} [θ_{i}^{k} + \frac{1}{λ} ln (1 + ∊^{- λ θ_{i}^{k}})]}{(1 + ∊^{- λ θ_{i}^{k}})^{2}}, \\ i = 1, . . ., N \end{matrix}$ (24) with $θ_{i}^{k} = 1 - t_{i} h (x_{i})^{T} β_{λ}^{k}$ , (i = 1, , , . N).

Choose a step size ν^k ∈ R such that $β_{λ}^{k + 1} = β_{λ}^{k} + ν^{k} d^{k}$ where $ν^{k} = max {\frac{1}{2^{i}}}$ , i ∈ N satisfies $r_{λ} (β_{λ}^{k}) - r_{λ} (β_{λ}^{k + 1}) \geq - ρ ν^{k} \nabla r_{λ} (β_{λ}^{k}) d^{k}$ where $ρ \in (0, \frac{1}{2})$ .

Stop if ${| | β_{λ}^{k + 1} - β_{λ}^{k} | |}_{2} < ɛ_{1}$ and ${| | r_{λ} (β_{λ}^{k + 1}) - r_{λ} (β_{λ}^{k}) | |}_{2} < ɛ_{2}$ . Otherwise let k = k + 1 and go to step 2.

The above algorithm uses a series of linear equations to solve the optimization problem, which inherits the most significant feature of ELM that is quickness. More accurately, we have the following theorem.

Theorem 3. Let ${β_{λ}^{k}}$ be a sequence generated by Newton-Armijo Algorithm for SELM, and the unique solution of optimization problem (17) is ${\bar{β}}_{λ}$ .

The sequence ${β_{λ}^{k}}$ converges to ${\bar{β}}_{λ}$ whatever the initial point ${β_{λ}^{0}}$ is.

For any $β_{λ}^{0}$ , there exists a positive integer K such that the step size ν^k of Newton-Armijo Algorithm for SELM obtains 1 when k ≥ K and thus ${β_{λ}^{k}}$ converges quadratically.

The proof of the above theorem is similar to the Theorem 3.2 in literature [14].

Theorem 3 shows that Newton-Armijo Algorithm for SELM (17) converges globally. And it takes a pure Newton step after a finite number of iterations, which guarantees that it converges quadratically.

3.3 Smooth ELM for multiclass classification

Following the notations of Section 2.2, we solve multiclass smooth ELM problem using the popular one-against-all (OAA) method [16, 18].

Remarks

Different from the smooth SVM (SSVM) which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters, kernel function of proposed SELM (17) has the explicit form: K_ELM (x_i, x_j) = h (x_i) ^Th (x_j), and its network parameters are randomly generated without tuning.

The bias b is not required in SELM since the separating hyperplane β^Th (x) =0 passes through the origin in ELM feature space, while SSVM needs threshold to determine the hyperplane. Thus the proposed SELM with fast running speed has less decision variables than the exist SSVM. Compared with the OPTELM (5), the proposed SELM is with fast running speed. The SELM is more convenient to apply.

The proposed SELM is posed as a strict convex programming with unique solution. A Newton-Armijo algorithm is used to solve the SELM, and the resulting algorithm converges globally and quadratically.

Similar to the conventional ELM and OPTELM, the proposed SELM has universal approximation capability and it can achieve low approximation error on training set.

4 Experiments

The performance of Newton-Armijo algorithm for solving the proposed smooth ELM is analyzed by numerical experiments on various data sets from the UCI Machine Learning Repository [19]. Two-class and multi-class classification are both simulated. All the simulations are carried out in MATLAB7.10 environment running in a Core i3, 2.3 GHZ CPU.

We perform ten-fold cross validation in all considered datasets. In other words, the dataset is split randomly into ten subsets, and one of those sets is reserved as a test set. This process is repeated ten times, and the average of ten test results is used as the performance measure. We remove the labels of the test samples and employ these methods to reclassify the test set.

We are interested in classification accuracy and the running time of the proposed smooth ELM framework. The following different criteria are used in this investigation:

Accuracy (ACC) is the identification rate of all samples from two classes; F₁-measure and Mathew correlation coefficient (MCC) are two comprehensive measures of the quality of classification model. The above values can be obtained from the decision function and are defined as [20] $ACC = \frac{TP + TN}{TP + FN + TN + FP},$ (25) $F_{1} = \frac{2 \times TP}{2 \times TP + FP + FN}$ (26)

$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}$ (27) where TP and TN denote true positives and true negatives; FN and FP denote false negatives and false positives, respectively. The ACC, MCC and F₁-measure are comprehensive measures of the quality of classification models. The higher their values are, the better the models are.

Time: the total training and test time.

Parameters selection. The performance of the proposed smooth ELM framework depends closely on the choices of its parameters [21]. In this work, all parameters including the penalty parameter C and the number of hidden nodes are chosen by the grid research method to maximize the classification accuracy (ACC) on each dataset. Once these parameters are selected, the optimal parameters are used to analyze experiment results. Thus all parameters should be optimized beforehand as detailed below.

The penalty parameter C of the proposed SELM (17) is adjusted from the set {0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 1000, 10000}. And we set λ = 10, ρ = 0.25 and ɛ₁ = ɛ₂ = 10^-3 in the experiments.

Theoretically, the number of hidden nodes L plays an important role in training ELM networks. With C = 10000, we provide a map (see Fig. 2) to illustrate the accuracy of the proposed SELM as L varies on seven datasets, where the x-axis denotes the values of L and the y-axis denotes the ACC.

Fig.2

Testing accuracy of smooth ELM on seven datasets as L increases.

We observe from Fig. 2 that the accuracy of SELM increases and then fluctuates or decreases slightly with L increasing. This may help to choose the number of nodes L of the hidden layer of SELM and OPTELM.

In this work, the optimal value L is tuned from the set {5, 10, 20, 50, 100, 200, 300, 400, 500} by ten-fold cross validation to maximize the accuracy of SELM and OPTELM respectively.

We select sigmoid function $g (x) = \frac{1}{1 + ∊^{- x}}$ (28) as the activation function in hidden layer.

We choose the smooth SVM (SSVM) [14] and OPTELM [2] as the baseline methods.

4.1 Experiments on two-class datasets

In this section, we implement the proposed smooth ELM (SELM) (17) in binary classification problems. The numerical experiments are carried out on seven UCI datasets.

4.1.1 Comparisons of SELM with OPTELM

We first compare the proposed SELM (17) with OPTELM (5). The optimum parameter C and the number of hidden nodes L for two algorithms are reported in Table 1. And with these chosen parameters, the SELM and OPTELM models are run 50 times trials and ten-fold cross validation is used in each trial. The average running time of these two algorithms are also presented in Table 1.

Table 1
Comparison between SELM and OPTELM with respect to Time (s)

Data Smooth ELM OPTELM

C L Time C L Time

Australian 10000 400 6.59 1000 300 2401.41

Cancer 20 400 4.26 50 50 303.34

Heart 1000 20 0.11 1000 50 57.15

Ionosphere 10000 500 6.52 0.5 400 2413.04

Pima 10000 100 1.38 1000 10 133.69

sonar 10000 10 0.04 1000 10 2.95

WPBC 1000 100 2.03 20 200 1216.87

Data	Smooth ELM	OPTELM
Australian	10000	400	6.59	1000	300	2401.41
Cancer	20	400	4.26	50	50	303.34
Heart	1000	20	0.11	1000	50	57.15
Ionosphere	10000	500	6.52	0.5	400	2413.04
Pima	10000	100	1.38	1000	10	133.69
sonar	10000	10	0.04	1000	10	2.95
WPBC	1000	100	2.03	20	200	1216.87

As observed from Table 1, in terms of time analysis, the advantage of the proposed algorithm is clear. The time spent on OPTELM is more than 50 times than that of SELM. For Heart and WPBC, the advantages increase up to more than 500 times.

Then with the optimum parameters C and L shown in Table 1, we compare the proposed SELM with OPTELM in terms of generalization. The average results are reported in Table 2.

Table 2

Comparison between Smooth ELM and OPTELM in terms of ACC (%), MCC (%) and F₁ (%)

Data	Smooth ELM			OPTELM
	ACC	MCC	F ₁	ACC	MCC	F ₁
Australian	82.68	64.96	82.08	78.89	56.91	76.95
(690 × 14)
Cancer	96.26	91.58	95.83	96.55	92.58	96.32
(699 × 9)
Heart	84.07	67.30	84.34	82.59	64.31	82.95
(270 × 13)
Ionosphere	87.43	69.91	85.64	90.29	76.44	88.59
(350 × 34)
Pima	76.86	46.91	75.96	76.34	44.64	75.31
(768 × 8)
Sonar	66.18	31.60	68.21	58.45	20.76	68.74
(208 × 60)
WPBC	95.42	89.90	94.81	95.42	89.43	94.46
(569 × 30)

The computational results from Table 2 show that the SELM improves generalization compared with the OPTELM in most cases. And there’re only two data sets whose ACC for the proposed SELM is lower while the rest are the same or higher. It is concluded that the generalization of the proposed SELM is on the same level as that of OPTELM, and more accurately, a little better than that of the latter.

In addition, the optimal parameters for different algorithms are distinct. To further evaluate the proposed SELM in terms of time, we also compare the running time when the parameters C and L are the same in SELM and OPTELM respectively. The average results on four different cases are shown in Table 3.

Table 3

Comparison between SELM and OPTELM with respect to Time (s)

Data	Parameter	SELM	OPTELM
Australian	C = 0.01, L = 10	0.85	85.24
	C = 1, L = 50	0.58	280.52
	C = 100, L = 100	0.95	506.83
	C = 1000, L = 400	6.01	3777.59
Cancer	C = 0.01, L = 10	0.18	174.23
	C = 1, L = 50	2.80	256.40
	C = 100, L = 100	1.04	560.84
	C = 1000, L = 400	4.57	3813.65
Heart	C = 0.01, L = 10	0.08	22.53
	C = 1, L = 50	0.22	37.89
	C = 100, L = 100	0.50	42.31
	C = 1000, L = 400	2.28	2036.34
Ionosphere	C = 0.01, L = 10	0.39	31.81
	C = 1, L = 50	0.46	85.19
	C = 100, L = 100	2.18	200.99
	C = 1000, L = 400	3.76	2266.62
Pima	C = 0.01, L = 10	0.24	145.01
	C = 1, L = 50	0.58	367.44
	C = 100, L = 100	1.12	678.41
	C = 1000, L = 400	5.58	4829.91
Sonar	C = 0.01, L = 10	0.03	2.78
	C = 1, L = 50	0.10	6.22
	C = 100, L = 100	0.19	11.96
	C = 1000, L = 400	1.13	443.77
WPBC	C = 0.01, L = 10	1.33	123.66
	C = 1, L = 50	0.70	237.47
	C = 100, L = 100	1.52	529.52
	C = 1000, L = 400	3.82	4165.01

For four different combinations of parameter C and L, Table 3 shows that the running time of SELM is much shorter than that of OPTELM when C and L are the same in SELM and OPTELM. It is true to other combinations of parameters.

4.1.2 Comparisons of SELM with other methods

Then we compare the SELM with linear SSVM (LSSVM) [10] and nonlinear SSVM (SSVM-kernel) [14] in terms of ACC. The average results with optimal parameters are summarized in Table 4.

Table 4
Comparison among Smooth ELM, linear SSVM (LSSVM) and nonlinear SSVM in terms of ACC (%)

Data Smooth ELM LSSVM SSVM-kernel

Australian 82.68 84.93 66.81

Cancer 96.26 95.42 65.52

Heart 84.07 84.07 67.04

Ionosphere 87.43 84.57 64.29

Pima 76.86 74.74 74.48

Sonar 66.18 72.12 53.37

WPBC 95.42 94.37 92.62

Data	Smooth ELM	LSSVM	SSVM-kernel
Australian	82.68	84.93	66.81
Cancer	96.26	95.42	65.52
Heart	84.07	84.07	67.04
Ionosphere	87.43	84.57	64.29
Pima	76.86	74.74	74.48
Sonar	66.18	72.12	53.37
WPBC	95.42	94.37	92.62

Table 4 illustrates that the proposed SELM achieves better performance than nonlinear SSVM in all considered data sets. Moreover, SELM yields accuracy comparable to linear SSVM.

To further demonstrate the effectiveness of the proposed algorithm, we compare the proposed algorithm with other popular algorithms: the traditional ELM [1], the standard SVM [10] and l₁-norm SVM [22]. The results on two data sets are showed in Table 5, where the results of l₁-norm SVM and SVM are from literature [14].

Table 5

Comparison of ACC (%) of SELM with other popular algorithms

Data set	SELM	ELM	l₁-norm SVM	SVM
Ionosphere	87.43	85.14	86.10	85.75
Pima	76.86	66.41	74.47	77.07

Table 5 shows that the proposed SELM has the better test accuracy in four algorithms. The performance of the SELM outperforms obviously the ELM on two datasets, and is slightly superior to the l₁-norm SVM and SVM. These indicate the effectiveness of the proposed algorithm.

4.2 Experiments on multi-class datasets

We also compare the proposed SELM with OPTELM on multi-class data sets. We use the popular one-against-all (OAA) method for multi-class problems. The information of datasets and the optimal parameters are shown in Table 6. And the average results with optimal parameters are listed inTable 7.

Table 6
Optimal parameters of Smooth ELM and OPTELM on multi-class data sets

Data Classes Smooth ELM OPTELM

C L C L

Wine 3 10000 100 50 100

(356 × 13)

Glass 6 10000 100 10000 100

(214 × 9)

Iris 3 10000 100 50 20

(150 × 4)

Data	Classes	Smooth ELM	OPTELM
Wine	3	10000	100	50	100
(356 × 13)
Glass	6	10000	100	10000	100
(214 × 9)
Iris	3	10000	100	50	20
(150 × 4)

Table 7

Comparison between Smooth ELM and OPTELM in terms of ACC (%), MCC (%) and Time on multi-class datasets

Data	Smooth ELM			OPTELM
	ACC	MCC	Time	ACC	MCC	Time
Wine	97.50	93.84	1.51	83.90	36.63	121.50
Glass	86.47	43.99	0.62	85.19	46.34	120.32
Iris	96.89	93.03	0.45	99.11	98.00	21.40

Table 7 shows that SELM achieves better generalization performance on two of the three data sets and it spends less time on all of the data sets. In terms of MCC, SELM is also comparable to OPTELM.

5 Conclusion

We change slightly the traditional ELM model and apply a smoothing technique in the extreme learning machine. A smooth ELM framework (SELM) is proposed to handle pattern classification problems. It combines the idea of ELM and smoothing approximation strategy. The main contributions of this investigation are:

Applying a smooth approximation of the plus function, we derive a smooth ELM framework (SELM) which is an unconstrained optimization and its objective function is strictly convex and infinitely differentiable.

A fast Newton-Armijo algorithm is used to solve the proposed SELM and the resulting algorithm converges globally and quadratically. And with fast learning speed, one only need to solve a system of linear equations iteratively, instead of quadratic programming such as the traditional ELM and OPTELM methods.

Different from the traditional smooth SVM (SSVM) which is difficult to deal with nonlinear problems, with implication mapping and the kernel parameters, the proposed SELM has the explicit kernel function form, and thus is convenient to use in nonlinear classifications and regressions.

Compared with the smooth SVM and extreme SVM (ESVM) [23], the bias is not required in SELM. Therefore,with fewer variables,the proposed SELM is more convenient to apply than that of SSVM and ESVM.

The performance of the proposed SELM is tested on various types of datasets including two-class datasets and multi-class datasets. Experiments show that the proposed SELM either improves or shows no significant difference in generalization compared with OPTELM, yet has much faster speed than the latter. Moreover, the proposed SELM achieves better performance than the corresponding nonlinear SSVM. Compared with the linear SSVM and other popular methods, the proposed SELM achieves comparative generalization.

The proposed smooth method for ELM can be applied to regression problems. Moreover, we plan to continue future work on other smoothing techniques to ELM and combining trust region method.

Footnotes

Acknowledgments

This work is supported by National Nature Science Foundation of China (11471010) and Chinese Universities Scientific Fund.

References

Huang

G.B.

, Siew

C.K.

and Zhu

Q.Y.

, Extreme learning machine: Theory and applications, Neurocomputing70 (2006), 489–501.

Huang

G.B.

, Ding

and Zhou

, Optimization method based extreme learning machine for classification, Neurocomputing74 (2010), 155–163.

Huang

G.B.

, Zhou

H.M.

, Ding

X.J.

and Zhang

, Extreme learning machine for regression and multiclass classification, IEEE Transaction on Systems, MAN, and Cybernetics-PART B: Cybernetics42(2) (2012), 513–529.

Al-Yaseen

W.L.

, Othman

Z.A.

and Nazri

M.Z.A.

, Multi-level hybrid support vector machine and extreme learning machine based on modified K-means for intrusion detection system, Expert Systems with Applications67 (2017), 296–303.

Yang

L.M.

and Zhang

S.Y.

, A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition, Engineering Applications of Artificial Intelligence53 (2016), 176–189.

Matias

, Souza

, Arajo

and Antunes

C.H.

, Learning of a single-hidden layer feedforward neural network using an optimized extreme learning machine, Neurocomputing129 (2014), 428–436.

Zhai

J.H.

, Xu

H.Y.

and Wang

X.H.

, Dynamic ensemble extreme learning machine based on sample entropy, Soft Computing16(9) (2012), 1493–1502.

, Zhao

, Wang

, et al., Distributed extreme learning machine with kernels based on mapreduce, Neurocomputing149(PA) (2015), 456–463.

Liu

, Gao

and Li

, A comparative analysis of support vector machines and extreme learning machines, Neural Networks33 (2012), 58–66.

10.

Vapnik

V.N.

, Statistical Learning Theory, Wiley, New York, USA, 1998.

11.

Chen

and Harker

P.T.

, Smooth approximations to non-linear complementarity problems, SIAM J Optimization7 (1997), 403–420.

12.

Chen

and Mangasarian

O.L.

, A class of smoothing Functions for nonlinear and mixed complementarity problems, Computational Optimization and Applications5 (1996), 97–138.

13.

Yang

L.M.

and Wang

L.S.H.

, A class of smooth semi-supervised SVM by difference of convex functions programming and algorithm, Knowledge-Based Systems41 (2013), 1–7.

14.

Lee

Y.J.

and Mangasarian

O.L.

, SSVM: A smooth support vector machine, Computational Optimization and Applications20(1) (2001), 5–22.

15.

Reddy

I.S.

, Shevade

and Murty

M.N.

, A fast quasi-Newton method for semi-supervised SVM, Pattern Recognition44 (2011), 2305–2313.

16.

Balasundaram

and Kapil

D.G.

, 1-Norm extreme learning machine for regression and multiclass classification using Newton method, Neurocomputing128 (2014), 4–14.

17.

Bartlett

P.L.

, The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Transactions on Information Theory44(2) (1998), 525–536.

18.

Lee

and Lin

, A study on L2-loss (squared hinge-loss) multiclass SVM, Neural Computation259(5) (2013), 1302–1323.

19.

Blake

C.L.

and Merz

C.J.

, UCI Repository for Machine Learning Databases, Department of Information and Computer Sciences, University of California, Irvine, http://www.ics.uci.edu/mlearn/MLRepository.html, 1998.

20.

Fawcett

, An introduction to ROC analysis, Pattern Recognition Letters27(8) (2006), 861–874.

21.

Yang

and Zhao

, A new optimizing parameter approach of LSSVM multiclass classification model, Neural Computing and Applications21(5) (2012), 945–955.

22.

Zhang

and Zhou

W.D.

, On the sparseness of 1-norm support vector machines, Neural Networks23 (2010), 373–385.

23.

Liu

, He

and Shi

, Extreme support vector machine classifier, Lecture Notes in Computer Science (2008), 222–233.

A smooth extreme learning machine framework

Abstract

Keywords

1 Introduction

2.1 ELM

3.1 Smooth ELM for binary classification

4 Experiments

4.1.1 Comparisons of SELM with OPTELM

Table 6 Optimal parameters of Smooth ELM and OPTELM on multi-class data sets Data Classes Smooth ELM OPTELM C L C L Wine 3 10000 100 50 100 (356 × 13) Glass 6 10000 100 10000 100 (214 × 9) Iris 3 10000 100 50 20 (150 × 4)

Footnotes

Acknowledgments

References

Table 6
Optimal parameters of Smooth ELM and OPTELM on multi-class data sets

Data Classes Smooth ELM OPTELM

C L C L

Wine 3 10000 100 50 100

(356 × 13)

Glass 6 10000 100 10000 100

(214 × 9)

Iris 3 10000 100 50 20

(150 × 4)