Abstract
Extreme learning machine (ELM) has demonstrated great potential in machine learning and data mining. Smoothing strategy is an important technology for continuous optimizations. In this work, we apply a smoothing technique to replace the hinge loss function by an accurate smooth approximation. This will allow us to solve ELM as an unconstrained minimization problem directly. We term this reformulated problem as smooth ELM (SELM). A Newton-Armijo algorithm is used to solve the proposed SELM, and the resulting algorithm converges globally and quadratically. The proposed SELM with fast running speed has less decision variables and can better deal with nonlinear problems than the existing smooth support vector machine. Numerical experiments on various types of datasets including two-class datasets and multi-class datasets demonstrate that the speed of SELM is much faster than that of the existing ELM models. And compared with other popular algorithms of support vector machine and ELM, the proposed SELM achieves better or similar generalization. These demonstrate the effectiveness and fast speed of the algorithm.
Introduction
Extreme learning machine (ELM) [1–5] is an important learning algorithm for single-hidden-layer feedforward neural networks (SLFNs) [6]. With simple structure, low computational cost and good generalization, ELM has been used successfully in machine learning and data mining, especially in big data analysis [4, 8]. Moreover, ELM overcomes the drawbacks of traditional neural networks such as local minima, imprecise learning rates and slow convergence rates. Its hidden nodes and input weights are randomly generated and the output weights are expressed analytically if the activation functions in the hidden layer are infinfitely differentiable. The ELM has better generalization performance than that of other neural networks algorithms such as gradient-based methods. Moreover, ELM can obtain similar generalization ability to support vector machine (SVM) [4, 10] but with much less training time. In addition, different from SVM, ELM can provide a unified learning platform to different applications, such as regression,binary and multi-class classification problems [3].
Smoothing technology has been extensively applied to different optimizations, such as continuous optimization and variational inequalities [11–13]. The main merits of smoothing methods are that they can convert the continuous optimizations into smooth optimizations and the information of higher derivatives can be used after smoothing. For example, the smooth support vector machine (SSVM) [14] has important mathematical properties such as strong convexity and infinitely often differentiability. Inspired by these investigations, in this work we change the traditional ELM model slightly and apply a smoothing technique to the extreme learning machine for pattern classification. We begin with the binary classification case which can be converted to an unconstrained optimization problem directly in which the objective function is not twice differentiable. We apply a smoothing technique to ELM and to solve the problem as an unconstrained minimization problem directly. We term this reformulated problem as smooth ELM (SELM) which has important mathematical properties such as strong convexity and infinitely often differentiability. Moreover, a fast Newton-Armijo method [15] is used to solve the SELM and the resulting algorithm globally and quadratically converges to the unique solution of the SELM. Taking advantage of SELM formulation, we only need to solve a system of linear equations iteratively,with fast learning speed, instead of solving a convex quadratic programming.
We use a kind of smooth approximation to the plus function (x) + = max {0, x}. Suppose x is a stochastic variable whose density function is d (x), and the expectation of ∣x∣ satisfies E [∣ x ∣] d(x) < + ∞. Set v (x, λ) = λd (λx) and
A typical example of smooth functions is neural networks smooth plus function [12]. Let
Figure 1 illustrates the curves of the plus function (x) + and the function p (x, λ) with different λ values. It can be seen intuitively that the approximation accuracy of the function p (x, λ) increases as λ does.

Curves of plus function and neural networks smooth plus function.
Newton methods [15, 16] are a class of algorithms which converge quickly. Newton-Armijo algorithm for solving smooth programming is a kind of damped Newton methods. Different from the original Newton method, it can converge globally under certain conditions.
Consider a unconstrained optimization problem
ELM
We here give a brief description of ELM for binary classification problems; a more detailed description of ELM is available in literature [1, 5]. Consider a dataset containing N training samples (x i , t i ),i = 1, 2, . . . , N, where x i ∈ R n and t i ∈ {1, - 1} which denotes the category of the input sample x i .
Suppose that there are L nodes in the hidden layer for the SLFN. Set w
i
denote the weight vector connecting the input nodes with the ith hidden node, and b
i
the bias term of the ith hidden node. Then a desired SLFN with L hidden nodes is to approximate these N samples with zero error, which means that the desired output for the jth pattern is
T = (t1, t2, ⋯ , t N ) T and β = (β1, β2, ⋯ , β L ) T .
Huang et al. point out that the input weights w
i
and hidden layer biases b
i
(i = 1, 2, . . . , N) in the SLFN are not necessarily tuned during training and may be assigned randomly [1, 2]. Based on this scheme, Huang et al. propose a simple SLFN algorithm, called ELM, the aim of which is to find a least-squares solution of the linear system (2). That is to say, determining a SLFN can be posed as finding the solution of the least-squares problem
Recently, Huang et al. [2] have proposed a new ELM framework based on optimization theory (called OPTELM), in which the hinge loss function was introduced into ELM. According to Bartlett’s theory [17], the smaller the norms of weights are, the better generalization performance of the networks tends to have. Moreover SVM’s maximal separating margin property and the ELM’s minimal norm of output weights property are actually consistent; just as SVM does, ELM also minimizes the training error as well as maximizing the separating margin.
Theoretically, all the training data in the ELM feature space are linearly separable by a hyperplane passing through the origin with probability one, but in practical applications, the training data can not be strictly separated in the ELM feature space. Thus Huang et al. propose an optimization theory-based ELM framework (called OPTELM), where the hinge loss function is applied in ELM training:
In this section, we begin with the binary classification case that can be converted to an unconstrained optimization problem directly.
Smooth ELM for binary classification
We change slightly the traditional ELM with hinge loss-function [18]. First, we replace the l1-norm with the l2-norm of the slack variable ξ by weighting
Let D be a N × N diagonal matrix with t i (i = 1, 2, . . . , N) along its diagonal. Then problem (6) can be posed as the following form:
Where a column vector of ones of arbitrary dimension is denoted by e. This is also a quadratic programming with global solution.
Applying the Karush-Kuhn-Tucker Conditions (KKT conditions) of problem (7), we can come to the following conclusion.
The Lagrange function of problem (7) is
Consider two cases. On one hand, for i which satisfies 1 - t
i
h (x
i
)
T
β* > 0, 1 ≤ i ≤ N, we can get
On the other hand, for i which satisfies 1 - t
i
h (x
i
)
T
β* ≤ 0, 1 ≤ i ≤ N, it can be proved that
Hence we have
Thus, replacing ξ by (e - DHβ) + in (7), we can convert (7) into an unconstrained optimization problem:
The object function above is strictly convex, which guarantees that (16) has a unique solution. However, it is not twice differentiable so that Newton methods can’t be used directly.
Using neural networks smooth plus function (1), we get a smooth approximation of formulation (16), called SELM
The following theorem shows the relationship between the solutions of the problem (16) and its smooth approximation SELM (17). Moreover, it can be proved that the solution of problem (16) is obtained by solving its smooth approximation SELM (17) with λ approaching infinity.
There exists a unique solution For any λ,
The proof of theorem 2 is similar to theorem 2.3 in literature [14].
The above theorem shows that
As is presented in Section 1, Newton-Armijo algorithm has the merits that it converges quickly and globally. Since the object function of problem (17) is twice differentiable, a Newton-Armijo algorithm for solving SELM (17) is proposed as follows.
Initialize C, λ and deviations ɛ1 and ɛ2. Let k = 0 and start with any Determine direction d
k
∈ R
L
by the linear equations Choose a step size ν
k
∈ R such that Stop if
The above algorithm uses a series of linear equations to solve the optimization problem, which inherits the most significant feature of ELM that is quickness. More accurately, we have the following theorem.
The sequence For any
The proof of the above theorem is similar to the Theorem 3.2 in literature [14].
Theorem 3 shows that Newton-Armijo Algorithm for SELM (17) converges globally. And it takes a pure Newton step after a finite number of iterations, which guarantees that it converges quadratically.
Following the notations of Section 2.2, we solve multiclass smooth ELM problem using the popular one-against-all (OAA) method [16, 18].
Different from the smooth SVM (SSVM) which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters, kernel function of proposed SELM (17) has the explicit form: K
ELM
(x
i
, x
j
) = h (x
i
)
T
h (x
j
), and its network parameters are randomly generated without tuning. The bias b is not required in SELM since the separating hyperplane β
T
h (x) =0 passes through the origin in ELM feature space, while SSVM needs threshold to determine the hyperplane. Thus the proposed SELM with fast running speed has less decision variables than the exist SSVM. Compared with the OPTELM (5), the proposed SELM is with fast running speed. The SELM is more convenient to apply. The proposed SELM is posed as a strict convex programming with unique solution. A Newton-Armijo algorithm is used to solve the SELM, and the resulting algorithm converges globally and quadratically. Similar to the conventional ELM and OPTELM, the proposed SELM has universal approximation capability and it can achieve low approximation error on training set.
Experiments
The performance of Newton-Armijo algorithm for solving the proposed smooth ELM is analyzed by numerical experiments on various data sets from the UCI Machine Learning Repository [19]. Two-class and multi-class classification are both simulated. All the simulations are carried out in MATLAB7.10 environment running in a Core i3, 2.3 GHZ CPU.
We perform ten-fold cross validation in all considered datasets. In other words, the dataset is split randomly into ten subsets, and one of those sets is reserved as a test set. This process is repeated ten times, and the average of ten test results is used as the performance measure. We remove the labels of the test samples and employ these methods to reclassify the test set.
We are interested in classification accuracy and the running time of the proposed smooth ELM framework. The following different criteria are used in this investigation: Accuracy (ACC) is the identification rate of all samples from two classes; F1-measure and Mathew correlation coefficient (MCC) are two comprehensive measures of the quality of classification model. The above values can be obtained from the decision function and are defined as [20]
Time: the total training and test time.
The penalty parameter C of the proposed SELM (17) is adjusted from the set {0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 1000, 10000}. And we set λ = 10, ρ = 0.25 and ɛ1 = ɛ2 = 10-3 in the experiments.
Theoretically, the number of hidden nodes L plays an important role in training ELM networks. With C = 10000, we provide a map (see Fig. 2) to illustrate the accuracy of the proposed SELM as L varies on seven datasets, where the x-axis denotes the values of L and the y-axis denotes the ACC.

Testing accuracy of smooth ELM on seven datasets as L increases.
We observe from Fig. 2 that the accuracy of SELM increases and then fluctuates or decreases slightly with L increasing. This may help to choose the number of nodes L of the hidden layer of SELM and OPTELM.
In this work, the optimal value L is tuned from the set {5, 10, 20, 50, 100, 200, 300, 400, 500} by ten-fold cross validation to maximize the accuracy of SELM and OPTELM respectively.
We select sigmoid function
We choose the smooth SVM (SSVM) [14] and OPTELM [2] as the baseline methods.
In this section, we implement the proposed smooth ELM (SELM) (17) in binary classification problems. The numerical experiments are carried out on seven UCI datasets.
Comparisons of SELM with OPTELM
We first compare the proposed SELM (17) with OPTELM (5). The optimum parameter C and the number of hidden nodes L for two algorithms are reported in Table 1. And with these chosen parameters, the SELM and OPTELM models are run 50 times trials and ten-fold cross validation is used in each trial. The average running time of these two algorithms are also presented in Table 1.
Comparison between SELM and OPTELM with respect to Time (s)
Comparison between SELM and OPTELM with respect to Time (s)
As observed from Table 1, in terms of time analysis, the advantage of the proposed algorithm is clear. The time spent on OPTELM is more than 50 times than that of SELM. For Heart and WPBC, the advantages increase up to more than 500 times.
Then with the optimum parameters C and L shown in Table 1, we compare the proposed SELM with OPTELM in terms of generalization. The average results are reported in Table 2.
Comparison between Smooth ELM and OPTELM in terms of ACC (%), MCC (%) and F1 (%)
The computational results from Table 2 show that the SELM improves generalization compared with the OPTELM in most cases. And there’re only two data sets whose ACC for the proposed SELM is lower while the rest are the same or higher. It is concluded that the generalization of the proposed SELM is on the same level as that of OPTELM, and more accurately, a little better than that of the latter.
In addition, the optimal parameters for different algorithms are distinct. To further evaluate the proposed SELM in terms of time, we also compare the running time when the parameters C and L are the same in SELM and OPTELM respectively. The average results on four different cases are shown in Table 3.
Comparison between SELM and OPTELM with respect to Time (s)
For four different combinations of parameter C and L, Table 3 shows that the running time of SELM is much shorter than that of OPTELM when C and L are the same in SELM and OPTELM. It is true to other combinations of parameters.
Then we compare the SELM with linear SSVM (LSSVM) [10] and nonlinear SSVM (SSVM-kernel) [14] in terms of ACC. The average results with optimal parameters are summarized in Table 4.
Comparison among Smooth ELM, linear SSVM (LSSVM) and nonlinear SSVM in terms of ACC (%)
Comparison among Smooth ELM, linear SSVM (LSSVM) and nonlinear SSVM in terms of ACC (%)
Table 4 illustrates that the proposed SELM achieves better performance than nonlinear SSVM in all considered data sets. Moreover, SELM yields accuracy comparable to linear SSVM.
To further demonstrate the effectiveness of the proposed algorithm, we compare the proposed algorithm with other popular algorithms: the traditional ELM [1], the standard SVM [10] and l1-norm SVM [22]. The results on two data sets are showed in Table 5, where the results of l1-norm SVM and SVM are from literature [14].
Comparison of ACC (%) of SELM with other popular algorithms
Table 5 shows that the proposed SELM has the better test accuracy in four algorithms. The performance of the SELM outperforms obviously the ELM on two datasets, and is slightly superior to the l1-norm SVM and SVM. These indicate the effectiveness of the proposed algorithm.
We also compare the proposed SELM with OPTELM on multi-class data sets. We use the popular one-against-all (OAA) method for multi-class problems. The information of datasets and the optimal parameters are shown in Table 6. And the average results with optimal parameters are listed inTable 7.
Optimal parameters of Smooth ELM and OPTELM on multi-class data sets
Optimal parameters of Smooth ELM and OPTELM on multi-class data sets
Comparison between Smooth ELM and OPTELM in terms of ACC (%), MCC (%) and Time on multi-class datasets
Table 7 shows that SELM achieves better generalization performance on two of the three data sets and it spends less time on all of the data sets. In terms of MCC, SELM is also comparable to OPTELM.
We change slightly the traditional ELM model and apply a smoothing technique in the extreme learning machine. A smooth ELM framework (SELM) is proposed to handle pattern classification problems. It combines the idea of ELM and smoothing approximation strategy. The main contributions of this investigation are: Applying a smooth approximation of the plus function, we derive a smooth ELM framework (SELM) which is an unconstrained optimization and its objective function is strictly convex and infinitely differentiable. A fast Newton-Armijo algorithm is used to solve the proposed SELM and the resulting algorithm converges globally and quadratically. And with fast learning speed, one only need to solve a system of linear equations iteratively, instead of quadratic programming such as the traditional ELM and OPTELM methods. Different from the traditional smooth SVM (SSVM) which is difficult to deal with nonlinear problems, with implication mapping and the kernel parameters, the proposed SELM has the explicit kernel function form, and thus is convenient to use in nonlinear classifications and regressions. Compared with the smooth SVM and extreme SVM (ESVM) [23], the bias is not required in SELM. Therefore,with fewer variables,the proposed SELM is more convenient to apply than that of SSVM and ESVM. The performance of the proposed SELM is tested on various types of datasets including two-class datasets and multi-class datasets. Experiments show that the proposed SELM either improves or shows no significant difference in generalization compared with OPTELM, yet has much faster speed than the latter. Moreover, the proposed SELM achieves better performance than the corresponding nonlinear SSVM. Compared with the linear SSVM and other popular methods, the proposed SELM achieves comparative generalization.
The proposed smooth method for ELM can be applied to regression problems. Moreover, we plan to continue future work on other smoothing techniques to ELM and combining trust region method.
Footnotes
Acknowledgments
This work is supported by National Nature Science Foundation of China (11471010) and Chinese Universities Scientific Fund.
