Twin extreme learning machine based on heteroskedastic Gaussian noise model and its application in short-term wind-speed forecasting

Abstract

Extreme learning machine (ELM) has received increasingly more attention because of its high efficiency and ease of implementation. However, the existing ELM algorithms generally suffer from the drawbacks of noise sensitivity and poor robustness. Therefore, we combine the advantages of twin hyperplanes with the fast speed of ELM, and then introduce the characteristics of heteroscedastic Gaussian noise. In this paper, a new regressor is proposed, which is called twin extreme learning machine based on heteroskedastic Gaussian noise (TELM-HGN). In addition, the augmented Lagrange multiplier method is introduced to optimize and solve the presented model. Finally, a significant number of experiments were conducted on different data-sets including real wind-speed data, Boston housing price dataset and stock dataset. Experimental results show that the proposed algorithms not only inherits most of the merits of the original ELM, but also has more stable and reliable generalization performance and more accurate prediction results. These applications demonstrate the correctness and effectiveness of the proposed model.

Keywords

Extreme learning machine heteroscedastic Gaussian noise least squares support vector regression twin hyperplanes wind-speed forecasting

1 Introduction

Extreme learning machines (ELM) [1], as a completely new training framework [2] in feedforward neural network learning, have been widely interested in all walks of life [3 –8] since its proposal, and have been successfully applied in a variety of fields ranging from image classification [9 –11], target recognition [12], fault diagnosis [13], feature selection [14, 15], and speech applications [16–17]. Compared with some other typical gradient-dependent neural networks, ELM has shown a number of significant advantages. One of them is the hidden layer need not to be adjusted, which is attributed to the reason that the weights of the input layer nodes to the hidden layer nodes, as well as the hidden layer nodes bias are determined by a random function, whereby only the output weights need to be optimized. Another virtue is the extremely rapid learning speed. The advantage is that the output weights of the network are obtained directly by performing a simple generalized inverse operation on the output matrix of the hidden layer, which eliminates the iterative steps.

However, the ELM also presents some challenges, one of which is sensitive to noise and outliers. It is due to the zero mean homoscedastic Gaussian distribution adopted as the empirical risk that the solution of ELM is optimal as long as the error variables obey the zero-mean homoskedasticity Gaussian distribution. Yet, in practical problems, owing to the effects of the measurement tools, experimental errors, and other parameters, the available sample data inevitably contains noise and outliers, and the error variables do not always follow a zero-mean homoscedastic Gaussian distribution. Another challenge is the tendency to overfitting, which results in poor generalization performance since standard ELM is a learning process based on minimized experience risk. In addition, an excessive number of hidden layer nodes can also affect the generalization performance of the model. As a consequence, immense amounts of extended spreading models have been proposed and studied by researchers to address this weakness.

Regularization theory can be applied to effectively deal with the above issues. Regularization is essentially an implementation of the structural risk minimization strategy, which is based on the empirical risk with the addition of a regularization term or penalty term that represents the complexity of the model. According to statistical learning theory, while minimizing the empirical risk, the simpler the model is, the smaller the confidence risk is, which can bring better generalization performance. Therefore, the idea of regularization has received wide attention, and a sea of scholars have conducted in-depth research on it. Huang [18] submitted a regularized ELM model by adding 2-parametric weights as regularization terms to the ELM model, which can significantly enhance the generalization performance of the ELM. Chen [19] et al. put forward a robust regularized ELM model based on iterative weight reassignment, which employs 2-parametric and 1-parametric regularization terms to evade overfitting and hence boost the generalization performance of the model. Deng [20] et al suggested a weighted regularized ELM (WELM) algorithm based on the principle of structural risk minimization and weighted least squares approach, which the generalization performance was somewhat improved without increasing the training time. But owing to the weight calculation process of error is added in the training process, it can be very time consuming particularly when the amount of data is enormous. In addition, the researchers also proposed Huber loss function [21], 1-norm loss function [22] and Pinball loss function [23] and their corresponding improved ELM models. Due to the linear relationship between them and training error, the robustness is still very poor.

In order to solve this problem, this paper deeply studies the noise characteristics in wind-speed prediction and derives the corresponding heteroscedastic optimal empirical risk loss function using Bayesian principle and maximum a posteriori probability technology. At the same time, the research of regressors based on the concept of constructing dual hyperplanes has drawn a great deal of attraction for its excellent generalization performance and low computational complexity. Moreover, The latest advancements of ELM had indicated some relationships between ELM and support vector machine [24 –26]. Wan [27] et al propose a new approach for data classification problem, termed as TELM, which extends ELM to two nonparallel separating hyperplanes classifier. However, the hidden layer parameters need to be calculated iteratively by solving the optimization problem, which will lead to high computational complexity and a large number of parameters in the hidden layer. It should be noted that a well-designed regressor should not only have high computational efficiency but also have more accurate prediction results and good generalization performance. Therefore, it is worthwhile to integrate ELM and twin structure to design hybrid model.

The main contributions of this paper are listed as follows: (1) Discover that the wind operation law meets a Gaussian distribution with zero mean heteroscedasticity by investigating the properties of noise models in real wind-speed forecasting; (2) Derive heteroskedasticity optimal empirical risk loss function by employing the Bayesian principle and maximizing posterior probability method; (3) Establish the regression models of twin extreme learning machine based on heteroskedastic Gaussian noise (TELM-HGN) and twin extreme learning machine based on homoscedastic Gaussian noise (TELM-GN), which combines the thought of twin hyperplanes with the speed characteristic of ELM. Experimental results show that TELM-HGN not only maintains the advantages of ELM in simple parameter setting and capability of rapid convergence, but also makes up for the disadvantages of being sensitive to noise and outliers and poor generalization performance, thus it can be easily extended to large data treatment.

The rest of this paper is organized as follows: In the second section, TSVR, TLSSVR and ELM are concisely introduced. In Section 3, we focus on exploring the noise model properties in wind-speed forecasting, and present the TELM-HGN and TELM-GN models in detail. To verify the correctness of the established model, a significant number of experiments on different data sets including real wind-speed data, Boston housing price dataset and stock dataset are conducted in Section 4. The last section is a summary of the article.

2 Materials and methods

In what follows is a brief description of TSVR, TLSSVR and ELM. Assume that the training data set of size N randomly generated by an unknown regression function f (x) is $D = {(x_{i}, y_{i})}_{i = 1}^{N}$ (i = 1, 2, ⋯, N), where x_i = [x_i1, x_i2, ⋯ , x_in] ^T ∈ Rⁿ is the input data of the ith sample, y_i = [y_i1, y_i2, ⋯ , y_im] ^T ∈ Rⁿ is the output data of the ith sample, and Rⁿ is an n-dimensional Euclidean dimensional space. The superscript T represents the matrix transpose.

2.1 Twin support vector regression

To enhance the computational speed and generalization performance of the typical SVR, Peng [28] further enhanced the TWSVM [29] model into TSVR model. TSVR will identify the insensitive upper and lower bounds of the regression function by generating a pair of non-parallel functions on both sides of the training data points, respectively. Consequently, two smaller quadratic programming problems [30] (QPPs) are solved in TSVR instead of one large QPP, leading to a substantial reduction in the computational complexity of the model time. TSVR [31] can be summarized as solving the following pair of QPP:

$\begin{matrix} \min {g_{P_{TSVR}} = \frac{1}{2} (y - e \cdot ɛ_{1} - (K (X, X^{T}) ω_{1} + e \cdot b_{1}))^{T} \\ \cdot (y - e \cdot ɛ_{1} - (K (X, X^{T}) ω_{1} + e \cdot b_{1})) \\ + C_{1} \cdot e^{T} \cdot ξ} \\ s . t . y - (K (X, X^{T}) ω_{1} + e \cdot b_{1}) ⩾ e \cdot ɛ_{1} - ξ, ξ ⩾ 0 \end{matrix}$ (1)

$\begin{matrix} \min {g_{P_{TSVR}} = \frac{1}{2} (y + e \cdot ɛ_{2} - (K (X, X^{T}) ω_{2} + e \cdot b_{2}))^{T} \\ \cdot (y + e \cdot ɛ_{2} - (K (X, X^{T}) ω_{2} + e \cdot b_{2})) \\ + C_{2} \cdot e^{T} \cdot η} \\ s . t . (K (X, X^{T}) ω_{2} + e \cdot b_{2})) - y ⩾ e \cdot ɛ_{2} - η, η ⩾ 0 \end{matrix}$ (2)

Where K (X, X^T) represents the kernel matrix whose element $k_{ij} = k (x_{i}, x_{j}^{T})$ , $x_{j}^{T}$ is the row vector of X, e is a uniform column vector. ξ and η are the slack variables. Here, k (x_i, x_j) = φ (x_i) ^T · φ (x_j) is the kernel function that gives the inner product in the high-dimensional feature space φ (x_i) and φ (x_j), where φ (·) is commonly a nonlinear mapping that maps the input space (x_i) to the high-dimensional feature space φ (x_i).

The dual optimization problem of TSVR is in the follows: $\begin{matrix} \min {g_{D_{TSVR}} = \frac{1}{2} α^{T} H (H^{T} H)^{- 1} H^{T} α \\ - g^{T} H (H^{T} H)^{- 1} H^{T} α + g^{T} α} \\ s . t . 0 ⩽ α ⩽ C_{1} e \end{matrix}$ (3)

$\begin{matrix} \min {g_{D_{TSVR}} = \frac{1}{2} β^{T} H (H^{T} H)^{- 1} H^{T} β \\ + h^{T} H (H^{T} H)^{- 1} H^{T} β - h^{T} β} \\ s . t . 0 ⩽ β ⩽ C_{2} e \end{matrix}$ (4)

where H = [K (X, X^T) , e], g = y - eɛ₁, h = y + eɛ₂, the optimal orientation vectors and biases are as follows:

(ω₁, b₁) ^T = (H^TH) ^-1H^T (g - α),

(ω₂, b₂) ^T = (H^TH) ^-1H^T (h + β).

Based on the TSVR, two nonlinear regression functions f₁ (x) and f₂ (x) are available, where $f_{1} (x) = ω_{1}^{T} K (X, x) + b_{1}$ , $f_{2} (x) = ω_{2}^{T} K (X, x) + b_{2}$ .

Thereby, the predictive regression function of the nonlinear TSVR can be expressed as: $\begin{matrix} f_{TSVR} (x) = \frac{f_{1} (x) + f_{2} (x)}{2} \\ = \frac{1}{2} (ω_{1} + ω_{2})^{T} K (X, x) + \frac{1}{2} (b_{1} + b_{2}) \end{matrix}$

An intuitive geometric interpretation of the TSVR is displayed in Fig. 1. In Fig. 1, the lower bound function f₁ (x) is Down_f (X), the upper bound function f₂ (x) is Up_f (X), and the predicted regression function $f (x)$ is f (X).

Fig. 1

Geometric interpretation of TSVR

2.2 Twin least squares support vector regression

In attempting to improve computational efficiency and obtain better generalization performance, Zhao [32] et al proposed TLSSVR by combining the spirit of twin hyperplanes with the fast nature of least squares support vector regression (LSSVR). The TLSSVR model modifies the inequality constraints of TSVR to equation constraints, whereby the significantly reducing the computational burden. The original problem of the TLSSVR model is expressed in the following: $\begin{matrix} \min {g_{P_{TLSSVR}} = \frac{1}{2} ω_{1}^{T} \cdot ω_{1} + \frac{C_{1}}{2 l} \sum_{i = 1}^{l} ν_{i} \cdot ξ_{i}^{2}} \\ P_{TLSSVR} : s . t . y_{i} = ω_{1}^{T} \cdot φ (x_{i}) + b_{1} + ξ_{i} + ɛ_{1} \end{matrix}$ (5) $\begin{matrix} \min {g_{P_{TLSSVR}} = \frac{1}{2} ω_{2}^{T} \cdot ω_{2} + \frac{C_{2}}{2 l} \sum_{i = 1}^{l} ν_{i}^{*} \cdot ξ_{i}^{* 2}} \\ P_{TLSSVR} : s . t . y_{i} = ω_{2}^{T} \cdot φ (x_{i}) + b_{2} + ξ_{i}^{*} - ɛ_{2} \end{matrix}$ (6)

Where ω₁, ω₂ represent the normal vectors of hyperplane, b₁, b₂ are the bias, the regularization parameter are C₁, C₂, $v_{i} ⩾ 0, ν_{i}^{*} ⩾ 0$ are respectively employed to weight the slack errors $ξ_{i}, ξ_{i}^{*}$ .

Solving Equation (5) employing the augmented Lagrange multiplier (ALM) approach, the solution of the equation can be obtained as: $[\begin{matrix} 0 & e^{T} \\ e & \tilde{K} \end{matrix}] [\begin{matrix} b_{1} \\ α \end{matrix}] = [\begin{matrix} 0 \\ y - ɛ_{1} e \end{matrix}]$ (7) where $\tilde{K_{ij}} = k (x_{i}, x_{j}) + δ_{ij} / v_{i} C_{1}$ , with δ_ij = 1 (i = j) or δ_ij = 0 (i ≠ j) i, j = 1, 2, ⋯ , l. And weights ν_i = ρ (α_i < 0) or ν_i = 1 (α_i ⩾ 0) , ρ > 1.

What is the lower bound prediction function of the LSSVR model is: $f_{1} (x) = \sum_{i = 1}^{l} α_{i} k (x_{i}, x) + b_{1} .$

Similarly, the upper bound prediction function of TLSSVR can be expressed as follows: $f_{2} (x) = \sum_{i = 1}^{l} α_{i}^{*} k (x_{i}, x) + b_{2} .$

2.3 Extreme learning machine

ELM was originally developed by Huang [33, 34] for single hidden layer feedforward neural networks [35]. The remarkable advantages is that the hidden layer nodes do not need iterative adjustment, which brings a further breakthrough to the research of feedforward neural network. Considering a set of data set ${(x_{i} y_{i})}_{i = 1}^{N}, i = 1, 2, \dots, N$ consisting of N learning samples, where g (w, x, b) designates the activation function of the ELM. The mathematical expression of the ELM output function [36] with L hidden nodes is as follows: $f (x) = \sum_{i = 1}^{L} β_{i} g_{i} (w_{i} x_{i} + b_{i}), i = 1, 2, \dots, N$ (8) where β_i = [β_i1, β_i2, ⋯ , β_im] ^T is the output weight between the node of the ith hidden layer neuron and the output neuron, w_i = [w_i1, w_i2, ⋯ , w_in] ^T indicates the input weight between the ith hidden layer neuron and the input layer neuron, b_i is the bias of the ith hidden layer neuron. w_ix_i is the inner integral product of the two parameters w_i and x_i, and f (x) is the practical output of the grid. The intermediate matrix generated by the mapping of the hidden layers in the network of the ELM is represented as follows: $\begin{matrix} H = h (x) = [\begin{matrix} g (w_{1} x_{1} + b_{1}) & \dots & g (w_{L} x_{1} + b_{L}) \\ ⋮ & ⋱ & ⋮ \\ g (w_{1} x_{N} + b_{1}) & \dots & g (w_{L} x_{N} + b_{L}) \end{matrix}] \end{matrix}$

Equation (8) can be formulated in the following form: $H β = T$ (9)

T is the desired target matrix. The purpose of training the ELM is to acquire a set of parameters $w_{i}^{*}, b_{i}^{*}, β_{i}^{*}$ , such that the following Equation (10) holds: $\begin{matrix} ∥ H (w_{i}^{*}, b_{i}^{*}) β_{i}^{*} - T ∥ \\ = min_{w, b, β} ∥ H (w_{i}, b_{i}) β_{i} - T ∥ \\ (i = 1, \dots, L) \end{matrix}$ (10)

The objective of the above equation is equivalent to optimize the following loss function: $E = \sum_{i = 1}^{N} {(\sum_{i = 1}^{L} β_{i} g (w_{i} \cdot x_{j} + b_{i}) - t_{j})}^{2}$ (11)

According to the theory of generalized inverse, the unique solution of the above equation can be concluded as: $β = H^{+} T$ (12)

Where H⁺ is the generalized inverse of the hidden layer output matrix H.

In accordance with the above Equation (11), it can be seen that the least square error is adopted by ELM as empirical risk, and the solution of ELM is optimal only when the error variable is subject to zero mean homoscedastic Gaussian distribution. However, in practical problems, the gathering of messages is carried out in a complex and dynamic environment, which is affected by various factors, resulting in data with noise and outliers, and the error variable do not always submit to a Gaussian distribution of zero mean homoscedastic. As a result, it is of great practical significance to optimize according to the actual error distribution and design the loss function that matches the actual problem.

3 Twin extreme learning machine based on Gaussian noise

As mentioned above, the conventional ELM only considers the situation where the error variable follows a zero mean homoscedastic Gaussian distribution, while ignoring the impact of noise on the data in the actual issue. Existing ELM algorithms generally suffer from the drawbacks of noise sensitivity and poor robustness. To address this issue, in the subsequent section, we will go more into the uncertainty of wind-speed, design a method to calculate wind-speed variance, and ultimately determine the characteristics of noise in wind-speed forecasting. In addition, the corresponding heteroscedasticity optimal empirical risk loss function is extracted by using Bayesian principle and maximum a posteriori probability technology. Meanwhile, twin ELM based on the heteroskedastic Gaussian noise and twin ELM based on the homoscedastic Gaussian noise are established respectively by integrating the spirit of twin regression and the advantages of ELM.

3.1 Uncertainty of wind

To investigate the behavior of the noise model in the actual wind-speed forecast, the wind-speed data from Heilongjiang province is collected, which has a sampling interval of 5 seconds. After statistical analysis and processing, the average wind-speed and variance of every 10 minutes were finally obtained. It was observed that the current forecasted wind-speed is a wind-speed in an average sense, while the real wind-speed consists of two parts, namely, hourly average wind-speed and instantaneous random fluctuations. Assuming that the time sequence of the actual instantaneous wind-speed data of the wind farm is {v (t)}, and the time sequence of the wind-speed on the hourly scale is ${\bar{v} (t)}$ . Then according to the composition of the real-time instantaneous wind-speed, the instantaneous random fluctuation part of the wind-speed, the turbulence residual, can be expressed as follows: $e (t) = {v (t)} - {\bar{v} (t)}$ , and the variance of the random fluctuation of the wind-speed can be expressed [37] as follows: $Var = \frac{1}{N - 1} \sum_{t = 1}^{N} e^{2} (t) = \frac{1}{N - 1} \sum_{t = 1}^{N} [v (t) - \bar{v} (t)]^{2}$ (13)

According to Equation (13),when calculating the variance of wind-speed Var, the variance of wind-speed is practically the equal in the time of t = 1, 2, 3, ⋯ , N by default. However, it is investigated that the variance of wind-speed varies at different moments. As shown in the Fig. 2:

Fig. 2

(a) represents the variation curve of the average wind-speed every 10 minutes; (b) indicates the variation graph of the wind-speed variance.

By observing the two images, it can be seen that both the average wind-speed and the wind-speed variance varies with time, and the trends of both are somewhat similar. Therefore, it is reasonable to assume that there is a connection between wind-speed variance and wind-speed. To further investigate the association between the two, the following experiments were conducted, setting the mean wind-speed as the x-axis and the wind-speed variance as the y-axis. Figure 3 illustrates the outcomes of the experiment.

Fig. 3

Modulation effect of wind amplitude on its variance.

From Fig. 3, it can be concluded that there is a linear correlation between the two and that the wind-speed variance y varies with the mean wind-speed x. The relationship expression is: y = 0.0896x + 0.1780, which implies that the variance of the wind-speed is distinct at various times and varies with the mean wind-speed, which is a heteroskedastic task.

3.2 Empirical risk loss function of noise characteristics

The noise properties of the data are assumed to satisfy a zero mean homoscedastic Gaussian distribution by a large number of contemporary regression algorithms, including conventional ELM. In contrast, statistical analysis of the acquired wind-speed dataset in the preceding section reveals that the variance varies with the mean wind-speed, which indicating that the wind operation pattern follows a Gaussian distribution of zero mean and heteroscedasticity. As a result, in this section, the optimal empirical risk loss function for the heteroscedasticity Gaussian noise signature will be derived by utilizing the Bayesian principle and maximizing a posteriori probability.

Suppose the N training data sets with heteroskedastic Gaussian noise property distribution are ${(x_{i} y_{i})}_{i = 1}^{N}$ (i = 1, 2, ⋯ , N), and the association between the measured value f (x_i) and the predicted valuation f (x_i) are as follows: $y_{i} = f (x_{i}) + ξ_{i} (i = 1, 2, \dots, N)$ (14) where ξ_i (i = 1, 2, ⋯ , N) is a random noise variable known to be independently and identically distributed (i.i.d.) with zero mean (Φ (x_i) Φ (x_j)) and standard deviation σ_i (i = 1, 2, ⋯ , N). Using Bayes’ principle, the optimal empirical risk loss function in the maximum likelihood sense [38] is stated as follows: $l (ξ_{i}) = - \log P (ξ_{i})$ (15) where P (ξ) is the probability density function (PDF) of the error variable ξ_i, l (ξ_i) signifies the loss value resulting from comparisons between the predicted value f (x_i) and y_i received at the sample point (x_i, y_i) for prediction, and l (ξ) denotes the loss function. Assuming that the noise in Equation (14) is Gaussian noise with zero mean and heteroskedastic variance $σ_{i}^{2} (i = 1, 2, \dots, N)$ , the PDF of ξ_i is $P (ξ_{i}) = \frac{1}{\sqrt{2 π} σ_{i}} \cdot exp {- \frac{ξ_{i}^{2}}{2 σ_{i}^{2}}}$ . From Equation (15), the corresponding optimal loss function with heteroskedastic Gaussian noise can be expressed as: $l (ξ_{i}) = \frac{1}{2 σ_{i}^{2}} ξ_{i}^{2} (i = 1, \dots, N)$ (16)

If the noise in Equation (14) is Gaussian with zero mean and the homoscedastic variance σ², the empirical risk loss function will be represented as $l (ξ_{i}) = \frac{1}{2} ξ_{i}^{2} (i = 1, \dots, N)$ .

3.3 Twin extreme learning machine based on heteroscedastic Gaussian noise model

Based on these pioneering investigations, in order to inherit the fast advantages of ELM and obtain stable and reliable generalization performance. In this section, we combine the idea of twin hyperplanes and brings in the optimal empirical risk loss functions of homoscedastic and heteroscedastic Gaussian noise to propose two advanced models of original ELM respectively, which are called twin extreme learning machine based on heteroscedastic Gaussian noise (TELM-HGN) and twin extreme learning machine based on homoscedastic Gaussian noise (TELM-GN). The model TELM-HGN is established by solving the following optimization problems: $\begin{matrix} min_{β} {P_{TELM - HGN} = \frac{1}{2} {∥ β_{1} ∥}^{2} + \frac{C_{1}}{2} \sum_{i = 1}^{N} \frac{1}{σ_{i}^{2}} ξ_{i}^{2}} \\ s . t . y_{i} = h (x_{i}) β_{1} + ξ_{i} + ɛ_{1}, i = (1, 2, \dots, N) \end{matrix}$ (17) $\begin{matrix} min_{β} {P_{TELM - HGN} = \frac{1}{2} {∥ β_{2} ∥}^{2} + \frac{C_{2}}{2} \sum_{i = 1}^{N} \frac{1}{σ_{i}^{* 2}} ξ_{i}^{* 2}} \\ s . t . y_{i} = h (x_{i}) β_{2} + ξ_{i}^{*} - ɛ_{2}, i = (1, 2, \dots, N) \end{matrix}$ (18)

Where $σ_{i}^{2}, σ_{i}^{* 2}$ is the heteroscedasticity variable, $ξ_{i}, ξ_{i}^{*}$ is the error variable, ɛ₁ ⩾ 0, ɛ₂ ⩾ 0 is the threshold parameter, C₁, C₂ is the penalty parameter, β₁, β₂ is the hidden layer output.

For the purpose of resolving the optimization problem (17), the Lagrangian function is constructed as follows: $\begin{matrix} L (β_{1}, ξ_{i}, α_{i}) = \frac{1}{2} {∥ β_{1} ∥}^{2} + \frac{C_{1}}{2} \sum_{i = 1}^{N} \frac{1}{σ_{i}^{2}} ξ_{i}^{2} \\ + \sum_{i = 1}^{N} α_{i} {y_{i} - h (x_{i}) β_{1} - ξ_{i} - ɛ_{1}} \end{matrix}$ (19)

Depending on the convex optimization theory, the partial derivatives of L (β₁, ξ_i, α_i) with respect to the parameter β₁, ξ_i, α_i are obtained by minimizing L (β₁, ξ_i, α_i), respectively, and the expressions are available as follows: $\frac{\partial L}{\partial β_{1}} = β_{1} - \sum_{i = 1}^{N} α_{i} h (x_{i}) = 0 \Rightarrow β_{1} = H^{T} α$ (20) $\begin{matrix} \frac{\partial L}{\partial ξ_{i}} = C_{1} \sum_{i = 1}^{N} \frac{1}{σ_{i}^{2}} ξ_{i} - \sum_{i = 1}^{N} α_{i} = 0 \\ \Rightarrow ξ_{i} = \frac{1}{C_{1}} α_{i} σ_{i}^{2} \end{matrix}$ (21)

$\begin{matrix} \frac{\partial L}{\partial α_{i}} = y - h (x_{i}) β_{1} - ξ_{i} - ɛ_{1} = 0 \\ \Rightarrow y - H β_{1} - ξ - ɛ_{1} = 0 \end{matrix}$ (22)

The optimal solution in accordance with the KKT condition can be derived as shown below: $β_{1} = {\begin{matrix} {(H^{T} H + σ_{i}^{2} / C_{1})}^{- 1} H^{T} (y - ɛ_{1}), N ⩾ L \\ {(H^{T} ({HH}^{T} + σ_{i}^{2} / C_{1})}^{- 1} (y - ɛ_{1}), N < L \end{matrix}$ (23)

Therefore, the lower bound nonlinear regression function is calculated as follows: $f_{1} (x) = \sum_{i = 1}^{L} h_{i} (x) β_{1} = h (x) β_{1}$ (24)

Similarly, the presentation of the Lagrangian function in (18) leads to the following expression: $\begin{matrix} L (β_{2}, ξ_{i}^{*}, α_{i}^{*}) = \frac{1}{2} {∥ β_{2} ∥}^{2} + \frac{C_{2}}{2} \sum_{i = 1}^{N} \frac{1}{σ_{i}^{* 2}} ξ_{i}^{* 2} \\ + \sum_{i = 1}^{N} α_{i}^{*} {y_{i} - h (x_{i}) β_{2} - ξ_{i}^{*} + ɛ_{2}} \end{matrix}$ (25)

Separately, it is possible to work out the partial derivatives of $L (β_{2}, ξ_{i}^{*}, α_{i}^{*})$ with reference to the parameter $β_{2}, ξ_{i}^{*}, α_{i}^{*}$ . On the basis of $\begin{matrix} \nabla_{β_{2}} (L) = 0, \nabla_{ξ_{i}^{*}} (L) = 0, \nabla_{α_{i}^{*}} (L) = 0 \end{matrix}$ yields β₂. $β_{2} = {\begin{matrix} {(H^{T} H + σ_{i}^{* 2} / C_{2})}^{- 1} H^{T} (y + ɛ_{2}), N ⩾ L \\ {(H^{T} ({HH}^{T} + σ_{i}^{* 2} / C_{2})}^{- 1} (y + ɛ_{2}), N < L \end{matrix}$ (26)

Then the upper bound nonlinear function can be expressively presented as: $f_{2} (x) = \sum_{i = 1}^{L} h_{i} (x) β_{2} = h (x) β_{2}$ (27)

Each one determines the ɛ-insensitive down-bound or up-bound regressor. Hence, TELM is constructed as: $\begin{matrix} f_{TELM - HGN} (x) = \frac{f_{2} (x) + f_{1} (x)}{2} \\ = \frac{h (x) β_{1} + h (x) β_{2}}{2} \end{matrix}$ (28)

That is, when the formula (14) satisfies Gaussian noise of zero mean homovariance, TELM-HGN evolves into twin extreme learning machine based on homovariance Gaussian noise model (TELM-GN), which is also termed as twin extreme learning machine (TELM). Therefore, the model TELM-GN can be expressed in the following form: $\begin{matrix} min_{β} {P_{TELM - GN} = \frac{1}{2} {∥ β_{1} ∥}^{2} + \frac{C_{1}}{2} \sum_{i = 1}^{N} ξ_{i}^{2}} \\ s . t . y_{i} = h (x_{i}) β_{1} + ξ_{i} + ɛ_{1}, i = (1, 2, \dots, N) \end{matrix}$ (29)

$\begin{matrix} min_{β} {P_{TELM - GN} = \frac{1}{2} {∥ β_{2} ∥}^{2} + \frac{C_{2}}{2} \sum_{i = 1}^{N} ξ_{i}^{* 2}} \\ s . t . y_{i} = h (x_{i}) β_{2} + ξ_{i}^{*} - ɛ_{2}, i = (1, 2, \dots, N) \end{matrix}$ (30)

For the solution of the optimization problem (29), the Lagrangian function is constructed as: $\begin{matrix} L (β_{1}, ξ_{i}, α_{i}) = \frac{1}{2} {∥ β_{1} ∥}^{2} + \frac{C_{1}}{2} \sum_{i = 1}^{N} ξ_{i}^{2} \\ + \sum_{i = 1}^{N} α_{i} {y_{i} - h (x_{i}) β_{i} - ξ_{i} - ɛ_{1}} \end{matrix}$ (31)

Derivation of the Lagrangian function $L (β_{1}, ξ_{i}^{*}, α_{i}^{*})$ yields the following expression: $\frac{\partial L}{\partial β_{1}} = β_{1} - \sum_{i = 1}^{N} α_{i} h (x_{i}) = 0 \Rightarrow β = H^{T} α$ (32)

$\frac{\partial L}{\partial ξ_{i}} = C_{1} ξ_{i} - \sum_{i = 1}^{N} α_{i} = 0 \Rightarrow ξ_{i} = \frac{1}{C_{1}} α_{i}$ (33) $\begin{matrix} \frac{\partial L}{\partial α_{i}} = y - h (x_{i}) β_{1} - ξ_{i} - ɛ_{1} = 0 \\ \Rightarrow y - H β_{1} - ξ - ɛ_{1} = 0 \end{matrix}$ (34)

According to the KKT condition, the optimal solution can be derived as shown below: $β_{1} = {\begin{matrix} {(H^{T} H + I / C_{1})}^{- 1} H^{T} (y - ɛ_{1}), N ⩾ L \\ {(H^{T} ({HH}^{T} + I / C_{1})}^{- 1} (y - ɛ_{1}), N < L \end{matrix}$ (35)

Correspondingly, β₂ can be calculated as: $β_{2} = {\begin{matrix} {(H^{T} H + I / C_{2})}^{- 1} H^{T} (y + ɛ_{2}), N ⩾ L \\ {(H^{T} ({HH}^{T} + I / C_{2})}^{- 1} (y + ɛ_{2}), N < L \end{matrix}$ (36)

Once Equations (36) are solved, two functions f₁ (x) and f₂ (x) are obtained, respectively, as: $f_{1} (x) = \sum_{i = 1}^{L} h_{i} (x) β_{1} = h (x) β_{1}$ (37)

$f_{2} (x) = \sum_{i = 1}^{L} h (x_{i}) β_{2} = h (x) β_{2}$ (38)

According to Equations (38), the decision function of TELM*GN is expressed in the following: $\begin{matrix} f_{TELM - GN} (x) = \frac{f_{2} (x) + f_{1} (x)}{2} \\ = \frac{h (x) β_{1} + h (x) β_{2}}{2} \end{matrix}$ (39)

3.4 Algorithm design of TELM-HGN

The algorithm design of TELM-HGN is as follows:

Algorithm Lower-Bound TELM-HGN
1. Input: the training data set
T ={ (x_iy_i) \|x_i ∈ Rⁿ, y_i ∈ Rⁿ } (i = 1, 2, ⋯ , N);
2. The grid searches for the critical parameter ɛ₁, the penalty parameter C₁ and the number of hidden layer neurons L;
3. The input weight parameter w_i and bias vector b_i are provided by the computer system, i = 1, 2, ⋯ , L;
4. Computationally solving the hidden layer output matrix H;
5. The output weight matrix β₁ of the TELM-HGN algorithm is calculated;
6. Output the lower-bound function
$f_{1} (x) = \sum_{i = 1}^{L} h_{i} (x) β_{1} = h (x) β_{1}$

Algorithm 2 Upper-Bound TELM-HGN
1. Input: the training data set
T ={ (x_iy_i) \|x_i ∈ Rⁿ, y_i ∈ Rⁿ } (i = 1, 2, ⋯ , N);
2. The grid searches for the critical parameter ɛ₂, the penalty parameter C₂ and the number of hidden layer neurons L;
3. The input weight parameter w_i and bias vector b_i are provided by the computer system, i = 1, 2, ⋯ , L;
4. Computationally solving the hidden layer output matrix H;
5. The output weight matrix β₂ of the TELM-HGN algorithm is calculated;
6. Output the lower-bound function
$f_{2} (x) = \sum_{i = 1}^{L} h_{i} (x) β_{2} = h (x) β_{2}$

Once f₁ (x) and f₂ (x) are obtained, respectively. The regression function of TELM-HGN is shown as follows: $\begin{matrix} f_{TELM - HGN} (x) \\ = \frac{f_{2} (x) + f_{1} (x)}{2} = \frac{h (x) β_{1} + h (x) β_{2}}{2} . \end{matrix}$

4 Experiments and discussion

In this section, in order to adequately validate the aforementioned regression algorithm in terms of correctness and feasibility, it is evaluated against several recently published algorithms including ELM, regularized ELM (RELM), robust ELM model with truncated 2-norm loss function (RTTELM) on both public and real wind-speed datasets. All experiments were carried out on a personal notebook with Inter Core i5-8700, 4GB memory, and windows 7 operation system in python 3.7 environment such that the same platform is provided for simulations. As for activation function, the commonly used Sigmoid is chosen.

In addition, parameter selection is one of the key issues affecting model evaluation, such as the regularization coefficient and the number of hidden layer nodes have a large impact on the generalization performance of the model. There have been many algorithms [39, 40] for selecting the optimal parameters, including particle swarm optimization algorithm [41], grid search algorithm [42], gray wolf optimization algorithm [43], etc. In this paper, the more popular and general grid search method is used to optimally select the parameters of the above models, which locate the optimal solution by traversing the specified parameters in the parameter space. In an attempt to reduce the computational burden of model selection for TELM-HGN and TELM-GN, C₁ = C₂ and ɛ₁ = ɛ₂ are set in our experiments. The regularization parameters involved in these algorithms are also selected from the values of [1,1000]. Meanwhile, to evaluate the performance of the aforementioned algorithms, the following five commonly used evaluation criterions are imported before presenting the experimental results, namely, mean absolute error (MAE), mean square error (MSE), sum of error squares (SSE), total sum of squares (SST), and sum of squares of regression (SSR) to compare the learning performance of different models. Table 1 presents the predictions and definitions of each evaluation metrics.

Table 1
Indicator prediction and definition

Metrics Calculation

MAE $MAE = \frac{1}{m} \sum_{i = 1}^{m} (| y_{i} - y_{i}^{} |)$

MSE $MSE = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - y_{i}^{})}^{2}$

SSE $SSE = \sum_{i = 1}^{m} {(y_{i} - y_{i}^{})}^{2}$

SST $SST = \sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}$

SSR $SSR = \sum_{i = 1}^{m} {(y_{i}^{} - \bar{y})}^{2}$

SSE/SST $SSE / SST = \frac{\sum_{i = 1}^{m} {(y_{i} - y_{i}^{})}^{2}}{\sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}}$

SSR/SST $SSR / SST = \frac{\sum_{i = 1}^{m} {(y_{i}^{} - \bar{y})}^{2}}{\sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}}$

Metrics	Calculation
MAE	$MAE = \frac{1}{m} \sum_{i = 1}^{m} (\| y_{i} - y_{i}^{*} \|)$
MSE	$MSE = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - y_{i}^{*})}^{2}$
SSE	$SSE = \sum_{i = 1}^{m} {(y_{i} - y_{i}^{*})}^{2}$
SST	$SST = \sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}$
SSR	$SSR = \sum_{i = 1}^{m} {(y_{i}^{*} - \bar{y})}^{2}$
SSE/SST	$SSE / SST = \frac{\sum_{i = 1}^{m} {(y_{i} - y_{i}^{*})}^{2}}{\sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}}$
SSR/SST	$SSR / SST = \frac{\sum_{i = 1}^{m} {(y_{i}^{*} - \bar{y})}^{2}}{\sum_{i = 1}^{m} {(y_{i} - \bar{y})}^{2}}$

Without loss of generality, assume that the mean value of the test sample is $\bar{y} = \frac{1}{m} \sum_{i = 1}^{m} y_{i}$ , while y_i represents the true value of the sample points, $y_{i}^{*}$ indicates the prediction result, and m denotes the number of test samples. In addition, in general, a smaller MSE, MAE, SSE, SSE/SST indicates a better learning ability of the model. However, when the prediction sample contains noise, a smaller MSE may imply overfitting, and a smaller SSE/SST typically accompanies a larger SSR/SST. However, over-small is not necessarily the best, as it probably means mismatching.

4.1 UCI dataset

In this section, in order to further verify the effectiveness of the model, we apply the proposed model to UCI data sets, including Boston house price data set and stock data set, Boston house prices are some data points of Boston houses (http://archive.ics.uci.edu/ml/index.php), which contain relatively few data points, only 506, and each sample has 13 characteristics to determine the trend of house prices, such as per capita crime rate, average number of rooms per house, highway accessibility, etc. The stock data set contains the historical stock data. Each sample contains nine characteristics, including the stock price at the opening of the market, the highest price of the day, the lowest price of the day, and so on. 50% of each data set is extracted as training set and test set for experimental analysis. Figure 4 shows the prediction results of the five models on stocks, and Table 2 lists the prediction errors of the five models on stocks. Figure 5 shows the prediction results of the five models on Boston house prices, and Table 3 lists the prediction errors of the five models on Boston house prices.

Fig. 4

The prediction results of the five models on stocks.

Table 2

The prediction errors of five models on stocks

Model	MAE	MSE	SSE	SSE/SST	SSR/SST
ELM	0.030498	0.001501	0.713195	0.028672	0.970971
RELM	0.031527	0.001582	0.751680	0.030219	0.968906
RTTELM	0.029830	0.001424	0.676543	0.027199	0.971320
TELM-HGN	0.026903	0.001170	0.555983	0.022352	0.971112
TELM-GN	0.028307	0.001299	0.617161	0.024811	0.991986

Fig. 5

The prediction results of the five models on Boston house prices.

Table 3

The prediction errors of five models on Boston house prices

Model	MAE	MSE	SSE	SSE/SST	SSR/SST
ELM	0.076792	0.011171	2.792717	0.339149	0.938550
RELM	0.083006	0.013403	3.390865	0.347279	0.665540
RTTELM	0.069218	0.008735	2.183875	0.265211	0.893620
TELM-HGN	0.059357	0.006857	1.714345	0.208191	0.930940
TELM-GN	0.061543	0.007263	1.815912	0.220526	0.896413

From Tables 2 3, among all the algorithms, the ELM with twin spirits improves the learning effect and has the smallest evaluation standard, which is also the motivation to develop the algorithm in this paper. Specially, the proposed model derives the smallest SSE and SSE/SST, and the largest SSR/SST among these three algorithms, which indicates the statistical information in the training datasets is well presented by the proposed model with fairly small regression errors. That is to say, the presented model not only obtain more accurate prediction but also owns good generalization performance.

4.2 Short-term wind-speed forecasting

In the above subsection, TELM-HGN has demonstrated its advantages on public data sets. To further proof the benefits of the model in practical applications, one-year wind-speed data of Heilongjiang province was gathered, which yields 62,466 samples with four attributes: mean, variance, minimum, and maximum. In this experiment, 432 training samples and 432 test samples are selected for analysis, respectively. The wind-speed forecasting model is constructed as follows: the input vector $\vec{X_{i}} = (X_{i - 11}, X_{i - 10}, X_{i - 9}, \dots, X_{i - 1} X_{i}), i = 1, 2, \dots, 864$ , the output value x_i+step, and x_i is the wind-speed value at a certain moment. The experimental setup step = 1, 3 in this section, where step is the forecast scale. That is to say, the above model is used to forecast and analyze the wind-speed at 10 minutes and 30 minutes after a certain time i in Heilongjiang Province in a summer. The prediction of five models on wind-speed after 10 minutes are shown in Fig. 6. The prediction of wind-speed after 10 minutes by the proposed models are shown in Fig. 7. The prediction of five models on wind-speed after 30 minutes are shown in Fig. 8, and the prediction of wind-speed after 30 minutes by the proposed models are shown in Fig. 9. Table 4 shows the result comparisons of five models on wind-speed after 10 minutes, and Table 5 shows the result comparisons of five models on wind-speed after 30 minutes.

Fig. 6

Prediction of five models on wind-speed after 10 minutes.

Fig. 7

Prediction of wind-speed after 10 minutes by the proposed model.

Fig. 8

Prediction of five models on wind-speed after 30 minutes.

Fig. 9

Prediction of wind-speed after 30 minutes by the proposed model.

Table 4

Result comparisons of five models on wind-speed after 10 min

Model	MAE	MSE	SSE	SSE/SST	SSR/SST
ELM	0.481640	0.454902	196.517548	0.102448	0.967804
R ELM	0.531206	0.438121	189.268403	0.098669	0.817762
RTTELM	0.459347	0.358951	155.066725	0.080839	0.871664
TELM-HGN	0.421954	0.319814	138.159563	0.072193	0.998312
TELM-GN	0.426268	0.325824	140.756013	0.073549	0.971882

Table 5

Result comparisons of five models on wind-speed after 30 min

Model	MAE	MSE	SSE	SSE/SST	SSR/SST
ELM	0.698045	0.886228	382.850303	0.199343	1.023118
RELM	0.700537	0.835965	361.136828	0.188037	0.829106
RTTELM	0.665480	0.784455	338.884573	0.176451	0.867329
TELM-HGN	0.653906	0.763562	329.858641	0.171681	0.904114
TELM-GN	0.669962	0.793175	342.651787	0.178339	0.901516

Tables 4 5 demonstrate that the proposed models have distinct advantages over the other comparability models, particularly in the wind-speed forecasting error statistics after 10 minutes, where the proposed model achieves smaller MAE, SSE, and SSE/SST. Furthermore, the prediction accuracy of the proposed model in terms of MSE is always the strongest. However, the performance of the TELM-GN model is improved to some extent due to the addition of twin structure, its prediction is not the best because the noise model satisfies the heteroskedastic regression task in wind-speed forecasting.

Also, as shown in Figs. 6 8, the regression curves obtained by these algorithms all deviate from the original equation to varying degrees, whereas the regression curve obtained by the proposed model is always the closest to the original system, indicating that the proposed model has a highest accuracy effect than several other models. As a result, the proposed models can be deemed an effective approach for predicting actual wind-speed.

5 Conclusions

ELM has the advantage of being efficient and fast, which has brought about a further breakthrough in the research of feedforward neural networks. However, there are some drawbacks of the ELM, such as being sensitive to noise and outliers, being susceptible to overfitting and resulting in degraded generalization performance, and being of poor stability. From a variety of perspectives, it is not considered as the most adequate model and there is much room for improvements.

This section summarizes our main work: (1) we discover that the wind operation law meets a Gaussian distribution with zero mean heteroscedasticity by investigating the properties of noise models in real wind-speed forecasting; (2) The optimal empirical risk loss function of heteroscedasticity noise characteristics is derived by using Bayes principle and maximizing posterior probability method. (3) The regression models of twin extreme learning machine based on Gaussian heteroscedasticity noise (TELM-HGN) and twin extreme learning machine based on homoscedastic Gaussian noise (TELM-GN) are established respectively; (4) Using the Lagrange function and we obtained the dual problem of TELM-HGN and TELM-GN according to KKT conditions; (5) Solving the TELM-HGN by the ALM method, which guaranteed the effectiveness and stability of the algorithm; Experiments on stock, Boston home price, and wind-speed data sets validate the validity and accuracy of the provided model by comparing it to other several models recently released algorithms. Besides, TELM-HGN not only maintains the advantages of ELM in simple parameter setting and capability of rapid convergence, but also makes up for the disadvantages of being sensitive to noise and outliers and poor generalization performance. Because of its fast throughput and strong generalization performance, it is particularly suited for large-scale data processing.

However, this work solely addresses the issue of heteroskedastic Gaussian in regression models. The real distribution of noise is complex and changeable in more practical situations. Considering the limited approximation ability of heteroscedasticity noise to simulate complex noise, the authors will investigate utilizing alternative mixed distributions to model noise distributions in real problems. In addition, we can also develop problems similar to classification learning. In other words, in the future we will study the classification problem of mixed noise characteristics.

Author Contributions

S.G. Zhang and D. Guo drafted the manuscript, conceived the algorithm and designed the experiments, D. Guo implemented the experiments; T. Zhou analyzed the results. All authors read and revised the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shandong Province (ZR2022MF242); Key research and development plan of Shandong Province (2019GGX101056).

Conflicts of interest

The authors declare conflict of interest.

References

Huang

G.B.

, Zhu

Q.Y.

and Siew

C.K.

, Extreme learning machine: theory and applications, Neurocomputing 70(1-3) (2006), 489–501.

Huang

G.B.

, Zhu

Q.Y.

and Siew

C.K.

, Extreme learning machine: a new learning scheme of feedforward Neural Networks, In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks 2(04CH37541) (2004), 985–990.

Zou

, Lu

, Jiang

and Xie

, A fast and precise indoor localization algorithm based on an online sequential extreme learning machine, Sensors 15(1) (2015), 1804–1824.

Wang

H.B.

, Wang

and Hu

Q.H.

, Self-adaptive robust nonlinear regression for unknown noise via mixture of Gaussians, Neurocomputing 235 (2017), 274–286.

Wang

X.Y.

and Han

, Online sequential extreme learning machine with kernels for nonstationary time series prediction, Neurocomputing 145 (2014), 90–97.

Gao

Z.K.

, Yu

, Zhao

, Hu

and Yang

, A hybrid method of cooling load forecasting for large commercial building based on extreme learning machine, Energy 238 (2022), 122073.

Sun

and Huang

, Predictions of carbon emission intensity based on factor analysis and an improved extreme learning machine from the perspective of carbon emission efficiency, Journal of Cleaner Production 338 (2022), 130414.

, Miche

, Eirola

, et al., Regularized extreme learning machine for regression with missing data, Neurocomputing 102 (2013), 45–51.

Cao

J.W.

, Zhang

, Luo

M.X.

, et al., Extreme learning machine and adaptive sparse representation for image classification, Neural Networks 81 (2016), 91–102.

10.

Feng

X.Y.

, Liang

Y.C.

, Shi

X.H.

, et al., Overfitting reduction of text classification based on AdaBELM, Entropy 19(7) (2017), 330–342.

11.

Zhou

Z.Y.

, Liu

M.X.

, Deng

W.X.

, Wang

Y.M.

and Zhu

Z.F.

, Clothing image classification algorithm based on convolutional neural network and optimized regularized extreme learning machine, 92(23-24) (2022), 5106–5124.

12.

Huang

Z.Y.

, Yu

and Gu

, An efficient method for traffic sign recognition based on extreme learning machine, IEEE Transactions on Cybernetics 47(4) (2016), 920–933.

13.

Mazinani

, Ismail

Z.B.

, Shamshirband

, et al., Estimation of tsunami bore forces on a coastal bridge using an extreme learning machine, Entropy 18(5) (2016), 167–182.

14.

Zhou

Z.Y.

, Deng

W.X.

, Zhu

Z.F.

, Wang

Y.M.

, Du

J.Y.

and Liu

X.Q.

, Fabric defect detection based on feature fusion of a convolutional neural network and optimized extreme learning machine, Textile Research Journal 92(7-8) (2022), 1161–1192.

15.

Sun

, Xu

J.T.

, Jiang

C.M.

, et al., Extreme learning machine for multi-label classification, Entropy 18(6) (2016), 225–237.

16.

Kashif

, Wu

Y.Z.

and Michael

, Consonant Phoneme Based Extreme Learning Machine (ELM) Recognition Model for Foreign Accent Identification, Proceedings of the 2019 The World Symposium on Software Engineering, pp. (2019), 68–72.

17.

Albu

, Hagiescu

, Vladutu

and Puica

M.A.

, Neural network approaches for children’s emotion recognition in intelligent learning applications, EDULEARN15 7th Annu Int Conf Educ New Learn Technol Barcelona, Spain, 6th-8th, 2015.

18.

Huang

G.B.

, Zhou

H.M.

, Ding

X.J.

and Zhang

, Extreme learning machine for regression and multiclass classification, IEEE Transactions on systems, Man, and Cybernetics, Part B, (Cybernetics) 42(2) (2011), 513–529.

19.

Chen

, Lv

, Lu

, et al., Robust regularized extreme learning machine for regression using iteratively reweighted least squares, Neurocomputing 230 (2017), 345–358.

20.

Deng

W.Y.

, Zheng

Q.H.

and Chen

, Regularized extreme learning machine, In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 389-395, April, 2009.

21.

Horata

, Chiewchanwattana

and Sunat

, Robust extreme learning machine, Neurocomputing (102) (2013), 31–44.

22.

Xing

H.J.

and Wang

X.M.

, Training extreme learning machine via regularized correntropy criterion, Neural Computing and Applications 23(7/8) (2013), 1977–1986.

23.

Ren

L.R.

, Gao

Y.L.

, Liu

J.X.

, et al., L2, 1-Extreme learning machine: an efficient robust classifier for tumor classification, Computational Biology and Chemistry 89 (2020). DOI: 10.1016/j.compbiolchem.2020.107368

24.

Liu

Q.G.

, He

and Shi

, Extreme support vector machine classifier,, Springer, Heidelberg, Pacific-asia conference on knowledge discovery and data mining 5012 (2008), 222–233.

25.

Frenay

and Verleysen

, Using SVMs with randomised feature spaces: an extreme learning approach, Proceedings of the 18th European Symposium on Artificial Neural Networks-Computational Intelligence and Machine Learning. in Proc. 18th ESANN, Bruges, Belgium, 28-30, pp. 315-320, Apr, 2010.

26.

Huang

G.B.

, Ding

X.J.

and Zhou

H.M.

, Optimization method based extreme learning machine for classification, Neurocomputing 74(1-3), 155–163. DOI: 10.1016/j.neucom.2010.02.019

27.

Wan

Y.H.

, Song

S.J.

, Huang

, et al., Twin extreme learning machines for pattern classification, Neurocomputing 260(18) (2017), 235–244. DOI: 10.1016/j.neucom.2017.04.036

28.

Peng

X.J.

, TSVR: an efficient twin support vector machine for regression, Neural Networks: the Official Journal of the International Neural Network Society 23(3) (2010), 365–372. DOI: 10.1016/j.neunet.2009.07.002

29.

Khemchandani Jayadeva

and Suresh Chandra , Twin support vector machines for pattern classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 29(5) (2007), 905–910. DOI: 10.1109/TPAMI.2007.1068

30.

Cortes

and Vladimir

, Support-vector networks, Machine Learning 20(3) (1995), 273–297.

31.

Zhang

S.G.

, Liu

, Zhou

and Sun

, Twin least squares support vector regression of heteroscedastic Gaussian noise model, IEEE Access 8(19659793) (2020), 94076–94088. DOI: 10.1109/ACCESS.2020.2995615

32.

Zhao

Y.P.

, Zhao

and Zhao

, Twin least squares support vector regression, Neurocomputing 118 (2013), 225–236. DOI: 10.1016/j.neucom.2013.03.005

33.

Huang

G.B.

and Chen

, Convex incremental extreme learning machine, Neurocomputing 70(16-18) (2007), 3056–3062. DOI: 10.1016/j.neucom.2007.02.009

34.

Huang

G.B.

and Chen

, Enhanced random search based incremental extreme learning machine, Neurocomputing 71(16-18) (2008), 3460–3468. DOI: 10.1016/j.neucom.2007.10.008

35.

Huang

G.B.

, Chen

and Siew

C.K.

, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans, Neural networks 17(4) (2006), 879–892. DOI: 10.1109/TNN.2006.875977

36.

Liang

N.Y.

, Huang

G.B.

, Saratchandran

and Sundararajan

, A fast and accurate online sequential learning algorithm for feedforward networks, IEEE Transations on Neural Networks 17(9173292) (2006), 1411–1423. DOI: 10.1109/TNN.2006.880583

37.

Q.H.

, Zhang

S.G.

, Yu

and Xie

Z.X.

, Short-term wind speed or power forecasting with heteroscedastic support vector regression, IEEE Transactions on Sustainable Energy 7(1) 241–249, (2016), DOI: 10.1109/TSTE.2015.2480245

38.

Chu

, Keerthi

S.S.

and Ong

C.J.

, Bayesian support vector regression using a unified loss function, IEEE transactions on Neural Networks 15(1) (2004), 29–44. DOI: 10.1109/TNN.2003.820830

39.

J.Q.

, Shi

W.M.

and Yang

D.H.

, Clothing image classification with a dragonfly algorithm optimised online sequential extreme learning machine, Fibres Textiles in Eastern Europe 29(3) (2021), 90–95.

40.

J.Q.

, Shi

W.M.

and Yang

D.H.

, Color difference classification of dyed fabrics via a kernel extreme learning machine based on an improved grasshopper optimization algorithm, Color Research and Application 46(2) (2021), 388–401.

41.

Nieto

P.J.G.

, García-Gonzalo

, Fernández

J.R.A.

, et al., A hybrid PSO optimized SVM-based model for predicting a successful growth cycle of the Spirulina platensis from raceway experiments data., Journal of Computational and Applied Mathematics 291(1) (2016), 293–303. DOI: 10.1016/j.cam.2015.01.009

42.

Wang

K.N.

and Zhong

, Robust non-convex least squares loss function for regression with outliers, Knowledge-Based Systems 71 (2014), 290–302. DOI: 10.1016/j.knosys.2014.08.003

43.

Wong

S.Y.

, Yap

K.S.

and Yap

H.J.

, A constrained optimization based extreme learning machine for noisy data regression, Neurocomputing 171(1) (2016), 1431–1443. DOI: 10.1016/j.neucom.2015.07.065