A noise-resilient online learning algorithm with ramp loss for ordinal regression

Abstract

Ordinal regression has been widely used in applications, such as credit portfolio management, recommendation systems, and ecology, where the core task is to predict the labels on ordinal scales. Due to its learning efficiency, online ordinal regression using passive aggressive (PA) algorithms has gained a much attention for solving large-scale ranking problems. However, the PA method is sensitive to noise especially in the scenario of streaming data, where the ranking of data samples may change dramatically. In this paper, we propose a noise-resilient online learning algorithm using the Ramp loss function, called PA-RAMP, to improve the performance of PA method for noisy data streams. Also, we validate the order preservation of thresholds of the proposed algorithm. Experiments on real-world data sets demonstrate that the proposed noise-resilient online ordinal regression algorithm is more robust and efficient than state-of-the-art online ordinal regression algorithms.

Keywords

Ordinal regression online learning PA-RAMP algorithm ramp loss

1. Introduction

Ranking plays a central role in the learning task where the labels of data samples need to be ordered. For example, the levels of bond credits can be sorted as “B” $<$ “A” $<$ “AA” $<$ “AAA” [1], and the preferences of movies can be labeled as ( “do-not-bother” $<$ “only-if-you-must” $<$ “good” $<$ “very-good” $<$ “run-to-see”) [2].

As an important tool, ordinal regression, also called ordinal classification, has been successfully used for the ranking task in a wide variety of applications, e.g. collaborative filtering [3, 4], ecology [5], detecting the severity of Alzheimer disease [6], to name a few. A seminal work of ordinal regression can be traced back to the proportional odds model (POM) proposed by Mccullagh [7], in which a general class of regression models for ordinal data was studied. In order to balance the model complexity and model fitness over the training dataset, Herbrich et al. [8] studied ordinal regression under the framework of structural risk minimization and proposed a new distribution-independent learning algorithm based on double-rank inter-loss functions. Later, the algorithms of sorting with constraint conditions in [9] provide a unified framework for classification and regression. Shashua and Levin [10] used support vector regression to order labels, in which a continuous value range is divided into $r$ continuums by selecting $r-1$ threshold. However, selecting thresholds for the continuums is still a significant challenge. It is worth mentioning that Chu et al. [11] proposed a support vector ordinary regression algorithm with explicit constraints (SVOR-EXP) and implicit constraints (SVOR-IMC). The threshold of the latter is automatically satisfied at the optimal solution, but there is no explicit constraint on these thresholds. This greatly simplifies the complexity of the algorithm as the explicit constraint becomes the implicit constraint, which significantly reduces the constraint conditions. But it requires a huge amount of computation time and memory to solve the optimization problem.

Compared with a classical offline algorithm, an online learning algorithm uses only a few data samples in model training at each time and, thus, much less storage and computation time are needed. Although existing online algorithms have made huge progressin reducing the memory requirement and computational time for big data, training models over a dataset with poor quality, e.g. noisy datasets, remains challenging.

In this work, we aim to study a noise-resilient online learning algorithm to promote the online ordinary regression problem for noisy data streams. Particularly, we propose PA algorithm with Ramp loss (PA-RAMP) for ordinal regression in an online learning manner using interval labels. Our key contributions are as follows:

1) 1)
Propose a noise-resilient algorithm, i.e. PA-RAMP, and design a procedure to update the model parameters for online ordinary regression. The optimal parameters are obtained by solving convex problems using the Concave-Convex Procedure (CCCP).
2)
Present support class algorithm with ramp loss (SCA-RAMP) to find the support class set that describes the thresholds need to be updated by finding the active constraints in theKarush-Kuhn-Tucker (KKT) system.
3)
Theoretically prove that the proposed PA-RAMP algorithm maintains the order of the thresholds.
4)
Perform experiments on various data sets and show the effectiveness of the proposed algorithm by comparing it with state-of-the-arts.

The paper is organized as follows. In Section 2, we discuss the related work. Section 3 presents a generic framework of ordinal regression using interval labels and the update rules for PA-RAMP. Besides, the order preservation guarantees of the threshold of proposed algorithms are discussed here. In Section 4, we conduct experiments and present the comparison results between PA-RAMP and state-of-the-arts. We conclude this work with some remarks in Section 5.
2. Related works

In this section, we review the related work from two aspects: online ordinal regression and Ramp loss.

Machine learning approaches use loss functions to penalize learning errors in the training process, and thus selection of the loss function is critical to the output performance of learning models. Many works have studied the effectiveness of the loss function on ordinal regression. Pedregosa [12] proposed a novel surrogate of the squared error loss and the ordinal regression problem can be solved by minimizing a convex surrogate of the zero-one, absolute, or squared loss functions. Recently, the online learning algorithm has gained a lot of attention due to the emergence of big data. Manwani [13] proposed an online learning algorithm PRIL (perceptron ranking using interval labels) for ordinal regression. The algorithm converges in finite steps, and its regret bound was used as an index to evaluate the algorithm convergence. More recently, Sahoo et al. [14] proposed a set of online multiple kernel regression (OMKR) algorithms for nonlinear regression problems. The application of OMKR algorithms was discussed for the prediction tasks in AR, ARMA and ARIMA time series. A precise Passive-aggressive (PA) online algorithm for ordinary regression was proposed in [15, 16], where a series of convex subproblems were solved iteratively in each round of the experiments.

While many works focus on improving algorithmic performance, the study of noise tolerance analysis for ordinal regression algorithms is still in its early stages. Since the standard binary support vector machine adopts Hinge loss function, its decision hyperplane is determined by several support vectors. However, outliers tend to have the greatest marginal loss and become support vectors (SVs), which are the determinants of hyperplane structures. Therefore, traditional support vector machines are sensitive to outliers or noise. Outliers tend to have the greatest marginal loss and become support vectors (SVs), resulting in distortion of the decision hyperplane. As the Hinge loss function is sensitive to noise, different loss functions, such as Pinball loss and Ramp loss, have been studied to reduce the sensitivity of the SVM-based models. Huang et al. [17] found that the advantage of Ramp loss applied to SVM algorithm (RSVM) is that it has stronger robustness and sparsity than SVM. Compared with Pinball loss, this algorithm retains the sparsity to a large extent. The Ramp loss function gives an upper bound on loss for any outliers, and thus RSVM is not very sensitive to the presence of outliers. However, using Ramp losses causes a big computation challenge as the RSVM model is non-convex. To solve this problem, a concave convex procedure (CCCP) is introduced to convert the non-convex optimization problem into a series of convex optimization problems to obtain the optimal solution of RSVM. Many researchers have studied the Ramp loss by applying binary SVM, such as the non-parallel Ramp loss SVM proposed by Liu et al. [18], and the least square SVM proposed by Liu et al. [19]. Tian et al. [20] proposed a single-class SVM with Ramp loss and Lu et al. [21] proposed a double-spherical SVM with the largest Ramp loss, etc.

3. Method

3.1 Learning to rank in online ordinary regression

Comparing with the traditional batch learning framework, online learning (shown in Fig. 1) takes samples for training in a streaming fashion so that it is easy to scale and respond in real-time. We first introduce the online ordinary regression of PA algorithm and its variants, and then propose a noise-resilient online learning algorithm in the next section.

Figure 1.

The framework of online ordinary regression. At round $t$ , the learner is given a question, $x^{t}\in\mathcal{X}$ , and is required to provide an answer to this question, $f(x^{t})$ . After predicting an answer, it receives the correct rank $y^{t}$ and updates its ranking rule by modifying $w$ , so that it enjoys good properties of scalability and real-time.

Consider a general setting for online ordinary regression [15], let $\mathcal{X}\subset\mathbb{R}^{d}$ be the instance space, $\mathcal{Y}=\{1,\cdots,K\}$ the label space, and $(x^{1},y^{1}),\cdots,(x^{T},y^{T})$ be a sequence of instance-rank pairs, where $x^{t}\in R^{d}$ and $y^{t}\in\mathcal{Y}$ is its corresponding rank, $t=1\cdots T$ . Without loss of generality, we assume that $\mathcal{Y}=\{1,2,\cdots,K\}$ with “ $<$ ” as the order relation. Unlike the general online ordinal regression, we consider the interval labels for the sequence pairs. Every instance $x\in\mathcal{X}$ is assigned an interval label $[y_{l},y_{r}]\in\mathcal{Y}\times\mathcal{Y}$ , where $y_{l}$ , $y_{r}\in\mathcal{Y}$ . Let $S=\{(x^{1},y_{l}^{1},y_{r}^{1}),\cdots,(x^{T},y_{l}^{T},y_{r}^{T})\}$ be the training set. Particularly, it becomes the exact label scenario when $y_{l}=y_{r}$ . In this case, the upper interval endpoint is defined as

$\displaystyle b(x)=\min\limits_{i\in[K]}\{i:f(x)-\theta_{i}<0\},$ (1)

where function $f:\mathcal{X}\rightarrow R$ is defined as $f(x)=w\cdot x$ , and thresholds are $\theta_{1}\leqslant\cdots\leqslant\theta_{K-1},\theta_{K}=\infty,[K]=\{1,% \cdots,K\}$ . Similarly, we can define the lower interval endpoint. The Hinge loss function as shown in Fig. 2 is used to capture the discrepancy between the actual label and the predicted label, where $z$ ranges from $-\infty$ to $+\infty$ .

$\displaystyle z=\left\{\begin{array}[]{lcl}f(x)-\theta_{i},&&i\in{1,\cdots,y_{% l}^{t}-1}\\ \theta_{i}-f(x),&&i\in{y_{r}^{t},\cdots,K-1}.\\ \end{array}\right.$ (2)

Taken $i\in{1,\cdots,y_{l}^{t}-1}$ as an example, we only drew the interval $[-6,6]$ . Notice that when there is noise in the data, $f(x)$ is far less than $\theta_{i}$ . That is, $f(x)-\theta_{i}$ tends to $-\infty$ . In this case, the value of loss will tend to $+\infty$ . The loss function of online ordinal regression is defined as follows [15].

$\displaystyle L_{\textit{IMC}}(f(x),\theta,y_{l},y_{r})=\sum\limits_{i=1}^{y_{% l}-1}l_{i}+\sum\limits_{i=y_{r}}^{K-1}l_{i}=\sum_{i=1}^{y_{l}-1}[1-f(x)+\theta% _{i}]_{+}+\sum_{i=y_{r}}^{K-1}[1+f(x)-\theta_{i}]_{+}.$ (3)

As the Hinge loss $[1-f(x)+\theta_{i}]_{+}$ is convex for each $i\in\{1,\cdots,y_{l}^{t}-1\}$ , the sum of them is convex. Similarly, the sum of $[1+f(x)-\theta_{i}]_{+}$ is also convex. So as the sum of $[1+f(x)-\theta_{i}]_{+}$ , $i\in\{y_{r}^{t},\cdots,K-1\}$ . Obviously, the Hinge loss function is primarily designed to learn information from clean data because the loss tends to $\infty$ if there is an amount of noise. However, the interference of label noise on the loss does degrade the overall performance of the model.

Figure 2.

Hinge loss (left) and Ramp loss function (right).

Based on the above analysis, we propose a noise-resilient online learning algorithm in the next section. We first introduce the online ordinary regression of the PA algorithm and its variants.

3.2 Exact passive aggressive algorithms with ramp loss

3.2.1 PA algorithm and its variants

PA algorithm was first introduced in [15] for ordinal regression in online fashion. Let $x^{t}$ be the example being observed at trial $t$ . Let $w^{t}\in R^{d}$ and $\theta^{t}\in R^{K-1}$ be the parameters of the online ordinary at time $t$ . The PA algorithm employs an aggressive update strategy to modify the weight vector as much as needed to satisfy the constraint imposed by the current example. For the current example, it finds $w^{t+1}$ and $\theta^{t+1}$ closed to $w^{t}$ and $\theta^{t}$ such that the loss $L_{\textit{IMC}}$ becomes 0. That is,

$\displaystyle(w^{t+1},\theta^{t+1})=\mathop{\textit{argmin}}\limits_{w,\theta}% \frac{1}{2}\|w-w^{t}\|^{2}+\|\theta-\theta^{t}\|^{2}$

(4) $\displaystyle s.t.\left\{\begin{array}[]{lcl}w\cdot x^{t}-\theta_{i}\geqslant 1% &&i=1,\cdots,y_{l}^{t}-1\\ w\cdot x^{t}-\theta_{i}\leqslant-1&&i=y_{r}^{t},\cdots,K-1.\\ \end{array}\right.\text{(PA)}$

For Eq. (3.2.1), it is passive if the Hinge loss is zero, that is, $w^{t+1}=w^{t}$ whenever $l_{i}^{t}=0$ . In contrast, the algorithm aggressively forces $w^{t+1}$ to satisfy the constraint $l(w^{t+1};(x^{t},y^{t}))=0$ regardless of the step-size required. That is the source of the name Passive-Aggressive or PA for short. It is the trade-off between the amount of progress made on each round and the amount of information retained from previous rounds.

Besides, the variants PA-I of PA algorithms are as follows:

$\displaystyle(w^{t+1},\theta^{t+1},\xi^{t+1})=\mathop{\textit{argmin}}\limits_% {w,\theta,\xi}\frac{1}{2}\|w-w^{t}\|^{2}+\|\theta-\theta^{t}\|^{2}+C\left(\sum% \limits_{i=1}^{y_{l}^{t}-1}\xi_{i}-\sum\limits_{i=y_{r}^{t}}^{K-1}\xi_{i}\right)$ (5) $\displaystyle s.t.\left\{\begin{array}[]{ll}w\cdot x^{t}-\theta_{i}\geqslant 1% -\xi_{i};\xi_{i}\geqslant 0&i=1,\cdots,y_{l}^{t}-1\\ w\cdot x^{t}-\theta_{i}\leqslant-1+\xi_{i};\xi_{i}\geqslant 0&i=y_{r}^{t},% \cdots,K-1,\\ \end{array}\right.\text{(PA-I)}$

where $C$ is the aggressiveness parameter, and $\xi_{i}$ is a slack variable.
3.2.2 PA-RAMP algorithm

In this section, we present the PA-RAMP procedure to minimize the estimated risk using the difference of two convex functions (DC) and convex-concave procedure (CCCP) structure [22] of ramp loss in online ordinary regression.

(1) Ramp loss

To reduce the negative influence of outliers in PA, the Ramp loss function, i.e. the robust hinge loss has been proposed and it can be denoted as follows,

$\displaystyle l_{R}(z)=\left\{\begin{array}[]{lcl}\frac{1}{2}(1-z),&&z\in[-% \infty,1]\\ 0,&&\text{others}.\\ \end{array}\right.$ (6)

Figure 3.

Ramp loss is a DC function.

Notice that the Ramp loss is a noise-resilient function, that contributes to noise loss become 0 when $z$ tends to $\pm\infty$ , as illustrated in Fig. 2. To offset the influence of noise data, we apply Ramp loss as a surrogate of the Hinge loss function, and it can be solved by CCCP. Then the problem of minimizing the estimated risk can be recast as a DC programming. That is, $l_{R}(z)$ can be represented as the difference of two convex functions:

$\displaystyle l_{R}(z)=h_{1}(z)-h_{s}(z),$ (7)

where $h_{1}(z)=\max(0,\frac{1}{2}(1-z))$ , $h_{s}(z)=\max(0,\frac{1}{2}(s-z))$ , as illustrated in Fig. 3, where $s=-1$ .

(2) PA-RAMP algorithm

In the setting of ordinal regression, instead of considering all the categories to contribute errors for threshold, we allow the Ramp loss function to update the thresholds. The ordinal inequalities on the thresholds are satisfied automatically at the optimal solution. Figure 4 illustrates the roles of slack variables $\xi$ and $\xi^{*}$ . At threshold $\theta_{j}$ , the function values from all the lower bound $y_{l}^{t}$ should be less than the lower margin $\theta_{j}-1$ of all the $t$ th sample $x^{t}$ . If not, then $\xi_{ji}^{*j}=w\cdot x^{t}-(\theta_{j}-1)$ is the error associated with the sample $x^{t}$ in the trial $t$ . In the same way, the function values from all the upper bound $y_{r}^{t}$ should be upper margin. Here, $k i$ denotes that the slack variable of $i$ th sample in the $k$ th category, and $j$ denotes that the slack variable is associated with the lower bound of $\theta_{j}$ ; On the contrary. $*j$ denotes that the slack variable is associated with the upper bound of $\theta_{j}$ .

Figure 4.

An illustration of slack variable in online ordinary regression.

We now derive the PA-RAMP algorithm of the online ordinary regression to update the parameters to predict the label(s). Then we observe the actual label(s). In Eq. (3.2.1), the slack variable is eliminated and the Ramp loss is introduced to obtain the unconstrained optimization problem. It can be defined as follows.

$\displaystyle(w^{t+1},\theta^{t+1})=\mathop{\textit{argmin}}\limits_{w,\theta}% \frac{1}{2}\|w-w^{t}\|^{2}+\frac{1}{2}\|\theta-\theta^{t}\|^{2}+C\sum\limits_{% i=1}^{y_{l}^{t}-1}ramp(\langle w,x^{t}\rangle-\theta_{i})+C\sum\limits_{i=y_{r% }^{t}}^{K-1}ramp(\theta_{i}-\langle w,x^{t}\rangle)=\mathop{\textit{argmin}}% \limits_{w,\theta}\frac{1}{2}\|w-w^{t}\|^{2}+\frac{1}{2}\|\theta-\theta^{t}\|^% {2}+C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{1}(\langle w,x^{t}\rangle-\theta_{i}){}% -C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{s}(\langle w,x^{t}\rangle-\theta_{i}))+C% \sum\limits_{i=y_{r}^{t}}^{K-1}l_{1}(\theta_{i}-\langle w,x^{t}\rangle)-C\sum% \limits_{i=y_{r}^{t}}^{K-1}l_{s}(\theta_{i}-\langle w,x^{t}\rangle)=\mathop{% \textit{argmin}}\limits_{w,\theta}\underbrace{\frac{1}{2}\|w-w^{t}\|^{2}+\frac% {1}{2}\|\theta-\theta^{t}\|^{2}+C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{1}(\langle w% ,x^{t}\rangle-\theta_{i})+C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{1}(\theta_{i}-% \langle w,x^{t}\rangle)}_{\textit{convex}}{}\underbrace{-[C\sum\limits_{i=1}^{% y_{l}^{t}-1}l_{s}\langle w,x^{t}\rangle-\theta_{i})+C\sum\limits_{i=y_{r}^{t}}% ^{K-1}l_{s}(\theta_{i}-\langle w,x^{t}\rangle)}_{\textit{concave}}]\triangleq% \mathop{\textit{argmin}}\limits_{w,\theta}F_{1}(w,\theta)-F_{2}(w,\theta),$ (8)

In the above equation, we take the upper interval endpoint as an example. Since $\sum\limits_{i=1}^{y_{l}^{t}-1}\textit{ramp}(\langle w,x^{t}\rangle-\theta_{i})$ is a DC function, we write it as the difference of two convex functions, i.e.

$\displaystyle\sum\limits_{i=y_{r}^{t}}^{K-1}l_{1}(\theta_{i}-\langle w,x^{t}% \rangle)-C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{s}(\theta_{i}-\langle w,x^{t}% \rangle).$

So, it is found that the objective function Eq. (8) can be written as the sum of convex and concave functions. where

$\displaystyle F_{1}(w,\theta)=\frac{1}{2}\|w-w^{t}\|^{2}+\frac{1}{2}\|\theta-% \theta^{t}\|^{2}+C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{1}(\langle w,x^{t}\rangle-% \theta_{i})+C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{1}(\theta_{i}-\langle w,x^{t}% \rangle),$ (9) $\displaystyle F_{2}(w,\theta)=-C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{s}(\langle w% ,x^{t}\rangle-\theta_{i})+C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{s}(\theta_{i}-% \langle w,x^{t}\rangle).$ (10)

Note that $F_{1}(w,\theta)$ is a convex function while $F_{2}(w,\theta)$ is a concave function.

Let $\beta=(w,\theta)$ , so

$\displaystyle\beta^{t+1}\triangleq\mathop{\textit{argmin}}\limits_{\beta}F_{1}% (\beta)-F_{2}(\beta).$ (11)

Due to the non-convex term of the objective function in Eq. (11), we employe the CCCP algorithm to solve it. In order to get the optimal solution of Eq. (11), we solve the following equation iteratively. Non-convex function $F_{2}(\beta)$ can be approximated by linear functions based on gradients, and the non-convex optimization is converted into a convex optimization problem. We have

$\displaystyle\beta^{t+1}\overset{\textit{CCCP}}{=}\mathop{\textit{argmin}}% \limits_{\beta}F_{1}(\beta)-\langle\nabla F_{2}(\beta^{t}),\beta\rangle,$ (12)

where, the term $\nabla F_{2}(\beta^{t})$ is the gradient of the optimization variable $\beta$ , and its expression is as follows:

$\displaystyle\nabla F_{2}(\beta^{t})=\nabla F_{2}(\beta)\left\|{}_{\beta=\beta% ^{t}}\left[\begin{array}[]{c}\nabla_{w}F_{2}(\beta)\\ \nabla_{\theta}F_{2}(\beta)\\ \end{array}\right]\right.=C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{s}^{\prime}(% \langle w^{t},x^{t}\rangle-\theta_{i}^{t})+C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{% s}^{\prime}(\theta_{i}^{t}-\langle w^{t},x^{t}\rangle)=\left\{\begin{array}[]{% lcl}C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{s}^{\prime}(\theta_{i}^{t}-\langle w^{t% },x^{t}\rangle)&&i=1,\cdots,y_{l}^{t}-1\\ C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{s}^{\prime}(\theta_{i}^{t}-\langle w^{t},x^% {t}\rangle)&&i=y_{r}^{t},\cdots,K-1,\\ \end{array}\right.$ (13)

Partial derivatives $\nabla F_{2}(\beta^{t})$ with respect to $w$ and $\theta$ are as follows respectively,

$\displaystyle\nabla_{w}F_{2}(\beta^{t})=C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{s}^% {\prime}(t)\Bigg{|}_{t=\langle w,x^{t}\rangle-\theta_{i}}\cdot x^{t}+C\sum% \limits_{i=y_{r}^{t}}^{K-1}l_{s}^{\prime}(t)\Bigg{|}_{t=\theta_{i}-\langle w,x% ^{t}\rangle}\cdot(-x^{t}),i=1,\cdots,K-1,$ $\displaystyle\nabla_{\theta_{i}}F_{2}(\beta^{t})=C\sum\limits_{i=1}^{y_{l}^{t}% -1}l_{s}^{\prime}(t)\Bigg{|}_{t=\langle w,x^{t}\rangle-\theta_{i}}\cdot(-1),i% \in[y_{l}^{t}-1],$ (14) $\displaystyle\nabla_{\theta_{i}}F_{2}(\beta^{t})=C\sum\limits_{i=y_{r}^{t}}^{K% -1}l_{s}^{\prime}(t)\Bigg{|}_{t=\theta_{i}-\langle w,x^{t}\rangle}\cdot 1,i=y_% {r}^{t},\cdots,K-1,$

where, $l_{s}(t)$ is the Hinge loss. When $t>s$ , the loss is 0, that is $l_{s}(t)=\max\{0,s-t\}$ . The derivative function of $l_{s}(t)$ is the following piecewise constant value function,

$\displaystyle l_{s}^{\prime}(t)=\left\{\begin{array}[]{lcl}-1,&&t\geqslant s\\ 0,&&t<s.\\ \end{array}\right.$ (15)

Combining Eqs (9)–(12), we can only minimize the following convex optimization problems at each iteration of CCCP, and the optimal solution of the objective function Eq. (12) is obtained by iteratively updating the optimization variable $\beta$ according to the following rules,

$\displaystyle\beta^{t+1}=\mathop{\textit{argmin}}\limits_{\beta}F_{1}(\beta)-% \langle\nabla F_{2}(\beta^{t}),\beta\rangle=\mathop{\textit{argmin}}\limits_{% \beta}\frac{1}{2}\|\beta-\beta_{t}\|^{2}+C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{1}% (\langle w,x^{t}\rangle-\theta_{i})+C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{1}(% \theta_{i}-\langle w,x^{t}\rangle)-\langle\nabla F_{2}(\beta^{t}),\beta\rangle% =\mathop{\textit{argmin}}\limits_{\beta}\frac{1}{2}[\|\beta\|^{2}-2\langle% \beta^{t},\beta\rangle-2\langle\nabla F_{2}(\beta^{t}),\beta\rangle+\|\beta^{t% }\|^{2}]+C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{1}(\langle w,x^{t}\rangle-\theta_{% i}){}+C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{1}(\theta_{i}-\langle w,x^{t}\rangle)% =\mathop{\textit{argmin}}\limits_{\beta}\frac{1}{2}\|\beta-\beta^{t}-\nabla F_% {2}(\beta^{t})\|^{2}-\langle\nabla F_{2}(\beta^{t}),\beta^{t}\rangle-\frac{1}{% 2}\|\nabla F_{2}(\beta^{t})\|^{2}{}+C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{1}(% \langle w,x^{t}\rangle-\theta_{i})+C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{1}(% \theta_{i}-\langle w,x^{t}\rangle).$ (16)

Let $\widetilde{\beta^{t}}=\beta^{t}+\nabla F_{2}(\beta^{t})=\left[\begin{array}[]{% c}w^{t}+\nabla_{w}F_{2}(\beta^{t})\\ \theta^{t}+\nabla_{\theta}F_{2}(\beta^{t})\\ \end{array}\right]=\left[\begin{array}[]{c}\widetilde{w^{t}}\\ \widetilde{\theta^{t}}\\ \end{array}\right]$ , we have where $-\langle\nabla F_{2}(\beta^{t}),\beta^{t}\rangle-\frac{1}{2}\|\nabla F_{2}% \linebreak(\beta^{t})\|^{2}$ is constant because it is known in each trial $t$ . Then the objective function Eq. (12) is equivalent to the following constrained optimization problem.

$\displaystyle\iff\beta^{t+1}=\mathop{\textit{argmin}}\limits_{\beta,\xi}\frac{% 1}{2}\|\beta-\widetilde{\beta^{t}}\|^{2}+C\left(\sum\limits_{i=1}^{y_{l}^{t}-1% }\xi_{i}+\sum\limits_{i=y_{r}^{t}}^{K-1}\xi_{i}\right)-\langle\nabla F_{2}(% \beta^{t}),\beta^{t}\rangle-\frac{1}{2}\|\nabla F_{2}(\beta^{t})\|^{2}$ $\displaystyle s.t.\left\{\begin{array}[]{lcl}w\cdot x^{t}-\theta_{i}\geqslant 1% -\xi_{i};\xi_{i}\geqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ w\cdot x^{t}-\theta_{i}\leqslant-1+\xi_{i};\xi_{i}\geqslant 0&&i=y_{r}^{t},% \cdots,K-1,\\ \end{array}\right.$ (17)

where $C$ is the aggressiveness parameter. $\xi_{i}$ is a slack variable. Equation (3.2) is the basic structure of the PA-RAMP algorithm, which solves a convex problem in every trial $t$ .

Lagrangian for the above objective function is as follows.

$\displaystyle L^{t}=\mathop{\textit{argmin}}\limits_{w,\theta,\xi}\frac{1}{2}% \|\beta-\widetilde{\beta^{t}}\|^{2}+C\left(\sum\limits_{i=1}^{y_{l}^{t}-1}\xi_% {i}-\sum\limits_{i=y_{r}^{t}}^{K-1}\xi_{i}\right)+\sum\limits_{i=1}^{y_{l}^{t}% -1}\lambda_{i}^{t}(1-\xi_{i}+\theta_{i}-w\cdot x^{t}){}+\sum\limits_{i=y_{r}^{% t}}^{K-1}\mu_{i}^{t}(1-\xi_{i}+w\cdot x^{t}-\theta_{i})-\sum\limits_{i=1}^{y_{% l}^{t}-1}\alpha_{i}\xi_{i}-\sum\limits_{i=y_{r}^{t}}^{K-1}\beta_{i}\xi_{i}-% \langle\nabla F_{2}(\beta^{t}),\beta^{t}\rangle{}-\frac{1}{2}\|\nabla F_{2}(% \beta^{t})\|^{2},$ (18)

where $\lambda_{i}\geqslant 0,\alpha_{i}\geqslant 0,(i\in[y_{l}^{t}-1])$ and $\mu_{i}\geqslant 0,\beta_{i}\geqslant 0,(i=y_{r}^{t},\cdots,K-1)$ are Lagrange multipliers.

The KKT optimality conditions are

$\displaystyle w=\widetilde{w^{t}}+\left(\sum\limits_{i=1}^{y_{l}^{t}-1}\lambda% _{i}^{t}-\sum\limits_{i=y_{r}^{t}}^{K-1}\mu_{i}^{t}\right)x^{t}=w_{t}+\nabla_{% w}F_{2}(\beta^{t})+\left(\sum\limits_{i=1}^{y_{l}^{t}-1}\lambda_{i}^{t}-\sum% \limits_{i=y_{r}^{t}}^{K-1}\mu_{i}^{t}\right)x^{t}\text{\Pisymbol{pzd}{192}}$ $\displaystyle\left\{\begin{array}[]{l}\theta_{i}-\theta_{i}^{t}-\nabla_{\theta% _{i}}F_{2}(\beta^{t})+\lambda_{i}^{t}=0,i=1,\cdots,y_{l}^{t}-1\text{\Pisymbol{% pzd}{193}}\\ \theta_{i}-\theta_{i}^{t}-\nabla_{\theta_{i}}F_{2}(\beta^{t})-\mu_{i}^{t}=0,i=% y_{r}^{t},\cdots,K-1\\ \end{array}\right.$ $\displaystyle\left\{\begin{array}[]{ll}C-\alpha_{i}^{t}-\lambda_{i}^{t}=0&i=1,% \cdots,y_{l}^{t}-1\text{\Pisymbol{pzd}{194}}\\ C-\beta_{i}^{t}-\mu_{i}^{t}=0&i=y_{r}^{t},\cdots,K-1\\ \end{array}\right.$ $\displaystyle\left\{\begin{array}[]{lcl}\lambda_{i}^{t}\geqslant 0,\alpha_{i}^% {t}\geqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ \mu_{i}^{t}\leqslant 0,\beta_{i}^{t}\leqslant 0&&i=y_{r}^{t},\cdots,K-1\\ \end{array}\right.$ $\displaystyle\left\{\begin{array}[]{lcl}1-\xi_{i}+\theta_{i}-w\cdot x^{t}% \leqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ 1-\xi_{i}+w\cdot x^{t}-\theta_{i}\leqslant 0&&i=y_{r}^{t},\cdots,K-1\\ \xi_{i}\geqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ \xi_{i}\geqslant 0&&i=y_{r}^{t},\cdots,K-1\\ \end{array}\right.$ $\displaystyle\left\{\begin{array}[]{ll}\lambda_{i}^{t}(1-\xi_{i}+\theta_{i}-w% \cdot x^{t})=0&i\in[y_{l}^{t}-1]\text{\Pisymbol{pzd}{195}}\\ \mu_{i}^{t}(1-\xi_{i}+w\cdot x^{t}-\theta_{i})=0&i\in\{y_{r}^{t},\cdots,K-1\}% \\ \alpha_{i}^{t}\xi_{i}\geqslant 0&i\in[y_{l}^{t}-1]\\ \beta_{i}^{t}\xi_{i}\geqslant 0&i\in\{y_{r}^{t},\cdots,K-1\}.\\ \end{array}\right.$

Let $S_{l}^{t}=\{1\leqslant i\leqslant y_{l}^{t}-1|\lambda_{i}^{t}>0\}$ be the left support set and $S_{r}^{t}=\{y_{r}^{t}\leqslant i\leqslant K-1|\mu_{i}^{t}>0\}$ be the right support set respectively. Let $w=w^{t}+a^{t}x^{t}$ , where $a^{t}=\sum\limits_{i\in S_{l}^{t}}\lambda_{i}^{t}-\sum\limits_{i\in S_{r}^{t}}% \mu_{i}^{t}$ . We skip the derivation procedures of the parameters, which shows in appendix.

For $i\in S_{l}^{t}$ , from ⟀⟁⟂⟃, we have

$\displaystyle\lambda_{i}^{t}=\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+% \nabla_{\theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t},$ $\displaystyle\alpha_{i}^{t}=C-\lambda_{i}^{t}=C-\min\{l_{i}^{t}-\nabla_{w}F_{2% }(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},i% \in S_{l}^{t},$ $\displaystyle\theta_{i}^{t+1}=\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t% })-\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(% \beta^{t})-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t},$ (19) $\displaystyle\xi_{i}^{t+1}=1-w^{t}x^{t}+\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2% }(\beta^{t})-\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i% }}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},$ $\displaystyle\quad i\in S_{l}^{t}.$

Similarly, for $i\in S_{r}^{t}$ , we get

$\displaystyle\mu_{i}^{t}=\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-% \nabla_{\theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t},$ $\displaystyle\beta_{i}^{t}=C-\mu_{i}^{t}=C-\min\{l_{i}^{t}+\nabla_{w}F_{2}(% \beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},i% \in S_{r}^{t},$ $\displaystyle\theta_{i}^{t+1}=\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t% })+\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(% \beta^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t},$ (20) $\displaystyle\xi_{i}^{t+1}=1+w^{t}x^{t}-\theta_{i}^{t}-\nabla_{\theta_{i}}F_{2% }(\beta^{t})-\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i% }}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},$ $\displaystyle\quad i\in S_{r}^{t}.$

In addition, $w^{t+1}$ is the optimal solution obtained in the previous iteration, and the expression is as follows,

$\displaystyle w^{t+1}=w^{t}+\nabla_{w}F_{2}(\beta^{t})+a^{t}x^{t}.$ (21)

Note that sets $S_{l}^{t}$ and $S_{r}^{t}$ are known at every trial $t$ . The pseudocode of the PA-RAMP algorithm is as given in Algorithm 1.

[H] : PA-RAMP Algorithm[1] Training set $S$ Initial $\boldsymbol{w}^{0}$ and $\boldsymbol{\theta}^{0}$ Predict $\hat{y}_{t}(t=1,\ldots,T)$ for $t=1,\ldots,T$ do Receive instance $\boldsymbol{X}_{t}$ from $S$ Predict: $\boldsymbol{\hat{y}}_{t}$ observe: $\boldsymbol{y_{l}}^{t}$ $\boldsymbol{y_{r}}^{t}$ $l_{i}^{t}=\max\{0,1+\theta_{i}^{t}-w^{t}x^{t}\},i=1,\cdots,y_{l}^{t}$ $l_{i}^{t}=\max\{0,1+w^{t}x^{t}-\theta_{i}^{t}\},i=y_{r}^{t},\cdots,K-1$ $S_{l}^{t}S_{r}^{t}=\textit{SCA-RAMP}(y_{l}^{t},y_{r}^{t},l_{i}^{t},i\in[K-1])$ Update $\boldsymbol{w}^{t+1}=w^{t}+\nabla_{w}F_{2}(\beta^{t})+a^{t}x^{t}$ , where $a^{t}=\sum\limits_{i\in S_{l}^{t}}\lambda_{i}^{t}-\sum\limits_{i\in S_{r}^{t}}% \mu_{i}^{t}$ $\lambda_{i}^{t}=\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta% _{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t}$ $\mu_{i}^{t}=\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}% }F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t}$ Update $\boldsymbol{\theta}_{i}^{t+1}=\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t% })-\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(% \beta^{t})-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t}$ $\boldsymbol{\theta}_{i}^{t+1}=\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t% })+\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(% \beta^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t}$ end for

Determining Support Sets $S_{l}^{t}$ and $S_{r}^{t}$ : An iterative method proposed in Manwani [15] is used to find the two support sets. We first find the values of $\lambda_{i}^{t}$ and $\mu_{i}^{t}$ , and then compute $a^{t}$ . This process is repeated until all values converge. At last, we derivate an $i$ in $S_{l}^{t}$ or $S_{r}^{t}$ based on whether $\lambda_{i}>0$ or $\mu_{i}>0$ . The pseudocode of the support class algorithm with Ramp loss (SCA-RAMP) for PA-RAMP is given in Algorithm 2.

: Support Class Algorithm (SCA-RAMP)[1] $y_{l}^{t},y_{r}^{t},x^{t}$ and $l_{i}^{t},i\in[K-1]$

Initialize $\boldsymbol{S_{l}}^{t}=\{y_{l}^{t}-1\},\boldsymbol{S_{r}}^{t}=\{y_{r}^{t}\},p=% y_{l}^{t}-2,q=y_{r}^{t}+1$ $S_{l}^{t}$ $S_{r}^{t}$ while $\lambda_{i}^{t},\ldots,\lambda_{y_{l}^{t}-1}^{t},\mu_{y_{r}^{t}}^{t},\ldots,% \mu_{K-1}^{t}$ do not converge do for $i=p,\cdots,1$ do if $\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(\beta% ^{t})-a^{t}\|x^{t}\|^{2},C\}>0,$ then $S_{l}^{t}=S_{l}^{t}\cup\{i\}$ else if $i\in S_{l}^{t}$ then $S_{l}^{t}=S_{l}^{t}-\{i\};\lambda_{i}^{t}=0$ end if end if end for for $i=q,\cdots,K-1$ do if $\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(\beta% ^{t})+a^{t}\|x^{t}\|^{2},C\}>0,$ then $S_{r}^{t}=S_{r}^{t}\cup\{i\}$ else if $i\in S_{r}^{t}$ then $S_{r}^{t}=S_{r}^{t}-\{i\};\mu_{i}^{t}=0$ end if end if end for Update: $a^{t}=\sum\limits_{i\in S_{l}^{t}}\lambda_{i}^{t}-\sum\limits_{i\in S_{r}^{t}}% \mu_{i}^{t}$ $\lambda_{i}^{t}=\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta% _{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t}$ $\mu_{i}^{t}=\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}% }F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t}$ end while

3.3 Theoretical analysis of PA-RAMP algorithm

Note that Ramp loss $F(\beta^{t})$ can be viewed as a concave-convex procedure (CCCP) problem. We show the CCCP problem in the online setting converges as follows.

.

Consider a function $F(v^{t})$ which is bounded below. It is in the form of $F(\beta^{t})=F_{\textit{vex}}(\beta^{t})+F_{\textit{cave}}(\beta^{t})$ where $F_{\textit{vex}}(\beta^{t})$ and $F_{\textit{cave}}(\beta^{t})$ are convex and concave function of $\beta^{t}$ , respectively. Then the discrete iterative algorithm given by

$\displaystyle\nabla F_{\textit{vex}}(\beta_{m+1}^{t})=-\nabla F_{\textit{cave}% }(\beta_{m+1}^{t})$

is gauranteed to monotonically decrease $\nabla F(\beta^{t})$ and hence to converge to a minimum point.

Proof..

The convexity and concavity of $F_{\textit{vex}}(\cdot)$ and $F_{\textit{cave}}(\cdot)$ means that:

$\displaystyle F_{\textit{vex}}(\beta_{2}^{t})\geqslant F_{\textit{vex}}(\beta_% {1}^{t})+(\beta_{2}^{t}-\beta_{1}^{t})\cdot\nabla F_{\textit{vex}}(\beta_{1}^{% t})$

(22) $\displaystyle F_{\textit{cave}}(\beta_{4}^{t})\leqslant F_{\textit{cave}}(% \beta_{3}^{t})+(\beta_{4}^{t}-\beta_{3}^{t})\cdot\nabla F_{\textit{cave}}(% \beta_{3}^{t})$

for all $\beta_{1}^{t}$ , $\beta_{2}^{t}$ , $\beta_{3}^{t}$ , $\beta_{4}^{t}$ . Now that set $\beta_{1}^{t}=\beta_{m+1}^{t}$ , $\beta_{2}^{t}=\beta_{m}^{t}$ , $\beta_{3}^{t}=\beta_{m}^{t}$ , $\beta_{4}^{t}=\beta_{m+1}^{t}$ , we find that:

$\displaystyle F_{\textit{vex}}(\beta_{m+1}^{t})+F_{\textit{cave}}(\beta_{m+1}^% {t})\leqslant F_{\textit{vex}}(\beta_{m}^{t})+F_{\textit{cave}}(\beta_{m}^{t}).$ (23)

This completes the proof. ∎

Considering the property of PA algorithms, we show that the PA-RAMP algorithm inherently maintains the ordering of thresholds in each iteration.

.

(Order preservation of thresholds in PA-RAMP algorithm) Let $\theta_{1}^{t}\leqslant\cdots\leqslant\theta_{K-1}^{t}$ be the thresholds at trial $t$ . Let $\theta_{1}^{t+1},\cdots,\theta_{K-1}^{t+1}$ be the updated thresholds using PA-RAMP. Then $\theta_{1}^{t+1}\leqslant\cdots\leqslant\theta_{K-1}^{t+1}$ .

Proof..

We need to analyse different cases as follows.

1) 1)
We know that $\theta_{k}^{t+1}=\theta_{k}^{t},k=y_{l}^{t},\cdots,y_{r}^{t}-1$ . Thus, $\theta_{1}^{t+1}\leqslant\cdots\,\leqslant\theta_{K-1}^{t+1}$ .
2)
${\forall}\hskip 2.58ptk,k+1\in[y_{l}^{t}-1]\setminus S_{l}^{t}$ :

$\displaystyle\theta_{k+1}^{t+1}-\theta_{k}^{t+1}=\theta_{k+1}^{t}-\theta_{k}^{% t}\geqslant 0.$
3)
$k\in[y_{l}^{t}-1]\setminus S_{l}^{t}$ and $k+1\in S_{l}^{t}$ :

Thus $\lambda_{k}^{t}<0$ , which means

$\displaystyle l_{k}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{% 2}(\beta^{t})-a^{t}\|x^{t}\|^{2}\leqslant 0$

as $C>0$ . Also, $\lambda_{k+1}^{t}>0$ , that is

$\displaystyle\min\{C,l_{k+1}^{t}-\nabla_{w}F_{2}(\beta^{t}).x^{t}+\nabla_{% \theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2}\}>0.$

(a)
When $\lambda_{k+1}^{t}=l_{k}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}% }F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2}$ , we see that

$\displaystyle\theta_{k+1}^{t}-\theta_{k}^{t+1}=\theta_{k+1}^{t}-l_{k+1}^{t}+% \nabla_{w}F_{2}(\beta^{t})x^{t}+a^{t}\|x^{t}\|^{2}-\theta_{k}^{t}.$

From $l_{k+1}^{t}=\max(0,1+\theta_{k+1}^{t}-w^{t}x^{t})$ , we have

$\displaystyle\theta_{k+1}^{t}-\theta_{k}^{t+1}=-l_{k}^{t}+\nabla_{w}F_{2}(% \beta^{t})x^{t}+a^{t}\|x^{t}\|^{2}=-\lambda_{k}^{t}+\nabla_{\theta_{k}}F_{2}(% \beta^{t}),$

which is greater than 0 since $\nabla_{\theta_{k}}F_{2}(\beta^{t})\geqslant 0$ , $k\in[y_{l}^{t}-1].$
(b)
When $\lambda_{k+1}^{t}=C$ , we have

$\displaystyle\theta_{k+1}^{t}-\theta_{k}^{t+1}=\theta_{k+1}^{t}-C-\theta_{k}^{% t}\geqslant\theta_{k+1}^{t}-l_{k+1}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}+a^{t}% \cdot\|x^{t}\|^{2}-\nabla_{\theta_{k+1}}F_{2}(\beta^{t})-\theta_{k}^{t}=\nabla% _{\theta_{k+1}}F_{2}(\beta^{t}).$

4)
$k,k+1\in S_{l}^{t}$ :

$\displaystyle\theta_{k+1}^{t+1}-\theta_{k}^{t+1}=\theta_{k+1}^{t}-\lambda_{k+1% }^{t}-\theta_{k}^{t}+\lambda_{k}^{t}.$

There are four different cases.

(a)
$\lambda_{k+1}^{t}=\lambda_{k}^{t}=C$ . Thus,

$\displaystyle\theta_{k+1}^{t+1}-\theta_{k}^{t+1}=\theta_{k+1}^{t}-\theta_{k}^{% t}\geqslant 0.$
(b)
$\lambda_{k}^{t}=C$ , Then $\lambda_{k+1}^{t}=C$ due to the fact that $l_{k+1}^{t}\geqslant l_{k}^{t}$ . It boils down to the situation discussed above.
(c)
$\lambda_{k}^{t}=l_{k}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{k}}F% _{2}(\beta^{t})-a^{t}\|x^{t}\|^{2}$ , $\lambda_{k+1}^{t}=l_{k+1}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{% k+1}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2}$ . So we have

$\displaystyle\theta_{k+1}^{t+1}-\theta_{k}^{t+1}=0.$
(d)
$\lambda_{k}^{t}=l_{k}^{t}-\nabla_{w}F_{2}(\beta^{t}).x^{t}+\nabla_{\theta_{k}}% F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2}$ , $\lambda_{k+1}^{t}=C$ . Then,

$\displaystyle\theta_{k+1}^{t+1}-\theta_{k}^{t+1}=\theta_{k+1}^{t}-C-[\theta_{k% }^{t}-l_{k}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{k}}F_{2}(\beta% ^{t})+a^{t}\|x^{t}\|^{2}]=\theta_{k+1}^{t}-C-[-1+w^{t}.x^{t}+\nabla_{w}F_{2}(% \beta^{t})\|x^{t}\|]\geqslant\theta_{k+1}^{t}-[l_{k}^{t}-\nabla_{w}F_{2}(\beta% ^{t})x^{t}+\nabla_{\theta_{k}}F_{2}(\beta^{t})-a^{t}\cdot\|x^{t}\|^{2}]-[-1+w^% {t}x^{t}\nabla_{\theta_{k+1}}F_{2}(\beta^{t})x^{t}+a^{t}\|x\|^{2}]=0.$

This completes the proof. ∎

Similar arguments can be given for the right support class $S_{r}^{t}$ and hence we skip the proof of it. This shows that the PA-RAMP algorithm can keep the ordering of thresholds values well so it is reasonable.
4. Experiments

In this section, we evaluate the performance of the proposed PA-RAMP algorithm by comparing it with other benchmarking methods on datasets with various ratios of noise.

The employed datasets can be obtained from UCI2

²
Available: https://archive.ics.uci.edu/ml/.

machine learning repository [23] and LIBSVM3

Available: http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/.

website [24]. All experiments are conducted in MATLAB R2016a environment on a PC with 2.5 GHz Intel Core i5 processors and 8GB RODRAM running under the Windows 10 operating system. The source code of the proposed algorithm will made public upon the acceptance of the manuscript.

4.1 An illustration example

The employed datasets are as following. Each feature of datasets is normalized to zero mean and unit variance for coordinate wise. Interval labels of our datasets should be analyzed and designated by domain expert.

Abalone: This dataset has information related to the physical measurement of Abalone found in Australia. It contains 4177 instances with 8 attributes. The aim is to predict the age of the Abalone using “Rings” attributes which vary from 1–29. We divide the target variables 1–29 into 4 intervals as 1–7, 8–9, 10–12, 13–29 [15].

Parkinsons-updrs: This dataset is composed of a range of biomedical voice measurements with early-stage Parkinson’s disease. There are 5847 instances with 21 features. The target variable of this dataset is “total UPDRS” (Clinician’s total UPDRS score) for the instance which varies from 7 to 54.992. We divide the target variable 7–54.992 into 5 classes: $\theta=$ (0, 17, 27, 37, 54.992) [15].

Real Estate Valuation: The historical market dataset of real estate value is collected from Sindian Dist, New Taipei City, Taiwan. This dataset has 414 instances with 7 attributes. The aim is to predict the target variable house price of unit area which ranges from 0 to 200. We create 4 intervals as 0–20, 21–40, 41–60, 61–200.

Winequality-red: It is related to the red variant of the Portuguese “Vinho Verde” wine. It has 1599 instances with 11 attributes. The classes are ordered and not balanced (e.g. there are many more normal wine than excellent or poor ones). Score of output variable “quality” ranges over 3–8. We create 6 intervals as $\theta=$ (3, 4, 5, 6, 7, 8) [9].

Winequality-white: This dataset is related to the white variant of the Portuguese “Vinho Verde” wine, similar to the Winequality-red dataset. It has 4898 instances with 11 attributes. The score of “quality” is divided into 6 intervals as $\theta=$ (0, 4, 5, 6, 7, 9).

Table 1 shows the characteristics of the 5 datasets, including the number of instances, attributes, and classes, and also the number of instances per class. To show the statistical properties of different data sets, we drew a box plot as shown in Fig. 5. The blue dots represent outliers. As the figure shows, the data contain a small amount of noise, especially in Parkinsons-updrs and Wine datasets.

Table 1
Ordinal regression datasets

Dataset	#Inst	#Attr	#Classes	#Class distribution
Abalone	4177	7	4	(839, 1257, 1388, 693)
Parkinsons-updrs	5847	21	5	(10, 798, 1885, 1443, 1749)
Real estate valuation	414	6	4	(36, 188, 171, 19)
Winequality-red	1599	11	6	(310, 53, 681, 638, 199, 18)
Winequality-white	5898	11	6	(20, 163, 1457, 2198, 880, 180)

Figure 5.

Box plots of five benchmark datasets.

To ascertain how the noise affects the prediction, we test our algorithm on the five datasets obtain from UCI with artificial perturbation. Considering that noise comes from two main sources, $x$ and $y$ , we use the following mechanism to generate influential points. We choose $m$ % instances from the dataset randomly. For covariance variables $x$ , the error follows a mixture normal distribution $0.8N(0,1)+0.2N(0,10^{2})$ [25]. For target variables, we generate the noise label by assigning one of the following interval randomly, i.e. $[y_{l}-1,y_{r}],[y_{l},y_{r}+1],[y_{l}-2,y_{r}-1],[y_{l}+1,y_{r}+2]$ , where $y_{l},y_{r}$ are the actual labels. Here, we consider $m=25$ and $50$ .

We compare the performance of proposed PA-RAMP algorithm with other benchmarking methods in solving the online ordinal regression with interval labels, that is, PA in [15], PA-I in [15] and PRIL [13], where PRIL is PRank (perceptron rank [2]) based approach for online ordinal regression using interval labels. Moreover, the noise tolerance of PA-RAMP on five datasets is investigated with various ratios of noise. Thus, experiments on noise sensitivity study of target variables $y$ compared with four methods in different noisy data are reported in Section 4.2 and the noise sensitivity study of covariance variables $x$ under various noise is described in Section 4.3.

4.2 Noise sensitivity of targate variables

Figure 6.

Accuracy comparison results of PA, PA-I, PRIL and PA-RAMP under different noise data on target variables.

The purpose of this experiment is to compare the global exactitude performance of four methods with various ratios of noise on target variables. We employ the Accuracy in [9], defined by $1-\frac{1}{T}\sum_{t=1}^{T}[\hat{y^{t}}\neq y^{t}]$ [9], to show the noise-resilient effect of four methods, where $y^{t}$ is the true label, $\hat{y^{t}}$ is the predicted label. The closer the value is to 1, the higher accuracy of the model coefficient estimation is. In online ordinal regression learning, the model is given a sample $x^{t}$ and required to predict $\hat{y^{t}}$ at round $t$ with initial values of $w^{0}$ and $\theta^{0}$ . Given a prediction of $\hat{y^{t}}$ , the model receives the correct rank $y^{t}$ and immediately updates its ranking rule by modifying $w$ and $\theta$ , so that it is real-time. In this online ordinal regression, to show the noise-resilient effect of four methods, we measure the Accuracy on every round $T$ using noise data of target variables. In the current experiments, we run the algorithm on datasets for three times and average the instantaneous Accuracy across three runs. Figure 6 shows the results of four methods for the five datasets, respectively. The $x$ -axis denotes the varied number of examples and the $y$ -axis corresponding estimate of the Accuracy on test examples. Before the experiment, the parameter of our model should be analyzed and designated by domain expert beforehand. We select the suitable aggressiveness parameter $C$ for PA, PA-I and PA-RAMP and $\eta=1$ for PRIL on the performance of estimating value. We have the following observation.

For all the datasets, the Accuracy increases faster than the number of trial $T$ , and declines with the increase in the fraction of the noisy label. The interference of label noise on target variables does degrade the Accuracy of prediction.

In general, the Accuracy of PA-RAMP is higher, and the PA-RAMP outperforms three methods over all datasets with different noise levels. Also, the noise-resilient PA-RAMP and noise-sensitive algorithms (PA, PA-I and PRIL) work equally well in terms of accuracy in no added noise environment. Moreover, the gap of Accuracy between PA-RAMP and other approaches increases along with the level of noise, especially in PA-I.

On Abalone datasets, the Accuracies of PA-RAMP, PA-I and PRIL are almost close when the dataset has 0% noise. However, the predictive Accuracies of the PA, PA-I and PRIL are relatively low and unstable in datasets in two noisy data cases and this result shows the three algorithms are very sensitive to noise. Interestingly, PA-RAMP outperforms others while PA-I performs comparably to PRIL and consistently better than PA. In comparison, the PA-RAMP gets the better noise-resilient performance. On Parkinson and Real Estate Valuation datasets, the different degrees of the prediction Accuracy are obtained by PA, PA-I and PRIL in different label noise, whereas PA-RAMP shows better noise tolerance.

On Winequality-red and Winequality-white datasets, the Accuracies of PA-I and PA-RAMP are close when the datasets have 0% noise. However, PA-RAMP outperforms PA-I across all levels of noise. Especially, the Accuracy of PRIL gradually decreases along with the level of noise while the prediction accuracy of PA-RAMP increases. PA shows consistently the lower test Accuracy than that of the others.

It is clear from the graphs of that that PA-RAMP outperforms the basic PA, PA-I and PRIL algorithms when the noise level is high. Thus, overall, the proposed noise-insensitive PA-RAMP algorithm significantly performs better than PA-I and PRIL, and consistently better than the PA algorithm, although the latter is more stable.

The primary difference between PA algorithms (PA, PA-I, PA-RAMP) and PRIL is that PRIL uses a constant step size whereas PA algorithms choose an appropriate step size to find the new $w$ and $\theta$ in every trial. In PA-RAMP, the step size is determined by solving an optimization problem in which Ramp loss is a noise-resilient function, such that the noise loss becomes 0 when there is a lot of noise. Whereas PA and PA-I take a more aggressive step size to ensure that the loss on the current example become 0, especially for PA. Thus, we see that the proposed PA-RAMP performs better to other methodologies and is noise-resilient.

4.3 Noise sensitivity of covariance variables

In this section, we conduct some experiments to evaluate the global Accuracy of four online methods on different noisy covariance variables and investigate the comparison of the proposed PA-RAMP algorithm with PA-I on various ratios of noise data by considering order and statistical details.

4.3.1 The accuracy comparison

Because of an artificial perturbation on covariance variables, the statistical properties of five datasets in 25% and 50% noise data are shown in Figs 7 and 8. The x-axis denotes the dimensions of examples and the y-axis corresponding to the perturbation of outliers. We analyze the Accuracy of four methods on the noise. As analysis of Accuracy in covariance variables noise in Fig. 10 is similar to target variables, we skip the details. Note that with 0% noise, the curve on covariance and target variables are the same ones.

Figure 7.

Five benchmark datasets under 25% noise data.

Figure 8.

Five benchmark datasets under 50% noise data.

Figure 9.

Accuracy comparison results of PA, PA-I, PA-RAMP and PRIL under different noise data on covariance variables.

Table 2

Sensitivity estimation of PA-I and proposed PA-RAMP algorithm of five datasets under 0% noise data. For each dataset, we have three noise scenarios to discuss: $t=\frac{1}{3}T$ , $t=\frac{2}{3}T$ and $t=T$ noise data. Furthermore for each method, the fourth and fifth columns show MAE and RMSE in three runs. The sixth columns show the Spearman’s Correlation Coefficient. The discard samples by algorithms and the percentage of discarding for each method are shown in seventh and eighth columns. Time denotes the running time of programma

Dataset	Rounds	Method	MAE	RMSE Spearman’s correlation coefficient	Discard sample	Discard rate	Time (s)
Abalone	T $=$ 1400	PA-I	$1.314\pm 0.002$	$1.595\pm 0.001$	$0.539\pm 0.003$	$0\pm 0$	0.00%	0.589
		PA-Ramp	$1.648\pm 0.000$	$1.909\pm 0.001$	$0.440\pm 0.001$	$36\pm 11$	2.57%	$1.817$
	T $=$ 2800	PA-I	$1.159\pm 0.002$	$1.471\pm 0.003$	$0.539\pm 0.000$	$0\pm 0$	0.00%	$1.075$
		PA-Ramp	$1.186\pm 0.004$	$1.469\pm 0.004$	$0.539\pm 0.000$	$85\pm 6$	3.03%	$3.224$
	T $=$ 4100	PA-I	$1.361\pm 0.003$	$1.632\pm 0.008$	$0.559\pm 0.003$	$0\pm 0$	0.00%	$1.471$
		PA-Ramp	$1.914\pm 0.005$	$2.165\pm 0.002$	$0.363\pm 0.002$	$116\pm 12$	2.82%	$5.435$
Parkin-sons-	T $=$ 2000	PA-I	$0.866\pm 0.000$	$1.070\pm 0.001$	$0.436\pm 0.001$	$0\pm 0$	0.00%	$0.449$
uprds		PA-Ramp	$0.792\pm 0.001$	$1.129\pm 0.003$	$0.568\pm 0.000$	$154\pm 21$	7.70%	$2.796$
	T $=$ 4000	PA-I	$0.757\pm 0.003$	$1.044\pm 0.003$	$0.480\pm 0.001$	$0\pm 0$	0.00%	$0.928$
		PA-Ramp	$0.752\pm 0.000$	$1.031\pm 0.002$	$0.628\pm 0.000$	$326\pm 8$	8.15%	5.987
	T $=$ 5800	PA-I	$0.751\pm 0.009$	$1.040\pm 0.001$	$0.492\pm 0.002$	$0\pm 0$	0.00%	$1.286$
		PA-Ramp	$0.707\pm 0.019$	$0.987\pm 0.005$	$0.648\pm 0.016$	$498\pm 23$	8.59%	7.770
Real estate	T $=$ 140	PA-I	$0.771\pm 0.014$	$1.076\pm 0.017$	$0.259\pm 0.009$	$0\pm 0$	0.00%	$1.895$
valuation		PA-Ramp	$0.814\pm 0.012$	$1.049\pm 0.008$	$0.510\pm 0.004$	$3\pm 1$	2.14%	$0.305$
	T $=$ 280	PA-I	$0.700\pm 0.018$	$0.967\pm 0.003$	$0.431\pm 0.007$	$0\pm 0$	0.00%	$3.728$
		PA-Ramp	$0.646\pm 0.004$	$0.908\pm 0.012$	$0.533\pm 0.014$	$9\pm 3$	3.21%	$0.563$
	T $=$ 400	PA-I	$0.777\pm 0.004$	$1.033\pm 0.003$	$0.470\pm 0.009$	$0\pm 0$	0.00%	$5.954$
		PA-Ramp	$0.661\pm 0.017$	$0.919\pm 0.012$	$0.537\pm 0.018$	$12\pm 5$	3.00%	$0.857$
Winequal-ity-	T $=$ 500	PA-I	$1.294\pm 0.017$	$1.770\pm 0.003$	$0.158\pm 0.002$	$0\pm 0$	0.00%	$0.151$
red		PA-Ramp	$2.088\pm 0.002$	$1.788\pm 0.003$	$0.150\pm 0.003$	$16\pm 2$	3.20%	$0.867$
	T $=$ 1000	PA-I	$1.073\pm 0.007$	$1.513\pm 0.009$	$0.004\pm 0.007$	$0\pm 0$	0.00%	$0.296$
		PA-Ramp	$1.904\pm 0.004$	$2.217\pm 0.002$	$0.144\pm 0.009$	$36\pm 5$	3.65%	$1.733$
	T $=$ 1500	PA-I	$0.983\pm 0.017$	$1.387\pm 0.005$	$0.018\pm 0.002$	$0\pm 0$	0.00%	$0.432$
		PA-Ramp	$1.828\pm 0.001$	$0.943\pm 0.002$	$0.164\pm 0.002$	$46\pm 6$	3.06%	$2.552$
Winequal-ity-	T $=$ 1600	PA-I	$1.171\pm 0.016$	$1.560\pm 0.002$	$0.121\pm 0.002$	$0\pm 0$	0.00%	$0.373$
white		PA-Ramp	$1.121\pm 0.025$	$0.630\pm 0.027$	$0.182\pm 0.033$	$68\pm 7$	4.25%	$2.382$
	T $=$ 3200	PA-I	$1.065\pm 0.016$	$1.425\pm 0.002$	$0.117\pm 0.014$	$0\pm 0$	0.00%	$0.723$
		PA-Ramp	$1.758\pm 0.007$	$0.730\pm 0.012$	$0.285\pm 0.008$	$123\pm 13$	3.84%	$5.471$
	T $=$ 4800	PA-I	$1.032\pm 0.004$	$1.373\pm 0.028$	$0.122\pm 0.019$	$0\pm 0$	0.00%	$1.085$
		PA-Ramp	$1.412\pm 0.025$	$0.972\pm 0.069$	$0.342\pm 0.018$	$203\pm 21$	4.22%	$6.875$

Table 3

Sensitivity comparison of PA-I and the proposed PA-RAMP in online ordinal regression under 25% noise data. For each dataset, we have three noise scenarios to discuss: $t=\frac{1}{3}T$ , $t=\frac{2}{3}T$ and $t=T$ noise data. Furthermore for each method, the fourth and fifth columns show MAE and RMSE in three runs. The sixth columns show the Spearman’s Correlation Coefficient. The discard samples by algorithms and the percentage of discarding for each method are shown in seventh and eighth columns. Time denotes the running time of programma

Dataset	Rounds	Method	MAE	RMSE	Spearman’s correlation coefficient	Discard sample	Discard rate	Time (s)
Abalone	T $=$ 1400	PA-I-25%	$1.297\pm 0.014$	$1.640\pm 0.001$	$0.130\pm 0.009$	$0\pm 0$	0.00%	$0.035$
		PA-Ramp-25%	$1.072\pm 0.014$	$1.369\pm 0.003$	$0.128\pm 0.001$	$321\pm 17$	22.92%	$0.135$
	T $=$ 2800	PA-I-25%	$1.319\pm 0.002$	$1.660\pm 0.003$	$0.194\pm 0.003$	$0\pm 0$	0.00%	$0.069$
		PA-Ramp-25%	$1.171\pm 0.029$	$1.476\pm 0.004$	$0.193\pm 0.000$	$614\pm 13$	21.93%	$0.289$
	T $=$ 4100	PA-I-25%	$1.299\pm 0.011$	$1.632\pm 0.008$	$0.230\pm 0.086$	$0\pm 0$	0.00%	$0.105$
		PA-Ramp-25%	$1.113\pm 0.005$	$1.429\pm 0.003$	$0.228\pm 0.096$	$1232\pm 21$	30.0%	$0.434$
Parkin sons-	T $=$ 2000	PA-I-25%	$1.028\pm 0.000$	$1.350\pm 0.018$	$0.199\pm 0.006$	$0\pm 0$	0.00%	$0.082$
uprds		PA-Ramp-25%	$1.015\pm 0.001$	$1.057\pm 0.073$	$0.279\pm 0.007$	$413\pm 4$	20.65%	$0.251$
	T $=$ 4000	PA-I-25%	$1.261\pm 0.005$	$1.531\pm 0.021$	$0.226\pm 0.059$	$0\pm 0$	0.00%	$0.296$
		PA-Ramp-25%	$0.936\pm 0.006$	$1.235\pm 0.068$	$0.268\pm 0.005$	$729\pm 14$	18.22%	0.233
	T $=$ 5800	PA-I-25%	$1.249\pm 0.026$	$1.207\pm 0.008$	$0.235\pm 0.045$	$0\pm 0$	0.00%	$0.285$
		PA-Ramp-25%	$0.913\pm 0.023$	$1.208\pm 0.021$	$0.255\pm 0.031$	$1389\pm 34$	24.10%	0.640
Real estate	T $=$ 140	PA-I-25%	$0.721\pm 0.029$	$1.017\pm 0.007$	$0.368\pm 0.023$	$0\pm 0$	0.00%	$0.421$
valuation		PA-Ramp-25%	$0.843\pm 0.028$	$1.055\pm 0.097$	$0.340\pm 0.028$	$21\pm 0$	15.00%	$0.287$
	T $=$ 280	PA-I-25%	$0.629\pm 0.578$	$0.910\pm 0.025$	$0.471\pm 0.001$	$0\pm 0$	0.00%	$0.923$
		PA-Ramp-25%	$0.732\pm 0.014$	$0.965\pm 0.047$	$0.440\pm 0.024$	$66\pm 7$	23.58%	$0.579$
	T $=$ 400	PA-I-25%	$0.664\pm 0.006$	$0.907\pm 0.028$	$0.498\pm 0.055$	$0\pm 0$	0.00%	$1.263$
		PA-Ramp-25%	$0.731\pm 0.023$	$0.949\pm 0.066$	$0.485\pm 0.027$	$111\pm 9$	27.75%	$0.875$
Winequal-ity-	T $=$ 500	PA-I-25%	$1.962\pm 0.005$	$2.277\pm 0.004$	$0.010\pm 0.002$	$0\pm 0$	0.00%	$0.175$
red		PA-Ramp-25%	$1.406\pm 0.002$	$1.857\pm 0.027$	$0.155\pm 0.038$	$111\pm 5$	22.20%	$1.728$
	T $=$ 1000	PA-I-25%	$1.869\pm 0.002$	$2.187\pm 0.025$	$0.005\pm 0.001$	$0\pm 0$	0.00%	$0.367$
		PA-Ramp-25%	$1.219\pm 0.012$	$1.631\pm 0.021$	$0.134\pm 0.073$	$192\pm 13$	19.21%	$2.397$
	T $=$ 1500	PA-I-25%	$1.856\pm 0.007$	$2.157\pm 0.036$	$0.017\pm 0.003$	$0\pm 0$	0.00%	$0.550$
		PA-Ramp-25%	$1.131\pm 0.069$	$1.514\pm 0.002$	$0.153\pm 0.086$	$209\pm 19$	22.73%	$3.574$
Winequal-ity-	T $=$ 1600	PA-I-25%	$2.038\pm 0.016$	$2.469\pm 0.008$	$0.097\pm 0.007$	$0\pm 0$	0.00%	$0.428$
white		PA-Ramp-25%	$1.320\pm 0.005$	$1.709\pm 0.005$	$0.213\pm 0.016$	$329\pm 12$	13.06%	$2.108$
	T $=$ 3200	PA-I-25%	$1.741\pm 0.003$	$2.385\pm 0.007$	$0.099\pm 0.004$	$0\pm 0$	0.00%	$0.829$
		PA-Ramp-25%	$1.150\pm 0.056$	$1.519\pm 0.007$	$0.225\pm 0.009$	$625\pm 5$	19.53%	$4.799$
	T $=$ 4800	PA-I-25%	$1.918\pm 0.006$	$2.368\pm 0.013$	$0.101\pm 0.016$	$0\pm 0$	0.00%	$1.243$
		PA-Ramp-25%	$1.088\pm 0.005$	$1.437\pm 0.007$	$0.234\pm 0.006$	$1026\pm 23$	21.37%	$6.943$

Table 4

Sensitivity comparison of PA-I and the proposed PA-RAMP in online ordinal regression under 50% noise data. For each dataset, we have three noise scenarios to discuss: $t=\frac{1}{3}T$ , $t=\frac{2}{3}T$ and $t=T$ noise data. Furthermore for each method, the fourth and fifth columns show MAE and RMSE in three runs. The sixth columns show the Spearman’s Correlation Coefficient. The discard samples by algorithms and the percentage of discarding for each method are shown in seventh and eighth columns. Time denotes the running time of programma

Dataset	Rounds	Method	MAE	RMSE	Spearman’s correlation coefficient	Discard sample	Discard rate	Time (s)
Abalone	T $=$ 1400	PA-I-50%	$2.372\pm 0.014$	$2.054\pm 0.016$	$0.028\pm 0.015$	$0\pm 0$	0.00%	$0.296$
		PA-Ramp-50%	$1.786\pm 0.013$	$1.744\pm 0.064$	$0.060\pm 0.005$	$419\pm 34$	29.92%	$0.157$
	T $=$ 2800	PA-I-50%	$2.369\pm 0.006$	$2.095\pm 0.008$	$0.083\pm 0.008$	$0\pm 0$	0.00%	$0.558$
		PA-Ramp-50%	$1.390\pm 0.064$	$1.926\pm 0.002$	$0.060\pm 0.043$	$934\pm 25$	33.35%	$0.325$
	T $=$ 4100	PA-I-50%	$2.296\pm 0.073$	$1.921\pm 0.045$	$0.108\pm 0.016$	$0\pm 0$	0.00%	$0.805$
		PA-Ramp-50%	$1.396\pm 0.060$	$1.704\pm 0.072$	$0.172\pm 0.028$	$1767\pm 48$	43.07%	$0.458$
Parkin-sons-	T $=$ 2000	PA-I-50%	$1.704\pm 0.021$	$2.030\pm 0.258$	$0.018\pm 0.069$	$0\pm 0$	0.00%	0.094
uprds		PA-Ramp-50%	$1.251\pm 0.005$	$1.498\pm 0.003$	$0.032\pm 0.023$	$945\pm 31$	47.25%	$0.397$
	T $=$ 4000	PA-I-50%	$1.603\pm 0.025$	$1.933\pm 0.038$	$0.043\pm 0.006$	$0\pm 0$	0.00%	0.128
		PA-Ramp-50%	$1.003\pm 0.006$	$1.232\pm 0.068$	$0.082\pm 0.013$	$1898\pm 56$	47.45%	0.801
	T $=$ 5800	PA-I-50%	$1.511\pm 0.005$	$1.849\pm 0.068$	$0.057\pm 0.008$	$0\pm 0$	0.00%	$0.273$
		PA-Ramp-50%	$1.010\pm 0.055$	$1.449\pm 0.089$	$0.097\pm 0.080$	$2799\pm 97$	48.25%	1.193
Real estate	T $=$ 140	PA-I-50%	$1.029\pm 0.057$	$1.276\pm 0.015$	$0.006\pm 0.002$	$0\pm 0$	0.00%	$0.159$
valuation		PA-Ramp-50%	$1.009\pm 0.057$	$1.180\pm 0.097$	$0.024\pm 0.008$	$54\pm 18$	38.57%	$0.183$
	T $=$ 280	PA-I-50%	$1.043\pm 0.085$	$1.320\pm 0.073$	$0.034\pm 0.008$	$0\pm 0$	0.00%	$0.324$
		PA-Ramp-50%	$1.667\pm 0.008$	$1.218\pm 0.089$	$0.440\pm 0.024$	$86\pm 31$	30.71%	$0.375$
	T $=$ 400	PA-I-50%	$1.160\pm 0.006$	$1.418\pm 0.048$	$0.003\pm 0.006$	$0\pm 0$	0.00%	0.475
		PA-Ramp-50%	$1.084\pm 0.037$	$2.131\pm 0.084$	$0.052\pm 0.009$	$165\pm 25$	41.25%	$0.549$
Winequa-lity-	T $=$ 500	PA-I-50%	$2.072\pm 0.005$	$2.368\pm 0.066$	$-0.055\pm 0.007$	$0\pm 0$	0.00%	$0.187$
red		PA-Ramp-50%	$1.536\pm 0.002$	$1.940\pm 0.1.3$	$0.010\pm 0.038$	$202\pm 24$	40.40%	$0.811$
	T $=$ 1000	PA-I-50%	$1.981\pm 0.102$	$2.275\pm 0.271$	$-0.004\pm 0.066$	$0\pm 0$	0.00%	$0.379$
		PA-Ramp-50%	$1.388\pm 0.008$	$1.757\pm 0.009$	$0.036\pm 0.073$	$389\pm 11$	38.90%	$1.623$
	T $=$ 1500	PA-I-50%	$1.959\pm 0.007$	$2.234\pm 0.036$	$-0.006\pm 0.003$	$0\pm 0$	0.00%	0.545
		PA-Ramp-50%	$1.301\pm 0.027$	$2.649\pm 0.089$	$0.093\pm 0.022$	$645\pm 32$	43.00%	$2.041$
Winequa-lity-	T $=$ 1600	PA-I-50%	$2.031\pm 0.075$	$2.457\pm 0.092$	$-0.209\pm 0.029$	$0\pm 0$	0.00%	$0.332$
white		PA-Ramp-50%	$1.327\pm 0.005$	$1.719\pm 0.005$	$0.114\pm 0.041$	$724\pm 74$	22.62%	$1.948$
	T $=$ 3200	PA-I-50%	$2.342\pm 0.025$	$2.380\pm 0.082$	$-0.218\pm 0.007$	$0\pm 0$	0.00%	$0.923$
		PA-Ramp-50%	$1.852\pm 0.018$	$1.723\pm 0.046$	$-0.105\pm 0.009$	$1326\pm 32$	41.43%	$3.993$
	T $=$ 4800	PA-I-50%	$2.903\pm 0.041$	$2.355\pm 0.005$	$-0.106\pm 0.011$	$0\pm 0$	0.00%	1.365
		PA-Ramp-50%	$2.189\pm 0.003$	$1.941\pm 0.039$	$0.025\pm 0.018$	$2056\pm 89$	42.83%	$5.816$

4.3.2 Noise statistical study

We investigate the proposed noise-resilient online ordinal regression algorithm in the case of covariance variables noisy data by considering the order. Specifically, we attempt to apply the statistical measures to answer how effective the proposed PA-RAMP method is in handling data with noise input. In this subsection, we have reported a number of simulation studies on finite-sample performance evaluation ( $t=\frac{1}{3}T,t=\frac{2}{3}T,t=T$ ). Considering order and statistical measure, we estimate prediction MAE (Mean Square Error) [9], RMSE (Root Mean Square Error), Spearman’s Correlation Coefficient, discarded sample, discarded rate, and run time between PA-I and PA-RAMP for different noisy data. $\textit{MAE}=\frac{1}{T}\sum_{t=1}^{T}|\mathcal{O}(\hat{y_{t}})-\mathcal{O}(y_% {t})|$ , $\textit{RMSE}=\sqrt{\frac{1}{T}\sum_{t=1}^{T}(\mathcal{O}(\hat{y_{t}})-% \mathcal{O}(y_{t}))^{2}}$ , where $\mathcal{O}(\hat{y_{t}})$ is predicted rank, and $\mathcal{O}({y_{t}})$ is true one. The value ranges from 0 to $K-1$ (maximum deviation in number of categories). It suggests that the larger of values, the worse the prediction accuracy is. However, the predicted values may deviate from the actual values but still maintain the same order. In ordinal matching, a useful and common measure is Spearman’s Rank Correlation which measures the correspondence in rank terms of the two distributions. It measures the structural similarity and values range from 0 to 1. The higher the value, the closer it is. The discard sampledenotes the samples in the model without updating effect, namely, samples that make $S_{l}^{t}$ and $S_{r}^{t}$ null set in each algorithm iteration. It is equivalent to that $\lambda_{i}^{t}=0,i\in S_{l}^{t}$ and $\mu_{i}^{t}=0,i\in S_{r}^{t}$ , and that $a^{t}=0$ . We calculate the number of discard samples by tracking the value of $a^{t}$ . The discard rate is the percentage of the discard sample the total. For each dataset, we have three noise scenarios to discuss: 0%, 25% and 50%.

The results are summarized in Tables 2–4. As can be seen from the columns of MAE and RMSE, the proposed PA-RAMP method outperforms the competing PA-I in two noisy data cases. Especially, in the case of a high level (50%), PA-RAMP significantly outperforms based algorithm PA-I. Spearman’s Correlation Coefficient shows that the correspondence of PA-RAMP in ranking terms of the two distributions is greater in three cases of noise. Empirically, as we increase the ratio of noise, the difference between the two methods in each evaluation index becomes apparent. Due to the intrinsic flaw of hinge loss, PA-I method is highly sensitive to noise. Ramp loss can effectively reduce the impact of noise data in some sense, except for individual data. Moreover, as we increase our sample, the number of discarded samples increases with the level of noise, compared with PA-I. Most of the samples are discarded when the noise level is 50%. Not surprisingly, the percentage of the noisy data significantly affects the discarded rate and the results of the algorithms. Empirically, the discarded rate is approximately equal to the noise ratio. Furthermore, the estimation of small samples can lead to the instability of the model because the prediction accuracy depends on the proportion of noise points that are used in the process of training the model. As for time, it depends on the size of the sample and how many times of $a^{t}$ iterates so that PA-RAMP requires a longer time for larger data sets. Each update of PA-I has a relatively simple expression while each online update of PA-RAMP requires solving a complex optimization problem and more computation and is thus much slower to implement in high noise. It indicates that PA-RAMP is a better noise-resilient method than PA-I to deal with noisy data when the covariance variables $x$ is contaminated, though the latter is more faster.

4.3.3 Noise sensitive comparison

Figure 10.

Sensitivity analysis result between MAE and discard rate.

The purpose of this experiment is to compare the noise sensitivity of the three algorithms. In other words, when the noise ratio is constantly increasing, we compare the degree to which the prediction accuracy of the algorithm decreases. The slower the reduction, the weaker the algorithm’s sensitivity to noise, that is, the stronger the anti-noise stability. According to the data statistics in the previous section, we used indicator MAE to represent the accuracy of prediction and discard rate to represent the real noise addition rate. We observed the weakening degree of MAE of three algorithms (PA-I, PA-RAMP, and PRIL) with the increased sample discarding rate. We have plotted the 2D performance variation under the different settings of noise in Fig. 10. The left side of the graph represents the unit change of MAE, and the right side represents the unit change of the sample discard rate, i.e., the actual noise ratio, and the x-axis represents the theoretical noise ratio. Through observation, we get the following results:

The discard rate of each dataset is equal to the actual noise ratio approximately. With the increase of noise ratio, the MAE value of the three algorithms gradually increases, and the prediction accuracy gradually decreases. However, the MAE of the proposed algorithm PA-RAMP gradually tends to be stable with the increase of sample discard rate, and the degree of growth is the slowest. Especially when the outlier ratio is greater than 50%, most of the noisy samples are discarded, and the MAE increment of PA-RAMP is the smallest. That is, PA-RAMP has better anti-noise stability. To be specific, the proposed PA-RAMP outperforms the two competing methods (PA-I, PRIL) significantly in the cases of high level noisy data. Because of the perceptron loss, original PRIL is very sensitive to noise data. However, MAE of PA-I is higher than PRIL, because Hinge loss function is more sensitive. Not surprisingly, both PRIL and PA-I are sensitive to noise due to their loss functions. However, the MAE of PA-I is higher than that of PRIL as the Hinge loss function is more sensitive. Moreover, MAE will gradually increase and stabilize with increasing of noise ratio compared with two others. Specifically, when the noise ratio is greater than 50%, most of noisy samples are discarded, so the MAE of PA-RAMP will gradually stabilize though it increases with increase of noise data. Furthermore, due to the discard samples in the cases of high level noisy data, PA-RAMP can reduce the impact of noise data obviously and has the best anti-noise stability.

5. Conclusion and further work

In this paper, we proposed an online ordinal regression method, PA-RAMP, where the Ramp loss function was employed for dealing with the noisy data. An efficient algorithm based on the CCCP framework has been presented to solve PA-RAMP, which iterations preserve the order of thresholds. Analysis shows that the proposed PA-RAMP is noise-resilient in the scenarios of interval label(s). At last, we conducted experiments on various datasets to validate that PA-RAMP is a robust and good candidate to deal with noisy data streams. While this paper focused on online settings, the proposed method could also serve as building blocks of large-scale batch algorithms. Future work will involve extending the ordinal regression model to a non-linear regression model by introducing kernel trick [26].

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant Nos. 71961004, 71461005 and 71561008, Thesis Cultivation Project of GUET Graduate Excellent Master under Grant No. 2019YJSPY01, Innovation Project of GUET Graduate Education under Grant No. 2019YCXS081.

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix

PA-RAMP algorithm is derived as follows.

In Section 2, we have derivated the framework of PA-RAMP algorithm from Ramp loss using CCCP. From Eq. (8)

we have Eq. (16):

$\displaystyle\beta^{t+1}=\mathop{\textit{argmin}}\limits_{\beta,\xi}\frac{1}{2% }\|\beta-\widetilde{\beta^{t}}\|^{2}+C\left(\sum\limits_{i=1}^{y_{l}^{t}-1}\xi% _{i}+\sum\limits_{i=y_{r}^{t}}^{K-1}\xi_{i}\right)-\langle\nabla F_{2}(\beta^{% t}),\beta^{t}\rangle-\frac{1}{2}\|\nabla F_{2}(\beta^{t})\|^{2}$ $\displaystyle s.t.\left\{\begin{array}[]{lcl}w\cdot x^{t}-\theta_{i}\geqslant 1% -\xi_{i};\xi_{i}\geqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ w\cdot x^{t}-\theta_{i}\leqslant-1+\xi_{i};\xi_{i}\geqslant 0&&i=y_{r}^{t},% \cdots,K-1.\\ \end{array}\right.$

where $C$ is the aggressiveness parameter.

Lagrangian for the above objective function is as follows.

$\displaystyle L^{t}=\mathop{\textit{argmin}}\limits_{w,\theta}\frac{1}{2}\|% \beta-\widetilde{\beta^{t}}\|^{2}+C\left(\sum\limits_{i=1}^{y_{l}^{t}-1}\xi_{i% }-\sum\limits_{i=y_{r}^{t}}^{K-1}\xi_{i}\right)+\sum\limits_{i=1}^{y_{l}^{t}-1% }\lambda_{i}^{t}(1-\xi_{i}+\theta_{i}-w\cdot x^{t})+\sum\limits_{i=y_{r}^{t}}^% {K-1}\mu_{i}^{t}(1-\xi_{i}+w\cdot x^{t}-\theta_{i})-\sum\limits_{i=1}^{y_{l}^{% t}-1}\alpha_{i}\xi_{i}-\sum\limits_{i=y_{r}^{t}}^{K-1}\beta_{i}\xi_{i}-\langle% \nabla F_{2}(\beta^{t}),\beta^{t}\rangle-\frac{1}{2}\|\nabla F_{2}(\beta^{t})% \|^{2},$

where $\lambda_{i}\geqslant 0,\alpha_{i}\geqslant 0,i\in[y_{l}^{t}-1]$ and $\mu_{i}\geqslant 0,\beta_{i}\geqslant 0,i=y_{r}^{t},\cdots,K-1$ are Lagrange multipliers. Parameter are $w\in R^{d}$ , $\theta(\theta_{1},\cdots,\theta_{K-1})$ , $\xi(\xi_{1},\cdots,\xi_{K-1})$ , $\lambda_{1}^{t},\cdots,\lambda_{y_{l}^{t}-1}^{t}$ , $\mu_{y_{r}^{t}}^{t},\cdots,\mu_{k-1}^{t}$ , $\alpha_{1}^{t},\cdots,\alpha_{y_{l}^{t}-1}^{t}$ , $\beta_{y_{r}^{t}}^{t},\cdots,\beta_{k-1}^{t}$ .

The KKT optimality conditions are as follows.

$\displaystyle w=w_{t}+\nabla_{w}F_{2}(\beta^{t})+\left(\sum\limits_{i=1}^{y_{l% }^{t}-1}\lambda_{i}^{t}-\sum\limits_{i=y_{r}^{t}}^{K-1}\mu_{i}^{t}\right)x^{t}% =\widetilde{w^{t}}+\left(\sum\limits_{i=1}^{y_{l}^{t}-1}\lambda_{i}^{t}-\sum% \limits_{i=y_{r}^{t}}^{K-1}\mu_{i}^{t}\right)x^{t}\text{\Pisymbol{pzd}{192}}$ $\displaystyle\left\{\begin{array}[]{l}{\theta_{i}}-\theta_{i}^{t}-\nabla_{% \theta_{i}}F_{2}(\beta^{t})+\lambda_{i}^{t}=0,i=1,\cdots,y_{l}^{t}-1\text{% \Pisymbol{pzd}{193}}\\ \theta_{i}-\theta_{i}^{t}-\nabla_{\theta_{i}}F_{2}(\beta^{t})-\mu_{i}^{t}=0,i=% y_{r}^{t},\cdots,K-1\end{array}\right.$ $\displaystyle\left\{\begin{array}[]{ll}C-\alpha_{i}^{t}-\lambda_{i}^{t}=0&i=1,% \cdots,y_{l}^{t}-1\text{\Pisymbol{pzd}{194}}\\ C-\beta_{i}^{t}-\mu_{i}^{t}=0&i=y_{r}^{t},\cdots,K-1\\ \end{array}\right.$ $\displaystyle\left\{\begin{array}[]{lcl}\lambda_{i}^{t}\geqslant 0,\alpha_{i}^% {t}\geqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ \mu_{i}^{t}\leqslant 0,\beta_{i}^{t}\leqslant 0&&i=y_{r}^{t},\cdots,K-1\\ \end{array}\right.$ $\displaystyle\left\{\begin{array}[]{lcl}1-\xi_{i}+\theta_{i}-w\cdot x^{t}% \leqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ 1-\xi_{i}+w\cdot x^{t}-\theta_{i}\leqslant 0&&i=y_{r}^{t},\cdots,K-1\\ \xi_{i}\geqslant 0&&i=1,\cdots,y_{l}^{t}-1\\ \xi_{i}\geqslant 0&&i=y_{r}^{t},\cdots,K-1\\ \end{array}\right.$ $\displaystyle\left\{\begin{array}[]{ll}\lambda_{i}^{t}(1-\xi_{i}+\theta_{i}-w% \cdot x^{t})=0&i\in[y_{l}^{t}-1]\text{\Pisymbol{pzd}{195}}\\ \mu_{i}^{t}(1-\xi_{i}+w\cdot x^{t}-\theta_{i})=0&i\in\{y_{r}^{t},\cdots,K-1\}% \\ \alpha_{i}^{t}\xi_{i}\geqslant 0&i\in[y_{l}^{t}-1]\\ \beta_{i}^{t}\xi_{i}\geqslant 0&i\in\{y_{r}^{t},\cdots,K-1\}\end{array}\right.$

We now known $x^{t}$ , $y_{l}^{t}$ , $y_{r}^{t}$ , $K$ , $w^{t}$ , $\theta^{t}$ , $\xi^{t}.$

To solve $w$ (i.e. $w^{t+1}$ ), $\theta(\theta_{1},\cdots,\theta_{K})$ (that is $\theta^{t+1}$ ), $\xi(\xi_{1},\cdots,\xi_{K})$ (thai is $\xi^{t+1}$ ), $\lambda_{i}^{t}(\lambda_{1}^{t},\cdots,\lambda_{y_{l}^{t}-1}^{t})$ , $mu_{i}^{t}(\mu_{y_{r}^{t}}^{t},\cdots,\mu_{k-1}^{t})$ , $\alpha_{i}^{t}(\alpha_{1}^{t},\cdots,\alpha_{y_{l}^{t}-1}^{t})$ , $\beta_{i}^{t}(\beta_{y_{r}^{t}}^{t},\cdots,\beta_{k-1}^{t})$ .

Let $S_{l}^{t}=\{1\leqslant i\leqslant y_{l}^{t}-1|\lambda_{i}^{t}>0\}$ be the left support set and $S_{r}^{t}=\{y_{r}^{t}\leqslant i\leqslant K-1|\mu_{i}^{t}>0\}$ be the right support set. Let $w=w^{t}+a^{t}x^{t}$ , where $a^{t}=\sum\limits_{i\in S_{l}^{t}}\lambda_{i}^{t}-\sum\limits_{i\in S_{r}^{t}}% \mu_{i}^{t}$ and

$\displaystyle l_{i}^{t}=[1-f(x)+\theta_{i}]_{+}=[1-w\cdot x^{t}+\theta_{i}]_{+% },i\in S_{l}^{t},l_{i}^{t}=[1+f(x)-\theta_{i}]_{+}=[1+w\cdot x^{t}-\theta_{i}]% _{+},i\in S_{r}^{t}.$

1) $\lambda_{i}^{t},\alpha_{i}^{t},i\in[y_{l}^{t}-1]$ , $\mu_{i}^{t},\beta_{i}^{t},i\in\{y_{r}^{t},\cdots,K-1\}$ :

For $i\in S_{l}^{t}$ , there are ⟀⟁⟂⟃ four equations and four unknowns ( $w,\theta_{i},\lambda_{i}^{t},\alpha_{i}^{t}$ ). From ⟁ we get $\lambda_{i}^{t}=-\theta_{i}+\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})$ . From $S_{l}^{t}$ expression $\lambda_{i}^{t}>0$ , we have $1-\xi_{i}+\theta_{i}-w\cdot x^{t}=0$ .

That is

$\displaystyle\lambda_{i}^{t}=-w\cdot x^{t}+1-\xi_{i}+\theta_{i}^{t}+\nabla_{% \theta_{i}}F_{2}(\beta^{t})\overset{\text{\Pisymbol{pzd}{192}}}{=}-(w^{t}+% \nabla_{w}F_{2}(\beta^{t})+a^{t}\cdot x^{t})x^{t}+1-\xi_{i}+\theta_{i}^{t}+% \nabla_{\theta_{i}}F_{2}(\beta^{t})=1-\xi_{i}-w.x^{t}-\nabla_{w}F_{2}(\beta^{t% })x^{t}-a^{t}\|x^{t}\|^{2}+\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})% \xlongequal{{l_{i}^{t}\text{expression}}}l_{i}^{t}-\xi_{i}-\nabla_{w}F_{2}(% \beta^{t})x^{t}-a^{t}\|x^{t}\|^{2}+\nabla_{\theta_{i}}F_{2}(\beta^{t}),i\in S_% {l}^{t}.$

On the other hand, from ⟂, we get $\lambda_{i}^{t}=C-\alpha_{i}^{t},i\in[y_{l}^{t}-1]$ .

So $\xLongrightarrow{\alpha_{i}^{t}\geqslant 0,\xi_{i}\geqslant 0,\alpha_{i}^{t}% \xi_{i}=0}\lambda_{i}^{t}=\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+% \nabla_{\theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t}$ .

$\alpha_{i}^{t}\overset{\text{\Pisymbol{pzd}{194}}}{=}C-\lambda_{i}^{t}=C-\min% \{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t}% )-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t}$ , where $\nabla_{w}F_{2}(\beta^{t})=C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{s}^{\prime w}(% \theta_{i}^{t}-\langle w^{t},x^{t}\rangle$ ).

For $i\in S_{r}^{t}$ , in the same way,

$\displaystyle\mu_{i}^{t}=1-\xi_{i}-\theta_{i}^{t}+w^{t}\cdot x^{t}+\nabla_{% \theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2}\overset{l_{i}^{t}}{=}l_{i}^{t}-% \xi_{i}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(\beta^{t})+a^% {t}\|x^{t}\|^{2},i\in S_{r}^{t},$

where $\xi_{i}$ , $a^{t}$ are unknown.

On the other hand, $\mu_{i}^{t}=C-\beta_{i}^{t},i\in\{y_{r}^{t},\cdots,K-1\}$ .

So $\xLongrightarrow{(\alpha_{i}^{t}\geqslant 0,\xi_{i}\geqslant 0,\alpha_{i}^{t}% \xi_{i}=0)}\mu_{i}^{t}=\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_% {\theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t}$ , $\beta_{i}^{t}\overset{\text{\Pisymbol{pzd}{194}}}{=}C-\linebreak\mu_{i}^{t}=C-% \min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(\beta% ^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t}$ , where $\nabla_{w}F_{2}(\beta^{t})=C\sum\limits_{i=y_{r}^{t}}^{K-1}l_{s}^{\prime w}(% \theta_{i}^{t}-\langle w^{t},x^{t}\rangle)$ .

$a^{t}$ : Putting the values of $\lambda_{i}$ and $\mu_{i}$ in the expression of $a^{t}$ , we get

$\displaystyle a^{t}=\sum\limits_{i\in S_{l}^{t}}\lambda_{i}^{t}-\sum\limits_{i% \in S_{r}^{t}}\mu_{i}^{t}=\sum\limits_{i\in S_{l}^{t}}\min\{l_{i}^{t}-\nabla_{% w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2}% ,C\}-\sum\limits_{i\in S_{r}^{t}}\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{% t}-\nabla_{\theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\}=\sum\limits_{i% \in S_{l}^{t}}\min\{1-w^{t}x^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{% \theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\}-\sum\limits_{i\in S_{r}^{t}% }\min\{1+w^{t}x^{t}-\theta_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{% \theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},$

where $a^{t}$ is no explicit solution (the same below). So,

$\displaystyle\lambda_{i}^{t}=\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+% \nabla_{\theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},$ $\displaystyle\alpha_{i}^{t}=C-\min\{l_{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+% \nabla_{\theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t},$

where $\lambda_{i}^{t}=0,\alpha_{i}^{t}=0$ , $i\in\{[y_{l}^{t}-1]\backslash S_{l}^{t}\}$ .

$\displaystyle\mu_{i}^{t}=\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-% \nabla_{\theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},$ $\displaystyle\beta_{i}^{t}=C-\min\{l_{i}^{t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-% \nabla_{\theta_{i}}F_{2}(\beta^{t})+a^{t}\|x^{t}\|^{2},C\},i\in S_{r}^{t},$

where $\mu_{i}^{t}=0,\beta_{i}^{t}=0$ , $i\in\{\{y_{r}^{t},\cdots,K-1\}\backslash S_{r}^{t}\}$ .

2) $\theta(\theta_{i}^{t+1})$ :

For $i\in S_{l}^{t}$ , because of $\lambda_{i}^{t}$ no explicit solution $\theta_{i}^{t}$ , from ⟁we have

$\displaystyle\theta_{i}^{t+1}=\theta_{i}^{t}-\lambda_{i}^{t}+\nabla_{\theta_{i% }}F_{2}(\beta^{t})t=\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})-\min\{l% _{i}^{t}-\nabla_{w}F_{2}(\beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})-a% ^{t}\|x^{t}\|^{2},C\},i\in S_{l}^{t}.$

For $i\in S_{r}^{t}$ , in the same way, from $\lambda_{i}^{t}$ expression and known condition $\theta_{i}^{t}$ , we have

$\displaystyle\theta_{i}^{t+1}=\theta_{i}^{t}+\mu_{i}^{t}+\nabla_{\theta_{i}}F_% {2}(\beta^{t})=\theta_{i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})+\min\{l_{i}^% {t}+\nabla_{w}F_{2}(\beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(\beta^{t})+a^{t}% \|x^{t}\|^{2},C\},i\in S_{r}^{t}.$

3) $w(w^{t+1}$ ):

From ⟀, we have $w^{t+1}=w^{t}+\nabla_{w}F_{2}(\beta^{t})+a^{t}x^{t}$ , where $a^{t}$ no explicit slution and

$\displaystyle\nabla_{w}F_{2}(\beta^{t})=C\sum\limits_{i=1}^{y_{l}^{t}-1}l_{s}^% {\prime w}(\theta_{i}^{t}-\langle w^{t},x^{t}\rangle)+C\sum\limits_{i=y_{r}^{t% }}^{K-1}l_{s}^{\prime w}(\theta_{i}^{t}-\langle w^{t},x^{t}\rangle).$

4) $\xi(\xi_{i}^{t+1})$ :

For $i\in S_{l}^{t}$ , from ⟁ and $\theta_{i}^{t+1}$ expression, we have

$\displaystyle\xi_{i}^{t+1}=1-w^{t}x^{t}+\theta_{i}^{t+1}=1-w^{t}x^{t}+\theta_{% i}^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})-\min\{l_{i}^{t}-\nabla_{w}F_{2}(% \beta^{t})x^{t}+\nabla_{\theta_{i}}F_{2}(\beta^{t})-a^{t}\|x^{t}\|^{2},C\},i% \in S_{l}^{t}.$

For $i\in S_{r}^{t}$ , in the same way, we have

$\displaystyle\xi_{i}^{t+1}=1+w^{t}x^{t}-\theta_{i}^{t+1}=1+w^{t}x^{t}-\theta_{% i}^{t}-\nabla_{\theta_{i}}F_{2}(\beta^{t})-\min\{l_{i}^{t}+\nabla_{w}F_{2}(% \beta^{t})x^{t}-\nabla_{\theta_{i}}F_{2}(\beta^{t}+a^{t}\|x^{t}\|^{2},C\},i\in S% _{r}^{t},$

where $a^{t}$ is no explicit solution. We provide the iterative approach in the PA-RAMP algorithm.

References

Ahn

and Kim

K.J.

, Corporate credit rating using multiclass classification models with order information, Comput. Oper. Res 5(12) (2011), 1783–1788.

Crammer

and Singer

, Pranking with ranking, Advances in Neural Information Processing Systems 14 (2002), 641–647.

Harrington

E.F.

, Online ranking/collaborative filtering using the perceptron algorithm, in: Proceedings of the Twentieth International Conference on Machine Learning, 2002, pp. 250–257.

Rennie

J.D.

and Srebro

, Loss functions for preference levels: Regression with discrete ordered labels, in: Proceedings of The IJCAI Multidisciplinary Workshop on Advances in Preference Handling, 2005, pp. 180–186.

Guisan

and Harrell

E.F.

, Ordinal response regression models in ecology, Journal of Vegetation Science 11(5) (2000), 617–626.

Doyle

O.M.

Westman

Marquand

A.F.

Mecocci

Vellas

Tsolaki

Kloszewska

Soininen

Lovestone

and Williams

S.C.

, Predicting progression of alzheimer’s disease using ordinal regression, PloS One 9(8) (2014), 1–10.

Mccullagh

, Regression models for ordinal data, Journal of The Royal Statistical Society: Series B (Methodological) 42(2) (1980), 109–127.

Herbrich

Graepel

and Obermayer

, Large margin rank boundaries for ordinal regression, Advances in Large Margin Classifiers, MIT press cambridge, 2000, 115–132.

Gutierrez

P.A.

Ortiz

M.P.

Monedero

J.S.

Navarro

F.F.

and Martinez

C.H.

, Ordinal regression methods: Survey and experimental study, IEEE Transactions on Knowledge and Data Engineering 28(1) (2015), 127–146.

10.

Shashua

and Levin

, Ranking with large margin principle: two approaches, in: Advances in Neural Information Processing Systems, 2003, pp. 961–968.

11.

Chu

and Keerthi

S.S.

, New approaches to support vector ordinal regression, in: Proceedings of The 22nd International Conference on Machine Learning, ACM, 2005, pp. 145–152.

12.

Pedregosa

Bach

and Gramfort

, On the consistency of ordinal regression methods, The Journal of Machine Learning Research 18(1) (2017), 1769–1803.

13.

Manwani

, PRIL: perceptron ranking using interval labeled data, in: Acm India Joint International Conference, ACM, 2019, pp. 1769–1803.

14.

Sahoo

Hoi

S.C.

and Li

, Large scale online multiple kernel regression with application to time-series prediction, ACM Transactions on Knowledge Discovery from Data (TKDD) 13(1) (2019), 9.

15.

Manwani

and Chandra

, Exact passive-aggressive algorithms for learning to rank using interval label, in: IEEE Transactions on Neural Networks and Learning Systems, 2018, pp. 19–38.

16.

Shah

and Manwani

, Online active learning of reject option classifiers, in: National Conference on Artificial Intelligence, 2020, pp. 134–176.

17.

Huang

and Shi

, Support vector machine classifier with pinball loss, IEEE Transactions on Pattern Analysis and Machine Intelligence 36(5) (2014), 984–997.

18.

Liu

Shi

and Tian

, Ramp loss nonparallel support vector machine for pattern classification, Knowledge-Based Systems 85 (2016), 224–233.

19.

Liu

Shi

Tian

and Huang

, Ramp loss least squares support vector machine, Journal of Computational Science 14 (2016), 61–68.

20.

Tian

Mirzabagheri

Bamakan

S.M.H.

Wang

and Qu

, Ramp loss one-class support vector machine: A robust and effective approach to anomaly detection problems, Neurocomputing 310 (2018), 223–235.

21.

Wang

and Zhou

, All-in-one multicate gory ramp loss maximum margin of twin spheres support vector machine, Applied Intelligence 49(6) (2019), 2301–2314.

22.

Lipp

and Boyd

, Variations and extension of the convex-concave procedure, Optimization and Engineering 17(2) (2016), 263–287.

23.

Blake

and Merz

C.J.

, UCI repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine, CA 55.

24.

Chang

C.C.

and Lin

C.J.

, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology, 2011.

25.

Wang

Jiang

Huang

and Zhang

, Robust variable selection with exponential squared loss, Journal of the American Statistical Association 108(50) (2013), 632–643.

26.

Liu

Pokharel

P.P.

and Jose

, The kernel least-mean-square algorithm, IEEE Transactions on Signal Processing 56(2) (2008), 543–554.

A noise-resilient online learning algorithm with ramp loss for ordinal regression

Abstract

Keywords

1. Introduction

3. Method

3.1 Learning to rank in online ordinary regression

3.2.1 PA algorithm and its variants

(1) Ramp loss

(2) PA-RAMP algorithm

.

Proof..

.

Proof..

2 Available: https://archive.ics.uci.edu/ml/.

Table 1 Ordinal regression datasets

4.3.1 The accuracy comparison

4.3.3 Noise sensitive comparison

Footnotes

Acknowledgments

Conflict of interest

Appendix

References

²
Available: https://archive.ics.uci.edu/ml/.

Table 1
Ordinal regression datasets