Active learning for ordinal classification on incomplete data

Abstract

Existing active learning algorithms typically assume that the data provided are complete. Nonetheless, data with missing values are common in real-world applications, and active learning on incomplete data is less studied. This paper studies the problem of active learning for ordinal classification on incomplete data. Although cutting-edge imputation methods can be used to impute the missing values before commencing active learning, inaccurately imputed instances are unavoidable and may degrade the ordinal classifier’s performance once labeled. Therefore, the crucial question in this work is how to reduce the negative impact of imprecisely filled instances on active learning. First, to avoid selecting filled instances with high imputation imprecision, we propose penalizing the query selection with a novel imputation uncertainty measure that combines a feature-level imputation uncertainty and a knowledge-level imputation uncertainty. Second, to mitigate the adverse influence of potentially labeled imprecisely imputed instances, we suggest using a diversity-based uncertainty sampling strategy to select query instances in specified candidate instance regions. Extensive experiments on nine public ordinal classification datasets with varying value missing rates show that the proposed approach outperforms several baseline methods.

Keywords

Active learning Incomplete data Ordinal classification Imputation uncertainty

1. Introduction

Ordinal classification (OC), also known as ordinal regression, is a particular case of multi-class classification where the instances are labeled by ordinal scales [1]. Since an ordered relation exists among the categories in many real situations, ordinal classification has a wide range of applications, such as disease severity estimation [2] in the medical field, bank failure prediction [3] in the financial area, and facial age estimation [4] in the computer vision domain, and so on. As a supervised learning task, OC usually relies on a sufficient amount of labeled data to train an ordinal prediction model. However, the label acquisition for ordinal instances is usually expensive and time-consuming due to the dependence on domain expertise [1], preventing the collection of a large number of labeled instances. In this circumstance, active learning (AL) [5] that interactively query the annotators of the most valuable unlabeled instances can be a cost-effective way to obtain an accurate ordinal prediction model [6, 7].

It is worthy of note that one can frequently encounter that the collected ordinal data are incomplete [8]. For instance, the ordinal data collected in clinical trials will almost inevitably be incomplete [9]. In general, incomplete data arise due to imperfect data collection procedures, such as device malfunction, sensor failure, operator mistakes, and so on [10]. Motivated by the above fact, this paper studies the problem of active ordinal classification in the presence of missing values. In the active learning community, despite the emergence of a large body of well-established active learning algorithms in the past few decades, little attention has been paid to active learning for incomplete data.

Figure 1.

Illustion of imputation results of two different imputation methods on a three-class artificial ordinal classification dataset. The dataset is simulated incompleteness by removing 30% of the values based on the MCAR mechanism [13]. (a) is the original ordinal data without value missing; (b) shows the imputation result by the incomplete-case $k$ -nearest neighbor imputation (ICkNNI) [11] with $k=$ 5; (c) shows the imputation result of the multiple imputation by chained equations (MICE) [12] based on random forest regression by averaging five times imputed values.

Data completeness is a major assumption of most active learning algorithms. Listwise deletion, which discards the instances with missing values, is the simplest method for preprocessing incomplete data. However, it may result in a significant information loss. Therefore, the better way to perform an active learning algorithm on incomplete ordinal data is to fill the missing values beforehand. We can utilise any state-of-the-art imputation method to fill the missing values. More complicated imputation methods, in general, may result in better value-filling quality. But, even if an imputation method is well-developed, imprecise imputations will inevitably exist in the imputed data. Figure 1 illustrates the imputation results by implementing two state-of-the-art imputation methods on a three-class artificial ordinal classification data with a value missing rate of 30%. The results of the two imputations methods show that many imputed instances have considerable distortion. Because of the inherent uncertainty in missing value imputation, some values may be imputed inaccurately [14]. The underlying distorted instances, i.e., the imprecisely imputed instances, can be considered detrimental to an active learning procedure. The ordinal classifier’s performance will suffer once the highly distorted instances are labeled. Therefore, in this study, two fundamental questions arise: how to prevent labeling the severely imprecisely imputed instances and how to mitigate the negative impact of the underlying labeled imprecisely imputed instances on active learning.

To perform active learning, we fill the missing values by a multiple imputation method, MICE [12]. In order to avoid selecting filled instances with high imputation imprecision, we devise a novel imputation uncertainty measure to penalize the query selection function. In the seminal work of [14], a feature-level imputation uncertainty is suggested to penalize the base query selection function for avoiding high inaccurately imputed instances from being selected. In contrast, our method considers not only the feature-level imputation uncertainty but also the imputation uncertainty on the knowledge level. The feature-level imputation uncertainty aggregates the imputation variances in an instance and reflects the scatter of the multiple imputations. The knowledge-level imputation uncertainty reflects the prediction uncertainty of an ordinal prediction model on the multiple imputations of an incomplete instance.

The proposed novel imputation uncertainty can penalize a query selection function less likely to select the highly distorted instances. Nevertheless, it cannot wholly avoid distorted instances from being selected. The labeled distorted instances may inhibit the base learner’s hyperplanes from quickly converging to the proper positions, thus causing the active learner to label more low-value instances. To mitigate the negative impact of the potentially labeled distorted instances on active ordinal classification, we suggest conducting diversity-based uncertainty sampling in specified candidate instances region based on the structure of a threshold-based ordinal classification model, i.e., the support vector machine with reduction framework (RED-SVM) [15]. We specify the candidate instance region based on the state of the decision hyperplanes in the prediction model before each query selection. Based on the condition of a specially designed hyperplane state coefficient, we restrict the query instance to be selected only in a specific region where the instances from which be labeled are likely to promote the convergence of the current hyperplane.

The contributions of this work are outlined as follows.

•

We introduce a novel imputation uncertainty measure by simultaneously considering the feature-level imputation uncertainty and the knowledge-level imputation uncertainty. The proposed imputation uncertainty measure can prevent the query selection function from selecting the highly distorted instances.

•

To mitigate the negative impact of the potentially labeled imprecisely imputed instances on active learning, we specify the candidate instance region based on the state of the current decision hyperplane. The ordinal predictive model is more likely to be promoted by labeling the instance from the specified region.

•

We conduct experiments on several public ordinal classification datasets under different value missing rates, and the results demonstrate that the proposed method is superior to the competitors.

The remainder of this paper is structured as follows. Section 2 reviews the related work from aspects of active learning and incomplete data processing and briefly recalls the formulation of the threshold-based ordinal classification model with a reduction framework. Section 3 provides the technical details of the proposed method. The experiment setting and experimental results are reported in Section 4. Finally, the conclusion is drawn in Section 5.

2. Related work

This section gives a brief overview of active learning, incomplete data processing, and threshold ordinal classification with a reduction framework.

2.1 Active learning

Active learning aims to train a reliable predictive model while minimizing labeling costs. Thus, it is beneficial for many machine learning scenarios where label acquisition is expensive. The key issue of active learning is to develop a query selection strategy to determine which unlabeled instances would be the most useful (i.e., improve the prediction model most) if they are labeled and used as training instances. Informativeness and representativeness are two conventional policies to measure the usefulness of an unlabeled instance.

The query selection strategies that based on instance’s informativeness includes uncertainty sampling [16], query by committee [17], expected change [18] and so on. Uncertainty sampling methods select the instances for which its current prediction is maximally uncertain [16]. Query-by-committee trains a set of prediction models and selects the unlabeled instances on which the models disagree the most [17]. The methods based on expected change calculate the change in the model under an unlabeled instance being assigned to the possible labels and weight the change by probability estimate [18]. The query selection concerning instance’s representativeness includes clustering-based methods [19, 20], experimental design-based methods [21, 22], and so on. The clustering-based active learning methods explore the clustering or manifold structure of the data and select the instances that represent the intrinsic geometry of the data. The experimental design-based methods rely on a data reconstruction framework, ensuring that the selected data has high representative power [22].

While much progress has been made for active learning algorithms, little attention has been focused on ordinal classification. Li et al. [6] introduced an A-optimal experimental design method for ordinal classification based on an adjacent category logistic model. However, this method needs to calculate the inverse of a large matrix, limiting its usability in real situations. Ge et al. [7] proposed an uncertainty sampling method for ordinal classification. But, potential sampling redundancy cannot be avoided in this method. To the best of our knowledge, the two works mentioned above are the only two active learning studies for ordinal classification, and active learning for ordinal classification in the presence of missing values has not been investigated. Therefore, designing an effective active learning algorithm that targets ordinal classification with missing values is essential.

In previous active learning studies that encountered incomplete data, one typically imputes the data in the preprocessing stage and rarely considers the impact or the validity of imputed values [14]. Recently, Han and Kang [14] proposed an active learning paradigm in which the imputation uncertainty is considered in the query selection. However, this method only considers the feature-level imputation uncertainty and ignores the uncertainty of the predictive output derived from the multiple imputations. Therefore, this study introduces a novel imputation uncertainty by simultaneously considering the feature-level imputation uncertainty and the knowledge-level imputation uncertainty. Furthermore, we also seek to mitigate the negative impact caused by the underlying labeled imprecisely imputed instances.

2.2 Incomplete data processing

The presence of missing values makes it difficult to apply traditional active learning methods immediately to incomplete data. Removing the instances with missing values is a simple ad hoc solution. However, a significant information loss will occur if the missing rate is high. Filling the missing values with plausible values before active learning can be an appealing solution. A variety of imputation methods, including single and multiple imputation methods, have been proposed to address the missing values.

Some prevalent single imputation methods, such as the incomplete-case $k$ -nearest neighbors imputation [11], the conditional mean imputation [23], and so on, are acceptable to address the missing values in incomplete ordinal data. However, the uncertainty of the imputed values is often not explicitly modeled in most single imputation methods [24]. The main idea of multiple imputations is to use the distribution of the observed components to estimate a set of plausible values of the missing values [24, 25, 12]. Because of creating a set of imputations for each missing value, multiple imputations overcome the limitation of single imputation by introducing the statistical uncertainty in the imputation. Multiple imputation by chain equations (MICE) has emerged as one principled method of addressing missing data [12]. It imputes every variable conditional on all other variables, creating multiple completed datasets based on random components. Therefore, to facilitate the quantification of imputation uncertainty, we use the multiple imputation method MICE to fill the incomplete ordinal data in this study.

2.3 Threshold ordinal classification based on reduction framework

Ordinal classification is generally defined as the problem of classifying input instances on an ordinal scale. Various dedicated methods have been developed to solve the ordinal classification, such as ordinal binary decomposition-based methods [26], threshold-based methods [15], and so on. Prior studies have indicated that threshold models generally produce competitive performances [26]. Therefore, this study recruits a well-developed threshold-based ordinal classification model as the base learner in our active learner. This model is particularly designed based on a reduction framework [15]. We recall the formulation of this model here to be a preliminary of the proposed method.

Let $\{\langle\textbf{x}_{i},y_{i}\rangle\}_{i+1}^{n}$ be the training set, where each instances $\textbf{x}_{i}\in\mathcal{X}\subseteq\mathbb{R}^{d}$ is a feature vector and associated with a ordinal label $y_{i}\in\mathcal{Y}=\{\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{K}\}$ . The class labels are in total ordering as follows $\mathcal{C}_{1}\prec\mathcal{C}_{2}\prec\ldots\prec\mathcal{C}_{K}$ . Without loss of generality, the $K$ ordinal class labels are generally represented by $K$ consecutive integers as follows $\{1,2,\ldots,K\}$ . To predict the label of an instance x, the threshold ordinal classification model learns $K-1$ ordered thresholds: $\theta_{1}<\theta_{2}<\ldots<\theta_{K-1}$ , and assumes that $\theta_{0}=-\infty$ and $\theta_{K}=+\infty$ [26]. The instance x is predicted to belong to the $k$ -th class if its prediction output $h(\textbf{x})=\textbf{w}^{T}\textbf{x}$ falls between $\theta_{k-1}$ and $\theta_{k}$ , where $\textbf{w}\in\mathbb{R}^{d}$ , $\textbf{w}^{T}\textbf{x}$ indicates the inner product.

Ordinal classification based on a reduction framework reduces the ordinal-class problem to extended binary classification problems. Then, all the binary classification problems are solved jointly to obtain a single binary classifier [15]. For each original training instance $\langle\textbf{x}_{i},y_{i}\rangle$ , the reduction framework extends it into the following $K-1$ binary training instances:

$\displaystyle\left\langle\textbf{x}_{i}^{k},y_{i}^{k}\right\rangle,\ \ k=\{1,% \ldots,K-1\},$ $\displaystyle\textbf{x}_{i}^{k}=(\textbf{x}_{i},\textbf{e}_{k})\in\mathbb{R}^{% d+K-1},$ (1) $\displaystyle y_{i}^{k}=1-2I[y_{i}\leqslant k],$

where $\textbf{e}_{k}\in\textbf{R}^{K-1}$ is a vector with the $k$ -th element as 1 and the remainder elements are all 0, and $I\left[\cdot\right]$ is an indicator function that returns 1 if the inner condition holds, otherwise, zero is returned. For each extended instance $\textbf{x}_{i}^{k}$ , it associates with a binary class label $y_{i}^{k}\in\{-1,+1\}$ .

Let the weight vector in the binary classification problem be

$\displaystyle\bar{\textbf{w}}=(\textbf{w},-{\bm{\theta}})\in\mathbb{R}^{d+K-1}.$ (2)

Thus, the prediction output of $\textbf{x}_{i}^{k}$ can be expressed as $g(\textbf{x}_{i}^{k})=(\textbf{w},-{\bm{\theta}})^{T}\textbf{x}_{i}^{k}=% \textbf{w}^{T}\textbf{x}_{i}-\theta_{k}=h(\textbf{x}_{i})-\theta_{k}$ . The threshold $\theta_{k}$ in the ordinal classification model is estimated based on feature extension. Finally, the label of instances $\textbf{x}_{i}$ is predicted as

$\displaystyle\hat{y}_{i}=1+\sum_{k=1}^{K-1}I[g(\textbf{x}_{i}^{k})>0],$ (3)

where $I[\cdot]$ is an indicator function with the same meaning as that in Eq. (2.3).

The above reduction framework has been instantiated based on the binary SVM model [7]. We call this model RED-SVM. Based on the SVM formulation, the loss function of ordinal classification can be derived as the following primal problem

$\displaystyle\min_{\textbf{w},{\bm{\theta}},\xi_{i}^{k},b}{\frac{1}{2}\left% \lVert\textbf{w}\right\rVert^{2}+\frac{1}{2}\left\lVert{\bm{\theta}}\right% \rVert^{2}+C\sum_{i=1}^{n}\sum_{k=1}^{K-1}\xi_{i}^{k}}$ $\displaystyle\text{s.t.}\quad y_{i}^{k}(\textbf{w}^{T}\phi(\textbf{x}_{i})-% \theta_{k}-b)\geqslant 1-\xi_{i}^{k},$ (4) $\displaystyle\qquad\xi_{i}^{k}\geqslant 0,\quad\forall i\in\{1,\ldots,n\},% \quad k\in\{1,\ldots,K-1\},$

where $\phi$ denotes the nonlinear feature mapping induced by a kernel function, and $\xi_{i}^{k}$ indicates the slack variable that tolerates the error of $\textbf{x}_{i}$ occur at the $k$ -th decision boundary.

The dual form of the minimization problem in Eq. (2.3) is then derived as

$\displaystyle\max_{\alpha}\sum_{i=1}^{n}\sum_{k=1}^{K-1}\alpha_{i}^{k}-\frac{1% }{2}\sum_{i=1}^{n}\sum_{k_{1}=1}^{K-1}\sum_{j=1}^{n}\sum_{k_{2}=1}^{K-1}y_{i}^% {k_{1}}y_{j}^{k_{2}}\alpha_{i}^{k_{1}}\alpha_{j}^{k_{2}}\mathcal{K}(\textbf{x}% _{i}^{k_{1}},\textbf{x}_{j}^{k_{2}})$ $\displaystyle\text{s.t.}\quad 0\leqslant\alpha_{i}^{k}\leqslant C,\quad\forall i% \in\{1,\ldots,n\},\quad k\in\{1,\ldots,K-1\}\,,$ (5) $\displaystyle\qquad\sum_{i=1}^{n}\sum_{k=1}^{K-1}y_{i}^{k}\alpha_{i}^{k}=0\,,$

where $\mathcal{K}(\textbf{x}_{i}^{k_{1}},\textbf{x}_{j}^{k_{2}})=\phi(\textbf{x}_{i}% )^{T}\phi(\textbf{x}_{j})+\textbf{e}_{k1}^{T}\textbf{e}_{k2}$ is the resultant kernel with respect to $\textbf{x}_{i}^{k_{1}}$ and $\textbf{x}_{j}^{k_{2}}$ . Hence, the prediction output of $\textbf{x}_{i}^{k}$ is

$\displaystyle g(\textbf{x}_{i}^{k})=(\textbf{w},-{\bm{\theta}})^{T}\textbf{x}_% {i}^{k}=\bar{\mathcal{W}}^{T}\phi(\textbf{x}_{i}^{k}),$ (6)

where $\bar{\mathcal{W}}$ is the weight vector derived as

$\displaystyle\bar{\mathcal{W}}=\sum_{i=1}^{n}\sum_{k=1}^{K-1}\alpha_{i}^{k}y_{% i}^{k}\phi(\textbf{x}_{i}^{k}).$ (7)

The dual problem in Eq. (2.3) can be solved by standard SVM solvers. The computational complexity of the RED-SVM model $\mathcal{O}((K-1)^{2}n^{2})=\mathcal{O}(n^{2})$ , where $n$ is the number of original training instances.

3. The proposed method

This section provides the technical details of the proposed active learning method. We first introduce a novel imputation uncertainty measure in Subsection 3.1. Following this, In Subsection 3.2, we design a novel query selection strategy by incorporating the novel imputation uncertainty with a particularly designed diversity-based uncertainty measure. In this query selection strategy, a hyperplane state measure is designed to specify the candidate instance region for each query instance to promote the prediction model. Subsection 3.3 summarizes the computational complexity of the proposed active learning method.

3.1 Imputation uncertainty

Before active learning, we first impute the missing values by the multiple imputation method MICE. For each incomplete instance $\textbf{x}_{i}$ , the imputation model generates $m$ imputed instances $\{\hat{\textbf{x}}_{i}^{(r)}\}_{r=1}^{m}$ . We can get the complete instance by averaging the multiple imputed instance vectors as follows:

$\displaystyle\hat{\textbf{x}}_{i}=\frac{1}{m}\sum\limits_{r=1}^{m}\hat{\textbf% {x}}_{i}^{(r)}\,.$ (8)

Suppose there are $q$ missing values in instance $\textbf{x}_{i}$ , the unobserved attribute index set for $\textbf{x}_{i}$ is $u_{i}=\{u_{i}^{1},\ldots,u_{i}^{q}\}$ , and the multiple imputed values on the $j$ -th attribute are $\{\hat{\textbf{x}}_{ij}^{(r)}\}_{r=1}^{m}$ , $j=u_{i}^{1},\ldots,u_{i}^{q}$ . The variance among the imputed values can be used to quantify the imputation uncertainty [14]. Thus, the imputation uncertainty on the $j$ -th attribute of $\hat{\textbf{x}}_{i}$ can be computed as

$\displaystyle v_{ij}=\frac{1}{m-1}\sum\limits_{r=1}^{m}(\hat{\textbf{x}}_{ij}^% {(r)}-\hat{\textbf{x}}_{ij})^{2},j=u_{i}^{1},\ldots,u_{i}^{q},$ (9)

where $\hat{\textbf{x}}_{ij}$ is the j-th attribute value in $\hat{\textbf{x}}_{i}$ . Therefore, the feature-level imputation uncertainty of an instance is defined as the summation of imputation variances on all missing features as follows [14]

$\displaystyle\textit{FIU}(\hat{\textbf{x}}_{i})=\frac{1}{m-1}\sum\limits_{j=u_% {i}^{1}}^{u_{i}}\sum\limits_{r=1}^{m}(\hat{\textbf{x}}_{ij}^{(r)}-\hat{\textbf% {x}}_{ij})^{2}=\frac{1}{m-1}\sum\limits_{r=1}^{m}(\hat{\textbf{x}}_{i}^{(j)}-% \hat{\textbf{x}}_{i})^{T}(\hat{\textbf{x}}_{i}^{(j)}-\hat{\textbf{x}}_{i})\,.$ (10)

To quantify the imputation uncertainty of an imputed instance more effectively and comprehensively, we suggest considering the feature-level imputation uncertainty and the knowledge-level imputation uncertainty simultaneously. We define the knowledge-level imputation uncertainty as

$\displaystyle\textit{KIU}(\hat{\textbf{x}}_{i})=\frac{1}{m-1}\sum\limits_{j=1}% ^{m}(\textbf{w}^{T}\hat{\textbf{x}}_{i}^{(j)}-\textbf{w}^{T}\hat{\textbf{x}}_{% i})^{2}=\frac{1}{m-1}\sum\limits_{j=1}^{m}(\bar{\textbf{w}}^{T}\hat{\textbf{x}% }_{i}^{(j)k}-\bar{\textbf{w}}^{T}\hat{\textbf{x}}_{i}^{k})^{2}=\frac{1}{m-1}% \sum\limits_{j=1}^{m}(\bar{\mathcal{W}}^{T}\phi(\hat{\textbf{x}}_{i}^{(j)k})-% \bar{\mathcal{W}}^{T}\phi(\hat{\textbf{x}}_{i}^{k}))^{2}\,,$ (11)

where $\hat{\textbf{x}}_{i}^{(j)k}$ and $\hat{\textbf{x}}_{i}^{k}$ are the $k$ -th extended binary instances with respect to $\hat{\textbf{x}}_{i}^{(j)}$ and $\hat{\textbf{x}}_{i}$ , respectively. $\bar{\textbf{w}}$ and $\bar{\mathcal{W}}$ are the learned weight vectors in Eqs (2) and (7) that correspond to the linear RED-SVM and the nonlinear RED-SVM, respectively. In Eq. (11) $k$ can be any value in $\{1,\ldots,K-1\}$ . The essence of the knowledge-level imputation uncertainty is the uncertainty of the predictive output of the multiple imputations in an ordinal prediction model. The learned weight vector $\bar{\textbf{w}}$ or $\bar{\mathcal{W}}$ can be viewed as the induction of classification knowledge on the training set with the RED-SVM model. Thus, the knowledge-level imputation uncertainty can be roughly viewed as a dynamic weighted feature-level imputation uncertainty, which will be updated after each the base learner retraining.

In order to prevent labeling highly inaccurate imputed instances, this paper proposes penalizing the base query selection function with the feature-level imputation uncertainty and the knowledge-level imputation uncertainty simultaneously. Suppose the base query selection function is $Q(\hat{\textbf{x}}_{i})$ , for $\hat{\textbf{x}}_{i}\in\mathcal{U}$ , with the larger value the more informative of an instance. Thus, the final query selection function can be expressed as $Q(\hat{\textbf{x}}_{i})-\lambda_{1}FIU(\hat{\textbf{x}}_{i})-\lambda_{2}% \textit{KIU}(\hat{\textbf{x}}_{i})$ , where $\lambda_{1}$ and $\lambda_{2}$ are two parameters to control the contributions of the two imputation uncertainty measures. The base query selection function $Q(\hat{\textbf{x}}_{i})$ will be introduced in the following subsection.

3.2 Query selection based on specified candidate instance region

In order to select the most informative instances in the active learning process, we design a diversity-based uncertainty sampling-based query selection function based on the RED-SVM model. According to the policy of margin sampling strategy [5], the instances close to the decision hyperplane are the most informative instances. In contrast to the binary SVM model, there are $K-1$ parallel decision hyperplanes in the RED-SVM model. Selecting an informative instance according to the $K-1$ parallel decision hyperplanes is an issue that needs to be addressed. Intuitively, it is non-trivial to focus on the classification uncertainty between adjacency classes in active learning for ordinal classification. Therefore, the idea of uncertainty sampling is particularly suitable for query selection in ordinal data. Ge et al. [7] suggested selecting the instance with the shortest distance from its nearest decision hyperplane in each iteration. Obviously, this strategy may result in an unbalanced hyperplane-updating problem and sampling redundancy.

To avoid the unbalanced hyperplane-updating problem, we adopt a Round Robin sampling style. By this sampling style, each $K-1$ sequentially selected instances are according to the $K-1$ different decision hyperplanes. To alleviate the potential sampling redundancy, we design a diversity-based uncertainty sampling query selection function.

Let $\mathcal{L}=\{\hat{\textbf{x}}_{i}\}_{i=1}^{n}$ be the initial labeled instance set, and each instance $\hat{\textbf{x}}_{i}$ corresponds to a label $y_{i}\in\mathcal{Y}$ . Suppose $\mathcal{U}=\{\hat{\textbf{x}}_{i}\}_{i=n+1}^{N}$ be the unlabeled instance pool, and has $n\ll N$ . For an unlabeled instance $\hat{\textbf{x}}_{i}\in\mathcal{U}$ , the distance from $\hat{\textbf{x}}_{i}$ to the $k$ -th decision hyperplane is computed as

$\displaystyle\mathcal{D}(\hat{\textbf{x}}_{i},h_{k})=|\textbf{w}^{T}\hat{% \textbf{x}}_{i}-\theta_{k}|$ $\displaystyle\hskip 45.524409pt=|(\textbf{w},-{\bm{\theta}})^{T}\hat{\textbf{x% }}_{i}^{k}|$ $\displaystyle\hskip 45.524409pt=|\bar{\textbf{w}}^{T}\hat{\textbf{x}}_{i}^{k}|$ (12) $\displaystyle\hskip 45.524409pt=|\bar{\mathcal{W}}^{T}\phi(\hat{\textbf{x}}_{i% }^{k})|,$ $\displaystyle k=\{1,\ldots,K-1\},$

where $\hat{\textbf{x}}_{i}^{k}$ is the $k$ -th extended instance concerning $\hat{\textbf{x}}_{i}$ , and $\bar{\mathcal{W}}$ is the weight vector in Eq. (7). Naturally, the shorter the distance of $\mathcal{D}(\hat{\textbf{x}}_{i},h_{k})$ , the more informative of $\hat{\textbf{x}}_{i}$ corresponding to the $k$ -th hyperplane.

In order to take into account sampling diversity in the query selection, we incorporate the cosine angular diversity into the uncertainty sampling criterion. The cosine angular diversity of $\hat{\textbf{x}}_{i}\in\mathcal{U}$ with respect to the labeled instance set $\mathcal{L}$ can be defined as

$\displaystyle\textit{div}(\hat{\textbf{x}}_{i},\mathcal{L})=1-\max\limits_{% \hat{\textbf{x}}_{j}\in\mathcal{L}}(|\textit{cos}(\hat{\textbf{x}}_{i},\hat{% \textbf{x}}_{j})|)=1-\max\limits_{\hat{\textbf{x}}_{j}\in\mathcal{L}}\left(% \frac{|\mathcal{K}(\hat{\textbf{x}}_{i},\hat{\textbf{x}}_{j})|}{\sqrt{|% \mathcal{K}(\hat{\textbf{x}}_{i},\hat{\textbf{x}}_{i})||\mathcal{K}(\hat{% \textbf{x}}_{j},\hat{\textbf{x}}_{j})|}}\right).$ (13)

where $\mathcal{K}(\hat{\textbf{x}}_{i},\hat{\textbf{x}}_{j})$ is the kernel function of instances $\hat{\textbf{x}}_{i}$ and $\hat{\textbf{x}}_{j}$ .

The uncertainty sampling seeks to select the instance with a minimal value of $\mathcal{D}(\hat{\textbf{x}}_{i},h_{k})$ , while the diversity measure pursues to select the instance with a maximal value of $div(\hat{\textbf{x}}_{i},\mathcal{L})$ . Therefore, we can design the query selection function by integrating the two measures as follows

$\displaystyle Q(\hat{\textbf{x}}_{i};k)=\frac{div(\hat{\textbf{x}}_{i},% \mathcal{L})}{1+\mathcal{D}(\hat{\textbf{x}}_{i},h_{k})},\hat{\textbf{x}}_{i}% \in\mathcal{U},k=1,2,\ldots,K-1.$ (14)

The unlabeled instance that maximizes the value of $Q(\hat{\textbf{x}}_{i};k)$ can be regarded as the most informative instance concerning the $k$ -th decision hyperplane. To prevent the highly distorted instances from being selected in the active learning process, the query selection function is penalized by the proposed imputation uncertainty measure as follows

$\displaystyle\mathcal{Q}(\hat{\textbf{x}}_{i};k)=Q(\hat{\textbf{x}}_{i};k)-% \lambda_{1}\textit{FIU}(\hat{\textbf{x}}_{i})-\lambda_{2}\textit{KIU}(\hat{% \textbf{x}}_{i}),\,\hat{\textbf{x}}_{i}\in\mathcal{U},\,k=1,2,\ldots,K-1.$ (15)

Figure 2.

Diagram of ordinal classification in an active learning setting on a three-class ordinal dataset. The symbols $\bigcirc$ , $\bigtriangleup$ and $\Box$ represent the unlabeled instances belong to classes $\mathcal{C}_{1}$ , $\mathcal{C}_{2}$ and $\mathcal{C}_{3}$ , respectively. The classes satisfy $\mathcal{C}_{1}\prec\mathcal{C}_{2}\prec\mathcal{C}_{3}$ . The colored symbols denote the labeled instances, in which the instances $\hat{\textbf{x}}_{2}$ , $\hat{\textbf{x}}_{4}$ and $\hat{\textbf{x}}_{7}$ are the imprecisely imputed instances.

Although the imputation uncertainty measures can restrict the query function from selecting the high imprecisely imputed instances, they cannot prevent the algorithm from choosing imputed instances. The underlying imprecisely imputed instances in the training set may prevent the learning model’s decision hyperplanes from rapidly converging to the correct positions. Figure 2 shows an illustration of ordinal classification in an active learning setting on a three-class ordinal dataset with imprecisely imputed instances being labeled. We can observe that the two decision hyperplanes $h_{1}$ and $h_{2}$ are obstructed by the imprecisely imputed instances $\hat{\textbf{x}}_{2}$ and $\hat{\textbf{x}}_{7}$ , respectively. Since there are few labeled instances in an active learning setting, if labeled distorted instances exist, the above situation may inevitably occur. To break through this dilemma, specifying the candidate instance region for each query selection can be a resolution. Take hyperplane $h_{1}$ as an example, it is currently across the density region of class $\mathcal{C}_{2}$ and hindered by instance $\hat{\textbf{x}}_{2}$ . In this case, the instances close to the $h_{1}$ on the left side are more informative than the ones on the right side. In this regard, in order to make the $h_{1}$ converge to the correct position, the next informative unlabeled instance should be selected by specifying its candidate region on the left side of $h_{1}$ .

To accomplish the above idea, we resort to exploring the state of the current decision hyperplane and determining the candidate instance region based on the state. We assume that the state of a hyperplane is determined by the labeled instances that are closest to it. Let $\mathcal{N}_{k}$ be the set of $2+\lfloor\frac{|\mathcal{L}|}{K-1}\rfloor$ labeled instances that closest to hyperplane $h_{k}$ . Thus, we define the state of a hyperplane as

$\displaystyle\Delta_{h_{k}}=\frac{\sum\limits_{\hat{\textbf{x}}_{k}\in\mathcal% {N}_{i},y_{i}=k}\exp(-\mathcal{D}(\hat{\textbf{x}}_{i},h_{k})^{2}/\sigma^{2})}% {\sum\limits_{\hat{\textbf{x}}_{i}\in\mathcal{N}_{k},y_{i}=k+1}\exp(-\mathcal{% D}(\hat{\textbf{x}}_{i},h_{k})^{2}/\sigma^{2})}$ (16)

where $\mathcal{D}(\hat{\textbf{x}}_{i},h_{k})$ is the distance from $\hat{\textbf{x}}_{i}$ to the $k$ -th decision hyperplane, and the constant $\sigma$ is set as 0.1 in this paper. $\sum_{\hat{\textbf{x}}_{i}\in\mathcal{N}_{i},y_{i}=k}\exp(-\mathcal{D}(\hat{% \textbf{x}}_{i},h_{k})^{2}/\sigma^{2})$ can be interpreted as the strength that the instances of the $k$ -th class dominate the local region of the $k$ -th hyperplane. The value of $\Delta_{h_{k}}$ ranges in $[0,+\infty)$ . Let $\alpha>1$ be a control parameter. In the case of $\Delta_{h_{k}}>\alpha$ , the $k$ -th hyperplane is more likely cut across the region that is significantly dominated by the instances of $k$ -th class. If $0\leqslant\Delta_{h_{k}}<1/\alpha$ , the $k$ -th hyperplane is more likely cut across the region that is significantly dominated by the instances of ( $k+1$ )-th class. In the case of $1/\alpha\leqslant\Delta_{h_{k}}\leqslant\alpha$ , the $k$ -th hyperplane in a relatively balanced state. Based on the above consideration, we can specify the candidate instance region based on the condition of $\Delta_{h_{k}}$ as follows

$\displaystyle\mathcal{U}_{h_{k}}=\left\{\begin{array}[]{ll}\{\hat{\textbf{x}}_% {i}\in\mathcal{U}|\hat{y_{i}}=k\},&0\leqslant\Delta_{h_{k}}<1/\alpha\\ \{\hat{\textbf{x}}_{i}\in\mathcal{U}|\hat{y_{i}}=k\,or\,\hat{y_{i}}=k+1\},&1/% \alpha\leqslant\Delta_{h_{k}}\leqslant\alpha\\ \{\hat{\textbf{x}}_{i}\in\mathcal{U}|\hat{y_{i}}=k+1\},&\Delta_{h_{k}}>\alpha% \end{array}\right.,$ (17)

where $\hat{y}_{i}$ is the predicted label of $\hat{\textbf{x}}_{i}$ . The parameter $\alpha$ is set as 1.5 in this paper.

Once the specified candidate region is determined, the labeled instances that are used to calculate the diversity measure also need to be restricted in a specified region. The labeled instance subset, that is used to derive the diversity measure, should be determined corresponding to the candidate instance region as follows

$\displaystyle\mathcal{L}_{h_{k}}=\left\{\begin{array}[]{ll}\{\hat{\textbf{x}}_% {i}\in\mathcal{L}|\hat{y_{i}}=k\},&0\leqslant\Delta_{h_{k}}<1/\alpha\\ \{\hat{\textbf{x}}_{i}\in\mathcal{L}|\hat{y_{i}}=k\,or\,\hat{y_{i}}=k+1\},&1/% \alpha\leqslant\Delta_{h_{k}}\leqslant\alpha\\ \{\hat{\textbf{x}}_{i}\in\mathcal{L}|\hat{y_{i}}=k+1\},&\Delta_{h_{k}}>\alpha% \end{array}\right..$ (18)

Thus, the final query selection function becomes as follows

$\displaystyle\mathcal{Q}(\hat{\textbf{x}}_{i};k)=\frac{\textit{div}(\hat{% \textbf{x}}_{i},\mathcal{L}_{h_{k}})}{1+\mathcal{D}(\hat{\textbf{x}}_{i},h_{k}% )}-\lambda_{1}\textit{FIU}(\hat{\textbf{x}}_{i})-\lambda_{2}\textit{KIU}(\hat{% \textbf{x}}_{i}),\,\hat{\textbf{x}}_{i}\in\mathcal{U}_{h_{k}},\,k=1,2,\ldots,K% -1.$ (19)

Specifying the candidate instance region can promote the performance of the base learner and improve the computational efficiency due to the size of each candidate region being much smaller than that of the unlabeled pool. We name the proposed method as DUSR-TIU. The DUSR represents our method employs a diversity-based uncertainty sampling and selects instances in specified candidate instance regions. The TIU indicates that the query selection is penalized by two imputation uncertainty measures. The algorithmic procedure of the proposed active learning method is summarized in Algorithm 2.

[htp] Active learning for ordinal classification on incomplete data[1] Incomplete initial training set $\mathcal{L}$ ; incomplete unlabeled instance pool $\mathcal{U}$ ; the number of classes $K$ ; the query budget $B$ ; the base learner $\mathcal{M}$ . The extended training set $\mathcal{L}$ , trained ordinal classification model $\mathcal{M}$ . Impute the instance with missing values in $\mathcal{L}$ and $\mathcal{U}$ by MICE and Eq. (8); Initialize the base learner $\mathcal{M}$ by training it on $\mathcal{L}$ ; $b\leftarrow B$ ; $b>0$ $k\leftarrow 1$ ; $k\leqslant K-1$ and $b>0$ Train the base learner $\mathcal{M}$ on $\mathcal{L}$ ; Calculate the state of hyperplane $h_{k}$ , $\Delta_{h_{k}}$ ; Determine the candidate region $\mathcal{U}_{k}$ and the labeled instance subset $\mathcal{L}_{h_{k}}$ based on Eqs (16–18); Calculate $\mathcal{Q}(\hat{\textbf{x}}_{i};k)$ for each $\hat{\textbf{x}}_{i}\in\mathcal{U}_{k}$ based on Eqs (10–14,19); Acquire the label of instance $\hat{\textbf{x}}^{\ast}=\max\limits_{\hat{\textbf{x}}_{i}\in\mathcal{U}_{k}}% \mathcal{Q}(\hat{\textbf{x}}_{i})$ from the annotator; $b\leftarrow b-1$ ; $\mathcal{U}\leftarrow\mathcal{U}/\{\hat{\textbf{x}}^{\ast}\}$ ; $\mathcal{L}\leftarrow\mathcal{L}\cup\{\hat{\textbf{x}}^{\ast}\}$ ; $k\leftarrow k+1$ ; $\mathcal{L}$ and $\mathcal{M}$

3.3 Complexity analyses

Suppose the size of the initial training set is $n$ , the size of the unlabeled instances pool is $N$ , the number of classes is $K$ , the number of attributes $d$ , and the query budget is $B$ . The missing values imputation can be viewed as a preprocessing step. Thus, we do not consider it in the complexity analysis.

Before each query selection, we need to perform the operations from line 6 to line 9 in Algorithm 2. The time complexity of the $B$ -time query selections is analyzed as follows. The RED-SVM model needs to be trained $(B+1)$ times with increasing training instances. Training the RED-SVM model on $n$ training instances needs $\mathcal{O}((K-1)^{2}n^{2})=\mathcal{O}(n^{2})$ time. Therefore, the accumulated time complexity for $(B+1)$ times training is $\mathcal{O}(n^{2}(B+1)+nB(B+1)+\frac{1}{6}B(B+1)(2B+1))$ . Since $n$ and $K$ are much less than $B$ , the complexity of $(B+1)$ times model training cost the computational time of order $\mathcal{O}(B^{3})$ . Estimate the state of current hyperplane $B$ times require $\mathcal{O}(B^{3})$ time. Determine the candidate instance region $B$ times require $\mathcal{O}(BN)$ time. Obtain $\mathcal{Q}(\hat{\textbf{x}}_{i};k)$ needs to calculate $Q(\hat{\textbf{x}}_{i};k)$ , $\textit{FIU}(\hat{\textbf{x}}_{i})$ and $\textit{KIU}(\hat{\textbf{x}}_{i})$ , respectively. Conservatively, we suppose, in each iteration, the size of the candidate instance region $\mathcal{U}_{k}$ is $\frac{2}{K}|\mathcal{U}|$ , and the size of labeled instance subset $\mathcal{L}_{k}$ is $\frac{2}{K}|\mathcal{L}|$ . Thus, the time complexity of computing $Q(\hat{\textbf{x}}_{i};k)$ of $B$ times is $\mathcal{O}(B^{2}\frac{(N-B)}{K^{2}})$ . Compute the $\textit{FIU}(\hat{\textbf{x}}_{i})$ requires $\mathcal{O}(N)$ . Compute the $\textit{KIU}(\hat{\textbf{x}}_{i})$ requires $\mathcal{O}(B(N-B))$ .

In summary, suppose $(N-B)/K^{2}\gg B$ , the proposed method performs $B$ times query selections require the computational time of order $\mathcal{O}(B^{2}\frac{(N-B)}{K^{2}})$ .

4. Experiments

4.1 Datasets

Nine public ordinal classification datasets are employed in the experiments. Table 1 summarizes the information of the used datasets. The Toy, Stock, Computer, Automobile, and Bank datasets are from reference [26]. The other four datasets are from the UCI dataset repository.1

¹
https://archive.ics.uci.edu/ml/index.php.

The PowerPlant was originally regression data, which was discretized into ten-class ordinal classification data using the equal frequency bin [26].

Table 1

Information of the used datasets

No.	Datasets	#Instances	#Features	#Class	Distribution
1	Toy	300	2	5	[35, 87, 79, 68, 31]
2	Stock	950	9	5	[190, 190, 190, 190, 190]
3	Computer	8192	8	5	[1639, 1639, 1638, 1638, 1638]
4	Automobile	206	71	6	[3, 22, 67, 54, 32, 27]
5	Obesity	2111	16	7	[272, 287, 290, 290, 351, 297, 324]
6	Optdigits	5620	64	10	[554, 571, 557, 572, 568, 558, 558, 566, 554, 562]
7	Bank	8192	8	10	[820, 820, 819, 819, 819, 819, 819, 819, 819, 819]
8	PowerPlant	9568	4	10	[956, 956, 957, 957, 957, 957, 957, 957, 957, 957]
9	Penbased	10992	16	10	[1143, 1143, 1144, 1055, 1144, 1055, 1056, 1142, 1055, 1055]

The datasets in Table 1 are all complete. In experiments, each dataset is randomly split into an unlabeled instance pool (80% of the data) and a testing set (20% of the data). The experiments repeated the above partition ten times independently. We simulate the incomplete case by removing some of the attribute values from the unlabeled instance pool using the MCAR mechanism [13]. We generate three incomplete versions for each unlabeled instance pool with the missing rate as 20%, 30%, and 40%, respectively. Here, the missing rate is defined as

$\displaystyle mr=\frac{N_{u}}{N_{t}}\times 100\%,$ (20)

where $N_{u}$ is the number of missing values, i.e., the unobserved attributes. $N_{t}=N\times d$ is the total number of values in the dataset, where $N$ is the number of instances in the unlabeled instance pool, and $d$ is the dimension of the instances. The testing set contains only complete instances since we need to estimate the performance of the proposed active learning method.

Before performing active learning, we impute the incomplete unlabeled instance pool by the random forest regression-based MICE [12] with ten times round-robin iterations. We set the number of imputations as $m=5$ . Therefore, each imputed value is the average value of five imputations as described in Subsection 3.1. The initial training set contains instances that selected only two with the least FIU value from each class in the unlabeled instance pool.

Table 2

Information of the unlabeled instance pool

Datasets	#Instances	mr $=$ 20%		mr $=$ 30%		mr $=$ 40%
		IP(%)	NP(%)	IP(%)	NP(%)	IP(%)	NP(%)
Toy	240	36.771	26.12	51.17	31.96	64.12	40.54
Stock	760	87.14	6.76	95.93	8.34	99.16	10.96
Computer	6553	93.13	23.00	98.67	24.86	99.77	27.60
Automobile	164	100.00	20.61	100.00	22.01	100.00	24.87
Obesity	1688	97.07	14.36	99.77	18.43	99.98	22.89
Optdigits	4496	100.00	1.58	100.00	2.39	100.00	3.53
Bank	6553	83.07	57.71	94.24	62.54	98.29	65.89
PowerPlant	7654	58.98	33.30	76.06	38.59	87.21	44.66
Penbased	8793	97.14	1.38	99.65	2.72	99.96	5.13

We have counted the incomplete instances in the unlabeled pool and the noise percentage information under three different missing rates and summarized the information in Table 2. In the table, “IP” indicates the percentage of incomplete instances, and “NP” denotes the percentage of noisy instances in the pool. We use the “All $k$ -NN” noise filtering methods [27], with $k=$ 10, to determine whether one is a noisy instance in the pool. From Table 2, we can see that even with the missing rate being 20%, most datasets containing more than 80% of the instances are incomplete. In particular, the higher the dimensionality, the higher the percentage of incomplete samples. Therefore, listwise deletion generally cannot be used to deal with the incomplete dataset in the problem of active learning on incomplete data. We can see that the higher the missing rate, the higher the proportion of noisy instances in the pool. Of course, most of the noisy instances are imprecisely imputed instances. The amount of the noisy instances directly affects the classification accuracy of active learning. It is clear that the more noisy instances are labeled, the greater the negative impact on the model training. In the active learning process, we do not know which imputed instance is noise since the instances in the pool are all unlabeled. The only information available to determine that an imputed instance may be noisy is the imputation uncertainty. In this paper, what we do is to reduce the impact of potential noise instances, i.e., highly inaccurate imputed instances, on active learning. On the one hand, the AL method should minimize the possibility of noisy instances being selected. On the other hand, when potentially noisy instances are labeled, the AL method should reduce their negative impact on model training. Therefore, it is reasonable for our AL method to employ an imputation uncertainty measure and a region-specified query selection function to reduce the negative impact of the noisy instances in the pool.

4.2 Experimental configuration

To comparatively study the effectiveness of the proposed method DUSR-TIU, we compare it with the following six baseline methods.

•
MCSVMA is the SVM-based multi-class active learning method, i.e., the method MC_SVMA ${}_{\textit{RCU}}$ in reference [28], which selects the instances by considering the criteria of rejection, compatibility, and uncertainty.
•
McPAL [29] is the multiclass probabilistic active learning method, which selects the instances with maximal probabilistic gain.
•
LogitA [6] is the A-optimal experimental design method for ordinal classification based on an adjacent category logistic model. This method tends to select the representative and discriminative instances.
•
ALOR [7] is the active ordinal classification method based on the RED-SVM model. It selects the instances with the smallest distance from the nearest decision boundary.
•
DUSR is the same method as DUSR-TIU, except it does not include any imputation uncertainty measures.
•
DUSR-FIU is the method differs from the DUSR-TIU only on imputation uncertainty measure. In DUSR-FIU, the query selection is penalized by the featrue-level imputation uncertainty as described in reference [14].

In the above methods, MCSVMA and McPAL are two state-of-the-art multi-class active learning methods. LogitA and ALOR are both recently proposed active learning algorithms for ordinal classification. DUSR and DUSR-FIU are involved in the comparison in the form of an ablation experiment. For fair comparison, each of the compared methods uses the labeled instances to train a RED-SVM model in each iteration and evaluate its ordinal classification performance on the testing set. In the RED-SVM model, the kernel function is set as RBF kernel with $\gamma=$ 0.1, and $C$ is fixed as 100. The control parameters $\lambda_{1}$ and $\lambda_{2}$ usually depend on the data, the missing rate, the pool size, and the query budget. In order to facilitate the experiments, the $\lambda_{1}$ and $\lambda_{2}$ in DUSR-TIU are both set as 5. The evaluation metrics involve the classification accuracy (ACC) and the mean absolute error (MAE) [26]. We plot the learning curves of ACC and MAE to compare the different methods visually.

Furthermore, to perform the quantitative comparison, the metric area under learning curve (ALC) is employed. The metric ALC is one of the frequently used metrics to quantify the overall performance of active learning [30]. We use the trapezoidal approximation rule [30] to calculate the area under learning curve ACC (ALC-ACC) and the area under learning curve MAE (ALC-MAE). Due to the length limitation of the article, the comparison results about ALC-ACC and ALC-MAE are placed in Section Appendix A. Based on the above metrics, the average results of ten-time repeats are ultimately reported. In the experiments, we simulate the annotator to provide the labels of the selected instances. We assume that the annotator can always provide an instance’s ground-truth label based on its observed components. The instances with all attributes unobserved were removed from the pool.

The experiments were implemented on Windows 10 64-bit operating system with 128GB RAM and Intel(R) Xeon(R) Silver 4214 CPU@2.20GHz processor, using Python 3.6 software. The source codes and the used datasets are available at https://github.com/DeniuHe/DUSR-TIU.

Figure 3.
Learning curves of ACC for the compared methods on the nine datasets with a missing rate of 20%.

Figure 4.
Learning curves of ACC for the compared methods on the nine datasets with a missing rate of 30%.

Figure 5.
Learning curves of ACC for the compared methods on the nine datasets with a missing rate of 40%.

Figure 6.
Learning curves of MAE for the compared methods on the nine datasets with a missing rate of 20%.

Figure 7.
Learning curves of MAE for the compared methods on the nine datasets with a missing rate of 30%.

Figure 8.
Learning curves of MAE for the compared methods on the nine datasets with a missing rate of 40%.

4.3 Experimental results

Figures 3–5 plot the learning curves about ACC for the seven compared methods on the nine datasets, respectively, under three distinct missing rates (20%, 30%, and 40%). While, Figs 6–8 visualize the learning curves of MAE on the nine datasets with the three different missing rates, respectively. From the learning curves in the six figures, we can observe that the proposed method outperforms the baseline methods on most of the datasets in the active learning process in various missing rates on metrics classification accuracy and mean absolute error. Out of the seven methods, we can see that the DUSR-FIU generally performs second best. This indicates that it is essential to consider the imputation uncertainty in active learning on incomplete ordinal data. While DUSR-TIU performs better than DUSR-FIU, this illustrates that it is advantageous to consider feature-level imputation uncertainty and knowledge-level imputation uncertainty simultaneously. When the missing rates are 30% and 40%, the learning curves for some algorithms, such as LogitA, ALOR, and MCSVMA, fluctuate considerably. It is mainly because of the negative impact of the imprecisely imputed instances. But, the learning curves of the proposed method generally show a steady increase in ACC and a steady decrease for MAE as the query proceeds. This illustrates that the proposed method has advantages in dealing with imputed incomplete ordinal data for active ordinal classification.

Figure 9.

Result of parameters ( $\lambda_{1}$ and $\lambda_{2}$ ) study on dataset Stock over metric ALC-ACC.

The parameters $\lambda_{1}$ and $\lambda_{2}$ control the contributions of the two imputation uncertainty measures. In order to learn how to set the two parameters, we conduct experiment on dataset Stock by ranging $\lambda_{1}$ and $\lambda_{2}$ both in [0.0, 0.01, 0.1, 1, 10, 100, 1000]. The 3D bar charts about metric ALC-ACC by the DUSR-TIU are depicted in Fig. 9. We can observe that it is preferable to set both $\lambda_{1}$ and $\lambda_{2}$ larger than 1. When the missing rate is not too high or there is a larger proportion of original complete samples in the pool, we can set the two parameters relatively larger. This can enforce the active leaner to select more instances from the complete instances. In the case the missing rate is high, such as 20% or higher, the two parameters should not be set too large since not all the original complete instances are informative. Moderate parameter values can make the active learner to select more valuable imputed instances and lead to better active learning performance.

5. Conclusion and future work

This paper proposes an active learning method for incomplete ordinal classification data. To prevent the high imprecisely imputed instances from being labeled, we propose to penalize the query selection with the feature-level imputation uncertainty and the knowledge-level imputation uncertainty simultaneously. To mitigate the negative impact of the potentially imprecisely imputed instances in the training set, we suggest selecting instances in specified candidate instance regions based on the state of the hyperplane of the base learner. The extensive experiments on several public ordinal datasets with different missing rates demonstrate the effectiveness of the proposed method.

In this paper, we only consider the issue of active instance selection for incomplete ordinal data, and the missing values are imputed automatically without considering the potential help from the domain expert in the loop. It is known that active feature acquisition for incomplete data is also an essential issue in the machine learning community. In practice, we can buy not only the labels of the samples from the experts, but also the missing attribute values from the experts. Based on the domain knowledge, perhaps we can know the price of obtaining a label and the price of acquiring each attribute value. In such a situation, we would like to design a cost-sensitive acquisition method with a trade-off mechanism between the missing attribute values and the labels, thus achieving a better ordinal classification model with lowest cost. The imputation uncertainty proposed in this paper can also be used in the above problem to quantify the utility of a missing attribute value being purchased. This is a work that merits investigation in the future.

Footnotes

Acknowledgments

This work is supported by Chongqing Key Laboratory of Computational Intelligence.

Appendix A: Additional experimental results

For quantitative comparison, Tables 5–5 report the results about ALC-ACC for the seven methods on the nine datasets under three different missing rates, respectively. While, Tables 8–8 summarize the results about metric ALC-MAE. In the six tables, the best results are highlighted in boldface. The average ranks of the compared methods are also listed in the tables. The first five methods, i.e., MCSVMA, McPAL, LogitA, ALOR, and DUSR, without considering the imputation uncertainty. The experimental results in the six tables show that our designed method DUSR performs superior to the other four methods. Therefore, the method of diversity-based uncertainty sampling with specified candidate instance region is more effective to conduct active instance selection on imputed incomplete ordinal data. In the compared methods, DUSR-FIU and DUSR-TIU are the two methods that take into account imputation uncertainty. The results show that DUSR-FIU and DUSR-TIU generally perform superior to the other five methods which do not consider the imputation uncertainty. This demonstrates that the potentially imprecisely imputed instances can indeed degrade the performance of active learning. Therefore, it is crucial to involve

Table 3

Results of ALC-ACC for the compared methods on the nine datasets with a missing rate of 20%

MissingRate	Datasets	MCSVMA	McPAL	LogitA	ALOR	DUSR	DUSR-FIU	DUSR-TIU
20%	Toy	35.36 $\pm$ 3.53	33.26 $\pm$ 4.02	32.70 $\pm$ 4.77	35.32 $\pm$ 2.02	37.36 $\pm$ 3.59	38.60 $\pm$ 3.15	41.75 $\pm$ 2.36
	Stock	33.82 $\pm$ 1.87	34.53 $\pm$ 1.30	34.56 $\pm$ 1.53	34.23 $\pm$ 0.83	36.50 $\pm$ 1.73	36.27 $\pm$ 1.03	36.75 $\pm$ 1.67
	Computer	24.50 $\pm$ 1.76	24.92 $\pm$ 2.13	25.29 $\pm$ 2.05	26.12 $\pm$ 2.73	25.50 $\pm$ 1.87	25.66 $\pm$ 2.33	26.62 $\pm$ 1.79
	Automobile	20.94 $\pm$ 3.79	17.30 $\pm$ 2.22	20.21 $\pm$ 4.19	23.67 $\pm$ 2.70	24.07 $\pm$ 2.91	25.12 $\pm$ 3.01	26.00 $\pm$ 3.10
	Obesity	39.27 $\pm$ 1.03	33.59 $\pm$ 1.85	38.33 $\pm$ 1.96	37.41 $\pm$ 0.61	37.56 $\pm$ 1.64	40.48 $\pm$ 1.76	42.76 $\pm$ 1.85
	Optdigits	56.62 $\pm$ 5.92	47.79 $\pm$ 5.54	46.31 $\pm$ 9.92	41.28 $\pm$ 5.33	71.56 $\pm$ 2.76	73.38 $\pm$ 2.45	74.64 $\pm$ 2.44
	Bank	27.80 $\pm$ 1.74	35.19 $\pm$ 1.78	30.74 $\pm$ 2.00	38.05 $\pm$ 1.56	36.50 $\pm$ 0.83	37.96 $\pm$ 2.50	38.71 $\pm$ 1.95
	PowerPlant	40.98 $\pm$ 1.80	43.25 $\pm$ 1.64	41.05 $\pm$ 2.50	40.18 $\pm$ 1.83	42.49 $\pm$ 1.61	42.89 $\pm$ 2.16	43.60 $\pm$ 1.41
	Penbased	76.54 $\pm$ 1.92	81.50 $\pm$ 1.87	82.61 $\pm$ 1.93	75.16 $\pm$ 1.34	83.34 $\pm$ 2.26	84.60 $\pm$ 1.90	85.01 $\pm$ 1.31
	AvgRank	5.44	5.33	5.22	5.11	3.44	2.44	1.00

Table 4

Results of ALC-ACC for the compared methods on the nine datasets with a missing rate of 30%

MissingRate	Datasets	MCSVMA	McPAL	LogitA	ALOR	DUSR	DUSR-FIU	DUSR-TIU
30%	Toy	35.26 $\pm$ 2.13	30.48 $\pm$ 3.39	31.29 $\pm$ 2.55	34.90 $\pm$ 1.96	37.50 $\pm$ 2.30	38.65 $\pm$ 2.40	41.27 $\pm$ 1.02
	Stock	35.10 $\pm$ 1.47	34.60 $\pm$ 1.65	35.03 $\pm$ 1.28	33.06 $\pm$ 1.16	35.82 $\pm$ 1.80	36.19 $\pm$ 1.38	36.35 $\pm$ 1.80
	Computer	23.89 $\pm$ 2.49	24.12 $\pm$ 1.70	25.18 $\pm$ 2.26	24.01 $\pm$ 1.82	25.42 $\pm$ 1.86	25.53 $\pm$ 1.54	26.15 $\pm$ 2.20
	Automobile	20.16 $\pm$ 4.01	16.98 $\pm$ 2.77	22.86 $\pm$ 4.02	19.76 $\pm$ 1.99	23.22 $\pm$ 3.33	24.17 $\pm$ 4.71	24.96 $\pm$ 4.27
	Obesity	37.70 $\pm$ 2.46	32.20 $\pm$ 3.35	36.34 $\pm$ 2.74	37.02 $\pm$ 3.01	36.47 $\pm$ 2.28	40.08 $\pm$ 1.69	43.14 $\pm$ 1.74
	Optdigits	55.62 $\pm$ 5.91	45.55 $\pm$ 5.48	48.65 $\pm$ 9.48	43.75 $\pm$ 6.61	73.40 $\pm$ 1.33	73.72 $\pm$ 0.87	76.12 $\pm$ 1.71
	Bank	25.15 $\pm$ 2.33	33.27 $\pm$ 1.66	26.29 $\pm$ 2.45	33.94 $\pm$ 2.25	33.00 $\pm$ 2.67	35.22 $\pm$ 1.40	38.73 $\pm$ 2.25
	PowerPlant	40.69 $\pm$ 1.93	43.58 $\pm$ 2.41	40.81 $\pm$ 2.70	41.91 $\pm$ 2.29	43.01 $\pm$ 1.22	43.32 $\pm$ 1.17	44.07 $\pm$ 1.78
	Penbased	76.22 $\pm$ 2.43	81.80 $\pm$ 1.67	80.68 $\pm$ 2.94	77.35 $\pm$ 3.20	82.97 $\pm$ 2.19	84.84 $\pm$ 1.07	85.25 $\pm$ 0.99
	AvgRank	5.33	5.33	5.22	5.44	3.56	2.11	1.00

Table 5

Results of ALC-ACC for the compared methods on the nine datasets with a missing rate of 40%

MissingRate	Datasets	MCSVMA	McPAL	LogitA	ALOR	DUSR	DUSR-FIU	DUSR-TIU
40%	Toy	33.87 $\pm$ 2.87	30.61 $\pm$ 5.07	28.55 $\pm$ 3.92	29.19 $\pm$ 3.96	32.11 $\pm$ 3.14	31.78 $\pm$ 3.07	39.42 $\pm$ 2.82
	Stock	32.42 $\pm$ 1.70	33.25 $\pm$ 1.62	31.72 $\pm$ 1.71	33.25 $\pm$ 1.19	34.22 $\pm$ 1.63	34.54 $\pm$ 1.78	36.03 $\pm$ 1.49
	Computer	24.54 $\pm$ 1.38	25.65 $\pm$ 1.96	24.78 $\pm$ 1.50	25.39 $\pm$ 1.85	26.10 $\pm$ 1.31	26.57 $\pm$ 1.82	27.35 $\pm$ 1.32
	Automobile	19.39 $\pm$ 3.08	15.26 $\pm$ 3.58	20.44 $\pm$ 4.58	18.15 $\pm$ 4.30	22.50 $\pm$ 3.05	24.41 $\pm$ 3.13	24.26 $\pm$ 2.76
	Obesity	36.13 $\pm$ 2.00	32.60 $\pm$ 3.92	35.81 $\pm$ 3.44	39.05 $\pm$ 0.91	35.80 $\pm$ 2.85	39.75 $\pm$ 2.29	42.45 $\pm$ 2.01
	Optdigits	57.09 $\pm$ 4.31	47.45 $\pm$ 5.19	50.78 $\pm$ 8.00	51.04 $\pm$ 1.67	71.20 $\pm$ 1.56	74.54 $\pm$ 1.47	76.16 $\pm$ 2.22
	Bank	22.36 $\pm$ 2.36	31.62 $\pm$ 1.99	23.58 $\pm$ 2.35	34.72 $\pm$ 1.14	30.43 $\pm$ 2.94	31.78 $\pm$ 1.72	36.86 $\pm$ 1.67
	PowerPlant	36.84 $\pm$ 2.49	41.88 $\pm$ 1.64	39.55 $\pm$ 1.56	39.55 $\pm$ 1.06	40.86 $\pm$ 2.35	41.20 $\pm$ 2.26	42.82 $\pm$ 1.21
	Penbased	73.34 $\pm$ 1.88	79.40 $\pm$ 1.51	80.17 $\pm$ 2.84	74.88 $\pm$ 1.71	80.86 $\pm$ 1.80	82.94 $\pm$ 1.79	85.43 $\pm$ 1.01
	AvgRank	5.44	5.11	5.67	4.67	3.67	2.33	1.11

Table 6

Results of ALC-MAE for the compared methods on the nine datasets with a missing rate of 20%

Table 7

Results of ALC-MAE for the compared methods on the nine datasets with a missing rate of 30%

Table 8

Results of ALC-MAE for the compared methods on the nine datasets with a missing rate of 40%

Table 9

$P$ -values of the Wilcoxon signed-rank tests using DUSR-TIU as the control method

Metric	MissingRate	MCSVMA	McPAL	LogitA	ALOR	DUSR	DUSR-FIU
ALC-ACC	20%	3.91e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03
	30%	3.91e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03
	40%	3.91e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03	7.81e-03
ALC-MAE	20%	3.91e-03	7.81e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03
	30%	3.91e-03	7.81e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03
	40%	3.91e-03	3.91e-03	3.91e-03	3.91e-03	3.91e-03	1.95e-02

an imputation uncertainty measure in the query selection procedure in active learning on incomplete data. In addition, the DUSR-FIU performs inferior to the DUSR-TIU. This further indicates that it is effective to simultaneously consider the feature-level imputation uncertainty and the knowledge-level imputation uncertainty in the query selection.

To examine whether a compared method performs equivalently or significantly different from the DUSR-TIU, we conduct the Wilcoxon signed-rank tests between DUSR-TIU and the six compared methods at a confidence level of $\alpha=$ 0.05. The p-values on metric ALC-ACC and ALC-MAE under the three distinct missing rates are reported in Table 9, and the values are all less than $\alpha$ . This demonstrates that the differences between DUSR-TIU and the six methods are statistically significant.

References

Tang

Pérez-Fernández

and Baets

B.D.

, A comparative study of machine learning methods for ordinal classification with absolute and relative information, Knowledge-Based Systems 230 (2021), 107358.

Georgoulas

G.K.

Karvelis

P.S.

Gavrilis

Stylios

C.D.

and Nikolakopoulos

, An ordinal classification approach for CTG categorization. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju Island, South Korea, July 11–15, 2017, IEEE, 2017, pp. 2642–2645.

Manthoulis

Doumpos

Zopounidis

and Galariotis

, An ordinal classification framework for bank failure prediction: methodology and empirical evidence for US banks, European Journal of Operational Research 282(2) (2020), 786–801.

Cao

Mirjalili

and Raschka

, Rank consistent ordinal regression for neural networks with application to age estimation, Pattern Recognition Letters 140 (2020), 325–331.

Tong

and Koller

, Support vector machine active learning with applications to text classification, Journal of Machine Learning Research 2 (2001), 45–66.

Chen

Wang

and Chang

Y.I.

, Active learning in multiple-class classification problems via individualized binary models, Computational Statistics & Data Analysis 145 (2020), 106911.

Chen

Zhang

Hou

and Yuan

, Active learning for imbalanced ordinal regression, IEEE Access 8 (2020), 180608–180617.

Mathieson

, Ordered classes and incomplete examples in classification, In: Mozer

Jordan

M.I.

and Petsche

, editors, Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2–5, 1996, MIT Press, 1996, pp. 550–556.

Verzilli

C.J.

and Carpenter

J.R.

, Assessing uncertainty about parameter estimates with incomplete repeated ordinal data, Statistical Modelling 2(3) (2002), 203–215.

10.

Eirola

Lendasse

Vandewalle

and Biernacki

, Mixture of gaussians for distance estimation with missing data, Neurocomputing 131 (2014), 32–42.

11.

Hulse

J.V.

and Khoshgoftaar

T.M.

, Incomplete-case nearest neighbor imputation in software measurement data, Information Sciences 259 (2014), 596–610.

12.

Azur

M.J.

Stuart

E.A.

and LPhilip

C.F.J.

, Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research 21(1) (2011), 40–49.

13.

Little

R.J.A.

and Rubin

D.B.

, Statistical analysis with missing data, John Wiley & Sons, USA, 3rd edition, 2019.

14.

Han

and Kang

, Active learning with missing values considering imputation uncertainty, Knowledge-Based Systems 224 (2021), 107079.

15.

and Lin

H.T.

, Ordinal regression by extended binary classification, In: Schölkopf

Platt

J.C.

and Hofmann

, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4–7, 2006, MIT Press, 2006, pp. 865–872.

16.

Jing

Zhang

and Zhang

, Entropy-based active learning with support vector machines for content-based image retrieval, In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo, ICME 2004, 27–30 June 2004, Taipei, Taiwan, IEEE Computer Society, 2004, pp. 85–88.

17.

Kee

del Castillo

and Runger

, Query-by-committee improvement with diversity and density in batch active learning, Information Sciences 454-455 (2018), 401–418.

18.

Xue

and Hauskrecht

, Active learning of classification models with likert-scale feedback, In: Chawla

N.V.

and Wang

, editors, Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, Texas, USA, April 27–29, 2017, SIAM, 2017, pp. 28–35.

19.

Wang

Min

Zhang

and Wu

, Active learning through density clustering, Expert Systems with Applications 85 (2017), 305–317.

20.

Wang

and Li

, A two-stage clustering-based cold-start method for active learning, Intelligent Data Analysis 25(5) (2021), 1169–1185.

21.

Nie

Wang

Huang

and Ding

C.H.Q.

, Early active learning via robust representation and structured sparsity, In: Rossi

, editor, IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, August 3–9, 2013, IJCAI/AAAI, 2013, pp. 1572–1578.

22.

Wang

Huang

Liu

and Huang

, New balanced active learning model and optimization algorithm, In: Lang

, editor, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13–19, 2018, Stockholm, Sweden, ijcai.org, 2018, pp. 2826–2832.

23.

Hunt

L.A.

and Jorgensen

M.A.

, Mixture model clustering for mixed data with missing information, Computational Statistics & Data Analysis 41(3-4) (2003), 429–440.

24.

Sovilj

Eirola

Miche

Björk

Nian

Akusok

and Lendasse

, Extreme learning machine for missing data using multiple imputations, Neurocomputing 174 (2016), 220–231.

25.

Murray

J.S.

, Multiple imputation: a review of practical and theoretical findings, Statistical Science 33(2) (2018), 142–159.

26.

Gutiérrez

P.A.

Pérez-Ortiz

Sánchez-Monedero

Fernández-Navarro

and Hervás-Martínez

, Ordinal regression methods: survey and experimental study, IEEE Transactions on Knowledge and Data Engineering 28(1) (2016), 127–146.

27.

Sheng

V.S.

Jiang

and Li

, Noise filtering to improve data and model quality for crowdsourcing, Knowledge-Based Systems 107 (2016), 96–103.

28.

Guo

and Wang

, An active learning-based SVM multi-class classification model, Pattern Recognition 48(5) (2015), 1577–1597.

29.

Kottke

Krempl

Lang

Teschner

and Spiliopoulou

, Multi-class probabilistic active learning, In: ECAI 2016 – 22nd European Conference on Artificial Intelligence, 29 August-2 September 2016, The Hague, The Netherlands – Including Prestigious Applications of Artificial Intelligence (PAIS 2016), volume 285, IOS Press, 2016, pp. 586–594.

30.

Pupo

O.G.R.

Altalhi

A.H.

and Ventura

, Statistical comparisons of active learning strategies over multiple datasets, Knowledge-Based Systems 145 (2018), 274–288.

Active learning for ordinal classification on incomplete data

Abstract

Keywords

1. Introduction

2.1 Active learning

2.2 Incomplete data processing

2.3 Threshold ordinal classification based on reduction framework

3.1 Imputation uncertainty

4. Experiments

4.1 Datasets

1 https://archive.ics.uci.edu/ml/index.php.

Footnotes

Acknowledgments

Appendix A: Additional experimental results

References

¹
https://archive.ics.uci.edu/ml/index.php.