Capped ℓ 1 -norm regularized least squares classification with label noise

Abstract

Since label noise can hurt the performance of supervised learning (SL), how to train a good classifier to deal with label noise is an emerging and meaningful topic in machine learning field. Although many related methods have been proposed and achieved promising performance, they have the following drawbacks: (1) They can lead to data waste and even performance degradation if the mislabeled instances are removed; and (2) the negative effect of the extremely mislabeled instances cannot be completely eliminated. To address these problems, we propose a novel method based on the capped ℓ₁ norm and a graph-based regularizer to deal with label noise. In the proposed algorithm, we utilize the capped ℓ₁ norm instead of the ℓ₁ norm. The used norm can inherit the advantage of the ℓ₁ norm, which is robust to label noise to some extent. Moreover, the capped ℓ₁ norm can adaptively find extremely mislabeled instances and eliminate the corresponding negative influence. Additionally, the proposed algorithm makes full use of the mislabeled instances under the graph-based framework. It can avoid wasting collected instance information. The solution of our algorithm can be achieved through an iterative optimization approach. We report the experimental results on several UCI datasets that include both binary and multi-class problems. The results verified the effectiveness of the proposed algorithm in comparison to existing state-of-the-art classification methods.

Keywords

Artificial intelligence classification algorithm graph-based learning label noise

1 Introduction

Past decades have witnessed the success of supervised learning (SL) in machine learning field [7 , 47]. SL is an effective tool to learn a classifier, and has been applied in various practical tasks, such as image classification [5 , 52], emotion classification [53], and speech recognition [46]. In general, SL attempts to train a desired classifier from a labeled dataset. Many methods have been proposed for SL, including the k-Nearest Neighbor (k-NN), naive Bayes [11], support vector machine (SVM) [9, 45], AdaBoost [12, 38], regularized least squares classification (RLSC) [39], and kernel minimum squared error (KMSE) [17].

Among these SL methods, RLSC is commonly used because it can be easily solved and obtain promising performance. Xue et al. [49] discovered a local discriminative structure by building intra-class and inter-class graphs, and used the structure to develop discriminatively RLSC (DRLSC). From the reported results conducted on image recognition, DRLSC achieved better performance than the other graph-based learning methods. Different from the above approach, Gan et al. [18] utilized the second-order statistical magnitude of margin distribution to improve the generalization ability, and proposed generalized RLSC. The results showed that a large margin mean and small margin variance were beneficial and crucial to the performance improvement.

In fact, RLSC and its variants are ultimately a kind of SL. As we know, the performance of SL depends heavily on the quality of labeled training instances. However, collecting the instance labels is time-consuming and expensive. Moreover, reliable labels are not easily obtained because labeling experts may be fatigued and inexperienced. In such cases, the instances can be mislabeled by the experts [34]. Additionally, there is an increasing interest in using crowdsourcing, such as the Amazon Mechanical Turk [2 , 33]. However, such an approach may lead to instances being mislabeled by the nonexperts. As one can imagine, such label noise can have a severe negative impact on the performance of the trained classifier [13]. Many related methods [6 , 50] have been proposed and achieved promising performance. Generally speaking, these methods can be cast into two categories [15, 36]: (1) identifying and removing/correcting the mislabeled instances; and (2) training a robust classifier to label noise.

The first category attempts to improve the quality of the labeled instances through a preprocessing step. In this step, the mislabeled instances are firstly recognized through different strategies and then removed or corrected. Tu [43] designed a noisy label detection method by using density peak clustering [40]. The mislabeled instances were identified through local density and then removed. Bhadra and Hein [4] utilized a mutual consistency approach to identify and correct noisy labels, and introduced two effective strategies to solve the optimization problem. Meanwhile, Smith and Martinez [42] proposed PReprocessing Instances that Should be Misclassified (PRISM), which used different heuristics measures to remove the mislabeled instances. Experimental results showed that PRISM could improve both the quality of training instances and the classification performance. Zhang et al. [50] introduced adaptive voting noise correction (AVNC) to find and correct the mislabeled instances. Their experimental results verified the effectiveness, especially under the situation where each instance had more than one noisy label. Although these methods could improve performance in some scenarios, they might remove so many instances and thus waste instance information. In some cases, the mislabeled instances might be even erroneously corrected, which can lead to performance degradation.

The second category tries to modify the objective function to achieve robustness to label noise. Manwani [32] investigated the behaviors of different loss functions on label noise. Experimental analyses showed that risk minimization under a 0-1 loss function could yield promising noise-tolerance performance. Ghosh [20] provided a deeper theoretical analysis than the literature [32]. It proved that the loss function that satisfied some sufficient conditions could be robust under uniform and non-uniform label noise, such as 0-1 loss, Sigmoid loss, and so on. Additionally, Ghosh [19] attempted to find an appropriate loss function for deep learning. It showed that the loss function based on the mean absolute value of error was inherently robust to label noise. Liu [31] developed an improved kernel minimum square error classification (IKMSE) by minimizing an ℓ₂₁-norm regularizer on the decision coefficient. IKMSE could reduce the negative influence of outliers. Similarly, Li [27] proposed an ℓ₂₁-norm-based extreme learning machine (LR21-ELM) to decrease the harmful effect of mislabeled instances. Gong [21] aimed at learning a transductive classifier by penalizing an ℓ₀ norm on the graph to filter out the mislabeled instances. For outliers with large residues, Jiang [23] introduced the capped ℓ₁ norm to learn a robust dictionary [1, 28]. In order to effectively deal with the extremely mislabeled instances, Nie [37] presented the capped ℓ_p-norm SVM (CappedSVM) to improve learning performance. Nevertheless, the negative effect of the extremely mislabeled instances could not be completely eliminated and the instance information was neglected and not fully discovered.

In order to completely eliminate the negative effect and avoid the information waste of the extremely mislabeled instances, we propose a capped ℓ₁-norm RLSC method (i.e., CRLSC) based on a graph-based regularizer. In the proposed algorithm, we use the capped ℓ₁ norm to modify the objective function of RLSC. The capped ℓ₁ norm can adaptively find the extremely mislabeled instances and completely eliminate the corresponding negative influence. In order to avoid information waste and improve classification performance, we build a graph-based regularization term to exploit the mislabeled instances. Finally, we utilize an iterative optimization strategy to solve the optimization problem.

The organization of this paper is as follows. We give a review of related work in Section 2. Section 3 discusses the details of the proposed algorithm. The results from experiments conducted on several UCI datasets are reported in Section 4. Finally, we present a conclusion and directions for future work in Section 5.

2 Background knowledge

2.1 Regularized Least Squares Classification (RLSC)

Because RLSC [39] is easy to solve and can yield comparable performance to other SL methods (e.g., SVM, KMSE), it has become a useful technique in the machine learning field. Formally, suppose a c-class training dataset $X = [x_{i}]_{i = 1}^{l}$ with the labels $Y = [y_{i}]_{i = 1}^{l}$ , $x_{i} \in ℝ^{D}$ and y_i ∈ {1, 2, ⋯ , c}. The goal of RLSC is to train a classifier f (x) by solving a convex optimization problem (OP). The OP for RLSC is as follows: $min_{f} J (f) = \sum_{i = 1}^{l} | | f (x_{i}) - y_{i} | |_{2}^{2} + γ ∥ f ∥_{K}^{2}$ (1) where γ is a regularization parameter of model complexity. ∥ · ∥ _K is a norm defined in the reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel $K : X \times X \to ℝ$ .

According to the representer theorem [3], the classifier f (x) can be represented as $f (x) = \sum_{i = 1}^{l} α_{i} k (x_{i}, x)$ (2) where k (· , ·) is a Mercer kernel [44].

We then substitute Eq. (2) into Eq. (1) and achieve the following OP:

$min_{α} J (α) = Tr ((K α - Y)^{T} (K α - Y)) + γ Tr (α^{T} K α)$ (3)

where $α = [α_{1}, \dots, α_{l}]_{l \times c}^{T}$ and the Gram matrix K is denoted as $K = [\begin{matrix} k (x_{1}, x_{1}) & \dots & k (x_{1}, x_{l}) \\ ⋮ & ⋱ & ⋮ \\ k (x_{l}, x_{1}) & \dots & k (x_{l}, x_{l}) \end{matrix}]$ (4)

The above OP can be easily solved by matrix analysis. By taking the derivative of $J (α)$ with respect to α to 0, one can achieve the optimal coefficients: $α^{*} = (K + γ I)^{- 1} Y$ (5) where I is an identity matrix with size l × l.

2.2 Multi-class Capped ℓ_p-norm SVM (CappedSVM)

The goal of CappedSVM [37] is to learn a linear model w^Tx + b that can be robust to label noise. In order to realize this goal, CappedSVM defines the objective function as follows: $\begin{matrix} min_{w, b, M \geq 0} \sum_{i = 1}^{n} min (∥ w^{T} x_{i} + b - y_{i} - y_{i} \circ m_{i} ∥_{2}^{p}, ɛ) \\ + η ∥ w ∥_{F}^{2} \end{matrix}$ (6) where w is the projection vector, and b is the bias coefficient. The ith column of M is m_i, which is a slack variable used to encode the loss of x_i. ɛ is a predefined parameter. A ∘ B denotes the element-wise product of vectors A and B. η is a regularization parameter.

CappedSVM utilizes a re-weighted method to solve the OP. The objective function of CappedSVM is then rewritten as below: $min_{w, b, M \geq 0} \sum_{i = 1}^{n} d_{i} ∥ w^{T} x_{i} + b - y_{i} - y_{i} \circ m_{i} ∥_{2}^{2} + η ∥ w ∥_{F}^{2}$ (7) where $d_{i} = {\begin{matrix} \frac{p}{2} g_{i} (w, b, m_{i})^{\frac{p - 2}{2}} & if g_{i} (w, b, m_{i})^{\frac{p}{2}} \leq ɛ \\ 0 & otherwise \end{matrix}$ and $g_{i} (w, b, m_{i}) = ∥ w^{T} x_{i} + b - y_{i} - y_{i} \circ m_{i} ∥_{2}^{2}$ .

When M is fixed, CappedSVM can obtain the following solution: $w = ({XHX}^{T} + η I)^{- 1} {XHZ}^{T}$ (8) where $H = D - \frac{1}{1^{T} D 1} D 1 1^{T} D .$ $b = \frac{1}{1^{T} D 1} ZD 1 - \frac{1}{1^{T} D 1} w^{T} XD 1$ (9) where 1 ∈ R^c×1 is a column vector with each entry as 1.

When w and b are fixed, the solution of M is given as $m_{i} = (y_{i} \circ (w^{T} x_{i}) + y_{i} \circ b - 1)_{+}$ (10)

In the experimental section of [37], the value of ɛ is set by selecting the 10% instances with the largest loss in the first five iterations. For a fair comparison, p of CappedSVM in our experiments is fixed to 1.

3 Capped ℓ₁-norm RLSC (CRLSC)

In this section, we will discuss the formulation and solution of the proposed algorithm.

3.1 Motivation

In some situations, the user may collect mislabeled instances because of inexperience or fatigue. Some existing methods attempt to identify and remove the mislabeled instances. However, this will result in data waste. As shown in Fig. 1, if the mislabeled instances are identified and removed, classification performance can be degraded. Hence, the mislabeled instances are considered as unlabeled ones by ignoring the labels and discovered by a graph-based regularizer in our algorithm.

Fig. 1

The motivation of the proposed algorithm. (a) A training dataset with noisy labels. (b) Results if the mislabeled instances are identified and removed. (c) Results if the mislabeled instances are exploited through a graph-based framework by ignoring the corresponding label information.

Additionally, the ℓ₂ and ℓ₁-norm loss functions are not robust enough to the label noise. As shown in Fig. 2, the loss can be infinite if the instances are not correctly classified. However, the capped ℓ₁-norm loss reaches a maximum that is related to ɛ. This means that the negative effect of the extremely mislabeled instances with very large loss can be reduced.

Fig. 2

The illustrations of different loss functions.

3.2 Formulation

According to the above analysis, we employ the capped ℓ₁ norm to substitute for the ℓ₂ norm in RLSC. Meanwhile, the instances that are likely to be mislabeled are explored through the graph-based regularization term.

Firstly, we build a p-nearest neighbor graph W as follows:

$W_{ij} = {\begin{matrix} exp {- \frac{∥ x_{i} - x_{j} ∥_{2}^{2}}{2 τ^{2}}} & if x_{i} \in N_{p} (x_{j}) or x_{j} \in N_{p} (x_{i}) \\ 0 & otherwise \end{matrix}$ (11)

where N_p (x_i) is a subset composed of p nearest neighbors of x_i, and τ is a parameter.

Then, the graph-based regularization term can be written as $S = \frac{1}{2} \sum_{i = 1}^{l} \sum_{j = 1}^{l} W_{ij} | | f (x_{i}) - f (x_{j}) | |_{2}^{2} = f^{T} Lf$ (12) where f = [f (x₁) , ⋯ , f (x_l)] ^T, and L = D - W, in which D is a diagonal matrix with $D_{ii} = \sum_{j = 1}^{l} W_{ij}$ .

Finally, we have the following objective function of our algorithm:

$min_{f} J_{c} (f) = \sum_{i = 1}^{l} min (| | f (x_{i}) - y_{i} | |_{2}, ɛ) + ∥ f ∥_{K}^{2} + η f^{T} Lf$ (13)

where ɛ is a threshold parameter.

3.3 Solution

Since the above OP in the proposed algorithm is non-convex, it is difficult to solve. In this paper, we introduce an effective iterative optimization strategy as shown in [37]. The OP in Eq. (13) can be transformed as follows:

$min_{f} J_{c} (α, R) = \sum_{i = 1}^{l} r_{i} | | f (x_{i}) - y_{i} | |_{2}^{2} + γ ∥ f ∥_{K}^{2} + η f^{T} Lf$ (14)

where R = diag ([r₁, ⋯ , r_l]) and $r_{i} = {\begin{matrix} \frac{1}{2 | | f (x_{i}) - y_{i} | |_{2}} & if | | f (x_{i}) - y_{i} | |_{2} < ɛ \\ 0 & otherwise \end{matrix}$ (15)

Jiang et al. [23] has proved that the OP in Eq. (132) will converge to the optimal solution of OP in Eq. (13). Hence, we have the following iterative process.

1) Update f (x) when r_i is fixed.

Based on the representer theorem, the OP in Eq. (132) can be rewritten as follows:

$\begin{matrix} min_{α} J_{c} (α, R) = & Tr ((K α - Y)^{T} R (K α - Y)) + γ Tr (α^{T} K α) \\ + η Tr (α^{T} KLK α) \end{matrix}$ (16)

By setting the derivative of $J_{c} (α, R)$ with respect to α to 0, we have $α = {(RK + γ I + η LK)}^{- 1} (RY)$ (17)

2) Update r_i when f (x) is fixed.

r_i can be computed according to Eq. (15).

By iteratively computing Eq. (17) and Eq. (15), we can achieve the optimal solution α^*. The flow chart is shown in Fig. 3. The details of the proposed algorithm are given in Algorithm 1. Because the sequence of objective function values obtained by our algorithm is non-negative and decreases monotonically, i.e., $J_{c} (α^{(t)}, R^{(t)}) \leq J_{c} (α^{(t - 1)}, R^{(t)}) \leq J_{c} (α^{(t - 1)}, R^{(t - 1)})$ , the iteration procedure will converge. In the next section, we give the plots of the iteration procedure, which indicate that CRLSC has fast convergence.

Fig. 3

Flow chart of the proposed algorithm.

Algorithm 1 CRLSC

Input: Dataset $X = [x_{i}]_{i = 1}^{l}$ , $Y = [y_{i}]_{i = 1}^{l}$ , the parameters γ, η, p, ɛ and M.

Output: Optimal value of α^*.

Compute the graph Laplacian L and the Gram matrix K;

Set $r_{i}^{(0)} = 1$ and t = 1;

repeat

Update α^(t) through Eq. (17);

Update $r_{i}^{(t)}$ through Eq. (15);

until $Δ J_{c}^{(t)} < ɛ$ or t > M.

4 Experimental analysis

4.1 A toy experiment

We firstly conduct an experiment over a toy dataset (i.e., Fig. 4(a)). The dataset has 600 instances and is generated through a Gaussian distribution with a unit covariance matrix. The mean is (2, -2) for Class 1 and (-2, -2) for Class -1. Thirty instances are mislabeled and denoted as red rectangles in Fig. 4(b). The classification results obtained by RLSC and CRLSC are respectively reported in Figs. 4(c) and (d). From these figures, one can see that the instances on the left side of the ideal boundary are easily mislabeled by RLSC. However, CRLSC can achieve the desired results. Moreover, the weights of some extremely mislabeled instances are zeros, as shown in Fig. 4(d); this verifies the effectiveness of the capped ℓ₁ norm.

Fig. 4

Performance comparison over the toy dataset.

4.2 Benchmark experiments

In this subsection, some experiments are carried out on 12 datasets selected from UCI [14]. The statistical information of the datasets is shown in Table 1. In the experiments, 70% of each dataset is used to constitute the training subset and the rest is used as the testing subset. Because the proposed algorithm is designed to deal with label noise, the training subset is contaminated, which means that the given labels are inconsistent with the true labels. The percentage of label noise is selected from 0% to 30% with a step size of 5%. The following state-of-the-art classification methods are used for performance comparisons.

1-NN;

RLSC [39];

KMSE [17];

IKMSE [31];

CappedSVM [37];

Table 1
Information regarding the datasets

Dataset No. of samples No. of features No. of classes

Cmc 1473 9 3

Dermatology 336 33 6

Glass 214 9 6

Ionosphere 351 33 2

Iris 150 4 3

New_thyroid 215 5 3

Sonar 208 60 2

Soybean_small 47 35 4

Vehicle 846 18 4

Water 527 38 2

Waveform 5000 21 3

Wine 178 13 3

Dataset	No. of samples	No. of features	No. of classes
Cmc	1473	9	3
Dermatology	336	33	6
Glass	214	9	6
Ionosphere	351	33	2
Iris	150	4	3
New_thyroid	215	5	3
Sonar	208	60	2
Soybean_small	47	35	4
Vehicle	846	18	4
Water	527	38	2
Waveform	5000	21	3
Wine	178	13	3

In the experiments, the parameter γ for RLSC, KMSE, IKMSE and our algorithm is set to 10^-4. The parameter ɛ for our algorithm is set to 90% of the largest loss in each iteration. η in CappedSVM and our algorithm is set through 5-fold cross validation, and the candidate value is {10^-4, 10^-3, 10^-2, 10^-1, 1, 10, 10², 10³, 10⁴}. The Gaussian kernel width σ for computing the Gram matrix K is determined through 5-fold cross validation from a set {2^-4, 2^-3, 2^-2, 2^-1, 1, 2, 2², 2³, 2⁴} δ in which δ is the average distance between the instances.

The results are shown in Figs. 5 and 6. From these figures, we can reach the following conclusions:

Compared to the other methods, 1-NN obtains the worst performance overall on all datasets. This demonstrates that 1-NN is very sensitive to label noise and that it has the worst stability.

If the percentage of label noise is fixed to 0%, CRLSC can obtain comparable, if not better, results than the other methods. Especially, the proposed algorithm can outperform the other methods on the New_thyroid, Sonar, Vehicle, and Water datasets. The reason for this may be that there are other types of noise on these datasets, such as outliers. These results show that the proposed algorithm can deal with such other types of noise to some extent.

As the percentage of label noise increases, the performance of all methods decreases. This confirms that label noise can hurt the performance of a trained classifier.

The performance of CRLSC and CappedSVM decreases at the slowest pace as the percentage of label noise changes. Therefore, the capped norm is more robust to label noise, and the effectiveness of the proposed algorithm is verified.

CRLSC with η = 0 can perform better than CRLSC with the ℓ₁ norm and η = 0 in most cases, such as on Dermatology, Ionosphere, and New_thyroid datasets. This means that the capped ℓ₁ norm is more robust to label noise than the ℓ₁ norm.

CRLSC generally outperforms CRLSC with the ℓ₁ norm and η = 0 and CRLSC with η = 0. This finding further verifies the effectiveness of the strategy using the graph-based technique to exploit the mislabeled instances.

Fig. 5

Performance comparison of different methods over the first six datasets.

Fig. 6

Performance comparison of different methods over the latter six datasets.

Furthermore, we conduct a performance analysis with different parameter settings. Since ɛ was set through a heuristic strategy, we conduct the experiments under different values of η. The analysis plots are shown in Figs. 7 and 8. From these figures, we know that the best performance can be achieved under a small value of η when the noise ratio is 0%. It is because the labeled instances are mainly discovered through the first fidelity term in the objective function. Meanwhile, as the ratio increases, the proposed algorithm often obtains the best performance when the value of η is large; this is especially obvious on the Ionosphere, New_thyroid, Vehicle, and Water datasets when the ratio is 30%. The reason for this may be that the mislabeled instances have large loss and are discovered through the last graph-based term to guarantee the performance. Moreover, the performance of the proposed algorithm on the Waveform dataset is not sensitive to label noise (see Fig. 6(e)). Hence, the performance with different values of η is similar when the noise percentage changes (Fig. 8(e)).

Fig. 7

Performance with different values of η over the first six datasets.

Fig. 8

Performance with different values of η over the latter six datasets.

Finally, we present the convergence over different datasets in Fig. 9. From these plots, we can find that our algorithm can converge fast, thus demonstrating the effectiveness of the optimization method used in our algorithm.

Fig. 9

Illustrations of the iterative procedure.

5 Conclusion

In this paper, we propose a novel classification method to deal with label noise. The proposed algorithm utilizes the capped ℓ₁ norm to alleviate the negative influence of label noise. Moreover, the instances that satisfy ||f (x_i) - y_i||₂ > ɛ may be the extremely mislabeled ones and exploited by the graph-based approach to avoid wasting instance information. Our experimental results verify the feasibility and usefulness of the proposed algorithm. Further analysis reveals that our algorithm is robust to outliers to some extent. Therefore, our algorithm can extend the practicability of the trained classifier. However, this work mainly focuses on the label noise. In future work, we will try to simultaneously deal with different types of noise, such as outlier and feature noise.

Footnotes

Acknowledgment

This work is supported by the Doctoral Scientific Research Foundation of Hubei University of Technology under grant No. BSQD2015026, the National Natural Science Foundation of China under grant No. 61601162, the Doctoral Scientific Research Foundation of Wuhan Institute of Technology under grant No. K201905, and the Research Project of Hubei Provincial Department of Education under grant No. Q20191510, B2019041.

References

Aharon

, Elad

and Bruckstein

, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Transactions on Signal Processing 54 (2006), 4311–4322.

Allahbakhsh

, Benatallah

, Ignjatovic

, Motahari-Nezhad

H.R.

, Bertino

and Dustdar

, Quality control in crowdsourcing systems: Issues and directions, IEEE Internet Computing 17(2) (2013), 76–81.

Belkin

, Niyogi

and Sindhwani

, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006), 2399–2434.

Bhadra

and Hein

, Correction of noisy labels via mutual consistency check, Neurocomputing 160 (2015), 34–52.

Cao

, He

and Huang

, Lift: A new framework of learning from testing data for face recognition, Neurocomputing 74(6) (2011), 916–929.

Brodley

C.E.

, M.A.F., Identifying and eliminating mislabeled training instances. In: Proceedings of the National Conference on Artificial Intelligence, The MIT Press, Oregon, USA, (1996), pp. 799–805.

Caruana

, Niculescu-Mizil

, An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, ACM, NY, USA, (2006), pp. 161–168.

Chen

, Wang

, Chen

and Li

, Capped l₁-norm sparse representation method for graph clustering, IEEE Access 7 (2019), 54464–54471.

Chen

W.-J.

, Shao

Y.-H.

, Li

C.-N.

and Deng

N.-Y.

, MLTSVM: A novel twin support vector machine to multi-label learning, Pattern Recognition 52 (2016), 61–74.

10.

Cheng

, Li

, Han

, Yao

and Guo

, Exploring hierarchical convolutional features for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 56(11) (2018), 6712–6722.

11.

Chhogyal

, Nayak

, An empirical study of a simple naive bayes classifier based on ranking functions, In: Proceedings of Australasian Joint Conference on Artificial Intelligence, Springer International Publishing, Cham, (2016), pp. 324–331.

12.

Collins

, Schapire

R.E.

and Singer

, Logistic regression, adaboost and bregman distances, Machine Learning 48(1) (2002), 253–285.

13.

Foody

G.M.

, The effect of mis-labeled training data on the accuracy of supervised image classification by svm, In: Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium, IEEE, Milan, Italy, (2015), pp. 4987–4990.

14.

Frank

and Asuncion

, UCI Machine Learning repository, 2010. http://archive.ics.uci.edu/ml.

15.

Frenay

and Verleysen

, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25(5) (2014), 845–869.

16.

Gan

, Huang

, Luo

, Xi

and Gao

, On using supervised clustering analysis to improve classification performance, Information Sciences 454-455 (2018), 216–228.

17.

Gan

, Sang

, Chen

, Semi-supervised kernel minimum squared error based on manifold structure, In: Proceedings of the 10th International Symposium on Neural Networks, Springer-Verlag, Dalian, China, (2013), pp. 265–272.

18.

Gan

, She

, Ma

, Wu

and Meng

, Generalization improvement for regularized least squares classification, Neural Computing and Applications 31(2) (2019), 1045–1051.

19.

Ghosh

, Kumar

, Sastry

P.S.

, Robust loss functions under label noise for deep neural networks. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI Publications, San Francisco, California, USA, (2017), pp. 1919–1925.

20.

Ghosh

, Manwani

and Sastry

, Making risk minimization tolerant to label noise, Neurocomputing 160 (2015), 93–107.

21.

Gong

, Zhang

, Yang

, Tao

, Learning with inadequate and incorrect supervision. In: Proceedings of the 2017 IEEE International Conference on Data Mining. IEEE, Los Alamitos, CA, USA, (2017), pp. 889–894.

22.

Ipeirotis

P.G.

, Provost

, Wang

, Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD Workshop on Human Computation, ACM, New York, NY, USA, (2010), pp. 64–67.

23.

Jiang

, Nie

, Huang

, Robust dictionary learning with capped l1-norm. In: Proceedings of International Joint Conference on Artificial Intelligence, AAAI Press, Buenos Aires, Argentina, (2015), pp. 3590–3596.

24.

Kang

, Duan

, Xiang

, Li

and Benediktsson

J.A.

, Detection and correction of mislabeled training samples for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 56(10) (2018), 5673–5686.

25.

Kearns

, Efficient noise-tolerant learning from statistical queries, Journal of the ACM 45(6) (1998), 983–1006.

26.

LeCun

, Bengio

and Hinton

, Deep learning, Nature 521(7553) (2015), 436–444.

27.

, Wang

, Lei

and Song

, l₂₁ -norm based loss function and regularization extreme learning machine, IEEE Access 7 (2019), 6575–6586.

28.

, Shen

, Zhang

, Yuan

and Yang

, Recovering Quantitative Remote Sensing Products Contaminated by Thick Clouds and Shadows Using Multitemporal Dictionary Learning, IEEE Transactions on Geoscience and Remote Sensing 52 (2014), 7086–7098.

29.

Liu

and Liu

, A novel locally linear knn method with applications to visual recognition, IEEE Transactions on Neural Networks and Learning Systems 28(9) (2016), 2010–2021.

30.

Liu

and Tao

, Classification with noisy labels by importance reweighting, IEEE Transactions on Pattern Analysis and Machine Intelligence 38(3) (2016), 447–461.

31.

Liu

, Xue

, Zhang

, Pu

and Wang

, An improved kernel minimum square error classification algorithm based on l_2,1-norm regularization, IEEE Access 5 (2017), 14133–14140.

32.

Manwani

and Sastry

P.S.

, Noise tolerance under risk minimization, IEEE Transactions on Cybernetics 43(3) (2013), 1146–1151.

33.

Mao

, Capra

, Harman

and Jia

, A survey of the use of crowdsourcing in software engineering, Journal of Systems and Software 126 (2017), 57–84.

34.

Muhlenbach

, Lallich

and Zighed

D.A.

, Identifying and handling mislabelled instances, Journal of Intelligent Information Systems 22(1) (2004), 89–109.

35.

Natarajan

, Dhillon

I.S.

, Ravikumar

P.K.

, Tewari

, Learning with noisy labels, In: Advances in Neural Information Processing Systems 26. Curran Associates, Inc., Lake Tahoe, Nevada, USA, (2013), pp. 1196–1204.

36.

Nettleton

D.F.

, Orriols-Puig

and Fornells

, A study of the effect of different types of noise on the precision of supervised learning techniques, Artificial Intelligence Review 33(4) (2010), 275–306.

37.

Nie

, Wang

, Huang

, Multiclass capped lp-norm svm for robust classifications. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI Press, San Francisco, California, USA, (2017), pp. 2415–2421.

38.

Rätsch

, Onoda

and Müller

K.-R.

, Soft margins for adaboost, Machine Learning 42(3) (2001), 287–320.

39.

Rifkin

, Yeo

and Poggio

, Regularized least-squares classification, Nato Science Series Sub Series III Computer and Systems Sciences 190 (2003), 131–154.

40.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344(6191) (2014), 1492–1496.

41.

Scott

, Blanchard

, Handy

, Classification with asymmetric label noise: Consistency and maximal denoising, In: Proceedings of the 26th Annual Conference on Learning Theory, PMLR, Princeton, NJ, USA, (2013), pp. 489–511.

42.

Smith

M.R.

, Martinez

, Improving classification accuracy by identifying and removing instances that should be misclassified, In: Proceedings of the 2011 International Joint Conference on Neural Networks, IEEE, San Jose, California, USA, (2011), pp. 2690–2697.

43.

, Zhang

, Kang

, Zhang

and Li

, Density peak-based noisy label detection for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 57(3) (2019), 1573–1584.

44.

Vapnik

V.N.

, Statistical Learning Theory, Wiley-Interscience, (1998).

45.

Vapnik

V.N.

and Vapnik

, Statistical learning theory, Wiley, (1998).

46.

Varadarajan

, Yu

, Deng

, Acero

, Using collective information in semi-supervised learning for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, Taipei, Taiwan, (2009), pp. 4633–4636.

47.

Wang

, Lu

, Cai

, Cham

and Wang

, Large-margin multi-modal deep learning for RGB-D object recognition, IEEE Transactions on Multimedia 17(11) (2015), 1887–1898.

48.

Xiao

, Xia

, Yang

, Huang

, Wang

, Learning from massive noisy labeled data for image classification. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, MA, USA, (2015), pp. 2691–2699.

49.

Xue

, Chen

and Yang

, Discriminatively regularized least-squares classification, Pattern Recognition 42(1) (2009), 93–104.

50.

Zhang

, Sheng

V.S.

, Wu

, Fu

, Wu

, Improving label quality in crowdsourcing using noise correction. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ACM, New York, NY, USA, (2015), pp. 1931–1934.

51.

Zhang

and Zhang

, Robust visual knowledge transfer via extreme learning machine-based domain adaptation, IEEE Transactions on Image Processing 25(10) (2016), 4959–4973.

52.

Zhang

, Zuo

and Zhang

, LSDT: Latent sparse domain transfer learning for visual adaptation, IEEE Transactions on Image Processing 25(3) (2016), 1177–1191.

53.

Zhu

, Liu

, Li

, Wan

and Qin

, Emotion classification with data augmentation using generative adversarial networks. In: Advances in Knowledge Discovery and Data Mining, Springer International Publishing, Cham, (2018), pp. 349–360.

Capped ℓ 1 -norm regularized least squares classification with label noise

Abstract

Keywords

1 Introduction

2 Background knowledge

2.1 Regularized Least Squares Classification (RLSC)

3.1 Motivation

4.1 A toy experiment

Table 1 Information regarding the datasets Dataset No. of samples No. of features No. of classes Cmc 1473 9 3 Dermatology 336 33 6 Glass 214 9 6 Ionosphere 351 33 2 Iris 150 4 3 New_thyroid 215 5 3 Sonar 208 60 2 Soybean_small 47 35 4 Vehicle 846 18 4 Water 527 38 2 Waveform 5000 21 3 Wine 178 13 3

Footnotes

Acknowledgment

References