Abstract
Multiclass data classification, where the goal is to segment data into classes, is an important task in machine learning. However, the task is challenging due to reasons including the scarcity of labeled training data; in fact, most machine learning algorithms require a large amount of labeled examples to perform well. Moreover, the accuracy of a classifier can be dependent on the accuracy of the training labels which can be corrupted. In this paper, we present an efficient and unconditionally stable semi-supervised graph-based method for multiclass data classification which requires considerably less labeled training data to accurately classify a data set compared to current techniques, due to properties such as the embedding of data into a similarity graph. In particular, it performs very well and more accurately than current approaches in the common scenario of few labeled training elements. Morever, we show that the algorithm performs with good accuracy even with a large number of mislabeled examples and is also able to incorporate class size information. The proposed method uses a modified auction dynamics technique. Extensive experiments on benchmark datasets are performed and the results are compared to other methods.
Introduction
Data classification, where the goal is to divide unlabeled examples into several classes with the aid of labeled data, is an important task in machine learning. However, data classification is challenging for many reasons, including the scarcity of labeled training data. In fact, one of the key limitations of most current approaches for the classification task is their reliance on large labeled training sets for good performance. In particular, many modern machine learning techniques, such as neural networks and support vector machines (e.g. [1, 2, 3]), require large amounts of labeled data to perform well [4]. The state-of-the-art deep learning techniques, such as convolutional neural networks (e.g. [5, 6, 7]), have millions of free parameters and thus require even more labeled training data. However, obtaining labeled data is time-consuming and expensive, especially in domains where only experts can determine labels.
Moreover, data classification can be challenging since the labels of the training data can be corrupted, affecting the classification accuracy. This can occur in practice for many reasons. For example, a data-entry error can be made. Moreover, there might not be enough information to label each instance. In addition, subjectivity in labeling can arise in domains in which experts disagree. Thus, it is crucial to use a classifier which tolerates label noise well and to find a way to identify or correct the errors in labels.
In this paper, we present an efficient and unconditionally stable semi-supervised graph-based algorithm for data classification, derived using upper and lower bound auction dynamics. The method requires considerably less labeled training data to accurately classify a data set compared to current techniques; in particular, it performs very well and more accurately than current approaches in the common scenario of few labeled training elements. The approach can also incorporate class size information, which can increase the classification accuracy. Moreover, the method handles label noise in the training data well and is able to perform with good accuracy even with many errors in the labels of the training data. It is important to note that while noise in the training data can occur in the attribute (feature) information of the data as well as the training labels, in this paper, we focus on the noise in the training labels. The case of noise in the attribute information is the subject of another paper; however, in Section 4.5, we include two experiments where noise is introduced in the attribute (feature) information.
The ability of the proposed algorithm to perform well with small labeled training sets will be due to several properties: (i) The embedding of data into a weighted similarity graph, which provides information about the extent of similarity between elements of data (including unlabeled data) and also about the overall structure of the data. (ii) The in-depth construction of the weights in the graph; for example, in the case of text classification, we use a pairwise document similarity measure based on present term set. (iii) The usage of both the unlabeled data and scarce labeled data to construct the graph; this allows one to leverage important structural information of the unlabeled data. (iv) The incorporation of random fluctuations of variables for reaching even lower energies. (v) The ability to incorporate class size information, in the form of upper and lower bounds on the class sizes, which improves accuracy even further.
Main contributions
We present a multiclass data classification method, Algorithm 1, with several advantages:
The method is very efficient and unconditionally stable. The algorithm requires considerably less labeled training data to accurately classify a data set compared to current machine learning techniques; in particular, it performs very well and more accurately than current approaches in the common scenario of few labeled training elements. Moreover, the method handles mislabeled training data well and performs with good accuracy even in the case of a lot of mislabeled training instances. Our algorithm is able to incorporate class size information in form of upper and lower bounds on the class sizes. The latter property is favorable since this kind of information is often available. We elaborate on the stability of the proposed algorithm with respect to mislabeled training data. There are several properties of the method, discussed in Section 3.3, which allow the algorithm to perform well even with large amounts of mislabeled training data. We also perform extensive numerical experiments to show that the proposed algorithm can still perform with good accuracy even with many mislabeled training examples. Moreover, we present a technique, based on the market mechanism and detailed as Algorithm 2, to identify and correct mislabeled training elements. A detailed discussion of the upper and lower bound auction technique for class-size constrained multiclass data classification is presented. We perform extensive experiments on eight benchmark data sets:
The The The
The paper is organized as follows. In Section 2, we review background and previous work. In Section 3, we present the proposed method and theory on the auction dynamics technique for class-size constrained multiclass data classification, as well as elaborate on the stability of the proposed method with respect to mislabeled training data. We also introduce a procedure for identifying and correcting mislabeled training data. Section 4 presents extensive numerical experiments. We conclude in Section 5.
This work further extends our paper [8]. Specifically, we provide much more details of the theory of auction dynamics technique for class-size constrained multiclass data classification; in particular, we include proofs for several unjustified statements in [8]. Moreover, we perform extensive experiments over many more data sets, including images and text/webpage data, to validate the proposed method. In addition, we provide a thorough comparison of our algorithm to some of the best recent methods. We also elaborate on the stability of the proposed method with respect to mislabeled training data, and perform extensive experiments where different amounts of the training labels are corrupted. Lastly, we propose a technique to identify and correct the corrupted labels of the training data using a market mechanism.
Graphical framework
Our method involves the graphical framework, where we consider a weighted graph
The use of the graphical framework offers many advantages. First, it provides information about the extent of similarity between elements of data and the overall structure of the data. It also provides a way to handle nonlinearly separable classes and affords the flexibility to incorporate diverse types of data.
The weight function is computed using the attribute information for each instance in the data set. For example, for image data sets, the attributes contain information about the pixel intensity. For classifying text data, the weight function can be computed using a pairwise document similarity measure [9].
In this paper, we begin by constructing the weight function by connecting each data instance to its
The value of parameter
Previous work
Data classification methods
Since our method is graph-based, we first briefly review some recent graph-based multiclass data classification approaches; a broad survey of such algorithms is provided in [11]. In [12], the authors introduce several algorithms derived from variational models for semi-supervised clustering of high-dimensional data, posed as a minimization of a convex functional of the labeling function. The work [13] presents a transductive version of the
Many other data classification algorithms have been proposed. Among the most popular approaches are associated with deep learning. Some examples include convolutional neural networks (CNN), described in works such as [25], which explores network structure optimization, and [26], which details a weighted pooling approach for image recognition using deep convolutional neural networks. Another example is LeNet-5 [27], which is a CNN designed for handwritten and machine-printed character recognition. The work [28] explores the applicability of deep learning approaches to medical imaging.
Neural networks and support vector machines have also been widely used for classification purposes. Examples of neural network approaches include [29], which details shallow, two-layer networks that are trained to reconstruct linguistic contexts of words and use Word2Vec. The work [30] compares the accuracy of the two neural networks with respect to training data set size. Examples of support vector machines used for text classification include [31, 32].
Other approaches include successive subspace learning [33], scattering operators [34], semi-supervised classification trees [35], continual learning [36]. The work [37] details an algorithm that exploits only the titles and categories of Wikipedia articles for classification purposes.
The experiment section of this paper will compare the classification accuracy of the proposed approach with many of the aforementioned approaches using different kinds of data sets, such as images and text, and different amounts of labeled training data and training label noise.
Approaches for handling mislabeled training data
We now describe previous studies about handling mislabeled training data divided into: local learning-based methods, ensemble learning-based techniques and single learning-based algorithms [38].
Local learning-based methods are algorithms which consider an instance as mislabeled if the class of that instance differs from the classes of the neighbors of the node. In other words, local learning-based techniques make the assumption that instances that are close to each other tend to be assigned to the same class. Of course, defining the neighborhood and the distance function is a challenge for these kinds of methods. One example of a local learning-based technique is edited nearest neighbor (ENN) [39, 40], which eliminates all entries from the training data that are misclassified by the
Ensemble learning-based approaches use independent classifiers to tag mislabeled data. Specifically, they consider a training example as mislabeled if it is defined so by several independent classifiers. Two ensemble approaches are majority vote and consensus vote classifiers [46, 47]. In the case of a majority vote ensemble approach, a training example needs to be considered mislabeled by a majority of the classifiers for it to be eliminated from the training data. Consensus vote ensemble techniques differ from majority vote ones, because they require that the instance be considered mislabeled by all the classifiers for it to be eliminated. Another type of ensemble learning-based methods are data variation techniques which do not use multiple various learning classifiers and instead apply multiple training sets to the same algorithm. Some examples include cross-validated committees [48], bagging [49, 50] and boosting [51].
Single learning-based methods employ the use of a single classifier to detect mislabeled data. Two popular single learning-based algorithms are based on decision trees and neural networks [38]. The robust-C4.5 algorithm [52] detects mislabeled training data based on the conventional C4.5 method [53], which is one of the most popular decision tree algorithms. The method successively builds trees, prunes the trees and then removes training examples if they are misclassified by the pruned trees. The process stops when no more training instances are removed. Another single learning-based approach is based on a framework of multi-layer artificial neural networks. Automatic noise reduction [54] assigns each training example a class probability vector. The vector is updated based on its difference from the network output. After enough training, the mislabeled training element’s label will be changed to the correct one. The training data whose classes have been changed is then removed.
Some specific algorithms we will be comparing our proposed method to in terms of robustness towards noisy labels include neural networks with bootstrapping [55], stream-based active learning [56], an adversarial regularization scheme based on the Wasserstein distance [57], Credal C4.5 model [58], regularized root-quartic mixture of experts model [59], an anti-noise multi-task multi-view algorithm AN-MTMVCSF [60] and a method using a bandit framework [61].
Modified auction dynamics for multiclass classification
In this section, we provide details about our auction dynamics approach to multiclass data classification. This extends our previous work [8], and we include much more details and proofs about the theory for applying auction dynamics to class-size constrained multiclass data classification. In particular, we prove several unjustified statements in [8]. Moreover, extensive experiments on many more different types of data sets are performed, including data sets comprised of text and webpage data as well as images. In addition, a thorough comparison to some of the best recent methods is presented. We also elaborate on the stability of our method with respect to mislabeled training data, and perform extensive experiments where certain portions of the training labels are corrupted. Moreover, we propose and apply a technique to identify and correct the mislabeled labels of the training data using a market mechanism.
Notation
The classification problem of our paper is: given a data set
To incorporate class size information, we impose constraints in our procedure. In particular, let
Derivation of the algorithm
We start the derivation of our algorithm by considering the problem of minimizing the graph cut with class size constraints and labeled data constraints, with
which penalizes partitions of the data that place strongly connected points in different classes.
Equation (1) is difficult as it is NP-hard, so we consider a convex relaxation of the graph cut Eq. (1) called the graph heat content (GHC):
where
It turns out that if the weight matrix
.
If the matrix
Proof..
We need to show that, if
The positive semi-definite condition of
where
The left side of Eq. (3) can be expressed as:
The right side of Eq. (3) can be expressed as:
Therefore,
where the last inequality uses the fact that
Since Eq. (3.2) shows that
we have proven Eq. (3). ∎
Since the GHC in Eq. (3.2) is concave if the weight matrix
The scheme resembles the one introduced by Merriman, Bence and Osher in [62], which generates an approximation to mean curvature motion by alternating between two simple steps: convolution with a kernel and pointwise thresholding. Since the original paper, threshold dynamics and mean curvature motion have been studied extensively in [64, 65, 66, 67, 68, 69, 70, 71, 72]. A generalization to curvature motion of a multiphase network is described in [63, 73, 74, 75].
where
Note that the first part of Eq. (9) is a bounded and feasible linear programming problem, whose solution always includes a vertex of the feasible polytope. Therefore, an optimal
Therefore, the precise form of the linearization
.
Let
Proof..
First, we compute the derivative of Eq. (3.2) with respect to
Taking the derivative of Eq. (3.2) with respect to
since the weight matrix we use is symmetric. At partition
∎
It turns out that the optimization problem in Eq. (9) is equivalent to a modified assignment problem; in particular, iteration
where
and forming the partition of the data set at iteration
Note that, for all
To interpret the modified assignment Eq. (3.2) in a practical way, let the coefficients
Our approach to solving each iteration of Eq. (9), the first step of which is equivalent to the modified assignment Eq. (3.2), is to assign the memberships according to a market mechanism, motivated by [76]. Let each class
.
The dual of the modified assignment Eq. (3.2), which is
where
where
Proof..
To incorporate the class size constraints, we introduce new vectors
Rearranging the terms, one obtains
The dual problem is calculated as
Since
Here, the variable
Using the dual of the modified assignment Eq. (3.2) from Proposition 3.3:
we see that
Therefore,
The complementary slackness condition for Eqs (3.2) and (21) states that an assignment
It can be difficult compute a solution
where
A summary of what occurs during the auctions is:
Step 1 (Upper Bound Auction): In this auction, the elements of the data set compete for memberships to the class of their choice and place their bids. Each class accepts the people with the highest bids, but it cannot accept more than declared by the upper bound on its class size. Moreover, each class tailors its price and incentive according to demand during each iteration. Overall, the upper bounds on class sizes fit nicely into this perspective, but the lower bounds might not be fulfilled. They are instead taken care of during Step 2. Step 2 (Lower Bound Auction): In this auction, the result of Step 1 is inputted into a reverse auction, during which the classes which have not yet reached the necessary number of members (i.e. lower bound on class sizes) compete for the data elements and provide incentives to attract the customers.
Our method is Algorithm 1 and the upper and lower bound auctions are outlined in the Appendix. The algorithm incorporates
In Algorithm 1, iteration
Overall, the energy in Eq. (3.2) is decreased after each iteration of Algorithm 1; this fact is proved in the next proposition. Moreover, the class-size constrained graph heat content energy is a Lyapunov functional for our scheme; thus, we can guarantee unconditional gradient stability.
.
Every iteration of Algorithm 1, derived using the scheme Eq. (9), which successively minimizes linearizations of GHC in Eq. (3.2), will dissipate the energy of GHC.
Proof..
From Proposition 1, we know that GHC is concave whenever
where
According to our algorithm in Eq. (9),
By evaluating the objective function at
Due to Eq. (25), the following holds:
Using Eqs (28) and (29),
[h] InputInput Data set
Compute the energy in Eq. (3.2) of the current iteration,
This subsection discusses the stability of Algorithm 1 with respect to mislabeled training data. In fact, the procedure of Algorithm 1 is equipped with several properties which allow the algorithm to still perform well even in the presence of a large number of mislabeled training nodes.
Here, we assume that the mislabeled data only contains noise in the label and not in the attribute information. Then, the weight matrix
Moreover, we construct the weight matrix using a graph that connects each point only to its
We now note that, as mentioned in Section 3.2, our algorithm uses the labels from the training data only in the calculation of the variables
Suppose that, at iteration
due to properties
We now see that all but extremely few of the assignment problem coefficients
In this section, we present a modified version of Algorithm 1 which corrects mislabeled training data during its procedure. Specifically, one can consider a variation of Algorithm 1 where we relabel certain training instances each iteration before each modified assignment problem is solved. Our approach to relabeling involves using the equilibrium price
Our relabeling procedure is the following: at iteration
Energy Eq. (3.2) is decreased when training node
If
Overall, our procedure for identifying and correcting mislabeled training data is detailed as Algorithm 2. Moreover, the time complexity of the procedure is easy to compute. Let
Relabeling training data requires at most
Recomputing assignment problem coefficients
[h] Let the Input, Result and Initialization be the same as in Algorithm 1,
To validate our algorithm, we perform extensive experiments on eight benchmark datasets:
The first set of experiments consists of varying the number of labeled examples. The second set of experiments consists of corrupting the labels of different amounts of the training data as an input to the algorithm. To corrupt the labels, we replace the correct label of a training data element with a randomly chosen incorrect one. For example, a training element of class 3 might be assigned an incorrect label of class 1 as an input into the algorithm. The last set of experiments includes applying the proposed technique in point Eq. (2) to identify and correct mislabeled training data.
For each data set, we randomly select a certain percentage of it to use as labeled training data, and this set is used to label the unlabeled testing data. Regarding initialization, the labels of the unlabeled points are initialized by creating a Voronoi diagram with the labels of the training points as the seed points; every point is assigned the label of the training point in its Voronoi cell. Regarding parameters, we use
In our experiments, when the labeled training set is small, we incorporate random fluctuations of variables to help our algorithm reach even lower energies, thus increasing the classification accuracy even further. Random fluctuations can also help avoid degenerate auction coefficients which can impact the speed of the method. Although there are several ways to introduce the fluctuations into our procedure, we choose to perturb
All experiments were run using C code on 2.2 GHz Intel Core i7 computer. The
Comparison of proposed Algorithm 1 and other methods using different amounts of labeled examples.
Comparison of proposed Algorithm 1 and other methods using different amounts of corrupted training labels.
Accuracy of proposed Algorithm 1 using different amounts of labeled examples and corrupted training labels.
To validate our proposed Algorithm 1, we perform experiments with different numbers of labeled examples, and compare the results to those of some of the best recent algorithms. The results of the experiments are displayed in Fig. 1 for all the data sets, where our proposed Algorithm 1 is indicated by the red line. The numerical data is presented in Tables 4–11 for all the data sets.
The results show that the proposed method performs extremely well compared to other algorithms, especially in the case of few labeled elements, when the proposed method is often able to obtain much higher accuracies than other methods. This is especially important due to the scarcity of labeled training data. For example, our method is able to obtain around 97.37% on the MNIST data set with just 140 labeled training examples, which corresponds to only 0.2% of the overall data set. Moreover, our method is able to obtain around 98.22% on the Optdigits data set and 94.93% on the Pendigits data set with just 56 and 110 labeled training examples, respectively. In addition, text classification results show that with just 38 labeled training examples, the algorithm is able to obtain a 91.32% accuracy on the Reuters set.
Experiments with different amounts of corrupted training labels
To show that our proposed Algorithm 1 still performs with good accuracy even with high amounts of mislabeled examples, we perform experiments where we corrupt the labels of different amounts of training examples. In particular, we corrupt the labels of 5%, 10%, 15%, 20%, 30% and 40% of the training data. The quantitative results are displayed in Tables 4–11. A comparison to other methods is displayed in Fig. 2. Figure 3 shows the results of our method using different amounts of corrupted training labels and labeled training data.
Timing results for Algorithm 1 (not including weight matrix
construction)
Timing results for Algorithm 1 (not including weight matrix
Comparison of timing to other algorithms (weight matrix
Comparison of total timing to support vector machines (computed using LIBSVM)
MNIST results (accuracy) – Algorithm 1
OptDigits results (accuracy) – Algorithm 1
Fashion MNIST results (accuracy) – Algorithm 1
Reuters results (accuracy) – Algorithm 1
Pendigits results (accuracy) – Algorithm 1
WebKB results (accuracy) – Algorithm 1
Landsat results (accuracy) – Algorithm 1
20 Newsgroups results (accuracy) – Algorithm 1
Correcting mislabeled training data (using Algorithm 2) – MNIST data set
Correcting mislabeled training data (using Algorithm 2) – opt-digits data set
The results indicate that our algorithm obtains a high accuracy on the testing data even when a large part of the training data is mislabeled and the number of labeled training nodes is small. For example, from the MNIST data set results in Table 5, even when 40% of the training data is mislabeled and there are only 140 labeled images (out of 70000 total) as part of the training data, the method still obtains a 97.20% accuracy on the testing data! Moreover, from the Opt-Digits data set results in Table 6, even when 40% of the training data is mislabeled and there are only 281 labeled training images, the algorithm still obtains a 97.55% accuracy on the testing data. In addition, even when 40% of the training data is mislabeled, the proposed algorithm still obtains 92.74% and 81.82% accuracy on the testing data of the Pendigits and Landsat data sets, respectively, using only 275 and 161 labeled training examples, respectively. For the Reuters data set, using only 383 labeled training examples, even when 40% of the training data is mislabeled, the method still obtains 92.85% accuracy on the testing set.
To test the procedure in Section 3.4, detailed as Algorithm 2, for correcting mislabeled training data, we perform experiments using that technique. The results are shown in Tables 12 and 13; to be concise, we include results for two data sets. The tables show that the procedure is able to correct almost all of the labels of the mislabeled training data. For example, for MNIST, when the training data consists of only 700 labeled images and when 20% of the labels are incorrect, over 500 experiments, an average of 134.56 out of 140 incorrect training labels are corrected.
The chance of turning a correct label of the training data into an incorrect label is very small, as shown in Tables 12 and 13. For example, for the Opt-Digits data set, when the training data consists of 281 labeled images (out of 5620 total) and 20% of the labels are incorrect, on average, only around 2 out of 225 correctly labeled training points are classified incorrectly at the end.
Experiments with noise in attribute information
Even though this paper focuses on noise in the training labels, and the case of noise in the attribute (feature) information of the training data is the subject of another work, in this subsection, we include two experiments where noise is added to the attributes/features of the training data. In particular, we consider two of the data sets in Section 4.1: Pendigits and Landsat. For each experiment, we add Gaussian noise with mean 0 and standard deviation
The experiments indicate that the accuracy of the proposed algorithm reduces only by a little amount even with Gaussian noise with standard deviation
Timing
The timing results for Algorithm 1 are included in Tables 1–3. All experiments were performed on a 2.2 GHz Intel Core i7 computer. Table 1 includes the timing needed for everything but the construction of the weight matrix
We also include a comparison to the timing of convolutional neural networks and support vector machines in Tables 2 and 3.
Conclusion
We have detailed a semi-supervised graph-based method for multiclass data classification. The algorithm involves upper and lower bound auction dynamics. Among its advantages is the fact that the proposed method requires considerably less labeled training data to accurately classify a data set compared to current machine learning techniques. In particular, it performs very well and more accurately than current approaches in the common scenario of few labeled training elements. This is important due to the scarcity of labeled training data and the fact that obtaining labeled data is time-consuming and expensive. In addition, the method is very efficient and unconditionally stable. The algorithm also handles mislabeled examples well and performs with good accuracy even with a large number of mislabeled examples. Lastly, the proposed method is able to incorporate class size information.
Footnotes
Appendix
Upper and lower bound auctions
[H] InputInput
[H] InputInput
Set
