Abstract
Multiple Instance Learning (MIL) is a widely studied learning paradigm which arises from real applications. Existing MIL methods have achieved prominent performances under the premise of plenty annotation data. Nevertheless, sufficient labeled data is often unattainable due to the high labeling cost. For example, the task in web image identification is to find similar samples among a large size of unlabeled dataset through a small number of provided target pictures. This leads to a particular scenario of Multiple Instance Learning with insufficient Positive and superabundant Unlabeled data (PU-MIL), which is a hot research topic in MIL recently. In this paper, we propose a novel method called Multiple Instance Learning with Bi-level Embedding (MILBLE) to tackle PU-MIL problem. Unlike other PU-MIL method using only simple single-level mapping, the bi-level embedding strategy are designed to customize specific mapping for positive and unlabeled data. It ensures the characteristics of key instance are not erased. Moreover, the weighting measure adopted in positive data can extracts the uncontaminated information of true positive instances without interference from negative ones. Finally, we minimize the classification error loss of mapped examples based on class-prior probability to train the optimal classifier. Experimental results show that our method has better performance than other state-of-the-art methods.
Introduction
Multiple instance learning (MIL) is a learning paradigm that has been widely studied recently. In MIL scenario, several instances are joint to form a set called bag and share one label. The bag with no positive instance within it is defined as a negative bag. Otherwise, it is defined as a positive bag. MIL relaxes the requirements to label each training instance, and alleviates the problem of expensive data labeling costs caused by the current explosive growth of data. Since the MIL methods show excellent performance when dealing with problems of insufficient labeling, it is utilized to solve problems in many fields, such as text recognition [4], image classification [1], molecular activity prediction [9] and so on.
In literature, many methods have been proposed to solve the MIL problem. According to whether each bag is treated as a whole unit, the existing MIL methods can be grouped into two categories. The first category is instance-based methods [9, 22, 20, 28, 5], which disassemble the bags into instances and select the key instances to train the classifier. APR [9] is the first method proposed to solve the MIL problem, which is committed to drug activity prediction. Ins-KI-SVM [22] is also a representative instance-based method, which maximize the margin between selected key instances to train a classifier. The second category contains bag-based methods. These methods do not split the bag into instances and build the classification model of bags directly. In order to merge multiple instances into one sample to achieve the purpose of training a bag-based classifier, two strategies are commonly used in such methods. The first strategy [26, 14, 19, 8, 2] defines a function to measure the distance between bags, and trains a distance-based classifier on the basis of the function. Citation-kNN [26] is a method without classifier training process. The bags classify themselves by referring to and citing the labels of their neighbors. The second strategy [7, 6, 21, 13, 5, 27] is mapping and embedding, which extract particular kinds of information for each bag in the new latent feature space. The MIL problem is then translated into a classical supervised problem after bag mapping. MILDM [27] maps each bags to a single instance into the latent feature space via an optimal discriminative instance pool, which makes bags easier to be distinguished.
The existing methods have achieved satisfactory performance on MIL issues. Nevertheless, the premise of training often based on a large amount of complete and accurate labeled data. It is usually untoward to obtain enough labeled data in practice. In many cases, only a small amount of incomplete labeling data can be achieved. For example, in image classification, it is much easier to label the positive samples than negative ones in a large number of images, because it is simpler to judge that there exists an object in image than there not. When mining the specific target image or text on the web, only a small number of target samples (positive) are provided. In addition, the diverse unrelated (negative) data will also lead to difficulties in marking, such as labeling the pictures of oceans, rivers, lakes, etc. These practical problems lead to a hard learning scenario with little positive bags and lots of unlabeled bag, which we called it Positive and Unlabeled Multiple Instance Learning(PU-MIL) problem. Since the labeling data is unilateral and very inadequate, almost all of the existing MIL methods cannot be adapt to solve these problems well. Bao et al. have proposed a method called PU-SKC [3] to solve the MIL problem in PU scenario. This method maps bags into latent feature space based on set kernel [14], and train classifier by convex formulation of empirical risk [11]. However, in the mapping step, PU-SKC maps entire bags to single instances. Since the number of positive instances in positive bags is very small, such an entire bag mapping will cause the characteristic of positive samples be erased by many negative instances. This approach dose not attach importance to the key role of real positive instances, which leads to the pivotal identifiable information loss of data.
In this paper, we proposed a novel method to tackle PU-MIL problem called Multiple Instance Learning with Bi-Level Embedding (MILBLE). Different from the existing PU-MIL method with data set key information obfuscation problem, MILBLE pays attention to the information preservation and extraction of true positive instances in positive bags, which plays a vital role in the construction of classifier. Specifically, the bi-level embedding strategy adopted by MILBLE customizes specific different mappings for positive and unlabeled data to ensure that the characteristics of key instances are not submerged. Moreover, the weighting measure used in positive data mapping can identify true positive instances without interference from negative ones. The bi-level embedding strategy ensures that MILBLE can overcome the inherent label ambiguity of positive bags and extract the uncontaminated information from them. After embedding the mapped samples into the latent feature space, we minimize the classification error loss of mapped examples based on class-prior probability to train the optimal classifier. We also give a fast and effective optimization method to solve our model. Multiple sets of experiments shown that our method achieves better performance than other existing state-of-art methods. The contribution of this paper can be summarized as follows.
We have proposed a novel bi-level embedding method which can exert the effects of key instances in positive bag adequately and avoid essential information obfuscation. The weighting strategy adopted in positive bag embedding filters the interference caused by the negative instances, improves the purity of positive samples used for classifier training. Several sets of broad-coverage experiments on the public datasets verify the effectiveness of our proposed method.
The rest of this paper is organized as follows. First, we introduce related work in Section 2. In Section 3, we will detailed the background of PU-MIL problem, and elaborate the proposed MILBLE approach through two parts of the bi-level embedding and the model of embedded samples. The optimization of MILBLE is proposed in Section 4. In section 5, we display experimental results on eight datasets to verify the effectiveness of MILBLE under different class-prior probability, together with some results about the parameter determination, convergence analysis and running time. Finally, we conclude this paper in Section 6.
We would like to introduce some notations at first. Denote positive and unlabeled train set as
Positive and unlabeled learning
Positive and unlabeled (PU) learning is an important part of weakly supervised learning. The goal of PU learning is to train binary classifier with only positive and unlabeled data. So far, many methods have been proposed to solve the PU learning problem. The existing PU methods can be divided into two categories.
The first category of methods aim to tackle PU problem by two-steps strategy [23, 29, 24, 17, 16]. They first recognize the possible negative samples in unlabeled data, and then transform the PU problem into a traditional PN problem. Liu et al. [23] conducted experiments and evaluations on various types of possible combinations of methods of the two steps, and proposed a biased formulation of SVM in the second step. It used two parameters to weight positive and negative errors differently. Based on [23], Ke et al. [17] further identified the confidence positive examples in the first step for training to obtain a better classifier. Xiao et al. [29] proposed an algorithm named SPUL. This method not only identified the possible negative samples in the first step, but also associated the remaining part, which can not be explicitly identified as positive or negative (they call them ambiguous examples) with two similarity weights. These two weights indicate the similarity of an ambiguous example towards positive and negative class respectively. They have also been incorporated into an SVM-based learning phase to build a more accurate classifier in the second step. These basic approaches rely heavily on the discrimination of examples in unlabeled data. When the negative samples can not be classified correctly, the performance of classifier is usually not satisfactory.
Different from first category, the second category of PU learning methods [12, 10, 11, 18, 15] do not intend to distinguish the negative samples in unlabeled data, but directly weights and recombines cost-sensitive losses of positive and unlabeled data. du Plessis et al. [10] proposed a risk equivalent to the PN risk, in which the risk for negative samples are evaluated by unlabeled data, and use a non-convex ramp loss to obtain the PU loss. In order to avoid the difficulty of non-convex optimization caused by non-convex loss function in [10], du Plessis et al. have explored an alternative convex unbiased double hinge loss in [11]. By the double hinge loss, the empirical risk loss can be rewritten by a weighted ordinary convex loss for unlabeled data and a weighted composite convex loss for positive data. Gong et al. [15] proposed an algorithm called LLSVM. Unlike other cost-sensitive methods, an especial hat loss are adopted for unlabeled data, which can ensure the labels assigned to unlabeled data are confidence. Thus, the margin between positive and negative classes can be maximized. Since the performance of the second category methods does not depend on the ability of detecting negative samples in unlabeled data, these methods are still applicable in the situation of high negative samples proportion in unlabeled data. Thus the recent state-of-the-art methods usually adopt this category of strategy to solve the PU problem.
Positive and unlabeled multi-instance learning
As a learning paradigm that is widely used in many fields such as image recognition, text mining, and molecular activity detection, multi-instance learning often has PU problems caused by limited labeling capabilities. Nevertheless, there are relatively few studies on it so far. Bao et al. has proposed a representative method called PU-SKC [3] to tackle the PU-MIL problem. This method first maps both positive and unlabeled bags to single instances in the latent feature space by minimax statistic kernel
where
Here,
After all the bags are mapped, PU-MIL problem is transformed into a classical single instance PU learning problem. On this basis, Bao et al. adopted the empirical risk minimization proposed by du Plessis et al. [11] to get the optimal classifier.
PU-SKC shows considerable classification performance on several public datasets through [3]. However, in the case of low positive instances proportion in positive bag, the application of PU-SKC will be limited. Because when extract the shape information of bags by minimax statistic
PU-MIL is a learning paradigm in limited-labeling scenario, in which the data provided for binary classifier training consists of only scant positive bags and abundant unlabeled bags. The unlabeled data are composed of several positive and negative bags. Such problem appears in many domains, such as image and text classification, or web data mining. For example, when mining the specific target image or text on the web, only a small number of target samples (positive) are provided. In addition, the diverse unrelated (negative) data will also lead to difficulties in marking. In these PU-MIL problem, due to the special labeling rules of the multi-instance scenario and only existing one-side bag label, there is few label information. This class of methods can greatly reduce the requirements for manual labeling, and a decent classifier can be obtained in many tough environments. In this section, we will detail our proposed method MILBLE, which are used to tackle the PU-MIL problem, and elaborate the motivation of it. The formulation of our method will be illustrated in two parts, i.e., the bi-level embedding part and the model of embedded samples.
Bi-level embedding for positive and unlabeled bags
In PU learning problem, acquisition of labeled data is restricted. The restriction is mainly embodied in two aspects: the labeled data is unilateral, that is, only positive labels are available. Moreover, compared to unlabeled data, the number of labeled data is extremely small. Containing all the label information, positive samples play a vital role in the PU learning problem and directly determine the performance of classifier. However, in multiple instance learning scenario, the labeling rule of positive bags is particular: as long as there exists positive instance in bag, it will be labeled as positive bag. Only the few positive instances contained in positive bag really affect the PU classification, but not entire positive bag. The negative instances, which can be regarded as the unknown noise in positive bags, greatly weaken the key effect that the positive labels can provided in classifier training. Therefore, how to make positive instances avoid the interference from negative ones and work better in classification as the only labeled data is the focus of solving PU-MIL problem.
Existing PU-MIL method such as PU-SKC [3] maps all bags to single instances in the latent feature space by same mapping criteria on both positive and unlabeled data. In this way, the multi-instance problem is transformed into a single-instance problem. This approach will cause the characteristic of positive data to be blurred. Because in MIL scenario, positive instances contained in positive bag are only a few. In most cases, negative instances are the main component of positive bag. The mapping strategy used on bags, such as maximum, minimum or mean values of all the instances contained in bag, will make the difference between positive and negative data insignificant. Therefore, we do not map the positive bags to single instance, but adopt the following bi-level embedding strategy:
(1) For each instance
(2) For each unlabeled bag
where
The vectors mapped from positive instances and unlabeled bags have the same form in dimensions, and both of them describe the shape of bag. Moreover, since all the instances in positive bags are mapped separately, characteristics of positive data are not buried. In the model part of Section 3.2, a weighting strategy will be used on the mapped samples
In summary, we have detailed the framework of our proposed bi-level embedding. The advantages are listed as the following points.
The characteristics of positive samples, which are crucial to the performance of the classifier, are completely retained by the positive instance mapping. The weighting strategy adopted in positive embedding part removes the interference of negative instances in positive bags, and identifies the real positive instances, which trigger the positive label. Mappings on positive instances and unlabeled bags describe the common information of data and have the same feature dimensions, which provides a feasibility guarantee for the subsequent training of the classifier.
Based on the above advantages, when the proportion of positive instances in positive bag is small, our method still has excellent performance, which is lacking in other PU-MIL method.
After the first mapping step, the PU-MIL problem is translated into a single-instance one. Each instance in positive bags and each unlabeled bags are mapped into new feature space as
where the
The loss
Notice that the second term are not smooth, which makes it harder when optimize the object function by gradient descent method. To ensure that the objective function is differentiable everywhere, we use the square term
the square term can be upper bounded as following.
Denote the sum of all the instances in positive bags as
In this section, we will expound how to optimize Eq. (3.2) detailedly. The object function in Eq. (3.2) involves two optimization variables, i.e.,
Optimization of
When the weight vector
Denote the convex object function shown in Eq. (11) as
Based on Eq. (12), the gradients on positive train set and unlabeled train set are
After obtaining the gradient shown in Eqs (13) and (14), we get the update rule of the optimization target
Here,
When
Since the weights of each positive bag are independent, Eq. (4.2) can be transformed into separate optimization problem of
Here,
Then, we focus on how to estimate the parameter
where the
The KKT condition of the dual problem is
By the third and fourth conditions in Eq. (20), the
Let
Remove the maximum function, Eq. (22) can be rewritten as
[!t]
Note that
According to Eq. (4.2), the condition that
When
After the parameter
In order to verify the effectiveness of our proposed method, we compared it with other four representative methods on eight public datasets. The convergence behavior of optimization algorithm is also confirmed experimentally. In addition, the influence of parameters on the classifier performance, and the running time of five methods are also analyzed.
Experiment settings
To verify the effectiveness of our proposed method, eight public datasets are adopted to conduct multiple experiments. The eight datasets cover multiple application fields such as images, text, and biology. Details of adopted datasets such as the number of positive and negative bags, length of instance features and the average size of bag are shown in Table 1. In these eight datasets, the number of positive instances in the positive bags is very small in the last three ones, include rec.sport.hockey, sci.electronics and sci.med. Existing PU-MIL methods ignore the process of selecting positive instances, and the features of positive bags are easily erased by a large number of negative instances in the mapping process. Therefore, these methods may not perform well on these three datasets.
The main information of eight public dataset
The main information of eight public dataset
The average Accuracy and standard deviation results of five methods on eight data sets. The results tested the performance of five methods from four class-prior probability
It can be seen from Table 1 that the existing MIL public dataset are too small to provide sufficient samples for experiments in the PU scenario. Similar with [3], we randomly select bags from original dataset and duplicate them with the Gaussian noise of mean zero and variance 0.01. In this way, we increase the size of positive and unlabeled train set to 20 and 180, respectively. Moreover, additional 100 positive bags and 100 negative bags constitute the test set. Then, we take comparative experiments to verify the effectiveness of our proposed method on four different ratios of unlabeled bags composition. The ratio of positive bags in unlabeled bags are set as {0.2, 0.4, 0.6, 0.8}. The larger the ratio, the more positive bags in the unlabeled bags. The experiments under each ratio are repeated 20 times. We compute the mean and standard standard deviation values of each method under different classification metrics, include the accuracy, the area under curve and the F-measure, to evaluate the performance.
Classification performance
In this section, we investigate the performance of our method with four different class-prior probability
AUC (Area Under Curve) of five methods on eight data sets. The results tested performance of five methods from four class-prior probability
of each dataset of {0.2, 0.4, 0.6, 0.8}. The contents in brackets are standard deviations of AUC results. The highest score over five methods is marked in bold
AUC (Area Under Curve) of five methods on eight data sets. The results tested performance of five methods from four class-prior probability
F-measure of five methods on eight data sets. The results tested performance of five methods from four class-prior probability
As the results of accuracy, AUC and F-measure shown in Tables 2–4, our MILBLE obtains better performance gain than other representative methods in most of cases and simultaneously verify the effectiveness of our proposed method. Moreover, the MILBLE we proposed has excellent classification performance on the data set composed of various class-prior probability, while the classification performance of the other four comparison methods decreases as the class-prior probability
In addition, the performance improvements of MILBLE on the three data sets rec.sport.hockey, sci.electronics and sci.med are particularly prominent. This is because the proportions of positive instances in positive bag of these three datasets are lower than that of the other datasets, while the bi-level embedding strategy adopted in our MILBLE algorithm can retain and extract the characteristics of the true positive instances, which is different from other methods. Therefor, when the number of key positive instances is very small, MILBLE can still ensure that the effective information of positive bags will not be erased. This is why MILBLE still has excellent classification performance on the low positive instance proportion.
In the optimization section, we detailed how to minimize the objective function we proposed to obtain the optimal solution. We alternate iteratively on the coefficient term
The curves of iteration convergence times of the alternating iterative solution method. The data sets used to verify the effect of our optimal solution method are Corel_hd, Sival_ab and Tiger, respectively. The curves shown in the figure are the experimental results of three data sets under the class-prior probability of 0.2.
The curves of iteration convergence times of the alternating iterative solution method. The data sets used to verify the effect of our optimal solution method are Corel_hd, Sival_ab and Tiger, respectively. The curves shown in the figure are the experimental results of three data sets under the class-prior probability of 0.4.
It can be seen from Figs 1 and 2 that under the iterative step we set, the objective function gradually decreases and can converge within dozens of steps. Combined with the analysis of the classifier performance results, our optimization algorithm can efficiently find the optimal solution of the model.
In the formulation we proposed in Eq. (3.2), parameters
Classifier performance under different parameter combinations of 
Classifier performance under different parameter combinations of 
Figures 3 and 4 shows that these two parameters have a substantial impact on the performance of our proposed method. To be specific, from accuracy results on these parameters combinations, it can be seen that on these data sets, the better classifier performance mostly appears near the range of
In this section, we compare the training time of our method with the other four exists methods, include PU-SKC [3], C-kNN [26], aMILGDM [27] and KI-SVM [22] on eight data sets. The training time of each method is tested in the same computing environment of Intel Xeon CPU E3-1245 v3 at 3.40 GHz and 32 G memory. All the training time are the mean value of five experiments. The class-prior probability of each dataset used to test the running time is 0.6, and the size of them are all 200 (20 positive bags and 180 unlabeled bags). The CPU time is reported in Table 5.
Running time under the class-prior probability of 0.6 of eight datasets. The results are mean values with 5 experiments of five methods. The unit used to measure the running time in the table is seconds
Running time under the class-prior probability of 0.6 of eight datasets. The results are mean values with 5 experiments of five methods. The unit used to measure the running time in the table is seconds
It can be seen from results recorded in Table 5 that the time costs of PU-SKC is larger than that of MILBLE, and the costs of KI-SVM and our method are comparable. The running time of other two methods especially C-kNN are much slower than our proposed MILBLE. Nevertheless, the classification performance of these contrast methods are lower than our MILBLE. Among them, the reason of PU-SKC spends less time is that it maps each bag to a single instance in the latent feature space. The number of instances in the bag has no effect on the calculation of this algorithm. However, the classification performance of this algorithm is not good when the proportion of positive instances in positive bag is very small, which can also be seen from Table 2. In summary, our proposed MILBLE has the satisfactory classification performance with an acceptable running time.
In this paper, we focused on the multi-instance learning problem in PU scenario, which is crucial but rarely studied. We proposed a method based on bi-level embedding. This method adopt different embedding strategies for the positive and unlabeled data, which guarantees the purity of extracted information from positive bags under the premise of consistency of different two mapped samples. After mapping step, we trained the classifier by minimizing the weighted classification error loss on positive train set, and the class-prior probability based loss on unlabeled train set. We also expound how to optimize our proposed model in detail. The validity of parameters and the convergence of optimization algorithm is verified according to the experiment. Experimental results on the public datasets illustrated that our method has better performance than the compare methods in most cases. As a emerging field that has not been widely studied but has great practical value, PU-MIL still has a lot of room for research and exploration. Therefore, in the next work, we will continue to study PU-MIL problem from the perspective of other latent feature space mapping and cost-sensitive loss functions.
Footnotes
Acknowledgments
The authors want to thank the associate editor and reviewers for helpful comments and suggestions. This work was supported by the NSF of China under Grants No. 61922087, 61906201 and 62006238, NSF of Hunan Province under Grant No. 2020JJ5669, and the NSF for Distinguished Young Scholars of Hunan Province under Grant No. 2019JJ20020.
