Multi-instance positive and unlabeled learning with bi-level embedding

Abstract

Multiple Instance Learning (MIL) is a widely studied learning paradigm which arises from real applications. Existing MIL methods have achieved prominent performances under the premise of plenty annotation data. Nevertheless, sufficient labeled data is often unattainable due to the high labeling cost. For example, the task in web image identification is to find similar samples among a large size of unlabeled dataset through a small number of provided target pictures. This leads to a particular scenario of Multiple Instance Learning with insufficient Positive and superabundant Unlabeled data (PU-MIL), which is a hot research topic in MIL recently. In this paper, we propose a novel method called Multiple Instance Learning with Bi-level Embedding (MILBLE) to tackle PU-MIL problem. Unlike other PU-MIL method using only simple single-level mapping, the bi-level embedding strategy are designed to customize specific mapping for positive and unlabeled data. It ensures the characteristics of key instance are not erased. Moreover, the weighting measure adopted in positive data can extracts the uncontaminated information of true positive instances without interference from negative ones. Finally, we minimize the classification error loss of mapped examples based on class-prior probability to train the optimal classifier. Experimental results show that our method has better performance than other state-of-the-art methods.

Keywords

Multi-instance learning positive and unlabeled learning bi-level embedding classification

1. Introduction

Multiple instance learning (MIL) is a learning paradigm that has been widely studied recently. In MIL scenario, several instances are joint to form a set called bag and share one label. The bag with no positive instance within it is defined as a negative bag. Otherwise, it is defined as a positive bag. MIL relaxes the requirements to label each training instance, and alleviates the problem of expensive data labeling costs caused by the current explosive growth of data. Since the MIL methods show excellent performance when dealing with problems of insufficient labeling, it is utilized to solve problems in many fields, such as text recognition [4], image classification [1], molecular activity prediction [9] and so on.

In literature, many methods have been proposed to solve the MIL problem. According to whether each bag is treated as a whole unit, the existing MIL methods can be grouped into two categories. The first category is instance-based methods [9, 22, 20, 28, 5], which disassemble the bags into instances and select the key instances to train the classifier. APR [9] is the first method proposed to solve the MIL problem, which is committed to drug activity prediction. Ins-KI-SVM [22] is also a representative instance-based method, which maximize the margin between selected key instances to train a classifier. The second category contains bag-based methods. These methods do not split the bag into instances and build the classification model of bags directly. In order to merge multiple instances into one sample to achieve the purpose of training a bag-based classifier, two strategies are commonly used in such methods. The first strategy [26, 14, 19, 8, 2] defines a function to measure the distance between bags, and trains a distance-based classifier on the basis of the function. Citation-kNN [26] is a method without classifier training process. The bags classify themselves by referring to and citing the labels of their neighbors. The second strategy [7, 6, 21, 13, 5, 27] is mapping and embedding, which extract particular kinds of information for each bag in the new latent feature space. The MIL problem is then translated into a classical supervised problem after bag mapping. MILDM [27] maps each bags to a single instance into the latent feature space via an optimal discriminative instance pool, which makes bags easier to be distinguished.

The existing methods have achieved satisfactory performance on MIL issues. Nevertheless, the premise of training often based on a large amount of complete and accurate labeled data. It is usually untoward to obtain enough labeled data in practice. In many cases, only a small amount of incomplete labeling data can be achieved. For example, in image classification, it is much easier to label the positive samples than negative ones in a large number of images, because it is simpler to judge that there exists an object in image than there not. When mining the specific target image or text on the web, only a small number of target samples (positive) are provided. In addition, the diverse unrelated (negative) data will also lead to difficulties in marking, such as labeling the pictures of oceans, rivers, lakes, etc. These practical problems lead to a hard learning scenario with little positive bags and lots of unlabeled bag, which we called it Positive and Unlabeled Multiple Instance Learning(PU-MIL) problem. Since the labeling data is unilateral and very inadequate, almost all of the existing MIL methods cannot be adapt to solve these problems well. Bao et al. have proposed a method called PU-SKC [3] to solve the MIL problem in PU scenario. This method maps bags into latent feature space based on set kernel [14], and train classifier by convex formulation of empirical risk [11]. However, in the mapping step, PU-SKC maps entire bags to single instances. Since the number of positive instances in positive bags is very small, such an entire bag mapping will cause the characteristic of positive samples be erased by many negative instances. This approach dose not attach importance to the key role of real positive instances, which leads to the pivotal identifiable information loss of data.

In this paper, we proposed a novel method to tackle PU-MIL problem called Multiple Instance Learning with Bi-Level Embedding (MILBLE). Different from the existing PU-MIL method with data set key information obfuscation problem, MILBLE pays attention to the information preservation and extraction of true positive instances in positive bags, which plays a vital role in the construction of classifier. Specifically, the bi-level embedding strategy adopted by MILBLE customizes specific different mappings for positive and unlabeled data to ensure that the characteristics of key instances are not submerged. Moreover, the weighting measure used in positive data mapping can identify true positive instances without interference from negative ones. The bi-level embedding strategy ensures that MILBLE can overcome the inherent label ambiguity of positive bags and extract the uncontaminated information from them. After embedding the mapped samples into the latent feature space, we minimize the classification error loss of mapped examples based on class-prior probability to train the optimal classifier. We also give a fast and effective optimization method to solve our model. Multiple sets of experiments shown that our method achieves better performance than other existing state-of-art methods. The contribution of this paper can be summarized as follows.

•
We have proposed a novel bi-level embedding method which can exert the effects of key instances in positive bag adequately and avoid essential information obfuscation.
•
The weighting strategy adopted in positive bag embedding filters the interference caused by the negative instances, improves the purity of positive samples used for classifier training.
•
Several sets of broad-coverage experiments on the public datasets verify the effectiveness of our proposed method.

The rest of this paper is organized as follows. First, we introduce related work in Section 2. In Section 3, we will detailed the background of PU-MIL problem, and elaborate the proposed MILBLE approach through two parts of the bi-level embedding and the model of embedded samples. The optimization of MILBLE is proposed in Section 4. In section 5, we display experimental results on eight datasets to verify the effectiveness of MILBLE under different class-prior probability, together with some results about the parameter determination, convergence analysis and running time. Finally, we conclude this paper in Section 6.
2. Related work

We would like to introduce some notations at first. Denote positive and unlabeled train set as ${D^{P}}=\{B_{1}^{P},B_{2}^{P},\ldots B_{{N_{P}}}^{P}\}$ and ${D^{U}}=\{B_{1}^{U},B_{2}^{U},\ldots B_{{N_{U}}}^{U}\}$ , where $N_{P}$ and $N_{U}$ are the number of positive bags $B^{P}$ and unlabeled bags $B^{U}$ in train set, respectively. Each bag in dataset is identified as $B_{i}^{P}=\{\bm{x}^{P}_{i1},\bm{x}^{P}_{i2},\ldots,\bm{x}^{P}_{iP(i)}\}$ or $B_{j}^{U}=\{\bm{x}^{U}_{j1},\bm{x}^{U}_{j2},\ldots,\bm{x}^{U}_{jU(j)}\}$ , $\bm{x}\in{\mathbb{R}^{d}}$ . $P(i)$ and $U(j)$ are the number of instances in the $i$ - $t h$ positive bag $B_{i}^{P}$ and $j$ - $t h$ unlabeled bag $B_{j}^{U}$ , $\bm{x}^{P}_{ik}$ and $\bm{x}^{U}_{jl}$ are the $k$ - $t h$ and $l$ - $t h$ instances in $B_{i}^{P}$ and $B_{j}^{U}$ .

2.1 Positive and unlabeled learning

Positive and unlabeled (PU) learning is an important part of weakly supervised learning. The goal of PU learning is to train binary classifier with only positive and unlabeled data. So far, many methods have been proposed to solve the PU learning problem. The existing PU methods can be divided into two categories.

The first category of methods aim to tackle PU problem by two-steps strategy [23, 29, 24, 17, 16]. They first recognize the possible negative samples in unlabeled data, and then transform the PU problem into a traditional PN problem. Liu et al. [23] conducted experiments and evaluations on various types of possible combinations of methods of the two steps, and proposed a biased formulation of SVM in the second step. It used two parameters to weight positive and negative errors differently. Based on [23], Ke et al. [17] further identified the confidence positive examples in the first step for training to obtain a better classifier. Xiao et al. [29] proposed an algorithm named SPUL. This method not only identified the possible negative samples in the first step, but also associated the remaining part, which can not be explicitly identified as positive or negative (they call them ambiguous examples) with two similarity weights. These two weights indicate the similarity of an ambiguous example towards positive and negative class respectively. They have also been incorporated into an SVM-based learning phase to build a more accurate classifier in the second step. These basic approaches rely heavily on the discrimination of examples in unlabeled data. When the negative samples can not be classified correctly, the performance of classifier is usually not satisfactory.

Different from first category, the second category of PU learning methods [12, 10, 11, 18, 15] do not intend to distinguish the negative samples in unlabeled data, but directly weights and recombines cost-sensitive losses of positive and unlabeled data. du Plessis et al. [10] proposed a risk equivalent to the PN risk, in which the risk for negative samples are evaluated by unlabeled data, and use a non-convex ramp loss to obtain the PU loss. In order to avoid the difficulty of non-convex optimization caused by non-convex loss function in [10], du Plessis et al. have explored an alternative convex unbiased double hinge loss in [11]. By the double hinge loss, the empirical risk loss can be rewritten by a weighted ordinary convex loss for unlabeled data and a weighted composite convex loss for positive data. Gong et al. [15] proposed an algorithm called LLSVM. Unlike other cost-sensitive methods, an especial hat loss are adopted for unlabeled data, which can ensure the labels assigned to unlabeled data are confidence. Thus, the margin between positive and negative classes can be maximized. Since the performance of the second category methods does not depend on the ability of detecting negative samples in unlabeled data, these methods are still applicable in the situation of high negative samples proportion in unlabeled data. Thus the recent state-of-the-art methods usually adopt this category of strategy to solve the PU problem.

2.2 Positive and unlabeled multi-instance learning

As a learning paradigm that is widely used in many fields such as image recognition, text mining, and molecular activity detection, multi-instance learning often has PU problems caused by limited labeling capabilities. Nevertheless, there are relatively few studies on it so far. Bao et al. has proposed a representative method called PU-SKC [3] to tackle the PU-MIL problem. This method first maps both positive and unlabeled bags to single instances in the latent feature space by minimax statistic kernel $\tilde{k}_{\textit{minimax}}$ , which is a kind of set kernel [14]. The $\tilde{k}_{\textit{minimax}}$ between bag $B$ and $B^{\prime}$ is defined as

$\displaystyle\tilde{k}_{\textit{minimax}}\left({B,{B^{\prime}}}\right):=\left(% s_{\textit{minimax}}(B)^{\top}s_{\textit{minimax}}\left(B^{\prime}\right)+1% \right)^{\rho},$ (1)

where $\rho$ is a positive integer. $s_{\textit{minimax}}$ is called a minimax statistic, which can be calculated by

$\displaystyle s_{\textit{minimax}}(B):={\left[{\mathop{\min}\limits_{\bm{x}\in B% }{x^{(1)}},\ldots,\mathop{\min}\limits_{\bm{x}\in B}{x^{(d)}},\mathop{\max}% \limits_{\bm{x}\in B}{x^{(1)}},\ldots,\mathop{\max}\limits_{\bm{x}\in B}{x^{(d% )}}}\right]^{\top}}.$ (2)

Here, $x^{(k)}$ is the $k$ -th element of instance $\bm{x}$ in bag $B$ . Through the above minimax statistic kernel $\tilde{k}_{\textit{minimax}}$ , each bag $B$ in train set $\{D^{P},D^{U}\}$ can be mapped as instance $\phi(B)$ by the following rules,

$\displaystyle\phi(B)=\left[k_{\textit{minimax}}\left(B,B_{1}^{P}\right),\ldots% ,k_{\textit{minimax}}\left(B,B_{N_{P}}^{P}\right),k_{\textit{minimax}}\left(B,% B_{1}^{U}\right),\ldots,k_{\textit{minimax}}\left(B,B_{N_{U}}^{U}\right)\right% ]^{\top}.$ (3)

After all the bags are mapped, PU-MIL problem is transformed into a classical single instance PU learning problem. On this basis, Bao et al. adopted the empirical risk minimization proposed by du Plessis et al. [11] to get the optimal classifier.

PU-SKC shows considerable classification performance on several public datasets through [3]. However, in the case of low positive instances proportion in positive bag, the application of PU-SKC will be limited. Because when extract the shape information of bags by minimax statistic $s_{\textit{minimax}}$ in Eq. (2), the characteristic of positive instances will be erased by many negative instances, which makes the differences between positive and negative samples becomes smaller. In response to this limitation in PU-MIL, we propose a bi-level embedding method called MILBLE, and introduce the detail of it in next section.

3. Proposed method

PU-MIL is a learning paradigm in limited-labeling scenario, in which the data provided for binary classifier training consists of only scant positive bags and abundant unlabeled bags. The unlabeled data are composed of several positive and negative bags. Such problem appears in many domains, such as image and text classification, or web data mining. For example, when mining the specific target image or text on the web, only a small number of target samples (positive) are provided. In addition, the diverse unrelated (negative) data will also lead to difficulties in marking. In these PU-MIL problem, due to the special labeling rules of the multi-instance scenario and only existing one-side bag label, there is few label information. This class of methods can greatly reduce the requirements for manual labeling, and a decent classifier can be obtained in many tough environments. In this section, we will detail our proposed method MILBLE, which are used to tackle the PU-MIL problem, and elaborate the motivation of it. The formulation of our method will be illustrated in two parts, i.e., the bi-level embedding part and the model of embedded samples.

3.1 Bi-level embedding for positive and unlabeled bags

In PU learning problem, acquisition of labeled data is restricted. The restriction is mainly embodied in two aspects: the labeled data is unilateral, that is, only positive labels are available. Moreover, compared to unlabeled data, the number of labeled data is extremely small. Containing all the label information, positive samples play a vital role in the PU learning problem and directly determine the performance of classifier. However, in multiple instance learning scenario, the labeling rule of positive bags is particular: as long as there exists positive instance in bag, it will be labeled as positive bag. Only the few positive instances contained in positive bag really affect the PU classification, but not entire positive bag. The negative instances, which can be regarded as the unknown noise in positive bags, greatly weaken the key effect that the positive labels can provided in classifier training. Therefore, how to make positive instances avoid the interference from negative ones and work better in classification as the only labeled data is the focus of solving PU-MIL problem.

Existing PU-MIL method such as PU-SKC [3] maps all bags to single instances in the latent feature space by same mapping criteria on both positive and unlabeled data. In this way, the multi-instance problem is transformed into a single-instance problem. This approach will cause the characteristic of positive data to be blurred. Because in MIL scenario, positive instances contained in positive bag are only a few. In most cases, negative instances are the main component of positive bag. The mapping strategy used on bags, such as maximum, minimum or mean values of all the instances contained in bag, will make the difference between positive and negative data insignificant. Therefore, we do not map the positive bags to single instance, but adopt the following bi-level embedding strategy:

(1) For each instance $\bm{x}$ in positive bag, we define the map as

$\displaystyle\bm{\Phi}_{\textit{ins}}(\bm{x})=\left[x_{\max}^{(1)}-x^{(1)},% \ldots,x_{\max}^{(d)}-x^{(d)},x^{(1)}-x_{\min}^{(1)},\ldots,x^{(d)}-x_{\min}^{% (d)}\right].$ (4)

(2) For each unlabeled bag $B$ , we define the map as

$\displaystyle\bm{\Phi}_{\textit{bag}}(B)=\left[x_{\min}^{(1)},\ldots,x_{\min}^% {(d)},x_{\max}^{(1)},\ldots,x_{\max}^{(d)}\right],$ (5)

where $x^{(l)}$ is the $l$ -th dimension of instance $\bm{x}$ . $x_{\max}^{(l)}$ and $x_{\min}^{(l)}$ are the maximum and minimum values of the $l$ -th dimension of all instances in bag.

The vectors mapped from positive instances and unlabeled bags have the same form in dimensions, and both of them describe the shape of bag. Moreover, since all the instances in positive bags are mapped separately, characteristics of positive data are not buried. In the model part of Section 3.2, a weighting strategy will be used on the mapped samples $\bm{\Phi}_{\textit{ins}}(\bm{x})$ to select the true positive instances from it. Unlike positive data, the mapping rule $\bm{\Phi}_{\textit{bag}}(B)$ on unlabeled data maps the entire bag to a single instance, avoid the problem of data imbalance caused by too much unlabeled instances in subsequent training.

In summary, we have detailed the framework of our proposed bi-level embedding. The advantages are listed as the following points.

•

The characteristics of positive samples, which are crucial to the performance of the classifier, are completely retained by the positive instance mapping.

•

The weighting strategy adopted in positive embedding part removes the interference of negative instances in positive bags, and identifies the real positive instances, which trigger the positive label.

•

Mappings on positive instances and unlabeled bags describe the common information of data and have the same feature dimensions, which provides a feasibility guarantee for the subsequent training of the classifier.

Based on the above advantages, when the proportion of positive instances in positive bag is small, our method still has excellent performance, which is lacking in other PU-MIL method.

3.2 Weighting model for embedded samples

After the first mapping step, the PU-MIL problem is translated into a single-instance one. Each instance in positive bags and each unlabeled bags are mapped into new feature space as $\bm{\Phi}_{\textit{ins}}(\bm{x})$ and $\bm{\Phi}_{\textit{bag}}(B)$ , respectively. The model we used to classify the bi-level embedding based PU learning problem can be formulated as follows.

$\displaystyle\mathop{\min}\limits_{\bm{\omega},b,\bm{a}}{\kern 10.0pt}\frac{1}% {2}\|\bm{\omega}\|^{2}+\frac{\alpha}{\sum\limits_{m=1}^{{N_{P}}}{P}(m)}\sum% \limits_{m=1}^{{N_{P}}}{\sum\limits_{i=1}^{{P}(m)}}a_{i}^{m}l_{1}\left(g\left(% \Phi_{ins}\left(\bm{x}_{mi}^{P}\right)\right)\right)$ $\displaystyle{\kern 30.0pt}+\lambda\sum\limits_{m=1}^{N_{P}}\left\|\bm{a}^{m}% \right\|^{2}+\gamma l_{2}\left(\frac{1}{N_{U}}\sum\limits_{j=1}^{N_{U}}g^{% \prime}\left(\Phi_{bag}\left(B_{j}^{U}\right)\right),t\right),$ (6) $\displaystyle\text{ s.t. }\quad a_{i}^{m}\geqslant 0,\sum\limits_{i=1}^{{P}(m)% }a_{i}^{m}=1,$

where the $\alpha$ , $\lambda$ and $\gamma$ are non-negative parameters. $a^{m}_{i}$ is the weight put on the $i$ -th instance in $m$ -th bag, which are used to identify the true positive instances in bag $B_{j}^{U}$ . The weights of instances in each bag are non-negative and sum to 1. $g(\bm{x})=\bm{\omega}^{\top}\bm{x}+b$ is the classifier for mapped samples, $g^{\prime}(\bm{x})=(2/\pi)\arctan(g(\bm{x}))$ . The third and forth terms are the regularizers to avoid overfitting. $l_{1}$ and $l_{2}$ are the different losses on positive and unlabeled train set respectively.

$\displaystyle{l_{1}}(z)={\log_{2}}(1+{e^{-z}}),$ (7) $\displaystyle{l_{2}}(z,t)=\max(z-t,0).$

The loss $l_{1}$ used in the first term penalizes the object value in positive set when it is going to be less than 1. By optimizing the first term, every $\Phi_{ins}\left(\bm{x}_{mi}^{P}\right)$ will be classified as positive. The loss $l_{2}$ used in the second term penalizes the mean value of $g^{\prime}\left(\Phi_{bag}\left(B\right)\right)$ in unlabeled train set when it is smaller than the threshold $t$ . $g^{\prime}(\bm{x})=(2/\pi)\arctan(g(\bm{x}))$ can be regarded as an approximation to sign function, which is used to count the positive and negative samples of all unlabeled bags separately. If we optimize the proposed model without the second term, all samples will be classified into positive ones by the positive-biased classifier. Therefore, this term declares the fact that there exist certain proportion of negative bags, and ensures these negative ones are allocated to the correct side, thereby pushing the bias classification obtained by the first term to the correct side.

Notice that the second term are not smooth, which makes it harder when optimize the object function by gradient descent method. To ensure that the objective function is differentiable everywhere, we use the square term $[\max(z-t,0)]^{2}$ instead of $l_{2}(z,t)$ . In addition, since the calculation of second term requires all the unlabeled train set, it leads to a complicated results when deriving $\bm{\omega}$ . In order to facilitate optimization, we changed the upper bound of the second term to be optimized. Let $\bm{\varphi}^{P}_{mi}$ and $\bm{\varphi}^{U}_{j}$ denote $\Phi_{ins}\left(\bm{x}_{mi}^{P}\right)$ and $\Phi_{bag}\left(B\right)$ , respectively. According to Jensen’s inequality, we have

$\displaystyle\frac{1}{n}\sum\limits_{i=1}^{n}{f({x_{i}})}\geqslant f\left(% \frac{1}{n}\sum\limits_{i=1}^{n}{{x_{i}}}\right),$ (8)

the square term can be upper bounded as following.

$\displaystyle{\left[{\max\left({\frac{1}{{{N_{U}}}}\sum\limits_{j=1}^{{N_{U}}}% {g^{\prime}(\bm{\varphi}_{j}^{U})-t},0}\right)}\right]^{2}}\leqslant{\left[{% \frac{1}{{{N_{U}}}}\sum\limits_{j=1}^{{N_{U}}}{\max\left({g^{\prime}(\bm{% \varphi}_{j}^{U})-t,0}\right)}}\right]^{2}},\leqslant\frac{1}{{{N_{U}}}}\sum% \limits_{j=1}^{{N_{U}}}{{{\left[{\max\left({g^{\prime}(\bm{\varphi}_{j}^{U})-t% ,0}\right)}\right]}^{2}}}.$ (9)

Denote the sum of all the instances in positive bags as $n_{P}$ . Taking Eq. (7) and the upper bound of square $l_{2}$ loss into Eq. (3.2), and put the bias $b$ into $\bm{\omega}$ , we can rewrite the proposed model as following,

$\displaystyle{\mathop{\min}\limits_{\bm{\omega},\bm{a}}{\kern 5.0pt}\frac{% \alpha}{{{n_{P}}}}\sum\limits_{m=1}^{{N_{P}}}{\mathop{\sum}\limits_{i=1}^{{P}(% m)}}a_{i}^{m}\log(1+{e^{-\bm{\omega}\bm{\varphi}_{mi}^{P}}})+\lambda\sum% \limits_{m=1}^{{N_{P}}}{{{\left\|{{\bm{a}^{m}}}\right\|}^{2}}}}$ $\displaystyle{{\kern 25.0pt}+\frac{\gamma}{{{N_{U}}}}\sum\limits_{j=1}^{{N_{U}% }}{{{\left[{\max\left(\frac{2}{\pi}\arctan(\bm{\omega}\bm{\varphi}_{j}^{U})-t,% 0\right)}\right]}^{2}}+\frac{1}{2}{{\left\|\bm{\omega}\right\|}^{2}}}}$ (10) $\displaystyle{\rm s.t.}\quad a_{i}^{m}\geqslant 0,\sum\limits_{i=1}^{P(m)}a_{i% }^{m}=1.$

4. Optimization

In this section, we will expound how to optimize Eq. (3.2) detailedly. The object function in Eq. (3.2) involves two optimization variables, i.e., $\bm{\omega}$ and $\bm{a}$ . It is hard to optimize all the variables simultaneously. Thus, we consider the alternating iterative algorithm. Specifically, after initializing each variable, we first optimize the object function with respect to $\bm{\omega}$ when $\bm{a}$ is fixed. Then, we optimize it with respect to $\bm{a}$ when $\bm{\omega}$ is fixed. We repeat these two procedures until convergence. In the following, we will show these two optimization procedures in detail and give the algorithm diagram of the optimization.

4.1 Optimization of $\omega$

When the weight vector $\bm{a}$ is fixed, $a_{i}^{m}$ can be regarded as a constant parameter, and the object function in Eq. (3.2) can be rewrote as

$\displaystyle\mathop{\min}\limits_{\bm{\omega}}\frac{\alpha}{{{n_{P}}}}\sum% \limits_{m=1}^{{N_{P}}}{\mathop{\sum}\limits_{i=1}^{{N_{P}}(m)}}a_{i}^{m}\log(% 1+{e^{-\bm{\omega}\bm{\varphi}_{mi}^{P}}})\!+\!\frac{1}{2}{{\left\|\bm{\omega}% \right\|}^{2}}\!+\!\frac{\gamma}{{{N_{U}}}}\sum\limits_{j=1}^{{N_{U}}}\left[\!% \max\left(\frac{2}{\pi}\arctan(\bm{\omega}\bm{\varphi}_{j}^{U})-t,0\!\right)\!% \right]^{2}\!\!.$ (11)

Denote the convex object function shown in Eq. (11) as $J(\bm{\omega})$ . We use the effective gradient descent for optimization. The function on each positive bag mapped instance $\bm{\varphi}_{mi}^{P}$ and unlabeled bag mapped bag $\bm{\varphi}_{j}^{U}$ can be formulated as

$\displaystyle J=\left\{{\begin{array}[]{c}\frac{1}{2}\left\|\bm{\omega}\right% \|^{2}+\frac{\alpha}{n_{P}}a_{i}^{m}\log(1+e^{-\bm{\omega}\bm{\varphi}_{mi}^{P% }}),\\ \frac{1}{2}\left\|\bm{\omega}\right\|^{2}+\frac{\gamma}{N_{U}}\max\left[\frac{% 2}{\pi}\arctan\left(\bm{\omega}\bm{\varphi}_{j}^{U}\right)-t,0\right].\end{% array}}\right.$ (12)

Based on Eq. (12), the gradients on positive train set and unlabeled train set are

$\displaystyle\nabla_{\bm{\omega}}\left(J\right)=\bm{\omega}-\frac{\alpha a_{i}% ^{m}}{n_{P}}\frac{e^{-\bm{\omega}\bm{\varphi}_{mi}^{P}}}{1+e^{-\bm{\omega}\bm{% \varphi}_{mi}^{P}}},\quad\text{for positive data, }$ (13) $\displaystyle\nabla_{\bm{\omega}}\left(J\right)=\bm{\omega}+\frac{4\gamma}{\pi N% _{U}}\frac{\max\left[\frac{2}{\pi}\arctan\left(\bm{\omega}\bm{\varphi}_{j}^{U}% \right)-t,0\right]}{1+\left(\bm{\omega}\bm{\varphi}_{j}^{U}\right)^{2}},\quad% \text{for unlabeled data.}$ (14)

After obtaining the gradient shown in Eqs (13) and (14), we get the update rule of the optimization target $\bm{\omega}$ as follow.

$\displaystyle\bm{\omega}:=\bm{\omega}-\tau\left[\sum\limits_{i=1}^{n_{P}}% \nabla_{\bm{\omega}}\left(J_{i}\right)+\sum\limits_{j=1}^{N_{U}}\nabla_{\bm{% \omega}}\left(J_{j}\right)\right].$ (15)

Here, $\tau$ is the step size and its value ranges from $e^{-3}$ to $e^{-15}$ in our paper. When the results of two consecutive iterations converge, we end the gradient descent algorithm and take the current time $\bm{\omega}$ as the optimal result.

4.2 Optimization of

a

When $\bm{\omega}$ is fixed, the part in Eq. (3.2) is not related to $\bm{\omega}$ . Let $v_{i}^{m}=\log(1+{e^{-\bm{\omega}\bm{\varphi}_{mi}^{P}}})$ . Denote $\bm{a}=[a_{1}^{1},a_{2}^{1},\ldots,a_{{P}(N_{P})}^{{N_{P}}}]$ and $\bm{v}=[v_{1}^{1},v_{2}^{1},\ldots,v_{{P}(N_{P})}^{{N_{P}}}]$ . The object function need to be optimized is as following.

$\displaystyle\mathop{\min}\limits_{\bm{a}}{\kern 6.0pt}\frac{\alpha}{n_{P}}{% \bm{a}}^{\top}{\bm{v}}+\lambda{\bm{a}}^{\top}{\bm{a}}$

(16) $\displaystyle\text{s.t.}{\kern 4.0pt}{\bm{a}}\geqslant{\bm{0}},\sum\limits_{i=% 1}^{P(m)}a_{i}^{m}=1,m=1,2,\ldots,N_{P}.$

Since the weights of each positive bag are independent, Eq. (4.2) can be transformed into separate optimization problem of $N_{P}$ bags,

$\displaystyle\mathop{\min}\limits_{\bm{a}^{m}}{\kern 5.0pt}\frac{\alpha}{n_{P}% }\bm{a}^{m\top}\bm{v}^{m}+\lambda\bm{a}^{m\top}\bm{a}^{m}$ (17) $\displaystyle\text{\rm s.t}.{\kern 5.0pt}\bm{a}^{m}\geqslant\bm{0},\bm{1}^{% \top}\bm{a}^{m}=1.$

Here, $\bm{v}$ is a constant vector. Optimizing the first term in Eq. (4.2) merely will result in a sparse solution. On the contrary, if we only optimize the second term, a smooth solution will be obtained. Thus, the sparsity of $\bm{a}$ can be controlled by adjusting the parameter $\bm{\lambda}$ . Since the role of $\bm{a}$ in our proposed model is to select true positive instances and weight the selected key instances, we can preset the number of key instances, thereby further calculate the parameter $\lambda$ . Besides, fixing the $\lambda$ by presetting $\ell_{0}$ -norm of vector $\bm{a}$ can also reduce the number of hyper parameters.

Then, we focus on how to estimate the parameter $\lambda$ . The Lagrangian function of Eq. (4.2) is

$\displaystyle L=\frac{\alpha}{{{n_{P}}}}{\bm{a}^{m\top}}{\bm{v}^{m}}+\lambda{% \bm{a}^{m^{\top}}}{\bm{a}^{m}}-\bm{d}{\bm{a}^{m}}-e(\bm{1}^{\top}{\bm{a}^{m}}-% 1),$ (18)

where the $\bm{d}$ and $e$ are Lagrangian multiplier. Let the partial derivative of Eq. (18) be 0

$\displaystyle\frac{\alpha}{{{n_{P}}}}{\bm{v}^{m}}+2\lambda{\bm{a}^{m}}-\bm{d}-% e\bm{1}=0.$ (19)

The KKT condition of the dual problem is

$\displaystyle\left\{{\begin{array}[]{c}{\bm{a}^{m}},\bm{d},e\geqslant 0,\\ {\bm{a}^{m\top}}\bm{1}=0,\\ \bm{d}^{\top}{\bm{a}^{m}}=0,\\ {\bm{a}^{m}}=\frac{1}{{2\lambda}}\left({e\bm{1}+\bm{d}-\frac{\alpha}{{{n_{P}}}% }{\bm{v}^{m}}}\right).\end{array}}\right.$ (20)

By the third and fourth conditions in Eq. (20), the $\bm{a}^{m}$ can be written as

$\displaystyle{\bm{a}^{m}}=\max\left[{\frac{1}{{2\lambda}}\left({e\bm{1}+\bm{d}% -\frac{\alpha}{{{n_{P}}}}{\bm{v}^{m}}}\right),0}\right].$ (21)

Let ${\left\|\bm{a}^{m}\right\|_{0}}=s$ . Denote the $p$ -th largest element in $\bm{a}^{m}$ and the $q$ - $t h$ smallest element in $\bm{v}^{m}$ as $a_{(p)}$ and $v_{(q)}$ , respectively. According to Eq. (21), $a_{(s)}$ and $a_{(s+1)}$ should satisfy

$\displaystyle a_{(s)}=\max\left(e-\frac{1}{2\lambda}{v_{(s)}},0\right)>0,$ (22) $\displaystyle a_{(s+1)}=\max\left(e-\frac{1}{2\lambda}{v_{(s+1)}},0\right)=0.$

Remove the maximum function, Eq. (22) can be rewritten as

$\displaystyle e-\frac{1}{{2\lambda}}{v_{(s)}}>0,$ (23) $\displaystyle e-\frac{1}{{2\lambda}}{v_{(s+1)}}\leqslant 0.$

[!t] : Optimization algorithm of MILBLE Input: $\bm{\Phi}_{\textit{ins}}(\bm{x}^{P})$ , $\bm{\Phi}_{\textit{bag}}(B^{U})$ . Output:[1] Initialize $\bm{\omega}$ , $\bm{a}$ and $\lambda$ . Calculate object function value $ov(\bm{\omega},\bm{a},\lambda)$ by Eq. (3.2). $o v$ not stable fix $\bm{a}$ , solve $\bm{\omega}$ : Calculate gradient on positive train set and unlabeled train set by Eqs (13) and (14). Compute the gradient of object function by $\textit{grad}=\left[{\sum\limits_{i=1}^{{n_{P}}}{{\nabla_{\bm{\omega}}}\left({% {J_{i}}}\right)}+\sum\limits_{j=1}^{{N_{U}}}{{\nabla_{\bm{\omega}}}\left({{J_{% j}}}\right)}}\right]$ . $t=$ 3:15 $\tau=e^{-t}$ . $\bm{\omega}_{\textit{update}}=\bm{\omega}-\tau*\textit{grad}$ . $ov(\bm{\omega})>ov(\bm{\omega}_{\textit{update}})$ $\bm{\omega}=\bm{\omega}_{\textit{update}}$ . end endfix $\bm{\omega}$ , solve $\bm{a}$ : Calculate $\bm{v}=[v_{1}^{1},v_{2}^{1},\ldots,v_{{P}(N_{P})}^{{N_{P}}}]$ by $v_{i}^{m}=\log(1+{e^{-\bm{\omega}\bm{\varphi}_{mi}^{P}}})$ . Estimate $\lambda$ by Eq. (26). Optimize $\bm{a}$ by solving Eq. (4.2). end $g(\bm{\Phi}_{\textit{bag}}(B))$

Note that $\bm{1}^{\top}\bm{a}^{m}=1$ . By combining Eq. (22), we can get an equation of $e$

$\displaystyle se-\sum\limits_{i=1}^{s}\frac{v_{(s)}}{2\lambda}=1,$ (24) $\displaystyle e=\frac{1}{s}\left(1+\sum\limits_{i=1}^{s}\frac{v_{(s)}}{2% \lambda}\right).$

According to Eq. (4.2), the condition that $\lambda$ should satisfy is

$\displaystyle\frac{sv_{(s)}}{2}-\frac{1}{2}\sum\limits_{i=1}^{s}{v_{(s)}}<% \lambda\leqslant\frac{sv_{(s+1)}}{2}-\frac{1}{2}\sum\limits_{i=1}^{s}v_{(s)}.$ (25)

When $\lambda$ satisfy the condition revealed by Eq. (25), the optimal solution of weight $\bm{a}$ meets the limit of ${\left\|\bm{a}^{m}\right\|_{0}}=s$ . Since the upper bound value shown in Eq. (25) is desirable, for convenience, the value of $\lambda$ is formulate as

$\displaystyle\lambda=\frac{sv_{(s+1)}}{2}-\frac{1}{2}\sum\limits_{i=1}^{s}v_{(% s)}.$ (26)

After the parameter $\lambda$ is determined, the optimization of $\bm{a}$ is a convex quadratic programming problem. We use interior point method to optimize it. The main steps of optimization of our proposed model are shown in Algorithm 4.2.
5. Experiment

In order to verify the effectiveness of our proposed method, we compared it with other four representative methods on eight public datasets. The convergence behavior of optimization algorithm is also confirmed experimentally. In addition, the influence of parameters on the classifier performance, and the running time of five methods are also analyzed.

5.1 Experiment settings

To verify the effectiveness of our proposed method, eight public datasets are adopted to conduct multiple experiments. The eight datasets cover multiple application fields such as images, text, and biology. Details of adopted datasets such as the number of positive and negative bags, length of instance features and the average size of bag are shown in Table 1. In these eight datasets, the number of positive instances in the positive bags is very small in the last three ones, include rec.sport.hockey, sci.electronics and sci.med. Existing PU-MIL methods ignore the process of selecting positive instances, and the features of positive bags are easily erased by a large number of negative instances in the mapping process. Therefore, these methods may not perform well on these three datasets.

Table 1
The main information of eight public dataset

Name	# of P Bags	# of N Bags	Features	Avg # of Insts
Tiger	100	100	230	6.10
Mutagenesis_Atoms	125	63	10	8.61
Mutagenesis_Bonds	125	63	16	21.25
Corel_hd	100	100	9	3.85
Sival_ab	60	60	30	31.68
rec.sport.hockey	50	50	200	19.82
sci.electronics	47	53	200	31.92
sci.med	50	50	200	30.45

Table 2

The average Accuracy and standard deviation results of five methods on eight data sets. The results tested the performance of five methods from four class-prior probability $\theta_{P}$ of each dataset of {0.2, 0.4, 0.6, 0.8}. The highest score is marked in bold

Dataset	$\theta_{P}$	Accuracy
		MILBLE	PU-SKC	C-kNN	aMILGDM	KI-SVM
Tiger	0.2	0.7523(0.022)	0.6730(0.039)	0.6410(0.026)	0.6140(0.030)	0.6057(0.066)
	0.4	0.7490(0.031)	0.6963(0.046)	0.5900(0.032)	0.5660(0.024)	0.5893(0.064)
	0.6	0.7543(0.028)	0.6828(0.050)	0.5675(0.025)	0.5520(0.021)	0.6765(0.069)
	0.8	0.7338(0.024)	0.5955(0.092)	0.5245(0.010)	0.5320(0.012)	0.5820(0.069)
Mutagenesis_Atoms	0.2	0.6232(0.019)	0.5533(0.046)	0.5335(0.022)	0.5778(0.036)	0.6788(0.085)
	0.4	0.6313(0.022)	0.5648(0.058)	0.5135(0.018)	0.5525(0.028)	0.6368(0.047)
	0.6	0.6170(0.023)	0.5655(0.054)	0.4960(0.011)	0.5263(0.032)	0.5533(0.085)
	0.8	0.6280(0.022)	0.5418(0.040)	0.4995(0.014)	0.5265(0.026)	0.5095(0.050)
Mutagenesis_Bonds	0.2	0.6087(0.037)	0.5565(0.047)	0.5120(0.028)	0.5828(0.039)	0.5035(0.014)
	0.4	0.6237(0.031)	0.5745(0.059)	0.5165(0.033)	0.5603(0.030)	0.5975(0.079)
	0.6	0.6045(0.037)	0.5443(0.043)	0.5065(0.017)	0.5430(0.026)	0.5203(0.061)
	0.8	0.6110(0.030)	0.4972(0.050)	0.4945(0.012)	0.5313(0.027)	0.5438(0.036)
Corel_hd	0.2	0.8420(0.022)	0.8093(0.037)	0.6800(0.032)	0.6523(0.033)	0.8068(0.038)
	0.4	0.8205(0.023)	0.8165(0.038)	0.6005(0.022)	0.5905(0.022)	0.7850(0.052)
	0.6	0.8303(0.027)	0.7493(0.055)	0.5760(0.014)	0.5548(0.021)	0.7715(0.055)
	0.8	0.8225(0.028)	0.7080(0.062)	0.5540(0.028)	0.5423(0.025)	0.7013(0.076)
Sival_ab	0.2	0.7323(0.024)	0.7550(0.035)	0.6560(0.034)	0.6323(0.030)	0.7423(0.070)
	0.4	0.7268(0.027)	0.7260(0.057)	0.6035(0.023)	0.5828(0.024)	0.6633(0.042)
	0.6	0.7237(0.027)	0.6813(0.057)	0.5725(0.027)	0.5625(0.028)	0.6755(0.064)
	0.8	0.6610(0.093)	0.6210(0.079)	0.5570(0.030)	0.5495(0.027)	0.6110(0.098)
rec.sport.hockey	0.2	0.6053(0.024)	0.5753(0.038)	0.4980(0.008)	0.5045(0.008)	0.5813(0.023)
	0.4	0.6103(0.034)	0.5183(0.057)	0.5310(0.031)	0.5040(0.008)	0.5525(0.025)
	0.6	0.6580(0.031)	0.5045(0.044)	0.4985(0.018)	0.5033(0.010)	0.5370(0.025)
	0.8	0.6618(0.025)	0.4512(0.033)	0.5015(0.008)	0.5015(0.007)	0.5775(0.042)
sci.electronics	0.2	0.6343(0.044)	0.5450(0.036)	0.5040(0.020)	0.5077(0.008)	0.6028(0.025)
	0.4	0.6510(0.043)	0.4865(0.045)	0.4930(0.018)	0.5037(0.008)	0.5855(0.032)
	0.6	0.7635(0.028)	0.4880(0.044)	0.4955(0.022)	0.5020(0.008)	0.5937(0.031)
	0.8	0.7763(0.029)	0.5278(0.039)	0.5045(0.019)	0.5025(0.005)	0.5742(0.062)
sci.med	0.2	0.6505(0.025)	0.5680(0.022)	0.5090(0.016)	0.5085(0.013)	0.5963(0.032)
	0.4	0.5940(0.027)	0.5245(0.038)	0.5060(0.021)	0.5030(0.009)	0.5798(0.024)
	0.6	0.6345(0.031)	0.5225(0.042)	0.5015(0.004)	0.5013(0.005)	0.5290(0.031)
	0.8	0.6730(0.033)	0.4998(0.033)	0.5020(0.004)	0.4975(0.012)	0.5633(0.030)

It can be seen from Table 1 that the existing MIL public dataset are too small to provide sufficient samples for experiments in the PU scenario. Similar with [3], we randomly select bags from original dataset and duplicate them with the Gaussian noise of mean zero and variance 0.01. In this way, we increase the size of positive and unlabeled train set to 20 and 180, respectively. Moreover, additional 100 positive bags and 100 negative bags constitute the test set. Then, we take comparative experiments to verify the effectiveness of our proposed method on four different ratios of unlabeled bags composition. The ratio of positive bags in unlabeled bags are set as {0.2, 0.4, 0.6, 0.8}. The larger the ratio, the more positive bags in the unlabeled bags. The experiments under each ratio are repeated 20 times. We compute the mean and standard standard deviation values of each method under different classification metrics, include the accuracy, the area under curve and the F-measure, to evaluate the performance.

5.2 Results

5.2.1 Classification performance

In this section, we investigate the performance of our method with four different class-prior probability $\theta_{P}$ of {0.2, 0.4, 0.6, 0.8}. To verify the effectiveness of our proposed MILBLE, we compare its performance with four existing state-of-the-art methods on eight datasets. The comparison methods we used in this paper include KI-SVM [22], aMILGDM [27], C-kNN [26] and PU-SKC [3]. The first three are proposed to solve the problem with complete data and bilateral labels, the last one is proposed to solve PU-MIL problem, which trains the classifier via concise statistical mapping and PU empirical risk loss. The average accuracy and standard deviation of 20 independent trials on different methods and class-prior probabilities are presented in Table 2. Moreover, to make our experimental results more convincing, we also provide the AUC and F-measure results in Tables 3 and 4, respectively.

Table 3
AUC (Area Under Curve) of five methods on eight data sets. The results tested performance of five methods from four class-prior probability $\theta_{P}$ of each dataset of {0.2, 0.4, 0.6, 0.8}. The contents in brackets are standard deviations of AUC results. The highest score over five methods is marked in bold

Dataset	$\theta_{P}$	Area under curve
		MILBLE	PU-SKC	C-kNN	aMILGDM	KI-SVM
Tiger	0.2	0.5472(0.040)	0.4077(0.075)	0.2906(0.047)	0.2314(0.059)	0.2404(0.145)
	0.4	0.5547(0.049)	0.4840(0.063)	0.1922(0.058)	0.1428(0.043)	0.2961(0.120)
	0.6	0.5466(0.046)	0.4595(0.069)	0.1516(0.052)	0.1124(0.038)	0.4002(0.137)
	0.8	0.5151(0.051)	0.2997(0.131)	0.0833(0.020)	0.0821(0.024)	0.2422(0.131)
Mutagenesis_Atoms	0.2	0.3704(0.029)	0.1856(0.012)	0.0916(0.051)	0.2015(0.059)	0.4534(0.101)
	0.4	0.3739(0.032)	0.2656(0.111)	0.0921(0.037)	0.1533(0.045)	0.3756(0.083)
	0.6	0.3304(0.029)	0.3021(0.065)	0.0626(0.019)	0.1222(0.045)	0.2152(0.190)
	0.8	0.3927(0.029)	0.2203(0.068)	0.0347(0.016)	0.1160(0.032)	0.0807(0.058)
Mutagenesis_Bonds	0.2	0.3128(0.082)	0.2548(0.075)	0.1307(0.050)	0.2009(0.071)	0.0487(0.058)
	0.4	0.3468(0.045)	0.3283(0.067)	0.1134(0.052)	0.1633(0.043)	0.2761(0.186)
	0.6	0.3072(0.083)	0.2909(0.048)	0.0536(0.026)	0.1254(0.040)	0.0698(0.132)
	0.8	0.3309(0.036)	0.1587(0.097)	0.0414(0.022)	0.1194(0.028)	0.1544(0.033)
Corel_hd	0.2	0.7088(0.037)	0.6413(0.069)	0.3632(0.064)	0.3091(0.065)	0.6246(0.073)
	0.4	0.6676(0.043)	0.6666(0.063)	0.2066(0.042)	0.1869(0.046)	0.6091(0.086)
	0.6	0.6851(0.045)	0.5564(0.087)	0.1619(0.029)	0.1200(0.042)	0.5910(0.087)
	0.8	0.6720(0.050)	0.4755(0.106)	0.1195(0.052)	0.1061(0.040)	0.4910(0.106)
Sival_ab	0.2	0.5346(0.035)	0.5577(0.060)	0.3196(0.065)	0.2690(0.058)	0.5430(0.107)
	0.4	0.5123(0.045)	0.5280(0.083)	0.2164(0.040)	0.1707(0.040)	0.3738(0.064)
	0.6	0.5208(0.041)	0.4573(0.077)	0.1640(0.059)	0.1307(0.054)	0.4360(0.100)
	0.8	0.3778(0.211)	0.3810(0.107)	0.1483(0.044)	0.1164(0.052)	0.3614(0.118)
rec.sport.hockey	0.2	0.3136(0.038)	0.2824(0.047)	0.0008(0.009)	0.0149(0.013)	0.2632(0.062)
	0.4	0.3228(0.047)	0.2673(0.060)	0.1895(0.067)	0.0139(0.014)	0.2039(0.073)
	0.6	0.4087(0.044)	0.2414(0.050)	0.1048(0.056)	0.0143(0.015)	0.1836(0.072)
	0.8	0.4365(0.033)	0.0932(0.043)	0.0156(0.021)	0.0094(0.010)	0.2340(0.099)
sci.electronics	0.2	0.3631(0.067)	0.1779(0.038)	0.0692(0.033)	0.0202(0.022)	0.2397(0.055)
	0.4	0.3914(0.067)	0.2357(0.044)	0.0481(0.032)	0.0129(0.015)	0.1907(0.076)
	0.6	0.5802(0.042)	0.2370(0.044)	0.0611(0.031)	0.0070(0.013)	0.2388(0.048)
	0.8	0.5912(0.051)	0.2754(0.042)	0.1235(0.085)	0.0065(0.009)	0.2062(0.152)
sci.med	0.2	0.3799(0.045)	0.2563(0.030)	0.0704(0.027)	0.0234(0.021)	0.2673(0.058)
	0.4	0.3017(0.041)	0.2718(0.039)	0.0949(0.059)	0.0194(0.022)	0.1645(0.051)
	0.6	0.3702(0.042)	0.2646(0.045)	0.0189(0.038)	0.0064(0.011)	0.1191(0.066)
	0.8	0.4524(0.045)	0.2407(0.037)	0.0079(0.010)	0.0153(0.018)	0.1734(0.064)

Table 4

F-measure of five methods on eight data sets. The results tested performance of five methods from four class-prior probability $\theta_{P}$ of each dataset of {0.2, 0.4, 0.6, 0.8}. The contents in brackets are standard deviations of F-measure results. The highest score over five methods is marked in bold

Dataset	$\theta_{P}$	F-measure
		MILBLE	PU-SKC	C-kNN	aMILGDM	KI-SVM
Tiger	0.2	0.7808(0.019)	0.5822(0.080)	0.4486(0.057)	0.3724(0.077)	0.3710(0.185)
	0.4	0.7658(0.029)	0.6868(0.053)	0.3191(0.081)	0.2476(0.065)	0.4680(0.169)
	0.6	0.7855(0.025)	0.7053(0.044)	0.2603(0.078)	0.2002(0.060)	0.5639(0.146)
	0.8	0.7624(0.026)	0.6769(0.061)	0.1536(0.035)	0.1511(0.041)	0.3950(0.180)
Mutagenesis_Atoms	0.2	0.5652(0.034)	0.3056(0.170)	0.1646(0.084)	0.3328(0.080)	0.6896(0.093)
	0.4	0.5635(0.035)	0.4553(0.173)	0.1677(0.064)	0.2643(0.066)	0.5620(0.094)
	0.6	0.5060(0.032)	0.6152(0.005)	0.1180(0.035)	0.2164(0.070)	0.5220(0.219)
	0.8	0.6359(0.025)	0.6373(0.030)	0.0668(0.030)	0.2076(0.052)	0.1467(0.100)
Mutagenesis_Bonds	0.2	0.5130(0.053)	0.4378(0.120)	0.2317(0.077)	0.3295(0.103)	0.6521(0.034)
	0.4	0.5252(0.052)	0.5492(0.061)	0.2018(0.087)	0.2795(0.063)	0.5294(0.213)
	0.6	0.5068(0.060)	0.5700(0.047)	0.1010(0.045)	0.2211(0.063)	0.6415(0.052)
	0.8	0.5124(0.042)	0.5974(0.065)	0.0791(0.040)	0.2133(0.045)	0.2693(0.050)
Corel_hd	0.2	0.8434(0.020)	0.7824(0.051)	0.5300(0.068)	0.4688(0.074)	0.8338(0.029)
	0.4	0.8299(0.020)	0.8122(0.043)	0.3407(0.056)	0.3126(0.064)	0.7630(0.064)
	0.6	0.8386(0.029)	0.7666(0.050)	0.2778(0.043)	0.2120(0.066)	0.7567(0.070)
	0.8	0.8249(0.033)	0.7461(0.047)	0.2101(0.083)	0.1897(0.066)	0.6921(0.083)
Sival_ab	0.2	0.7207(0.025)	0.7238(0.052)	0.4812(0.073)	0.4209(0.072)	0.7419(0.066)
	0.4	0.6882(0.040)	0.7228(0.060)	0.3544(0.055)	0.2898(0.059)	0.5442(0.069)
	0.6	0.7133(0.027)	0.7080(0.048)	0.2783(0.087)	0.2275(0.082)	0.6267(0.111)
	0.8	0.5235(0.280)	0.6528(0.067)	0.2564(0.067)	0.2050(0.083)	0.5738(0.112)
rec.sport.hockey	0.2	0.4869(0.044)	0.4532(0.055)	0.0155(0.015)	0.0290(0.025)	0.4244(0.090)
	0.4	0.4970(0.051)	0.4977(0.073)	0.3243(0.104)	0.0271(0.027)	0.4188(0.178)
	0.6	0.5941(0.043)	0.5517(0.040)	0.1922(0.099)	0.0279(0.029)	0.4039(0.195)
	0.8	0.6731(0.027)	0.5877(0.026)	0.0300(0.037)	0.0185(0.019)	0.3810(0.140)
sci.electronics	0.2	0.5409(0.071)	0.3055(0.056)	0.1291(0.058)	0.0387(0.042)	0.3852(0.073)
	0.4	0.5711(0.069)	0.4669(0.047)	0.0908(0.056)	0.0250(0.029)	0.3163(0.111)
	0.6	0.7507(0.032)	0.5044(0.043)	0.1146(0.056)	0.0136(0.024)	0.3870(0.064)
	0.8	0.7971(0.022)	0.5521(0.035)	0.2222(0.149)	0.0127(0.018)	0.6085(0.080)
sci.med	0.2	0.5587(0.050)	0.4182(0.039)	0.1310(0.048)	0.0449(0.039)	0.4274(0.077)
	0.4	0.4741(0.049)	0.5482(0.045)	0.1720(0.102)	0.0373(0.041)	0.2795(0.075)
	0.6	0.5532(0.042)	0.5584(0.040)	0.0353(0.069)	0.0126(0.022)	0.2684(0.173)
	0.8	0.6736(0.042)	0.5416(0.032)	0.0156(0.020)	0.0297(0.034)	0.2970(0.098)

As the results of accuracy, AUC and F-measure shown in Tables 2–4, our MILBLE obtains better performance gain than other representative methods in most of cases and simultaneously verify the effectiveness of our proposed method. Moreover, the MILBLE we proposed has excellent classification performance on the data set composed of various class-prior probability, while the classification performance of the other four comparison methods decreases as the class-prior probability $\theta_{P}$ increases. Because when class-prior probability is higher, traditional multi-instance methods can only process unlabeled samples as negative samples. Larger class-prior probability means the more noise there is in “negative samples”, which will result in the worse performance of the classifier. The loss function adopted by our method on unlabeled samples uses this class-prior probability as a parameter during the training of classifier, so it still has good classification performance when there existing so many positive samples in unlabeled samples. This reflects the broad applicability of our proposed algorithm in wide range of class-prior probability.

In addition, the performance improvements of MILBLE on the three data sets rec.sport.hockey, sci.electronics and sci.med are particularly prominent. This is because the proportions of positive instances in positive bag of these three datasets are lower than that of the other datasets, while the bi-level embedding strategy adopted in our MILBLE algorithm can retain and extract the characteristics of the true positive instances, which is different from other methods. Therefor, when the number of key positive instances is very small, MILBLE can still ensure that the effective information of positive bags will not be erased. This is why MILBLE still has excellent classification performance on the low positive instance proportion.

5.2.2 Convergence analysis

In the optimization section, we detailed how to minimize the objective function we proposed to obtain the optimal solution. We alternate iteratively on the coefficient term $\bm{w}$ of the linear classifier and the weight term $\bm{a}$ used to select true positive instances. The strategy we adopt to solve coefficient term $\bm{w}$ is gradient descent method. In this section, we illustrate the effect of model optimization algorithm by observing the trend of the objective function values during the optimization process. Three data sets used for the experiment are Corel_hd, Sival_ab and Tiger. The iterative convergence times with object value are shown in Figs 1 and 2.

Figure 1.

The curves of iteration convergence times of the alternating iterative solution method. The data sets used to verify the effect of our optimal solution method are Corel_hd, Sival_ab and Tiger, respectively. The curves shown in the figure are the experimental results of three data sets under the class-prior probability of 0.2.

Figure 2.

It can be seen from Figs 1 and 2 that under the iterative step we set, the objective function gradually decreases and can converge within dozens of steps. Combined with the analysis of the classifier performance results, our optimization algorithm can efficiently find the optimal solution of the model.

5.2.3 Parameter analysis

In the formulation we proposed in Eq. (3.2), parameters $\alpha$ , $\gamma$ and $\lambda$ are adopted to trade off each part of the loss function. Among them, $\alpha$ and $\gamma$ are preset manually by the cross validation, while parameter $\lambda$ are determined according to the demand on sparsity of weight vector $\bm{a}$ , which is different from other two items. In this section, we probe the influence of parameters $\alpha$ and $\gamma$ on the classifier performance. Experiments are conducted with different parameter combinations on three datasets in class-prior probability of $\{0.6,0.8\}$ , the value ranges of two parameters are all $\{10^{-3},10^{-2},10^{-1},10^{0},10^{1},10^{2},10^{3}\}$ . Figure 3 and Fig.4 shows the accuracy results of our method under various parameter combinations.

Figure 3.

Classifier performance under different parameter combinations of $\alpha$ and $\gamma$ . The data sets used to conduct the experiment are Corel_hd, sci.electronics and Sival_ab. The results shown in the figure are obtained from experiments under the class-prior probability of 0.6.

Figure 4.

Figures 3 and 4 shows that these two parameters have a substantial impact on the performance of our proposed method. To be specific, from accuracy results on these parameters combinations, it can be seen that on these data sets, the better classifier performance mostly appears near the range of $\alpha=0.1$ and $\gamma=10$ , which provide guidance for us to train the classifier. Therefore, when choosing parameters for other datasets to train the classifier, we give priority to choosing values near this parameter combination.

5.2.4 Running time

In this section, we compare the training time of our method with the other four exists methods, include PU-SKC [3], C-kNN [26], aMILGDM [27] and KI-SVM [22] on eight data sets. The training time of each method is tested in the same computing environment of Intel Xeon CPU E3-1245 v3 at 3.40 GHz and 32 G memory. All the training time are the mean value of five experiments. The class-prior probability of each dataset used to test the running time is 0.6, and the size of them are all 200 (20 positive bags and 180 unlabeled bags). The CPU time is reported in Table 5.

Table 5
Running time under the class-prior probability of 0.6 of eight datasets. The results are mean values with 5 experiments of five methods. The unit used to measure the running time in the table is seconds

DatasetDurationMethod	MILBLE	PU-SKC	C-kNN	aMILGDM	KI-SVM
Tiger	4.4695	2.0785	1008.2608	77.7743	1.2870
Mutagenesis_Atoms	1.5600	0.5068	345.7675	10.9743	1.2520
Mutagenesis_Bonds	4.5823	2.2895	2098.3558	58.1728	4.3525
Corel_hd	1.3108	0.8223	92.3560	3.5060	1.0453
Sival_ab	5.5225	4.1650	4490.4898	131.9568	8.6348
rec.sport.hockey	14.1065	0.7178	3248.4523	96.8993	7.5503
sci.electronics	20.6665	0.9593	7576.7265	201.3223	14.0595
sci.med	21.8670	0.9515	7351.4545	211.0995	15.1475

It can be seen from results recorded in Table 5 that the time costs of PU-SKC is larger than that of MILBLE, and the costs of KI-SVM and our method are comparable. The running time of other two methods especially C-kNN are much slower than our proposed MILBLE. Nevertheless, the classification performance of these contrast methods are lower than our MILBLE. Among them, the reason of PU-SKC spends less time is that it maps each bag to a single instance in the latent feature space. The number of instances in the bag has no effect on the calculation of this algorithm. However, the classification performance of this algorithm is not good when the proportion of positive instances in positive bag is very small, which can also be seen from Table 2. In summary, our proposed MILBLE has the satisfactory classification performance with an acceptable running time.

6. Conclusion

In this paper, we focused on the multi-instance learning problem in PU scenario, which is crucial but rarely studied. We proposed a method based on bi-level embedding. This method adopt different embedding strategies for the positive and unlabeled data, which guarantees the purity of extracted information from positive bags under the premise of consistency of different two mapped samples. After mapping step, we trained the classifier by minimizing the weighted classification error loss on positive train set, and the class-prior probability based loss on unlabeled train set. We also expound how to optimize our proposed model in detail. The validity of parameters and the convergence of optimization algorithm is verified according to the experiment. Experimental results on the public datasets illustrated that our method has better performance than the compare methods in most cases. As a emerging field that has not been widely studied but has great practical value, PU-MIL still has a lot of room for research and exploration. Therefore, in the next work, we will continue to study PU-MIL problem from the perspective of other latent feature space mapping and cost-sensitive loss functions.

Footnotes

Acknowledgments

The authors want to thank the associate editor and reviewers for helpful comments and suggestions. This work was supported by the NSF of China under Grants No. 61922087, 61906201 and 62006238, NSF of Hunan Province under Grant No. 2020JJ5669, and the NSF for Distinguished Young Scholars of Hunan Province under Grant No. 2019JJ20020.

References

Amores

, Multiple instance classification: Review, taxonomy and comparative study, Artificial Intelligence 201 (2013), 81–105.

Andrews

Tsochantaridis

and Hofmann

, Support vector machines for multiple-instance learning, In Advances in Neural Information Processing Systems, MIT Press, 2002, pp. 561–568.

Bao

Sakai

Sato

and Sugiyama

, Convex formulation of multiple instance learning from positive and unlabeled bags, Neural Networks 105 (2018), 132–141.

Carbonneau

Cheplygina

Granger

and Gagnon

, Multiple instance learning: A survey of problem characteristics and applications, Pattern Recognition 77 (2018), 329–353.

Carbonneau

Granger

Raymond

A.J.

and Gagnon

, Robust multiple-instance learning ensembles using random subspace instance selection, Pattern Recognition 58 (2016), 83–99.

Chen

and Wang

J.Z.

, MILES: multiple-instance learning via embedded instance selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12) (2006), 1931–1947.

Chen

and Wang

J.Z.

, Image categorization by learning and reasoning with regions, Journal of Machine Learning Research 5 (2004), 913–939.

Cheplygina

Tax

D.M.J.

and Loog

, Multiple instance learning with bag dissimilarities, Pattern Recognition 48(1) (2015), 264–275.

Dietterich

T.G.

Lathrop

R.H.

and Lozano-Pérez

, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence 89(1-2) (1997), 31–71.

10.

du Plessis

M.C.

Niu

and Sugiyama

, Analysis of learning from positive and unlabeled data, In Conference on Neural Information Processing Systems, 2014, pp. 703–711.

11.

du Plessis

M.C.

Niu

and Sugiyama

, Convex formulation for learning from positive and unlabeled data, In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015, pp. 1386–1394.

12.

Elkan

and Noto

, Learning classifiers from only positive and unlabeled data, In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 213–220.

13.

Robles-Kelly

and Zhou

, MILIS: multiple instance learning with instance selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5) (2011), 958–977.

14.

Gärtner

Flach

P.A.

Kowalczyk

and Smola

A.J.

, Multi-instance kernels, In Proceedings of the Nineteenth International Conference, Morgan Kaufmann, 2002, pp. 179–186.

15.

Gong

Liu

Yang

and Tao

, Large-margin label-calibrated support vector machines for positive and unlabeled learning, IEEE Transactions on Neural Networks and Learning Systems 30(11) (2019), 3471–3483.

16.

Han

Zuo

Liu

and Peng

, Building text classifiers using positive, unlabeled and ‘outdated’ examples, Concurrency and Computation: Practice and Experience 28(13) (2016), 3691–3706.

17.

Yang

Zhen

Tan

and Jing

, Building high-performance classifiers using positive and unlabeled examples for text classification, In 9th International Symposium on Neural Networks, volume 7368 of Lecture Notes in Computer Science, Springer, 2012, pp. 187–195.

18.

Kiryo

Niu

du Plessis

M.C.

and Sugiyama

, Positive-unlabeled learning with non-negative risk estimator, In Conference on Neural Information Processing Systems, 2017, pp. 1675–1685.

19.

Leistner

Saffari

and Bischof

, Miforests: Multiple-instance learning with randomized trees, In Proceedings of 11th European Conference on Computer Vision, volume 6316 of Lecture Notes in Computer Science, Springer, 2010, pp. 29–42.

20.

and Sminchisescu

, Convex multiple-instance learning by estimating likelihood ratio, In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2010, pp. 1360–1368.

21.

and Yeung

, MILD: multiple-instance learning via disambiguation, IEEE Transactions on Knowledge and Data Engineering 22(1) (2010), 76–89.

22.

Kwok

J.T.

Tsang

I.W.

and Zhou

, A convex method for locating regions of interest with multi-instance learning, In European Conference of Machine Learning, volume 5782 of Lecture Notes in Computer Science, Springer, 2009, pp. 15–30.

23.

Liu

Dai

Lee

W.S.

and Yu

P.S.

, Building text classifiers using positive and unlabeled examples, In Proceedings of the 3rd IEEE International Conference on Data Mining, IEEE Computer Society, 2003, pp. 179–188.

24.

Liu

Lee

W.S.

and Li

, Partially supervised classification of text documents, In Proceedings of the Nineteenth International Conference on Machine Learning, Morgan Kaufmann, 2002, pp. 387–394.

25.

Sakai

du Plessis

M.C.

Niu

and Sugiyama

, Semi-supervised classification based on classification from positive and unlabeled data, In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, PMLR, 2017, pp. 2998–3006.

26.

Wang

and Zucker

, Solving the multiple-instance problem: A lazy learning approach, In Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufmann, 2000, pp. 1119–1126.

27.

Pan

Zhu

Zhang

and Wu

, Multi-instance learning with discriminative bag mapping, IEEE Transactions on Knowledge and Data Engineering 30(6) (2018), 1065–1080.

28.

Xiao

Liu

Hao

and Cao

, A similarity-based classification framework for multiple-instance learning, IEEE Transactions on Cybernetics 44(4) (2014), 500–515.

29.

Xiao

Liu

Yin

Cao

Zhang

and Hao

, Similarity-based approach for positive and unlabeled learning, In 22nd International Joint Conference on Artificial Intelligence, IJCAI/AAAI, 2011, pp. 1577–1582.

Multi-instance positive and unlabeled learning with bi-level embedding

Abstract

Keywords

1. Introduction

2.1 Positive and unlabeled learning

2.2 Positive and unlabeled multi-instance learning

3.1 Bi-level embedding for positive and unlabeled bags

4.1 Optimization of ω

5.1 Experiment settings

Table 1 The main information of eight public dataset

5.2.1 Classification performance

Table 5 Running time under the class-prior probability of 0.6 of eight datasets. The results are mean values with 5 experiments of five methods. The unit used to measure the running time in the table is seconds

Footnotes

Acknowledgments

References

4.1 Optimization of $\omega$

Table 1
The main information of eight public dataset

Table 5
Running time under the class-prior probability of 0.6 of eight datasets. The results are mean values with 5 experiments of five methods. The unit used to measure the running time in the table is seconds