Grey-based multiple instance learning with multiple bag-representative

Abstract

Multiple instance learning is a modification in supervised learning that handles the classification of collection instances, which called bags. Each bag contains a number of instances whose features are extracted. In multiple instance learning, the standard assumption is that a positive bag contains at least one positive instance, whereas a negative bag is only comprised of negative instances. The complexity of multiple instance learning relies heavily on the number of instances in the training datasets. Since we are usually confronted with a large instance space, it is important to design efficient instance selection techniques to speed up the training process, without compromising the performance. Firstly, a multiple instance learning model of support vector machine based on grey relational analysis is proposed in this paper. The data size can be reduced, and the importance of instances in the bag can be preliminarily judged. Secondly, this paper introduces an algorithm with the bag-representative selector that trains the support vector machine based on bag-level information. Finally, this paper shows how to generalize the algorithm for binary multiple instance learning to multiple class tasks. The experimental study evaluates and compares the performance of our method against 8 state-of-the-art multiple instance methods over 10 datasets, and then demonstrates that the proposed approach is competitive with the state-of-art multiple instance learning methods.

Keywords

Multiple instance learning support vector machine grey relational analysis bag representative multi-class learning

1. Introduction

Since the mid-to-late 1990s, researchers have proposed the concept of multiple instance learning (MIL) in the study of drug activity prediction. Because of the applicability to many real-world problems recently, MIL, which is a modification of supervised learning, has been gaining interest, such as drug activity prediction [11,26], stock market prediction [25], data mining applications [28], image retrieval [37,40], natural scene classification [27], text categorization [1], and image categorization [8]. MIL provides a framework for the classification of collections of instances called bags rather than individual instances. In a typical binary MIL problem, the training data is presented in the form of bags and their associated binary label. Introduced by Dietterich et al. in the context of drug activity prediction, the standard assumption in MIL is that one positive bag contains at least one positive instance, whereas a negative bag is only comprised of negative instances, although the size of bags may vary [11]. The uniqueness of not requiring labels for individual instances makes MIL very suitable for applications without label information for individual instances [36]. In many cases, however, the negative instances may play a leading role in positive bag. As a result, the main reason for the performance degradation of classifiers is that the traditional supervised classification methods are directly used to deal with MIL problems. Hence, some specialized methods should be devised to use the structure information of MIL. Many methods have been proposed to solve the MIL problems, which can be separated into the following three categories:

Instance-based modeling, which finds the most positive and least negative instances from bags to derive MIL models;

Bag-based modeling, which directly builds classification models at the bag level;

Hybrid approaches, which use both instances and bags to confine the learning space to build classification models.

One of the major complexities associated with MIL is the ambiguity of the relationship between the bag label and the instances within this bag [38]. Since the genuine positive instance inside each positive bag is unknown, the main challenge of MIL is to leverage the bag label and constraints to derive an accurate classification model. Many previous algorithms assigned the label of a bag directly to the containing instances [11,25,28], that is, the instances’ labels in the positive bag are all positive, which simplifies the training process of the classifier. However, the irrationality of this approach also makes the prediction results of the label for unknown bag deviate from the real label to some extent. Selecting a method that is robust for MIL can be difficult when little is known about the nature of the data, especially considering the unknown distribution of the instances within bags [6,24].

Therefore, the challenging problem for MIL still is how to efficiently and effectively prune the unrelated instances and preserve the valid instances. In this paper, we propose the grey-based multiple instance learning with multiple bag-representative (MIMBR) method. The results from experiments indicate that the proposed algorithm is efficient in training and has the following characteristics:

Broad adaptability: It provides a learning framework for transforming multiple instance (MI) problems into supervised learning problems. In our experimental study on benchmark datasets, it shows highly competitive classification accuracy.

Low complexity: It uses gray relational analysis to integrate similar instances in each bag, greatly reducing the number of data to be calculated in the experiment. Although the computational complexity, selecting at least one representative instance for each bag, is slightly higher than previous method, the classification accuracy is effectively improved. This is mainly due to the use of chosen subset of instances for classifier training rather than the use of all instances. In order to ensure convergence, the classifier is then optimized by intertwining instance selection with classifier learning in an alternating optimization framework.

Prediction capability: In some MI problems, the classification of instances is at least as important as the classification of bags. The proposed approach supports predicting instance labels. Moreover, it also has a key feature of being able to identify instances that have significant impact on modeling.

The remainder of the paper is organized as follows: Section 2 reviews the relevant MIL literature and examines the limitations of the existing approaches. In Section 3, the proposed algorithm MIMBR is described in detail, and then extend it to multi-class settings. The efficiency and effectiveness of proposed approach is illustrated in Section 4. The conclusions of the paper and further work on this topic are discussed in Section 5.

2. MIL approaches

One of the earliest algorithms for learning from MI problems was developed by Dietterich et al. for drug activity prediction [11]. Their algorithms, the axis-parallel rectangle methods, search for appropriate axis-parallel graphs by shrinking or expanding their attribute values, making the graph contain the maximum number of positive bags and the minimum number of negative bags [3,4,23]. And then Readt has demonstrated the connection between MIL and inductive logic programming [10].

After the work of Dietterich et al., many researchers began to design practical MIL algorithms. In 1998, Maron and Lozano-Peréz proposed the diversity density (DD) algorithm, called MIDD [26]. It has great impact on later research, and a lot of work is directly based on this algorithm. For example, Zhang and Goldman put forward EM-DD algorithm by combining DD algorithm with EM algorithm in 2002 [39].

In 2000, Wang and Zucker extended the k nearest neighbor (KNN) algorithm to handle MIL problems [35]. Instead of the usual Euclidean distance, they used the modified Hausdorff distance, which calculates the distance between different bags effectively. On this basis, Bayesian-KNN and Citation-KNN, were proposed. In addition, these extended KNN algorithms need to save the entire training set to calculate the distance when testing, so although it hardly needs training time, its storage and test time is very expensive. Some methods in literature [30,34] converted multiple instance (MI) datasets into traditional instance-level datasets, such as Simple-MI [12], and Bunescu and Mooney mapped each bag to a maximum-minimum vector [5]. The downside of these methods is that they assume instances within a positive bag are all labeled as positive, which may not be the case.

Zhou et al. proposed miGraph, which uses graph cores to construct implicit graphs by deriving affinity matrices [41]. It assumed that these instances are not independent and equally distributed due to the nature of MI data, and the relations among instances may convey important information.

Chen et al. raised MILES which mapped each bag into a feature space defined by instances in all the training bags via an instance similarity measure [7]. On this basis, Fu et al. proposed MILIS, a novel MIL algorithm based on adaptive instance selection which by intertwining the steps of instance selection and classifier learning in an iterative manner [13].

Yang et al. proposed an instance-based support vector machine (SVM), which uses an asymmetric loss function [37]. A false negative instance in positive bag may not cause an error on the bag label, but a false positive instance in negative bag may generate a classification error. Andrews et al. represented MIL as a mixed integer quadratic program. Integer variables are selector variables which selecting a positive instance from each positive bag [2]. The main idea of this method, which is called MI-SVM, transforms the MI data into single-instance data. Andrews et al. also proposed another mixed-integer instance-level approach, named mi-SVM, which tries to identify the instances within positive bags that are negative and utilize them in the construction of the negative margin [1]. The main disadvantage of this approaches is that they create an imbalanced class problem that favors the negative class, resulting in a biased classifier.

Veronika et al. proposed to represent each bag by a vector of its dissimilarities to other bags in the training set, and treat these dissimilarities as a feature representation, called MInD (Multiple Instance Dissimilarity) [33]. Melki et al. proposed a novel SVM multi-instance formulation and presented an algorithm with a bag-representative selector that trains the SVM based on bag-level information, named MIRSVM [28]. The main disadvantage of this approach stems from selecting only unique instance for each bag as representative instances, and each instance is treated as a potential target concept.

3. Grey-based multiple instance learning with multiple bag-representative

3.1. Notation

To describe MIMBR, we give the formal description of MIL problem. Given a training set $B = {B_{1}^{+}, \dots, B_{m^{+}}^{+}, B_{1}^{-}, \dots, B_{m^{-}}^{-}}$ , where $m = m^{+} + m^{-}$ , and we denote the positive bags as $B_{i}^{+}$ and the jth instance in that bag as $x_{i j}^{+}$ . The bag $B_{i}^{+}$ consists of $n_{i}^{+}$ instances $x_{i j}^{+}$ , $j = 1, \dots, n_{i}^{+}$ . Similarly, $B_{i}^{-}$ , $x_{i j}^{-}$ , $n_{i}^{-}$ represent a negative bag, the jth instance in the bag, and the number of instances in the bag, respectively. Different bags can have different numbers of instances; hence, $n_{i}^{+}$ and $n_{i}^{-}$ may vary for different bags. When ignoring the bag label, it will be referred to as $B_{i}$ with instances as $x_{i j}$ . All instances belong to the feature space $R^{n \times d}$ . The number of positive (negative) bags is denoted as $l^{+}$ ( $l^{-}$ ). For the sake of convenience, when we line up all instances in all bags together, we re-index these instances as $x^{k}$ , $k = 1, \dots, n$ , where $n = \sum_{i = 1}^{l^{+}} n_{i}^{+} + \sum_{i = 1}^{l^{-}} n_{i}^{-}$ . $B^{+}$ means that the positive bag contains at least one positive instance, $B^{-}$ means that all instance of the negative bag are negative. An intuitive understanding of single-instance learning versus multi-instance learning is shown in Fig. 1 and Fig. 2.

Fig. 1.

The single instance learning.

Fig. 2.

The multiple instance learning.

Each bag $B_{i}$ is associated with a bag label $Y_{i} \in {- 1, 1}$ , $i = 1, 2, \dots, m$ and each instance is associated with an instance label $y_{i, k} \in {- 1, 1}$ , where $- 1$ , 1 represent negative and positive label respectively. The relation between bag label and instance label follows the Noisy-OR model [26]: $\begin{array}{rcl} Y_{i} = \{\begin{matrix} 1, & \exists y_{i, k} = 1, \\ - 1, & \forall y_{i, k} = - 1 . \end{matrix} \end{array}$

3.2. Grey relational analysis

Grey system theory has been proposed for uncertain systems with partially known and partially unknown information [21], and can extract valuable information from partial information. In this paper, we used grey relational analysis (GRA) for measuring the similarity of two random instances in each bag, which involves the grey relational coefficient (GRC) and the grey relational grade (GRG). GRA offers some advantages rather than other measure metrics, such as Minkowski distance. At first, the idea of GRA is clear, which can reduce the loss caused by information asymmetry to a great extent. Besides, GRA gives a normalized measuring function due to its normality, i.e., measuring the similarities or differences among instances for analyzing the relational structure. Furthermore, GRA gives whole relational orders due to its wholeness over the entire relational space [21].

Fig. 3.

This figure represents a summary of the steps performed by GRA.

As shown in the Fig. 3, if we assume the reference sequence is $x_{0} = x_{0} (p)$ , $p = 1, 2, \dots, m$ , the comparison sequence is $x_{i} = x_{i} (p)$ , $p = 1, 2, \dots, m$ , $i = 1, 2, \dots, n$ . And then since the data in each factor column in the system may be different in dimensions, it is not convenient and even difficult to get a correct conclusion during the comparison, so the next step is to normalize the data which is the process of reducing data to its canonical form.

In this paper, we use min−max normalization, which is expressed as follows: $\begin{matrix} (1) & x_{p}^{'} (j) = \frac{{max}_{\forall i} x_{i} (j) - x_{p} (j)}{{max}_{\forall i} x_{i} (j) - {min}_{\forall i} x_{i} (j)}, \end{matrix}$ where $x_{i} (j)$ is the jth feature value of instance $x_{i}$ , $i, p = 1, 2, \dots, m$ , and $j = 1, 2, \dots, n$ . According to Eq. (1), all input feature values can be transformed into values between zero and one. Then, the GRC of two instances is defined as follows: $\begin{matrix} (2) & GRC (x_{0} (p), x_{i} (p)) = \frac{α + ρ β}{| x_{0} (p) - x_{j} (p) | + ρ β}, \end{matrix}$ where $\begin{array}{l} α = min_{\forall i} min_{\forall k} | x_{0} (k) - x_{j} (k) |, \\ β = max_{\forall i} max_{\forall k} | x_{0} (k) - x_{j} (k) |, \end{array}$ $i, j = 1, 2, \dots, n$ , $k, p = 1, 2, \dots, m$ , p stands for an attribute, $x_{0} (p)$ denotes the values of attribute p in instance $x_{0}$ and ρ is distinguishing coefficient, whose range is $[0, 1]$ . The smaller the ρ, the greater the distinguishing. No convincing method has so far been suggested for determining the optimal value of ρ. We use $ρ = 0.5$ based on experimental results from [31]. The higher the value of $GRC (x_{0} (k), x_{i} (k))$ , the greater the similarity between $x_{0} (k)$ and $x_{i} (k)$ . Therefore, the GRG is expressed as Eq. (3): $\begin{matrix} (3) & GRG (x_{0}, x_{i}) = \frac{1}{n} \sum_{k = 1}^{n} GRC (x_{0} (k), x_{i} (k)), \end{matrix}$ where the mean is used to calculate the GRG, and the role of each feature is the same. And then the correlation degree is arranged in descending order.

As mentioned above, the GRA based on MIL can be employed to measure the similarity between two instances in a bag. Then, instances whose similarity is greater than a certain threshold are integrated, in other words, the “center” of these instances is taken as the representative instance of these instances.

3.3. Support vector machine

Most classical learning techniques require knowledge of the data distribution to build accurate models, which is a serious restriction because, in most cases, the distribution is unknown [17]. SVM was first proposed by Corinna Cortes and Vapnik in 1995, and quickly applied in various fields [9,21,22]. It represents learning techniques that have been introduced under the structural risk minimization framework and Vapnik–Chervonenkis theory [15,32]. Except for linear classification, SVM can used so-called kernel tricks which implicitly map the input vector into the high-dimensional feature space to perform nonlinear classification effectively. Obviously, the key to SVM is the kernel function. Vector sets in low-dimensional spaces are often difficult to partition, and the solution is map them to higher-dimensional spaces.

But the difficulty with this approach is the increase in computational complexity, and the kernel neatly solves this problem. In other words, as long as the proper kernel function is selected, the classification function of the high-dimensional space can be obtained. After determining the kernel function, two parameters, namely the relaxation coefficient and the penalty coefficient, are introduced to correct the errors in the known data. In this paper, generalization and linear separability can be enhanced by mapping the original input space to a higher dimensional dot-product space by using a kernel function shown in Eq. (4): $\begin{matrix} (4) & K (x_{i}, x_{j}) = (ϕ (x_{i}), ϕ (x_{j})), \end{matrix}$ where $ϕ (\cdot)$ represents a function mapping from the original feature space to a higher dimensional space. This kernel mapping is helpful when solving the dual SVM problem shown in: $\begin{array}{l} max_{α} - \frac{1}{2} \sum_{i, j = 1}^{m} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j}) + \sum_{i = 1}^{m} α_{i} \\ (5) & s.t. \sum_{i = 1}^{m} α_{i} y_{i} = 0, \\ 0 ⩽ α_{i} ⩽ \frac{C}{m}, \forall i \in {1, \dots, m} \end{array}$ where $C \in R$ is the penalty parameter that controls the trade-off between margin maximization and classification error minimization.

3.4. MIMBR

With the above ingredients, we can now describe the MIMBR framework. We first focus on the binary classification case, where the label value $y_{i} \in {- 1, 1}$ for each bag $i = 1, 2, \dots, m$ , where $y_{i} = 1$ if the ith bag corresponds to the positive class and $y_{i} = - 1$ otherwise. By just a few minor modifications, we will later generalize the framework to deal with multi-class MIL problems.

Like MIRSVM, our method aims to find bag labels and the representative instances for each bag. Note that MIRSVM only selects unique instances from each bag as representative instances, and each instance is treated as a potential target concept. However, if an instance is labeled as positive, the bag in which the instance reside is labeled as positive, that is, there may be more than one instance are marked positive in a positive bag. In this paper, we have taken into account this point which was ignored by MIRSVM. Moreover, to achieve comparable performance to MIRSVM with almost the same computational complexity, explicit instance pruning and selection is necessary. Therefore, an effective approach is needed to reduce the complexity of processes that are not clustered or quantified. This is an important observation because clustering and quantization do not take bag-level structure or discriminant information into account and may discard the small clusters in the feature space. Hence, we adopt a novel data preprocessing method, which had already been mentioned, and at the same time, the importance of instance pruning for efficient MIL is highlighted. Next, the problems of feature learning and instance updating are addressed. Finally, we summarize the algorithm and provide some theoretical study on its property and computational complexity.

In some MI problems, classification of instances is at least as important as the classification of bags. For example, an object detection algorithm needs not only to identify whether the image contains a certain object, but also to locate the object (or part of the object) from the image if it contains the object [34]. Under the MI formulation, this requires the classification of the bags as well as the instances in a bag that correspond to the object. The classifier $y = sign (\sum_{k \in I} w_{k}^{*} s (x^{k}, B_{i}) + b^{*})$ predicts the label for a bag, where $w^{*}$ and $b^{*}$ are the weight vector and bias term obtained by the classifier respectively. Next, we introduce a way to classify instances based on a bag classifier.

Fig. 4.

This figure represents a summary of the steps performed by MIMBR. Where (a) is the training process of the algorithm, and the representatives are first randomly initialized, then continuously updated according to the current hyper-plane. And (b) is the testing process of the algorithm, which is given a label for the unknown bag.

MIMBR is based on the idea of selecting representative instances from both positive and negative bags which are used to find an unbiased, optimal separating hyper-plane. Iteratively select at least one representative from each bag and form a new hyper-plane based on those representatives until they converge. According to the standard MI hypothesis, only one instance in a bag is required to be positive for the bag to adopt a positive label. Since the distribution of instances in positive bags is unknown, MIMBR gives priority to negative bags in the training process, because their distribution is known, that is, all instances have negative labels. Figure 4 is a summary of the steps performed by MIMBR.

The previous method of selecting representative instances in a bag was to index the maximum output value within each bag using the following rules: $\begin{array}{rcl} s_{I} = arg max_{i \in I} (⟨ w, x_{i} ⟩ + b), \forall i = 1, 2, \dots, n, \end{array}$ where $w \in R^{d}$ is a d-dimensional weight vector, $b \in R$ is a bias term. In other words, the most positive instance is chosen from each positive bag and the least negative instance is chosen from each negative bag (instances with the largest output value based on the current hyper-plane) [36]. According to the standard MI, there may be more than one instance labeled positive in the positive bag, so it is obvious that the representative instance selected by above method may not be representative.

So this paper fully makes up for the defect, the idea is to divide the training bags into positive and negative parts. For positive bags, all output values are sorted in descending order and then selected from largest to smallest, its termination condition is that at least one instance is selected from each bag as the representative instance. Similarly, for negative bags, all output values are first sorted in ascending order and then selected from smallest to largest until each bag can contain at least one instance as the representative instance. The primal MIMBR optimization problem is presented in Eq. (6): $\begin{matrix} (6) & \begin{array}{l} min \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} ξ_{i} \\ s.t. y_{i} (w \cdot x_{r_{i}} + b) ⩾ 1 - ξ_{i}, \\ i = 1, 2, \dots, n, \\ ξ_{i} ⩾ 0, i = 1, 2, \dots, n, \end{array} \end{matrix}$ where $r_{i}$ is a set of the bag representatives’ indices, $x_{r_{i}}$ is the instance representative of bag $B_{i}$ , C is the regularization parameter that controls the influence of the second term on the right-hand side of the above equation and the slack variable $ξ_{i}$ allows for optimizing over bag errors.

Note the variables in MIMBR formulation are the similar to those of the classical SVM, except they are now representing each bag as one or more instances. This is one of the benefits of the MIMBR formulation, whereby using bag-representative selector, traditional SVM solvers can be used for new data representations. Solving the optimization problem given in Eq. (6) using quadratic programming solver is a computationally expensive task due to the number of constraints, which scales by the number of bags n, as well as the calculation of the inner product between two d-dimensional vectors. Therefore, in order to solve the optimization problem more efficiently, we take it as the original optimization problem and apply Lagrange duality to obtain the optimal solution of the primal problem by solving the dual problem.

First, the Lagrange function of the primal problem Eq. (6) is constructed as: $\begin{array}{l} L (w, b, ξ, α, μ) \\ = \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} ξ_{i} - \sum_{i = 1}^{n} μ_{i} ξ_{i} \\ - \sum_{i = 1}^{n} α_{i} (y_{i} (w \cdot x_{i} + b) - 1 + ξ_{i}), \end{array}$ where α and μ are the non-negative Lagrange multipliers. Then, by substituting the optimization conditions (from $\nabla_{w, b, ξ} L (w, b, ξ, α, μ) = 0$ ), we can obtain the dual MIMBR formulation goes like: $\begin{array}{l} max_{α} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{i = 1}^{n} α_{i} μ_{j} y_{i} y_{j} (x_{r_{i}} \cdot x_{r_{j}}) + \sum_{i = 1}^{n} α_{i} \\ s.t. \sum_{i = 1}^{n} α_{i} y_{i} = 0, i = 1, 2, \dots, n, \\ C - α_{i} - y_{i} = 0, i = 1, 2, \dots, n, \\ α_{i} ⩾ 0, i = 1, 2, \dots, n, \\ μ_{i} ⩾ 0, i = 1, 2, \dots, n . \end{array}$

It can be noted that in the dual problem above only involve the inner product between the instances. The inner product of the objective function in the dual problem can be replaced by the kernel values $K (x_{r_{i}}, x_{r_{j}})$ , which extends the linear classifier to the nonlinear classifier. MIMBR adopts the Gaussian Radial Basis Function given by Eq. (7): $\begin{matrix} (7) & K (x_{r_{i}}, x_{r_{j}}) = e^{- \frac{{‖ x_{i} - x_{j} ‖}^{2}}{2 σ^{2}}}, \end{matrix}$ where σ is the Gaussian shape parameter. If the optimal solution obtained by using the above decision function is α, then b can be obtained by using a positive component of $0 < α_{j} < C$ : $b = y_{j} - \sum_{i = 1}^{n} α_{i} y_{i} K (x_{i}, x_{j})$ .

Algorithm 1

MIMBR

Algorithm 1 shows the procedure for training the MIMBR classifier and obtaining the optimal representative instances from each bag. During the training process, the representatives, S, are first initialized by randomly selecting an instance from each bag. Then the hyperplane is obtained by S, and the new optimal representative is found for the current hyperplane according to the rules given below: for positive bags, all output values are sorted in descending order and then selected from top to bottom until each bag can contain at least one instance. Similarly, for negative bags, all output values are first sorted in ascending order and then selected from bottom to top until each bag can contain at least one instance.

At each step, the previous values in S are sorted in $S_{old}$ . The training procedure ends when the bag representatives stop changing from one iteration to the next ( $S = S_{old}$ ). During testing, each bag generates an output vector according to the hyper-plane found during training. Then, according to the SMI hypothesis, the bag label is assigned by the symbol of the maximum value of the output vector. Furthermore, the main components of the MIMBR’s time complexity include constructing the representative instance from the bag, the time complexity of this algorithm can be achieved $O (n log n)$ , where n is the number of bags.

Fig. 5.

The figure shows that DAG-SVMs is used to illustrate the classification decision process of four samples. It starts from the root node at the top (including two types) and continues to classify with the node at the next layer or right node according to the classification result of the root node until it reaches a certain leaf at the bottom, and the category represented by the leaf is the category of the unknown sample.

Before the training classifier, the relationship between instances in the training datasets is processed by gray correlation analysis is one improvement of MIMBR. The data volume can be reduced on the one hand, and the importance of instances in the bag can be preliminarily judged on the other. Furthermore, unlike MIRSVM, which restricts each bag to one representative instance, MIMBR selects one or more representative instances for each bag based on the hyper-plane output from each iteration. Besides, the initialization process of MIMBR is to randomly select an instance from all the bags. During the initialization process, no wrapper techniques are added with any noise and no assumptions are made for the instance.

3.5. Extension to the multi-class setting

In this section, we will show how to generalize the algorithm for binary MIL presented in the previous sections to multi-class MIL tasks. Over here, we adopt a traditional approach that use a variety of classifier fusion framework to decompose the multi-class problem into multiple binary problems, and combines the results of the binary classifiers to carry out multi-class classification.

For multi-class MIL problems, each class is a positive one relative to the other classes. Therefore, for each bag in any class, there exists at least one true positive instance, which carries the discriminant information for the class under consideration, and the rest of the instances may be considered as “background” ones. For some applications, there might be a number of bags that contain background instances alone, akin to the negative class in the binary case.

The SVM method was first proposed for two kinds of classification problems, how to extend the two-category classification method to multi-category classification is one of the important contents of SVM research. Directed Acyclic graph SVMs (DAG-SVMs) is derived from the decision directed cyclic graph proposed by Platt, and it is proposed for the misclassification and rejection of “1-v-1” SVMs [29].

Table 1
MIL datasets,number of bags,dimensionality,number of instances and the average,minimum and maximum number of instances per bag

Datasets Attributes Positive bags Negative bags Total bags Instances Average size

Musk1 166 47 45 92 476 5.17

Musk2 166 39 62 101 6598 65.33

Eastwest 24 10 10 20 213 10.65

Westeast 24 10 10 20 213 10.65

Elephant 230 100 100 200 1391 6.96

Fox 230 100 100 200 1320 6.60

Tiger 230 100 100 200 1220 6.60

Mutagenesis-atoms 10 125 63 188 1618 8.61

Mutagenesis-bonds 16 125 63 188 3995 21.25

Mutagenesis-chains 24 125 63 188 5349 28.45

Datasets	Attributes	Positive bags	Negative bags	Total bags	Instances	Average size
Musk1	166	47	45	92	476	5.17
Musk2	166	39	62	101	6598	65.33
Eastwest	24	10	10	20	213	10.65
Westeast	24	10	10	20	213	10.65
Elephant	230	100	100	200	1391	6.96
Fox	230	100	100	200	1320	6.60
Tiger	230	100	100	200	1220	6.60
Mutagenesis-atoms	10	125	63	188	1618	8.61
Mutagenesis-bonds	16	125	63	188	3995	21.25
Mutagenesis-chains	24	125	63	188	5349	28.45

In the training stage, the algorithm is the same as the “1-v-1” method, and the classification of every 2-type of questions should also be constructed by $l (l - 1) / 2$ classifiers. However, in the classification stage, this method forms a two-way directed acyclic graph $l (l - 1) / 2$ nodes and l leaves, where l is the number of classifier. Each node is a classifier and is connected to two nodes (or leaves) in the next layer. Figure 5 is a schematic diagram of the classification decision process for 4 types of samples by DAG-SVMs. The theory proves that the upper limit of error accumulation of DAG method is fixed.

4. Simulation analysis and discussions

4.1. Datasets and experimental setup

To validate the effectiveness of the proposed MIMBR, this section describes our contribution to the experimental setup and comparison, as well as 9 other the state-of-art methods on 10 different benchmark datasets. First, the experimental setup is described and the state-of-art methods are listed. Then, the results for each metric are then presented and analyzed. The main purpose of the experiment was to compare our contributions to other MI support vector machines, state-of-the-art multi-instance learners, and ensemble methods.

To investigate the proposed algorithm works in varies scenarios, Table 1 provides a summary of the 10 datasets used throughout the experiment, showing the total number of properties, bags, and instances. The datasets were obtained from the KEEL dataset repositories [16].1

¹
https://sci2s.ugr.es/keel/category.php?cat=mul

The first column is the name of datasets, and followed by other columns, which are attributes, the number of positive bags, the number of negative bags, total bags, the number of instances and size of average bag. It is obvious that in our experiments, the datasets have a variety of attributes and number of instances. The attributes range from 10 to 230, and the number of instances varies from 213 to 6598. Our proposed approach works for both high dimensional datasets and low dimensional datasets.

Among the datasets, MUSK1 and MUSK2, are benchmark data sets for MIL. Both data sets are publicly available from the UCI Machine Learning Repository [14]. The datasets consist of descriptions of molecules. Specifically, a bag represents a molecule. Instances in a bag represent low-energy shapes of the molecule. To represent the low-energy shapes of molecules as attribute-value pairs, first fix the molecules in their standard position and orientation, and then 162 rays are emitted fairly uniformly from the origin. The length of each ray cut by the origin and the molecular surface is treated as an attribute. Add four attributes representing fixed oxygen positions, and each instance in the package is described by 166 numerical attributes.

The task of the three datasets, elephant, fox, and tiger, are to estimate whether the images contained elephants, tigers, and foxes. In these three datasets, each image is treated as a bag, and the region of interest of the image is taken as an instance. The experiments were conducted on a 3.40 GHz Intel i7-6700 CPU and 8 GB of memory. MIMBR was implemented in PYTHON while the referenced algorithms are available in the Java implementation of WEKA with the exception of miGraph which was made available by Zhou et al. and tested in MATLAB. To evaluate the performance of our proposed method, we compare it with several representative MIL algorithms, such as miGraph, MIDD, MIOptimalBall [41], MISVM [5] and MIRSVM.

In order to objectively evaluate the performance of the model and optimize the hyper-parameters, the 10-fold cross-validation method is used to carry out the experiment. This process ensures that the model is not optimistically biased towards the complete data set, and that the algorithm is fairly evaluated on the same data in each fold. Optimization in cross validation models includes finding the best penalty parameters, C, and the best shape parameters, the kernel of the Gaussian radial basis function, and σ. Three parameters σ (Gaussian shape parameter), C (error-tolerant rate) and $maxIter$ (maximum number of iterations) need to be specified for MIMBR.

We fixed $maxIter = 50$ and $toler = 0.005$ to do the same maximum number of iterations in the positive class and the negative class. The parameters σ and C were selected according to a tenfold cross-validation on the training set. We chose σ from the following possible runs ${0.1, 0.2, 0.5, 1, 2, 5}$ , and C from 0.01 to 1000 by a factor of ten. These parameters are also used for the state-of-art SVM methods. The purpose of this is to keep the experimental environment controllable and ensure the fair evaluation of multiple instance SVM algorithm. The parameters of the reference algorithm used throughout the experiment were specified by their author.

4.2. Evaluation index

The classification performance was measured using five metrics: Accuracy (8), Precision (9), Recall (10), F0.5-measure (11) and the Area under ROC curve (AUC, 12). On account of accuracy itself can be misleading when categories are unbalanced, the Precision and Recall measures were reported. F0.5-measure and AUC measures are used as complementary measures in order to evaluate the algorithms comprehensively. Although precision and recall are not necessarily related to each other, which can be seen in the above formula. However, in the large-scale data set, these two indicators are mutually restrictive. So F-score is also used as a performance measure, shown in (11), which is based on the definition of harmonic mean of precision and recall. The AUC metric highlights the trade-off between the true positive rate, or recall, and the false positive rate, as shown in (12). The values of the true positive (TP), true negative (TN), false positive (FP), and false negative samples (FN) were first collected for each of the classifiers, then the metrics were computed using the equations shown in (8) on the n bags of the test data: $\begin{array}{l} (8) & Accuracy = \frac{TP + TN}{TP + FP + TN + FN}, \\ (9) & Precision = \frac{TP}{TP + FP}, \\ (10) & Recall = \frac{TP}{TP + FN}, \\ (11) & F-score = \frac{(1 + β^{2}) * Precision * Recall}{β^{2} * (Precision + Recall)}, \\ (12) & AUC = \frac{1 + \frac{TP}{TP + FN} - \frac{FP}{FP + TN}}{2} . \end{array}$

4.3. Accuracy

Table 2 shows the accuracy results of 10 algorithms for 10 multi-instance datasets, as well as their average. The results show that the bag-based and ensemble learning performs better than the instance-based and wrapper methods. In particular, MIMBR achieves the highest accuracy in 6 datasets of 10 datasets, while MIRSVM achieves the best results on three datasets. Note that MIMBR performs better than MISVM for all datasets, suggesting that selecting at least one representative instance for each bag by sort can improve classifier performance. The instance-level classifiers and wrapper methods (such as MIBoost, MIWrapper, and SimpleMI) classify the worst. This behavior emphasizes the importance of not assuming a positive bag distribution in advance. The reason why higher accuracy has been obtained by MIMBR is that after selecting at least one representative for each bag MIMBR used, the instance in each bag in dataset are clustered and the difference between positive and negative become clear, which makes easier to classify correctly.

Table 2
Accuracy on benchmark tasks

Datasets MIMBR SimpleMI MIOptimalBall miGraph MIRSVM MIBoost MIWrapper MIDD MISVM MInD

Musk1 0.9022 0.5109 0.6848 0.8152 0.9022 0.5109 0.5109 0.8054 0.7609 0.8364

Musk2 0.8564 0.6238 0.7327 0.7426 0.8218 0.6139 0.6139 0.7228 0.7129 0.7506

Eastwest 0.8500 0.6500 0.7250 0.7500 0.7500 0.5000 0.5000 0.6125 0.5625 0.7340

Westeast 0.8500 0.6500 0.4000 0.7500 0.7500 0.5000 0.5000 0.4500 0.4125 0.7500

Elephant 0.7250 0.5450 0.5000 0.8300 0.8100 0.5000 0.5000 0.7700 0.8000 0.8055

Fox 0.6300 0.5050 0.5150 0.5880 0.6150 0.5000 0.5000 0.5800 0.4750 0.5950

Tiger 0.8122 0.5350 0.5850 0.7950 0.7750 0.5000 0.5000 0.7100 0.7550 0.8052

Mutagenesis-atoms 0.8564 0.6649 0.6447 0.7606 0.7819 0.6649 0.6649 0.7074 0.6649 0.7941

Mutagenesis-bonds 0.8144 0.6649 0.6809 0.7872 0.8152 0.6649 0.6649 0.7713 0.6649 0.7849

Mutagenesis-chains 0.8019 0.6649 0.6702 0.7926 0.8411 0.6649 0.6649 0.7764 0.6649 0.8007

Average 0.8099 0.6014 0.6138 0.7611 0.7863 0.5620 0.5620 0.6906 0.6475 0.7656

Datasets	MIMBR	SimpleMI	MIOptimalBall	miGraph	MIRSVM	MIBoost	MIWrapper	MIDD	MISVM	MInD
Musk1	0.9022	0.5109	0.6848	0.8152	0.9022	0.5109	0.5109	0.8054	0.7609	0.8364
Musk2	0.8564	0.6238	0.7327	0.7426	0.8218	0.6139	0.6139	0.7228	0.7129	0.7506
Eastwest	0.8500	0.6500	0.7250	0.7500	0.7500	0.5000	0.5000	0.6125	0.5625	0.7340
Westeast	0.8500	0.6500	0.4000	0.7500	0.7500	0.5000	0.5000	0.4500	0.4125	0.7500
Elephant	0.7250	0.5450	0.5000	0.8300	0.8100	0.5000	0.5000	0.7700	0.8000	0.8055
Fox	0.6300	0.5050	0.5150	0.5880	0.6150	0.5000	0.5000	0.5800	0.4750	0.5950
Tiger	0.8122	0.5350	0.5850	0.7950	0.7750	0.5000	0.5000	0.7100	0.7550	0.8052
Mutagenesis-atoms	0.8564	0.6649	0.6447	0.7606	0.7819	0.6649	0.6649	0.7074	0.6649	0.7941
Mutagenesis-bonds	0.8144	0.6649	0.6809	0.7872	0.8152	0.6649	0.6649	0.7713	0.6649	0.7849
Mutagenesis-chains	0.8019	0.6649	0.6702	0.7926	0.8411	0.6649	0.6649	0.7764	0.6649	0.8007
Average	0.8099	0.6014	0.6138	0.7611	0.7863	0.5620	0.5620	0.6906	0.6475	0.7656

Table 3

Precision on benchmark tasks

Datasets	MIMBR	SimpleMI	MIOptimalBall	miGraph	MIRSVM	MIBoost	MIWrapper	MIDD	MISVM	MInD
Musk1	0.8800	0.6809	0.6596	0.7872	0.9362	1.0000	1.0000	0.8936	0.8108	0.8878
Musk2	1.0000	0.8974	0.5385	0.7692	0.7179	0.6154	0.5897	0.7576	0.7436	0.8066
Eastwest	0.7000	0.5000	0.8000	0.7000	0.7000	0.5000	0.5000	0.6000	0.5000	0.7250
Westeast	0.7000	0.5000	0.3000	0.7273	0.7000	0.5000	0.5000	0.5000	0.3800	0.6974
Elephant	0.9400	0.5000	0.5700	0.8700	0.8100	0.5000	0.5000	0.7900	0.7700	0.8540
Fox	0.8500	0.5000	0.3200	0.7200	0.6040	0.5000	0.5000	0.5800	0.4800	0.7545
Tiger	0.8000	0.5000	0.5000	0.7300	0.7365	0.5000	0.5000	0.6900	0.7800	0.7778
Mutagenesis-atoms	0.9844	1.0000	0.5440	0.7920	0.7840	1.0000	1.0000	0.6160	1.0000	0.7384
Mutagenesis-bonds	0.7656	1.0000	0.5360	0.8240	0.8468	1.0000	1.0000	0.7520	1.0000	0.7808
Mutagenesis-chains	1.0000	1.0000	0.5520	0.8160	0.8560	1.0000	1.0000	0.7040	1.0000	0.7152
Average	0.8620	0.7078	0.5320	0.7736	0.7691	0.7115	0.7090	0.6783	0.7464	0.7738

Table 4

Recall on benchmark tasks

Datasets	MIMBR	SimpleMI	MIOptimalBall	miGraph	MIRSVM	MIBoost	MIWrapper	MIDD	MISVM	MInD
Musk1	0.9362	0.5109	0.7045	0.8409	0.8800	0.0000	0.0000	0.8077	0.8444	0.8303
Musk2	0.7222	1.0000	0.7000	0.6383	0.8000	0.5000	0.5000	0.6122	0.6041	0.7252
Eastwest	1.0000	0.7142	0.6667	0.7778	0.7778	0.5000	0.5000	0.6667	0.4545	0.7655
Westeast	1.0000	0.7142	0.3750	0.7778	0.6923	0.5000	0.5000	0.4545	0.2500	0.7885
Elephant	0.6573	0.5000	0.7674	0.7768	0.7459	0.5000	0.5000	0.7596	0.8819	0.6334
Fox	0.5862	0.5051	0.5246	0.5556	0.5950	0.5000	0.5000	0.5800	0.4752	0.5957
Tiger	0.8163	0.5376	0.6024	0.8210	0.8432	0.5000	0.5000	0.7188	0.7429	0.7648
Mutagenesis-atoms	0.8344	0.5714	0.5313	0.8462	0.8574	0.0000	0.0000	0.6638	0.0000	0.8474
Mutagenesis-bonds	0.8673	0.5714	0.5234	0.8720	0.8607	0.0000	0.0000	0.6528	0.0000	0.8264
Mutagenesis-chains	0.7353	0.5714	0.5391	0.8947	0.8560	0.0000	0.0000	0.6567	0.0000	0.7059
Average	0.8155	0.6196	0.5934	0.7801	0.7908	0.3000	0.3000	0.6573	0.4253	0.7483

Table 5

AUC on benchmark tasks

Datasets	MIMBR	SimpleMI	MIOptimalBall	miGraph	MIRSVM	MIBoost	MIWrapper	MIDD	MISVM	MInD
Musk1	0.9324	0.5055	0.6856	0.8163	0.9043	0.5000	0.5000	0.8414	0.8124	0.9040
Musk2	0.8611	0.9697	0.7232	0.7358	0.8167	0.5000	0.5000	0.7196	0.7077	0.9544
Eastwest	0.8846	0.6648	0.7084	0.7525	0.7525	0.5000	0.5000	0.6515	0.4495	0.7213
Westeast	0.8846	0.6648	0.4792	0.7525	0.7099	0.5000	0.5000	0.4495	0.1875	0.7228
Elephant	0.7760	0.5206	0.6337	0.8499	0.7991	0.5000	0.5000	0.7704	0.8011	0.8269
Fox	0.6567	0.5050	0.5177	0.6182	0.6203	0.5000	0.5000	0.5000	0.4750	0.6155
Tiger	0.8101	0.6798	0.6187	0.8000	0.7977	0.5000	0.5000	0.5000	0.7557	0.7981
Mutagenesis-atoms	0.8902	0.7857	0.2907	0.7400	0.7358	0.5000	0.5000	0.4986	0.5000	0.5943
Mutagenesis-bonds	0.7572	0.7857	0.2784	0.7789	0.7788	0.5000	0.5000	0.4741	0.5000	0.6853
Mutagenesis-chains	0.8677	0.7857	0.3029	0.7933	0.7851	0.5000	0.5000	0.4858	0.5000	0.7061
Average	0.8320	0.6867	0.5239	0.7637	0.7700	0.5000	0.5000	0.5891	0.5689	0.7529

Table 6

F0.5-score on benchmark tasks

Datasets	MIMBR	SimpleMI	MIOptimalBall	miGraph	MIRSVM	MIBoost	MIWrapper	MIDD	MISVM	MInD
Musk1	0.8907	0.6384	0.6681	0.7974	0.9244	0.0000	0.0000	0.8750	0.8173	0.8133
Musk2	0.9286	0.9162	0.5645	0.7389	0.7329	0.6667	0.5693	0.7232	0.7108	0.8593
Eastwest	0.7447	0.5319	0.7692	0.7143	0.7143	0.5000	0.5000	0.6123	0.4902	0.7059
Westeast	0.7447	0.5319	0.3125	0.7369	0.6984	0.5000	0.5000	0.4902	0.2885	0.7266
Elephant	0.8655	0.5000	0.6009	0.8496	0.7963	0.5000	0.5000	0.7837	0.7793	0.6973
Fox	0.7798	0.5010	0.3470	0.6798	0.6022	0.5000	0.5000	0.5800	0.4790	0.6528
Tiger	0.8032	0.5071	0.5176	0.7434	0.7556	0.5000	0.5000	0.6956	0.7723	0.7649
Mutagenesis-atoms	0.9502	0.8696	0.5414	0.8023	0.7977	0.0000	0.0000	0.6250	0.0000	0.8958
Mutagenesis-bonds	0.7840	0.8696	0.5334	0.8332	0.8495	0.0000	0.0000	0.7298	0.0000	0.7332
Mutagenesis-chains	0.9328	0.8696	0.5494	0.8306	0.8760	0.0000	0.0000	0.6940	0.0000	0.8595
Average	0.8424	0.6735	0.5404	0.7726	0.7747	0.3167	0.3069	0.6809	0.4337	0.7709

4.4. Precision and recall

Precision and recall are contradictory measures in some cases and must be evaluated at the same time to observe their behaviors at the same time, because they are both used to measure correlation. The precision and recall results of each algorithm are given in Table 3 and Table 4. The precision and recall results of MIWrapper and SimpleMI show that they are unstable classifiers and show great differences in the results, making them unsuitable for practical applications. It is also worth noting that when analyzing the performance of mutagenesis datasets, the number of positive bag is greater than the number of negative bag, where MISVM, MIBoost and MIWrapper predict that all bags’ label is negative. Furthermore, although miGraph and MInD achieved unbiased results on these datasets, MIMBR was significantly superior to miGraph in precision and recall, resulting in a better tradeoff. For MIMBR, the experimental results show that it achieves the highest precision in the following datasets: fox, tiger, elephant, while the effect is not very satisfactory in mutagenesis-atoms and mutagenesis-bonds.

4.5. AUC and F0.5-score

Table 5 show AUC results obtained by the algorithms, which objectively reflects the comprehensive prediction ability of positive bags and negative bags, and also considers the influence of eliminating sample skew, emphasizing the better performance of bag-based methods. MIMBR achieve the best AUC score on 6 of the 10 datasets, while MIBoost and MIWrapper acquire the worst results. Their AUC values were 0.5 across datasets, indicating that they were random predictors. These results justify the accuracy of the algorithm because the instance-based and wrapper approach performs worse than the bag-based and integrated learning approach. The AUC of the bag-level classifier ranges from 0.75 to 0.85, which is well explained by the precision and recall.

In MIL, due to paying more attention to the classification of positive bags, that is, how many positive bags are correctly labeled after classifier training. So we set $β = 0.5$ in F-score to calculate. Table 6 presents the F0.5-score result. As we can be seen, although SimpleMI gets the best effect on mutagenesis series datasets, it has no ideal experimental results on other datasets. It’s not hard to see that MIDD, MIWrapper, MIBoost and MISVM not have an obvious advantage over other classifiers. From Table 6, we can also see that MIMBR prediction is rather stable over high positive ratios, and the performance is still satisfactory. It suggests that MIMBR can work well when we select as many representative instances as possible.

Fig. 6.

The figure shows that Holm test results of MIMBR versus other 9 datasets.

Fig. 7.

The figure shows that Wilxocon test results of MIMBR versus other 9 datasets.

4.6. Overall comparison

The presented classifiers vary significantly in the model assumptions and their complexity. Figure 6 and Fig. 7 show the Holm test and Wilcoxon test, respectively. The MInD and MIRSVM approaches also have reasonably good (but typically worse than miGraph and MIMBR) and consistent performances. On the one hand, the Holm test procedures reflect that MIMBR performs significantly better than SimpleMI, MIOptimalBall, miGraph, MIRSVM, MIDD and MInD. The significance in MIMBR is 0.06, which is lower than the acceptance of the null hypothesis of 0.05 because MISVM, MIBoost and MIWrapper predict that all bags’ label is negative. On the other hand, we can find that except for the recall in MIRSVM, the performance of MIMBR is significantly improved compared to other algorithms. But in other measurement indicators, the results of MIMBR are better than MIRSVM. Therefore, it can be known from the above two types of non-parametric tests that MIMBR’s performance as a competitive classifier.

5. Conclusion

In this paper, we presented MIMBR, which is an efficient SVM-based MIL approach to classification. This method has universality, and the key lies on the combination of instance selection and classifier learning. An iterative two-step optimization framework to update classifiers is also developed in this paper. This optimization strategy is guaranteed to convergence. Here, initial instance selection is achieved via randomly selected from each bags. And then, for the training set, positive and negative bags are processed respectively according to the classification hyperplane obtained by SVM training, and at least one representative instance is selected for each bag for the next classifier training. This is more efficient and effective than the traditional EM-based instance selection strategies. The proposed approach is proved to be more amenable to improvement than the others. As mentioned in this article, this approach can be applied to a large quantity of classification settings additionally.

Compared with the most advanced multi-instance SVM, experimental results show that traditional MI learners and implicit learning methods, MIMBR has better learning performance. The experimental results are evaluated according to different performance indicators, which further validates the advantages of bag-level classifiers, such as miGraph and MIRSVM, in predicting the accuracy of unknown bag labels in MIL, while instance-level learners perform poorly or are referred to as strongly biased and unstable classifiers.

MIMBR could be further improved by investigating the following approaches. The first research direction is to experiment with different optimal solvers. Iterative Single Data Algorithm [15,18] is a more recent and efficient approach for solving the L2-SVM problem, shown to be faster than the sequential minimal optimization algorithm and equal in terms of accuracy [19,20]. It iteratively updates the objective function by working on one data point at one time, using coordinate descent to find the optimal objective function value. Another area to explore is testing and observing MIMBR performance on highly imbalanced and high-dimensional MI datasets.

In some applications, the training data are given in a one-class setting. For example, the domain-based protein interaction inference problem typically only has positive training samples, namely, pairs of proteins that interact [19]. The mutual effect of two proteins is produced by the interaction of their domains. As a consequence, if a pair of proteins is treated as a bag and the domain-domain pairs are treated as the instances in the bag, an MIL representation is obtained. So MIL of one-class training data is an interesting direction to delve.

Footnotes

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61573266).

References

Andrews,

Tsochantaridis and

Hofmann, Support vector machines for multiple-instance learning, Neural Information Processing Systems 15 (2003), 561–568.

Andrews,

Tsochantaridis and

Hofmann, Support vector machines for multiple-instance learning, in: Proceedings of the 15th International Conference on Neural Information Processing Systems, 2004, pp. 577–584.

Auer, On learning from multi-instance examples: Empirical evaluation of a theoretical approach, in: Proceeding of the 14th International Conference of Machine Learning, 1997, pp. 1–29.

Blum and

Kalai, A note on learning from multiple-instance examples, Machine Learning 30 (1998), 23–29. doi:10.1023/A:1007402410823.

Bunescu and

Monney, Multiple instance learning for sparse positive bags, in: Proceedings of the Annual International Conference on Machine Learning, 2007, pp. 105–112.

Cano,

Zafra and

Ventura, Speeding up multiple instance learning classification rules on GPUs, Knowledge and Information Systems 127 (2015), 127–145. doi:10.1007/s10115-014-0752-0.

Chen,

Bi and

Wang, MILES: Multiple-instance learning via embedded instance selection, IEEE Transactions Pattern Analysis and Machine Intelligence 28 (2006), 1931–1947. doi:10.1109/TPAMI.2006.248.

Chen and

Wang, Image categorization by learning and reasoning with regions, Machine Learning Research 5 (2004), 913–939.

Chen,

Li,

We,

Xu and

Shi, Multiple-kernel SVM based multiple-task oriented data mining system for gene expression data analysis, Expert Systems With Applications 38 (2011), 12151–12159. doi:10.1016/j.eswa.2011.03.025.

10.

De Raedt, Attribute-value learning versus inductive logic programming: The missing links, Artificial Intelligence 1446 (1998), 1–8.

11.

T.G.

Dietterich,

R.H.

Lathrop and

Lozano-Pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence 89 (1997), 31–71. doi:10.1016/S0004-3702(96)00034-3.

12.

Dong, A comparison of multi-instance learning algorithms, Master of Science thesis, Univ. Waikato, 2006, pp. 21–94.

13.

Fu,

Robles-Kelly and

Zhou, MILIS: Multiple instance learning with instance selection, IEEE Transactions Pattern Analysis and Machine Intelligence 33 (2011), 958–977. doi:10.1109/TPAMI.2010.155.

14.

C.C.

Huang and

H.M.

Lee, A grey-based nearest neighbor approach for missing attribute value prediction, Applied Intelligence 20 (2004), 239–252. doi:10.1023/B:APIN.0000021416.41043.0f.

15.

Huang,

Kecman and

Kopriva, Kernel Based Algorithms for Mining Huge Data Sets: Supervised, Semi-Supervised, and Unsupervised Learning, Springer, 2006.

16.

Joliat,

Alcalá-Fdez,

Fernández,

Luengo,

Derrac,

García,

Sánchez and

Herrera, KEEL Data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Multiple-Valued Log. Soft Comput. 17 (2011), 255–287.

17.

Kecman, Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models, 2001.

18.

Kecman, Iterative k data algorithm for solving both the least squares SVM and the system of linear equations, in: Proceedings of the IEEE SoutheastCon, 2015, pp. 1–6.

19.

Kecman,

Huang and

Vogt, Iterative single data algorithm for training kernel machines from huge data sets: Theory and performance, Computational Intelligence 177 (2005), 255–274.

20.

Kecman and

Zigic, Algorithms for direct L2 support vector machines, in: Proceeding of the IEEE International Symposium on Innovations in Intelligent Systems and Applications, 2014, pp. 419–424.

21.

Li,

Chen,

Wei,

Xu and

Kou, Feather selection via least squares support feature machine, International Journal of Information Technology and Decision Making 6 (2007), 671–686. doi:10.1142/S0219622007002733.

22.

Liu,

Li,

Xu and

Shi A weighted

L_{q}

adaptive least squares support vector machine classifiers – Robust and sparse approximation, Expert Systems With Applications 38 (2011), 2253–2259. doi:10.1016/j.eswa.2010.08.013.

23.

P.M.

Long and

Tan, PAC learning axis-aligned rectangles with respect to product distribution from multiple-instance examples, Machine Learning 30 (1998), 7–21. doi:10.1023/A:1007450326753.

24.

Luna,

Cano,

Sakalauskas and

Ventura, Discovering useful patterns from multiple instance data, Information Science 357 (2016), 23–38. doi:10.1016/j.ins.2016.04.007.

25.

Maron, Learning from ambiguity, AI Technical Report AITR-1639, MIT 1998.

26.

Maron and

Lozano-Pérez, A framework for multiple-instance learning, Neural Information Processing Systems 10 (1998), 570–576.

27.

Maron and

A.L.

Ratan, Multiple-instance learning for natural scene classification, in: Proceeding of the 15th International Conference on Machine Learning, 2008, pp. 341–349.

28.

Melki,

Cano and

Ventura, MIRSVM: Multi-instance support vector machine with bag representatives, Pattern Recognition 79 (2018), 228–241. doi:10.1016/j.patcog.2018.02.007.

29.

Platt,

Cristianini and

Taylor, Large margin DDAGs for multi-class classification, Neural Information Processing Systems, 12 (2000), 547–553.

30.

Ray and

Craven, Supervised versus multiple instance learning: An empirical comparison, in: Proceeding of the International Conference on Machine Learning, 2005, pp. 697–704.

31.

Ruffo, Learning single and multiple decision trees for security applications, PhD dissertation, Dept. Computer Science, Univ. Turin, Italy, 2000.

32.

Schőelkopf and

Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002.

33.

Veronika,

M.J.T.

David and

Marco, Multiple instance learning with bag dissimilarities, Pattern Recognition 48 (2015), 264–275. doi:10.1016/j.patcog.2014.07.022.

34.

Viola,

Platt and

Zhang, Multiple instance boosting for object detection, Neural Information Processing Systems, 18 (2005), 1417–1424.

35.

Wang and

J.D.

Zucker, Solving multiple-instance problem: A lazy learning approach, in: Proceeding of the International Conference on Machine Learning, 2002.

36.

Wu,

X.Q.

Zhu,

C.Q.

Zhang and

Z.H.

Cai, Multi-instance learning from positive and unlabeled bags, Knowledge Discovery and Data Mining 8443 (2014), 237–248. doi:10.1007/978-3-319-06608-0_20.

37.

Yang,

Dong and

Hua, Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2006, pp. 2057–2063.

38.

Yang and

Lozano-Pérez, Image database retrieval with multiple-instance learning techniques, in: Proceeding IEEE International Conference Data Engineering, 2000, pp. 233–243.

39.

Zhang and

S.A.

Goldman, EM-DD: An improved multiple-instance learning technique, Advances in Neural Information Processing Systems 14 (2002), 1073–1080.

40.

Zhang,

S.A.

Goldman,

Yu and

Fritts, Content-based image retrieval using multiple-instance learning, in: Proceeding of the 19th International Conference on Machine Learning, 2002, pp. 682–689.

41.

Zhou,

Sun and

Li, Multi-instance learning by treating instances as non-i.i.d. samples, in: Proceedings of the 26th International Conference on Machine Learning, 2009, pp. 1249–1256.

Grey-based multiple instance learning with multiple bag-representative

Abstract

Keywords

1. Introduction

2. MIL approaches

3. Grey-based multiple instance learning with multiple bag-representative

3.1. Notation

3.4. MIMBR

4.1. Datasets and experimental setup

1 https://sci2s.ugr.es/keel/category.php?cat=mul

4.3. Accuracy

4.5. AUC and F0.5-score

5. Conclusion

Footnotes

Acknowledgements

References

¹
https://sci2s.ugr.es/keel/category.php?cat=mul