On the generation of multi-label prototypes

Abstract

Data reduction techniques play a key role in instance-based classification to lower the amount of data to be processed. Prototype generation aims to obtain a reduced training set in order to obtain accurate results with less effort. This translates into a significant reduction in both algorithms’ spatial and temporal burden. This issue is particularly relevant in multi-label classification, which is a generalization of multiclass classification that allows objects to belong to several classes simultaneously. Although this field is quite active in terms of learning algorithms, there is a lack of data reduction methods. In this paper, we propose several prototype generation methods from multi-label datasets based on Granular Computing. The simulations show that these methods significantly reduce the number of examples to a set of prototypes without significantly affecting classifiers’ performance.

Keywords

Multi-label classification prototype generation granular computing

1. Introduction

Classification is one of the most popular Data Mining topics. Its aim is to learn from labeled patterns a model able to predict the decision class for future, never seen before, data samples [1]. The best way to solve a classification problem is usually to have as much information as possible. In practice, however, this is not always the case. Sometimes, the performance may decrease due to the abundance of information since many examples may be irrelevant to the resolution of the problem or may provide the same information [2, 3].

Some algorithms, such as the ones found in examples-based learning, use a training set to estimate the class label, which causes scalability problems when the size of the training set increases. In this case, the number of training objects affects the computational cost of the method [4, 5]. The nearest neighbor rule is an example of high computational cost method when the number of examples is large [6]. The most popular algorithm in this category is kNN [7]. The computational complexity of kNN is O(NM), where $N$ and $M$ are the size of dataset and the dimensionality of embedding space, respectively.

However, it is possible to reduce or modify the datasets without affecting the learning process, thus improving the efficiency by reducing the computational cost. One approach to do that is the classification based on the Nearest Prototype (NP) [8, 9]. In this approach, the decision class of a new object is calculated by analyzing its proximity to a set of prototypes selected or generated from the initial set of objects. Several strategies exist to reduce the number of examples in the input data into a set of representative prototypes. Overall, we can say that the data reduction methods (with respect to the objects in the training dataset) are divided into two categories: prototypes selection[10, 11] and prototypes generation[12, 2, 13, 14]. Prototype selection algorithms choose a set of representative objects according to a well-defined criterion, while prototype generation algorithms create synthetic objects from the initial set of objects. In the literature concerning single-label learning, several interesting papers [6, 5, 12] elaborating on this issue have been proposed.

As mentioned above, Multi-Label Classification (MLC) is a type of classification task in which each object is associated with a binary vector of outputs, instead of being associated with a single value [15, 16]. In the literature, the ML-kNN algorithm appears as a very popular alternative, which uses the kNN rule in multi-label prediction [17]. This method finds the k nearest neighbors in the datasets by relying on the maximum a posteriori principle in order to determine the label set of the test object. The solution is based on the prior and posterior probabilities associated with each label within the k nearest neighbors. This method has the same drawback as kNN: as the dataset increases, so does the algorithm’s computational cost. This happens because we need to scan the whole training set every time we process an object.

One relevant work related to NP in the field of multi-label classification is the kNNc method described in [18]. This algorithm works in two stages, by combining prototype selection techniques with example-based classification. Firstly, the algorithm obtains a reduced set of objects by using prototype selection techniques as done in traditional pattern classification [19]. The goal of this stage is to determine the set of labels which are nearest to the ones in the object to be classified. Secondly, the method uses the full set of samples but limiting the prediction to the labels inferred in the previous step.

The method introduced in [18] is a prototype selection method. However, in this paper, we propose a different alternative as we develop several methods to generate prototypes in MLC datasets that are independent from the learning paradigm. Aiming at reaching such an independence, we rely on Granular Computing [20, 21, 22] and two different ways of granulating the information space. Being more precise, two classical granulation approaches are the condition granulation and the decision granulation, which granulate the universe according to conditional attributes and decision classes, respectively. Our methods use these granulation approaches to derive a set of representative objects that replaces the original training set. In this way, a significant reduction in the number of examples is achieved, thus enhancing the efficiency of the classification methods that rely on the instance-based learning principle.

In the case of the first two methods, the granulation of a universe is performed by using a similarity relation that produces similarity classes (or granules) of objects in the universe from conditional attributes. Using similarity relations enables our methods to be used in presence of mixed data, i.e., when there are both numerical and nominal attributes. On the other hand, the third method granulates the universe of discourse by means of an equivalence relation, and taking into account the different labels attached with the problem. Therefore, we obtain an equivalence class for each label, which is used the generate a prototype. In the last method, we use a fuzzy similarity relation instead of a binary relation, which allows for the development of more flexible models without the need for additional parameters. As a result, for each object in the universe, a fuzzy set can be derived from its similarity to other objects. Experimental results using different benchmark problems showed that the proposed methods achieve a remarkable reduction of the training set, while preserving or improving the efficacy of some multi-label classifiers.

The paper is organized as follows. Section 2 goes around the prototype generation concept, and Section 3 presents the theoretical background on Granular Computing. Section 4 introduces the prototype generation methods for MLC datasets, and Section 5 is dedicated to evaluating the performance of some of state-of-the-art MLC algorithms on the set of generated prototypes. Finally, in Section 6 we provide some concluding remarks.

2. A brief introduction to prototype generation

Prototype generation techniques are devoted to creating a new set of labeled synthetic objects that replace the initial training set. Under the data reduction paradigm, this new set is expected to be smaller than the original one while having better decision boundaries. Let us consider an object $x_{i}$ of a dataset, which is defined by $M$ descriptive attributes and attached with a class attribute $d$ , that is, $x_{i}=[a_{i1},a_{i2},\dots,a_{iM},a_{id}]$ . Then, let us assume that $X_{\textit{train}}$ is a training dataset with $N$ samples. The purpose of prototype generation is to obtain a reduced set, $X_{\textit{red}}$ having $P$ objects generated from $X_{\textit{train}}$ , such that $P<N$ . The cardinality of $X_{\textit{red}}$ should be small enough to decrease the evaluation time taken by a classifier (kNN, for example) while maintaining the classification accuracy as much as possible. Therefore, data reduction approaches aim at summarizing the raw dataset without damaging its analytical properties.

The prototype generation approaches can be divided into several families depending on the main heuristic operation followed [23, 24]. The first approach that we can find in the literature, called PNN (Prototypes for Nearest Neighbor) [25], belongs to the family of methods that carry out a merging of prototypes of the same class in successive iterations, thus generating centroids. Other well-known methods are those based on a divide-and-conquer scheme. They decompose the N-dimensional space into two or more subspaces with the purpose of simplifying the problem at each step [26]. Other proposals that follow a similar approach include MixtGauss [13], RSP (Reduction by Space Partitioning) [27], and ICPL (Integrated Concept Prototype Learner) [28].

Another approach consists in adjusting the position of the prototypes that can be viewed as an optimization process. The main algorithm belonging to this family is LVQ (Learning Vector Quantization) [29]. LVQ can be understood as an artificial neural network in which a neuron corresponds with a prototype. Several approaches have been proposed to modify the LVQ method. For example, the third version of this algorithm –termed LVQ3– reported the best results. On the other hand, the HYB algorithm [30] combines support vector machines with LVQ3, and executes a search in order to find the most promising parameters of LVQ3. Also, the LVQPRU algorithm [31] extends LVQ by using a pruning step to remove noisy objects.

As a positioning adjustment of prototypes technique, a genetic algorithm called ENPC (Evolutionary Approach based on Nearest Prototype Classifier) was proposed for prototype generation in [32]. This algorithm executes different operators in order to find the most suitable position of the prototypes. PSO (Particle Swarm Optimization) was proposed for prototype generation in [33, 34], and they also belong to the positioning adjustment of prototypes category of methods. The main difference between them is the type of codification of the particles. Also, evolutionary algorithms have been used to tackle this problem [32, 35, 36]. In reference [24], a methodology is presented to learn iteratively the positioning of prototypes using real parameter optimization procedures. They propose an iterative prototype adjustment technique based on differential evolution.

3. Granular computing

Without losing generality, we can say that the issues of Granular Computing can be seen from two related angles, the construction of granules and the reasoning with granules. The former deals with the formation, representation, and interpretation of granules, while the latter deals with the exploitation of granules in problem solving [37, 38]. When building the granules, it is necessary to study the criteria for deciding whether or not two elements should be put together into the same granule, based on the available information.

Typically, elements in a granule are gathered together by indistinguishability, similarity, proximity or functionality. The granulation of an information system $\textit{IS}=(U,A)$ , with $U$ being a non-empty finite set of objects and $A$ being a non-empty finite set of attributes, consists in forming subsets of objects using an indiscernibility relation. It considers the features of the application domain to determine the inseparability of objects, and depending on how it is built, different type of information granules are obtained.

These granules could be a partition or a covering of the universe. In both cases, a set of granules is formed. When each object of $U$ belongs to an unique granule, it is a partition, however, if it belongs to more than one granule at the same time, it is a covering. A covering of the universe is a family of non-empty subsets whose union is equal to the universe, when the intersection of the subsets is empty, a partition is obtained.

3.1 Equivalence relation

An equivalence relation is a very common type of indiscernibility relation ( $R$ ), which is the foundation of the Rough Set Theory [39]. As such, $[x]_{R}$ defines an equivalence class of an element $x\in U$ under $R$ , where $[x]_{R}=\{y\in U:yRx\}$ , i.e. the equivalence class of an element includes all objects in the universe indiscernible from it. Each equivalence class can be seen as a granule consisting of indistinguishable elements. We say that two objects are equivalent if they have exactly the same value with respect to a set of attributes.

3.2 Hard similarity relation

For some domains, the use of equivalence relations could lead to that two inseparable objects being incorrectly labeled as separable, thus making the relationship to be excessively strict [40]. This problem can be alleviated in some extent by extending the concept of inseparability relation [41], and replacing the equivalence relation with a weaker binary relation. Equation (1) shows an indiscernibility relation, where $0\leqslant$ $\delta(x,y)$ $\leqslant 1$ represents a similarity function,

$\displaystyle R:\textit{xRy}\ \Longleftrightarrow\delta(x,y)\geqslant\xi.$ (1)

This similarity relation states that the objects $x$ and $y$ are inseparable as long as their similarity degree $\delta(x,y)$ exceeds a similarity threshold $0\leqslant\xi\leqslant 1$ . This relation actually defines a similarity class $\overline{R}(x)=\{y\in U:yRx\}$ that replaces the equivalence class. The similarity function could be formulated in a variety of ways, for example, $\delta(x,y)=1-\varphi(x,y)$ with $\varphi(x,y)$ being the distance between objects $x$ and $y$ . In this approach, an object can simultaneously belong to different similarity classes, so the covering induced by the similarity relation $R$ over the universe $U$ is not necessarily a partition [42, 43].

3.3 Soft similarity relation

Using a user-specified similarity threshold parameter in Eq. (1) makes the information granule to be hard, i.e. the object either belongs to the information granule or not. An alternative to suppress this parameter is to use fuzzy sets. As a matter of fact, fuzzy sets provide another way of relaxing the inseparability concept when processing the universe objects. In this way, the similarity relation is replaced with a fuzzy similarity relation.

This relation replaces a hard membership with more flexible membership, i.e. sets to which all objects belong to some degree [44, 45]. Being more precise, instead of determining the indiscernibility between objects, an approximate equality is established. The approximate equality between objects is modeled through a fuzzy relation that assigns to each pair of objects in $U$ a degree of similarity. In references [46, 47], a fuzzy relation is used as the indiscernibility relation, which quantifies the strength of the similarity relation between two objects in the $[0,1]$ interval. The fuzzy approach requires to rewrite the binary relation $R$ as a fuzzy binary relation. The fuzzy relation is characterized by a membership function, which is defined by a similarity function $\delta(x,y)$ between $x$ and $y$ , therefore we can say that $xRy=\delta(x,y)$ .

4. Generation of MLC prototypes

In this section, we propose four prototype construction methods for multi-label training sets. Let $\textit{mlDS}=(U,A\cup L)$ be a multi-label decision system, where the set $U$ is a non-empty finite set of objects, $A$ is a non-empty finite set of attributes that describe each observation, and $L=\{L_{1},L_{2},\ldots,L_{K}\}$ is a non-empty finite set of labels such that the label domain is $L_{i}=\{0,1\}$ .

4.1 Prototype generation from a universe granulation by condition

The rationale behind the first two algorithms to be described in this subsection follows the same principle explained above. They are iterative methods that build similarity classes by using the similarity relation defined in Eq. (1). An object may belong to several similarity classes at the same time. However, when an object is included in a similarity class, it will not be taken into account when building a new similarity class from it. Each similarity class consists in a granule that will be used later on to build the prototype.

The generated prototypes are synthetic instances composed of both conditional and label attributes. To derive the information by condition (attribute values) and by decision (label values), an aggregation operation is used. In the case of conditional attributes, the average can be used as the aggregation operator if the attribute is numeric, or the mode if the attribute is nominal. The way in which the labels in the prototype are derived differs in the algorithms, as they use different concepts of decision class.

Therefore, we must define what is considered to be a decision class in the MLC context. This is done as follows:

•
Each combination $C_{i}$ of labels represents a decision value. For example, let $L=\{L_{1},L_{2},L_{3}\}$ denote the set of labels, a combination of labels could be “101”, pointing out that the object belongs to the labels $L_{1}$ and $L_{3}$ , then “101” defines a decision class, so that all objects associated with labels $L_{1}$ and $L_{3}$ belong to that decision class.
•
Each label ( $L_{i}$ ) is considered a decision value, so that all the objects associated with that label belong to this decision class. Consequently, in the example above there are three decision classes $L_{1}$ , $L_{2}$ , and $L_{3}$ .

In the case of Algorithm 4.1, each combination of labels represents a decision value, while Algorithm 4.1 considers each label independently as a decision value. In this way, the first algorithm builds each decision class from the most common combination of labels in the granule, while the second algorithm does it by taking into account the labels independently. In this latter case, the prototype will have as decision values the most common labels in the granule.

[!ht] GP1mlTS[1] Initialize objects’ counter

$\textit{Used}[i]=0$ $i=1,\ldots,N$

$\textit{PrototypeSet}=\emptyset$ While $\exists i$ : $\textit{Used}[i]=0$

$j=i$

Construct the similarity class $\overline{R}(O_{j})$ of the object $O_{j}$

Construct a prototype $P=[P_{\textit{cond}},P_{\textit{dec}}]$ from all objects in $\overline{R}(O_{j})$

$P_{\textit{cond}}$ is calculated from the set of values of the attributes ( $A$ ) of all objects

in $\overline{R}(O_{j})$ and using an aggregation operator

$P_{\textit{dec}}$ is the most common label combination ( $C$ ) among all existing label

combinations of the objects in $\overline{R}(O_{j})$

$\textit{PrototypeSet}=\textit{PrototypeSet}\cup P$

$\textit{Used}[j]=1$ for all the objects in $\overline{R}(O_{j})$

Return PrototypeSet[!ht] GP2mlTS[1] Initialize objects’ counter

$\textit{Used}[i]=0$ $i=1,\ldots,N$

$\textit{PrototypeSet}=\emptyset$ While $\exists i$ : $\textit{Used}[i]=0$

$j=i$

Construct the similarity class $\overline{R}(O_{j})$ of the object $O_{j}$

Construct a prototype $P=[P_{\textit{cond}},P_{\textit{dec}}]$ from all objects in $\overline{R}(O_{j})$

$P_{\textit{cond}}$ is calculated from the set of values of the attributes ( $A$ ) of all objects

in $\overline{R}(O_{j})$ and using an aggregation operator

$P_{\textit{dec}}=\{L_{1},L_{2},\ldots,L_{K}\}$ , where $L_{K}=1$ if most of the objects in $\overline{R}(O_{j})$

are labeled with that label, otherwise $L_{K}=$ 0

$\textit{PrototypeSet}=\textit{PrototypeSet}\cup P$

$\textit{Used}[j]=1$ for all the objects in $\overline{R}(O_{j})$

Return PrototypeSet
4.2 Prototype generation from a universe granulation by decision

The third algorithm performs a decision-based granulation of the universe. Overall, this method builds the information granules by taking into consideration the labels instead of the condition attributes. The universe granulation produced by this algorithm is a covering since an object can belong to two or more granules. The algorithm builds a granule for each label, thus including all objects that are labeled with that decision label. The aggregation procedure to derive the prototype for each granule is done as described the previous subsection. This whole procedure is formalized in Algorithm 4.2.

[!ht] GP3mlTS[1] $\textit{PrototypeSet}=\emptyset$ For each $Li\in L$ $i=1,\ldots,K$

Construct the equivalence class $[L_{i}]_{R}$ using a equivalence relation

Construct a prototype $P=[P_{\textit{cond}},P_{\textit{dec}}]$ from all objects in $[L_{i}]_{R}$

$P_{\textit{cond}}$ is calculated from the set of values of the attributes ( $A$ ) of all objects

in $[L_{i}]_{R}$ and using an aggregation operator

$P_{\textit{dec}}=\{L_{1},L_{2},\ldots,L_{K}\}$ , where $L_{K}=1$ if most of the objects in $[L_{i}]_{R}$

are labeled with that label, otherwise $L_{K}=0$

$\textit{PrototypeSet}=\textit{PrototypeSet}\cup P$

Return PrototypeSet

4.3 Generation of prototypes by using a fuzzy similarity relation

A key issue when constructing similarity classes concerns with the proper estimation of the similarity threshold parameter, which defines whether two objects are similar or not. Small variations on the granularity degree may lead to quite different outcomes. Determining the exact granularity degree is not trivial, since higher values do not necessarily lead to optimal prediction rates while smaller values might result in a significant accuracy lose.

To suppress this parameter, we propose a forth algorithm that uses a fuzzy similarity relation. This implies that all objects in $U$ are related to each other, but with different similarity degrees. The fuzzy binary relation $xRy=\delta(x,y)$ is used to define the fuzzy set $N_{1}(x)$ , as given below:

$\displaystyle N_{1}(x)=\left\{(y,\delta(x,y))\>\forall y\in U\right\}$ (2)

where $\delta(x,y)$ denotes the similarity degree of the object $y$ to $x$ .

In short, the set $N_{1}(x)$ represents the granule by condition of object $x$ , which is used to build the prototype. However, only 5% of the objects having the highest similarity to $x$ will be used to compute the prototype. It should be highlighted that, once an object has been used for the construction of a prototype, it is not taken into account to build a new set $N_{1}$ from it, since it was already used to build an object that represents it. The prototype construction is done as explained before. Algorithm 4.3 formalizes this procedure.

[!ht] GP4mlTS[1] Initialize objects’ counter

$\textit{Used}[i]=0$ $i=1,\ldots,N$

PrototypeSet $=$ $\emptyset$ While $\exists i$ : $\textit{Used}[i]=$ 0

$j=i$

Construct the set $N_{1}(O_{j})$ of the object $O_{j}$ ,

$S(O_{j})$ $\leftarrow$ Select 5% of the objects in $N_{1}$ more similar to $O_{j}$ , and add it to $S$

Construct a prototype $P=[P_{\textit{cond}},P_{\textit{dec}}]$ from all objects in $S(O_{j})$

$P_{\textit{cond}}$ is calculated from the set of values of the attributes ( $A$ ) of all objects

in $S(O_{j})$ and using an aggregation operator

$P_{\textit{dec}}=\{L_{1},L_{2},\ldots,L_{K}\}$ , where $L_{K}=1$ if most of the objects in $S(O_{j})$

are labeled with that label, otherwise $L_{K}=0$

$\textit{PrototypeSet}=\textit{PrototypeSet}\cup P$

$\textit{Used}[j]=1$ for all the objects in $S(O_{j})$

Return PrototypeSet

5. Empirical simulations

In this section, we explore the performance of our prototype generation methods when coupled with state-of-the-art multi-label classifiers.

5.1 Experimental setup

The MLC state-of-the-art classifiers used in this section are included the MULAN library [48]. Specifically, we use the following algorithms:

•
The ML-kNN algorithm [17] is an adaptation of the $k$ NN method to the multilabel scenario. It starts by building a limited model that operates on two pieces of information:

–
The a priori probability of each label, which is defined as the number of times each label appears in the multi-label dataset divided by the number of objects in the dataset. A smoothing factor is applied to avoid multiplying by zero.
–
The conditional probabilities for each label, which is computed as the proportion of k-nearest neighbors having that label with respect to all objects associated with the target label.

These probabilities are independently computed for each label, facing the task as a collection of individual binary problems. Therefore, the potential dependencies among labels are fully dismissed by this algorithm.

After completing this training process, the classifier is able to predict the labels for new objects. The classification goes as follows:

–
First, the k-nearest neighbors of the given sample are obtained. The Euclidean distance is used to measure the (dis)similarity between the reference object and the samples in the multilabel dataset.
–
Then, the presence of each label in the neighbors is used as the evidence to compute maximum a posteriori (MAP) probabilities from the conditional ones obtained before.
–
Finally, the label set of the new sample is generated from the MAP probabilities. The probability itself is provided as a confidence level for each label, thus making possible to also generate a label ranking.

•
The foundation behind the BP-MLL algorithm [49] is to adapt a feed-forward neural network to deal with multi-label data, where the error backpropagation strategy is employed to minimize a global error function capturing the correlation among the labels. The key aspect in BP-MLL is the introduction of a new error function, which is computed taking into account the fact that each sample contains several labels. Specifically, this new function penalizes the predictions including labels which are not truly relevant for the processed object. In the BP-MLL method, the input layer has as many neurons as features describing the multi-label dataset. The number of units in the output layer is determined by the number of labels, while the number of neurons in the hidden layer is also influenced by the number of labels. This algorithm produces a label ranking as result while classifying new objects. A threshold parameter decides which labels will be deemed as relevant.
•
Random k-Labelsets, RAkEL [50] is a method that generates random subsets of labels, thus training a multiclass classifier for each subset. RAkEL involves two essential parameters, $c$ and $k$ . The former determines the number of classifiers to be trained and the latter the length of the labelsets to be generated. When $k=1$ and $c=|L|$ , RAkEL works as the Binary Relevance transformation [51]. In contrast, with $c=1$ and $k=|L|$ the model is equivalent to Label Powerset [52]. Intermediate values for both parameters are the interesting ones since the algorithm produces a ranking of labels as computed by means of a voting mechanism.

We use the same default parameter settings provided by the MULAN library. It should be highlighted that these default parameter values are common across all datasets, thus no algorithm performs hyperparameter tuning.

To perform the numerical simulations, we take 12 multi-label datasets from the well-known RUMDR [53] repository. Table 1 summarizes the number of objects, attributes, and labels for each dataset. In the adopted datasets, the number of objects ranges from 1,675 to 10,491, the number of attributes from 294 to 1,836, and the number of labels from 6 to 400. A more description of each of the studies case is given below:

•
bibtex: This dataset was introduced in [54] and contains the metadata for bibliographic entries. The words in the title, authors names, journal name, and publication date were taken as input attributes. The data origin is Bibsonomy, a specialized social network where the users can share bookmarks and BibTeX entries assigning labels to them. The bag-of-words (BoW) model is used to represent the documents, so all features are binary and indicate whether a certain term is relevant to the document or not.
•
corel5k: this dataset involves thousands of images, which were categorized into several groups [55]. In addition, each picture is associated with a set of words describing its content (i.e., the labels). These pictures were segmented by the authors by using the normalized cuts method, thus generating a set of blobs associated with one or more words. The input features are the vectors resulting from the segmentation process.
•
enron: the Enron corpus is a large set of email messages, with more than half a million entries, from which a dataset for automatic folder assignment research was generated [56]. The enron dataset is a subset of the previous dataset. Each one has as input features a BoW model obtained from the emails fields, such as the subject and the body of the message. The labels are the folders in which each message was stored.
•
scene: this dataset is also related with image labeling, specifically to scene classification. The set of pictures was taken from the Corel dataset and some personal ones by the authors [52] were also included. The images are transformed to the CIE Luv color space, known for being perceptually uniform, and latter segmented into 49 blocks
•
stackexchange: the case study in [57] is a tag suggestion task for questions posted in specialized forums, specifically forums from the Stack Exchange network. Six datasets were generated from forums devoted to topics such as cooking, computer science, and chess. The title and body of each question was text-mined, thus producing a frequency BoW. The tags assigned by the users to their questions were used as the labels.

Table 1
Summary of the MCL datasets used in our study

Domain Objects Attributes Labels

bibtex (D1) text 7,395 1,836 159

corel5k (D2) images 5,000 499 374

enron (D3) text 1,702 1,001 53

scene (D4) images 2,407 294 6

stackex_chemistry (D5) text 6,961 540 175

stackex_chess (D6) text 1,675 585 227

stackex_cooking (D7) text 10,491 577 400

stackex_cs (D8) text 9,270 635 274

stackex_philosophy (D9) text 3,971 842 233

In order to measure performance obtained by the MLC classifiers, we will use Hamming Loss (HL) metric, which is a well-known performance measure in MLC scenarios [58]. This metric is defined as follows,

$\displaystyle\textit{HL}=\frac{1}{{N}}\frac{1}{{K}}\sum_{i=1}^{{N}}{\left|Y_{i% }\varDelta Z_{i}\right|}$ (3)

where $\varDelta$ operator returns the symmetric difference between $Y_{i}$ (the real label set of the $i$ th object) and $Z_{i}$ (the predicted one), with $N$ the number of objects, and $K$ the number of decision labels in the training set.

During the numerical simulations, we also studied the efficiency in terms of the reduction coefficient [59]. This measure, depicted in Eq. (4), quantifies how much the number of objects is reduced, i.e. the proportion between the size of the set of prototypes $P$ and the universe of discourse $U$ ,

$\displaystyle\textit{Red (.)}=\frac{\left|U\right|-\left|P\right|}{\left|U% \right|}100.$ (4)

In this experiment, the similarity threshold $\xi$ used in Eq. (1) ranges from 0.75 to 0.95. Also, we have adopted the Heterogeneous Euclidean-Overlap Metric* (HEOM) [60], which computes the normalized Euclidean distance between numerical attributes and an overlap metric for nominal attributes. Eq. (5) formalizes the HEOM distance function,

$\displaystyle\varphi_{\textit{HEOM}}(x,y)=\sqrt{\frac{\sum_{j=1}^{|A|}a_{j}% \sigma_{j}(x,y)}{\sum_{j=1}^{|A|}a_{j}}}$ (5)

where $x(j)$ and $y(j)$ are the normalized values of the $j$ th attribute for heterogeneous objects $x$ and $y$ , respectively. Moreover,

$\displaystyle\sigma_{j}(x,y)=\begin{cases}0&\text{if }a_{j}\text{ is nominal }% \wedge x(j)=y(j)\\ 1&\text{if }a_{j}\text{ is nominal }\wedge x(j)\neq y(j)\\ (x(j)-y(j))^{2}&\text{if }a_{j}\text{ is numerical}.\\ \end{cases}$ (6)

It is worth mentioning that, for each datasets, we have estimated the HL value by using a 10-fold cross validation scheme. For each fold, this procedure splits the whole training set into two data pieces, namely, the training set and the test set. Only the training set is used to generate the set of prototypes and derive the learning model. The test set is never modified so that it only serves to compute the HL associated with the current fold.
5.2 Results and discussion

	Domain	Objects	Attributes	Labels
bibtex (D1)	text	7,395	1,836	159
corel5k (D2)	images	5,000	499	374
enron (D3)	text	1,702	1,001	53
scene (D4)	images	2,407	294	6
stackex_chemistry (D5)	text	6,961	540	175
stackex_chess (D6)	text	1,675	585	227
stackex_cooking (D7)	text	10,491	577	400
stackex_cs (D8)	text	9,270	635	274
stackex_philosophy (D9)	text	3,971	842	233

Figure 1 displays the reduction coefficient achieved once the proposed prototype generation methods are used on each dataset. The reader can note that our methods achieve a reduction rate higher than 20% in most problems. The GP1mlTS and GP2mlTS methods have a similar behavior. However, the GP3mlTS method reports reduction rates above 90% in most datasets.

Figure 1.

Dataset reduction (%) achieved by each algorithm.

Figure 2 displays the HL values achieved by the ML-kNN method with the original multi-label datasets, and the results obtained after using the set of prototypes generated by three of the methods proposed in this paper. The results show that the prototypes generated for each dataset leads to HL values similar to those obtained with the original dataset. Only in the case of the scene dataset there is a significant difference in the HL value, especially when we use the set of prototypes generated by the GP3mlTS method with respect to the original dataset. This is due to the fact that this dataset has few labels (exactly 6 labels), thus only a few prototypes are generated.

Figure 2.

HL values achieved by the ML-kNN method.

Figure 3.

Performance (in term of HL values) of the ML-kNN method from several prototype sets built with different similarity thresholds.

Figure 4.

Dataset reduction (%) obtained with GP1mlTS and GP2mlTS algorithms with different similarity thresholds.

The results shown above have been achieved with a fixed similarity threshold of 0.85. Figure 3 shows how the performance of the ML-kNN algorithm fluctuates when the similarity threshold value oscillates between 0.75 and 0.95. This variation is also shown in Fig. 4 but taking into account the reduction coefficient obtained using GP1mlTS and GP2mlTS. Both figures illustrate how the similarity threshold influences the results of both algorithms, as expected. Overall, when the value of the similarity threshold is increased, better values for HL are achieved; however, the reduction percent are lower. A suitable trade-off between the HL value and the reduction rate across all datasets is obtained when using a similarity threshold value equal to 0.85.

Figure 5.

Reduction rates produced by GP2mlTS and GP4mlTS algorithms.

Figure 6.

HL values obtained by GP2mlTS and GP4mlTS algorithms.

In addition, we compared GP2mlTS and GP4mlTS themselves as we wanted to verify the behavior of GP2mlTS when using fuzzy sets instead of similarity classes in the construction of the prototypes. Figure 5 illustrates the prototype reduction achieved for each dataset from these two algorithms. Figure 6 shows the HL values obtained by ML-kNN when replacing the original training set with the granular one containing the generated prototypes.

The results suggest that the GP4mlTS algorithm is able to obtain significant reduction percentages, i.e. over 80% in most of the study cases. More importantly, the efficacy of the ML-kNN algorithm is preserved.

5.2.1 Performance analysis of other state-of-the-art multi-label algorithms using the generated prototypes

In this subsection, we analyze the performance of other multi-label classifiers on the generated prototypes. Figures 7 and 8 show the average HL values obtained by the BP-MLL and RAkEL algorithms, respectively, when GP1mlTS, GP2mlTS and GP4mlTS are used to derive the prototypes. We adopted these generation method since they reported a better trade-off between reduction and efficacy in the numerical simulations conducted above.

Figure 7.

Average HL values achieved by the BP-MLL method.

Figure 8.

Average HL values achieved by the RAkEL method.

The simulation results show that the granular training sets help improve the performance of the BP-MLL algorithm in all cases. Likewise, the GP1mlTS, GP2mlTS and GP4mlTS algorithms provide a significant reduction in the dataset size (see Figs 1 and 5 in the previous subsection) while improving the HL values. In addition, the best trade-off between the prediction and the reduction rates is achieved with the GP4mlTS algorithm.

In the case of the RAkEL method, both the GP1mlTS and GP2mlTS methods produce HL values similar to the ones produced with the original dataset. This indicates that the efficacy is not compromised even when there is a significant reduction in the training set size. However, the RAkEL method seems to be quite sensitive to the GP4mlTS algorithm as the HL value increases. Therefore, using both algorithms together is not encouraged.

6. Concluding remarks

Despite the extensive work done in multi-label classification, as far as we know, the topic of prototype generation has not received much attention. In that regard, this paper proposed several methods based on Granular Computing for the generation of prototypes. These information constructs are built to represent the knowledge comprised into the training set. The first two methods build the granules from similarity relations among the objects, while the third method uses an equivalence relation to create the granules. In addition, the fourth proposed method uses fuzzy similarity relations to suppress the need to establish a similarity threshold when granulating the information space.

The prototypes generated by these methods replace the objects in the training set, which is an important component in supervised learning. The numerical simulations have shown that the proposed methods achieve a significant reduction when it comes to the dataset size, while also preserving the efficacy of the multi-label classifiers adopted for simulation. The GP3mlTS algorithm achieved the highest reduction, thus improving the efficiency of classifiers such as ML-kNN, BP-MLL and RAkEL. However, the overall accuracy was slightly affected. The GP1mlTS and GP2mlTS algorithms achieved a fair trade-off between efficiency and efficacy, although they are sensitive to the similarity threshold parameter. The GP4mlTS algorithm does not have this limitation and emerged as the most suitable alternative in our research.

References

Aggarwal

C.C.

, Data classification: algorithms and applications, CRC press, 2014.

Kim

S.-W.

and Oommen

B.J.

, A brief taxonomy and ranking of creative prototype reduction schemes, Pattern Analysis & Applications 6(3) (2003), 232–244.

Guan

Yuan

Lee

Y.-K.

and Lee

, Nearest neighbor editing aided by unlabeled data, Information Sciences 179(13) (2009), 2273–2282.

García-Durán

Fernández

and Borrajo

, A prototype-based method for classification with time constraints: A case study on automated planning, Pattern Analysis and Applications 15(3) (2012), 261–277.

Hernández

Yumilka

Bello

Filiberto

Frías

Coello Blanco

and Caballero

, An approach for prototype generation based on similarity relations for problems of classification, Computación y Sistemas 19(1) (2015), 109–118.

Barandela

Cortés

and Palacios

, The nearest neighbor rule and the reduction of the training sample size, in: Proceedings 9th Symposium on Pattern Recognition and Image Analysis, Vol. 1, 2001, pp. 103–108.

Cover

T.M.

Hart

P.E.

et al., Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1) (1967), 21–27.

Bezdek

J.C.

and Kuncheva

L.I.

, Nearest prototype classifier designs: An experimental study, International Journal of Intelligent Systems 16(12) (2001), 1445–1473.

García

Luengo

and Herrera

, Data preprocessing in data mining, Springer, 2015.

10.

García

Cano

J.R.

and Herrera

, A memetic algorithm for evolutionary prototype selection: A scaling up approach, Pattern Recognition 41(8) (2008), 2693–2709.

11.

Pekalska

Duin

R.P.

and Paclík

, Prototype selection for dissimilarity-based classifiers, Pattern Recognition 39(2) (2006), 189–208.

12.

Triguero

Derrac

Garcia

and Herrera

, A taxonomy and experimental study on prototype generation for nearest neighbor classification, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(1) (2012), 86–100.

13.

Lozano

Sotoca

J.M.

Sánchez

J.S.

Pla

Pekalska

and Duin

R.P.

, Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces, Pattern Recognition 39(10) (2006), 1827–1838.

14.

Fayed

H.A.

Hashem

S.R.

and Atiya

A.F.

, Self-generating prototypes for pattern classification, Pattern Recognition 40(5) (2007), 1498–1509.

15.

Tsoumakas

Katakis

and Vlahavas

, Mining multi-label data, in: Data mining and knowledge discovery handbook, Springer, 2009, pp. 667–685.

16.

Zhang

M.-L.

and Zhou

Z.-H.

, A review on multi-label learning algorithms, IEEE Transactions on Knowledge and Data Engineering 26(8) (2014), 1819–1837.

17.

Zhang

M.-L.

and Zhou

Z.-H.

, ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognition 40(7) (2007), 2038–2048.

18.

Calvo-Zaragoza

Valero-Mas

J.J.

and Rico-Juan

J.R.

, Improving kNN multi-label classification in Prototype Selection scenarios using class proposals, Pattern Recognition 48(5) (2015), 1608–1622.

19.

Nanni

and Lumini

, Prototype reduction techniques: A comparison among different approaches, Expert Systems with Applications 38(9) (2011), 11820–11828.

20.

Bello

Falcón

and Pedrycz

, Granular computing: at the junction of rough sets and fuzzy sets, Vol. 224, Springer, 2007.

21.

Bargiela

and Pedrycz

, Granular computing: an introduction, vol. 717, Springer Science & Business Media, 2012.

22.

Pedrycz

, Granular computing: analysis and design of intelligent systems, CRC press, 2016.

23.

Calvo-Zaragoza

Valero-Mas

J.J.

and Rico-Juan

J.R.

, Prototype generation on structural data using dissimilarity space representation, Neural Computing and Applications 28(9) (2017), 2415–2424.

24.

Triguero

García

and Herrera

, IPADE: Iterative prototype adjustment for nearest neighbor classification, IEEE Transactions on Neural Networks 21(12) (2010), 1984–1990.

25.

Chang

C.-L.

, Finding prototypes for nearest neighbor classifiers, IEEE Transactions on Computers 100(11) (1974), 1179–1184.

26.

Chen

and Jóźwik

, A sample set condensation algorithm for the class sensitive artificial neural network, Pattern Recognition Letters 17(8) (1996), 819–823.

27.

Sánchez

J.S.

, High training set size reduction by space partitioning and prototype abstraction, Pattern Recognition 37(7) (2004), 1561–1564.

28.

Lam

Keung

C.-K.

and Liu

, Discovering useful concept prototypes for classification based on filtering and abstraction, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(8) (2002), 1075–1090.

29.

Kohonen

, The self-organizing map, Proceedings of the IEEE 78(9) (1990), 1464–1480.

30.

Kim

S.-W.

and Oommen

B.J.

, Enhancing prototype reduction schemes with LVQ3-type algorithms, Pattern Recognition 36(5) (2003), 1083–1093.

31.

Manry

M.T.

and Wilson

D.R.

, Prototype classifier design with pruning, International Journal on Artificial Intelligence Tools 14(01n02) (2005), 261–280.

32.

Fernández

and Isasi

, Evolutionary design of nearest prototype classifiers, Journal of Heuristics 10(4) (2004), 431–454.

33.

Nanni

and Lumini

, Particle swarm optimization for prototype reduction, Neurocomputing 72(4–6) (2009), 1092–1097.

34.

Cervantes

Galván

I.M.

and Isasi

, AMPSO: a new particle swarm method for nearest neighborhood classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(5) (2009), 1082–1091.

35.

Triguero

García

and Herrera

, A preliminary study on the use of differential evolution for adjusting the position of examples in nearest neighbor classification, in: IEEE Congress on Evolutionary Computation, IEEE, 2010, pp. 1–8.

36.

Triguero

García

and Herrera

, Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification, Pattern Recognition 44(4) (2011), 901–916.

37.

Pedrycz

and Homenda

, Building the fundamentals of granular computing: a principle of justifiable granularity, Applied Soft Computing 13(10) (2013), 4209–4218.

38.

Zadeh

L.A.

, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems 90(2) (1997), 111–127.

39.

Pawlak

and Skowron

, Rough sets: some extensions, Information Sciences 177(1) (2007), 28–40.

40.

Yao

and Zhong

, Granular computing using information tables, in: Data mining, rough sets and granular computing, Springer, 2002, pp. 102–124.

41.

Slowinski

and Vanderpooten

, A generalized definition of rough approximations based on similarity, IEEE Transactions on knowledge and Data Engineering 12(2) (2000), 331–336.

42.

Qin

Gao

and Pei

, On covering rough sets, in: International Conference on Rough Sets and Knowledge Technology, Springer, 2007, pp. 34–41.

43.

Yao

, On generalizing rough set theory, in: International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing, Springer, 2003, pp. 44–51.

44.

W.-Z.

J.-S.

and Zhang

W.-X.

, Generalized fuzzy rough sets, Information Sciences 151 (2003), 263–282.

45.

Diker

, Textures and fuzzy unit operations in rough set theory: an approach to fuzzy rough set models, Fuzzy Sets and Systems 336 (2018), 27–53.

46.

Coello

Fernández

Filiberto

and Bello

, Impact of Weight Initialization on Multilayer Perceptron Using Fuzzy Similarity Quality Measure, in: Workshop on Engineering Applications, Springer, 2016, pp. 115–122.

47.

Fernandez

Coello

Filiberto

Bello

and Falcon

, Learning similarity measures from data with fuzzy sets and particle swarms, in: 2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), IEEE, 2014, pp. 1–6.

48.

Tsoumakas

Spyromitros-Xioufis

Vilcek

and Vlahavas

, Mulan: A java library for multi-label learning, Journal of Machine Learning Research 12(Jul) (2011), 2411–2414.

49.

Zhang

M.-L.

and Zhou

Z.-H.

, Multilabel neural networks with applications to functional genomics and text categorization, IEEE transactions on Knowledge and Data Engineering 18(10) (2006), 1338–1351.

50.

Tsoumakas

and Vlahavas

, Random k-labelsets: An ensemble method for multilabel classification, in: European conference on machine learning, Springer, 2007, pp. 406–417.

51.

Godbole

and Sarawagi

, Discriminative methods for multi-labeled classification, in: Pacific-Asia conference on knowledge discovery and data mining, Springer, 2004, pp. 22–30.

52.

Boutell

M.R.

Luo

Shen

and Brown

C.M.

, Learning multi-label scene classification, Pattern Recognition 37(9) (2004), 1757–1771.

53.

Charte

Rivera

del Jesus

M.J.

and Herrera

, R ultimate multilabel dataset repository, in: International Conference on Hybrid Artificial Intelligence Systems, Springer, 2016, pp. 487–499.

54.

Katakis

Tsoumakas

and Vlahavas

, Multilabel text classification for automated tag suggestion, in: Proceedings of the ECML/PKDD, Vol. 18, 2008, p. 5.

55.

Duygulu

Barnard

de Freitas

J.F.

and Forsyth

D.A.

, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary, in: European conference on computer vision, Springer, 2002, pp. 97–112.

56.

Klimt

and Yang

, The enron corpus: A new dataset for email classification research, in: European Conference on Machine Learning, Springer, 2004, pp. 217–226.

57.

Charte

Rivera

A.J.

del Jesus

M.J.

and Herrera

, QUINTA: a question tagging assistant to improve the answering ratio in electronic forums, in: IEEE EUROCON 2015-International Conference on Computer as a Tool (EUROCON), IEEE, 2015, pp. 1–6.

58.

Herrera

Charte

Rivera

A.J.

and Del Jesus

M.J.

, Multilabel classification, in: Multilabel Classification, Springer, 2016, pp. 17–31.

59.

Bermejo

and Cabestany

, A batch learning vector quantization algorithm for nearest neighbour classification, Neural Processing Letters 11(3) (2000), 173–184.

60.

Wilson

D.R.

and Martinez

T.R.

, Improved heterogeneous distance functions, Journal of Artificial Intelligence Research 6 (1997), 1–34.

61.

Wang

Zhao

and Hua

X.-S.

, A transductive multi-label learning approach for video concept detection, Pattern Recognition 44(10–11) (2011), 2274–2286.

62.

Zhang

Burer

and Street

W.N.

, Ensemble pruning via semi-definite programming, Journal of Machine Learning Research 7(Jul) (2006), 1315–1338.

63.

Xiao

Z.-C.

and Chou

K.-C.

, A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites, PloS One 6(6) (2011), e20592.

64.

Tahir

M.A.

Kittler

Yan

and Mikolajczyk

, Kernel discriminant analysis using triangular kernel for semantic scene classification, in: 2009 Seventh International Workshop on Content-Based Multimedia Indexing, IEEE, 2009, pp. 1–6.

65.

Tsoumakas

Papadopoulos

Qian

Vologiannidis

Dâ€™yakonov

Puurula

Read

Švec

and Semenov

, WISE 2014 challenge: Multi-label classification of print media articles to topics, in: International Conference on Web Information Systems Engineering, Springer, 2014, pp. 541–548.

66.

Alazaidah

and Ahmad

F.K.

, Trending challenges in multi label classification, International Journal of Advanced Computer Science and Applications 7(10) (2016), 127–131.

67.

Liu

Hussain

Tan

and Dash

, Discretization: an enabling technique, data mining and knowledge discovery, vol. 6(4), Springer, Netherland, 2002.

68.

Pawlak

, Rough sets, International Journal of Computer & Information Sciences 11(5) (1982), 341–356.

69.

Pedrycz

Skowron

and Kreinovich

, Handbook of granular computing, John Wiley & Sons, 2008.

70.

Zadeh

L.A.

, Fuzzy sets and information granularity, Advances in Fuzzy Set Theory and Applications 11 (1979), 3–18.

71.

Chen

X.-j.

Zhan

Y.-z.

and Chen

X.-b.

, Complex video event detection via pairwise fusion of trajectory and multi-label hypergraphs, Multimedia Tools and Applications 75(22) (2016), 15079–15100.

72.

Lin

, Granular computing: From rough sets and neighborhood systems to information granulation and computing in words, in: European congress on intelligent techniques and soft computing, 1997, pp. 1602–1606.

73.

Chen

Sun

and Zhou

, Granular Rough Theory: A representation semantics oriented theory of roughness, Applied Soft Computing 9(2) (2009), 786–805.

74.

Derrac

Triguero

García

and Herrera

, Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42(5) (2012), 1383–1397.

75.

Dubois

and Prade

, Twofold fuzzy sets and rough sets – Some issues in knowledge representation, Fuzzy sets and Systems 23(1) (1987), 3–18.

76.

Dubois

and Prade

, Rough fuzzy sets and fuzzy rough sets, International Journal of General System 17(2–3) (1990), 191–209.

77.

Rojas

, Neural Networks: A Systematic Introduction, Springer-Verlag, Berlin, Heidelberg, 1996. ISBN 3540605053.