Abstract
Data reduction techniques play a key role in instance-based classification to lower the amount of data to be processed. Prototype generation aims to obtain a reduced training set in order to obtain accurate results with less effort. This translates into a significant reduction in both algorithms’ spatial and temporal burden. This issue is particularly relevant in multi-label classification, which is a generalization of multiclass classification that allows objects to belong to several classes simultaneously. Although this field is quite active in terms of learning algorithms, there is a lack of data reduction methods. In this paper, we propose several prototype generation methods from multi-label datasets based on Granular Computing. The simulations show that these methods significantly reduce the number of examples to a set of prototypes without significantly affecting classifiers’ performance.
Introduction
Classification is one of the most popular Data Mining topics. Its aim is to learn from labeled patterns a model able to predict the decision class for future, never seen before, data samples [1]. The best way to solve a classification problem is usually to have as much information as possible. In practice, however, this is not always the case. Sometimes, the performance may decrease due to the abundance of information since many examples may be irrelevant to the resolution of the problem or may provide the same information [2, 3].
Some algorithms, such as the ones found in examples-based learning, use a training set to estimate the class label, which causes scalability problems when the size of the training set increases. In this case, the number of training objects affects the computational cost of the method [4, 5]. The nearest neighbor rule is an example of high computational cost method when the number of examples is large [6]. The most popular algorithm in this category is kNN [7]. The computational complexity of kNN is O(NM), where
However, it is possible to reduce or modify the datasets without affecting the learning process, thus improving the efficiency by reducing the computational cost. One approach to do that is the classification based on the Nearest Prototype (NP) [8, 9]. In this approach, the decision class of a new object is calculated by analyzing its proximity to a set of prototypes selected or generated from the initial set of objects. Several strategies exist to reduce the number of examples in the input data into a set of representative prototypes. Overall, we can say that the data reduction methods (with respect to the objects in the training dataset) are divided into two categories: prototypes selection[10, 11] and prototypes generation[12, 2, 13, 14]. Prototype selection algorithms choose a set of representative objects according to a well-defined criterion, while prototype generation algorithms create synthetic objects from the initial set of objects. In the literature concerning single-label learning, several interesting papers [6, 5, 12] elaborating on this issue have been proposed.
As mentioned above, Multi-Label Classification (MLC) is a type of classification task in which each object is associated with a binary vector of outputs, instead of being associated with a single value [15, 16]. In the literature, the ML-kNN algorithm appears as a very popular alternative, which uses the kNN rule in multi-label prediction [17]. This method finds the k nearest neighbors in the datasets by relying on the maximum a posteriori principle in order to determine the label set of the test object. The solution is based on the prior and posterior probabilities associated with each label within the k nearest neighbors. This method has the same drawback as kNN: as the dataset increases, so does the algorithm’s computational cost. This happens because we need to scan the whole training set every time we process an object.
One relevant work related to NP in the field of multi-label classification is the kNNc method described in [18]. This algorithm works in two stages, by combining prototype selection techniques with example-based classification. Firstly, the algorithm obtains a reduced set of objects by using prototype selection techniques as done in traditional pattern classification [19]. The goal of this stage is to determine the set of labels which are nearest to the ones in the object to be classified. Secondly, the method uses the full set of samples but limiting the prediction to the labels inferred in the previous step.
The method introduced in [18] is a prototype selection method. However, in this paper, we propose a different alternative as we develop several methods to generate prototypes in MLC datasets that are independent from the learning paradigm. Aiming at reaching such an independence, we rely on Granular Computing [20, 21, 22] and two different ways of granulating the information space. Being more precise, two classical granulation approaches are the condition granulation and the decision granulation, which granulate the universe according to conditional attributes and decision classes, respectively. Our methods use these granulation approaches to derive a set of representative objects that replaces the original training set. In this way, a significant reduction in the number of examples is achieved, thus enhancing the efficiency of the classification methods that rely on the instance-based learning principle.
In the case of the first two methods, the granulation of a universe is performed by using a similarity relation that produces similarity classes (or granules) of objects in the universe from conditional attributes. Using similarity relations enables our methods to be used in presence of mixed data, i.e., when there are both numerical and nominal attributes. On the other hand, the third method granulates the universe of discourse by means of an equivalence relation, and taking into account the different labels attached with the problem. Therefore, we obtain an equivalence class for each label, which is used the generate a prototype. In the last method, we use a fuzzy similarity relation instead of a binary relation, which allows for the development of more flexible models without the need for additional parameters. As a result, for each object in the universe, a fuzzy set can be derived from its similarity to other objects. Experimental results using different benchmark problems showed that the proposed methods achieve a remarkable reduction of the training set, while preserving or improving the efficacy of some multi-label classifiers.
The paper is organized as follows. Section 2 goes around the prototype generation concept, and Section 3 presents the theoretical background on Granular Computing. Section 4 introduces the prototype generation methods for MLC datasets, and Section 5 is dedicated to evaluating the performance of some of state-of-the-art MLC algorithms on the set of generated prototypes. Finally, in Section 6 we provide some concluding remarks.
A brief introduction to prototype generation
Prototype generation techniques are devoted to creating a new set of labeled synthetic objects that replace the initial training set. Under the data reduction paradigm, this new set is expected to be smaller than the original one while having better decision boundaries. Let us consider an object
The prototype generation approaches can be divided into several families depending on the main heuristic operation followed [23, 24]. The first approach that we can find in the literature, called PNN (Prototypes for Nearest Neighbor) [25], belongs to the family of methods that carry out a merging of prototypes of the same class in successive iterations, thus generating centroids. Other well-known methods are those based on a divide-and-conquer scheme. They decompose the N-dimensional space into two or more subspaces with the purpose of simplifying the problem at each step [26]. Other proposals that follow a similar approach include MixtGauss [13], RSP (Reduction by Space Partitioning) [27], and ICPL (Integrated Concept Prototype Learner) [28].
Another approach consists in adjusting the position of the prototypes that can be viewed as an optimization process. The main algorithm belonging to this family is LVQ (Learning Vector Quantization) [29]. LVQ can be understood as an artificial neural network in which a neuron corresponds with a prototype. Several approaches have been proposed to modify the LVQ method. For example, the third version of this algorithm –termed LVQ3– reported the best results. On the other hand, the HYB algorithm [30] combines support vector machines with LVQ3, and executes a search in order to find the most promising parameters of LVQ3. Also, the LVQPRU algorithm [31] extends LVQ by using a pruning step to remove noisy objects.
As a positioning adjustment of prototypes technique, a genetic algorithm called ENPC (Evolutionary Approach based on Nearest Prototype Classifier) was proposed for prototype generation in [32]. This algorithm executes different operators in order to find the most suitable position of the prototypes. PSO (Particle Swarm Optimization) was proposed for prototype generation in [33, 34], and they also belong to the positioning adjustment of prototypes category of methods. The main difference between them is the type of codification of the particles. Also, evolutionary algorithms have been used to tackle this problem [32, 35, 36]. In reference [24], a methodology is presented to learn iteratively the positioning of prototypes using real parameter optimization procedures. They propose an iterative prototype adjustment technique based on differential evolution.
Granular computing
Without losing generality, we can say that the issues of Granular Computing can be seen from two related angles, the construction of granules and the reasoning with granules. The former deals with the formation, representation, and interpretation of granules, while the latter deals with the exploitation of granules in problem solving [37, 38]. When building the granules, it is necessary to study the criteria for deciding whether or not two elements should be put together into the same granule, based on the available information.
Typically, elements in a granule are gathered together by indistinguishability, similarity, proximity or functionality. The granulation of an information system
These granules could be a partition or a covering of the universe. In both cases, a set of granules is formed. When each object of
Equivalence relation
An equivalence relation is a very common type of indiscernibility relation (
Hard similarity relation
For some domains, the use of equivalence relations could lead to that two inseparable objects being incorrectly labeled as separable, thus making the relationship to be excessively strict [40]. This problem can be alleviated in some extent by extending the concept of inseparability relation [41], and replacing the equivalence relation with a weaker binary relation. Equation (1) shows an indiscernibility relation, where
This similarity relation states that the objects
Using a user-specified similarity threshold parameter in Eq. (1) makes the information granule to be hard, i.e. the object either belongs to the information granule or not. An alternative to suppress this parameter is to use fuzzy sets. As a matter of fact, fuzzy sets provide another way of relaxing the inseparability concept when processing the universe objects. In this way, the similarity relation is replaced with a fuzzy similarity relation.
This relation replaces a hard membership with more flexible membership, i.e. sets to which all objects belong to some degree [44, 45]. Being more precise, instead of determining the indiscernibility between objects, an approximate equality is established. The approximate equality between objects is modeled through a fuzzy relation that assigns to each pair of objects in
Generation of MLC prototypes
In this section, we propose four prototype construction methods for multi-label training sets. Let
Prototype generation from a universe granulation by condition
The rationale behind the first two algorithms to be described in this subsection follows the same principle explained above. They are iterative methods that build similarity classes by using the similarity relation defined in Eq. (1). An object may belong to several similarity classes at the same time. However, when an object is included in a similarity class, it will not be taken into account when building a new similarity class from it. Each similarity class consists in a granule that will be used later on to build the prototype.
The generated prototypes are synthetic instances composed of both conditional and label attributes. To derive the information by condition (attribute values) and by decision (label values), an aggregation operation is used. In the case of conditional attributes, the average can be used as the aggregation operator if the attribute is numeric, or the mode if the attribute is nominal. The way in which the labels in the prototype are derived differs in the algorithms, as they use different concepts of decision class.
Therefore, we must define what is considered to be a decision class in the MLC context. This is done as follows:
Each combination Each label (
In the case of Algorithm 4.1, each combination of labels represents a decision value, while Algorithm 4.1 considers each label independently as a decision value. In this way, the first algorithm builds each decision class from the most common combination of labels in the granule, while the second algorithm does it by taking into account the labels independently. In this latter case, the prototype will have as decision values the most common labels in the granule.
[!ht] GP1mlTS[1] Initialize objects’ counter
Construct the similarity class
Construct a prototype
in
combinations of the objects in
Return PrototypeSet[!ht] GP2mlTS[1] Initialize objects’ counter
Construct the similarity class
Construct a prototype
in
are labeled with that label, otherwise
Return PrototypeSet
The third algorithm performs a decision-based granulation of the universe. Overall, this method builds the information granules by taking into consideration the labels instead of the condition attributes. The universe granulation produced by this algorithm is a covering since an object can belong to two or more granules. The algorithm builds a granule for each label, thus including all objects that are labeled with that decision label. The aggregation procedure to derive the prototype for each granule is done as described the previous subsection. This whole procedure is formalized in Algorithm 4.2.
[!ht] GP3mlTS[1]
Construct the equivalence class
Construct a prototype
in
are labeled with that label, otherwise
Return PrototypeSet
Generation of prototypes by using a fuzzy similarity relation
A key issue when constructing similarity classes concerns with the proper estimation of the similarity threshold parameter, which defines whether two objects are similar or not. Small variations on the granularity degree may lead to quite different outcomes. Determining the exact granularity degree is not trivial, since higher values do not necessarily lead to optimal prediction rates while smaller values might result in a significant accuracy lose.
To suppress this parameter, we propose a forth algorithm that uses a fuzzy similarity relation. This implies that all objects in
where
In short, the set
[!ht] GP4mlTS[1] Initialize objects’ counter
PrototypeSet
Construct the set
Construct a prototype
in
are labeled with that label, otherwise
Return PrototypeSet
In this section, we explore the performance of our prototype generation methods when coupled with state-of-the-art multi-label classifiers.
Experimental setup
The MLC state-of-the-art classifiers used in this section are included the MULAN library [48]. Specifically, we use the following algorithms:
The ML-kNN algorithm [17] is an adaptation of the
The a priori probability of each label, which is defined as the number of times each label appears in the multi-label dataset divided by the number of objects in the dataset. A smoothing factor is applied to avoid multiplying by zero. The conditional probabilities for each label, which is computed as the proportion of k-nearest neighbors having that label with respect to all objects associated with the target label. These probabilities are independently computed for each label, facing the task as a collection of individual binary problems. Therefore, the potential dependencies among labels are fully dismissed by this algorithm. After completing this training process, the classifier is able to predict the labels for new objects. The classification goes as follows:
First, the k-nearest neighbors of the given sample are obtained. The Euclidean distance is used to measure the (dis)similarity between the reference object and the samples in the multilabel dataset. Then, the presence of each label in the neighbors is used as the evidence to compute maximum a posteriori (MAP) probabilities from the conditional ones obtained before. Finally, the label set of the new sample is generated from the MAP probabilities. The probability itself is provided as a confidence level for each label, thus making possible to also generate a label ranking. The foundation behind the BP-MLL algorithm [49] is to adapt a feed-forward neural network to deal with multi-label data, where the error backpropagation strategy is employed to minimize a global error function capturing the correlation among the labels. The key aspect in BP-MLL is the introduction of a new error function, which is computed taking into account the fact that each sample contains several labels. Specifically, this new function penalizes the predictions including labels which are not truly relevant for the processed object. In the BP-MLL method, the input layer has as many neurons as features describing the multi-label dataset. The number of units in the output layer is determined by the number of labels, while the number of neurons in the hidden layer is also influenced by the number of labels. This algorithm produces a label ranking as result while classifying new objects. A threshold parameter decides which labels will be deemed as relevant.
Random k-Labelsets, RAkEL [50] is a method that generates random subsets of labels, thus training a multiclass classifier for each subset. RAkEL involves two essential parameters,
We use the same default parameter settings provided by the MULAN library. It should be highlighted that these default parameter values are common across all datasets, thus no algorithm performs hyperparameter tuning.
To perform the numerical simulations, we take 12 multi-label datasets from the well-known RUMDR [53] repository. Table 1 summarizes the number of objects, attributes, and labels for each dataset. In the adopted datasets, the number of objects ranges from 1,675 to 10,491, the number of attributes from 294 to 1,836, and the number of labels from 6 to 400. A more description of each of the studies case is given below:
bibtex: This dataset was introduced in [54] and contains the metadata for bibliographic entries. The words in the title, authors names, journal name, and publication date were taken as input attributes. The data origin is Bibsonomy, a specialized social network where the users can share bookmarks and BibTeX entries assigning labels to them. The bag-of-words (BoW) model is used to represent the documents, so all features are binary and indicate whether a certain term is relevant to the document or not.
corel5k: this dataset involves thousands of images, which were categorized into several groups [55]. In addition, each picture is associated with a set of words describing its content (i.e., the labels). These pictures were segmented by the authors by using the normalized cuts method, thus generating a set of blobs associated with one or more words. The input features are the vectors resulting from the segmentation process.
enron: the Enron corpus is a large set of email messages, with more than half a million entries, from which a dataset for automatic folder assignment research was generated [56]. The enron dataset is a subset of the previous dataset. Each one has as input features a BoW model obtained from the emails fields, such as the subject and the body of the message. The labels are the folders in which each message was stored.
scene: this dataset is also related with image labeling, specifically to scene classification. The set of pictures was taken from the Corel dataset and some personal ones by the authors [52] were also included. The images are transformed to the CIE Luv color space, known for being perceptually uniform, and latter segmented into 49 blocks
stackexchange: the case study in [57] is a tag suggestion task for questions posted in specialized forums, specifically forums from the Stack Exchange network. Six datasets were generated from forums devoted to topics such as cooking, computer science, and chess. The title and body of each question was text-mined, thus producing a frequency BoW. The tags assigned by the users to their questions were used as the labels.
Summary of the MCL datasets used in our study
In order to measure performance obtained by the MLC classifiers, we will use Hamming Loss (HL) metric, which is a well-known performance measure in MLC scenarios [58]. This metric is defined as follows,
where
During the numerical simulations, we also studied the efficiency in terms of the reduction coefficient [59]. This measure, depicted in Eq. (4), quantifies how much the number of objects is reduced, i.e. the proportion between the size of the set of prototypes
In this experiment, the similarity threshold
where
It is worth mentioning that, for each datasets, we have estimated the HL value by using a 10-fold cross validation scheme. For each fold, this procedure splits the whole training set into two data pieces, namely, the training set and the test set. Only the training set is used to generate the set of prototypes and derive the learning model. The test set is never modified so that it only serves to compute the HL associated with the current fold.
Figure 1 displays the reduction coefficient achieved once the proposed prototype generation methods are used on each dataset. The reader can note that our methods achieve a reduction rate higher than 20% in most problems. The GP1mlTS and GP2mlTS methods have a similar behavior. However, the GP3mlTS method reports reduction rates above 90% in most datasets.
Dataset reduction (%) achieved by each algorithm.
Figure 2 displays the HL values achieved by the ML-kNN method with the original multi-label datasets, and the results obtained after using the set of prototypes generated by three of the methods proposed in this paper. The results show that the prototypes generated for each dataset leads to HL values similar to those obtained with the original dataset. Only in the case of the scene dataset there is a significant difference in the HL value, especially when we use the set of prototypes generated by the GP3mlTS method with respect to the original dataset. This is due to the fact that this dataset has few labels (exactly 6 labels), thus only a few prototypes are generated.
HL values achieved by the ML-kNN method.
Performance (in term of HL values) of the ML-kNN method from several prototype sets built with different similarity thresholds.
Dataset reduction (%) obtained with GP1mlTS and GP2mlTS algorithms with different similarity thresholds.
The results shown above have been achieved with a fixed similarity threshold of 0.85. Figure 3 shows how the performance of the ML-kNN algorithm fluctuates when the similarity threshold value oscillates between 0.75 and 0.95. This variation is also shown in Fig. 4 but taking into account the reduction coefficient obtained using GP1mlTS and GP2mlTS. Both figures illustrate how the similarity threshold influences the results of both algorithms, as expected. Overall, when the value of the similarity threshold is increased, better values for HL are achieved; however, the reduction percent are lower. A suitable trade-off between the HL value and the reduction rate across all datasets is obtained when using a similarity threshold value equal to 0.85.
Reduction rates produced by GP2mlTS and GP4mlTS algorithms.
HL values obtained by GP2mlTS and GP4mlTS algorithms.
In addition, we compared GP2mlTS and GP4mlTS themselves as we wanted to verify the behavior of GP2mlTS when using fuzzy sets instead of similarity classes in the construction of the prototypes. Figure 5 illustrates the prototype reduction achieved for each dataset from these two algorithms. Figure 6 shows the HL values obtained by ML-kNN when replacing the original training set with the granular one containing the generated prototypes.
The results suggest that the GP4mlTS algorithm is able to obtain significant reduction percentages, i.e. over 80% in most of the study cases. More importantly, the efficacy of the ML-kNN algorithm is preserved.
In this subsection, we analyze the performance of other multi-label classifiers on the generated prototypes. Figures 7 and 8 show the average HL values obtained by the BP-MLL and RAkEL algorithms, respectively, when GP1mlTS, GP2mlTS and GP4mlTS are used to derive the prototypes. We adopted these generation method since they reported a better trade-off between reduction and efficacy in the numerical simulations conducted above.
Average HL values achieved by the BP-MLL method.
Average HL values achieved by the RAkEL method.
The simulation results show that the granular training sets help improve the performance of the BP-MLL algorithm in all cases. Likewise, the GP1mlTS, GP2mlTS and GP4mlTS algorithms provide a significant reduction in the dataset size (see Figs 1 and 5 in the previous subsection) while improving the HL values. In addition, the best trade-off between the prediction and the reduction rates is achieved with the GP4mlTS algorithm.
In the case of the RAkEL method, both the GP1mlTS and GP2mlTS methods produce HL values similar to the ones produced with the original dataset. This indicates that the efficacy is not compromised even when there is a significant reduction in the training set size. However, the RAkEL method seems to be quite sensitive to the GP4mlTS algorithm as the HL value increases. Therefore, using both algorithms together is not encouraged.
Despite the extensive work done in multi-label classification, as far as we know, the topic of prototype generation has not received much attention. In that regard, this paper proposed several methods based on Granular Computing for the generation of prototypes. These information constructs are built to represent the knowledge comprised into the training set. The first two methods build the granules from similarity relations among the objects, while the third method uses an equivalence relation to create the granules. In addition, the fourth proposed method uses fuzzy similarity relations to suppress the need to establish a similarity threshold when granulating the information space.
The prototypes generated by these methods replace the objects in the training set, which is an important component in supervised learning. The numerical simulations have shown that the proposed methods achieve a significant reduction when it comes to the dataset size, while also preserving the efficacy of the multi-label classifiers adopted for simulation. The GP3mlTS algorithm achieved the highest reduction, thus improving the efficiency of classifiers such as ML-kNN, BP-MLL and RAkEL. However, the overall accuracy was slightly affected. The GP1mlTS and GP2mlTS algorithms achieved a fair trade-off between efficiency and efficacy, although they are sensitive to the similarity threshold parameter. The GP4mlTS algorithm does not have this limitation and emerged as the most suitable alternative in our research.
