A lazy feature selection method for multi-label classification

Abstract

In many important application domains, such as text categorization, biomolecular analysis, scene or video classification and medical diagnosis, instances are naturally associated with more than one class label, giving rise to multi-label classification problems. This has led, in recent years, to a substantial amount of research in multi-label classification. More specifically, feature selection methods have been developed to allow the identification of relevant and informative features for multi-label classification.

This work presents a new feature selection method based on the lazy feature selection paradigm and specific for the multi-label context. Experimental results show that the proposed technique is competitive when compared to multi-label feature selection techniques currently used in the literature, and is clearly more scalable, in a scenario where there is an increasing amount of data.

Keywords

Multi-label classification data mining feature selection

1. Introduction

A large body of research in supervised learning deals with the analysis of single-label data, where each instance is associated with only one class label from a fixed set of possible class labels. However, in many data mining applications, an instance may be associated with more than one class label. This characterizes the multi-label (ML) classification task, which has become a common real-world task [32] and a relevant topic of research.

Classification strategies that handle multi-label data can be divided into two groups: transformation and adaptation strategies. Transformation strategies, such as Label Powerset and Binary Relevance transformations, convert the multi-label data into single-label data and then use traditional single-label classifiers. Adaptation strategies adapt or extend single-label classifiers to cope with multi-label data directly, such as the Multi-Label $K$ -Nearest Neighbors (ML-KNN) [32], the ML Naive Bayes classifier [31], ML Decision Tree [3], among others [27].

The performance of a classification method is closely related to the quality of the training data. Redundant and irrelevant features may not only decrease the model’s accuracy but also make the process of building the model or running the classification algorithm slower. Feature selection is traditionally a data preprocessing step which aims at identifying relevant features for a target data mining task – particularly in this paper, the multi-label classification task.

There is an extensive literature regarding feature selection for single-label classification, which has been summarized in surveys such as in [5, 7]. Given the increasing popularity of multi-label classification and the challenge of selecting features in this context, there has been significant research specifically on feature selection for multi-label classification [15]. Most methods proposed for this task rely on the transformation of the multi-label data set into a single-label one. This can cause the loss of some information present in the original multi-label data set related to the dependence between labels, an issue previously studied in multi-label learning [23].

In this work, we propose and evaluate a new method for multi-label feature selection. This method has two main characteristics: (a) it is a lazy multi-label feature selection method and was designed based on the single-label lazy feature strategy proposed in [17]; and (b) it uses the information gain (IG) measure that was previously adapted for the multi-label setting in [14].

The remainder of this paper is organized as follows. In Section 2, we revisit the multi-label classification problem. In Section 3, we describe the multi-label feature selection process and past work in the area. In Section 4, we describe our adaptation proposal of a novel multi-label feature selection technique. In Section 5, we present the experiments that compare our proposal with methods currently used in the literature. Finally, in Section 6, we make our concluding remarks and point to directions for future research.

2. Multi-label classification

In the multi-label classification task, each data instance may be associated with multiple labels. Multi-label classification is suitable for many domains, such as text categorization, scene or video classification, medical diagnosis, bioinformatics and microbiology. In all these applications, the task is to assign for each unseen instance a label set whose size is unknown a priori [32].

Strategies proposed to deal with multi-label classification rely mainly on problem transformation, where the multi-label problem is transformed into one or more single-label problems; or on algorithm adaptation, where single-label learning algorithms are adapted to handle multi-label data directly.

The simplest way to apply a classification algorithm to multi-label data is to transform them into single-label data. Then, a traditional classification technique – such as $k$ -NN, decision tree or SVM – can be employed to perform the classification task. The advantage of using a transformation technique is allowing the usage of one or more single-label classification algorithms for the learning task which have been thoroughly studied and perfected over the last decades.

A simple transformation technique used to convert a multi-label data set into a single-label one consists of copying each multi-label instance $n$ times, where $n$ is the number of labels assigned to that instance. Each copied instance is then assigned one distinct single label from the original set.

A popular transformation is the Label Powerset (LP) technique, which creates one label for each different subset of labels that exists in the multi-label training data set. Thus, the new set of labels corresponds to the powerset of the original set of labels. After this transformation, a single-label classification learning algorithm can handle the transformed data set and produce a classifier. This classifier can then be used to assign to new instances one of these new labels, which can be mapped back to the corresponding subset of the original labels [28].

Binary Relevance (BR) is a well-known transformation technique that produces a binary classifier for each different label of the original data set [4, 26]. In its simplest implementation, each resulting classifier is capable of predicting if a label is relevant or not for a new instance. So, each classifier handles the data as single-label, since it gives a relevance feedback for just one specific label.

Regarding algorithm adaptation, most traditional classifiers employed in single-label problems have been adapted to the multi-label paradigm [27]. C4.5 decision-tree induction algorithm was adapted in [3], by allowing multiple labels in the leaves of the tree. An adaptation of the SVM algorithm has been proposed in [6]. A $k$ -NN adaptation was proposed in [32]. A multi-label adaptation of the Naive Bayes algorithm was proposed in [31]. MMAC (Multi-class, Multi-label Associative Classification) follows the paradigm of associative classification, which deals with the construction of multi-label classification rule sets using association rule mining [27].

3. Related work

Feature selection techniques are primarily employed to identify relevant and informative features [7]. Besides this, there are other important motivations: improvement of the classifier predictive accuracy, reduction and simplification of the data set, acceleration of the classification task and even the construction of classification models that can be understood, analyzed, and qualitatively evaluated by domain experts [12].

Feature selection techniques intended specifically for multi-label classification have been developed recently [9, 8, 15]. Analogous to multi-label classification, the simplest way to employ feature selection to a multi-label data set is to change it into single-label data and apply a traditional feature selection technique.

In [1], the following common simple transformation techniques have been employed to allow the application of traditional feature selection for the multi-label text categorization problem: select-max, select-min, select-random, select-ignore and copy. These strategies are used to convert a multi-label data set into a single-label one.

In [25], several multi-label classification strategies were evaluated and compared for the task of automated decision of emotion in a music data set. The Label Powerset transformation was used to produce a single-label data set, and then a common feature selection measure was employed ( ${\chi}^{2}$ statistic) to select the best features. They have verified that, by using feature selection, the classification result achieved a better Hamming Loss measure than without feature selection, for the evaluated data set and the ML-KNN algorithm as the classifier.

The Label Powerset transformation is also used before the feature selection process in [22], in conjunction with the relief and information gain measures. With this feature selection, it was possible to reduce the size of the data sets without compromising the classification performance.

Some text classification work [29, 13, 33] has employed the Binary Relevance technique before applying single-label feature selection measures, like information gain and ${\chi}^{2}$ statistic. For each different label in the original data set, a binary single-label data set is created, and then feature selection is executed for each one. Binary Relevance transformation is also used for feature selection in [22], in conjunction with relief and information gain measures. This feature selection strategy is compared to LP transformation using the same measures, with the conclusion that both transformation methods achieved a similar predictive performance in the experiments with data sets from various domains.

There are also recently proposed multi-label feature selection techniques that do not require transformation of the data set – the feature selection is built as an adaptation of techniques suited for the single-label paradigm, or as a wrapper-based technique.

In [10], the well-known technique FCBF (Fast Correlation-Based Filter), introduced in a previous work [30], is extended to handle multi-label data. In that work, the technique consists of transforming the data set into a directed graph and applying the symmetrical uncertainty measure to evaluate the features of the data set. This feature selection is applied to the IBLR-ML [2] and ECC [20] classifiers, and data sets from multiple domains are evaluated. In [31], a wrapper technique is used to identify the best feature set. This wrapper feature selection implements a genetic algorithm as the search component. To evaluate this method, the Multi-label Naive Bayes classifier is proposed and used as the wrapper classifier. The classification coupled with the feature selection achieved a better result, even when compared with other classifiers.

Single-label feature selection techniques were recently adapted to the multi-label paradigm. The ReliefF measure was adapted in [23] and in [18]. The Mutual Information measure was adapted in [11]. Correlation-based feature selection, capable of handling subset of features, was adapted to the multi-label setting in [9].

In [14], a comprehensive evaluation of various multi-label feature selection techniques was performed, along with an adaptation of the Information Gain metric to handle multi-label data directly. Using data sets from various domains, including large data sets, the proposed algorithm was experimentally compared to well-known transformation-based feature selection techniques coupled with multi-label classifiers. The results showed that the proposed algorithm is competitive and more scalable than the other compared techniques.

4. The lazy multi-label feature selection proposal

In conventional feature selection strategies, features are usually selected in a preprocessing phase. The features that are not selected are discarded from the data set and no longer participate in the classification process.

In [17], a lazy feature selection strategy was proposed based on the hypothesis that postponing the selection of features to the moment at which an instance is submitted for classification can contribute to identifying the best features for the correct classification of that particular instance. For each different instance to be classified, it is possible to select a distinct and more appropriate subset of features to classify it. The technique was implemented for single-label data.

Table 1 presents a multi-label example that could benefit from a lazy strategy of feature selection. The data set, composed of two features – $X$ , $Y$ – and the multi-label class, is represented twice. The left occurrence is ordered by the values of $X$ and the right one is ordered by the values of $Y$ .

Table 1
Multi-label data set example

Data set sorted by $X$			Data set sorted by $Y$
X	Y	Labels	X	Y	Labels
1	1	A	1	1	A
1	2	B	2	1	A
1	3	B	3	1	B, C
1	4	A, B	4	1	B
2	1	A	1	2	B
2	2	A	2	2	A
2	3	A	3	2	B, C
2	4	A, B	4	2	A, C
3	1	B, C	1	3	B
3	2	B, C	2	3	A
3	3	B, C	3	3	B, C
3	4	A, B	4	3	A
4	1	B	1	4	A, B
4	2	A, C	2	4	A, B
4	3	A	3	4	A, B
4	4	A, B	4	4	A, B

It can be observed in the left occurrence that the values of $X$ are strongly correlated with at least one label value. The instances with $X=$ 1 do not have the label $C$ among their labels. When $X=$ 2, the label $A$ is always present, and the label $C$ is not. For the instances with $X=$ 3, the label $B$ is always present. The value $X=$ 4 is the only one that is not strongly correlated with any label. Nonetheless, this makes $X$ a useful feature, because most of its values are correlated to at least one label (or its absence).

On the other hand, as shown in the right occurrence, the values $Y=$ 1, $Y=$ 2 and $Y=$ 3 do not have a strong correlation with any label. $Y$ would be a strong candidate to be eliminated, since most of its values do not properly discriminate a label. However, there is a strong correlation between the value 4 of feature $Y$ and the three label values: $A$ , $B$ and the absence of $C$ . This correlation would be lost if this feature were discarded. The multi-label classification of an instance with value 4 in the $Y$ feature would clearly take advantage of the presence of this feature.

So, we present in this small multi-label example a motivation for postponing the selection of features to the moment at which an instance is submitted for classification. This way, the selection can take a more informed decision on which features to keep in the data set.

Lazy feature selection is a general strategy, since it can employ different evaluation measures to assess the quality of the features. In this work, we propose to instantiate the strategy analogously to the original work [17] – which deals with the single-label scenario – using an entropy-based criterion to rank features [29]. This entropy measure was extended to the multi-label setting in [3], where the C4.5 algorithm was adapted for handling multi-label data. This decision tree algorithm allowed multiple labels at the leaves of the tree, by using an adaptation of entropy measure, described by Eq. (1), for computing the entropy of the label distribution in a multi-label data set $D$ with features $X_{1},X_{2},\ldots,X_{d}$ .

$\displaystyle\textit{Ent.ML}(D)=-\sum^{l}_{i=1}p(\lambda_{i})\ast\log_{2}p(% \lambda_{i})+q(\lambda_{i})\ast\log_{2}q(\lambda_{i}),$ (1)

where $p(\lambda_{i})$ is the probability that an arbitrary instance in the multi-label data set $D$ belongs to class label $\lambda_{i}$ , $q(\lambda_{i})=1-p(\lambda_{i})$ , and $l$ is the number of labels in the data set.

The formula for computing the entropy of the label distribution in $D$ , restricted to the values of feature $X_{j}$ , $1\leqslant j\leqslant d$ , represented by $\textit{Ent.ML}(D,X_{j})$ is defined by Eq. (2), where $d$ is the number of features in $D$ .

$\displaystyle\textit{Ent.ML}(D,X_{j})=\sum^{d_{j}}_{i=1}\left[\left(\frac{|D_{% ji}|}{|D|}\right)\ast\textit{Ent.ML}(D_{ji})\right],$ (2)

where $D_{ji}$ , $1\leqslant i\leqslant d_{j}$ , is the partition of $D$ composed of all instances whose value of feature $X_{j}$ is equal to $x_{ji}$ , $d_{j}$ is the number of different values of $X_{j}$ and $x_{ji}$ is one of these values.

These equations were used in the MLInfoGain technique [14], as an adaptation of the Information Gain Ranking [19] for the multi-label context. The adaptation was able to handle multi-label data directly, and it was implemented as a filter. MLInfoGain technique is considered as “eager”, which is the opposite of “lazy”, selecting the features as a data preprocessing step.

In this work, our proposal is to extend this information gain adaptation to the lazy paradigm. Each individual feature value needs to be measured separately from the others, and for each label. The entropy of the label distribution in $D$ , restricted to the value $x_{ji}$ , $1\leqslant i\leqslant d_{j}$ and to the label $l_{k}$ , $1\leqslant k\leqslant q$ , of feature $X_{j}$ , $1\leqslant j\leqslant d$ , represented by $\textit{Ent.ML}(D,X_{j},x_{ji},l_{k})$ is defined by Eq. (3).

$\displaystyle\textit{Ent.ML}(D,X_{j},x_{ji},l_{k})=\textit{Ent.ML}(D_{\textit{% jik}}),$ (3)

where $D_{\textit{jik}}$ , $1\leqslant i\leqslant d_{j}$ , is the partition of $D$ composed of all instances with label $l_{k}$ whose value of feature $X_{j}$ is equal to $x_{ji}$ . For each label $l_{k}$ this equation gives a different entropy value. The closer the entropy $\textit{Ent.ML}(D,X_{j},x_{ji},l_{k})$ is to zero, the greater the chance that the value $x_{ji}$ of feature $X_{j}$ is a good discriminator for label $l_{k}$ .

Equation (4) aggregates the result for all $q$ labels in $D$ using the $\min$ function, in order to identify feature values which best discriminate at least one label.

$\displaystyle\textit{LazyEnt.ML}(D,X_{j},x_{ji})=\text{min}^{k=q}_{k=1}\textit% {Ent.ML}(D,X_{j},x_{ji},l_{k}).$ (4)

For computing the lazy multi-label information gain (MLIG), for each feature $X_{j}$ , if the discrimination ability of the specific value $x_{ji}$ of $X_{j}$ (i.e., $\textit{Ent.ML}(D,X_{j},x_{ji})$ ) is better than (less than) the overall discrimination ability of feature $X_{j}$ (i.e., $\textit{Ent.ML}(D,X_{j})$ ) then the former will be considered for computing the lazy multi-label information gain. This is given by Eq. (5).

$\displaystyle\textit{LazyML.IG}(D,X_{j},x_{ji})=\textit{Ent.ML}(D)-\min[% \textit{Ent.ML}(D,X_{j}),\textit{LazyEnt.ML}(D,X_{j},x_{ji})]$ (5)

The choice of considering the minimum value from both the entropy of the specific value and the overall entropy of the feature was motivated in [17] by the fact that some instances to be classified may not have any relevant features considering their particular values. In this case, features with the best overall discrimination ability will be selected. In other words, if the values of the instance to be classified do not help the lazy feature selection, then select the best features in the data set regardless, considering the eager multi-label information gain measure.

The proposed lazy adaptation of the multi-label information gain works as follows: for an entry instance $E=(e_{1},e_{2},\ldots,e_{d})$ to be classified, the value $\textit{LazyML.IG}(D,X_{j},e_{j})$ for each feature $X_{j}$ and value $e_{j}$ is computed. The features are ranked according to these scores. The filter strategy implemented in this work selects a percentage $r$ of the best features. After the feature selection phase, the multi-label classification will only use the best features according to the percentage $r$ to classify instance $E$ . For the next instance, this process is repeated. This feature selection technique is categorized as a filter adaptation.

Any lazy feature selection technique should be coupled with a lazy multi-label classifier, because the ‘lazy module’ is called at classification time for every new instance. This restriction is justified by the target classifier not requiring to construct a model in a previous step. For instance, a decision tree classifier would not benefit from the lazy feature selection in terms of scalability, because the tree model would need to be reconstructed for every new instance. On the other hand, the $k$ -NN classifier, being a lazy classifier, does not construct a model in a previous phase. So after receiving a new instance to be classified, it could call the ‘lazy feature selection module’, select the suitable features for that instance, and then compute the neighbor distances and proceed with the classification.

In Algorithm 1 we show the pseudo-code of the proposed multi-label lazy strategy called LazyMLInfoGain. This algorithm is activated by the multi-label classifier during the classification phase. The equations presented before are indicated in the pseudo-code accordingly.

Pseudo-code for the proposed multi-label lazy feature selection – LazyMLInfoGainTraining multi-label dataset $D(X_{1},X_{2},\ldots,X_{d})$ Entry multi-label instance $E(e_{1},e_{2},\ldots,e_{d})$ Percentage of features to be selected $r$ % Set of selected features LazyMLInfoGain compute entropy- $D$ , given by Eq. (1) each feature $X_{j}$ in $D$ compute entropy- $X_{j}$ , given by Eq. (2) each value $e_{j}$ in $E$ compute lazy-entropy- $e_{j}$ , given by Eq. (3) lazy-entropy- $e_{j}$ $\leqslant$ entropy- $X_{j}$ quality- $X_{j}$ : $=$ entropy- $D$ – lazy-entropy- $e_{j}$ quality- $X_{j}$ : $=$ entropy- $D$ – entropy- $X_{j}$ Sort features $X_{j}$ using quality- $X_{j}$ Return the first $r$ % features

In order to build the rank of features, we have introduced the variable quality- $X_{j}$ , which decides if the entropy for the feature’s individual value is smaller than the overall feature entropy, and should be favored in the overall selection. This decision is defined by Eq. (4) and the information gain formula was given by Eq. (5).

The features not selected in a lazy fashion for a given instance are not needed to be removed from the dataset, but only disregarded by the classifier when performing its algorithm. This proves to be more efficient in terms of memory and performance than removing the features each time, as each new instance will almost certainly make use of a different subset of features.

The goal of this novel technique is to benefit the multi-label classification in terms of predictive accuracy, as achieved by the lazy single-label adaptation in [17], and in terms of scalability, as achieved by the direct information-gain adaptation in [14]. This lazy multi-label adaptation is compared experimentally with other feature selection techniques used in the literature and the results are reported in the next section.

5. Computational experiments

The goal of these experiments is to evaluate if the proposed multi-label lazy feature selection achieves a competitive result when compared with other feature selection techniques in the multi-label paradigm coupled with traditional multi-label classifiers.

The multi-label lazy feature selection (Lazy MLInfoGain) was implemented in Mulan [27]. Mulan is an open-source Java library for learning from multi-label data sets. The library includes a variety of state-of-the-art algorithms for performing multi-label classification, ranking and feature selection.

The Lazy MLInfoGain was coupÄ°ed with two multi-label classifiers: BRKNN and ML-KNN. The experiments with these classifiers are detailed in the next subsections.

5.1 Experiments with the BRKNN classifier

The BR transformation coupled with the $k$ -NN classifier (or $BR+\textit{KNN}$ ) can be adapted by using a single search instead of transforming the multi-label data set using the BR approach. It finds the $k$ nearest neighbors while it makes independent predictions for each label [21]. While BR followed by $k$ -NN has a computational complexity of $L$ times the cost of computing the $k$ nearest instances, where $L$ is the number of labels in the data set, this adaptation runs much faster, and is more scalable than other classification algorithms based on transformation. This adaptation is used in this work and is referred as BRKNN classifier.

The lazy feature selection was first evaluated when coupled with the BRKNN classifier. The feature selection was executed within the algorithm just before the actual classification happens. The classifier considered only the $r$ % features selected in the lazy manner to compute its neighbors distances, for each test instance. This implies that for different instances distinct subsets of features were used. The experiments were executed with the default parameter settings in the Mulan tool, for the original BRKNN classifier.

Throughout this work, we elected the following measures to evaluate our algorithms: Hamming Loss, Subset 1/0 Loss, Example-Based Accuracy and Ranking Loss. These measures are commonly used in multi-label classification work, and the decision on this subset is supported by [16]. Example-Based Accuracy was inverted (1 $-$ measure) so that for all measures, the lower the values, the better.

Table 2 shows the overall result of each feature selection technique coupled with the BRKNN classifier. Each table section presents the result for a specific performance measure. The first column indicates the data set used, and the other columns indicate which feature selection technique was applied. “BR $+$ InfoGain”, “Copy $+$ InfoGain” and “LP $+$ InfoGain” stand for a transformation followed by the single-label information gain measure to rank and select features. “MLInfoGain” corresponds to the multi-label information gain technique proposed in [14]. “Lazy MLInfoGain” is the lazy adaptation proposed in this work.

“No Sel.” is the result without feature selection, and also the baseline. Each cell shows the achieved result in terms of the multi-label performance measure, varying between 0 and 1, and the lower the value, the better. In parentheses, we show the percentage of selected features that achieved the best value for each technique, and in case of ties we report the smaller percentage. Bold values show the results that achieved a score equal or better than the baseline, and underlined values show the best result achieved in each row. At the end of the table we summarize the results.

Table 2
Best results achieved using each feature selection technique with BRKNN

Data set	BR $+$ InfoGain	Copy $+$ InfoGain	LP $+$ InfoGain	MLInfoGain	LazyMLInfoGain	No Sel.
Hamming loss
Bibtex	0.0128 (10%)	0.0132 (10%)	0.0137 (20%)	0.0132 (10%)	0.0128 (10%)	0.0143
Birds	0.0447 (30%)	0.0458 (90%)	0.0456 (80%)	0.0438 (10%)	0.0445 (30%)	0.0454
CAL500	0.1411 (80%)	0.1416 (40%)	0.1410 (30%)	0.1412 (40%)	0.1415 (60%)	0.1425
Corel5k	0.0094 (10%)	0.0094 (10%)	0.0094 (10%)	0.0094 (10%)	0.0094 (10%)	0.0094
Emotions	0.1917 (90%)	0.1910 (80%)	0.1951 (90%)	0.1890 (80%)	0.1912 (70%)	0.1934
Enron	0.0525 (10%)	0.0579 (10%)	0.0523 (10%)	0.0565 (70%)	0.0508 (30%)	0.0580
Flagsml	0.2510 (20%)	0.2570 (20%)	0.2540 (20%)	0.2474 (30%)	0.2521 (30%)	0.2749
Genbase	0.0038 (10%)	0.0038 (10%)	0.0038 (10%)	0.0038 (10%)	0.0038 (10%)	0.0038
Medical	0.0139 (10%)	0.0160 (10%)	0.0162 (10%)	0.0160 (10%)	0.0163 (10%)	0.0180
Scene	0.0958 (90%)	0.0932 (90%)	0.0947 (90%)	0.0928 (90%)	0.0918 (60%)	0.0920
Yeast	0.1924 (70%)	0.1971 (50%)	0.1945 (90%)	0.1942 (80%)	0.1949 (40%)	0.1952
Subset 0/1 loss
Bibtex	0.8817 (10%)	0.9120 (10%)	0.9516 (30%)	0.9118 (10%)	0.8772 (10%)	0.9754
Birds	0.4945 (50%)	0.5084 (70%)	0.5069 (70%)	0.4852 (20%)	0.4914 (30%)	0.5039
CAL500	1.0000 (10%)	1.0000 (10%)	1.0000 (10%)	1.0000 (10%)	1.0000 (10%)	1.0000
Corel5k	0.9992 (50%)	0.9994 (70%)	0.9992 (90%)	0.9994 (30%)	0.9976 (10%)	1.0000
Emotions	0.6985 (30%)	0.6883 (70%)	0.7035 (90%)	0.6732 (80%)	0.6968 (90%)	0.7085
Enron	0.8908 (10%)	0.8837 (40%)	0.8996 (40%)	0.8866 (40%)	0.8720 (30%)	0.9195
Flagsml	0.8084 (20%)	0.8450 (20%)	0.8087 (20%)	0.8034 (30%)	0.8192 (20%)	0.8547
Genbase	0.0785 (10%)	0.0785 (10%)	0.0785 (10%)	0.0785 (10%)	0.0785 (10%)	0.0785
Medical	0.4530 (10%)	0.5471 (10%)	0.5471 (10%)	0.5359 (10%)	0.5440 (10%)	0.5982
Scene	0.4130 (90%)	0.4088 (90%)	0.4088 (90%)	0.4005 (80%)	0.3930 (70%)	0.4038
Yeast	0.7985 (90%)	0.8014 (90%)	0.8056 (90%)	0.7964 (80%)	0.8014 (90%)	0.8018
Example-based accuracy (inverted)
Bibtex	0.7894 (10%)	0.8369 (10%)	0.8848 (30%)	0.8369 (10%)	0.7887 (10%)	0.9289
Birds	0.4443 (30%)	0.4560 (90%)	0.4535 (80%)	0.4282 (10%)	0.4349 (30%)	0.4482
CAL500	0.8094 (80%)	0.8107 (70%)	0.8099 (60%)	0.8106 (40%)	0.8120 (70%)	0.8144
Corel5k	0.9915 (80%)	0.9928 (70%)	0.9941 (80%)	0.9925 (70%)	0.9876 (20%)	0.9975
Emotions	0.4702 (70%)	0.4686 (80%)	0.4871 (50%)	0.4643 (80%)	0.4754 (60%)	0.4851
Enron	0.6530 (10%)	0.7314 (20%)	0.7000 (10%)	0.7162 (70%)	0.6202 (20%)	0.7973
Flagsml	0.3953 (20%)	0.3945 (20%)	0.3903 (20%)	0.3824 (30%)	0.3955 (20%)	0.4364
Genbase	0.0463 (10%)	0.0463 (10%)	0.0463 (10%)	0.0463 (10%)	0.0463 (10%)	0.0463
Medical	0.3815 (10%)	0.4799 (10%)	0.4828 (10%)	0.4718 (10%)	0.4867 (10%)	0.5437
Scene	0.3881 (90%)	0.3831 (90%)	0.3837 (90%)	0.3750 (80%)	0.3669 (70%)	0.3802
Yeast	0.4975 (90%)	0.5037 (90%)	0.5002 (90%)	0.4965 (80%)	0.5004 (90%)	0.4998
Ranking loss
Bibtex	0.1342 (10%)	0.1807 (10%)	0.2296 (30%)	0.1805 (10%)	0.1412 (10%)	0.2830
Birds	0.0861 (70%)	0.0889 (90%)	0.0878 (40%)	0.0872 (60%)	0.0868 (70%)	0.0864
CAL500	0.2301 (70%)	0.2301 (30%)	0.2295 (40%)	0.2310 (90%)	0.2285 (70%)	0.2310
Corel5k	0.1887 (10%)	0.1997 (10%)	0.2254 (10%)	0.1983 (10%)	0.2025 (10%)	0.3243
Emotions	0.1624 (70%)	0.1623 (80%)	0.1599 (90%)	0.1584 (60%)	0.1574 (50%)	0.1610
Enron	0.1165 (10%)	0.1096 (10%)	0.1260 (10%)	0.1087 (10%)	0.1039 (20%)	0.1655
Flagsml	0.1815 (50%)	0.1855 (20%)	0.1816 (50%)	0.1891 (40%)	0.1832 (50%)	0.1978
Genbase	0.0052 (10%)	0.0052 (10%)	0.0052 (10%)	0.0052 (10%)	0.0052 (10%)	0.0052
Medical	0.0350 (10%)	0.0438 (10%)	0.0445 (10%)	0.0437 (10%)	0.0431 (10%)	0.0475
Scene	0.0925 (90%)	0.0902 (90%)	0.0927 (90%)	0.0905 (90%)	0.0851 (60%)	0.0889
Yeast	0.1757 (90%)	0.1766 (90%)	0.1797 (90%)	0.1755 (80%)	0.1803 (60%)	0.1778
Best values (underlined)	17	6	7	18	21	6
$\leqslant$ baseline (bold)	39	33	31	41	41

With the BRKNN classifier, the proposed lazy multi-label information gain technique (Lazy MLInfoGain) achieved a competitive result, holding the best performance in 21 cases, out of the 44 experiments. The non-lazy MLInfoGain technique achieved the best result in 18 cases, and the BR $+$ InfoGain transformation technique achieved the best result in 17 cases. Only in 6 cases, the result without feature selection achieved the best result. In 41 cases, both the multi-label information gain technique and the proposed lazy adaptation were able to yield a value equal or better than the baseline (without feature selection).

These preliminary results indicate an improvement over the non-lazy MLInfoGain technique proposed in [14]. To confirm this, in the next subsection a statistical analysis is used to evaluate these results.

5.2 Statistical evaluation

Statistical analysis was used to evaluate if the differences in performance of the multi-label feature selection techniques are statistically significant.

The five feature selection techniques were ranked according to their performance for each data set and for each percentage of selected features. The best performing technique was ranked first, the second best was ranked second, and so on. In case of ties, the ranks were averaged. From the average ranks of the techniques, the Friedman statistic was calculated, and then, at a significance level of 5%, the hypothesis that the techniques performed equally well in average ranking was rejected.

Then a post-hoc Nemenyi test was used to compare the feature selection techniques to each other. The performance of two techniques is considered significantly different if their average ranks differ by more than a critical distance value. Figure 1 shows the results from the Nemenyi post-hoc test for the four different measures used in the experiments for the BRKNN classifier. Each diagram presents an enumerated axis with the average ranks of each technique. The best rankings are at the right-most side of the diagram. The lines for the average ranks of the algorithms that do not differ significantly (at the significance level of 0.05) are connected by a horizontal line.

Figure 1.

Critical diagram for the Nemenyi post-hoc test at $p\leqslant$ 0.05.

The diagrams show that the proposed Lazy MLInfoGain generally outperforms the original MLInfoGain (eager) with a significant difference, except for the Example-based Accuracy measure. It also outperforms the Copy $+$ InfoGain and LP $+$ InfoGain techniques for all measures. The second best feature selection algorithm is the BR $+$ InfoGain, which is not significantly worse than Lazy MLInfoGain, and not significantly better than the original MLInfoGain.

5.3 ML-KNN lazy feature selection

We have also incorporated the lazy feature selection into the ML-KNN, which is another classifier capable of handling multi-label data directly. The lazy attribute selection was executed within the algorithm just before the actual classification takes place. Analogously to the BRKNN implementation, the classifier considered only the $r$ % best features selected in a lazy manner to compute its neighbors distances, for each test instance. Again, this implies that for different instances probably distinct subsets of features were used. The experiments were executed with the default parameter settings in the Mulan tool, for the original ML-KNN classifier.

Table 3 shows the overall result of each feature selection technique coupled with the ML-KNN classifier. The results are reported similarly to the experiments with the BRKNN classifier in Table 2.

Table 3
Best results achieved using each feature selection technique with ML-KNN

Data set	BR $+$ InfoGain	Copy $+$ InfoGain	LP $+$ InfoGain	MLInfoGain	LazyMLInfoGain	No Sel.
Hamming Loss
Bibtex	0.0126 (20%)	0.0129 (20%)	0.0133 (30%)	0.0130 (10%)	0.0135 (70%)	0.0136
Birds	0.0479 (30%)	0.0477 (80%)	0.0472 (90%)	0.0463 (10%)	0.0472 (90%)	0.0473
CAL500	0.1381 (40%)	0.1380 (50%)	0.1381 (20%)	0.1380 (70%)	0.1379 (70%)	0.1388
Corel5k	0.0094 (10%)	0.0094 (10%)	0.0094 (10%)	0.0094 (10%)	0.0094 (70%)	0.0094
Emotions	0.1903 (60%)	0.1921 (80%)	0.1966 (70%)	0.1898 (90%)	0.1929 (70%)	0.1951
Enron	0.0502 (10%)	0.0531 (90%)	0.0502 (10%)	0.0520 (70%)	0.0517 (50%)	0.0524
Flagsml	0.2489 (40%)	0.2622 (90%)	0.2622 (90%)	0.2570 (90%)	0.2447 (80%)	0.2536
Genbase	0.0048 (10%)	0.0048 (10%)	0.0048 (10%)	0.0048 (10%)	0.0048 (10%)	0.0048
Medical	0.0126 (10%)	0.0149 (10%)	0.0148 (10%)	0.0147 (10%)	0.0150 (20%)	0.0151
Scene	0.0899 (90%)	0.0879 (90%)	0.0911 (90%)	0.0867 (90%)	0.0860 (80%)	0.0862
Yeast	0.1915 (90%)	0.1935 (60%)	0.1945 (90%)	0.1925 (80%)	0.1934 (60%)	0.1933
Subset 0/1 loss
Bibtex	0.8619 (20%)	0.8807 (10%)	0.9154 (30%)	0.8818 (10%)	0.8955 (10%)	0.9396
Birds	0.5085 (40%)	0.5240 (80%)	0.5210 (90%)	0.5100 (10%)	0.5116 (60%)	0.5085
CAL500	1.0000 (10%)	1.0000 (10%)	1.0000 (10%)	1.0000 (10%)	1.0000 (10%)	1.0000
Corel5k	0.9972 (90%)	0.9980 (90%)	0.9988 (90%)	0.9980 (90%)	0.9958 (30%)	0.9982
Emotions	0.6816 (80%)	0.6866 (70%)	0.7019 (60%)	0.6832 (70%)	0.7087 (60%)	0.7169
Enron	0.9013 (10%)	0.9424 (90%)	0.9125 (10%)	0.9172 (60%)	0.8996 (50%)	0.9260
Flagsml	0.8087 (10%)	0.8500 (40%)	0.8297 (60%)	0.8603 (80%)	0.8034 (70%)	0.8453
Genbase	0.0890 (10%)	0.0890 (10%)	0.0890 (10%)	0.0890 (10%)	0.0890 (10%)	0.0890
Medical	0.3967 (10%)	0.4816 (10%)	0.4776 (30%)	0.4633 (10%)	0.4745 (20%)	0.4940
Scene	0.3743 (90%)	0.3760 (90%)	0.3797 (80%)	0.3685 (70%)	0.3722 (60%)	0.3752
Yeast	0.8097 (90%)	0.8101 (60%)	0.8192 (70%)	0.8113 (80%)	0.8035 (50%)	0.8126
Example-based accuracy (inverted)
Bibtex	0.7538 (20%)	0.7840 (10%)	0.8307 (30%)	0.7849 (10%)	0.7918 (10%)	0.8640
Birds	0.4519 (80%)	0.4622 (80%)	0.4617 (90%)	0.4511 (10%)	0.4505 (40%)	0.4515
CAL500	0.8018 (40%)	0.8033 (50%)	0.8023 (20%)	0.8007 (40%)	0.7999 (70%)	0.8028
Corel5k	0.9849 (90%)	0.9829 (90%)	0.9897 (90%)	0.9834 (90%)	0.9666 (20%)	0.9853
Emotions	0.4427 (80%)	0.4507 (80%)	0.4601 (60%)	0.4423 (70%)	0.4578 (70%)	0.4674
Enron	0.6074 (10%)	0.6898 (90%)	0.6254 (10%)	0.6586 (60%)	0.5906 (40%)	0.6684
Flagsml	0.3679 (40%)	0.3982 (90%)	0.3917 (40%)	0.3863 (40%)	0.3733 (80%)	0.3896
Genbase	0.0584 (10%)	0.0584 (10%)	0.0584 (10%)	0.0584 (10%)	0.0584 (10%)	0.0584
Medical	0.3245 (10%)	0.4100 (10%)	0.4045 (30%)	0.3902 (10%)	0.3997 (20%)	0.4187
Scene	0.3280 (90%)	0.3322 (90%)	0.3356 (80%)	0.3256 (70%)	0.3270 (60%)	0.3330
Yeast	0.4804 (90%)	0.4875 (60%)	0.4889 (90%)	0.4848 (80%)	0.4843 (90%)	0.4838
Ranking loss
Bibtex	0.1351 (20%)	0.1577 (10%)	0.1845 (30%)	0.1563 (10%)	0.1432 (10%)	0.2083
Birds	0.0742 (10%)	0.0759 (90%)	0.0753 (40%)	0.0724 (80%)	0.0745 (70%)	0.0746
CAL500	0.1830 (30%)	0.1820 (40%)	0.1823 (40%)	0.1825 (80%)	0.1823 (70%)	0.1828
Corel5k	0.1325 (80%)	0.1340 (80%)	0.1346 (90%)	0.1338 (60%)	0.1281 (10%)	0.1340
Emotions	0.1624 (50%)	0.1591 (80%)	0.1601 (90%)	0.1546 (60%)	0.1587 (50%)	0.1633
Enron	0.0883 (10%)	0.0919 (70%)	0.0898 (10%)	0.0924 (90%)	0.0892 (30%)	0.0920
Flagsml	0.1844 (40%)	0.1906 (90%)	0.1906 (90%)	0.1952 (30%)	0.1909 (40%)	0.2012
Genbase	0.0062 (10%)	0.0062 (10%)	0.0062 (10%)	0.0062 (10%)	0.0062 (10%)	0.0062
Medical	0.0329 (10%)	0.0384 (30%)	0.0389 (30%)	0.0393 (90%)	0.0372 (10%)	0.0395
Scene	0.0799 (90%)	0.0791 (90%)	0.0792 (90%)	0.0787 (90%)	0.0762 (60%)	0.0774
Yeast	0.1636 (80%)	0.1644 (90%)	0.1660 (90%)	0.1649 (80%)	0.1663 (60%)	0.1652
Best values (underlined)	23	7	7	13	19	7
$\leqslant$ baseline score (bold)	39	28	27	37	40

With the ML-KNN classifier, the proposed lazy multi-label information gain technique (Lazy MLInfoGain) also achieved a competitive result, holding the best performance in 19 cases, out of the 44 experiments. The non-lazy MLInfoGain technique achieved the best result in 13 cases, and the BR $+$ InfoGain transformation technique achieved the best result in 23 cases. Only in 7 cases the execution without feature selection achieved the best result. In 40 cases, the proposed Lazy MLInfoGain was able to yield a value equal or better than the baseline (without feature selection).

The corresponding statistical evaluation yields a similar result when compared with the experiments using the BRKNN classifier. These results indicate that the lazy MLInfoGain technique is also a competitive feature selection technique when coupled with other multi-label classifiers.

5.4 Experiments on large multi-label data sets for BRKNN

This section reports the experiments on larger multi-label data sets. Eleven independently compiled data sets from the Yahoo! directory [24] were chosen, each one with more than 5,000 instances and 30,000 features.

Table 4 shows the result of the experiment with larger data sets executed in a similar fashion than the previous one. The feature selection techniques compared are BR $+$ InfoGain, the MLInfoGain proposed in [14] and the Lazy MLInfoGain (Lazy MLIG) proposed in this work. The techniques selected 10% of features from the data sets. Each column shows the result on a Yahoo data set, which are: Arts, Business (Busn), Computers (Comp), Education (Educ), Entertainment (Entr), Health (Heal), Recreation (Recr), Reference (Refr), Science (Scie), Social (Socl) and Society (Soci). Rows “H. Loss”, “S. Loss”, “Eb. Accuracy” and “Rank. Loss” show the result of the Hamming Loss, Subset 0/1 Loss, Example-based Accuracy (inverted) and Ranking Loss, respectively. Row “Time(s)” shows the total execution time of the experiment (feature selection time $+$ classification time), in seconds. The computer used in the experiments was an AMD FX 8210 8-Core 3.1 Ghz with 8 Gb of RAM and a 64 bit OS.

Table 4
Result of experiments on large data sets with BRKNN classifier, comparing BR $+$ InfoGain, MLInfoGain and Lazy MLInfoGain feature selection strategies

	Arts	Busn	Comp	Educ	Entr	Heal	Recr	Refr	Scie	Socl	Soci
BR $+$ InfoGain
Hamming Loss	0.0595	0.0267	0.036	0.0413	0.0578	0.043	0.0559	0.0317	0.0343	0.0254	0.0537
Subset Loss	0.8991	0.4464	0.6497	0.8771	0.7621	0.689	0.8262	0.6342	0.9054	0.6204	0.7762
Eb. Accuracy	0.877	0.3	0.59	0.8578	0.739	0.6141	0.8117	0.6002	0.894	0.5937	0.7207
Ranking Loss	0.1941	0.0745	0.1509	0.1658	0.1778	0.1292	0.199	0.2009	0.21	0.1277	0.1898
Time(s)	53692	93634	186670	142035	125560	110008	122812	133902	120105	334846	215605
MLInfoGain
Hamming Loss	0.0617	0.027	0.0368	0.0427	0.0578	0.0456	0.0584	0.0326	0.035	0.0276	0.0547
Subset Loss	0.928	0.4497	0.6439	0.9192	0.8113	0.7299	0.8757	0.6839	0.9456	0.6849	0.8075
Eb. Accuracy	0.9128	0.3026	0.5812	0.9035	0.7944	0.6295	0.8624	0.6546	0.9379	0.6579	0.762
Ranking Loss	0.2093	0.0767	0.1604	0.1854	0.1933	0.1342	0.2328	0.2107	0.2264	0.1341	0.1998
Time(s)	686	1015	1869	1487	1726	1174	1647	1344	1069	3080	2442
Lazy MLInfoGain
Hamming Loss	0.0615	0.0273	0.0362	0.0429	0.0566	0.0442	0.0586	0.0315	0.0346	0.0266	0.0549
Subset Loss	0.9213	0.4505	0.6515	0.9109	0.7993	0.6927	0.8828	0.6523	0.9316	0.65	0.7803
Eb. Accuracy	0.9071	0.3062	0.5887	0.8954	0.7811	0.6147	0.8699	0.6217	0.924	0.6239	0.7264
Ranking Loss	0.2027	0.0773	0.1518	0.1693	0.1708	0.1367	0.2128	0.1818	0.2064	0.1249	0.1967
Time(s)	1111	1590	2832	2188	2750	1738	2584	2141	1762	4795	3694

Even though the results from most measures favor slightly the BR $+$ InfoGain technique for these Yahoo! directory data sets, the compared techniques show no significant difference for the assessed multi-label measures. However, the computational time of the adaptations (both MLInfoGain and Lazy MLInfoGain) is significantly better than the BR $+$ InfoGain strategy.

For these Yahoo! directory data sets, Lazy MLInfoGain generally outperformed the original non-lazy MLInfoGain technique, specially for the Example-Based Accuracy and Ranking Loss measures. The Lazy technique was slower than the non-lazy MLInfoGain technique due to the overhead associated with the postponement of feature selection to the classification time. This difference is not significant when compared with transformation-based techniques, represented here by the BR $+$ InfoGain strategy, which take more time for the classification.

The experiments with large data sets and the ML-KNN classifier led to similar results as the ones with BRKNN and reinforced the conclusion that the proposed technique is more scalable than the well-known techniques used in the literature and which rely on transformation.

6. Conclusion

In this work, a new method for multi-label feature selection was proposed, based on the lazy paradigm. We motivated this proposal presenting a multi-label data set example which indicated how a lazy strategy could benefit the multi-label feature selection by postponing the selection to the classification moment.

The proposed lazy strategy for the multi-label context was implemented as a new feature selection method adaptation based on the information gain measure. An experimental evaluation was conducted with various multi-label feature selection methods and data sets from different domains. Two multi-label classifiers were used to assess the lazy adaptation: BRKNN and ML-KNN.

Experimental results and a statistical analysis confirmed that the Lazy MLInfoGain outperformed the non-lazy (eager) MLInfoGain proposed in [14]. Since the MLInfoGain method was already competitive compared with other techniques, the lazy adaptation can be also considered as a competitive feature selection technique for multi-label classification. In terms of scalability, the experiments evidenced that the proposed technique is able to run at faster times than the transformation-based techniques. This reinforces the importance of employing feature selection methods adaptations that handle larger data sets directly, without transformation.

References

Chen

Yan

Zhang

Chen

and Yang

, Document transformation for multi-label feature selection in text categorization, in: Proceedings of the 7th IEEE International Conference on Data Mining, 2007, pp. 451–456.

Cheng

and Hüllermeier

, Combining instance-based learning and logistic regression for multilabel classification, Machine Learning 76(2–3) (2009), 211–225.

Clare

and King

R.D.

, Knowledge discovery in multi-label phenotype data, in: Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, 2001, pp. 42–53.

Crammer

Singer

Hofmann

Poggio

and Shawe-taylor

, A family of additive online algorithms for category ranking, Journal of Machine Learning Research 3 (2003), 1025–1058.

Dash

and Liu

, Feature selection for classification, Intelligent Data Analysis 1 (1997), 131–156.

Elisseeff

and Weston

, A kernel method for multi-labelled classification, in: Advances in Neural Information Processing Systems, Vol. 14, 2001, pp. 681–687.

Guyon

Gunn

Nikravesh

and Zadeh

, eds., Feature Extraction, Foundations and Applications, Springer, 2006.

Huang

and Wu

, Joint feature selection and classification for multilabel learning, IEEE Transactions on Cybernetics 48(3) (2017), 876–889.

Jungjit

Michaelis

Freitas

A.A.

and Cinatl

, Two extensions to multi-label correlation-based feature selection: a case study in bioinformatics, in: IEEE International Conference on Systems, Man, and Cybernetics, IEEE, 2013, pp. 1519–1524.

10.

Lastra

Luaces

Quevedo

J.R.

and Bahamonde

, Graphical feature selection for multilabel classification tasks, in: Proceedings of the 10th International Conference on Advances in Intelligent Data Analysis, 2011, pp. 246–257.

11.

Lee

and Kim

D.-W.

, Feature selection for multi-label classification using multivariate mutual information, Pattern Recognition Letters 34(3) (2013), 349–357.

12.

Loza Mencía

Fürnkranz

Hüllermeier

and Rapp

, Learning Interpretable Rules for Multi-Label Classification, Springer International Publishing, Cham, 2018, 81–113.

13.

Olsson

and Oard

D.W.

, Combining feature selectors for text classification, in: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, ACM, 2006, pp. 798–799.

14.

Pereira

R.B.

Plastino

Zadrozny

and Merschmann

L.H.

, Information gain feature selection for multi-label classification, Journal of Information and Data Management 6(1) (2015), 48.

15.

Pereira

R.B.

Plastino

Zadrozny

and Merschmann

L.H.

, Categorizing feature selection methods for multi-label classification, Artificial Intelligence Review 49(1) (2016).

16.

Pereira

R.B.

Plastino

Zadrozny

and Merschmann

L.H.

, Correlation analysis of performance measures for multi-label classification, Information Processing and Management 54(3) (2018), 359–369.

17.

Pereira

R.B.

Plastino

Zadrozny

Merschmann

L.H.d.C.

and Freitas

A.A.

, Lazy attribute selection – choosing attributes at classification time, Intelligent Data Analysis 15(5) (2011), 715–732.

18.

Pupo

O.G.R.

Morell

and Soto

S.V.

, Relieff-ml: An extension of relieff algorithm to multi-label learning, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Springer, 2013, pp. 528–535.

19.

Quinlan

J.R.

, Induction of decision trees, Machine Learning 1 (1986), 81–106.

20.

Read

Pfahringer

Holmes

and Frank

, Classifier chains for multi-label classification, Machine Learning 85(3) (2011), 333–359.

21.

Sorower

M.S.

, A literature survey on algorithms for multi-label learning, Technical report, Oregon State University, Corvallis, 2010.

22.

Spolaôr

Cherman

E.A.

Monard

M.C.

and Lee

H.D.

, A comparison of multi-label feature selection methods using the problem transformation approach, Electronic Notes in Theoretical Computer Science 292 (2013), 135–151.

23.

Spolaôr

Cherman

E.A.

Monard

M.C.

and Lee

H.D.

, Relieff for multi-label feature selection, in: Proceedings of the 2nd Brazilian Conference on Intelligent Systems, IEEE, 2013, pp. 6–11.

24.

Tang

Rajan

and Narayanan

V.K.

, Large scale multi-label classification via metalabeler, in: Proceedings of the 18th International Conference on World Wide Web, ACM, 2009, pp. 211–220.

25.

Trohidis

Tsoumakas

Kalliris

and Vlahavas

I.P.

, Multi-label classification of music into emotions, in: Bello

J.P.

Chew

and Turnbull

, eds, Proceedings of the 9th International Conference on Music Information Retrieval, 2008, pp. 325–330.

26.

Tsoumakas

Dimou

Spyromitros

Mezaris

Kompatsiaris

and Vlahavas

, Correlation based pruning of stacked binary relevance models for Multi-Label learning, in: Proceedings of the 1st International Workshop on Learning from Multi-Label Data, 2009, pp. 101–116.

27.

Tsoumakas

Katakis

and Vlahavas

, Mining multi-label data, in: Maimon

and Rokach

, eds, Data Mining and Knowledge Discovery Handbook, Springer US, 2010, pp. 667–685.

28.

Tsoumakas

and Vlahavas

, Random k-labelsets: An ensemble method for multilabel classification, in: Proceedings of the 18th European Conference on Machine Learning, 2007, pp. 406–417.

29.

Yang

and Pedersen

J.O.

, A comparative study on feature selection in text categorization, in: Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 412–420.

30.

and Liu

, Efficient feature selection via analysis of relevance and redundancy, The Journal of Machine Learning Research 5 (2004), 1205–1224.

31.

Zhang

M.-L.

Peña

J.M.

and Robles

, Feature selection for multi-label naive bayes classification, Information Sciences 179(19) (2009), 3218–3229.

32.

Zhang

M.-L.

and Zhou

Z.-H.

, Ml-knn: a lazy learning approach to multi-label learning, Pattern Recognition 40(7) (2007), 2038–2048.

33.

Zheng

and Srihari

, Feature selection for text categorization on imbalanced data, ACM SIGKDD Explorations Newsletter 6(1) (2004), 80–89.

A lazy feature selection method for multi-label classification

Abstract

Keywords

1. Introduction

2. Multi-label classification

3. Related work

4. The lazy multi-label feature selection proposal

Table 1 Multi-label data set example

5.1 Experiments with the BRKNN classifier

Table 2 Best results achieved using each feature selection technique with BRKNN

Table 3 Best results achieved using each feature selection technique with ML-KNN

Table 4 Result of experiments on large data sets with BRKNN classifier, comparing BR + InfoGain, MLInfoGain and Lazy MLInfoGain feature selection strategies

References

Table 1
Multi-label data set example

Table 2
Best results achieved using each feature selection technique with BRKNN

Table 3
Best results achieved using each feature selection technique with ML-KNN

Table 4
Result of experiments on large data sets with BRKNN classifier, comparing BR $+$ InfoGain, MLInfoGain and Lazy MLInfoGain feature selection strategies