Dynamic mutual information-based feature selection for multi-label learning

Abstract

In classification problems, feature selection is used to identify important input features to reduce the dimensionality of the input space while improving or maintaining classification performance. Traditional feature selection algorithms are designed to handle single-label learning, but classification problems have recently emerged in multi-label domain. In this study, we propose a novel feature selection algorithm for classifying multi-label data. This proposed method is based on dynamic mutual information, which can handle redundancy among features controlling the input space. We compare the proposed method with some existing problem transformation and algorithm adaptation methods applied to real multi-label datasets using the metrics of multi-label accuracy and hamming loss. The results show that the proposed method demonstrates more stable and better performance for nearly all multi-label datasets.

Keywords

Feature selection multi-label learning dynamic mutual information filter ranking algorithm adaptation

1. Introduction

Recently, researchers have gained interest in multi-label learning as well as single-label classification. Multi-label classification problems have been applied to various domains including scene annotation, emotion data, gene function prediction, text categorization, and healthcare data. In classification problems, the high dimensionality of input data can cause problems such as increased prediction error, high computational complexity, decreased interpretability, and data sparsity. Feature selection addresses this problem by reducing the dimensionality of the input space while preserving classification performance by selecting important features.

Therefore, studies on feature selection for multi-label learning have been actively conducted. There are two main approaches: the problem transformation method and the algorithm adaptation method. The problem transformation method is algorithm-independent and transforms a multi-label data representation into a single-label representation [1, 2, 3, 4]. In contrast, the algorithm adaptation method extends a single-label algorithm to directly handle multi-label data [5, 6, 7, 8, 9]. In this paper, we primarily focus on the algorithm adaptation method because it is easy to understand the label relation without information loss caused by reconstructing labels.

Feature selection methods in general can be categorized into the following three groups: filter methods, wrapper methods, and embedded methods [10]. A filter method does not incorporate learning but directly selects the best feature subset based on the intrinsic properties of the data. This approach is usually used for high-dimensional data, as it is relatively simple and fast [11]. A wrapper method selects a subset of features that provides the best classification performance with a specific classifier [12]. In the embedded method, the feature selection procedure is part of classifier learning itself [11]. Because both the wrapper and embedded methods use a classifier, their performance largely depends on the classifier used. In addition, classifier learning has a high computational cost during an exhaustive feature subset search. For these reasons, filter methods are a reasonable choice for high-dimensional data, as they do not depend on a classifier. When applying a filter method to a classification problem, information theory and rough set theory are commonly used for selecting significant features, as they can measure the properties of categorical features. In particular, for mixed-type data, a specific discretization technique is utilized in information theory [13, 14, 15], and Pawlak’s rough set model and neighborhood rough set model are employed in rough set theory [16, 17, 18]. Entropy and mutual information are used to handle linear or non-linear relations between features, as well as mixed-type data with a specific discretization technique. In addition, relevance and redundancy among features are calculated directly using information theory.

Existing multi-label learning feature selection algorithms based on the algorithm adaptation method estimate mutual information in a whole input space but cannot accurately express the relevance between features and labels. For this reason, dynamic mutual information (DMI) has been proposed by Liu et al. [13]. The whole input space is divided into recognized and unrecognized instances, which will be defined in Section 2.3. DMI estimates the relevance information on unrecognized instances, not the whole input space. Through this measure, relevance information among features can be estimated more accurately. However, existing studies using DMI mainly focus on feature selection in single-label classification not in multi-label learning.

In this paper, we propose DMI-based feature selection for multi-label learning (DMIML). To the best of our knowledge, this is the first work to extend DMI to feature selection for multi-label learning. To this end, we first extend the original DMI algorithm to implement it for multi-label learning. We then utilize the score function in DMI so that recognized instances are captured by selected features with candidate features in multi-label learning. Second, rarely occurring combinations in the training set are ignored, as they lead to significant performance loss. Finally, we determine a score function that fits DMIML, as DMI already takes redundancy among features into account. The proposed method through these process contributes accurate information among the features and labels of multi-label datasets to be estimated by removing redundant information. In addition, our proposed method demonstrates more stable classification performance for most of the benchmark multi-label datasets than the benchmark methods in experimental results. Also, it makes full use of the information of the features and labels which can be said to have the faster entropy-reducing capability and can be clearly shown from the remaining entropy for interpretation.

The remainder of this paper is structured as follows. In Section 2, we briefly review related works, while in Section 3 we present the DMIML and DMIML-M algorithms. In Section 4, we present and discuss our experimental results. Finally, in Section 5, we present our conclusions and ideas for future work.

2. Related works

This section introduces fundamental information theory concepts and terminology, as well as methods related to our proposed feature selection method. Here, we first characterize our notations of feature selection for multi-label learning. Assume that $O$ is the total dataset (or whole input space) with $p$ features ( $f_{1},\ldots,f_{p}$ ), $q$ labels ( $l_{1},\ldots,l_{q}$ ), and $n$ instances or observations ( $o_{1},\ldots,o_{n}$ ). $S=\{{f_{1},\ldots,f_{s}}\}$ refers to the selected feature set, where $s$ denotes the number of selected features.

2.1 Information theory

In information theory, entropy is widely used a key measure of information in many fields, including feature selection, as it can quantify the uncertainty of random variables and effectively scale the amount of information shared among them. Here the definition of the entropy is based on Shannon’s entropy [19].

Let $X$ be a discrete random variable. Then, its uncertainty can be measured by entropy $H(X)$ , which is defined as follows:

$\displaystyle H(X)=-\mathop{\sum}\limits_{x\in X}p(x)\log p(x),$ (1)

where $p(x)$ is the (marginal) probability mass function of $X$ . It should be noted that the entropy does not depend on the realized values, but only on the probability distribution of a random variable. Likewise, the entropy of two discrete random variables $X$ and $Y$ is defined as follows:

$\displaystyle H({X,Y})=-\mathop{\sum}\limits_{y\in Y}\mathop{\sum}\limits_{x% \in X}p({x,y})\log p({x,y}),$ (2)

where $p({x,y})$ is the joint probability mass function of $X$ and $Y$ . In addition, the conditional entropy of $Y$ given $X$ is defined as follows:

$\displaystyle H({Y|X})=-\mathop{\sum}\limits_{y\in Y}\mathop{\sum}\limits_{x% \in X}p({x,y})\log p({y|x}),$ (3)

where $p({y|x})$ is the conditional probability mass function of $Y$ given $X=x$ . Conditional entropy refers to the reduction in the uncertainty of a variable when another variable is known. For example, $H({X|Y})=0$ signifies that $X$ depends entirely on $Y$ , while $H({X|Y})=H(X)$ signifies that knowing $Y$ is unrelated to $X$ .

The mutual information $I({X;Y})$ quantifies the amount of information shared by two variables $X$ and $Y$ . It is defined as follows:

$\displaystyle I({X;Y})=\mathop{\sum}\limits_{y\in Y}\mathop{\sum}\limits_{x\in X% }p({x,y})\log\frac{p({x,y})}{p(x)p(y)}.$ (4)

Equation (4) can be expressed by entropy as follows:

$\displaystyle I({X;Y})=H(X)-H({X{|}Y})=H(Y)-H({Y{|}X})=H(X)+H(Y)-H({X,Y}).$ (5)

A higher value of $I({X;Y})$ indicates that $X$ and $Y$ are closely related. Clearly, $I({X;Y})=0$ signifies that $X$ and $Y$ are independent. The relation between entropy and mutual information is illustrated in Fig. 1.

Figure 1.

Entropy and mutual information.

For the case of three variables, multivariate mutual information is defined as follows:

$\displaystyle I({X;Y;Z})=-H({X,Y,Z})+H({X,Y})+H({Y,Z})+H({Z,X})-H(X)-H(Y)-H(Z).$ (6)

In addition, Eq. (6) can be represented in various forms:

$\displaystyle I({X;Y;Z})=I({X,Y;Z})-I({X;Z})-I({Y;Z})=I({Y;Z{|}X})-I({Y;Z}).$ (7)

The relation between entropy and multivariate mutual information for the case of three variables is illustrated in Fig. 2.

Figure 2.

Entropy and multivariate (3-variable) mutual information.

For any set of variables $S$ , the generalized form of multivariate mutual information is defined as follows:

$\displaystyle I(S)=-\mathop{\sum}\limits_{T\subseteq S}({-1})^{|S|-|T|}H(T),$ (8)

where $|S|$ denotes the cardinality of $S$ [20].

From the perspective of feature selection, the value of multivariate mutual information provides rich information about the interaction between variables. Suppose that $S$ is a set of already selected features, $f^{+}$ is a candidate feature to be selected, and $C$ is the target (class) variable. In feature selection, the effect of $f^{+}$ can usually be evaluated by

$\displaystyle I({f^{+};S;C})=I({S;C{|}f^{+}})-I({S;C}).$ (9)

The objective Eq. (9) is positive when variable $f^{+}$ and the variables in $S$ are complementary or provide synergistic information about $C$ . The function is negative when variable $f^{+}$ and the variables in $S$ provide redundant information about $C$ . Furthermore, the function should be close to zero when variable $f^{+}$ provides irrelevant information with respect to the dependency between $S$ and $C$ [21].

2.2 Mutual information based multi-label feature selection (MIMF)

In this subsection, we introduce the mutual information based multi-label feature selection method called MIMF proposed by Lee and Kim [7], and discuss its limitations. MIMF introduces multivariate mutual information and proposes the $b$ -degree feature selection algorithm for multi-label learning. The mutual information between the selected feature set $S=\{{f_{1},\ldots,f_{s}}\}$ and label set $L=\{{l_{1},\ldots,l_{q}}\}$ can be expressed as follows:

$\displaystyle I({S;L})=H(S)+H(L)-H({S,L}).$ (10)

To calculate the multivariate mutual information, each component in Eq. (10) is calculated using Eqs (2) and (8) as follows:

$\displaystyle H(S)=-\mathop{\sum}\limits_{k=1}^{s}({-1})^{k}V_{k}({S}^{\prime}))$ (11) $\displaystyle H(L)=-\mathop{\sum}\limits_{k=1}^{q}({-1})^{k}V_{k}({L}^{\prime}% )),$ (12)

where $V_{k}(T)=\mathop{\sum}\limits_{X\in T_{k}}I(X)$ is the $k$ -degree interaction information, $k$ is the cardinality or the degree of mutual information, $T_{k}=\{{t{|}t\subseteq T,|t|=k}\}$ , and ${T}^{\prime}=\{t/t\subseteq T\}$ is the power set of $T$ . Likewise, $H({S,L})$ can be obtained as follows:

$\displaystyle H({S,L})=-\mathop{\sum}\limits_{k=1}^{s+q}({-1})^{k}V_{k}({\{{S,% L}\}^{\prime}}),$ (13)

where $V_{k}({\{{S,L}\}^{\prime}})=\mathop{\sum}\limits_{p=0}^{k}V_{p}({S^{\prime}_{k% -p}\times L^{\prime}_{p}})$ , $S^{\prime}_{k-p}$ and $L^{\prime}_{p}$ have the same meaning as $T^{\prime}_{k}=\{e{|}e\subseteq{T}^{\prime},|e|=k\}$ and $\times$ denotes the Cartesian product of two sets. Finally, the mutual information in Eq. (10) can be rewritten as follows:

$\displaystyle I({S;L})=\mathop{\sum}\limits_{k=2}^{s+q}\mathop{\sum}\limits_{p% =1}^{k-1}({-1})^{k}V_{p}({S^{\prime}_{k-p}\times L^{\prime}_{p}}).$ (14)

When candidate feature $f^{+}$ is added to set $S$ , the increased mutual information is calculated by

$\displaystyle Q({\{{S,f^{+}}\},L})=I({\{{S,f^{+}}\};L})-I({S;L}).$ (15)

According to Eqs (9) and (14), Eq. (15) can be expressed as follows:

$\displaystyle Q({\{{S,f^{+}}\},L})=\mathop{\sum}\limits_{k=2}^{s+q+1}\mathop{% \sum}\limits_{p=1}^{k-1}({-1})^{k}V_{k}({\{{f^{+}\times S^{\prime}_{k-p-1}}\}% \times L^{\prime}_{p}}).$ (16)

However, it is computationally expensive to calculate a higher degree of multivariate mutual information. Therefore, to relax the score function, Lee and Kim [7] have proposed a $b$ -degree score function for the feature selection algorithm as follows:

$\displaystyle\tilde{Q}_{b}(\{S,f^{+}\},L)=\mathop{\sum}\limits_{k=2}^{b}% \mathop{\sum}\limits_{p=1}^{k-1}(-1)^{k}V_{k}(\{f^{+}\times S^{\prime}_{k-p-1}% \}\times L^{\prime}_{p}).$ (17)

One of MIMF’s limitations is that mutual information is only estimated on the whole input space and cannot exactly express the relevance between features and labels, as there may be redundant information among features for labels. In this paper, we focus on resolving this problem by developing DMI for a multi-label classification problem. DMI estimates the relevance information on unrecognized instances, not the whole input space. The DMI for feature selection in a single-label classification problem is introduced in detail in Section 2.3.

2.3 Dynamic mutual information (DMI)

The whole input space $O$ can be divided into two groups: recognized instances $O_{r}\subseteq O$ and unrecognized instances $O_{u}\subseteq O$ , where $O_{r}\mathop{\cap}\nolimits O_{u}=\emptyset$ and $O_{r}\mathop{\cup}\nolimits O_{u}=O$ . The recognized instances are chosen using the selected features that are entirely classified with respect to the class label $C$ . Let $F$ be a candidate feature set. Then, $O_{r}$ is induced by a candidate feature $f^{+}\in F$ , which distinguishes observations in the data matrix with respect to the class set. For example, consider a dataset with eight instances $O=\{{o_{1},o_{2},o_{3},o_{4},o_{5},o_{6},o_{7},o_{8}}\}$ . Suppose that a given selected feature has an equivalence class set such as $\{{\{{o_{1},o_{2},o_{4},o_{5}}\},\{{o_{3},o_{6},o_{8}}\},\{{o_{7}}\}}\}$ according to its values, and that class label $C$ is $\{{\{{c_{1},c_{2},c_{7}}\},\{{c_{3},c_{6},c_{8}}\},\{{c_{4},c_{5}}\}}\}$ . Then, the recognized instance group is $O_{r}=\{{o_{3},o_{6},o_{8}}\}$ because $\{{o_{3},o_{6},o_{8}}\}=\{{c_{3},c_{6},c_{8}}\}$ , which has the same index as class label $C$ . Now, the unrecognized instance group is $O_{u}=\{{\{{o_{1},o_{2},o_{4},o_{5}}\},\{{o_{7}}\}}\}$ . Going one step further, if a new given selected feature $f_{\text{new}}\in S$ has an equivalence class set $\{{\{{o_{1},o_{2}\},\{o_{4},o_{5}}\},\{{o_{3},o_{6},o_{8}}\},\{{o_{7}}\}}\}$ which is subdivided from $\{{o_{1},o_{2},o_{4},o_{5}}\}$ to $\{{o_{1},o_{2}\},\{o_{4},o_{5}}\}$ by $S$ . Now, at this step, the recognized instance group is $O_{r}=\{{\{{o_{4},o_{5}}\},\{{o_{3},o_{6},o_{8}}\}}\}$ because $\{{o_{4},o_{5}}\}=\{{c_{4},c_{5}}\}$ , and the recognized instance group is $O_{u}=\{{\{{o_{1},o_{2}}\},\{{o_{7}}\}}\}$ . In other words, the instances with values that can clearly determine the class label $C$ are defined as the recognized instance group.

In traditional mutual information based feature selection algorithms, mutual information estimated on the whole input space cannot exactly express the relevance between features and labels. For this reason, Liu et al. [13] proposed DMI to estimate relevance information on unrecognized instances, not the whole input space. The utilized evaluation function is as follows:

$\displaystyle J({f^{+}|C,S})=\alpha\times I({C;f^{+}{|}S})-g({C,S,f^{+}}),$ (18)

where ${\alpha}$ is a coefficient to regulate the relative significance of the conditional mutual information, $C$ is the class label, $S$ is the selected feature set, and $f_{i}^{+}$ is a candidate feature to be selected. The penalized function $g({C,S,f})$ is a deviated function about $f^{+}$ and $S$ under $C$ and is expressed as follows:

$\displaystyle g({C,S,f^{+}})=({\beta/|S|})I({S;f})$ (19a) $\displaystyle g({C,S,f^{+}})=\left({\frac{\beta}{|S|}}\right)\left({\frac{I({C% ;S})}{H(S)}}\right)I({S;f}),$ (19b)

where $0.5\leqslant\beta\leqslant 1$ . Equation (19a) is a typical example proposed by Battiti [22], while Eq. (19b) is an improved version of Eq. (19a) proposed by Huang et al. [23] for considering the relevance between the selected features and class labels.

Related to this phase, the following proposition has been described by Liu et al. [13]: suppose that $S$ and $F$ are selected and candidate feature subsets, respectively. For recognized instances $O_{r}$ , any feature $f^{+}\in F$ is irrelevant or redundant to the class label $C$ . Thereafter, any mutual information based criterion can be utilized on $O_{u}$ as an evaluation function.

2.4 Existing feature selection methods for multi-label learning

Therefore, studies on feature selection for multi-label learning have been actively conducted. There are two main approaches to feature selection methods for multi-label learning. One is the problem transformation method, which transforms a multi-label data representation into a single-label representation. The works by [1, 2, 3, 4] fall in this approach. The other one is the algorithm adaptation methods, which extends a single-label algorithm to directly handle multi-label data [5, 6, 7, 8, 9].

Although its simplicity and algorithm independent, there is a main drawback of the problem transformation that there is great loss of information with respect to the label dependency. For this reason, in this paper, we primarily focus on the algorithm adaptation method because it is easy to understand the label relation without information loss caused by reconstructing labels.

3. Proposed algorithms: DMIML and DMIML-M

In this section, we provide a detailed discussion of the proposed algorithm based on the foundation presented in Section 2. We also present the concept of DMI for information theory, which is the fundamental principle behind the proposed feature selection algorithm. In addition, we dynamically use score functions to select important features.

3.1 Motivation

In the DMI algorithm, mutual information can be recalculated only on unrecognized instances $O_{u}$ , not the whole input space $O$ . However, the DMI algorithm for feature selection is only applied for single-label learning. In this study, we redefine the recognized instances in our proposed method to be suitable for a multi-label learning problem. In multi-label learning with mutual information, the high-degree (e.g., $b$ -degree) interactions calculate information for features and labels. In this case, DMI can mitigate this calculation problem by handling the input space. That is, as DMI considers the redundancy factor, the mutual information can compute the relevance factor more accurately. To apply DMI to the feature selection problem with multi-label learning, we extend it by combining the multi-labels and improving the phase of definition of recognized instances.

3.2 Extension of DMI to multi-label learning

First, we redefine the recognized instances in our proposed method. In DMI, single-label (i.e., single class) values are compared with candidate feature values. To perform an operation similar to DMI, we create possible label combination set $P$ from multi-label values. The possible combination set $P$ is used only for DMI-like class values to generate recognized instances, and the original multi-labels are employed for calculating mutual information. As an example of a possible label combination set, consider triple labels on eight instances of $l_{1}=({0,1,1,0,1,0,1,0})^{T}$ , $l_{2}=({1,0,0,0,1,1,0,0})^{T}$ , $l_{3}=({0,1,0,1,0,0,1,1})^{T}$ . Then, the possible combination set $P$ can be obtained by $P=({1,2,3,4,5,1,2,4})^{T}$ , as illustrated in Fig. 3.

Figure 3.

Example of obtaining possible combination set.

The next step of the proposed method is to divide the data into a recognized instance group and an unrecognized instance group using the possible combination set $P$ . If too many labels are presented in the dataset, the number of partitions of $P$ increases exponentially. To mitigate this problem, we redefine the recognized instances in DMI. In DMI, recognized instances are only defined by the selected feature $f^{+}$ . In the proposed method, however, we redefine the recognized instances to be dependent on $\{{S,f^{+}}\}$ rather than only on $f^{+}$ . Through information on several features rather than only one feature, the recognized instances $O_{r}$ can be determined more accurately. That is, the selected feature set $S$ with $f^{+}$ can explain a small portion of indices in the possible combination set $P$ that only one feature $f^{+}$ cannot explain. For more accurate mutual information, when there are recognized instances $O_{r}({\neq\emptyset})$ at each iteration, we estimate mutual information with candidate $f^{+}$ on a new, not past, $O_{u}$ . In addition, a set with observations occurring $\tau$ times by $\{{S,f^{+}}\}$ in the equivalence class are simply removed from the training dataset to improve performance.

3.3 Improvement of score function

Similar to MIMF, we employ a 3-degree score function D3 as $\tilde{Q}_{e}(\{S,f^{+}\},L)$ and its variants D2F as $\tilde{Q}_{2}^{f}(\{S,f^{+}\},L)$ and D2L as $\tilde{Q}_{2}^{l}(\{S,f^{+}\},L)$ for multi-label feature selection, as follows:

$\displaystyle\text{D3}:\mathop{\sum}\limits_{l_{i}\in L}I({f^{+},l_{i}})-% \mathop{\sum}\limits_{f_{i}\in S}\mathop{\sum}\limits_{l_{j}\in L}I({f^{+},f_{% i},l_{j}})-\mathop{\sum}\limits_{l_{i}\in L}\mathop{\sum}\limits_{l_{j}\in L}I% ({f^{+},l_{i},l_{j}})\text{ on }O_{u}$ (20a) $\displaystyle\text{D2F}:\mathop{\sum}\limits_{l_{i}\in L}I({f^{+},l_{i}})-% \mathop{\sum}\limits_{f_{i}\in S}\mathop{\sum}\limits_{l_{j}\in L}I({f^{+},f_{% i},l_{j}})\text{ on }O_{u}$ (20b) $\displaystyle\text{D2L:}\mathop{\sum}\limits_{l_{i}\in L}I({f^{+},l_{i}})-% \mathop{\sum}\limits_{l_{i}\in L}\mathop{\sum}\limits_{l_{j}\in L}I({f^{+},l_{% i},l_{j}})\text{ on }O_{u},$ (20c)

where D3 is a 3-degree approximation score function, D2F is a score function that emphasizes feature dependency, and D2L is a score function that focuses on the relation between pairs of labels. Estimation is performed on $O_{u}$ without observations in the equivalence class that occur $\tau$ times or less by $\{{S,f^{+}}\}$ . Because DMI takes redundancy information among features into account, we only utilize the D3 and D2L score functions. The proposed method is called DMIML_D3 or DMIML_D2L depending on the use of D3 or D2L. The proposed method is summarized in Algorithm 1. It should be noted that the benchmark method MIMF also uses the score function D2F.

Algorithm 1: Proposed Method: DMIML
Input: nfeat;
% the number of features to be selected
Procedure:
1: Initialize $S=\emptyset$ , $O_{u}=D$ , $O_{r}=\emptyset$ ;
2: repeat (3 $\sim$ 8)
3: for each feature $f^{+}\in F$ do
4: calculate information D3 or D2L on candidate new $O_{u}$ ;
5: choose the feature $f^{+}$ maximizing D3 or D2L;
6: $S=S\mathop{\cup}\nolimits\{{f^{+}}\}$ ; $F=F\backslash\{{f^{+}}\}$ ;
7: obtain $O_{r}$ induced by $\{{S,f^{+}}\}$ ;
8: $O_{u}=O_{u}\backslash O_{r}$ ;
9: until $\|S\|=$ nfeat
Output: $S$ ;
% the selected feature subset

As mentioned earlier in Section 1, DMI handles redundant information and can generate recognized instances $O_{r}$ . In the case of $O_{r}\neq\emptyset$ , redundant information is reflected clearly; however, in case of $O_{r}=\emptyset$ , redundant information should be considered in a score function, namely D3. For this, we propose another method to estimate mutual information using a different score function in the context of recognized instances. That is, when $O_{r}\neq\emptyset$ , the D2L score function is used because redundancy information is already considered by DMI. Similarly, when $O_{r}=\emptyset$ , the D3 score function is used because the redundancy information is not explicitly considered by DMI. This version of the proposed method is called DMIML-M and is presented in Algorithm 2.

Algorithm 2: Proposed Method – Mixed D3 and D2L: DMIML-M
Input: nfeat;
% the number of features to be selected
Procedure:
1: Initialize $S=\emptyset$ , $O_{u}=D$ , $O_{r}=\emptyset$ ;
2: repeat (3 $\sim$ 8)
3: for each feature $f^{+}\in F$ do
4: if $O_{r}\neq\emptyset$ then calculate information D2L on candidate new $O_{u}$ ;
else calculate information D3 on $O_{u}$ ;
5: choose the feature $f^{+}$ maximizing the score function;
6: $S=S\mathop{\cup}\nolimits\{{f^{+}}\}$ ; $F=F\backslash\{{f^{+}}\}$ ;
7: obtain $O_{r}$ induced by $\{{S,f^{+}}\}$ ;
8: $O_{u}=O_{u}\backslash O_{r}$ ;
9: until $\|S\|=$ nfeat
Output: $S$ ;
% the selected feature subset

Figure 4.

Example of assigning recognized instances with $\{{S,f^{+}}\}$ .

An example of selecting one feature among candidate features is illustrated in Fig. 4. That is, if candidate feature $f^{+}$ and the selected feature set $S$ generate recognized instances $O_{r}$ , such as the two colored instances in Fig. 4, then DMIML estimates the mutual information of the new unrecognized instances $O_{u}=O_{u}\backslash O_{r}$ in the subsequent selecting feature scheme. There are too many possible combinations from multi-labels; as a result, there are many rarely occurring combinations throughout the dataset that are unique to the training set, which leads to significant performance loss. Therefore, observations that occur $\tau$ times by $\{{S,f^{+}}\}$ in the equivalence class are simply removed from the training dataset to improve performance.

4. Experiments and discussion

4.1 Datasets

In experiments, we used the following four benchmark multi-label datasets from Knowledge Extraction based on Evolutionary Learning (https://sci2s.ugr.es/keel/multilabel.php#sub10). These datasets incorporate multiple labels and have various numbers of features.

Enron: this dataset contains e-mail messages that focus on business-related topics and the California energy crisis. The purpose of analyzing this data is that trying to avoid very personal messages, jokes, and so on. It contains 1,702 observations with 1,001 categorical features and has 53 labels.

Genbase: this dataset contains proteins belonging to one or more labels from 27 classes of important protein families. The Prosite documentation ID number was used to represent the label. Similarly, the Prosite access number was used to represent the motif patter. It contains 662 observations with 1,185 categorical features and has 27 labels.

Medical: this dataset contains a brief free-text summary of patient symptom histories with 45 classes for categorizing patients. It contains 978 observations with 1,449 categorical features and has 45 labels.

Scene: this dataset contains characteristics of images and their classes. The classes are as follows: beach, sunset, fall foliage, field, mountain, and urban, then one image can belong to one or more labels. This dataset contains 2,407 observations with 294 numerical features and has six labels. These benchmark multi-label datasets have no missing values and Table 1 summarizes the datasets used in this study with a domain.

Table 1
Summary of multi-label datasets

Dataset	Observations	Features	Labels	Domain
Enron	1,702	1,001	53	Text categorization
Genbase	662	1,185	27	Bioinformatics
Medical	978	1,449	45	Healthcare
Scene	2,407	294	6	Semantic scene analysis

4.2 Experimental setting

The proposed algorithms DMIML and DMIML-M were compared with the benchmark method MIMF to evaluate the effects of using DMI. In MIMF, the score functions D3 and D2F were employed, as they were known to demonstrate higher performance than others, including D2L. For both the benchmark methods and proposed algorithms, we discretized the Scene dataset using an equal-width interval scheme with three bins to apply the mutual information based feature selection methods [24]. In addition, we set $\tau=1$ , which signifies that the observations that occurred only once by $\{{S,f^{+}}\}$ were removed from the training dataset to improve performance.

In this study, ML $k$ NN [25] was used as a multi-label classifier to measure classification performance. We set the nearest neighborhood parameter $k$ in ML $k$ NN to $k=3$ , and set the number of features to be selected to 70. In addition, we used two conventional evaluation measures: multi-label accuracy and Hamming loss. Multi-label accuracy is the strictest metric, representing the percentage of observations whose labels are all classified correctly.

$\displaystyle\text{Multi-label accuracy}=\frac{1}{N}\mathop{\sum}\limits_{i=1}% ^{N}|L_{i}\mathop{\cap}\nolimits\hat{L}_{i}|/|L_{i}\cup\hat{L}_{i}|.$ (21)

Hamming loss calculates the fraction of incorrect labels to the total number of labels.

$\displaystyle\text{Hamming loss}=\frac{1}{N\times q}\mathop{\sum}\limits_{i}^{% N}\mathop{\sum}\limits_{j}^{q}\text{xor}(l_{ij},\hat{l}_{ij}).$ (22)

We used 5-fold cross-validation for all datasets, dividing each dataset randomly into five equal subsets. Each subset was used in turn as the test dataset, with the other four subsets used for training. This procedure was repeated ten times, generating a total of 50 results for each classifier and dataset.

4.3 Experimental results and discussion

4.3.1 Comparison with existing algorithm adaptation methods

To verify the performance of our proposed methods, we compared DMIML and DMIML-M to MIMF, which was proposed as an algorithm adaptation method. In MIMF, D3, D2F, D2L, and D2 were used as score functions in experiments, and D3 and D2F were empirically suggested as the final score functions. In this study, we use D3 and D2L because D2F is a score function that emphasizes feature dependency; we already focus on feature redundancy via DMI.

In this subsection, we present the results of our experiments and discuss the findings. Figures 5 and 6 illustrate the classification performance according to the multi-label accuracy and Hamming loss (vertical axis) of four multi-label datasets from 1 to 70 selected features (horizontal axis).

Table 2
Comparison of classification performance with benchmark method

Performance measure	Dataset	DMIML_D3	DMIML_D2L	DMIML-M	MIMF_D3	MIMF_D2F
Multi-label	Enron	0.0992 (0.0025)	0.0638 (0.0107)	0.0675 (0.0054)	0.0773 (0.0028)	0.0892 (0.0152)
accuracy	Genbase	0.9044 (0.0026)	0.9569 (0.0039)	0.9389 (0.0023)	0.8733 (0.0048)	0.7131 (0.0110)
	Medical	0.5747 (0.0060)	0.5544 (0.0102)	0.5374 (0.0073)	0.5500 (0.0053)	0.3366 (0.0089)
	Scene	0.4759 (0.0086)	0.3822 (0.0090)	0.4365 (0.0104)	0.4354 (0.0104)	0.2745 (0.0170)
Hamming	Enron	0.0502 (0.0003)	0.0534 (0.0003)	0.0521 (0.0002)	0.0516 (0.0003)	0.0518 (0.0002)
loss	Genbase	0.0039 (0.0004)	0.0020 (0.0003)	0.0027 (0.0003)	0.0062 (0.0003)	0.0145 (0.0005)
	Medical	0.0135 (0.0002)	0.0153 (0.0003)	0.0146 (0.0002)	0.0141 (0.0002)	0.0191 (0.0001)
	Scene	0.1263 (0.0016)	0.1392 (0.0015)	0.1365 (0.0014)	0.1390 (0.0013)	0.1568 (0.0015)

Figure 5.

Multi-label accuracy with benchmark methods.

Figure 6.

Hamming loss with benchmark methods.

The black dotted line represents the performance of the original multi-label data. The results are also summarized in Table 2 with its standard deviation in bracket, in which the best performance for each dataset is boldfaced.

Overall, the proposed methods DMIML and DMIML-M lead to superior experimental results for most of the datasets, in particular, Enron, Genbase, and Medical. For Scene, the performance does not reach the performance of the original multi-label data; however, our proposed methods demonstrate better results than those of MIMF. In particular, for Enron and Genbase, the performance sharply increases after certain features are selected; this is a phenomenon not observed in MIMF without DMI.

In comparing DMIML and DMIML-M, we expected DMIML-M to have a better performance than DMIML_D3 and DMIML_D2L. However, DMIML_D3 was determined to be the most stable algorithm. DMIML-M displayed moderate performance regardless of the performance ranks of DMIML_D3 and DMIML_D2L. Table 2 indicates that DMIML has superior performance to that of benchmark methods MIMF_D3 and MIMF_D2F; therefore, it can be recommended as the final proposed method. Because it is unknown whether DMIML_D3 or DMIML_D2L has superior performance before learning, DMIML-M can be used for stable results.

4.3.2 Comparison with existing problem transformation methods

For a more detailed experimental comparison, we conducted two additional feature selection methods for multi-label learning. These belong to the problem transformation method, not the algorithm adaptation method. To verify the novelty of our proposed methods DMIML_D3 and DMIML_D2L, we compared them with ELA $+$ Chi [1] and PPT $+$ Chi [26], which require entropy-based label assignment and pruned problem transformation techniques. Figures 7 and 8 illustrate classification performance based on multi-label accuracy and Hamming loss (vertical axis) for four multi-label datasets from 1 to 70 selected features (horizontal axis). The results are summarized in Table 3 with its standard deviation in bracket, in which the best performance for each dataset is boldfaced.

Table 3
Comparison of classification performance with ELA $+$ Chi and PPT $+$ Chi

Performance measure	Dataset	DMIML_D3	DMIML_D2L	ELA $+$ Chi	PPT $+$ Chi
Multi-label accuracy	Enron	0.0992 (0.0025)	0.0638 (0.0107)	0.0099 (0.0037)	0.0114 (0.0013)
	Genbase	0.9044 (0.0026)	0.9569 (0.0039)	0.9486 (0.0036)	0.9548 (0.0034)
	Medical	0.5747 (0.0060)	0.5544 (0.0102)	0.6443 (0.0080)	0.6508 (0.0067)
	Scene	0.4759 (0.0086)	0.3822 (0.0090)	0.3093 (0.0105)	0.2960 (0.0131)
Hamming loss	Enron	0.0502 (0.0003)	0.0534 (0.0003)	0.0579 (0.0001)	0.0582 (0.0002)
	Genbase	0.0039 (0.0004)	0.0020 (0.0003)	0.0023 (0.0002)	0.0020 (0.0002)
	Medical	0.0135 (0.0002)	0.0153 (0.0003)	0.0120 (0.0003)	0.0120 (0.0002)
	Scene	0.1263 (0.0016)	0.1392 (0.0015)	0.1429 (0.0015)	0.1442 (0.0012)

Figure 7.

Multi-label accuracy with ELA $+$ Chi and PPT $+$ Chi.

Figure 8.

Hamming loss with ELA $+$ Chi and PPT $+$ Chi.

We determined that the performance of our proposed method led to better results to those of the problem transformation methods ELA $+$ Chi and PPT $+$ Chi, and the performance of ELA $+$ Chi and PPT $+$ Chi was high only for the medical dataset. We thus empirically proved that the proposed method can be used stably, as the performance of ELA $+$ Chi and PPT $+$ Chi was too low for the Enron and Scene datasets.

Figures 5–8 and Tables 2 and 3 indicate that because of its stability and high performance, DMIML is suitable for multi-label datasets. In various considerations, DMIML outperformed other benchmark methods, including MIMF as an algorithm adaptation method and ELA $+$ Chi, and PPT $+$ Chi as problem transformation methods. Unlike other methods, DMIML demonstrated superior performance for the majority of the multi-label datasets. In particular, DMIML had excellent performance for the Enron and Scene datasets. Therefore, we can conclude that DMIML outperformed the other benchmark methods in our experiments.

4.4 Case study for interpretation

It was observed that the proposed method DMIML significantly outperforms the benchmark methods which are MIMF, PPT $+$ Chi, and ELA $+$ Chi. To show the reason for classification performance improvement by considering DMI, we calculated the remaining entropies according to the number of selected features on four multi-label datasets. Table 4 shows the description of each label and the remaining entropy by selected feature set for the proposed method DMIML and benchmark method MIMF on Scene data. For simplicity, this study will report the results based on the Scene data with the lowest number of labels. The others are arranged as Tables A.1–6 in Appendix A. The Scene data comprises six labels which are Beach, Sunset, Fall foliage, Field, Mountain, and Urban, denotes by $l_{1},l_{2},\ldots,l_{6}$ , respectively.

Table 4
Remaining entropy for interpretation of DMIML and MIMF (on scene data)

# of $S$	Beach		Sunset		Fall foliage		Filed		Mountain		Urban
	$l_{1}$		$l_{2}$		$l_{3}$		$l_{4}$		$l_{5}$		$l_{6}$
	Remaining entropy
	$H({l_{1}{\|}S})$		$H({l_{2}{\|}S})$		$H({l_{3}{\|}S})$		$H({l_{4}{\|}S})$		$H({l_{5}{\|}S})$		$H({l_{6}{\|}S})$
	DMIML	MIMF	DMIML	MIMF	DMIML	MIMF	DMIML	MIMF	DMIML	MIMF	DMIML	MIMF
0	0.6744	0.6744	0.6129	0.6129	0.6460	0.6460	0.6798	0.6798	0.7628	0.7628	0.6780	0.6780
2	0.5990	0.5990	0.3960	0.3960	0.5130	0.5130	0.4129	0.4129	0.6976	0.6976	0.5825	0.5825
4	0.4640	0.4705	0.2993	0.2969	0.4059	0.4682	0.3758	0.3787	0.6436	0.6690	0.5163	0.5215
6	0.3582	0.4187	0.2382	0.2539	0.3188	0.4052	0.2529	0.3483	0.5819	0.6024	0.4460	0.4568
8	0.2374	0.3907	0.1455	0.2109	0.2223	0.3378	0.1874	0.2978	0.4445	0.5481	0.3155	0.4068
10	0.1202	0.3281	0.0582	0.1735	0.1083	0.2650	0.1060	0.2693	0.2560	0.4625	0.1790	0.3235
15	0.0367	0.1572	0.0186	0.0860	0.0341	0.1452	0.0370	0.1483	0.1053	0.2524	0.0525	0.1590
20	0.0073	0.0856	0.0045	0.0474	0.0146	0.0887	0.0136	0.0818	0.0268	0.1649	0.0098	0.1025
25	0.0000	0.0469	0.0000	0.0249	0.0017	0.0485	0.0036	0.0462	0.0076	0.0859	0.0017	0.0566
30	0.0000	0.0226	0.0000	0.0115	0.0008	0.0241	0.0025	0.0273	0.0028	0.0396	0.0000	0.0282
40	0.0000	0.0139	0.0000	0.0045	0.0000	0.0102	0.0008	0.0137	0.0017	0.0237	0.0000	0.0184
50	0.0000	0.0078	0.0000	0.0028	0.0000	0.0066	0.0008	0.0083	0.0008	0.0116	0.0000	0.0097
60	0.0000	0.0036	0.0000	0.0017	0.0000	0.0038	0.0008	0.0063	0.0008	0.0025	0.0000	0.0020
70	0.0000	0.0011	0.0000	0.0008	0.0000	0.0008	0.0008	0.0008	0.0008	0.0008	0.0000	0.0011

A theory in Fano [27], the error bound of classification performance decreases according to the remaining conditional entropy [7]. The entropy $l_{q}$ conditioned by $S$ , $H({l_{q}{|}S})$ can be decreased according to the degree of dependency between the selected feature set $S$ and each label $l_{q}$ . Since the remaining conditional entropy $H({l_{q}{|}S})$ corresponds to the $I({S;L})$ value, the proposed method DMIML sequentially decreases the remaining entropy of each label by selecting an actually important feature at each iteration.

According to the Table 4, since the remaining entropy is significantly reduced from initial entropy of each label, the result indicates that the proposed method DMIML successfully reduces the entropy of each label. Moreover, DMIML makes the best use of Beach and Sunset when the number of selected features is 25, and of Beach, Sunset, Fall foliage, and Urban when the number of selected features is 40. On the other hand, the benchmark method MIMF used a lot of entropy of labels, but it is confirmed that MIMF does not select the features that utilize entropy better than the proposed method DMIML. That is, this study found that it works in practice to find the features that describe the labels by handling the input space through DMI. In the case of other three benchmark multi-label datasets, the proposed method DMIML utilized entropy better than the MIMF.

When plotting how quickly the proposed method DMIML finds actually important features compared to benchmark method MIMF, it is confirmed that the DMIML using DMI utilizes the entropy of labels more efficiently. That is, the proposed method DMIML has shown that it can find important features more accurately than existing methods on multi-label data with mixed information.

5. Conclusions

In this paper, we propose DMIML and DMIML-M, a feature selection method for multi-label learning with DMI. This method makes two contributions to the field. First, it extends DMI to multi-label learning. This novel capability allows accurate information among the features and labels of multi-label datasets to be estimated by removing redundant information. In addition, observations that occur $\tau$ times by $\{{S,f^{+}}\}$ in the equivalence class are simply removed from the training dataset to improve performance.

Second, in view of multi-label accuracy and Hamming loss, DMIML demonstrates more stable classification performance for most of the benchmark multi-label datasets than the benchmark methods MIMF (algorithm adaptation method) and ELA $+$ Chi and PPT $+$ Chi (problem transformation methods). In comparison to MIMF, our proposed method demonstrates superior performance for nearly all multi-label datasets and the reason for this was clearly shown from the faster entropy-reducing capability. The same holds for ELA $+$ Chi and PPT $+$ Chi with the exception of the medical dataset. However, ELA $+$ Chi and PPT $+$ Chi are not suitable for use because their performance for the Enron and Scene datasets is too low. As a result, the algorithm adaptation method is more effective because it is able to make full use of the information of the features and labels without requiring any label reconstruction which can cause information loss.

Although it demonstrates better performance, the proposed method has several limitations, which will be the focus of future work. First, although it can effectively estimate information among features in multi-label datasets, users who desire a subset of important features may require additional work because the proposed method is a filter ranking method that only allocates a significant order of features. Hence, we are going to conduct an optimal feature selection algorithm on our proposed method for getting global optimal feature subset. Second, we intend to add more multi-label learners or classifiers and compare their performance. Finally, as one of the most important tasks, we aim to investigate why the proposed method shows higher performance than the benchmark methods in certain multi-label datasets.

Footnotes

Acknowledgments

This work was supported by a grant of the National Research Foundation of Korea (Project Number: 2023R1A2C1002911).

Appendix A

Table A.1

Remaining entropy for interpretation of proposed method DMIML (on enron data)

$S$	$l_{5}$	$l_{10}$	$l_{15}$	$l_{20}$	$l_{25}$	$l_{30}$	$l_{35}$	$l_{40}$	$l_{45}$	$l_{53}$
	Remaining entropy
	$H({l_{5}{\|}S})$	$H({l_{10}{\|}S})$	$H({l_{15}{\|}S})$	$H({l_{20}{\|}S})$	$H({l_{25}{\|}S})$	$H({l_{30}{\|}S})$	$H({l_{35}{\|}S})$	$H({l_{40}{\|}S})$	$H({l_{45}{\|}S})$	$H({l_{53}{\|}S})$
0	0.3410	0.1105	1.0000	0.0959	0.2658	0.3765	0.0996	0.6005	0.3893	0.0187
2	0.3365	0.1089	0.9898	0.0957	0.2543	0.3472	0.0972	0.5861	0.3612	0.0179
4	0.3257	0.1045	0.9389	0.0869	0.2332	0.2391	0.0906	0.5279	0.3478	0.0174
6	0.3110	0.1002	0.9068	0.0830	0.2185	0.1952	0.0775	0.4806	0.3327	0.0171
8	0.2883	0.0943	0.8207	0.0720	0.1877	0.1381	0.0663	0.3027	0.3057	0.0107
10	0.2721	0.0914	0.7701	0.0645	0.1709	0.1115	0.0582	0.2787	0.2836	0.0095
15	0.2123	0.0744	0.6294	0.0386	0.1164	0.0699	0.0344	0.2002	0.1497	0.0047
20	0.1512	0.0619	0.4533	0.0228	0.0793	0.0345	0.0236	0.1450	0.1112	0.0000
25	0.1031	0.0467	0.3153	0.0149	0.0603	0.0253	0.0139	0.1070	0.0846	0.0000
30	0.0844	0.0292	0.2427	0.0083	0.0513	0.0167	0.0094	0.0825	0.0746	0.0000
40	0.0543	0.0207	0.1366	0.0075	0.0349	0.0096	0.0069	0.0497	0.0473	0.0000
50	0.0424	0.0197	0.1035	0.0074	0.0316	0.0091	0.0056	0.0468	0.0448	0.0000
60	0.0376	0.0195	0.0930	0.0071	0.0298	0.0091	0.0055	0.0468	0.0390	0.0000
70	0.0352	0.0177	0.0862	0.0065	0.0261	0.0091	0.0042	0.0451	0.0318	0.0000

Table A.2

Remaining entropy for interpretation of benchmark method MIMF (on enron data)

$S$	$l_{5}$	$l_{10}$	$l_{15}$	$l_{20}$	$l_{25}$	$l_{30}$	$l_{35}$	$l_{40}$	$l_{45}$	$l_{53}$
	Remaining entropy
	$H({l_{5}{\|}S})$	$H({l_{10}{\|}S})$	$H({l_{15}{\|}S})$	$H({l_{20}{\|}S})$	$H({l_{25}{\|}S})$	$H({l_{30}{\|}S})$	$H({l_{35}{\|}S})$	$H({l_{40}{\|}S})$	$H({l_{45}{\|}S})$	$H({l_{53}{\|}S})$
0	0.3410	0.1105	1.0000	0.0959	0.2658	0.3765	0.0996	0.6005	0.3893	0.0187
2	0.3325	0.1072	0.8915	0.0948	0.2269	0.2211	0.0977	0.4950	0.2597	0.0177
4	0.3241	0.1051	0.8716	0.0904	0.2177	0.2036	0.0929	0.4805	0.2456	0.0160
6	0.3165	0.0995	0.8509	0.0852	0.1997	0.1682	0.0813	0.4528	0.2365	0.0137
8	0.2894	0.0946	0.8147	0.0794	0.1845	0.1123	0.0696	0.3911	0.2230	0.0131
10	0.2631	0.0886	0.7672	0.0679	0.1651	0.0835	0.0586	0.3513	0.2036	0.0092
15	0.2149	0.0736	0.6185	0.0446	0.1211	0.0446	0.0428	0.2763	0.1606	0.0057
20	0.1525	0.0593	0.4852	0.0301	0.0858	0.0287	0.0296	0.2144	0.1257	0.0023
25	0.1056	0.0417	0.3356	0.0152	0.0575	0.0228	0.0153	0.1326	0.0897	0.0000
30	0.0782	0.0348	0.2533	0.0097	0.0470	0.0169	0.0126	0.1082	0.0732	0.0000
40	0.0643	0.0262	0.1728	0.0060	0.0453	0.0149	0.0084	0.0860	0.0615	0.0000
50	0.0523	0.0207	0.1277	0.0057	0.0383	0.0091	0.0059	0.0792	0.0462	0.0000
60	0.0458	0.0189	0.1066	0.0056	0.0327	0.0091	0.0057	0.0671	0.0408	0.0000
70	0.0427	0.0171	0.0899	0.0042	0.0287	0.0091	0.0056	0.0549	0.0332	0.0000

Table A.3

Remaining entropy for interpretation of benchmark method MIMF (on medical data)

$S$	$l_{4}$	$l_{9}$	$l_{13}$	$l_{18}$	$l_{22}$	$l_{27}$	$l_{31}$	$l_{36}$	$l_{40}$	$l_{45}$
	Remaining entropy
	$H({l_{4}{\|}S})$	$H({l_{9}{\|}S})$	$H({l_{13}{\|}S})$	$H({l_{18}{\|}S})$	$H({l_{22}{\|}S})$	$H({l_{27}{\|}S})$	$H({l_{31}{\|}S})$	$H({l_{36}{\|}S})$	$H({l_{40}{\|}S})$	$H({l_{45}{\|}S})$
0	0.0212	0.0116	0.0539	0.0685	0.1265	0.0116	0.1144	0.1552	0.1144	0.2602
2	0.0194	0.0100	0.0421	0.0612	0.1159	0.0100	0.1054	0.1423	0.1007	0.2508
4	0.0188	0.0099	0.0415	0.0588	0.1110	0.0099	0.1012	0.1369	0.0713	0.2422
6	0.0180	0.0069	0.0405	0.0553	0.1065	0.0097	0.0972	0.1250	0.0680	0.2308
8	0.0164	0.0058	0.0319	0.0534	0.1012	0.0083	0.0903	0.1147	0.0642	0.2235
10	0.0161	0.0058	0.0276	0.0521	0.0948	0.0081	0.0828	0.1112	0.0585	0.2172
15	0.0154	0.0053	0.0221	0.0246	0.0806	0.0072	0.0668	0.0849	0.0521	0.1850
20	0.0120	0.0053	0.0145	0.0213	0.0655	0.0063	0.0581	0.0636	0.0396	0.1561
25	0.0113	0.0040	0.0099	0.0124	0.0525	0.0033	0.0441	0.0462	0.0288	0.1269
30	0.0112	0.0037	0.0056	0.0095	0.0413	0.0028	0.0394	0.0315	0.0152	0.1070
40	0.0081	0.0000	0.0049	0.0061	0.0357	0.0028	0.0334	0.0307	0.0115	0.0893
50	0.0040	0.0000	0.0041	0.0055	0.0311	0.0000	0.0319	0.0231	0.0097	0.0776
60	0.0040	0.0000	0.0041	0.0054	0.0263	0.0000	0.0272	0.0178	0.0077	0.0724
70	0.0037	0.0000	0.0020	0.0053	0.0259	0.0000	0.0266	0.0125	0.0077	0.0685

Table A.4

Remaining entropy for interpretation of benchmark method MIMF (on genbase data)

$S$	$l_{2}$	$l_{5}$	$l_{7}$	$l_{10}$	$l_{12}$	$l_{15}$	$l_{17}$	$l_{20}$	$l_{22}$	$l_{27}$
	Remaining entropy
	$H({l_{2}{\|}S})$	$H({l_{5}{\|}S})$	$H({l_{7}{\|}S})$	$H({l_{10}{\|}S})$	$H({l_{12}{\|}S})$	$H({l_{15}{\|}S})$	$H({l_{17}{\|}S})$	$H({l_{20}{\|}S})$	$H({l_{22}{\|}S})$	$H({l_{27}{\|}S})$
0	0.5142	0.8242	0.2728	0.4680	0.2595	0.1478	0.1722	0.0641	0.0418	0.0418
2	0.4354	0.0154	0.2421	0.0000	0.0986	0.1342	0.1556	0.0593	0.0389	0.0389
4	0.0409	0.0143	0.2265	0.0000	0.0421	0.1266	0.1377	0.0541	0.0358	0.0260
6	0.0000	0.0137	0.0585	0.0000	0.0421	0.0574	0.0689	0.0245	0.0338	0.0260
8	0.0000	0.0130	0.0544	0.0000	0.0416	0.0571	0.0689	0.0245	0.0318	0.0258
10	0.0000	0.0000	0.0535	0.0000	0.0416	0.0371	0.0641	0.0224	0.0314	0.0255
15	0.0000	0.0000	0.0520	0.0000	0.0386	0.0275	0.0615	0.0208	0.0306	0.0254
20	0.0000	0.0000	0.0516	0.0000	0.0386	0.0275	0.0606	0.0208	0.0304	0.0254
25	0.0000	0.0000	0.0499	0.0000	0.0294	0.0236	0.0596	0.0208	0.0296	0.0250
30	0.0000	0.0000	0.0491	0.0000	0.0294	0.0195	0.0574	0.0166	0.0292	0.0250
40	0.0000	0.0000	0.0483	0.0000	0.0294	0.0134	0.0564	0.0125	0.0288	0.0250
50	0.0000	0.0000	0.0480	0.0000	0.0294	0.0134	0.0555	0.0125	0.0287	0.0250
60	0.0000	0.0000	0.0476	0.0000	0.0294	0.0134	0.0555	0.0125	0.0000	0.0249
70	0.0000	0.0000	0.0469	0.0000	0.0288	0.0134	0.0532	0.0125	0.0000	0.0000

Table A.5

Remaining entropy for interpretation of proposed method DMIML (on medical data)

$S$	$l_{4}$	$l_{9}$	$l_{13}$	$l_{18}$	$l_{22}$	$l_{27}$	$l_{31}$	$l_{36}$	$l_{40}$	$l_{45}$
	Remaining entropy
	$H({l_{4}{\|}S})$	$H({l_{9}{\|}S})$	$H({l_{13}{\|}S})$	$H({l_{18}{\|}S})$	$H({l_{22}{\|}S})$	$H({l_{27}{\|}S})$	$H({l_{31}{\|}S})$	$H({l_{36}{\|}S})$	$H({l_{40}{\|}S})$	$H({l_{45}{\|}S})$
0	0.0212	0.0116	0.0539	0.0685	0.1265	0.0116	0.1144	0.1552	0.1144	0.2602
2	0.0194	0.0100	0.0421	0.0612	0.1159	0.0100	0.1054	0.1423	0.1007	0.2508
4	0.0183	0.0100	0.0373	0.0588	0.1069	0.0100	0.0976	0.1042	0.0964	0.2360
6	0.0127	0.0093	0.0362	0.0558	0.0991	0.0093	0.0907	0.0989	0.0680	0.2313
8	0.0123	0.0093	0.0358	0.0538	0.0940	0.0093	0.0862	0.0849	0.0668	0.2232
10	0.0119	0.0091	0.0357	0.0505	0.0849	0.0091	0.0783	0.0849	0.0646	0.0458
15	0.0082	0.0072	0.0185	0.0407	0.0622	0.0067	0.0576	0.0619	0.0441	0.0318
20	0.0077	0.0040	0.0049	0.0181	0.0438	0.0020	0.0445	0.0350	0.0176	0.0230
25	0.0070	0.0000	0.0041	0.0157	0.0356	0.0000	0.0354	0.0231	0.0146	0.0159
30	0.0070	0.0000	0.0041	0.0079	0.0235	0.0000	0.0286	0.0199	0.0110	0.0150
40	0.0041	0.0000	0.0020	0.0071	0.0134	0.0000	0.0172	0.0129	0.0110	0.0098
50	0.0020	0.0000	0.0020	0.0057	0.0056	0.0000	0.0028	0.0050	0.0090	0.0082
60	0.0020	0.0000	0.0020	0.0033	0.0049	0.0000	0.0020	0.0033	0.0069	0.0082
70	0.0020	0.0000	0.0000	0.0028	0.0041	0.0000	0.0020	0.0033	0.0020	0.0082

References

Chen

Yan

Zhang

Chen

and Yang

, Document transformation for multi-label feature selection in text categorization, in ICDM, IEEE, 2007, 451–456.

Read

, Scalable multi-label classification, Doctoral dissertation, University of Waikato, 2010.

Doquire

and Verleysen

, Feature selection for multi-label classification problems, in: International Work-Conference on Artificial Neural Networks, Springer, Berlin, 2011, pp. 9–16.

Sechidis

Nikolaou

and Brown

, Information theoretic feature selection in multi-label data through composite likelihood, in: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, Berlin, 2014, 143–152.

Doquire

and Verleysen

, Mutual information-based feature selection for multilabel classification, Neurocomputing 122 (2013), 148–155.

Liu

Duan

Zhou

and Zhao

, Multi-label feature selection via information gain, in: International Conference on Advanced Data Mining and Applications, Springer, Cham, 2014, pp. 345–355.

Lee

and Kim

D.W.

, Mutual information-based multi-label feature selection using interaction information, Expert Systems with Applications 42(4) (2015), 2013–2025.

Pereira

R.B.

Carvalho

A.P.D.

Zadrozny

and Merschmann

L.H.C.

, Information gain feature selection for multi-label classification, Journal of Information and Data Management 6(1) (2015), 48–58.

Pereira

R.B.

Plastino

Zadrozny

and Merschmann

L.H.C.

, A lazy feature selection method for multi-label classification, Intelligent Data Analysis 25(1) (2021), 21–34.

10.

Saeys

Inza

and Larrañaga

, A review of feature selection techniques in bioinformatics, Bioinformatics 23(19) (2007), 2507–2517.

11.

Guyon

and Elisseeff

, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003), 1157–1182.

12.

Blum

A.L.

and Langley

, Selection of relevant features and examples in machine learning, Artificial Intelligence 97(1–2) (1997), 245–271.

13.

Liu

Sun

Liu

and Zhang

, Feature selection with dynamic mutual information, Pattern Recognition 42(7) (2009), 1330–1339.

14.

Chen

Zhang

Huang

Ran

Zhong

and Lyu

, Feature selection with redundancy-complementariness dispersion, Knowledge-Based Systems 89 (2015), 203–217.

15.

Shishkin

Bezzubtseva

Drutsa

Shishkov

Gladkikh

Gusev

and Serdyukov

, Efficient high-order interaction-aware feature selection based on conditional mutual information, in: Advances in Neural Information Processing Systems, 2016, pp. 4637–4645.

16.

Pedrycz

and Lang

, Selecting discrete and continuous features based on neighborhood decision error minimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40(1) (2010), 137–150.

17.

Zhao

and Qin

, Mixed feature selection in incomplete decision table, Knowledge-Based Systems 57 (2014), 181–190.

18.

Kim

K.J.

and Jun

C.-H.

, Rough set model based feature selection for mixed-type data with feature space decomposition, Expert Systems with Applications 103 (2018), 196–205.

19.

Shannon

C.E.

, A mathematical theory of communication, Bell System Technical Journal 27(3) (1948), 379–423.

20.

McGill

, Multivariate information transmission, Transactions of the IRE Professional Group on Information Theory 4(4) (1954), 93–111.

21.

Timme

Alford

Flecker

and Beggs

J.M.

, Multivariate information measures: an experimentalist’s perspective, arXiv preprint arXiv:1111.6857, 2011.

22.

Battiti

, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks 5(4) (1994), 537–550.

23.

Huang

Cai

and Xu

, A hybrid generic algorithm for feature selection wrapper based on mutual information, Pattern Recognition Letters 28(13) (2007), 1825–1844.

24.

Dougherty

Kohavi

and Sahami

, Supervised and unsupervised discretization of continuous features, in Machine Learning Proceedings, Morgan Kaufmann, 1995, 194–202.

25.

Zhang

M.L.

and Zhou

Z.H.

, ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognition 40(7) (2007), 2038–2048.

26.

Read

, A pruned problem transformation method for multi-label classification, in: Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), 2008, pp. 143–150.

27.

Fano

R.M.

and Hawkins

, Transmission of information: A statistical theory of communications, American Journal of Physics 29 (1961), 793–794.

Dynamic mutual information-based feature selection for multi-label learning

Abstract

Keywords

1. Introduction

2. Related works

2.1 Information theory

3. Proposed algorithms: DMIML and DMIML-M

3.1 Motivation

3.2 Extension of DMI to multi-label learning

4.1 Datasets

Table 1 Summary of multi-label datasets

4.3.1 Comparison with existing algorithm adaptation methods

Table 2 Comparison of classification performance with benchmark method

Table 3 Comparison of classification performance with ELA + Chi and PPT + Chi

Table 4 Remaining entropy for interpretation of DMIML and MIMF (on scene data)

Footnotes

Acknowledgments

Appendix A

References

Table 1
Summary of multi-label datasets

Table 2
Comparison of classification performance with benchmark method

Table 3
Comparison of classification performance with ELA $+$ Chi and PPT $+$ Chi

Table 4
Remaining entropy for interpretation of DMIML and MIMF (on scene data)