Abstract
High-dimensional multi-label data is widespread in practical applications, which brings great challenges to the research field of pattern recognition and machine learning. Many feature selection algorithms have been proposed in recent years, among which the filtering feature selection algorithm is the most popular one because of its simplicity. Therefore, filtering feature selection has become a hot research topic, especially the multi-label feature selection algorithm based on mutual information. In the algorithm, the computation cost of high dimensional mutual information is expensive. How to approximate high order mutual information based on low order mutual information has become a major research direction. To our best knowledge, all existing feature selection algorithms that consider the label correlation will increase the computational cost greatly. Therefore, this paper proposes an approximation method of three-dimensional interaction information, which is applied to the calculation of correlation and redundancy. It can take the correlation of labels into account and don’t increase the computation cost significantly at the same time. Experiments analysis results show that the proposed method is effective.
Introduction
With the rapid development of technology, data information became more and more complex, for example multi label, high dimension, multi-modal data [1]. The actual requirements are no longer limited to single label data. On the other hand, multi-label data is becoming more and more widespread. Therefore, in recent years, multi-label learning has attracted extensive attention in the fields of data mining and machine learning, such as text classification [2], image semantic annotation [3], gene classification [4], multi-media [5], and so on.
Due to the increasing number of features in multi-label data, multi-label learning also faces the problem of dimension curse [6]. High-dimensional multi-label data usually contains a large number of redundant and irrelevant features, which affects the performance of the multi-label classifier and increases the training cost of the classifier.
As a dimensionality reduction technique, multi-label feature selection can effectively solve the above problems. It can reduce the training cost of classifier, simplify the model, and prevent overfitting [1]. Multi-label feature selection is usually divided into three types, i.e. filter type, embedding type and wrapper type. In general, the wrapper algorithm can select high-quality features. However, considering the diversity of feature combinations and the classifier training for each selection, the computation cost of wrapper feature selection algorithm is very expensive. Embedding feature selection algorithm combines feature selection with classifier together. Filter feature selection is independent of classifier and selects feature subset through a specific evaluation metric, such as Cosine correlation [7], Pearson correlation coefficient [7], information theory [8, 9, 10, 11, 12, 13], ReliefF [14], symmetrical uncertainty [15] and so on.
Pearson correlation coefficient can only measure the linear correlation between the variables. Information measurement, for example mutual information, can measure both the linear correlation and the nonlinear correlation between the variables. Based on mutual information, many efficient multi-label feature selection algorithms have been proposed.
The goal of multi-label feature selection based on mutual information is to select a feature subset to maximize the mutual information between the feature subset and the label set. The mutual information between feature subset and label set is called joint mutual information. It is difficult to calculate the accurate joint mutual information value. The computational cost increases rapidly with the increase of feature dimensions. It cannot meet the requirements of practical application [8]. Therefore, the trends of multi-label feature selection based on mutual information is to approximate the high-dimensional mutual information, such as the joint mutual information between features and labels and that between features [8, 13, 9, 11, 10, 12]. However, the problem of these algorithms is that most of them ignore the correlation between labels [8, 16, 9]. In some algorithms, the correlation between labels is calculated directly through calculating the interactive information or three-dimensional joint entropy [13, 12]. These methods are computationally intensive. In addition, the redundancy is usually measured by calculating the mutual information accumulation between features, and the redundancy between selected feature set is not considered [8, 10, 16, 11]. Only a few algorithms take the correlation between feature sets into consideration, which also involves the calculation of high-dimensional joint entropy or high-dimensional joint probability [9]. They are computationally intensive also.
For realizing efficient multi-label feature selection, a new correlation measurement based on the proportion hypothesis is proposed in the paper. The label correlation is considered. A new correlation redundancy measurement method considering the correlation between selected feature sets is proposed also.
The main contributions of the paper are as follows:
An approximate estimation method of 3D interaction information based on the assumption of information proportion is proposed. Based on the hypothesis of contributions 1), a correlation measure is proposed, which takes the correlation between labels into account and avoids the calculation of high-dimensional joint entropy or joint probability. Based on the hypothesis of contributions 1), a redundant measure is proposed, which takes the correlation between selected feature sets into account and avoids the calculation of high-dimensional joint entropy or joint probability. It is compared with state-of-the-art feature selection algorithms on 6 public data sets to prove the effectiveness of the proposed method.
The following is the arrangement of the contents of this paper. We cover the background information, the related formulas, and some definitions in Section 2. The related work is introduced in Section 3. In Section 4, the multi-label feature selection algorithm proposed in this paper is introduced. In Section 5, the experiment is designed and the experimental results are analyzed. In Section 6, some conclusions are summarized.
The information theories used in this paper are all based on Shannon entropy [17], which measures the amount of information of a variable. Given discrete random variables
The joint entropy measures the amount of information about two variables. The joint entropy is defined as follows:
where
Extending to multiple variables, the joint entropy is as follows.
where
The conditional entropy quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known. The conditional entropy is defined as follows:
where
In a similar way
Mutual information measures the amount of information shared between two variables. It is defined as follows.
The relationship between mutual information and entropy is as follows.
In a similar way
It is easy get the following result.
The conditional mutual information measures the correlation between two random variables
The joint mutual information measures the correlation between two random variables (
It is easy get the following result.
The interaction information is used to measure the correlation among three random variables. And it is defined as follows:
It is easy to get the following result:
Entropy, conditional entropy, joint entropy, mutual information, joint mutual information and conditional mutual information are all none negative values, but interaction information can be positive or negative or zero. When the interaction information is negative, the random variables X and Y together can provide Z with information that they cannot provide alone, so it is called synergy. When the interaction information is positive, it means the common information between them, so it is redundant. When the interaction information is zero, it is considered that the introduction of one variable will not affect the relationship between the other two variables.
Multi label classification
Up to now, many multi-label learning algorithms have been proposed. These algorithms fall into three broad categories based on the use of label correlation [18], i.e. first-order strategy, second-order strategy, high-order strategy. First-order strategy does not consider the babel dependencies. The multi-label classification problem is usually transformed into several independent binary classification problems. For example, binary relevance (BR) approach [3], ML-kNN [19], and sparse weighted instance-based multilabel (SWIM) [20]. The advantages of first-order strategy are easy calculation and easy to model. However, it does not consider the correlation between labels, which may eventually lead to poor performance. Second-order strategy considers pair-to-pair label relationships, for example, RankSVM [4], Calibrated Multilabel Perceptrons (CMLPC) [21], Correlative Multi-label (CML) [22]. The generalization performance of second-order strategy is usually excellent. But some application scenarios may have high-order label correlation, which is contrary to the assumption that only second-order label correlation exists. In the high-order strategy, the model can usually express the label correlation better, but the complexity of the model will be higher, for example classifier chains (CCs) [21], ensembles of pruned set (EPS) [22].
Multi-label feature selection
Up to now, multi-label feature selection algorithms (filter, embedding, wrapper) have developed quickly. In terms of wrapper methods, Gharroudi et al. proposed Binary Relevance Random Forest (BRRF), which ignores the correlation between labels. In the method, multi-label data sets are first converted to multiple single-label data sets, and random forest classifier is adopted. Zhang et al. proposed Multi-Label Naive Bayes method (MLNB) [23], which applies the principal component analysis to determine correlated and redundant features and select the best features with the genetic algorithm.
In terms of embedding methods, Zhang et al. proposed LIFT [24], which is based on the assumption that each label has its own specific features. Ma et al. proposed SFUS [25] method, which combines sparse feature selection and label correlation to select the optimal feature subset. Jian et al. proposed Multi-label Informed Feature Selection (MIFS) [26] method, which selects the most discriminative features based on label sparsity and label correlation. Huang et al. proposed Joint Feature Selection and Classification for Multilabel Learning (JFSC) [27] method. This method utilizes pair label correlation to learn the specific correlation features and the shared features between labels. It is based on Fisher discriminant-based regularization the inner-class distance and maximize the intraclass distance for each label.
In terms of filter methods, Lee et al. proposed AMI [8], which is based on Shearer’s inequality scaling [28]. The feature selection criterion is defined as follows.
where
The disadvantage of the algorithm is that it does not consider the correlation of labels, and its advantages are computation low, simple and practical.
Lee et al. proposed a multi-label feature selection method called Pairwise Multi-label Utility (PMU) [12] based on three-dimensional interaction information. The maximization criterion is as follows.
This algorithm is highly explanatory but it need computing the three-dimensional interaction information repeatedly. The computation cost is very expensive.
Lin et al. proposed MDMR [9], which applied the principle of maximum correlation and minimum redundancy of single-label feature selection MRMR [29] to the multi label feature selection. The maximization criterion of feature selection is as follows:
The disadvantage of the algorithm is that it needs calculating the conditional mutual information, which requires a large amount of computation, and does not consider the correlation between labels. Lee et al. proposed SCLS [11], which uses variable coefficient contraction between redundancy and correlation, and the feature selection criterion is defined as follows.
where
Zhang et al. proposed LRFS [16], which is based on conditional mutual information, and the maximization criterion of feature selection is as follows.
The advantage of this algorithm is that it considers the correlation of labels, but the disadvantage is that it needs calculating the conditional mutual information, which requires a large amount of computation. Seo et al. proposed GICS [30], which is based on the general form of entropy, and the maximization criterion of feature selection is as follows.
where
Due to the
Zhang et al. proposed MCMFS [10], which is based on the method of label clustering. The correlation between the feature and the label sets is approximated, and the maximum mutual information between the candidate feature and the label replaces the correlation between the label sets, and the maximization criterion of feature selection is as follows.
The model complexity of the algorithm is low, and it is simple and efficient. However, some features may provide information for multiple labels, the algorithm possibly ignore these features.
Proportion-based interaction information estimation
To single label feature selection, Nojun Kwak and Chong-Ho Choi proposed MIFS-U (Mutual information feature selection under uniform information distribution) [31], which is based on the assumption of uniform information distribution. The approximate estimation formula of conditional mutual information is designed based on information Venn diagram, and the validity is proven by some experiments. Given random variables
Venn diagram illustrates the relation between 
The equation
According to the above formula, we can get
Similarly, we can get the symmetry estimation, i.e.
To avoid overestimating the redundancy, we take the minimum value, i.e.
The advantage of this estimation is that the calculation of 3D joint entropy is avoided and the computation cost is reduced. It is suitable to model more complicated problems.
In the incremental search, most of the existing algorithms are based on relevance minus redundancy.
where
The traditional correlation measurement usually does not consider the correlation between labels, for example, the correlation
For the sake of analysis advantage, we first consider the case that the labels set has only two labels. Given two labels
Venn diagram which is used to illustrate the relation between 
According to Section 4.1, we can get the interaction information
In the calculation of the interaction information, the label
Our goal is to measure the correlation between features and labels. The correlation between two labels and features is known as the regions
where
Next, we consider the redundancy between the selected features. The traditional measure of redundancy between features does not consider the influence of the selected features. For the sake of analysis, the analysis is based on the information Venn diagram in Fig. 3.
The traditional form of the redundancy measure is that
Similar to Eq. (30), we will approximate it as follows.
where
Venn diagram which is used to illustrate the relation between 
Therefore, the criterion of the feature selection under the framework of maximum correlation and minimum redundancy is
At the beginning of feature selection, the number of selected feature sets is relatively small and the number of elements in the label set remains unchanged. The measurement of the relevant part may only consider the measurement of the relevant part and ignore the measurement of the redundant part if the number of label elements is too large, which will render a large error [11]. Therefore, this paper simply gives the corresponding scale coefficient according to the number of feature sets and label sets, i.e.
where
The pseudo codes of the proposed algorithm can be described by Algorithm 1.
Since each filtering feature
Experiments
In this part, we compare the performance of seven algorithms. These algorithms are PMU [12], MDMR [9], LRFS [16], AMI [8], MCMFS [10], SCLS [11], MLACO [7].
Evaluation criteria
Four multi-label classification evaluation criteria are used to evaluate the performance of the proposed method, i.e. Average Precision (AP), Ranking Loss (RL), Coverage (CV), and One-error(OE) [18].
Let
AveragePrecision (AP): AP measures average value of a label order over a particular label. It is defined as follows.
Coverage (CV): CV denotes the difference between the predicted rank label and the real rank label. It is defined as follows.
Ranking Loss (RL): The RL measures that the classifier gives the proportion that the prediction probability of the corresponding ground truth corresponding to the positive case is less than that of the corresponding the negative case. The RL is defined as follows.
One Error (OE): This OE measures the proportion of the label with the highest predictive probability given by the classifier that is not the ground-truth label. It is defined as follows.
Macro F1: Macro F1 measures recall and precision of integrated quality. It is defined as follows.
Where
To evaluate the classification performance of the proposed method, we compare it with other algorithms on six datasets. These data sets are download from Mulan Library [32]. The characteristics of the datasets are shown in Table 1.
The datasets Arts, Recreation, Social, Health, Computer are from Yahoo text datasets, which are widely used to text categorization. The Emotions dataset is from music domain.
Description of data sets
Description of data sets
The multi-label classification algorithm MLKNN [19] is selected to be the multi-label classifier for evaluation. All continuous features are discretized into four bins according to equal-width strategy, as recommend in [12]. Additionally, the number of the nearest neighbors
All methods are implemented using python on an Intel Core-i5 CPU with 16 GB of RAM in the Microsoft Windows 10 operating system.
In each table, bold characters represent the optimal experimental results under the data set. Each number in the table is the average result of 50 groups of feature subsets under the classifier, where the parentheses represent the standard deviation. The last row “Average” records the average results of all data sets. In each table, bold characters represent the optimal experimental results under the data set.
The classification results in Tables 4–6 show that our proposed algorithm is superior to the others. For example, for Macro F1 evaluation criteria, the analysis results of three data sets are optimum, our proposed average value is 0.1933 and higher than PMU, MDMR, LRFS, AMI, MCMFS, SCLS, MLACO corresponding evaluation criteria of 0.1263, 0.1867, 0.1827, 0.1827, 0.1870, 0.1825, 0.1612. Because the proposed algorithm considers the correlation between labels and is suitable in the description of correlation and redundancy, it is generally superior to other algorithms in the global performance.
From their average results corresponding to the Macro F1 evaluation criteria, other ranking algorithms for PMU, MDMR, LRFS, SCLS, MCMFS, AMI, MLACO. SCLS algorithm with a scalable coefficient between correlation and redundancy has good performance when choosing a small number of features. MLACO may not be able to describe the actual correlation well because the correlation and redundancy measures are linear, shown as in Table 4.
In order to show the classification performance more clearly, Figs 4 and 5 gives a more detailed description of 2 datasets. The X-axis indicates the number of selected features, and the Y-axis indicates the classification result of the different evaluation criteria. Different colored lines in the figure represent different algorithms.
For datasets health and computer, shown as in Figs 4 and 5. As more and more features are selected, the advantages of our algorithm become more and more obvious. For example, in Fig. 5, evaluation criteria Coverage Error, when the number of selected features is less than or equal to 10, the advantage of our algorithm is not obvious compared with other algorithms. When the number of selected features is more than 10, our algorithm is obviously better than other algorithms.
This is because that SCLS, MCMFS, AMI are not sophisticated enough to measure redundancy and relevance. Therefore, when a few features are selected at the beginning, the error between the redundant items and the relevance items of algorithm SCLS, MCMFS, AMI are not significantly enlarged. But as the number of features selected increases, the error of redundant items and relevance items will be larger and larger, the corresponding performance will get worse and worse. MDMR, LRFS both have considered the correlation of the pairwise labels, but the estimation of the redundant term is not precise enough and the two redundant items of PMU have overlapping parts, so the accumulated error of these algorithms when selecting features will become larger and larger.
| Dataset | PMU | MDMR | LRFS | AMI | SCLS | MCMFS | MLACO | Proposed |
|---|---|---|---|---|---|---|---|---|
| Arts | 0.4436 0.0142 | 0.4889 0.0118 | 0.4862 0.0141 | 0.5029 0.0157 | 0.5013 0.0186 |
|
0.4744 0.0254 | 0.5046 0.0178 |
| Recreation | 0.4254 0.0124 | 0.4864 0.0217 | 0.4968 0.0211 | 0.5176 0.0265 | 0.4883 0.0226 | 0.5175 0.0288 | 0.4556 0.0499 |
|
| Emotions | 0.6892 0.0332 | 0.7286 0.0109 | 0.7266 0.0129 | 0.7019 0.0060 | 0.7227 0.0312 | 0.7124 0.0106 | 0.7296 0.0159 |
|
| Social | 0.6707 0.0062 | 0.6947 0.0127 | 0.6947 0.0102 |
|
0.6966 0.0124 | 0.7008 0.0117 | 0.6364 0.0182 | 0.7006 0.0155 |
| Health | 0.6278 0.0074 | 0.6941 0.0192 | 0.6883 0.0192 | 0.7062 0.0213 | 0.7006 0.0219 | 0.7074 0.0223 | 0.6815 0.0202 |
|
| Computer | 0.5844 0.0055 | 0.6028 0.0077 | 0.5954 0.0061 | 0.6150 0.0095 | 0.6019 0.0066 | 0.6092 0.0060 | 0.6084 0.0079 |
|
| Average | 0.5735 | 0.6159 | 0.6147 | 0.6251 | 0.6186 | 0.6255 | 0.5977 |
|
Experimental results of multi-label feature selection methods in terms of Ranking Loss (RL) (mean
Experimental results of multi-label feature selection methods in terms of Coverage (CV) (mean
| Dataset | PMU | MDMR | LRFS | AMI | SCLS | MCMFS | MLACO | Proposed |
|---|---|---|---|---|---|---|---|---|
| Arts | 0.6981 0.0297 | 0.6387 0.0277 | 0.6391 0.0328 | 0.6185 0.0314 | 0.6218 0.0352 |
|
0.6566 0.0435 | 0.6144 0.0360 |
| Recreation | 0.7214 0.0228 | 0.6377 0.0363 | 0.6242 0.0338 | 0.6014 0.0400 | 0.6318 0.0394 | 0.6001 0.0458 | 0.6840 0.0693 |
|
| Emotions | 0.4314 0.0572 | 0.4022 0.0158 | 0.4077 0.0193 | 0.4274 0.0117 | 0.3947 0.0383 | 0.4228 0.0145 | 0.3982 0.0290 |
|
| Social | 0.4123 0.0119 | 0.3793 0.0200 | 0.3810 0.0159 |
|
0.3764 0.0194 | 0.3649 0.0203 | 0.4891 0.0364 | 0.3735 0.0266 |
| Health | 0.4780 0.0120 | 0.3746 0.0273 | 0.3849 0.0257 | 0.3542 0.0295 | 0.3650 0.0306 | 0.3556 0.0322 | 0.3951 0.0251 |
|
| Computer | 0.4881 0.0068 | 0.4744 0.0059 | 0.4839 0.0067 | 0.4528 0.0157 | 0.4750 0.0057 | 0.4709 0.0076 | 0.4670 0.0101 |
|
| Average | 0.5382 | 0.4845 | 0.4868 | 0.4689 | 0.4775 | 0.4711 | 0.5150 |
|
Experimental results of multi-label feature selection methods in terms of Macro F1 (mean
The classification performance on Health data set: (a): Average Precision, (b): Ranking Loss, (c): Coverage Error, (d): One-Error, (e): Macro F1.
The classification performance on Computer data set: (a): Average Precision, (b): Ranking Loss, (c): Coverage Error, (d): One-Error, (e): Macro F1.
In this paper, a multi-label feature selection algorithm based on proportional second-order relevance and redundancy is proposed, which takes the correlation between labels into account, and adjusts the variable scaling coefficient of correlation and redundancy so as to describe the correlation between features and label sets better.
In order to prove the validity of the proposed algorithm in this paper, the seven other multi label based on correlation and redundancy measure feature selection algorithms (PMU, MDMR, LRFS, AMI, MCMFS SCLS, MLACO) are compared. In the 6 well-known multi-label datasets, five evaluation criteria (Average Precision, Ranking Loss, Coverage, One-error, Macro F1) were compared using the same multi-label feature selection classifier MLKNN. The final experimental results show that the proposed algorithm has obvious advantages.
In our future work, we are going to study higher-order correlation and redundancy, not just at second order. We are also talking about improving the way of search, rather than the traditional forward incremental search.
Footnotes
Acknowledgments
This paper was partially supported by National Defense Basic Research Program (JCKY2019413D001), Medical Engineering Cross Project of USST (10-21-302-413) and (10-202-302-424).
