Abstract
Jeffries-Matusita (JM) distance, a transformation of the Bhattacharyya distance, is a widely used measure of the spectral separability distance between the two class density functions and is generally used as a class separability measure. It can be considered to have good potential to be used for evaluation of the effectiveness of a feature in discriminating two classes. The capability of JM distance as a ranking based feature selection technique for binary classification problems has been verified in some research works as well as in our earlier work. It was found by our simulation experiments with benchmark data sets that JM distance works equally well compared to other popular feature ranking methods based on mutual information, information gain or Relief. Extension of JM distance measure for feature ranking in multiclass problems has also been reported in the literature. But all of them are basically rank based approaches which deliver the ranking of the features and do not automatically produce the final optimal feature subset. In this work, a novel heuristic approach for finding out the optimum feature subset from JM distance based ranked feature lists for multiclass problems have been developed without explicitly using any specific search technique. The proposed approach integrates the extension of JM measure for multiclass problems and the selection of the final optimal feature subset in a unified process. The performance of the proposed algorithm has been evaluated by simulation experiments with benchmark data sets in comparison with two other previously developed multiclass JM distance measures (weighted average JM distance and another multiclass extension equivalent to Bhattacharyya bound) and some other popular filter based feature ranking algorithms. It is found that the proposed algorithm performs better in terms of classification accuracy, F-measure, AUC with a reduced set of features and computational cost.
Introduction
With the rapid development of science and technology, the amount of generated data in every sphere of life has been increased tremendously, which causes the classification or mining of data increasingly difficult. These data are often high dimensional, making their analysis more complicated and computationally costly. The data need to be preprocessed to get rid of redundant and irrelevant information. Feature selection is the most important processing step prior to classification or clustering for any pattern recognition or data mining problem [1, 2]. This is a process of dimensionality reduction in which discriminatory and relevant information is retained by discarding redundant and irrelevant information leading to better performance of the classification model in terms of classification accuracy as well as computational cost [3, 4]. Feature subset selection produces the best or the optimum feature subset of d features from an original set of n features where d < n.
The optimum feature subset selection process comprises two main tasks, defining a feature evaluation criterion for evaluation of an individual feature or a feature subset and a selection process to find out the best feature subset from the full feature set according to the evaluation criterion. Based on the nature of the feature evaluation criterion, there are two main categories of feature selection algorithms, filter and wrapper. Filter algorithms [5, 6] use feature evaluation criteria based on intrinsic characteristics of data set and are independent of the classifier model while wrapper algorithms [7] use classifier performance as the feature evaluation criterion. Filter methods are generally faster than wrapper methods in producing optimal feature subset, but classification performance of the selected feature subset might be poor compared to the wrapper selected feature subset, which is tuned to the particular classifier. There are some embedded algorithms that combine the merits of both filter and wrapper approaches [8].
Based on the selection process of the optimal feature subset, two main approaches exist, rank based and search based. Rank based approaches evaluate each feature independently, rank them according to their merit/goodness and then select an appropriate portion of the top ranked features to form the final feature subset. Though simple and computationally light, rank based approaches ignore the interaction between features and cannot guarantee the optimality of the selected subset. Moreover, some strategies need to be adopted to fix the optimum percentage of top ranking features to be selected. According to [9], the best two individual features do not produce the best feature subset of two features. An exhaustive evaluation of all possible feature subsets can only guarantee the optimality of the selected feature subset. But for high dimensional data, it leads to an explosion of computational time with increasing dimension of data. To solve this combinatorial optimization problem, a lot of search algorithms have been developed so far for the selection of optimum feature subset, which include mainly statistical or mathematical and soft computing based techniques. The feature evaluation measures for filter approaches are generally classified into four categories such as distance based, dependency or relevance based, information theoretic and consistency measure [10]. Distance based evaluation measures include class separability measures such as divergence or Kullback-Liebler distance, Bhattacharyya distance, Jeffries-Matushita distance, Mahalanobis distance. Dependency based measures consider correlation or similarity of a feature to a class. Information theoretic measures determine the information gain of a feature by its inclusion. Consistency based measure penalizes inconsistent features where inconsistency is defined as two instances having the same feature values but different class labels. Some efficient popular filter evaluation measures used for feature ranking are Mutual Information (MI), Information Gain (IG), Gain Ratio (GR), Symmetrical uncertainty (SU), Chi-squared (CS), One-R, Relief, Jeffries-Matusita (JM) distance and Correlation [11].
As noted above, Jeffries-Matsushita (JM) distance, a modification of Bhattacharyya distance, can be used to rank a feature by assessing the degree of separability between two classes using the feature. Unlike other class separability measures like Bhattacharyya distance or Divergence, JM distance has an upper bound which makes it a more realistic measure for feature evaluation [12]. It is also noted in a very early study [13] that JM distance provides a much more reliable criterion for feature selection compared to Divergence. An extended definition of JM distance for the multiclass problem, the weighted average J ave , has been reported in [14]. In [15], the authors proposed J min , the best JM distance between the least separable pair of classes as the evaluation criterion for multiclass problems. The other two approaches for multiclass extension of JM distance reported in [16] and [17] respectively are found to be effective measures for feature selection. The authors of this work studied JM distance as a tool for feature selection in [18] and found that JM distance performs equally well compared to other popular filter based feature selection measures. In our other previous work [19], it was also found that the stability of JM distance based feature selection algorithm is better compared to other popular filter based ranking algorithms for feature subset selection.
In this work, an efficient feature subset selection algorithm for multiclass problems based on JM distance has been proposed. Until recently, JM based approaches that have been proposed for feature selection in multiclass problems are all univariate feature ranking strategy in which average JM distance for all the class pairs are used for final feature ranking whereas we propose here a heuristic approach to select the final optimal feature subset. The proposed approach consists of two steps. In the first step, similar to existing JM distance based feature selection approaches, features are ranked according to the JM distances for all class pairs and multiple ranked feature lists are created corresponding to each class pair. The second step is the novel contribution of this work in which a heuristic approach is developed to select the final optimum feature subset from the multiple ranked feature lists (corresponding to each class pair) based on the average JM distance values of the top ranking features of each class pair. Unlike traditional approaches, the proposed algorithm does not use any explicit search mechanism to find out the optimum feature subset. The proposed algorithm has been evaluated by comparing with other multiclass JM distance based feature selection as well as with some other popular filter based ranking methods by simulation experiment with benchmark data sets. The rest of the paper is organized as follows. Section 2, the next section presents the background and related works. In section 3, our proposed feature selection approach using JM distance for multiclass problems has been described. The following section describes simulation experiments, results and analysis followed by the final section of conclusion and future works.
Rank based filter approach for feature selection
Feature selection with filter approach has been extensively used in machine learning for many years due to its simplicity, low computational complexity and non reliance on any particular classifiers. Feature ranking based filter methods also perform very well in many real life problems of pattern recognition, machine learning and data mining [11, 20–22] compared to search based methods. Some of the well-known filter type evaluation measures for feature ranking are Mutual Information(MI), Information Gain (IG), Gain Ratio (GR), Symmetrical uncertainty (SU), Chi-squared (CS), One-R and Relief-F [23]. In this work, JM distance, a modification of Bhattacharyya distance, is used as the evaluation measure for ranking based feature selection in multiclass problems.
Feature evaluation measures
Some of the filter type feature evaluation measures including JM distance are briefly described in this section.
Mutual information (MI)
Mutual Information (MI), one of the most used feature evaluation measure, is an information theoretic measure which expresses the mutual dependency between two random variables. Mutual information between a feature and the class can be used to assess relevance of a feature for representing the class. MI of two random variables x and y are defined in terms of their probabilistic density functions p (x),p (y), p (x, y):
There are several modifications of MI that have been proposed in the literature over the years. Normalized MI, which maps the value of MI between 0 and 1, is an important improvement of MI, popularly used for feature selection [24].
Information Gain (IG), also an information theoretic measure, is the amount of information gained about a random variable from observing another variable. It is measured as the reduction in entropy of the class C when the feature A is observed and is expressed by the following equation [25].
Gain Ratio (GR) of a feature A is defined as in [26].
Symmetrical Uncertainty (SU) is obtained by normalizing Mutual Information (MI) to the entropies of the two variables and is defined as in [27].
Chi-squared algorithm for feature selection, based on χ2 statistics to test the independence of two variables (here class and a feature), discretizes numeric attributes repeatedly until some inconsistencies are found in the data and achieve feature selection via discretization. It is measured for two adjacent intervals by comparing the expected and actual values as in the following equation [28].
C is the number of classes, O ij is the number of instances (feature values) of the jth class in the ith interval and E ij is the expected frequency of O ij , where E ij = R i C j /N, R i is the number of instances in the ith interval and C j and N are the number of instances of the jth class and the total number of instances (counting both the intervals), respectively. The higher value of CS expresses the higher dependency between the class and the feature and the feature is selected.
One-R ranks the feature using a rule based classification algorithm proposed by Holte [29]. For each feature, the algorithm generates simple rules and selects the rule which has the lowest error rate for the majority of the classes. The features are ordered according to their quality of the corresponding rules. One-R treats all numerically valued attributes as continuous and uses a straightforward method to divide the range of values into several disjoint intervals.
Relief-F
Relief-F is an extension of the original Relief algorithm that can be used for multiclass problems [30]. The original Relief algorithm is formulated iteratively from an instance based learning approach that evaluates a feature by assigning a weight to the feature. This feature weight actually represents the feature’s relevance with respect to the target concept and is calculated by locating for each instance the nearest neighbor from the same class (nearest Hit) and the nearest neighbor from the opposite class (nearest Miss). Weight W
i
is initialized to 0, and then in each step, it is updated according to the following rule [31].
Finally, an average of W i is taken over the iterations. Relief-F is also more robust and has the ability to deal with incomplete and noisy data.
Bhattacharya Distance measures the similarity between two probability distributions. It is used to measure the separability or overlap of two classes for each feature in a binary classification problem and rank them to select the best performing features. If the two distributions for two classes i and j are considered to be Gaussian then Bhattacharya distance B ij can be defined as [32]:
Jeffries-Matushita (JM) distance is also a measure of statistical separability for two classes. For feature x, it is defined for two classes w
i
and w
j
as in [15].
JM distance is bounded by a range of values from 0 to 2. It is related to Bhattacharyya distance as
The JM distance yields a general measure of separability between the two classes based on the average distance between their class density functions. A larger JM distance value indicates a clearer separation between two classes. It is the more reliable criterion for correct classification compared to divergence or Bhattacharyya distance.
The original JM distance measure, used for binary classes, has been extended for multiclass problems as Weighted average JM distance (JM
ave
). For m number of classes, it is defined according to [14] as:
In [16], the authors proposed another multiclass extension of JM distance, JM
Bh
, which is an equivalent of Bhattacharyya bound to Bayes error, and justified its efficiency over JM
ave
in feature selection for multiclass problems by simulation experiments. For m class problem, JM
Bh
is defined as:
Some of the popular filter based feature ranking algorithms are described here in brief. In [20], authors have used the feature ranking methods such as information gain, gain ratio, correlation, symmetrical uncertainty and chi-squared for recognizing the handwritten digits. The effect of feature selection on classification accuracy is analyzed in [33]. This paper uses six feature ranking based filter methods of Information Gain, Gain Ratio, Symmetrical Uncertainty, One-R, Chi-square and Relief-F for the feature selection process and uses three classification models for comparative study. In [34], authors investigated Cancer Classification using feature selection with filter methods of Signal-to-noise statistic, CFS, Chi-square and Relief-F for Probabilistic Neural Networks. Generally, for high dimensional data such as gene expression data, the filter method is extensively used. In [35], the authors proposed an unsupervised feature selection algorithm with feature ranking method of Chi-squared for maximizing the classifier performance. This algorithm achieved better prediction accuracy and also reduced the number of features compared to other methods. For handwriting recognition, Cilia et al. in [11], used five univariate feature ranking based methods for ranking the features while feature subset was chosen by a greedy search approach. They used Chi-square, Relief, Gain ratio, Information gain and Symmetrical uncertainty as the feature ranking method and the Best First (BF) search strategy combined with consistency criterion and correlation based feature selection criterion as for searching feature subsets. Chen et al. in [36], used filter based ranking feature selection (FRFS) methods for Security vulnerability prediction (SVP) and showed that FRFS can improve the performance of SVP compared to others. They also performed the diversity analysis on identified vulnerable modules by using different FRFS methods. In [37], Ghazy et al. used the different ranking and subset-based feature selection techniques for finding the optimum number of features to find an appropriate classifier. They mainly used these feature selection techniques to verify the performance of the intrusion detection system (IDS). In [38], authors proposed a task of feature ranking for multi-target regression (MTR). They studied two types of feature ranking scores for MTR, one was ensemble based, and the other was an extension of the RReliefF method. Lee et al. in [39], proposed an efficient multivariate feature ranking method for gene selection and for improving the accuracy of microarray data classification. In their work, they created a new feature ranking method using the Markov blanket (MB), which embedded with relevance. They showed that the proposed feature ranking method possesses high classification accuracy as well as good efficiency.
JM distance based feature selection algorithms
In [40], author proposes a novel spectral matching technique by combining the JM distance and the Spectral Angle Mapper (SAM) algorithm in hyperspectral image data. Their proposed JM-SAM approach performs very well than the individual JM distance measure and SAM algorithm with the least average entropy in spectral matching. Dalponte et al. in [41], used the Jeffries-Matusita distance (JM) combined with sequential forward floating selection (SFFS) search strategy for fast and reliable feature selection. In this case, JM distance also has been used for hyperspectral data. In [42], the authors presented an analysis of the linear attenuation coefficients, which were used as a useful feature of mono-spectral and multispectral images using statistical pattern classification tools. In this paper, feature extraction was performed by JM distance and Karhunen-Loeve transformation. Daamouche et al. [43] proposed a particle swarm optimization (PSO) based approach for very high resolution (VHR) image classification, in which JM distance, support vector machine (SVM), cross-validation (CV) accuracy and normal Bhattacharyya distance were used as the fitness function. In [44], authors developed a new technique for crop identification by combining the wavelet variance and the JM distance (CIWJ). The proposed CIWJ approach outperforms other approaches for efficient crop mapping, such as agricultural crop identification with high spatial resolution images and classifications for more general or specific land use. In this paper [45], JM distance is applied as an evaluator of image segmentation in the area of remote sensing images. Here authors proposed an unsupervised evaluation method for evaluating the performance of segmentation using the JM distance and the area-weighted variance (WV). Authors in the paper of [16] proposed an extension of the JM distance measure for multiclass problems of feature selection. They formulated an equation for JM distance measure and used optical remote-sensing data for the experiment. They also compared their results with the most familiar weighted average JM distance. Sen et al. [18] studied JM distance as an efficient tool for feature selection in binary classification problems compared to their other feature ranking methods.
Proposed feature selection approach with JM distance for multiclass problems (JM mc )
In this work, a novel optimum feature subset selection approach JM
mc
, for a multiclass problem based on the feature evaluation by JM measure is proposed, which is described in detail in this section. Traditionally optimum feature subset selection approaches require a measure for evaluating feature or feature subset and a search strategy for finding out the best feature subset from possible feature subsets. The proposed approach is composed of two steps. In the first step (algorithm 1), similar to existing JM distance based feature selection approaches, features are ranked according to the JM distances for all class pairs and multiple ranked feature lists are created corresponding to each class pair. Other multiclass JM distance based approaches for feature selection use some kind of average JM measure, averaged over all class pairs, which is described in the previous section by Equation 11 and Equation 12, for final feature ranking. In our work, a novel heuristic approach for selecting an optimum feature subset from the multiple lists of ranked features corresponding to each class pair, obtained after the first step, is proposed in the second step (algorithm 2). Moreover, our approach does not include traditional search based methods for final feature subset selection. The heuristics proposed in our work is based on the following core concept. If the average JM values of the top ranking features for a class pair is high, the classes are considered to be well separated and few number of top ranking features are needed for good classification accuracy. The case is opposite for low average JM values of the top ranking features. The classes are not well separated, and we need to consider features in the final feature subset to provide better classification accuracy. The detailed algorithm is presented below. Algorithm 1: For any multiclass data set, let the number of features be n and the number of classes be m where m > 2. The number of class pairs (NC) will be Algorithm 2: In Algorithm 2, for qth class pair, features are ranked in descending order according to the values of JMmean
pq
distance of all the features (JMmean
pq
, p = 1, ⋯ , n). This is done for all class pairs. So we have now feature lists (FL
q
, where q = 1, ⋯ , NC) of ranked features corresponding to each class pair. The number of features in all the feature lists are the same as the total number of features n. Now our proposed approach selects the most important feature subset from the (FL
q
, where q = 1, ⋯ , NC) ordered feature lists for the multiclass problem. The underlying concept of the selection of features from the feature lists is presented below. 1: 2: JFC
k
: JM distances for all class-pairs of all features of the k
th
iteration 3: JMmean : Average JM distance for all class- pairs of all features for K trials 4: 5: 6: sum← 0 7: 8: sum← sum+JFC
k
[p, q] 9: 10: JMmean [p, q] ← sum/K 11: 12: 13: return JMmean 14: JMmean
pq
distance is a separability measure, which determines the average class separability of a feature p for a class pair q. The larger JMmean
pq
distance value of a feature p indicates that the feature can separate the class pair q very well, which implies an increase in classification accuracy. For multiclass problems, JMmean
pq
distance is calculated for each class pair q and different class pair shows a different value of JMmean
pq
distance for a single feature p. If for a particular class pair q, the JMmean
pq
distances of the top ranking features are near to 2 (the upper limit of standard JM distance is theoretically 2), then the features are very strong to easily separate classes. In this case, a lesser number of features can provide good classification accuracy. On the other hand, if the JMmean
pq
distance values of the top ranking features are low, the features have not good separability, then comparatively more features are required to provide moderate classification accuracy. Based on this concept, different percentages of features are selected from different class pairs for improvement of classification accuracy. The final feature subset selection from the ranked feature lists of all class pairs need to be done according to the following rules. If the top JMmean
pq
distance value ( If the top JMmean
pq
distance value of the feature list ( If the top JMmean
pq
distance value of the feature list ( The selected features from all the class pairs are put in a list (selected feature list SFL). As there is a possibility that a particular feature might be selected more than once, the frequency of occurrence of each feature p in the selected feature list SFL is counted and let it be |SFL
p
|, p = 1, ⋯ , n. Let the median value be SFL
med
. Now for finding out the final feature subset, depending on the relative number of features and the number of class pairs, two cases are considered. If the total number of features n in the data set is less than or equal to the total number of class pairs (n ≤ NC), then the final feature subset will constitute the features (from the selected feature list SFL) whose occurrence frequency is more than or equal to SFL
med
If the total number of features is greater than the total number of class pairs (n > NC), then all the features in SFL will be selected as the final feature subset.
The proposed heuristics selection process is presented clearly in Algorithm 2.

Class pair-feature table.
1:
2: n : len(featureList)
3: NC : len(classPair)
4: SFL : candidate features list
5: SFL ← empty list
6:
7: FLmax [q] ← find the feature with the highest JMmean [p, q] , p = 1, ⋯ , n
8: SortedfeatureList ← sort the fea-tures descending order of JMmean [p, q] , p = 1, ⋯ , n
9: F s = empty list
10:
11: F s ← Take α% of top features in SortedfeatureList
12:
13: F s ← Take 2α% of top features in SortedfeatureList
14:
15: F s ← Take 3α% of top features in SortedfeatureList
16:
17:
18: SFL.insert(Feature)
19:
20:
21: Freq ← dictionary for count the frequency of features in SFL
22:
23: Freq[feature] ← SFL.count(feature)
24:
25:
26: SFL med ← find the median value of frequency in Freq
27: selectedFeature ← all features in Freq > = SFL med
28:
29: selectedFeature ← features in Freq
30:
31: return selectedFeature
32:
The implementation of the proposed algorithm has been done with benchmark data set for simulation experiments. The data set description is presented in the next subsection.
Dataset description
In this work, 37 datasets are used for performing a simulation experiment to validate the proposed approach. Among them, 25 datasets are collected from UCI repository [46] and rest are collected from OpenML [47]. Some datasets need to be preprocessed, which have missing values or are categorical in nature. Here, categorical type missing values in the datasets are replaced with the most frequently used value, and after that, the whole dataset is converted into numeric type. Numeric type missing values are replaced with the average value. Categorical type datasets without missing values are directly converted to numeric type. Table 1 represents the summary of datasets which includes the number of features, the number of instances, the number of classes and a short description of each data set.
Summary of Datasets
Summary of Datasets
For the simulation experiment, a 10-fold cross validation method is used. The training set samples are used for feature selection. JM distance for each feature and for each class-pair is calculated according to Equations (9) and (8). The proposed approach in Algorithm 2 is used to select the subset of features based on average JM values for each class pair and each feature. The selected feature subset is evaluated by its performance for supervised classification using the Naive Bayes classifier. The same training samples are used for training the classifier, and the test samples are used for measuring classification accuracy, F-measeure and AUC of the classifier of the feature subset. The classification experiment is also repeated 10 times, and average classification accuracy, F-measeure and AUC are taken as the performance measure of the selected feature subset.
In order to set the value of parameter α, a preliminary experiment is conducted with 10 datasets. The value of α is changed from 1 to 20, and in each case, the classification performance of the proposed approach on these datasets is observed. Results show that classification accuracy for most of the datasets is the highest when the α value is near 10. Therefore, the value of α is here fixed as 10.
For comparative evaluation of the proposed algorithm, feature subset selection has been done with two other available multiclass extensions of JM distance, weighted average JM distance (JM ave ) and (JM Bh ) another one equivalent to Bhattacharyya bound according to Equatios (11) and (12) respectively and the results have been compared. The performance of the proposed algorithm also has been compared with other filter based feature ranking algorithms such as IG, GR, SU, CS, One-R and Relief-F by simulation experiment. The classification accuracy, F-measure, and AUC of Naive Bayes Classifier with the selected features by those algorithms, the number of features being same as the number of features selected by the proposed multiclass JM distance based algorithm, are used for performance comparison.
For all simulation experiments, Intel(R) Core(TM) i5-4590 CPU @3.30GHz Processor and 8GB RAM with a 64 bit operating system of Windows 8.1 Pro is used. R (version 3.5.3) is used with several key standard packages such as SpatialEco, FSelector, varSel, MLmetrics, caret, and e1071 for implementation of the algorithms.
Performance, measures
Percentage of selected features
Since our proposed method selects a specific number of features from the full set of feature, the percentage of feature selection is important. The percentage of feature selection is calculated by
The proposed feature selection approach with multiclass JM distance is evaluated by the classification accuracy as a performance measure which is defined as [48].
Where TP, true positive, TN, true negative, FP, false positive, and FN, false negative, represent the number of positive cases correctly detected, the number of negative cases correctly detected, the number of negative cases detected as positive and the number of positive cases detected as negative respectively.
Precision (also called positive predictive value) is a measure of the correctness of a positive prediction. For any classification task, the precision of a class (target value) is defined as the number of true positives divided by the total number of elements labelled as belonging to the positive class. It is calculated by using the following formula:
Recall is the measure of how many true positives get predicted out of all the positive class elements. It is sometimes also called sensitivity. The measure is collected by the following formula:
F-measure combines precision and recall. It can be defined as the (weighted) harmonic mean of precision and recall by the following equation [49]:
The area under a receiver operating characteristic (ROC) curve, or AUC, is a single scalar value that calculates the general performance of a binary classifier [50]. The range of AUC is [0.5, 1], where the minimum value indicates that the performance of the classifier is random, and the maximum value indicates that the classifier is perfect with a zero error rate. The AUC is an important measure to evaluate the overall performance of a classifier because its calculation relies on the complete ROC curve, which involves all possible classification thresholds.
Simulation results
Table 2 represents the selected feature subset with our proposed approach (JM mc ) for 37 multiclass datasets. For different datasets, the percentage of selected features is different. For the ‘Segment’ dataset, the percentage of feature selection is about 11.11%, and it is the lowest among the 37 datasets. For the ‘Pendigits’ dataset, the percentage is the second lowest and is about 12.50%. Among the 37 datasets, 15 datasets have feature selection rate lower than 50%, and for 13 datasets, the rate is between 50-70% and rest of the 9 datasets, the feature selection rate is more than 70%. Table 2 also highlights the feature selection time in seconds for 37 datasets. For all these datasets, time is very short, and the range is about 0.050 to 0.110 seconds.
Feature Selection with Proposed Approach (JM
mc
)
Feature Selection with Proposed Approach (JM mc )
Figure 2 shows the average classification accuracy (Avg) of 37 multiclass datasets using various JM distance measures. From the figure, it is clearly expressed that the classification accuracy of JM distance with our proposed approach (JM mc ) is very much comparable to the other two multiclass JM distance measures. For some datasets, our approach performs much better than the other two measures such as JM ave and JM Bh . For almost 22 datasets over 37, classification accuracy of JM mc is higher than JM ave and JM Bh .

Classification accuracy using various JM measures for all datasets.
Figure 3 represents the comparison of classification accuracy on the average of all datasets with nine different measures of ranking based filter approaches. This figure depicts that JM distance with our proposed approach (JM mc ) produced the average highest classification accuracy of 73.02% for all the data sets compared to other methods. The weighted average JM distance JM ave and the JM distance equivalent to Bhattacharyya bound JM Bh have classification accuracy 70.53% and 70.55% respectively. Among other feature ranking methods excluding our approach, Relief-F produced the highest value of 71.59%.

Classification performance over all datasets with different methods.
Table 3 shows the detailed comparison of classification accuracy with the selected feature subset using the proposed feature selection technique with other methods for multiclass datasets. Here, average classification accuracy (Avg) and standard deviation (SD) are calculated for ten iterations. The highest average value and the lowest SD value are represented in boldface in the table. From this table, it is highlighted that proposed JM mc can achieve the highest classification accuracy for 15 data sets; on the other hand, JM ave and JM Bh have the highest accuracy for four datasets and three datasets respectively. Other feature ranking measures such as IG, GR, SU and CS show the highest accuracy for three datasets, five datasets, four datasets and three datasets respectively. One-R has the highest accuracy for four datasets, and Relief-F shows the highest accuracy for five datasets. In the case of four datasets such as Iris, Pendigits, teaching-assistant and waveform-5000, we got the same average classification accuracy as well as standard deviation (SD) values using all the methods. In the case of SD values, JM mc provides the lowest SD for 12 datasets compared to other approaches. CS possesses very poor SD of classification accuracy, and for only two data sets, the SD value is minimum. The rest of the measures, such as JM ave , JM Bh , IG, GR, SU, One-R and Relief-F have the lowest SD values for three datasets, three datasets, seven datasets, five datasets, ten datasets, four datasets and four datasets respectively. From this result, we can infer that our proposed approach JM mc produces a stable output compared to others.
Classification accuracy of proposed approach and other methods
Table 4 shows the comparison of F-measure among all the nine methods for multiclass datasets. Results show that, for 15 out of 37 datasets, JM mc provides the highest F-measure. For other two JM measures JM ave and JM Bh , each approach provides the highest F-measure value only for two datasets. The rest of the measures, such as IG, GR, SU, CS, One-R and Relief-F, have the highest F-measure values for five datasets, eight datasets, five datasets, seven datasets, nine datasets and eight datasets respectively.
F-measure of proposed approach and other methods
Table 5 represents the comparison among the nine methods in terms of AUC. Again, it is observed that JM mc produces the highest AUC value for 15 in 37 datasets. For seven datasets, both JM ave and JM Bh outperform other approaches regarding AUC value. In addition, each of SU, CS, and One-R exhibits the highest AUC results for eight datasets, whereas IG, GR, and Relief-F show the highest AUC values for five datasets, 12 datasets, and six datasets respectively.
AUC of proposed approach and other methods
Table 6 illustrates a comparison among nine methods based on the average execution time (in seconds) for all the datasets. The execution times of three different JM distance measures are comparable and slightly higher than other methods except for Relief-F. In this table, the lowest execution time is highlighted in boldface. For 17 datasets, CS method shows the lowest execution time. JM mc , have the lowest execution time for three datasets, and the other two JM distance measures, JM ave and JM Bh have the lowest execution time for nine datasets and six datasets respectively. One-R method takes the lowest time for 3 data sets, and Relief-F needs the highest time, which is much more than others. Relief-F’s execution time is approximately 212 times greater or more than other methods.
Average execution time (seconds)
Table 7 represents a summary of results over all the datasets for all the methods regarding classification accuracy, F-measure, AUC and computational time. In this table, the computed average rank of the different approaches is shown. The nine methods are ranked (from the best to the worst as 1 to 9) based on the value of the evaluation metric individually for all the datasets. If multiple methods show the same effectiveness, they are given the same ranking value. This ranking process is performed for all datasets, and finally, the average rank value over all the data sets is calculated for all the methods. It is found that JM mc achieves the highest rank among all the approaches in terms of the classification accuracy, F-measure and AUC, which are shown in boldface in the table. For execution time, the average ranking over the data set is not suitable as computational time depends on the size of the data set. It seems that the average computational time for JM mc is the third lowest, losing to other JM measures.
Overall comparison between the methods
Table 8 represents the results of pair wise t-test with the proposed approach, JM mc and each of the other approaches regarding classification accuracy, F-measure, AUC and execution time over all the data sets. In this t-test, p-value less than 0.05 indicates JM mc is significantly better than other approaches. Otherwise, there is no significant difference between the approaches. The results on classification accuracy and F-measure indicate that the proposed approach performs significantly better than other approaches. Regarding AUC, proposed JM mc shows significantly better performance for four methods, including JM ave , JM Bh , One-R, and Relief-F. On the other hand, there is no significant variation among the approaches regarding execution time because JM mc shows significantly better results only for Relief-F.
Result of Pair wise t-tests
Optimum feature selection from a set of large number features in real world applications with noisy data is a tricky problem. One needs to balance classification accuracy with the number of features and computational cost in addition to model independence and good stability. Ranking based filter approaches produce model independent feature selection with low computational cost compared to search based filter approaches for optimum feature subset selection, but for final subset selection, the wrapper method is used to select an appropriate number of features from top ranked features to have a practical solution. In this work, a two step feature subset selection approach for multiclass problems is proposed in which feature ranking with JM distance is used in the first step for individual class pairs while a set of heuristic rules are framed for finding out the final optimal feature subset from the ranked feature lists of all class pairs without using any explicit search technique or a wrapper classifier. The proposed algorithm also shows a new approach to use JM distance for feature selection in multiclass scenario which has been shown effective compared to other previous extensions of JM distance for multiclass problems. In fact, the elegance of the proposed approach lies in the fact that it integrates the selection of final feature subset from the ranked feature lists and the extension of the JM distance measure for multiclass problems in a unified process. This also eliminates the need of using any wrapper classifier later to find out appropriate feature subset from the top ranked features of the feature list for a particular problem.
In our previous works, we have examined the efficiency of JM distance as an effective feature selection measure compared to some other well known filter measures for binary problems in terms of classification accuracy, feature reduction and stability with simulation experiments using benchmark data sets. Here we have evaluated our proposed approach for multiclass problems with 37 benchmark data sets with regard to classifier accuracy, F-measure, AUC, execution time and percentage of feature selection compared to two different extensions of JM distance measures for multiclass problems and six different popular feature evaluation measures for rank based feature selection. It is found that the average classification accuracy, F-measure, AUC (over all the data sets) of the proposed approach JM mc , is the highest compared to all other methods. The computational cost of all the methods except Relief-F is very comparable. Relief-F’s computational cost is almost 234 times higher than our proposed approach JM mc . The analysis of the standard deviation of the results over 10 independent trials reveals that our proposed approach is quite stable. In our work, feature -feature interaction is not considered to restrict the computational time. The features are ranked according to their individual goodness, and all the methods selected for comparison also work in the same way. We are now working on extending this work in specific application areas such as gene expression data, where the number of features is large and computationally efficient algorithm is important. In our proposed approach, after some trial and error, we fixed the value of α at 10 for all the data sets. In the future, we would also like to investigate the role of α over the individual data set.
