Abstract
Feature selection aims at selecting a feature subset that has the most discriminative information and preserve most of characteristics from original features in HyperSpectral Image (HSI) classification. This paper proposes a two-stage feature selection method based on Mutual Information (MI) and Jeffries-Matusita (J-M) measure. In first stage, we select a feature subset with minimal redundancy maximal relevance criteria. In second stage, we select further a feature subset from which obtained in first stage by maximizing J-M distance. Multiple Kernel Learning (MKL) and Ensemble Learning (EL) are promising family of machine learning algorithms and have been applied extensively in HSI classification. Many MKL methods often formulate the problem as an optimization task. To avoid solving the complicated optimization problem, this paper presents an ensemble learning framework, SMKB (Stochastic Multiple Kernel Boosting), which applies Adaptive Boosting (AdaBoost) and stochastic approach to learning multiple kernel-based classifier for multi-class classification problem. We examine empirical performance of proposed approach on benchmark hyperspectral classification data set in comparison with various state-of-the-art algorithms. Experimental results demonstrate that the proposed method obtains better feature subsets and is more effective and efficient than classical methods.
Keywords
Introduction
HyperSpectral Image (HSI) is an optical imaging technique, which has the continuous coverage of solar reflective wavelengths and a high spectral resolution. A pixel of an acquired HSI can be seen as a spectral fingerprint that is unique to the materials in the respective spatial area. The ability to perform the high accurate identification of materials on the earth surface makes HSI an important tool for supporting various applications. The acquired HSI has been used extensively in land-cover classification, agriculture, and target detection. However, challenges are related to the high dimensionality of data and the limited availability of training samples. With an increase of dimensionality, theoretical and practical problems arise, and the curse of dimensionality poses critical challenge to the supervised classification of HSI. In recent years, HSI classification becomes a very active research area, which aims at assigning each pixel with one class for an object in a scene.
From the viewpoint of probability theory and information theory, Mutual Information (MI) is a good indicator of relevance between variables and measures how much knowing one of these variables reduces uncertainty about the other. By maximizing the joint MI between the input features and target output, MI-based feature selection algorithm [1] can select one optimal subset from the originaldataset.
Jeffries-Matusita (J-M) [2] is one of the spectral separability measures. The behavior of J-M usually can be regarded as much more like probability of correct classification. As a feature selection technique, J-M measure can utilize the band-wise spectral information to search for an optimal feature subset for dimensionality reduction. Y. Gao [3] proposed optimization via distance calculating. In addition, Chavez et al. [4] proposed the Optimum Index Factor (OIF) which can be calculated to obtain multivariate statistical information on a data set.
Kernel-based methods [5], such as support vector machine (SVM), dominate the field of discriminative data classification models in recent years, due to its insensitivity to the curse of dimensionality. SVM was investigated as a binary classifier at the beginning. Generally, SVM separates data by an optimal hyperplane for a training set mapped into a space. The extension of SVMs to multi-class problems usually combines several binary classifiers and has been successfully introduced into the field of remote sensing image classification. Multiple kernel learning (MKL) is a powerful field of machine learning. MKL combines multiple sub-kernels (e.g., Gaussian kernel, polynomial kernel, etc.)to seek better results compared to the single kernel learning strategy. Recent studies have shown that MKL [6, 7] can provide enhanced classification accuracy and balance between the spatial and spectral information and computational efficiency.
To address those limitations of conventional MKL methods, many researchers adopted boosting to solve MKL problems [8, 9]. In this paper, we proposed a novel hyperspectral image classification approach, named Stochastic Multiple Kernel Boosting (SMKB), by applying the boosting method and the stochastic approach to support the learning of multiple kernel-based classifier for multi-class classification problem. Our approach randomly selects a subset of sampled kernels to construct a classifier in each boosting iteration. SMKB can efficiently learn a classifier without resolving a complicated optimization task and obtain the final classifier constructed by a weighted vote of individual classifiers.
Compared with other ensemble classifiers which have been used upon hyperspectral data, three specific contributions of our approach can be summarized as follows: (i) We propose a two-stage feature selection approach based on MI and J-M measure. (ii) We present a novel framework, i.e., SMKB, for MKL, which can efficiently learn kernel based classifiers with multiple kernels. (iii) We conduct experiments on hyperspectral image data set for validating the performance of SMKB and evaluate various parameters of SMKB in order to achieve a tradeoff between accuracy and efficiency.
The remainder of this paper is organized as follows. Section 2 discusses related techniques. Section 3 briefly introduces preliminaries. Section 4 formulates the proposed framework of SMKB. Section 5 presents and analyzes the results of experimental evaluations. Finally, Section 6 concludes this work.
Related work
In Shannon’s information theory, mutual information is good at quantifying how much information is shared by two random variables. Recently, the mutual information-based feature selection algorithm [1] usually is adopted in the feature selection process for classification tasks. To avoid estimating the high-dimensional MI, many low-dimensional MI approximation methods were proposed, such as mutual information based on Parzen window [10], mutual information feature selection [11], minimal redundancy maximal relevance [12], normalized mutual information feature selection [13]. Wei et al. [14] developed a MI-based unsupervised feature transformation process for heterogeneous feature selection. Feng et al. [15] defined a maximum joint mutual information criterion for unsupervised feature selection.
Jeffries-Matusita distance provides a much reliable criterion of class separability. Swain and Bruzzone [16, 17] used J-M measure to quantify the spectral separability in target detection. Laliberte et al. [18] used J-M measure to avoid the limitation of extending the measure to small sample sizes. Ghiyamat et al. [19] employed J-M measure for assessing non-normally distributed classes. Padma et al. [20] combined the spectral angle mapper and the J-M measure by the tangent and sine functions for hyperspectral image analysis.
Kernel-based methods [21, 22], based on the principles of structural risk minimization [23], have been widely used to solve machine learning problems in recent decades. Kernel-based methods have good generalization performance, for supporting applications in HSI classification [24, 25]. In [24], an original classification problem based on active learning strategies is reformulated to discriminate between significant and nonsignificant samples. In [25], samples in the training set maximizes the changes in the posterior distribution are selected through a maximum-likelihood classifier. However, they mainly work in a fixed training set scenario, and pay little attention on the integration of kernels.
Several hundreds of spectral bands lead to theoretical and practical problems [26, 27]. Most classification algorithms encountered the “Hughes phenomenon” [28]. To deal with this phenomenon, a large number of works have been performed. Tarabalkaet et al. [29] applied SVM to obtain class conditional probabilities and constructed minimum spanning forests to enable hyperspectral image segmentation and classification. Hamet et al. [30] incorporated the bagging of training samples and adaptive random subspace feature selection to construct a classifier. Tuia et al. [31] introduced a semi-supervised SVM and used two kernels to train the classifier. Mura et al. [32] extended morphological attribute profiles based on independent component analysis (ICA) for the classification of HSI. Besides, researches proposed spatial feature extraction [33, 34], multiple feature learning [35], nonlocal joint collaborative representation [36], spatial kernel-based methods [37], subspace projection-based approaches [38], active learning strategies [39], and probabilistic modeling-based methods [40, 41] for solving learning problems. Despite the success of kernel methods, an inappropriate kernel may lead to impaired prediction performance, thus choosing an appropriate kernel function and appropriate feature space is crucial for achieving good performance. Composite kernel learning [42], multiple kernel learning [43], and ensemble learning [44, 45] have been shown to outperform traditional single kernel approaches for HSI classification. various methods were developed to solve MKL problems, such as SimpleMKL [6], sparse MKL [46] and SpicyMKL [47].
Recently, ensemble methods are employed for MKL. Ensemble learning is a very effective and extremely versatile method which considers the result of misclassified data in the training phase and collect several classifiers to classify test examples. Extensive algorithms use ensemble learning to solve MKL. Since support vector coefficients cannot be obtained, Sun et al. [48] used an approximated method to approximate support vectors. Crammer et al. [49] provided a kernel evaluation method with multiple choices for sub-classifier forms. Lin et al. [50] used a manifold structure as the criterion to find the best local classifier. However, they have to solve a complicated optimization task when learning classifier using boosting methods. Other works adopt boosting and ensemble technique, including BoostSVM [51], AdaBoost with SVM [52, 53]. The representative boosting algorithm is the AdaBoost algorithm [54] (which will be introduced in detail in Section 3).
Simulation results for hyperspectral remote sensing data show that SVM ensemble with bagging or boosting outperforms a single SVM in terms of classification accuracy [45]. But they can hardly deal with multiple kernels originating from multiple resources.
Preliminaries: MI, SVM, MKL, and AdaBoost
In this section, we introduce entropy and mutual information, kernel-based method, MKL, and ensemble method, which have been used extensively in HSI classification.
Entropy and mutual information
The MI of a random variable with itself is the entropy of this random variable. Entropy is referred to as self-information. If a feature or class label can be regarded as a variable, entropy generally represent uncertainty or information of a given feature in information theory. If {X = x
i
|i = 1, ⋯ , n} is a non-numerical feature or class label, the entropy of X is
Mutual information can be perceived as evaluating the information shared by the two given features to measure the dependence between them. If {X = x
i
|i = 1, ⋯ , n} and {Y = y
j
|j = 1, ⋯ , m} are non-numerical features or class label, the MI between them is
Let {X = x i |i = 1, ⋯ , n} be the entire feature set of a given dataset, and let S m be a selected subset of X which consists of m features. Given the target class label c and a feature x i , after sorting I (x i , c) in descent order, we can obtain S m by selecting the top m features.
According to minimal redundancy maximal relevance criteria [12], the mutual information between candidate features and target classes, the mean mutual information between candidate features and the selected feature in subset Sm-1 are calculated. The m-th feature from set {X - Sm-1} is selected according to the condition shown as follows:
Support vector machine is a kernel method that has been used mostly.
SVM is a discriminative classifier based on a single kernel. Given a sample of independent and identically distributed training instances
Rather than using a single kernel k, MKL has n base kernels k1, ⋯ , k
n
, with corresponding feature maps φ1, ⋯ , φ
n
. After explicitly model the weights (μ1, ⋯ , μ
n
)
T
, an MKL formulation was presented in [55]:
Ensemble learning [57] try to generate one learner by constructing a set of base learners from training data. The “base learners” are also referred as “weak learners”. Ensemble methods are able to boost weak learners and usually significantly better than a single learner. Schapire [58] proposed the first boosting algorithm. Boosting approaches rely on resampling techniques to obtain different training sets for each classifiers and iteratively add a new kernel until the performance is relatively stable. Freund and Schapire [54] then developed AdaBoost algorithm which manipulated training examples to generate multiple hypotheses.
The AdaBoost algorithm can be is summarized as following:
Given a sequence of m examples < (x1, y1) , ⋯ , (x
m
, y
m
) > with labels y
i
∈ Y = {1, ⋯ , k}, number of learning iterations T, and a probability distribution D
t
(i) , i = 1, …, m over training examples; In each iteration t, call calculate the error of h
t
:
t
= ∑i:h
t
(x
i
)≠y
i
D
t
(i). If ∊
t
> 1/2, set T = t-1 and abort loop set β
t
= ∊
t
/(1 - ∊
t
) update distribution D
t
:
In above procedure,
Stochastic MKL AdaBoost framework
This section presents the two-stage feature selection approach and SMKB algorithm.
Feature selection criterion
Jeffries-Matusita distance is one of the spectral separability measures commonly used in remote sensing applications. According to Swain et al. [2], J-M distance provides a much reliable criterion because as a function of class separability, it behaves much more like probability of correct classification. If {X = x
i
|i = 1, ⋯ , n} is a feature and {C = c
i
|i = 1, ⋯ , m} is a class label, the J-M distance is given as:
The J-M measure is an attractive method and overcomes the limitation of Transformed Divergence which calculates the divergence as a function of normalized distance between two classes. Hence, this study employs a two-stage procedure for feature selection. In first stage, we select highly informative and lowly redundant features by MI. In second stage, we maximize J-M distance to increase spectral separability in the feature subset which was obtained in first stage.
Stochastic method is widely used to solve complex optimization problems [59]. The goal of our Stochastic Multiple Kernel Boosting (SMKB) method is to learn a classifier using boosting techniques and multiple kernels. As shown in Algorithm 1, our approach maintains a probability distribution D
t
over training examples. At each boosting trial t, t = 1, ⋯, T, where T denotes the total number of boosting trials, we learn a kernel classifiers f
t
(x). The misclassification rate
In particular, we learn one classifier
Update the weight Dt+1 (i) as follows:
The final classifier is constructed by a weighted vote of individual classifiers as follows:
1: training data: S N = {(x1, y1) , ⋯ , (x N , y N )}
2: kernel function: κ
j
(·, ·):
3: ∀ i: initialize D1 (i) = 1/m
4: select a subset S mi of S N with f MI features according to Equation 3
5: select a subset S jm of S mi with f JM features according to Equation 7
6:
7: sample n examples using distribution D t
8:
9: train a weak classifier
10: calculate the training error over D t
11:
12:
13: sorting
14: combine k (= ρ * M) classifiers in front: f
t
(x) =
15:
16: compute the training error over D
t
∊
t
=
17: choose α
t
=
18: update Dt+1 (i) ←
19:
Different with AdaBoost at each boosting trial, we no longer simply discard the other M-1 classifiers. We sample k kernels with lowest misclassification rate, where k is specified by a randomly generated kernel sampling ratio ρ (0 < ρ ≤ 1) that determines the proportion of kernel to be sampled. Specifically, after sorting
This section evaluates the performance of our SMKB algorithm for hyperspectral image classification.
Data set description
The image dataset, in our experiments, is the widely used one acquired using the AVIRIS sensor over the Indian Pines region, Northwestern Indian, USA, in 1992. The Indian Pines data set comprises 220 bands and has a spatial size of 145×145 pixels in the wavelength range from 0.4 to 2.5 μm. Removing noisy bands, there are 200 bands remaining. The ground truth has 10249 labeled pixels. It consists of 16 land cover classes. Figure 1 shows a false color composite (bands 17, 27, and 50 for RGB) and the ground truth.

Classification maps of AVIRIS Indian Pine dataset: (a) False color composite (bands 17, 27, and 50 for RGB); (b) Ground truth.
To evaluate our SMKB algorithm, we compared it with several state-of-the-art competitive algorithms, including SVM-based single kernel (SVM for short), SimpleMKL [6], and AdaBoost with SVM (AdaBoostSVM for short). For SVM classifiers, we employ a polynomial and Gaussian radial basis function kernel. We have performed ten-fold cross-validation procedure using a single SVM for finding optimal SVM parameters σ ∈ {10-2, . . . , 102}, C ∈ {101, . . . , 104}. In all cases, the one-versus-one multiclass scheme implemented in LibSVM [60] was used. SimpleMKL is one of algorithms used to solve the MKL problem. To implement SimpleMKL algorithms, we adopt the SimpleMKL toolbox [6] and their default settings suggested by the toolbox. AdaBoostSVM is an algorithm applying AdaBoost to improve SVM learning accuracy [51]. For AdaBoostSVM, 10-fold cross validation is adopted to select the best kernel, other settings are the same as in SMKB algorithms. For our SMKB, we follow the typical approach used in traditional MKL and AdaBoost studies in literature. In particular, 16 base kernels are used initially in the ensemble, i.e., gaussian kernels using 13 different bandwidth parameters from {2-6, 2-5, ⋯, 26} and polynomial kernels of degree 1 to 3 on all features. In first stage, S mi consists of 150 features according to Equation 3 and S jm consists of 120 features according to Equation 7. All experiments are running on a Windows 7 with 2.9 GHz Intel CPU, 16 GB RAM and MATLAB8.2 environment.
Comparison results
We have analyzed the classification accuracy of our SMKB with the AVIRIS dataset of Indian Pines. Table 1 shows classification results obtained by different methods, when the number of training samples is 15% of 10249 reference samples. Our method has obtained the highest OA = 88.97%. Experiments presented in Table 1 have been repeated 10 times to calculate the mean of OA, AA, and Kappa to evaluate the performance of different methods. The highest scores for each class are highlighted in the boldface font.
Classification results (%) of the different methods
Classification results (%) of the different methods
This table clearly demonstrates that our method yields better results (although for 15% of training data, this difference is not significant). It can also be seen that our method presents higher performances, especially in classes with small number of training samples such as Alfalfa, Grass/pasture-mowed, Oats and Stone-steel towers. Classification maps obtained using these four methods are shown in Fig. 2.

Classification maps with respect to the AVIRIS Indian Pine dataset, where (a) for SVM, (b) for SimpleSVM, (c) for AdaboostSVM, and (d) for our SMKB.
Figure 3 shows the evolution of kappa parameter as a function of the percentage of training samples used for our four evaluated approaches. In all cases, experiments are performed with 5% – 50% of all labeled samples in each class. The points in this picture indicate the respective mean kappa (over 10 runs) of these methods.

Relationship between the selection of scales and kappa coefficients.
Figure 3 illustrates that our method is significantly superior to SVM and SimpleMKL, and close to AdaBoostSVM, for the Indian dataset. By referring to the results in Fig. 3, as the number of training samples m increases, the classification performance of each kernel generally increases. This is due to the fact that, the complexity of data construction increases with increasing m. However, the number of samples for more than 20% did not have much impact on the kappa coefficient. The advantage of our SMKB is the most obvious for 15%, 20% and 30% of training samples with 0.73%, 0.62%, and 0.58% in kappa against AdaBoostSVM. However, such an advantage becomes less obvious when training samples at 10% and 50%. More interesting, AdaBoostSVM and SVM have a closest kappa coefficient in 30% of training samples.
The kappa statistics indicates that our SimpleMKL obtains a lower performance in each experiment. This result is similar to what Tuia et al. reported in [7]. MKL is not expected to always improve standard SVM. This limitation can be a motivation that exploits semi-supervised classifiers with unlabeled pixels and spatial features beyond spectral information. There are two reasons to explain why Simple MKL does not reflect the advantage of multiscale kernel combination: (i) insufficient training samples for constructing kernel matrices will result in the SimpleMKL overfitting [7], and (ii) choosing an optimal candidate multiscale kernel automatically for SimpleMKL is really difficult.
In our experiments, we randomly select approximately 15% of training samples in each class. We notice that the kernel sampling decay factor ρ which being randomly generated in each iteration could affect the performance (both accuracy and efficiency) of our SMKB algorithm. To examine the impact of ρ, we conduct a set of experiments to evaluate this impact on both accuracy and efficiency performance in classification tasks.
Generally, the sampling ratio ρ determines the proportion of kernels sampled from the whole collection of kernels at each boosting trial. The results of accuracy and efficiency performance with respect to the kernel sampling ratio ρ by varying its value from 0.1 to 1.0 are shown in Fig. 4. From these experimental results, we found that our SMKB algorithm with a large kernel sampling ratio value usually produced a better classification accuracy performance. This is especially more evident when the kernel sampling ratio is small.

Evaluation of SMKB kappa coefficients with respect to the sampling ratio ρ.
Despite the improvement when increasing the kernel sampling ratio, we found that the classification accuracy tends to saturate when the value is large enough (e.g., larger than 0.6). This is due to the fact that, when the sampling ratio is too small, the base kernel classifiers may suffer from insufficient training examples at boosting trials. However, employing a too large sampling ratio may lead to sample too many training data examples for data set, which may be redundant for building base classifiers.
Besides the impact on the accuracy, the kernel sampling ratio parameter affects the efficiency of our SMKB algorithm as well. The computing time of SMKB algorithm with various values of kernel sampling ratio ρ is evaluated. Relationship between kernel sampling ratio ρ and the time cost is shown in Fig. 5. We can see from this figure that, increasing the kernel sampling ratio also leads to the increase of the time cost needed by our SMKB algorithm. In particular, the runtime is nonlinearly increasing as the kernel sampling ratio ρ increases from the value 0.15. In practice, a tradeoff should be considered between accuracy and efficiency by choosing an appropriate number of kernel sampling ratio ρ.

Evaluation of our SMKB learning time cost with respect to kernel sampling ratio ρ.
Experiments are conducted for examining the impact of various number of base kernels on kappa coefficients for our method, AdaBoostSVM, and SimpleMKL algorithms (due to using single kernel, SVM was absent). In our previous experiments, we have fixed the number of base kernels to 16. In this set of experiments, we examine the experimental result by varying the numbers of kernels from 8 to 40. Figure 6 shows the experimental result about the impact of the numbers of kernels on the classification performance.

Kappa statistics with various numbers of kernels.
As shown in Fig. 6, our method performs the best and obviously outperforms the SimpleMKL method. In terms of the curves of kappa coefficients, increasing the total number of kernels in general is able to boost the accuracy performance of all our SMKB algorithm consistently. Such observation is particularly more evident when the total number of kernels is relatively small (e.g., M < 16). The kappa coefficient usually starts to decrease when the numbers of kernels is large (e.g., M > 16).
The computational cost is a critical bottleneck of MKL methods in hyperspectral classifications. The computational time, as an important factor indicating the applicability of MKL, was tested to compare the computational efficiency of different methods. In order to illustrate the classification efficiency, the computational time of different methods are shown in Fig. Fig. 7. The corresponding settings in this experiment were the same as previous experiments. From this figure, it can be found that the computational time of our SMKB are close to AdaBoostSVM, and much less than SimpleMKL. In particular, the time cost of our SMKB algorithm is only half of that of SimpleMKL, and about 80% of that of AdaBoostSVM, respectively.

Time costs for various numbers of kernels.
For our SMKB, the recorded runtime includes the time consumed on training and testing set sampling, multiple kernel constructing, SVM training and testing. Figure 7 shows that our SMKB algorithm utilizing 28 kernels is still much more efficient than SimpleMKL algorithm with 16 kernels. The consumed time curve of SimpleMKL rises steeply along with the number of kernels, while that of SMKB rises slowly. In short, compared with other algorithms, SMKB can be still satisfied with respect to computational efficiency.
This paper proposed an effective stochastic multiple kernel adaboost framework, called Stochastic Multiple Kernel Boosting (SMKB), for high-dimensional hyperspectral remote sensing classification based on the notion of multiclass kernel ensemble. Compared to the AdaBoostSVM approach, due to randomly employing a subset of kernels to construct classifiers, SMKB is significantly more efficient while providing a better performance of classification accuracy.
The efficiency of SMKB is evaluated upon the Indian Pines data set. The performance of SMKB was evaluated based on several criteria: the number of training samples, the ensemble size, and the number of kernel sampling. Experimental results revealed that SMKB performs well in terms of accuracy, compared with traditional algorithms. These indicate that SMKB are a promising approach for generating classifier ensemble of hyperspectral remotesensing.
