A framework of multiple kernel ensemble learning for classification using two-stage feature selection method

Abstract

Feature selection aims at selecting a feature subset that has the most discriminative information and preserve most of characteristics from original features in HyperSpectral Image (HSI) classification. This paper proposes a two-stage feature selection method based on Mutual Information (MI) and Jeffries-Matusita (J-M) measure. In first stage, we select a feature subset with minimal redundancy maximal relevance criteria. In second stage, we select further a feature subset from which obtained in first stage by maximizing J-M distance. Multiple Kernel Learning (MKL) and Ensemble Learning (EL) are promising family of machine learning algorithms and have been applied extensively in HSI classification. Many MKL methods often formulate the problem as an optimization task. To avoid solving the complicated optimization problem, this paper presents an ensemble learning framework, SMKB (Stochastic Multiple Kernel Boosting), which applies Adaptive Boosting (AdaBoost) and stochastic approach to learning multiple kernel-based classifier for multi-class classification problem. We examine empirical performance of proposed approach on benchmark hyperspectral classification data set in comparison with various state-of-the-art algorithms. Experimental results demonstrate that the proposed method obtains better feature subsets and is more effective and efficient than classical methods.

Keywords

Feature selection hyperspectral classification Jeffries-Matusita measure Multiple Kernel Learning Mutual Information

1 Introduction

HyperSpectral Image (HSI) is an optical imaging technique, which has the continuous coverage of solar reflective wavelengths and a high spectral resolution. A pixel of an acquired HSI can be seen as a spectral fingerprint that is unique to the materials in the respective spatial area. The ability to perform the high accurate identification of materials on the earth surface makes HSI an important tool for supporting various applications. The acquired HSI has been used extensively in land-cover classification, agriculture, and target detection. However, challenges are related to the high dimensionality of data and the limited availability of training samples. With an increase of dimensionality, theoretical and practical problems arise, and the curse of dimensionality poses critical challenge to the supervised classification of HSI. In recent years, HSI classification becomes a very active research area, which aims at assigning each pixel with one class for an object in a scene.

From the viewpoint of probability theory and information theory, Mutual Information (MI) is a good indicator of relevance between variables and measures how much knowing one of these variables reduces uncertainty about the other. By maximizing the joint MI between the input features and target output, MI-based feature selection algorithm [1] can select one optimal subset from the originaldataset.

Jeffries-Matusita (J-M) [2] is one of the spectral separability measures. The behavior of J-M usually can be regarded as much more like probability of correct classification. As a feature selection technique, J-M measure can utilize the band-wise spectral information to search for an optimal feature subset for dimensionality reduction. Y. Gao [3] proposed optimization via distance calculating. In addition, Chavez et al. [4] proposed the Optimum Index Factor (OIF) which can be calculated to obtain multivariate statistical information on a data set.

Kernel-based methods [5], such as support vector machine (SVM), dominate the field of discriminative data classification models in recent years, due to its insensitivity to the curse of dimensionality. SVM was investigated as a binary classifier at the beginning. Generally, SVM separates data by an optimal hyperplane for a training set mapped into a space. The extension of SVMs to multi-class problems usually combines several binary classifiers and has been successfully introduced into the field of remote sensing image classification. Multiple kernel learning (MKL) is a powerful field of machine learning. MKL combines multiple sub-kernels (e.g., Gaussian kernel, polynomial kernel, etc.)to seek better results compared to the single kernel learning strategy. Recent studies have shown that MKL [6, 7] can provide enhanced classification accuracy and balance between the spatial and spectral information and computational efficiency.

To address those limitations of conventional MKL methods, many researchers adopted boosting to solve MKL problems [8, 9]. In this paper, we proposed a novel hyperspectral image classification approach, named Stochastic Multiple Kernel Boosting (SMKB), by applying the boosting method and the stochastic approach to support the learning of multiple kernel-based classifier for multi-class classification problem. Our approach randomly selects a subset of sampled kernels to construct a classifier in each boosting iteration. SMKB can efficiently learn a classifier without resolving a complicated optimization task and obtain the final classifier constructed by a weighted vote of individual classifiers.

Compared with other ensemble classifiers which have been used upon hyperspectral data, three specific contributions of our approach can be summarized as follows: (i) We propose a two-stage feature selection approach based on MI and J-M measure. (ii) We present a novel framework, i.e., SMKB, for MKL, which can efficiently learn kernel based classifiers with multiple kernels. (iii) We conduct experiments on hyperspectral image data set for validating the performance of SMKB and evaluate various parameters of SMKB in order to achieve a tradeoff between accuracy and efficiency.

The remainder of this paper is organized as follows. Section 2 discusses related techniques. Section 3 briefly introduces preliminaries. Section 4 formulates the proposed framework of SMKB. Section 5 presents and analyzes the results of experimental evaluations. Finally, Section 6 concludes this work.

2 Related work

In Shannon’s information theory, mutual information is good at quantifying how much information is shared by two random variables. Recently, the mutual information-based feature selection algorithm [1] usually is adopted in the feature selection process for classification tasks. To avoid estimating the high-dimensional MI, many low-dimensional MI approximation methods were proposed, such as mutual information based on Parzen window [10], mutual information feature selection [11], minimal redundancy maximal relevance [12], normalized mutual information feature selection [13]. Wei et al. [14] developed a MI-based unsupervised feature transformation process for heterogeneous feature selection. Feng et al. [15] defined a maximum joint mutual information criterion for unsupervised feature selection.

Jeffries-Matusita distance provides a much reliable criterion of class separability. Swain and Bruzzone [16, 17] used J-M measure to quantify the spectral separability in target detection. Laliberte et al. [18] used J-M measure to avoid the limitation of extending the measure to small sample sizes. Ghiyamat et al. [19] employed J-M measure for assessing non-normally distributed classes. Padma et al. [20] combined the spectral angle mapper and the J-M measure by the tangent and sine functions for hyperspectral image analysis.

Kernel-based methods [21, 22], based on the principles of structural risk minimization [23], have been widely used to solve machine learning problems in recent decades. Kernel-based methods have good generalization performance, for supporting applications in HSI classification [24, 25]. In [24], an original classification problem based on active learning strategies is reformulated to discriminate between significant and nonsignificant samples. In [25], samples in the training set maximizes the changes in the posterior distribution are selected through a maximum-likelihood classifier. However, they mainly work in a fixed training set scenario, and pay little attention on the integration of kernels.

Several hundreds of spectral bands lead to theoretical and practical problems [26, 27]. Most classification algorithms encountered the “Hughes phenomenon” [28]. To deal with this phenomenon, a large number of works have been performed. Tarabalkaet et al. [29] applied SVM to obtain class conditional probabilities and constructed minimum spanning forests to enable hyperspectral image segmentation and classification. Hamet et al. [30] incorporated the bagging of training samples and adaptive random subspace feature selection to construct a classifier. Tuia et al. [31] introduced a semi-supervised SVM and used two kernels to train the classifier. Mura et al. [32] extended morphological attribute profiles based on independent component analysis (ICA) for the classification of HSI. Besides, researches proposed spatial feature extraction [33, 34], multiple feature learning [35], nonlocal joint collaborative representation [36], spatial kernel-based methods [37], subspace projection-based approaches [38], active learning strategies [39], and probabilistic modeling-based methods [40, 41] for solving learning problems. Despite the success of kernel methods, an inappropriate kernel may lead to impaired prediction performance, thus choosing an appropriate kernel function and appropriate feature space is crucial for achieving good performance. Composite kernel learning [42], multiple kernel learning [43], and ensemble learning [44, 45] have been shown to outperform traditional single kernel approaches for HSI classification. various methods were developed to solve MKL problems, such as SimpleMKL [6], sparse MKL [46] and SpicyMKL [47].

Recently, ensemble methods are employed for MKL. Ensemble learning is a very effective and extremely versatile method which considers the result of misclassified data in the training phase and collect several classifiers to classify test examples. Extensive algorithms use ensemble learning to solve MKL. Since support vector coefficients cannot be obtained, Sun et al. [48] used an approximated method to approximate support vectors. Crammer et al. [49] provided a kernel evaluation method with multiple choices for sub-classifier forms. Lin et al. [50] used a manifold structure as the criterion to find the best local classifier. However, they have to solve a complicated optimization task when learning classifier using boosting methods. Other works adopt boosting and ensemble technique, including BoostSVM [51], AdaBoost with SVM [52, 53]. The representative boosting algorithm is the AdaBoost algorithm [54] (which will be introduced in detail in Section 3).

Simulation results for hyperspectral remote sensing data show that SVM ensemble with bagging or boosting outperforms a single SVM in terms of classification accuracy [45]. But they can hardly deal with multiple kernels originating from multiple resources.

3 Preliminaries: MI, SVM, MKL, and AdaBoost

In this section, we introduce entropy and mutual information, kernel-based method, MKL, and ensemble method, which have been used extensively in HSI classification.

3.1 Entropy and mutual information

The MI of a random variable with itself is the entropy of this random variable. Entropy is referred to as self-information. If a feature or class label can be regarded as a variable, entropy generally represent uncertainty or information of a given feature in information theory. If {X = x_i|i = 1, ⋯ , n} is a non-numerical feature or class label, the entropy of X is $H (x) = - \sum_{i = 1}^{n} p (x_{i}) logp (x_{i}) .$ (1) where p (x_i) denotes the probability or joint probability.

Mutual information can be perceived as evaluating the information shared by the two given features to measure the dependence between them. If {X = x_i|i = 1, ⋯ , n} and {Y = y_j|j = 1, ⋯ , m} are non-numerical features or class label, the MI between them is $I (X; Y) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} p (x_{i}, y_{j}) \log \frac{p (x_{i}, y_{j})}{p (x_{i}) p (y_{j})} .$ (2) where p (x_i, y_j) denotes the joint probability distribution of the marginal probability distribution X_i and Y_j. I (X ; Y) can be used to measure the linear or nonlinear dependence between the two features. It is nonnegative. The higher value of the I (X ; Y), the more the dependence between these two features. When I (X ; Y) =0, these two features are completely independent.

Let {X = x_i|i = 1, ⋯ , n} be the entire feature set of a given dataset, and let S_m be a selected subset of X which consists of m features. Given the target class label c and a feature x_i, after sorting I (x_i, c) in descent order, we can obtain S_m by selecting the top m features.

According to minimal redundancy maximal relevance criteria [12], the mutual information between candidate features and target classes, the mean mutual information between candidate features and the selected feature in subset S_m-1 are calculated. The m-th feature from set {X - S_m-1} is selected according to the condition shown as follows: $\underset{x_{j} \in X - S_{m - 1}}{Max} [I (x_{j}; c) - \frac{1}{m - 1} \sum_{x_{j} \in S_{m - 1}} I (x_{j}; x_{i})] .$ (3)

3.2 SVM and MKL

Support vector machine is a kernel method that has been used mostly.

SVM is a discriminative classifier based on a single kernel. Given a sample of independent and identically distributed training instances ${(x_{i}, y_{i})}_{i = 1}^{N}$ , where $x_{i} \in ℝ^{D}$ and y_i ∈ {-1, + 1} are its class labels, SVM can find a mapping function Φ (·). The discriminant function is defined as follows: $f (x) = < w, Φ (x) > + b$ (4) where $Min \frac{1}{2} ∥ w ∥_{2}^{2} + C \sum_{i = 1}^{N} ξ_{i}$ (5) $\begin{matrix} w . r . t . w \in ℝ^{S}, ξ \in ℝ_{+}^{N}, b \in ℝ \\ s . t . y_{i} (< w, Φ (x_{i}) > + b) \geq 1 - ξ_{i} \forall i \end{matrix}$ where w is the vector of weight coefficients, S is the dimensionality of the feature space obtained by Φ (·), C is a predefined positive trade-off parameter between model simplicity and classification error, ξ is the vector of slack variables, and b is the bias term of the separating hyperplane.

Rather than using a single kernel k, MKL has n base kernels k₁, ⋯ , k_n, with corresponding feature maps φ₁, ⋯ , φ_n. After explicitly model the weights (μ₁, ⋯ , μ_n) ^T, an MKL formulation was presented in [55]: $\underset{μ, w, b, ξ}{Min} \frac{1}{2} {(\sum_{k = 1}^{n} μ_{k} ∥ w_{k} ∥)}^{2} + C \sum_{i = 1}^{l} ξ_{i}$ (6) $\begin{matrix} s . t . y_{i} (\sum_{k = 1}^{n} μ_{k} w_{k}^{T} ξ_{k} (x_{i}) + b) \geq 1 - ξ_{i} \\ ξ_{i} \geq 0, i = 1, \dots, l \\ \sum_{k = 1}^{n} μ_{k} = 1, μ_{k} \geq 0, k = 1, \dots, n \end{matrix}$

3.3 AdaBoost

Ensemble learning [57] try to generate one learner by constructing a set of base learners from training data. The “base learners” are also referred as “weak learners”. Ensemble methods are able to boost weak learners and usually significantly better than a single learner. Schapire [58] proposed the first boosting algorithm. Boosting approaches rely on resampling techniques to obtain different training sets for each classifiers and iteratively add a new kernel until the performance is relatively stable. Freund and Schapire [54] then developed AdaBoost algorithm which manipulated training examples to generate multiple hypotheses.

The AdaBoost algorithm can be is summarized as following:

Given a sequence of m examples < (x₁, y₁) , ⋯ , (x_m, y_m) > with labels y_i ∈ Y = {1, ⋯ , k}, number of learning iterations T, and a probability distribution D_t (i) , i = 1, …, m over training examples;

In each iteration t, call WeakLearner with the distribution D_t and get back a hypothesis h_t : X → Y;

calculate the error of h_t: _t = ∑_{i:h_t(x_i)≠y_i}D_t (i).

If ∊_t > 1/2, set T = t-1 and abort loop

set β_t = ∊_t/(1 - ∊_t)

update distribution D_t : $D_{t + 1} = \frac{D_{t} (i)}{Z_{t}} \times {\begin{matrix} β_{t} & if h_{t} (x_{i}) = y_{i}, \\ 1 & otherwise \end{matrix}$

In above procedure, WeakLearner is a base learner, such as a decision tree, a neural network, a SVM, or other kinds of machine learning algorithms. The Z_t is a normalization factor to make D_t+1 a distribution.

4 Stochastic MKL AdaBoost framework

This section presents the two-stage feature selection approach and SMKB algorithm.

4.1 Feature selection criterion

Jeffries-Matusita distance is one of the spectral separability measures commonly used in remote sensing applications. According to Swain et al. [2], J-M distance provides a much reliable criterion because as a function of class separability, it behaves much more like probability of correct classification. If {X = x_i|i = 1, ⋯ , n} is a feature and {C = c_i|i = 1, ⋯ , m} is a class label, the J-M distance is given as: $J (S_{1}, S_{2}) = \sqrt{\sum_{l = 1}^{L} (\sqrt{p (X | c_{i})} - \sqrt{p (X | c_{j})})^{2}}$ (7) where p (X|c_i) and p (X|c_j) are probability density of the spectral vectors S₁ and S₂ for the bands {1, ⋯ , L}. c_i, c_j ∈ C.

The J-M measure is an attractive method and overcomes the limitation of Transformed Divergence which calculates the divergence as a function of normalized distance between two classes. Hence, this study employs a two-stage procedure for feature selection. In first stage, we select highly informative and lowly redundant features by MI. In second stage, we maximize J-M distance to increase spectral separability in the feature subset which was obtained in first stage.

4.2 SMKB Algorithm

Stochastic method is widely used to solve complex optimization problems [59]. The goal of our Stochastic Multiple Kernel Boosting (SMKB) method is to learn a classifier using boosting techniques and multiple kernels. As shown in Algorithm 1, our approach maintains a probability distribution D_t over training examples. At each boosting trial t, t = 1, ⋯, T, where T denotes the total number of boosting trials, we learn a kernel classifiers f_t (x). The misclassification rate $ε_{t}^{j}$ used to adjust on training examples: $ε_{t}^{j} = ε (f_{t}^{j} (x)) = \sum_{i = 1}^{N} D_{t} (i) (f_{t}^{j} (x) \neq y_{i})$ (8)

In particular, we learn one classifier $f_{t}^{j} (x)$ with each kernel using SVM: $f_{t} (x) = \arg min_{f_{t}^{j} (x), j \in {1, \dots, M}} ε (f_{t}^{j} (x))$ (9)

Update the weight D_t+1 (i) as follows: $D_{t + 1} (i) = \frac{D_{t} (i)}{Z_{t}} \times {\begin{matrix} β_{t} & if f_{t} (x_{i}) = y_{i}, \\ 1 & otherwise . \end{matrix}$ (10) where β_t = ∊_t/(1 - ∊_t) and Z_t is a normalization factor to make D_t+1 a distribution.

The final classifier is constructed by a weighted vote of individual classifiers as follows: $f (x) = sign (\sum_{t = 1}^{T} α_{t} f_{t} (x))$ (11)

Algorithm 1 Stochastic MKBoost

Input:

1: training data: S_N = {(x₁, y₁) , ⋯ , (x_N, y_N)}

2: kernel function: κ_j(·, ·): $X$ × $X$ → $ℝ$ , j = 1, ⋯, M

3: ∀ i: initialize D₁ (i) = 1/m

Output: $f (x) = sign (\sum_{t = 1}^{T} α_{t} f_{t} (x))$

4: select a subset S_mi of S_N with f_MI features according to Equation 3

5: select a subset S_jm of S_mi with f_JM features according to Equation 7

6: fort = 1, ⋯, Tdo

7: sample n examples using distribution D_t

8: forj = 1, ⋯, Mdo

9: train a weak classifier $f_{t}^{j}$

10: calculate the training error over D_t

11: $ε_{t}^{j}$ = $\sum_{i = 1}^{N}$ $D_{t} (i) (f_{t}^{j} (x_{i})$ ≠ y_i)

12: $α_{t}^{j}$ = $\frac{1}{2}$ ln $\frac{1 - ε_{t}^{j}}{ε_{t}^{j}}$

13: sorting $ε_{t}^{j}$ in descent order, generate ρ = rand ()

14: combine k (= ρ * M) classifiers in front: f_t (x) = $sign (\sum_{t = 1}^{k} α_{t}^{j} f_{t}^{j} (x))$

15: end for

16: compute the training error over D_t ∊_t = $\sum_{i = 1}^{N} D_{t} (i) (f_{t} (x_{i}) \neq y_{i})$

17: choose α_t = $\frac{1}{2} \ln \frac{1 - ε_{t}}{ε_{t}}$

18: update D_t+1 (i) ← $\frac{D_{t} (i)}{Z_{t}} \exp (- α_{t} y_{i} f_{t} (x_{i}))$ where Z_t = ∑_iD_t (i) is a normalization constant

19: end for

Different with AdaBoost at each boosting trial, we no longer simply discard the other M-1 classifiers. We sample k kernels with lowest misclassification rate, where k is specified by a randomly generated kernel sampling ratio ρ (0 < ρ ≤ 1) that determines the proportion of kernel to be sampled. Specifically, after sorting $ε_{t}^{j}$ in descent order, we combine k (= ρ * M) classifiers in front, and build the classifier at the tth boosting trial: $f_{t} (x) = sign (\sum_{t = 1}^{k} α_{t}^{j} f_{t}^{j} (x))$ (12) where the weight $α_{t}^{j}$ is computed based on the misclassification rate $ε_{t}^{j}$ . Specifically, $α_{t}^{j} = \frac{1}{2} \ln \frac{1 - ε_{t}^{j}}{ε_{t}^{j}}$ (13)

5 Experimental evaluation

This section evaluates the performance of our SMKB algorithm for hyperspectral image classification.

5.1 Data set description

The image dataset, in our experiments, is the widely used one acquired using the AVIRIS sensor over the Indian Pines region, Northwestern Indian, USA, in 1992. The Indian Pines data set comprises 220 bands and has a spatial size of 145×145 pixels in the wavelength range from 0.4 to 2.5 μm. Removing noisy bands, there are 200 bands remaining. The ground truth has 10249 labeled pixels. It consists of 16 land cover classes. Figure 1 shows a false color composite (bands 17, 27, and 50 for RGB) and the ground truth.

Fig.1

Classification maps of AVIRIS Indian Pine dataset: (a) False color composite (bands 17, 27, and 50 for RGB); (b) Ground truth.

5.2 Experimental environment settings

To evaluate our SMKB algorithm, we compared it with several state-of-the-art competitive algorithms, including SVM-based single kernel (SVM for short), SimpleMKL [6], and AdaBoost with SVM (AdaBoostSVM for short). For SVM classifiers, we employ a polynomial and Gaussian radial basis function kernel. We have performed ten-fold cross-validation procedure using a single SVM for finding optimal SVM parameters σ ∈ {10^-2, . . . , 10²}, C ∈ {10¹, . . . , 10⁴}. In all cases, the one-versus-one multiclass scheme implemented in LibSVM [60] was used. SimpleMKL is one of algorithms used to solve the MKL problem. To implement SimpleMKL algorithms, we adopt the SimpleMKL toolbox [6] and their default settings suggested by the toolbox. AdaBoostSVM is an algorithm applying AdaBoost to improve SVM learning accuracy [51]. For AdaBoostSVM, 10-fold cross validation is adopted to select the best kernel, other settings are the same as in SMKB algorithms. For our SMKB, we follow the typical approach used in traditional MKL and AdaBoost studies in literature. In particular, 16 base kernels are used initially in the ensemble, i.e., gaussian kernels using 13 different bandwidth parameters from {2^-6, 2^-5, ⋯, 2⁶} and polynomial kernels of degree 1 to 3 on all features. In first stage, S_mi consists of 150 features according to Equation 3 and S_jm consists of 120 features according to Equation 7. All experiments are running on a Windows 7 with 2.9 GHz Intel CPU, 16 GB RAM and MATLAB8.2 environment.

5.3 Comparison results

We have analyzed the classification accuracy of our SMKB with the AVIRIS dataset of Indian Pines. Table 1 shows classification results obtained by different methods, when the number of training samples is 15% of 10249 reference samples. Our method has obtained the highest OA = 88.97%. Experiments presented in Table 1 have been repeated 10 times to calculate the mean of OA, AA, and Kappa to evaluate the performance of different methods. The highest scores for each class are highlighted in the boldface font.

Table 1
Classification results (%) of the different methods

Class land cover Class Samples SVM SimpleMKL AdaBoostSVM SMKB

1 Alfalfa 46 73.01 ± 1.1 68.38 ± 2.1 72.39 ± 1.1 83.16 ± 2.1

2 Corn-no till 1428 78.17 ± 2.2 78.41 ± 0.7 80.44 ± 0.8 86.41 ± 1.0

3 Corn-min till 830 73.59 ± 3.1 69.76 ± 1.1 77.32 ± 2.7 89.76 ± 0.8

4 Corn 237 65.05 ± 2.5 57.41 ± 2.1 75.65 ± 2.7 74.41 ± 2.1

5 Grass/Pasture 483 91.44 ± 3.2 90.02 ± 0.4 92.54 ± 3.6 94.02 ± 1.3

6 Grass/Trees 730 97.12 ± 2.6 94.04 ± 2.1 96.65 ± 0.1 97.23 ± 1.0

7 Grass/pasture-mowed 28 74.24 ± 2.7 71.47 ± 3.3 77.88 ± 3.1 86.47 ± 3.1

8 Hay-windrowed 478 96.74 ± 1.4 96.63 ± 0.1 98.35 ± 2.1 97.56 ± 1.4

9 Oats 20 52.74 ± 1.3 40.74 ± 1.1 59.12 ± 0.9 78.74 ± 2.9

10 Soybeans-no till 972 75.15 ± 2.7 76.03 ± 0.4 78.80 ± 0.4 87.03 ± 1.3

11 Soybeans-min till 2455 83.87 ± 2.1 81.57 ± 4.0 86.16 ± 2.1 94.57 ± 2.0

12 Soybeans-clean till 593 80.23 ± 3.8 72.56 ± 2.1 83.87 ± 1.3 83.27 ± 4.1

13 Wheat 205 95.97 ± 3.8 94.45 ± 3.7 96.81 ± 1.1 98.45 ± 0.7

14 Woods 1265 94.13 ± 0.9 93.71 ± 4.1 93.34±0.1 94.03 ± 1.8

15 Bldg-Grass-Tree drives 386 65.76 ± 3.1 64.83 ± 2.3 72.07 ± 1.1 71.83 ± 1.1

16 Stone-steel towers 93 94.71 ± 1.5 88.90 ± 3.1 93.38±3.2 94.20 ± 2.6

Kappa 0.8140 0.7837 0.8359 0.8568

OA 83.70 81.09 85.55 88.97

AA 82.36 80.45 82.72 87.12

Time 113s 564s 429s 307s

Class	land cover Class	Samples	SVM	SimpleMKL	AdaBoostSVM	SMKB
1	Alfalfa	46	73.01 ± 1.1	68.38 ± 2.1	72.39 ± 1.1	83.16 ± 2.1
2	Corn-no till	1428	78.17 ± 2.2	78.41 ± 0.7	80.44 ± 0.8	86.41 ± 1.0
3	Corn-min till	830	73.59 ± 3.1	69.76 ± 1.1	77.32 ± 2.7	89.76 ± 0.8
4	Corn	237	65.05 ± 2.5	57.41 ± 2.1	75.65 ± 2.7	74.41 ± 2.1
5	Grass/Pasture	483	91.44 ± 3.2	90.02 ± 0.4	92.54 ± 3.6	94.02 ± 1.3
6	Grass/Trees	730	97.12 ± 2.6	94.04 ± 2.1	96.65 ± 0.1	97.23 ± 1.0
7	Grass/pasture-mowed	28	74.24 ± 2.7	71.47 ± 3.3	77.88 ± 3.1	86.47 ± 3.1
8	Hay-windrowed	478	96.74 ± 1.4	96.63 ± 0.1	98.35 ± 2.1	97.56 ± 1.4
9	Oats	20	52.74 ± 1.3	40.74 ± 1.1	59.12 ± 0.9	78.74 ± 2.9
10	Soybeans-no till	972	75.15 ± 2.7	76.03 ± 0.4	78.80 ± 0.4	87.03 ± 1.3
11	Soybeans-min till	2455	83.87 ± 2.1	81.57 ± 4.0	86.16 ± 2.1	94.57 ± 2.0
12	Soybeans-clean till	593	80.23 ± 3.8	72.56 ± 2.1	83.87 ± 1.3	83.27 ± 4.1
13	Wheat	205	95.97 ± 3.8	94.45 ± 3.7	96.81 ± 1.1	98.45 ± 0.7
14	Woods	1265	94.13 ± 0.9	93.71 ± 4.1	93.34±0.1	94.03 ± 1.8
15	Bldg-Grass-Tree drives	386	65.76 ± 3.1	64.83 ± 2.3	72.07 ± 1.1	71.83 ± 1.1
16	Stone-steel towers	93	94.71 ± 1.5	88.90 ± 3.1	93.38±3.2	94.20 ± 2.6
Kappa			0.8140	0.7837	0.8359	0.8568
OA			83.70	81.09	85.55	88.97
AA			82.36	80.45	82.72	87.12
Time			113s	564s	429s	307s

This table clearly demonstrates that our method yields better results (although for 15% of training data, this difference is not significant). It can also be seen that our method presents higher performances, especially in classes with small number of training samples such as Alfalfa, Grass/pasture-mowed, Oats and Stone-steel towers. Classification maps obtained using these four methods are shown in Fig. 2.

Fig.2

Classification maps with respect to the AVIRIS Indian Pine dataset, where (a) for SVM, (b) for SimpleSVM, (c) for AdaboostSVM, and (d) for our SMKB.

5.4 Evaluation of scales selection

Figure 3 shows the evolution of kappa parameter as a function of the percentage of training samples used for our four evaluated approaches. In all cases, experiments are performed with 5% – 50% of all labeled samples in each class. The points in this picture indicate the respective mean kappa (over 10 runs) of these methods.

Fig.3

Relationship between the selection of scales and kappa coefficients.

Figure 3 illustrates that our method is significantly superior to SVM and SimpleMKL, and close to AdaBoostSVM, for the Indian dataset. By referring to the results in Fig. 3, as the number of training samples m increases, the classification performance of each kernel generally increases. This is due to the fact that, the complexity of data construction increases with increasing m. However, the number of samples for more than 20% did not have much impact on the kappa coefficient. The advantage of our SMKB is the most obvious for 15%, 20% and 30% of training samples with 0.73%, 0.62%, and 0.58% in kappa against AdaBoostSVM. However, such an advantage becomes less obvious when training samples at 10% and 50%. More interesting, AdaBoostSVM and SVM have a closest kappa coefficient in 30% of training samples.

The kappa statistics indicates that our SimpleMKL obtains a lower performance in each experiment. This result is similar to what Tuia et al. reported in [7]. MKL is not expected to always improve standard SVM. This limitation can be a motivation that exploits semi-supervised classifiers with unlabeled pixels and spatial features beyond spectral information. There are two reasons to explain why Simple MKL does not reflect the advantage of multiscale kernel combination: (i) insufficient training samples for constructing kernel matrices will result in the SimpleMKL overfitting [7], and (ii) choosing an optimal candidate multiscale kernel automatically for SimpleMKL is really difficult.

5.5 Evaluation of kernel sampling ratio ρ

In our experiments, we randomly select approximately 15% of training samples in each class. We notice that the kernel sampling decay factor ρ which being randomly generated in each iteration could affect the performance (both accuracy and efficiency) of our SMKB algorithm. To examine the impact of ρ, we conduct a set of experiments to evaluate this impact on both accuracy and efficiency performance in classification tasks.

Generally, the sampling ratio ρ determines the proportion of kernels sampled from the whole collection of kernels at each boosting trial. The results of accuracy and efficiency performance with respect to the kernel sampling ratio ρ by varying its value from 0.1 to 1.0 are shown in Fig. 4. From these experimental results, we found that our SMKB algorithm with a large kernel sampling ratio value usually produced a better classification accuracy performance. This is especially more evident when the kernel sampling ratio is small.

Fig.4

Evaluation of SMKB kappa coefficients with respect to the sampling ratio ρ.

Despite the improvement when increasing the kernel sampling ratio, we found that the classification accuracy tends to saturate when the value is large enough (e.g., larger than 0.6). This is due to the fact that, when the sampling ratio is too small, the base kernel classifiers may suffer from insufficient training examples at boosting trials. However, employing a too large sampling ratio may lead to sample too many training data examples for data set, which may be redundant for building base classifiers.

Besides the impact on the accuracy, the kernel sampling ratio parameter affects the efficiency of our SMKB algorithm as well. The computing time of SMKB algorithm with various values of kernel sampling ratio ρ is evaluated. Relationship between kernel sampling ratio ρ and the time cost is shown in Fig. 5. We can see from this figure that, increasing the kernel sampling ratio also leads to the increase of the time cost needed by our SMKB algorithm. In particular, the runtime is nonlinearly increasing as the kernel sampling ratio ρ increases from the value 0.15. In practice, a tradeoff should be considered between accuracy and efficiency by choosing an appropriate number of kernel sampling ratio ρ.

Fig.5

Evaluation of our SMKB learning time cost with respect to kernel sampling ratio ρ.

5.6 Evaluation of the number of kernels

Experiments are conducted for examining the impact of various number of base kernels on kappa coefficients for our method, AdaBoostSVM, and SimpleMKL algorithms (due to using single kernel, SVM was absent). In our previous experiments, we have fixed the number of base kernels to 16. In this set of experiments, we examine the experimental result by varying the numbers of kernels from 8 to 40. Figure 6 shows the experimental result about the impact of the numbers of kernels on the classification performance.

Fig.6

Kappa statistics with various numbers of kernels.

As shown in Fig. 6, our method performs the best and obviously outperforms the SimpleMKL method. In terms of the curves of kappa coefficients, increasing the total number of kernels in general is able to boost the accuracy performance of all our SMKB algorithm consistently. Such observation is particularly more evident when the total number of kernels is relatively small (e.g., M < 16). The kappa coefficient usually starts to decrease when the numbers of kernels is large (e.g., M > 16).

The computational cost is a critical bottleneck of MKL methods in hyperspectral classifications. The computational time, as an important factor indicating the applicability of MKL, was tested to compare the computational efficiency of different methods. In order to illustrate the classification efficiency, the computational time of different methods are shown in Fig. Fig. 7. The corresponding settings in this experiment were the same as previous experiments. From this figure, it can be found that the computational time of our SMKB are close to AdaBoostSVM, and much less than SimpleMKL. In particular, the time cost of our SMKB algorithm is only half of that of SimpleMKL, and about 80% of that of AdaBoostSVM, respectively.

Fig.7

Time costs for various numbers of kernels.

For our SMKB, the recorded runtime includes the time consumed on training and testing set sampling, multiple kernel constructing, SVM training and testing. Figure 7 shows that our SMKB algorithm utilizing 28 kernels is still much more efficient than SimpleMKL algorithm with 16 kernels. The consumed time curve of SimpleMKL rises steeply along with the number of kernels, while that of SMKB rises slowly. In short, compared with other algorithms, SMKB can be still satisfied with respect to computational efficiency.

6 Conclusion

This paper proposed an effective stochastic multiple kernel adaboost framework, called Stochastic Multiple Kernel Boosting (SMKB), for high-dimensional hyperspectral remote sensing classification based on the notion of multiclass kernel ensemble. Compared to the AdaBoostSVM approach, due to randomly employing a subset of kernels to construct classifiers, SMKB is significantly more efficient while providing a better performance of classification accuracy.

The efficiency of SMKB is evaluated upon the Indian Pines data set. The performance of SMKB was evaluated based on several criteria: the number of training samples, the ensemble size, and the number of kernel sampling. Experimental results revealed that SMKB performs well in terms of accuracy, compared with traditional algorithms. These indicate that SMKB are a promising approach for generating classifier ensemble of hyperspectral remotesensing.

References

Herman

, Zhang

, Wang

, Ye

and Chen

, Mutual information-based method for selecting informative feature sets, Pattern Recognit46(12) (2013), 3315–3327.

Swain

P.H.

, Robertson

T.V.

and Wacker

A.G.

, Comparison of the Divergence and B-Distance in Feature Selection. LARS Report, Purdue University, 1971.

Gao

, Farahani

M.R.

and Gao

, Ontology optimization tactics via distance calculating, Applied Mathematics and Nonlinear Sciences1(1) (2016), 154–169.

Chavez

, Berlin

and Sowers

, Statistical Method for Selecting Landsat MSS Ratios, J Appl Photogr Eng1 (1982), 23–30.

Scholkopf

and Smola

A.J.

, Learning with kernels, Cambridge, MA: MIT Press, 2002.

Rakotomamonjy

, Bach

F.R.

, Canu

and Grandvalet

, SimpleMKL, Journal of Machine Learning Research9 (2008), 2491–2521.

Tuia

, Camps-Valls

, Matasci

and Kanevski

, Learning relevant image features with multiple-kernel classification, IEEE Trans Geosci Remote Sens48(10) (2010), 3780–3791.

Xia

and Hoi

C.H.

, Steven, MKBoost: A framework of multiple kernel boosting, IEEE Trans Knowl Data Eng25(7) (2013), 1574–1586.

Zhang

, Yang

H.L.

, Prasad

, Pasolli

, Jung

and Crawford

, Ensemble multiple kernel active learning for classification of multisource remote sensing data, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing8(2) (2015), 845–858.

10.

Kwak

and Choi

C.H.

, Input feature selection by mutual information based on Parzen window, IEEE Trans Pattern Anal Mach Intell24(12) (2002), 1667–1671.

11.

Battiti

, Using mutual information for selecting features in supervised neural net learning, IEEE Trans Neural Netw5(4) (1994), 537–550.

12.

Peng

, Long

and Ding

, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans Pattern Anal Mach Intell27(8) (2005), 1226–1238.

13.

Estĺęvez

P.A.

, Tesmer

, Perez

C.A.

and Zurada

J.M.

, Normalized mutual information feature selection, IEEE Trans Neural Netw20(2) (2009), 189–201.

14.

Wei

, Chow

T.W.S.

and Chan

R.H.M.

, Heterogeneous feature subset selection using mutual information-based feature transformation[J], Neurocomputing168 (2015), 706–718.

15.

Feng

, et al., Unsupervised feature selection based on maximum information and minimum redundancy for hyperspectral images, Pattern Recognition51 (2016), 295–309.

16.

Swain

P.H.

and King

R.C.

, Two Effective Feature Selection Criteria for Multispectral Remote Sensing, LARS Technical Reports. Purdue University1973.

17.

Bruzzone

, Roli

and Serpico

S.B.

, An extension of the JeffreysĺCMatusita distance to multiclass cases for feature selection, IEEE Trans Geosci Remote Sens33 (1995), 1318–1321.

18.

Laliberte

, Browning

and Rango

, A comparison of three feature selection methods for object-based classification of subdecimeter resolution UltraCam-L imagery, Int J Appl Earth Observ Geoinform15 (2012), 70–78.

19.

Ghiyamat

, Shafri

H.Z.M.

and Amouzad Mahdiraji

, Shariff

A.R.M.

and Mansor

, Hyperspectral discrimination of tree species with different classifications using single-and multipleendmember, Int J Appl Earth Observ Geoinform23 (2013), 177–191.

20.

Padma

and Sanjeevi

, Jeffries Matusita based mixed-measure for improved spectral matching in hyperspectral image analysis, International Journal of Applied Earth Observation & Geoinformation32(1) (2014), 138–151.

21.

Camps-Valls

and Bruzzone

, Kernel-based methods for hyperspectral image classification, IEEE Trans Geosci Remote Sens43(6) (2005), 1351–1362.

22.

Müller

, Mika

, Rätsch

, Tsuda

and Schölkopf

, An introduction to kernel-based learning algorithms, IEEE Transactions on Neural Networks12(2) (2001), 181–202.

23.

Vapnik

V.N.

, Statistical Learning Theory, JohnWiley & Sons, New York, 1998.

24.

Pasolli

, Melgani

and Bazi

, Support vector machine active learning through significance space construction, IEEE Geosci Remote Sens Lett8(3) (2011), 431–435.

25.

Pasolli

, Melgani

, Tuia

, Pacifici

and Emery

W.J.

, SVM active learning approach for image classification using spatial information, IEEE Trans Geosci Remote Sens52(4) (2014), 2217–2233.

26.

Bruzzone

and Persello

, A novel context-sensitive semisupervised SVM classifier robust to mislabeled training samples, IEEE Trans Geosci Remote Sens47(7) (2009), 2142–2154.

27.

Kuo

B.-C.

, Li

C.-H.

and Yang

J.-M.

, Kernel nonparametric weighted feature extraction for hyperspectral image classification, IEEE Trans Geosci Remote Sens47(4) (2009), 1139–1155.

28.

Hughes

G.F.

, On the mean accuracy of statistical pattern recognizers, IEEE Trans Inf TheoryIT-14(1) (1968), 55–63.

29.

Tarabalka

, Chanussot

and Benediktsson

, Segmentation and classification of hyperspectral images using minimum spanning forest grown from automatically selected markers, IEEE Trans Syst Man Cybern B Cybern40(5) (2010), 1267–1279.

30.

Ham

, Chen

, Crawford

M.M.

and Ghosh

, Investigation of the random forest framework for classification of hyperspectral data, IEEE Trans Geosci Remote Sens43(3) (2005), 492–501.

31.

Tuia

and Camps-Valls

, Urban image classification with semisupervised multiscale cluster kernels, IEEE J Sel Top Appl Earth Observ Remote Sens4(1) (2011), 65–74.

32.

Mura

M.D.

, Villa

, Benediktsson

J.A.

and Chanussot

and Bruzzone

, Classification of hyperspectral images by using extended morphological attribute profiles and independent component analysis, IEEE Geosci Remote Sens Lett8(3) (2011), 542–546.

33.

Kang

, Li

and Benediktsson

J.A.

, Feature extraction of hyperspectral images with image fusion and recursive filtering, IEEE Trans Geosci Remote Sens52(6) (2011), 3742–3752.

34.

Zhang

, Zhang

, Tao

and Huang

, Tensor discriminative locality alignment for hyperspectral image spectral-spatial feature extraction, IEEE Trans Geosci Remote Sens51(1) (2013), 242–256.

35.

, Huang

, Bioucas-Dias

J.M.

, Benediktsson

J.A.

and Plaza

, Multiple feature learning for hyperspectral image classification, IEEE Trans Geosci Remote Sens53(3) (2015), 1592–1606.

36.

, Zhang

, Huang

and Zhang

, Hyperspectral image classification by nonlocal joint collaborative representation with a locally adaptive dictionary, IEEE Trans Geosci Remote Sens52(6) (2014), 3707–3719.

37.

Fauvel

, Chanussot

, Benediktsson

J.A.

and Villa

, Parsimonious Mahalanobis kernel for the classification of high dimensional data, Pattern Recog46(3) (2013), 845–854.

38.

Gao

, Li

, Khodadadzadeh

, Plaza

and Zhang

, He

and Yan

, Subspace-based support vector machines for hyperspectral image classification, IEEE Geosci Remote Sens Lett12(2) (2013), 349–353.

39.

Stumpf

, Lachiche

, Malet

J.-P.

and Kerle

and Puissant

, Active learning in the spatial domain for remote sensing image classification, IEEE Trans Geosci Remote Sens52(5) (2014), 2492–2507.

40.

Kang

, Li

, Fang

, Li

and Benediktsson

J.A.

, Extended random walker-based classification of hyperspectral images, IEEE Trans Geosci Remote Sens53(1) (2014), 144–153.

41.

Feng

, Jiao

L.C.

, Zhang

and Sun

, Hyperspectral band selection based on trivariate mutual information and clonal selection, IEEE Trans Geosci Remote Sens52(7) (2014), 4092–4105.

42.

and Reddy Marpu

, Plaza

, Bioucas-Dias

and Benediktsson

J.A.

, Generalized composite kernel framework for hyperspectral image classification, IEEE Trans Geosci Remote Sens51(9) (2013), 4816–4829.

43.

, Wang

, You

, Zhang

, Wang

and Zhang

, Representative multiple kernel learning for classification in hyperspectral imagery, IEEE Trans Geosci Remote Sens50(7) (2012), 2852–2865.

44.

Zhang

, Yang

H.L.

, Prasad

, Pasolli

, Jung

and Crawford

, Ensemble multiple kernel active learning for classification of multisource remote sensing data, IEEE J Sel Topics Appl Earth Observ Remote Sens8(2) (2015), 845–858.

45.

Sun

, Jiao

, Feng

, Liu

and Zhang

, Imbalanced hyperspectral image classification based on maximum margin, IEEE Geosci Remote Sens Lett12(3) (2015), 522–526.

46.

Subrahmanya

and Shin

Y.C.

, Sparse multiple kernel learning for signal processing applications, IEEE Trans Pattern Anal Mach Intell32(5) (2010), 788–798.

47.

Suzuki

and Tomioka

, SpicyMKL: A fast algorithm for multiple kernel learning with thousands of kernels, Machine Learning85(1-2) (2011), 1–32.

48.

Sun

, Jiao

, Wang

and Feng

, Selective multiple kernel learning for classification with ensemble strategy, Pattern Recognition46 (2013), 3081–3090.

49.

Crammer

, Keshet

and Singer

, Kernel design using boosting, Advances in Neural Information Processing Systems (2002), 367–373.

50.

Lin

Y.Y.

, Tsai

J.F.

and Liu

T.L.

, Efficient discriminative local learning for object recognition, IEEE International Conference Computer Vision (2009), 598–605.

51.

Zhang

and Ren

, Improving SVM Learning Accuracy with Adaboost, Proc Int’l Conf Natural Computation3 (2008), 221–225.

52.

, Wang

and Sung

, Adaboost with SVM-Based Component Classifiers, Eng Applications of Artificial Intelligence21(5) (2008), 785–795.

53.

Valiollahzadeh

S.M.

, Sayadiyan

and Nazari

, Face detection using adaboosted SVM-based component classifier, CoRR,abs/0812.2575 (2008).

54.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of on-line learning and an application to boosting, J Computer and Systems Sciences55(1) (1997), 119–139.

55.

Zien

and Ong

C.S.

, Multiclass multiple kernel learning, Oregon, USA, pp. , Proceedings of the 24th International Conference on Machine Learning, Corvallis (2007), 1191–1198.

56.

Gönen

and AlpayIn

, Multiple kernel learning algorithms, Journal of Machine Learning Research12(1) (2011), 2211–2268.

57.

Zhou

Z.-H.

, Ensemble learning, In Li

S.Z.

, editor, Encyclopedia of Biometrics. Springer, Berlin, 2008.

58.

Schapire

R.E.

, The strength of weak learnability, Machine Learning5(2) (1990), 197–227.

59.

Caraballo

, Diop

M.A.

and Mane

, Controllability for neutral stochastic functional integrodifferential equations with infinite delay, Applied Mathematics and Nonlinear Sciences1(2) (2016), 493–506.

60.

Chen

C.C.

and Lin

C.J.

, LIBSVM: A library for support vector machines, ACM Trans Intell Syst Technol (TIST)2(27) (2011).