Abstract
Hyperspectral Image (HSI) is usually composed of hundreds of capturing wavelength bands, which not only increase the size of the HSI rapidly but also impose various obstacles in classifying the objects accurately. Moreover, the traditional machine learning schemes utilize only the spectral features for HSI classification, which, therefore, neglect the spatial features that have a significant impact on the classification improvement. To address the aforementioned issues, in this paper, we propose to employ the principal component analysis (PCA), the baseline feature extraction method, and a thoughtfully designed stacked autoencoder, a deep learning-based feature extraction approach, for reducing the high dimensionality of the HSI and then propose a novel lightweight 3D-2D convolutional neural network (CNN) framework to concurrently exploit both spatial and spectral features from the dimensionality-reduced HSI for classification. In particular, PCA and stacked autoencoder are applied to reduce the high dimensionality of the original HSI and then the proposed 3D-2D CNN provides a combination of 3D and 2D convolution operations to extract the subtle spatial and spectral features for efficient classification. We well-adjust the proposed 3D-2D CNN architecture, and perform extensive experiments on three benchmark HSI datasets and compare our approach with the state-of-the-art classical and deep learning methods. Experimental results illustrate that we have achieved an overall accuracy of 99.73%, 99.90%, and 99.32% on Indian Pines, Pavia University, and Kennedy Space Center datasets, respectively, which outperform the classical machine learning and independent 2D and 3D CNN-based state-of-the-art methods.
Keywords
Introduction
Hyperspectral imaging is one of the most pivotal fields in satellite remote sensing imagery as it allows to identify different ground objects and analyze their structures in great detail from far away. In particular, a Hyperspectral Image (HSI) is an emerging remote sensing data source. Due to the modern advancement in hyperspectral camera sensor technology, hyperspectral cameras can now capture thousands of high-resolution spectral wavelength bands over the same spatial area. These high-resolution spectral bands conserve the crucial aspects of the spectrum and pave the way to differentiate between the materials and objects on the ground surface. For instance, Airbone Visible Infrared Imaging Spectrometer (AVIRIS) hyperspectral camera sensor can capture 224 contiguous spectral bands consisting of wavelength from 400 to 2500 nanometer, which ranges from visible ray to the infrared ray in the electromagnetic spectrum [1–4]. Due to the enormous amount of spectral bands, HSI is being used rigorously in vegetation analysis, precision agriculture, deforestation analysis, identification and analyzing compounds of soil and rock and also in surveillance [5–9]. However, there are some inherent challenges in effective HSI classification, which need to be resolved to take the full advantage of the imaging system. First, the neighboring spectral bands are highly correlated and, therefore, they create a huge amount of redundant information [10–13]. Secondly, all spectral bands are not equally paramount for all types of applications, which implies that some bands are more prominent than others in the case of a specific application [14, 15]. Thirdly, an insufficient number of training samples often creates an imbalance between the training set and spectral bands, which causes the Hughes phenomenon and works to reduce the classification accuracy [16]. Fourthly, a slew of spectral bands leads to a high dimensionality, which makes the computational cost much higher [17]. At last but most importantly, it is very critical task to combine both the spectral and spatial features together for more accurate classification of the HSI [18–21].
To tackle the correlated bands and high dimensionality, feature selection and feature extraction (sometimes, their combination) are mostly considered [1, 22–27]. Principal component analysis (PCA) is the mostly used feature extraction techniques in HSI that extracts orthogonal features and lessens the high dimensionality. Datta [28, 29] has found out that PCA reduces the dimensionality of the HSI by 85%. Besides, there are several other versions of PCA, such as kernel PCA (KPCA) [30], iterative PCA (IPCA) [31], segmented PCA (SPCA) [32], folded PCA (FPCA) [33], and many more. Though these methods are memory-efficient, they are more time consuming compared to the baseline PCA [34]. In addition, PCA also performs well in combination with the deep learning (DL) methods [22, 23]. As such. we utilize PCA as one of the baseline feature extraction approaches in this paper. Along with that, we also thoughtfully design a DL-based feature extraction method called stacked autoencoder [35]. stacked autoencoder enables a latent space representation of the original input and compresses the high dimensional inputs in such a way that no data is lost and these excellencies motivate us to design a stacked autoencoder in HSI perspective.
Most conventional machine learning approaches exploit only spectral features and, therefore, ignore the spatial features for HSI classification. Due to the recent breakthrough of DL models in image classification, researchers are leaning towards using DL models to analyze the HSIs with more precision [36–38]. Chen [39] was the pioneer to introduce a deep belief network that incorporates both spectral and spatial features to provide more precise classification. After that, different types of DL architectures was used in HSI classification, such as convolutional neural network (CNN) [40], recurrent neural network (RNN) [41], long-short-term-memory (LSTM) network [42], a combination of both CNN and RNN [43] etc. Among these, CNN has emerged as one of the most prominent architectures because of its remarkable success in image classification [44] and its ability to share a local connection, which leads to fewer training parameters. Makantasis [23] used a hybrid approach that incorporates PCA and CNN for HSI classification. Though it produces a significant result, it uses a 2D CNN structure, which struggles to identify the region with a similar texture. [45] used a deep CNN network in combination with random forest method, where the same deep CNN architecture was used separately for each of the feature subspace extracted by the random forest. However, this combination of CNN and random forest results in an enormous number of training parameters with such a limited number of training samples and may lead to the overfitting problem. Zhang [46] used a dual-channel convolution layer, one channel consisting of 1D CNN and another channel consisting of 2D CNN to provide a good quality hierarchical features and lead to an improvement of accuracy. Consequently, this pair of 1D and 2D CNNs also increases the network size enormously as there are dual channels in a single network and the features extracted by the 1D CNN are not quite up to the marks as they only consider the spectral bands. Recently 3D CNN have been [47, 48] introduced in the HSI classification field because of their ability to extract deep spectral and spatial features. Nevertheless, the 3D networks cannot extract as fine-tuned spatial features as 2D networks because of their structure. A dual graph CNN (GCN) [49] has been proposed in which the first GCN extracts features from hyperspectral images and the second GCN uses label distributed learning that allows it to work with a limited number of training samples. However, a limited number of training samples sometimes lead to underfitting. Another variant of CNN was presented in [22] known as multi-scale 2D CNN that extracts multiple higher quality spatial contexts, but the number of training parameters increases exponentially because multiple spatial contexts are being taken into the model simultaneously As such, training a model with an enormous number of parameters using a limited number of training samples demands a large amount of time and often overfits the model [50] to decrease the generalization performance.
From the above discussion, it can be seen that although different kinds of CNN networks have been used extensively for HSI classification, 2D and 3D CNN have been used separately. 2D CNN has achieved remarkable accuracy but its convolutional kernels move over only two-dimension not over the spectral dimension leading to mediocre spectral features. On the other hand, although the convolutional kernels of 3D CNN move across the spectral dimensions, it does not provide as good inter-spectral learning as 2D CNN. These observations motivate us to propose for merging 3D CNN and 2D deep CNN together to exploit both of their advantages in a single architecture. As such, in this paper, we propose a novel 3D-2D CNN architecture for performing HSI classification. The proposed hybrid 3D-2D CNN exploits both the spatial and spectral features of the HSI more precisely following the PCA and stack autoencoder-based extracted features. The number of trainable parameters in our proposed architecture is far less than other state-of-the-art CNN approaches, which makes it less time-consuming and curbs its tendency for overfitting. We consider three different benchmark HSI datasets for extensive performance evaluation of our proposed framework. We have achieved an accuracy of 99.73%, 99.80%, and 99.32% on the Indian Pines, Pavia University, and Kennedy Space Center (KSC) datasets, respectively, which are superior than the investigated state-of-the-art independent 2D and 3D approaches. The main reason behind the result contributes to the fact that our novel hybrid 3D-2D CNN makes the proper use of spatial and spectral feature at the same time. To summarize, the main noteworthy contributions of this paper can be put forward into the following key points: modelling and adjusting a novel 3D-2D deep CNN architecture for spatial-spectral classification, performance investigation of the proposed 3D-2D CNN using the PCA features, performance analysis of the proposed 3D-2D CNN using the designed stacked autoencoder features, and extensive and detailed experiments on three benchmark datasets comparing with the classical machine learning, and independent 2D and 3D CNN-based state-of-the-art approaches.
The rest of this paper is structured as follows. Section 2 delineates the dimensionality reduction and our novel 3D-2D deep CNN architecture for spectral-spatial classification. The experiments and results are described and analyzed in Section 3, while Section 4 concludes and summarizes the observations with pointing out some potential future research directions.
Proposed approach
Approach overview
Proper classification of the HSIs is challenging due to the increasing number of correlated spectral bands. At first the HSI and the corresponding ground truth are collected. Then the high dimensionality of the HSI image has been reduced by PCA and stacked autoencoder separately. After that, 3D spatial patches are created from the dimensionality reduced HSI. next, the patches are split into train and test samples. Finally, the train samples have been used to train the proposed 3D-2D CNN architecture for exploiting both the spectral and spatial features in great detail and the test samples have been used to evaluate the trained model. Figure 1 depicts the workflow of our proposed approach.

Overview of the proposed approach for performing the HSI classification based on the spectral-spatial features.
Principal component analysis
PCA is an unsupervised feature extraction technique, which is mainly utilized to reduce the correlated spectral bands that conveys almost the same spectral information about a particular region or object [8, 51]. PCA reduces the original dataset in such a way that it stores the structure of the original data [52]. PCA explores the statistical property of hyperspectral bands and reduces the dimensionality by using eigenvalue decomposition of covariance matrix formed from the spectral bands of the original HSI. At first, the HSI hypercube is converted into a two-dimensional matrix,
Although PCA is computationally inexpensive, but when the dataset is non-linearly distributed, PCA becomes inefficient to extract the proper features. Alternatively, autoencoder is an unsupervised feature extraction technique that uses mainly three-layers: input, hidden and output layers to reconstruct the original input data at the output layer via the reduced intermediate layer [53]. The pivotal purpose of an autoencoder is to learn a latent space representation of the original data. The hidden layer of the autoencoder preserves the information about the input data in much lower dimensionality. As the entire original data can be reconstructed from this lower-dimensional data, this provides a much more efficient and accurate way for reducing the dimensionality of high dimensional data [54].
Stacked autoencoder is a variant of the autoencoder, where instead of using just one input, hidden and output layer, we use multiple hidden layers between the input and output. This leads to a more precise and accurate feature extraction through the abstraction of the hidden layers. Stacked autoencoder is used in HSI based on reducing the correlation or noise from hyperspectral images and thus providing more useful and pivotal features for HSI classification [55]. The stacked autoencoder is trained through the back-propagation algorithm to extract the required features. We describe the working principle of stacked autoencoder for reducing the dimensionality in HSI as follows. Firstly, the hyperspectral cube is converted into a 2D data matrix,

Graphical representation of the stacked autoencoder designed in the experiment.
The middle layer of the stacked autoencoder, also known as the bottleneck layer, is mainly accountable for reducing the dimensionality of original data. Once the training process is completed and the error between the original signal
Among the deep neural network architectures, CNN has emerged as the most promising architecture in the field of image classification [44]. This mainly happened because of its significant performance in the Imagenet challenge [56] and the fact that it was based on a human visual system. There are mainly three parts of CNN. First one is the convolution layer, where the convolution operation is performed between the input image and different kernels in order to produce feature maps [57]. This is the main component of convolution layers that leads to the deep abstract hierarchical features. The second one is the pooling layer that provides translation invariance and reduces the dimensionality of feature maps. Finally, there is a fully connected layer that provides a brief description of which feature exists or not in the image [58]. In 2D CNN, the kernel only moves to two directions over the input image i.e., length-wise and height-wise producing superior spatial features in case of HSI. This also provides the opportunity for cross-channel learning HSIs. The convolutional operation of 2D CNN can be denoted by the following equation.
We propose a hybrid 3D-2D CNN to exploit the advantage of both 3D-CNN and 2D CNN together. To make the 3D hyperspectral cube compatible with CNN input, we create several small 3D spatial patches of dimension S × S × N, where S represents the size of spatial context and N represents the number of spectral bands. The extraction process is illustrated in Fig. 3. The ground truth for a specific patch depends on its center pixel. Note that the spatial size, S could be 3, 5, 7, 9, 11, 13 etc. and its appropriate value is required to tune. If the size is too small, then we might not have enough spatial context; on the other hand, if the size is too big, then spatial noise might get included while performing classification of the HSI. We provide an exhaustive search on the available values for spatial contexts towards its adjustment. After creating the 3D spatial patches, we pass the patches through four 3D convolutional layers to acquire finer spectral features along the depth dimension. The number of layers, the number of filters used in each layer, and the size of filters are optimized through a hyperparameter search for extracting higher quality spatial and spectral features. With the passing of each 3D convolution layer, we decrease the depth dimension of the filter size in order to extract more abstract features. There are mainly two ways of combining 3D and 2D CNNs. The first one is to perform the 3D convolutional operations and then 2D convolutional operations. The second one is to perform the 2D convolutional operations first and then reshape the output feature map and then perform 3D convolutional operations. However, in the second way, we lose some information about the 3 rd dimension of the feature maps which leads to poorer spectral and spatial features. As such, we have used 3D convolutional operations first and then 2D convolutional operations.

Extraction of 3D spatial patches from dimensionality reduced HSI.
Afterwards, we reshape the 3D feature maps produced by the 3D convolutional layers and pass them through a 2D convolutional layer. This combination leads to more accurate discrimination and enhancement of the spatial features than using only 3D convolutional layers. The 2D convolutional layer also provides a way for cross-channel learning that helps tune the spatial features across the spectral band. In each convolutional layers, the ReLU activation function is used to introduce the non-linearity in the extracted features. The number of 2D convolutional layers, the number of kernels on each layer, and the filter size are determined through performing an extensive grid search based upon the previous known 2D networks. Then, we flatten the 2D feature maps produced by the 2D convolutional layer into a one-dimensional tensor containing the spectral and spatial features extracted by the 3D-2D hybrid architecture. Finally, a multilayer perceptron (MLP) network with two hidden layers are used to perform the HSI classification based on the extracted spectral and spatial features using the hybrid 3D-2D CNN. We also use ReLU activation functions in the hidden layers of the MLP. The number of nodes in the output layer depends upon the number of classes in each hyperspectral image. We use the Softmax activation function in the output layer as each experimented HSI contains more than two ground classes and it provides a multinomial probability distribution, which provides more intuitive intuition about which class a sample belongs to. Figure 4 illustrates a detailed design of our proposed hybrid 3D-2D architecture and Table 1 represents the number of filters and sizes of the filters used in each convolutional layer.

Structure of the proposed hybrid 3D-2D CNN architecture, where S and N respectively denote the spatial context and number of spectral bands in the dimensionality reduced image.
Description of filters used in each convolutional layer
Experimental setting
We consider three benchmark HSI datasets i.e., Indian Pines, Pavia University and Kennedy Space Center (KSC) for experimental validation of our proposed spectral-spatial classification framework over the state-of-the-art DL-based approaches. We first perform standalone experiments using our 3D-2D CNN architecture preceded by the PCA and stacked autoencoder-based feature reduction methods over all three datasets. We denote the pair of PCA and our proposed 3D-2D CNN by PCA-Hybrid 3D-2D CNN and the pair of stacked autoencoder and our 3D-2D CNN by SAE-Hybrid 3D-2D CNN, respectively. In this way, we first fine-tune the hyperparameters associated with our proposed framework and then conduct comparative assessment of the 3D-2D CNN with the baseline independent 2D and 3D CNN-based approaches. We also include some classical machine learning based approaches, such as Support Vector Machine (SVM) in the comparative study. We consider average accuracy (AA), overall accuracy (OA) and Cohen’s Kappa score as the evaluation metrics. All of the accuracy metrics are mainly based on the confusion matrix of the test dataset and used to evaluate the performance of any multiclass problem. AA can be calculated by taking the average of individual accuracy’s for each of the classes and OA denotes the number of total correct prediction samples among the total amount of prediction samples. AA and OA act as complementary metrics to one another as they provide a clear picture about which classes are more difficult to predict in a multiclass environment. Cohen’s Kappa indicates the difference between the obtained overall accuracy of a classification model and the overall accuracy achieved by random guess. Besides, it represents the gap between the actual estimation and the chance estimation. There exists a positive correlation between Kappa score and efficacy of the model i.e., a high value of Kappa score yields a better classification model. In case of multiclass classification, where it becomes difficult to interpret from the receiver operating characteristic (ROC) curves, the kappa score can be used as a good alternative. Now, OA and Kappa score can be determined as follows.
For the implementation, we use google colaboratory as the online computing platform, which provides GPU supports for training the DL models. We build our proposed hybrid 3D-2D architecture using Keras, which is a Python framework used to create and deploy DL architecture, and runs on the top of Tensorflow, another framework developed by Google. Generally, It offers 13GB of RAM memory and 11-16 GB of graphics memory.
Indian Pines dataset
The Indian Pines dataset was curated by AVIRIS over the Indian Pines test set situated at North-western Indiana [59]. This HSI consists of 224 contiguous spectral bands with wavelengths from 400 to 2500 nanometers and a size of 145 × 145 pixels with a spatial resolution of 20 meter per pixel. The ground truth for this dataset mainly consists of agriculture, forests, and natural persisting forestation. There are a total of 16 different classes. We work with 200 spectral bands while excluding the spectral bands containing the water regions. A total of nine classes has been considered for this experiment while omitting the classes with a very lower amount of training and testing samples. Figure 5 depicts a sample band image and ground truth reference. The numbers of training and testing samples for our experiment are provided in Table 2.

Indian Pines HSI dataset. (a) Sample color image. (b) Ground truth map.
Number of training and testing samples for Indian Pines HSI
The Pavia University dataset was acquired by Reflective Optics System Imaging Spectrometer (ROSIS) over Pavia, norther Italy during a flight campaign. This dataset comprises 102 spectral bands ranging over the wavelengths of 430 to 960 nanometer. The size of the image is 610 × 610 pixels with a spatial resolution of 1.3 meter per pixel, but some of the samples contain no information. Therefore, they have been shed before performing further analysis resulting in a dataset with a size of 610 × 340. There is a total of 9 different classes representing different kind of objects and vegetations in the ground truth. From the hyperspectral signature of different classes, it is evident that some of the classes have a similar spectral signature. That is why it is paramount to introduce spatial information while performing classification. The original dataset of Pavia University has been provided by Prof. Paolo Gamba of Pavia University. Figure 6 portrays the image and ground truth for this dataset. The numbers of training and testing samples used in our experiment are given in Table 3.

Pavia University Dataset. (a) Sample image. (b) Ground truth map.
Number of training and testing samples for Pavia University HSI
The KSC dataset was also collected by the AVIRIS sensor. This HSI has been captured over the area of KSC in Florida by NASA. This image has been procured via 224 contiguous spectral bands, each of which is 10 nanometer wide in length. After removing the water absorption bands and low signal to noise ration bands, the total number of spectral bands is 176. The size of the image is 512 × 614 pixels with a spatial resolution of 18 meter per pixel. There is a total of 13 different classes representing various kinds of land covers in the swamp and dry land of that area. Some of the classes have mixed hyperspectral signatures that become very hard to distinguish in low dimension. For this reason, discerning between land covers for this image using only spectral signature gets very difficult. Figure 7 provides a representation of the image and ground truth reference. A distribution of training and testing samples for our experiment is provided in Table 4.

KSC HSI dataset. (a) Sample image. (b) Ground truth map.
Number of training and testing samples for KSC dataset
For performing accurate and precise classification, the tuning of our proposed hybrid 3D-2D architecture is one of the most crucial stages. Firstly, we need to fix the numbers of 3D and 2D convolutional layers in our deep architecture. For this, we start with using one 3D convolutional layer and one 2D convolutional layer. Then, we start increasing the numbers of both 3D and 2D convolutional layers. We provide the results in Fig. 8, from where it can be seen that we achieve more accuracy when we use three 3D convolutional layers and one 2D convolutional layer in the case of the Indian Pines dataset. We apply similar tuning strategy on the other datasets and we find that the same combination of three 3D convolutional layers and one 2D convolutional layer always yields the best classification result.

Graphical representation of accuracy vs number of 3D and 2D convolutional layers in the proposed hybrid 3D-2D CNN using S=7.
Usually, if the number of filters gradually increases, then the size of the filter decreases in deeper convolutional layers. This mainly happens because of the hierarchy of the features in feature maps. In the first few layers, the number of features is comparatively low but their size is moderate. As the layers go deep, the number of features rises, and the size of features declines. In this architecture, we have not used any pooling layer because the spatial patches are already small and there is no scaling variance present in the spatial patch. ReLU activation function is used in each convolutional layers. We consider different optimizers to minimize the categorical cross-entropy loss for our multiclass classification problem. Generally, Adam and stochastic gradient descent (SGD) optimizers are used in HSI classification. From Table 5, we can see that the Adam optimizer works superior with our proposed hybrid 3D-2D CNN in case of all the datasets used for the evaluation of the model.
Impact of different optimizers on PCA-Hybrid 3D-2D CNN
The training dataset is split into two parts: training and validation. The validation data is used to monitor the overfitting while training the model. As the number of samples is modicum, we use dropout and L2 regularization in each layer to avoid the model from overfitting. We take 128 samples in a single batch after experimenting with different kinds of batch sizes. Batch size 128 leads to more remarkable accuracy than other batch sizes. We use early-stopping and variable learning rates throughout the training process. Initially, the learning rate is set to 10-4. After that, it is gradually decreased based on early stopping criteria. We monitor the loss function and stop training after a certain period when the loss became constant. In this way, we are able to save a noticeable amount of computation time. The number of epochs varied based on the dataset i.e., 15, 20 and 36 for Indian Pines, Pavia University and KSC dataset, respectively, using the PCA-Hybrid 3D-2D CNN module, and 25, 35 and 42 for Indian Pines, Pavia University and KSC dataset, respectively, using the SAE-Hybrid 3D-2D CNN module. From the training and validation loss curves portrayed in Fig. 9, it is obvious that the training loss and validation loss are almost similar and, therefore, we can conclude that the model is neither overfitting nor underfitting. We can also see after a certain point that the training and validation loss stops decaying i.e., when we decide to stop the training process.

Training and validation loss curve showing the training progress of our approach on three different benchmark datasets.
In this experimental part, we evaluate the impact of the aforementioned two feature extraction methods i.e., PCA and stacked autoencoder on our proposed 3D-2D CNN framework. First, we perform PCA on each HSI dataset and then feed the reduced HSI into the proposed hybrid 3D-2D CNN. The number of principal components (PCs) is determined based upon the cumulative variance graph, as depicted in Fig. 10. In case of Indian Pines and Pavia University, it can be seen that 30 and 20 PCs respectively cover a total of 99% of the variance of each of the whole datasets. As such, we select 30 and 20 PCs for Indian Pines and Pavia university dataset, respectively. However, it can be seen that we need at least 100 PCs to cover about 90% of the total variance for the KSC dataset. Therefore, we choose the first 100 PCs for the KSC dataset.

Graphical representation of the cumulative variance of PCs of the three benchmark datasets.
Secondly, we implement a stacked autoencoder for each of the datasets to refine the relevant features and feed the features to the proposed hybrid 3D-2D CNN architecture. In stacked autoencoder, there are no hard and fast rules for the selection of nodes in the bottleneck layer, which is essentially the number of features extracted using the stacked autoencoder. We use a grid search strategy to find the appropriate number of nodes in the bottleneck layer. The search space is determined through previous heuristics and works on these datasets using autoencoders. From Fig. 11, it can be observed that 30, 20 and 19 nodes in the bottleneck layer yield a maximum classification accuracy for the Indian Pines, Pavia University and KSC dataset, respectively.

Graphical representation of accuracy vs number of nodes in the bottleneck layer of the proposed SAE-Hybrid 3D-2D CNN architecture.
From the above discussion, it is obvious that the number of PCs in case of the KSC dataset is far greater than the number of nodes used in the bottleneck layer of the stacked autoencoder. Although PCA approach outperforms the stacked autoencoder by a slight margin, the network built with a stacked autoencoder takes a lot less time as the total number of hyperparameters reduces drastically due to the number of features extracted via the stacked autoencoder. This happens mainly because of the heavy non-linearity that exists in the KSC dataset. As such, PCA cannot properly reduce the number of spectral bands in comparison to the stacked autoencoder. On the other hand, PCA performs comparatively better than the stacked autoencoder in case of the Indian Pines and Pavia University datasets because the number of PCs is much less than the original spectral bands.
We use three different accuracy metrics (AA, OA and Cohen’s Kappa score) for the evaluation of our proposed 3D-2D CNN architecture on the three benchmark datasets along with two different dimensionality reduction methods (PCA and stacked autoencoder). We provide the accuracy results in Tables 6, 7 and 8; from these tables, it can be observed that our proposed 3D-2D CNN architecture paired with PCA or stacked autoencoder achieves remarkable classification accuracy for all the three benchmark datasets. In particular, the proposed 3D-2D CNN framework paired with PCA obtains an overall accuracy of 99.73%, 99.90% and 99.32% for the Indian Pines, Pavia University, and KSC datasets, respectively. These are slightly higher than the accuracies achieved by pairing stacked autoencoder with our proposed 3D-2D CNN architecture, which are 99.70%, 99.87%, and 97.93% for the three datasets, respectively. It can also be seen that the Cohen Kappa score is high for our 3D-2D CNN architecture, which indicates the robustness and unbiased performance of our proposed 3D-2D CNN model. Besides, we consider increasing the value of spatial context, S. From there, it can be observed that the accuracy degrades with the increment of S. It happens mainly because lots of noise gets introduced into the feature space with the increasing of S. Figure 12 portrays the relationship between average accuracy and S. From these experimental results, we can conclude that the best classification accuracy is achieved when S equals to 7 and the classification accuracy declines, otherwise. It can be noted from Fig. 12 that the number of trainable parameters increases dramatically with the increment of S for all the three benchmark datasets.
Classification accuracy of the proposed hybrid 3D-2D CNN architecture using PCA and stacked autoencoder on Indian Pines dataset
Classification accuracy of the proposed hybrid 3D-2D CNN architecture using PCA and stacked autoencoder on Indian Pines dataset
Classification accuracy of the proposed hybrid 3D-2D CNN architecture using PCA and stacked autoencoder on Pavia University dataset
Classification accuracy of the proposed hybrid 3D-2D CNN architecture using PCA and Stacked Autoencoder on KSC dataset

Graphical representation of the accuracy vs size of spatial context in our proposed PCA-Hybrid 3D-2D CNN.
We provide the number of trainable parameters among only hybrid 3D-2D CNN, PCA-Hybrid 3D-2D CNN, SAE-Hybrid 3D-2D CNN and other state-of the art model such as PCA-Multi Scale CNN [22], ResNet50 [60] in Table 9. Here, we can see the lowest number of hyperparameter is in SAE-Hybrid 3D-2D CNN far less than other state of the art approaches. The greater the number of trainable parameters, the longer it takes to train the model and the more computational cost it incurs. From this analysis, we can say that our proposed approach takes less computation time and space compared to the other state-of-the art approaches as it has less number of hyperparameters. Therefore, we intend to employ dimensionality reduction techniques (PCA and stacked autoencoder) to reduce such huge trainable parameters of the standalone hybrid 3D-2D CNN. Consequently, the numbers of trainable parameters are the same in both frameworks (PCA-Hybrid 3D-2D CNN and SAE-Hybrid 3D-2D CNN) for the Indian Pines and Pavia University datasets. However, the number of trainable parameters for the KSC dataset is less in SAE-Hybrid 3D-2D CNN as the number of extracted components is less. From this discussion, it can be concluded that we could use stacked autoencoder when the number of PCs extracted by PCA is much more than the features extracted by the stacked autoencoder.
Comparison of the total number of trainable parameters among Hybrid 3D-2D CNN, PCA-Hybrid 3D-2D CNN and SAE-Hybrid 3D-2D CNN
After getting the promising results of our hybrid 3D-2D CNN framework on the three different HSI datasets, we compare our approach with the following state-of-the-art independent 2D and 3D CNN-based methods: PCA-multiscale-CNN (PCA-MS-CNN) [22], dual channel CNN [46], random forest with CNN [45], 3D CNN [47], PCA with 2D CNN [23] and lastly SVM with non-linear RBF kernel. We provide the results in Figs. 13, 14 and 15; from which it can be observed that in all three datasets, our proposed approach PCA-Hybrid 3D-2D CNN significantly outperforms the independent 2D and 3D CNN-based state-of-the-art approaches and SAE-Hybrid 3D-2D CNN also moderately surpasses those state-of-the-art methods. To this end, our proposed hybrid 3D-2D CNN architecture achieves superior results in comparison to the investigated methods because (i) the 3D convolutional layers that have been used in the architecture extract not only spatial features but also extract fine spectral features; (ii) the 3D convolutional layers also fine-tune the spectral features making the classification accuracy more precise and accurate; and (iii) the 2D convolutional layer facilitates spatial feature learning across the spectral bands and enhances the spatial features extracted before. Note that the feature reduction techniques PCA and stacked autoencoder reduce the high dimensionality of the HSI and contribute to the reduced parameter of the proposed hybrid 3D-2D CNN architecture. The number of training samples are very limited in case of hyperspectral image which may lead to overfitting problem, that is why we have used dropout and regularization techniques to prevent it. Lastly, if the spatial context size, s is too big, then there is the possibility of the test sample being overlapped with the train sample, that is why we have restricted the spatial context, s to a certain extent to avoid this overlapping. Note that our main goal in this work is to build a lightweight CNN with very smaller number of parameters to achieve the similar performance as the heavyweight CNN. As such, the results of our proposed lightweight CNN is superior than the existing methods but very close to them. Therefore, we have not compared the visual results of hyperspectral image of our method over the existing methods as the differences in such results would not be highly visualized.

Comparison of the proposed hybrid 3D-2D CNN architecture with the independent 2D and 3D CNN-based methods on Indian Pines dataset.

Comparison of the proposed hybrid 3D-2D CNN architecture with the independent 2D and 3D CNN-based methods on Pavia University dataset.

Comparison of the proposed hybrid 3D-2D CNN architecture with the independent 2D and 3D CNN-based methods on KSC dataset.
In this paper, we have proposed a novel well-tuned hybrid 3D-2D deep CNN architecture that has been evaluated on three benchmark datasets (Indian Pines, Pavia University and KSC) and has achieved a remarkable classification accuracy improvement. We have used PCA and stacked autoencoder for reducing the dimensionality of the HSI and provided an extensive comparison based on different accuracy measures (AA, OA and Kappa score) and the number of trainable parameters. We have also provided a way of when to use PCA and when to use stacked autoencoder for the HSI classification along with our designed 3D-2D CNN framework. In the proposed architecture, we have used three 3D convolutional layers followed by one 2D convolution layer to extract high-quality spectral and spatial features and merge them in such a congruous way that improves the classification of different land covers and vegetations in the HSI. We have achieved 99.73%, 99.90%, and 99.32% accuracy for Indian Pines, Pavia University, and KSC datasets, respectively that have outperformed the other state-of-the-art approaches both in machine learning (SVM with RBF kernel) and DL (independent 2D and 3D CNN-based approaches). Our proposals have improved the classification accuracy as well as reduced the high computational cost using feature reduction techniques. However, choosing the size of the spatial context in the 3D-2D CNN adaptively could be a fine future research direction. In future, we will investigate our approaches for different fields, ranging from 3D medical imaging to satellite surveillance.
