Abstract
With the evolution of convolutional neural networks, extraction of deep features for accurate classification of Remote Sensed (RS) images have gained lot of momentum. However, due to variation in the scale of high resolution remote sensed images, accurate classification still remains a challenging task. Moreover, along with the scale, variation in the angle also decreases the accuracy of extraction of deep features using convolutional neural network. In this paper, a Multiscale and Multiangle convolutional neural network (MSMA-CNN) is proposed which extracts deep features of the RS images by employing several convolutional, pooling and fully connected layers which are discriminant, nonlinear and invariant. In MSMA-CNN, along with the spatial features, spectral features are also considered for classification of remote sensing scenes thus, making the entire system robust. The RS images are scaled at different levels using Gaussian Pyramid Decomposition and rotated at different angles and further features are derived using maximally stable extremal regions (MSER) at spectral and spatial level which are further concatenated and fed to the MSMA-CNN. A regularization parameter is added to get the results for test images as close as the trained images. A hybrid MSMA-CNN structure is designed by altering various parameters of the CNN structure to get improved optimized performance. To demonstrate the effectiveness of the proposed method, we compared the results on six challenging high-resolution remote sensing datasets and achieve a classification accuracy of 92.25% which shows significant improvement compared to the other state-of-the-art scene classification methods in terms classificational accuracy and computational cost.
Keywords
Introduction
Example of object scale and angle variation [21].
Collection of data through remote sensing devices have become much easier in the recent years due to various new earth observation programs which collects high resolution data for various applications. Construction of robust feature maps to represent various characteristics in the high-resolution imagery has become very important for driving different applications. In Hyperspectral remote sensing there is a huge amount of data of the same given scene and this high-resolution spectral information helps in differentiating between different materials and also increases classification accuracy. Even small spatial structures can be analyzed finely due to the advantage of high-end sensors collecting images with fine spatial resolution [1]. Collection of fine spatial resolution data leads to increased dimensionality problems in the spectral domain. Techniques such as PCA, IDA are no longer found to be completely efficient for the processing of high dimensional data [2]. Feature extraction plays a major role in high dimensional data processing. However, as the spectral signatures of different materials are different but sometimes might be close to each other feature extraction of such images still remains a challenging task [3]. Lot of studies in the recent era are made on extracting spatial and spectral information for feature extraction [4]. Detailed spatial resolution has become easily [5] available due to sensors with good spatial resolution. FE spatial spectral classification improves overall classification efficiency [6]. Low-level and mid-level classification gives good results for normal images but when it comes to RS images the results are not that good due to high dimensionality problems. In order to bridge the semantic gap, Multiscaled image analysis is used for classification. The new solution for all the computer vision recognition problems is given by Deep learning [2, 7]. As, RS images are diverse in nature they have features at many layers which can be extracted by using deep feature extraction methods. Mostly single layer techniques [8] like PCA and IDA are used for FE which are not much deep and which does not yield a good classification of the image. Many classifiers are designed for classification but at a single layer, there some classifiers which also work on two layers such as decision tree or kernel SVMs [8]. Human intelligence works on different parameters and on different levels. For recognition of different objects visual system of human’s uses sequential processing and this kind of processing gives rises to learning of new objects or identification (recognition) of the existing objects [9]. Deep architectures give rise to deep learning of different objects like in human neural system. Due to the nature of RS images there is always a possibility of mixing of spectral signatures due to scattering which makes classification of the desired objects much more difficult. Moreover, factors such as sensor used, the range of IFOV, atmospheric conditions make the process more complicated. To address such various problems classifying RS images at a deeper level gives us a better classification result compared to all the existing techniques [10]. If features are derived at a single level, they tend to be variant to rotation, this problem can be easily solved for general images, but due to the nature of RS images, multilevel analysis is required to solve the problem which is not possible without a proper hierarchical architecture. Figure 1 depicts how scale and angle variation can create a huge difference in classifying objects accurately. In case of few hundreds of training samples training a fully new CNN becomes much more difficult. In this paper, we study the application of Convolutional Neural Network (CNN), which is one of the deep models, for feature extraction of RS images with varying scale and angle and develop a Multiscaled-Multiangled CNN model for effective RS classification.
The major additions in this paper are as follows.
MSMA-CNN a new structure having not only scale variation but also angle variation is proposed. In the proposed structure, the images are taken at different scales using and are rotated at different angles for extracting the most relevant features keeping the network structure same. The parameters of the CNN are shared by the fusion category in which scale and angle are taken together and trained with Multiscaled and Multiangled images effectively, thus reducing the number of parameters. To improve the network performance to correctly know the images with original structure and rescaled images in terms of scale and angle, these images are fed to the network individually and further a fusion is created where scale and angle parameters are fused together which is fed to train the network weights.
Convolutional neural network. In the proposed MSMA-CNN, Gaussian pyramid decomposition method [22] is used for scaling of the images as it efficiently extracts features at different scales through continuous filtering and downsampling operations. Further these scaled images are rotated at different angles for effective classification of images which are displaced by a certain angle compared to the original image using MSER algorithm [23]. The algorithm converges because the entire network is trained using a single structure. On the other hand, the network has a powerful feature extraction ability due to training of the network at different scales and at different angles. A concatenation layer is added before feeding the features to the CNN which takes images at different scales, at different angles and fusion of both into consideration to form a robust network. The proposed MSMA-CNN was examined and compared with both conventional remote sensing image classification methods and methods based on deep learning. Six data sets Indian_pines_corrected, JasperRidge2_F198, Jasperridge2_F224, SalinasA_corrected, Urban_ F210 and Google SIRI-WHU were used in the training and testing phase. The experimental results approve that the proposed method can obtain a better classification accuracy than the other methods.

Remaining paper deals with related studies in Section 2, proposed approach in Section 3, experimental results in Section 4 and conclusion in Section 5.
Convolutional neural network
Unlike neural networks which builds a fully connected network, deep learning networks does this but in a hierarchical manner. Deep Neural networks generally extracts deeper features by building a hierarchical network in first few layers followed by fully connected network which is finally given to a softmax classifier for final classification.
Due to this feature of deep networks, any complicated data can be represented with much more confidence. But the real challenge lies in training of such data. Initial stages of deep learning are generally unsupervised followed by final stages which are fine tuned in a supervised manner. Many deep learning models like DBN [12, 13], SAE [14], and CNN [11] are developed in the past for applications. Lately, CNNs have gained strength in the field of image processing and have proved to be much more competent compared to the other deep learning models in classification [15, 16] and detection [17]. In this paper, we examine the application of deep CNN for feature extraction of RS images.
Subpixel Mapping, Semantic Segmentation and Deep feature extraction are some of the important areas where CNN is applied for deep feature extraction. A basic CNN consists convolutional layers, pooling layers and fully connected layers which are further given to a softmax layers for final class allocation. Structural representation of CNN is shown in Fig. 2, where input image is a tensor with shape having various image parameters which is given to the convolutional layer. Convolutional kernel is selected with hyperparameters which convolve the input and give it to the next convolutional layer. Further these convolved images are given to the pooling layer which reduces the dimensions of the data by combining the multiple output nodes at one layer into a single output node in the next layer. Multiple fully connected layers are applied after pooling layer for flattening the network where a single node in one layer is connected to every other node in the next layer. Finally, the feature vector is classified by the softmax layer which classifies the features in their respective classes.
The value of a neuron s at position
where the feature map in the previous layer (
where
Wavelet decomposition [24], Curvelet [25], Contourlet [26] and Gaussian pyramid decomposition [22] are the various transforms which are used for scaling an image. Amongst all the above method, gaussian decomposition is an efficient scaling method which can efficiently extract features from an image at different scales by progressive filtering and downsampling operations. Gaussian pyramid decomposition can fuse in multiexposure images which is an important parameter in remote sensed images which are highly dimensional with lot of information at different scales which needs to be extracted for different applications. Original Gray scale image is taken as the first layer of the gaussian pyramid L1 which is further convolved using a gaussian kernel and downsampled resulting in gaussian pyramid layer L2. Further L2 is convolved using a gaussian kernel and downsampled to give gaussian pyramid layer L3 and so on. The decomposition is done till we get a finer scale of the original image (see Fig. 3).
Gaussian pyramid decomposition.
Maximally stable extremal regions (MSER).
Block diagram for classification of multiscaled RS images using 1-D convolutional neural network model.
Along with the gaussian pyramid decomposition which works on the scale of the dimensionally reduced hyperspectral images, Maximally stable extremal regions (MSER) is used derive features at different angles, as it has the properties of intensity transformation, invariance to adjacency and low computational complexity [23]. In MSER regions are defined by external property of the intensity in the region and on its outer boundary. Such regions possess properties such as, Invariance to affine transformation of image intensities, Covariance to adjacency preserving transformation, Stability as only extremal regions are selected whose support is virtually unchanged over a range of thresholds and most importantly Multiscaled detection wherein without smoothing very fine and very large structures are detected easily [23] (see Fig. 4).
Figure 5, shows a diagrammatic representation for classification of Multiscaled SAR/Hyperspectral images. As remotely sensed images are diverse in nature, hence while classification many aspects needs to be considered for precise classification. Figure 5 shows a simplified version of the architecture used for the constructing CNN. The same overall architecture used in [6] is developed here. A single pixel of a data cube with dimensions 1
For scaling we use, Gaussian pyramid decomposition in which we take the original RS image as RS1. Further we convolve this RS1 image with a gaussian kernel and downsampled it to get RS2 image. The gray scale value in the RS1 layer is obtained by,
Where
The gaussian window is represented as follows,
Where
In MSER’s, the regions are defined solely by an extremal property of the intensity function in the region and on its outer boundary [23].
Based on intensity pixels are sorted. Pixels are placed either in increasing or decreasing order in the image. A structure is developed by storing the area of each connected component as a function of intensity. Two components are merged to terminate the existence of smaller components and all the pixels of smaller components are inserted to become a larger one. Intensity levels that are local minima are selected as thresholds to produce maximally stable extremal regions. Each MSER’s position is represented by position of local intensity minimum or maximum which acts as a threshold.
For, Region Q is a contiguous subset of D, i.e. for each p, q
Further to combine the features from these multi scale and multiangle, the images are normalized using cubic interpolation method except the first image until the spatial size becomes same as that of the first image.
The details of the data sets
Sample image patches in the training data set. The data set includes 1000 samples.
Sample image patches in the testing data set. The data set includes 1000 samples.
The experiments are implemented using MATLAB 2018b, and the platform has X64 based PC, Intel (R) Core (TM), i7 processor – 7400 CPU @ 3.00 GHz, 3001 MHz, 4 cores CPU, 8 GB RAM, 4 GB NVIDIA Titan XP GPU and Windows 10 Pro operating system. The database consists of scenes from Indian_pines_corrected, JasperRidge2_F198, Jasperridge2_F224, SalinasA_corrected, Urban_F210 and Google dataset of Siri-WHU. Original and Multiscaled RS images are applied to the convolutional neural networks. Different tests are carried out by varying different parameters to test the results using 1-D CNN. Different parameters are calculated and the results precisely shows that, when features of images are derived at Multiscaled level yields very good recognition rate compared to features of single scale images. Further CNN network enhances the overall classification results compared to other classification techniques. The results show that, with the use of convolutional neural networks the classification accuracy improves for Multiscaled RS images compared to original RS images. Total 10 classes are considered for training and testing purpose with total 1000 samples (see Table 1, Figs 6 and 7).
Results of training and testing of SAR/Hyperspectral images on CNN network
Phase-I deals with applying of single scale and Multiscaled RS image to CNN network when different sizes of feature vectors are considered. The following table shows classification accuracy when different sizes of the feature vector are applied to the RS images (see Table 2, Fig. 8).
Classification accuracy for different sizes of feature vectors using CNN
Classification accuracy for different sizes of feature vectors using CNN
Classification accuracy for SAR/Hyperspectral images with different sizes of feature vectors using CNN.
The below figure shows different sizes of feature vectors considered and tested using CNN. We can clearly observe that, as we go on increasing the number of feature vectors, the recognition rate also goes on increasing, but at the same time the computational time and the amount of data for further computation also goes on increasing. The graph above shows that, for the size of 100
Classification accuracy for different size of kernels using CNN
Classification accuracy for SAR/Hyperspectral images with different size of kernels using CNN.
The figure below shows different sizes of kernels considered and tested using CNN. We can clearly observe that, as we go on increasing the size of kernels, the classification accuracy also goes on increasing, but after a certain size the accuracy starts decreasing. As we go on increasing the size of kernels beyond a certain limit, data dimensionality reduction becomes difficulty. Hence, we cannot increase the size of the kernel invariably. The computational time and the amount of data for further computation also goes on increasing. The graph above shows that, for the 11 size of kernel the classification accuracy is 95.75% with the computational time of 11.1 secs. For further calculations we will, now keep the size of kernel to be 11 and observe the changes in the accuracy by varying other parameters (see Table 4, Fig. 10).
Classification accuracy for different number of pooling layers using CNN
Classification accuracy for SAR/Hyperspectral images with different number of pooling layers using CNN.
Classification accuracy for SAR/Hyperspectral images with different number of fully connected layers using CNN.
The figure given above shows different number of pooling layers considered and tested using CNN. We can clearly observe that, as we go on increasing the number of pooling layers, the classification accuracy also goes on increasing, but after a certain number of layers the accuracy starts decreasing. As the number of pooling layer goes on increasing the number of subsamples also increases and the coarser subsamples are considered which does not give a finer image representation. Hence, we cannot increase the number of layers invariably. The computational time required for further computation also goes on increasing. The graph above shows that, for the 2 pooling layers the classification accuracy is 93.5% with the computational time of 5.9 secs. For further calculations we will, now keep the number of pooling layers to be 2 and observe the changes in the classification accuracy by varying other parameters (see Table 5, Fig. 11).
Classification accuracy for SAR/Hyperspectral images with different number of epochs using CNN.
Classification accuracy for different number of fully connected layers using CNN
The figure given above shows different number of fully connected layers considered and tested using CNN. We can clearly observe that, as we go on increasing the number of fully connected layers, the classification accuracy also goes on increasing, but after a certain number of layers the accuracy starts decreasing.
For further calculations we will, now keep the number of fully connected layers to be 6 and observe the changes in the classification accuracy by varying other parameters. Optimization is done over all the parameters of a CNN network to avoid under and over fitting. Using fully connected network maximum information is passed onto the next neuron to estimate a correct prediction value (see Table 6, Fig. 12).
Classification accuracy for different number of epochs using CNN
The above figure shows different number of epochs considered for training and testing of data using CNN. We can clearly observe that, as we go on increasing the number of epochs, the classification accuracy also goes on increasing, but after a certain number of epochs the accuracy starts decreasing. As we go on increasing the number of epochs beyond a certain limit, the over fitting problem comes into picture where data is over trained which starts gives a negative result which in turns reduces the classification accuracy. Hence, we cannot increase the number of epochs invariably. The computational time and the amount of data for further computation also goes on increasing. The graph above shows that, for the 11 epochs the classification accuracy is 92.25% with the computational time of 7.3 secs. As, remote sensed images have more number of different spectral signatures, classification using normal classification algorithms does not yield good classification result. Thus, we can clearly see that, when remote sensed images are classified using convolutional neural networks the classification accuracy increases to a larger extent as compared to classification using traditional algorithms.
Confusion Matrix for 90
Figure 13 shows classification accuracy for each individual class. We can clearly observe that, for class 0, class 1, class 2, class 4, class 5, class 7 and class 9 have classification accuracy above 90% whereas class 3, class 6 and class 9 are having classification accuracy below 90%. The overall correct classification is 90.85% whereas the incorrect classification is 9.15%. Total time taken for training per epoch is 15 secs on GPU whereas it requires 35 secs on CPU.
Confusion Matrix for 90
Figure 14 shows classification accuracy for each individual class. We can clearly observe that, for class 0, class 5, class 7, class 8 and class 9 have classification accuracy above 90% whereas class 1, class 2, class 3, class 4 and class 6 are having classification accuracy below 90%. The overall correct classification is 91.34% whereas the incorrect classification is 8.66%. Total time taken for training per epoch is 14 secs on GPU whereas it requires 36 secs on CPU.
Confusion Matrix for 90
Figure 15 shows classification accuracy for each individual class. We can clearly observe that, for class 1, class 2, class 5, class 6, class7 and class 9 have classification accuracy above 90% whereas class 0, class 3, class 4 and class 8 are having classification accuracy below 90%. The overall correct classification is 91.16% whereas the incorrect classification is 8.84%. Total time taken for training per epoch is 14 secs on GPU whereas it requires 36 secs on CPU.
Confusion Matrix for 180
Figure 16 shows classification accuracy for each individual class. We can clearly observe that, for class 0, class 1, class 2, class 3, class 6 and class 9 have classification accuracy above 90% whereas class 4, class 5, class 7 and class 8 are having classification accuracy below 90%. The overall correct classification is 90.51% whereas the incorrect classification is 9.49%. Total time taken for training per epoch is 15 secs on GPU whereas it requires 35 secs on CPU.
Confusion Matrix for 180
Figure 17 shows classification accuracy for each individual class. We can clearly observe that, for class 1, class 2, class 4, class 5, class7, class 8 and class 9 have classification accuracy above 90% whereas class 0, class 3 and class 6 are having classification accuracy below 90%. The overall correct classification is 90.9% whereas the incorrect classification is 9.1%. Total time taken for training per epoch is 19 secs on GPU whereas it requires 38 secs on CPU.
Confusion Matrix for 180
Figure 18 shows classification accuracy for each individual class. We can clearly observe that, for class 0, class 2, class 3, class 4, class 5, class 6, class 8 and class 9 have classification accuracy above 90% whereas class 1 and class 7 are having classification accuracy below 90%. The overall correct classification is 91.04% whereas the incorrect classification is 8.96%. Total time taken for training per epoch is 15 secs on GPU whereas it requires 35 secs on CPU.
Confusion Matrix for 270
Figure 19 shows classification accuracy for each individual class. We can clearly observe that, for class 2, class 3, class 4, class 5, class 7 and class 9 have classification accuracy above 90% whereas class 0, class 1, class 6 and class 8 are having classification accuracy below 90%. The overall correct classification is 91.12% whereas the incorrect classification is 8.88%. Total time taken for training per epoch is 16 secs on GPU whereas it requires 36 secs on CPU.
Confusion Matrix for 270
Figure 20 shows classification accuracy for each individual class. We can clearly observe that, for class 1, class 3, class 4, class 6 and class 9 have classification accuracy above 90% whereas class 0, class 2, class 5, class 7 and class 8 are having classification accuracy below 90%. The overall correct classification is 90.8% whereas the incorrect classification is 9.2%. Total time taken for training per epoch is 16 secs on GPU whereas it requires 36 secs on CPU.
Confusion Matrix for 270
Figure 21 shows classification accuracy for each individual class. We can clearly observe that, for class 0, class 2, class 3, class 4, class 6 and class 9 have classification accuracy above 90% whereas class 1, class 5, class 7 and class 8 are having classification accuracy below 90%. The overall correct classification is 90.84% whereas the incorrect classification is 9.16%. Total time taken for training per epoch is 15 secs on GPU whereas it requires 36 secs on CPU.
Comparison of different classification methods
Comparison of different classification methods
Comparative graph for different classification methods.
Table 7 shows comparison of different classification methods with the proposed method. The table shows classification of Google dataset of Siri-WHU with 80% of training samples using traditional as well as deep learning classification methods. The average classification accuracy of our method comes out to be 92.25% which is better than the other traditional classification methods like BoVW, SPM-SIFT, LLC and LDA. When compared with deep learning methods there is more improvement when compared to S-UFL but slight improvement when compared with LPCNN. Thus, by adding scale as well as angle parameter to the remote sensed images improves the overall classification accuracy. Remote sensed images are scaled at different levels to derive the information that lies deep within which is then applied to the CNN network. The experiments are conducted on five publicly available well known datasets and 1-D CNN is tested with different parameters. The optimum parameters are selected for the final network and the remaining data is tested on this network. Results show that, classification accuracy largely increases when CNN networks are used for classification compared to other supervised and unsupervised techniques (see Fig. 22).
This paper proposes the idea of multiscale and multiangle method (MSMA) to solve the problem of object rotation in remote sensed images. The three main novelties as described in the paper for classification are: Along with Multiscaling use of multiangle technique, use of Gaussian decomposition for Multiscaling and use of MSER on remote sensed images. For different scale images we use gaussian decomposition method which extracts features at different levels through progressive filtering and then downsampling. The MSERs are invariant to affine transformation of intensity which makes it one of the fast detection algorithms. The paper finally applies the concatenated features of Gaussian decomposition and MSERs to a CNN which further is build with certain convolutional and pooling layers to get the final classification result. Experiments were performed on 10 different classes of remotely sensed hyperspectral images which shows that, by deriving features at different scales and angles and applying these features to well-designed CNN layers, improves the overall efficiency to a greater extent. Experimental results prove that, the proposed method outperforms the traditional classification methods and also some of the deep learning methods. Furthermore, MSMA is computationally efficient as it uses a GPU for its implementation. In our future work, we can use a greater number of datasets for testing the efficiency of our algorithm along with reduction in the computational complexity.
Footnotes
Acknowledgments
This work is supported in part by NVIDIA GPU grant program. We thank NVIDIA for giving us Titan XP GPU as a grant to carry out our work in deep learning. We also thank the anonymous reviewers for their insightful comments.
