Abstract
Representing features at multiple scales is of great significance for hyperspectral image classification. However, the most existing methods improve the feature representation ability by extracting features with different resolutions. Moreover, the existing attention methods have not taken full advantage of the HSI data, and their receptive field sizes of artificial neurons in each layer are identical, while in neuroscience, the receptive field sizes of visual cortical neurons adapt to the neural stimulation. Therefore, in this paper, we propose a Res2Net with spectral-spatial and channel attention (SSCAR2N) for hyperspectral image classification. To effectively extract multi-scale features of HSI image at a more granular level while ensuring a small amount of calculation and low parameter redundancy, the Res2Net block is adopted. To further recalibrate the features from spectral, spatial and channel dimensions simultaneously, we propose a visual threefold (spectral, spatial and channel) attention mechanism, where a dynamic neuron selection mechanism that allows each neuron to adaptively adjust the size of its receptive fields based on the multiple scales of the input information is designed. The comparison experiments on three benchmark hyperspectral image data sets demonstrate that the proposed SSCAR2N outperforms several state-of-the-art deep learning based HSI classification methods.
Keywords
Introduction
Hyperspectral images (HSIs), acquired from special sensors, are composed of numerous narrow spectral bands, which are the electromagnetic radiation emitted, or reflected, by the scene under observation. With high spectral resolution, HSIs have absolute advantages for fine land-cover recognition. HSI classification which attempts to assign a specific class to each pixel has crucial and broad applications in the areas of globle environmental monitoring [1], national defense security [2], precision agriculture [3], materials analysis [4]. However, it has imposed serious challenges on HSI classification task, because HSI data has rich spectral information and high correlated spatial information. Over the last few decades, a vast amount of machine learning methods have been proposed for HSI classification. The earliest of these did classification by extracting the spectral signatures of hyperspectral data for feature representation and using classical classifiers, such as k-nearest neighbor approach [5], support vector machine (SVM) [6], random forest [7] and decision trees [8]. These methods are conceptually simple and easy to implement, but they ignored the spatial information of HSI. As HSI data lies in a geographical manifold and its neighboring pixels are spatial dependent, many methods are proposed to incorporate the spatial information for HSI classification, including the Markov Random Fields (MRFs) [9], sparse representation [10], partitional clustering techniques [11], extended morphological profiles (EMPs) [12] and multiple kernel learning [13]. Nevertheless, these conventional machine learning HSI classification methods, considered as only one or two layers processing methods, are all dependent on hand-craft descriptors which usually need expert knowledge.
In recent decade, deep learning, a subfield of machine learning, which is inspired by the structure and functions of the biological brain has shown more promising performance in HSI classification in comparison with the conventional machine learning methods, such as stacked autoencoder [14], convolutional neural network(CNN) [15], deep belief network [16], and recurrent neural network (RNN) [17]. In the deep learning based HSI classification methods, a hierarchical feature representation of HSIs is learned automatically by multiple non-linear networks, which exhibits better classification performance than the hand-craft descriptors [18, 19]. The early deep learning methods for HSIs classification need to organize the spatial information into a vector before training, and can not represent the spatial information efficiently [15, 21]. To extract the spatial feature, Yue et al. firstly proposed a spectral-spatial deep learning architecture using 2D CNN and logical regression for HSI classification [22], which demonstrates that CNNs can effectively learn the local spatial structure features of the HSI data. Although 2D spectral-spatial classification methods incorporate the spatial information into the feature representation, their processings of data transformation tend to deform the spatial structure and can not learn both the spectral and spatial feature of HSI. Therefore, it is reasonable to extend the deep neural networks model to a 3D structure to learn the higher-level 3D spectral-spatial features which conform to the structure of the HSI data. In [23], Chen et al. proposed to employ several 3D convolutional and pooling layers to extract nonlinear, discriminant and invariant deep features of HSIs and verified the superiority of 3D spectral-spatial classification method over 1D spectral and 2D spectral-spatial classification methods. To achieve an efficient and accute classification, Paoletti et al. proposed a novel 3D-CNN composed of 5 layers with a specific strategy for dealing with the border pixels in the image for HSI classification [24]. To address the over-train problem caused by the insufficient training samples, in [25], a 3D generative adversarial network was proposed for HSI classification, which contains a generative CNN to generate fake inputs and a discriminative CNN to classify the real and fake inputs. More recently, in [26], the spectral bands are firstly clustered to obtain several different groups which are fed into multiple 3D-CNNs respectivley, and then the multiple 3D-CNNs are fused for spectral and spatial feature fusion to improve the precision of HSI classification.
The HSI classification accuracy declines due to the degradation problem of the deep CNN architecture. To address the problem, a considerable number of residual architecture based HSI classification methods were proposed [27–29]. For example, Zhong et al. proposed a spectral-spatial residual network (SSRN), which uses two consecutive residual blocks to extract spectral-spatial features separately and employs residual connections to mitigate network degradation and improve HSI classification accuracy [29]. In [28], a depthwise separable residual neural network for HSI classification was proposed, which embeds the ResNet into a maximum posteriori framework to leverage the conditional random field model to preserve the class boundaries and edges. To reduce the high parameter numbers and feature redundancy, Paoletti et al. proposed a densely connected CNN for HSI classification, which enhances the generalization ability of the network and improves feature extraction capability of HSI [30]. In order to preserve the original spatial-spectral information in HSI, Yang et al. proposed a dual-channel network based on DenseNet to extract spectral and spatial features separately [31]. Zhang et al. proposed a multi-scale dense network (MSDN) for HSI classification, using dense networks in the horizontal dimension to extract deep features [32].
The aforementioned 3D-CNN based HSI classification methods all use single-scale convolution for the spectral-spatial feature extraction, of which the size of receptive field is fixed. It should be noted that relatively large receptive fields produce global and semantic features relatively easily, while relatively small receptive fields encode shape and geometric information quickly. In order to obtain receptive field of different sizes, researchers have developed deep methods with multi-layers to extract multi-scale fusion features, which have shown powerful feature representation ability for HSI classification [33]. In [34], a multi-scale 3D deep convolutional neural network was proposed, which could jointly learn both 2D multi-scale spatial feature and 1D spectral feature of HSI data. Subsequently, other forms of multi-scale convolutional neural networks were proposed, including the multi-scale densely 3D CNN [35] and the faster multi-scale capsule networks [36].
Nevertheless, the existing multi-scale feature extraction networks represent multi-scale features by increasing parallel convolution, which causes feature redundancy and increases the parameters amount. Therefore, recently, based on the Res2Net [37], which does not enhance the layer-wise multi-scale representation strength of CNNs, rather, improves the multi-scale representation ability at a more granular level, Zhang et al. proposed to consecutively extract spectral and spatial features of HSIs and an effective hinge cross-entropy loss function for HSI classification [38]. However, the network only uses the 3D CNN for feature extraction, which leads to feature redundancy. In [39], an advanced Capsule Network named RS-CapsNet was proposed for HSI classification, which uses Res2Net modules and small convolutional kernels to mitigate the problem of CapsNet’s large number of parameters, and adopts squeeze-and-excitation (SE) block to extract channel features. However, RS-CapsNet adds a convolution layer with small convolutional kernel of a single scale after each Res2Net module, which reduces the receptive field, and the SE module only focuses on spectral features, which is not facilitate extracting discriminative features.
In addition, recently, the attention mechanism was introduced in the feature representation of HSIs to suppress the clutter in HSIs and characterize the feature interactions of HSIs, attention mechanism can dynamically highlight the salient features. For instance, to suppress the influence of disturbing edge pixels, a spectral and spatial attention module is proposed and embedded in the feature extraction module [40]. To encode the long range spatial dependencies, Pande et al. proposed a hybrid attention composed of 1D and 2D CNNs which are attention masks for enhancing the spectral and spatial characteristics of HSIs [41]. In [42], different from the existing attention strategy, a spectral-spatial connected attention mechanism was proposed to process the raw HSI, which explores the diversity of spectral bands and the spatial relationship between the neighboring pixels. Hang et al. proposed a spectral attention subnetwork and a spatial attention subnetwork to enhance the learning capacity of CNNs for HSI classification [43]. Similar to [42], Zhu et al. proposed to directly process the raw HSI by spectral and spatial attention modules successively [44]. The spectral-spatial attention modules are also used for the following feature refinement and training acceleration.
Although great progress has been achieved in HSI classification by deep learning-based methods, there exist some problems unsolved. The existing attention based HSI classification methods either only consider the optimization of channel dimension, or only consider the optimization of spatial-spectral dimension, have not fully exploited the 3D information of spatial, spectral and channel. In the existing CNNs based attention models, the receptive field size of each artificial neuron layer is the same. But in neuroscience, the receptive field size of visual cortical neurons depends upon the stimulation, this is rarely considered in the convolutional networks construction [45].
To solve the aforementioned problems, in this paper, we propose Res2Net with spectral-spatial and channel attention (SSCAR2N) for HSI classification. To effectively extract the multi-scale features of hyperspectral images at a fine-grained level, Res2Net is employed which uses a set of smaller filter groups, each with several channels, to replace the traditional 3 × 3 filters, and also guarantees faster calculation speed. To further optimize the features obtianed by the Res2Net from the spectral-spatial and channel dimensions simultaneously, we propose a threefold attention block,i.e., spectral-spatial and channel attention block, where a dynamic neuron selection mechanism that allows each neuron to adaptively adjust the size of its receptive fields based on multiple scales of the input information is designed. Specifically, in the channel attention block, we design multi-scale selective kernel convolution in order to respectively adaptively weight the neurons and obtain a global and aggregated feature representation. In the spectral-spatial attention block, multi-scale 3D asymmetric convolutions are utilized to enhance the center backbone position weight and solve the problem of unequal features learning. Then to alleviate the degradation problem and gradient vanishing problem of the network, and enhance the generalization ability of the network, a residual connection is constructed which adds the optimized feature maps and the original feature map. To ensure the validity of feature, the network repeats the above operation. To the best of our knowledge, our study is the first exploration of the attention mechanism from the spectral-spatial and channel dimensions for HSI classification.
The contributions of this paper can be summarized as follows: Different from the existing architectures for HSI classification which involve Res2Net, the proposed network optimizes the feature maps of Res2Net by a novel spectral-spatial and channel attention (SSCAR2N) block, and residual structure is built on the optimized features and the original features, which can enhance the generalization ability of the network, and the feature extraction and optimization are repeated for feature effectiveness. To calibrate the features from spectral, spatial and channel dimensions simultaneously and adaptively, a spectral-spatial and channel attention block is designed. Inspired by neuroscience, a dynamic neuron selection mechanism that allows each neuron to adaptively adjust the size of its receptive fields based on multiple scales of the input information is developed. In the channel attention block, we design multi-scale selective kernel convolution in order to respectively adaptively weight the neurons and obtain a global and aggregated feature representation. In the spectral-spatial attention block, multi-scale 3D asymmetric convolutions are utilized to enhance the center backbone position weight and solve the problem of unequal features learning.
The rest of this paper is laid out as follows. The innovative Res2Net with spectral-spatial and channel attention (SSCAR2N) approach is described in detail in Section 2. The experimental results are presented and discussed in Section 3. This study is finally summarized in Section 4.
The proposed method
In this section, the proposed SSCAR2N for HSI classification is described in detail, including the backbone structure of our network, the principle and implementation of the Res2Net block and threefold attention block.
The proposed architecture of SSCAR2N for HSI classification
In the proposed SSCAR2N, as shown in Fig. 1, initially, for dimensionality reduction, the HSI initial data is analyzed using principal component analysis (PCA) along the spectral dimension,which makes the spectral feature extraction more efficient without affecting the spatial dimension. To extract the correlational spectral and spatial information, the input of network is a 3D HSI cube that is a center pixel with its adjacent pixels in a square area (using mirror fill for edge pixels) selected from the HSI. Let Flowchart of the proposed SSCAR2N for HSI classification.
In order to effectively exploit the multi-scale features of HSI at a more granular level, the Res2Net module is adopted, which is in contrast to the existing methods that explore the multi-scale features by layer-wise operations. The architecture of the Res2Net block is shown in Fig. 2.

The architecture of the Res2Net block.
Suppose the input feature map of Res2Net block has c channels,i.e.
In this block, division and concatenation strategies allow the network to fit the residual, thereby can enhance the features convolution processing more effectively. The omission of the convolution of the first division reduces the number of parameters, and can also be regarded as a form of feature reuse. The Res2Net block is the key part of our network, which is an architecture with strong multi-scale feature extraction capability and low computational complexity similar to a single-scale.
To further recalibrate the features extracted by Res2Net block, we attempt to mine the interdependence relationship between the spatial, spectral, and channel in the feature map and optimize the spatial, spectral, and channel features simultaneously and adaptively as the hyperspectral images have abundant spectral and spatial information. To this end, a spectral-spatial and channel attention module (SSCA) is proposed for HSI according to the characters of HSI data. SSCA consists of two parallel attention subblocks: spatial-spectral attention and channel attention subblocks, which use the adaptive selection mechanism of multi-branch to realize the adaptive weighting of neurons respectively and obtain a global, comprehensive selection weight representation. Fig. 3 depicts the structure of the spectral-spatial and channel attention mechanism block.

The structure of the 3D asymmetric convolutions attention mechanism block.
To improve the classification performance of our network, in the spatial-spectral attention subblock, we leverage the asymmetric convolution [48] to enhance the representational ability of a standard square-kernel. Specifically, we construct three parallel layers with 3 × 1 ×1, 1 × 3 ×1 and 3 × 3 ×3 kernels, among which the 3 × 1 ×1 and 1 × 3 ×1 kernels are non-square and referred to as the asymmetric convolutional layers. The asymmetric convolutional layers can enhance the learned center backbone position weight through horizontal and vertical kernels, which fit the nature of square kernels and derive no extra computation time. Then, to avoid the influence of the channel dimension on the spatial-spectral features we use a convolution with a convolution kernel size of 1 × 1 ×1, which also reduces the redundancy. Finally, the outputs of three parallel layers are summed up. A sigmoid operator is used for spatial-spectral-wise activation. The designed spatial-spectral attention subblock reliably extracts the feature effectively, which is beneficial to enhance the spatial feature of HSI.
In the channel attention subblock, to extract the multi-scale convolutional feature, two parallel convolution operations with 3 × 3 ×3 and 5 × 5 ×5 kernels and activation operations are applied to the input feature map x0 = d × d × band × c. Then we obtain u1 and u2,
Finally, the spatial-spectral attention enhanced feature map U and the channel attention enhanced feature map V are fused with each other by BN operation to obtain the final output x1 of the threefold attention block.
To comprehensively evaluate the performance of our proposed SSCAR2N for HSI classification, exhaustive quantitative and qualitative experiments are conducted on three standard HSI benchmark data sets, which are Indian Pines, Pavia University and Salinas Valley, in comparison with five state-of-the-art networks, 2DCNNs [15], 3DCNNs [49], ResNet [29], MSDN-SA [50], Res2Net. Four metrics, class accuracy, overall accuracy (OA), average accuracy (AA), kappa coefficient (κ) are used to measure the pros and cons of our approach. Overall accuracy (OA) is the ratio of the number of correctly predicted samples to the total number of samples. Average accuracy (AA) is the average of the ratio of the number of correctly predicted samples per class to the total number of samples per class. Kappa coefficient (κ) is used to judge the degree of image consistency. The networks are written in the TensorFlow 2.1.0 with Python 3.7 architecture. Moreover, all the experiments are executed on i7-8700 CPU and NVIDIA GTX1060 GPU.
Experimental data sets
False color composite, ground reference map, and a number of available samples of the Indian Pines data sets. False color composite, ground reference map, and a number of available samples of the Pavia University data sets. False color composite, ground reference map, and a number of available samples of the Salinas data sets.



The compared methods in our experiments are as follows: 2DCNNs [15]: A deep convolutional neural network was employed to classify hyperspectral images directly in the spatial domain. 3DCNNs [49]: A 3D CNNs method for HSI classification. CNNs are used to encode the spectral-spatial information of pixels, and an MLP is used to perform the classification task. ResNet [29]: A residual structure spectral-spatial 3D deep learning network that successfully mitigates over-fitting. MSDN-SA [50]: A densely connected framework for 3D CNNs with spectral-wise attention mechanism. Res2Net: Our proposed method without the spectral-spatial and channel attention mechanism block.
For all the adopted three data sets mentioned in Section 3.1, 50 samples per class are randomly selected for training, and a half of the total is selected for training if the corresponding class contains less than 50 samples. The remaining samples in each class are regarded as unlabeled examples during the training process and are used as the test set to evaluate the classification performance.
The dimension of the 3D cube is fixed at 12 × 12 × 12 in all experiments, with the spectral bands reduced to 12 using PCA. The number of convolution kernels in the first layer is set to 32, and the convolution mode is adopted ’valid’. In the first Res2Net block, the number of kernels in each convolution has a quarter of the number of kernels of the first convolution, and the convolution method uses ’same’. In the first channel attention module, the number of channels in the convolution process is 32. We use a 2-layer fully connected layer in the channel attention module to reduce the number of channels to half first and then restore the original size. In the spatial-spectral attention module, the number of feature maps remains unchanged, and convolution kernels are equal to the input data channels. The final fully connected layer is composed of 512 neurons and 256 neurons, sigmoid is used as the activation function, and the dropout rate is 0.5.
We utilize Adam as the network optimizer in this experiment. The optimizer’s learning rate is set to 1e-3 in order to ensure that the network has a fast convergence speed. The training epoch represents the number of times the network has been iterated. The experiment’s number of training epochs is 35, and the number of training batches is 128. Each iteration procedure will update the parameters of the whole network. Because each pick is made at random, each categorization will provide different results. We do five trials for each class in each of the three data sets to guarantee that the experiments are accurate. We determined the mean precision and standard deviation for each class of the six techniques in the three data sets, as well as the overall accuracy (OA), average accuracy (AA), and kappa coefficients (κ) for each data set.
Experimental results and discussion
1) Indian Pines: Table 1 shows the quantitative comparative findings of the classification accuracy on the Indian Pines data set. As the Indian Pines data set’s sample distribution is unbalanced. Some categories such as Corn-notill has 1428 sample, while Oats only has 20 samples. Moreover, its spatial sacle is only 145 × 145, which is the smallest among the three data sets. These make it a challenging data set. Table 1 suggests that the standard 3D CNN performs significantly better than CNNs. The superiority of the ResNet and MSDN-SA methods demonstrate that the residual 3D CNN’s spatial-spectral feature is more powerful than the standard 3D CNN’s. It also shows that SSCA has a good effect from the results of the proposed SSCAR2N in comparison with the ResNet, MSDN-SA and Res2Net methods. Among the six methods, the proposed SSCAR2N acquires the highest OA, AA, and κ, which are 94.13%, 94.54%, and 93.27% respectively. To display the results of the visual categorization, the classification maps generated by the proposed method in comparison with other different methods are shown in Fig. 7. From Fig. 7, we can see that the classification map produced by SSCAR2N method is closest to the ground truth map, and many regions of the classification maps achieved by SSCAR2N method are obviously less noisy than those of the CNNs, 3D CNNs, ResNet, MSDN-SA and Res2Net.
Classification accuracy comparison for Indian Pines data set
Classification accuracy comparison for Indian Pines data set

HSI classification maps of Indian Pines. (a) False-color image. (b) Ground truth map. (c) CNN (62.39%). (d) 3DCNN (88.04%). (e) ResNet (91.14%). (f) MSDN-SA (91.46%). (g) Res2Net (92.30%). (h) SSCAR2N (94.13%).
2) Pavia University: The comparison results of classification accuracy of Pavia University data set are reported in Table 2, from which we can see that the proposed SSCAR2N achieved the best OA, AA, and κ, which are 95.94%, 94.73% and 94.60% respectively. The SSCAR2N’s categorization results for each class were far superior than those of others, like the three classifications Bricks, Trees, and Asphalt. The OA, AA, and κ all improve when compared to the Res2Net approach without the attention module, indicating the usefulness of the suggested tripartite attention block. With regard to the qualitatively classification accuracies, the classification maps of different methods are illustrated in Fig. 8. According to Fig. 8, the suggested method’s classification maps are the most similar to the ground truth map. In our feature map, for example, the yellow region in the middle of the Pavia University data set, which represents the class ’Baresoil’, is pure yellow with no other colors, indicating that there are no misclassified pixels.
Classification accuracy comparison for Pavia University data set

Classification maps provided for the Pavia University data set by different methods. (a) A false color map. (b) The ground truth map. (c) CNN (82.36%). (d) 3DCNN (90.67%). (e) ResNet (91.93%). (f)MSDN-SA (93.86%). (g) Res2Net (93.01%). (h) SSCAR2N (95.94%).
3) Salinas Valley: Table 3 and Fig. 9 show the classification maps of all approaches’ classification results for the Salinas Valley data set, respectively. The suggested SSCAR2N approach produced the best OA, AA, and κ, as can be observed. The OA, AA, and κ were 94.10%, 97.06% and 93.43% respectively from the Table 3. In this data set, ’Grapes-untrained’ and ’Vinyard-untrained’ are difficult to distinguish. It can be seen that our method has a classification accuracy of 84.97% and 82.27% in these two classes. It has better classification results than other methods. For the visual results, from Fig. 9, we can see that the data in the orange part and the dark blue part in the figure are easily confused, but the result graph of our method is closer to the ground truth map.
Classification results for the Salinas data set

Classification maps provided for the Salinas data set by different methods. (a) A false color map. (b) The ground truth map. (c) CNN (85.74%). (d) 3DCNN (89.30%). (e) ResNet (89.88%). (f) MSDN-SA (90.82%). (g) Res2Net (92.24%). (h) SSCAR2N (94.10%).
4) Parameters Analysis: To further figure out the effectiveness of parameters in our deep learning framework, more experiments are carried out. The first set of experiments where different numbers of the spectral band are selected from the three data sets via PCA while the spatial sizes are fixed, is to explore the influence of PCA dimension. The corresponding overall classification accuracies (OA) of the three data sets are provided in Table 4. From the table, we can see that large spectral dimension contains more features, which is beneficial to network training. However, due to the influence of noise, it is necessary to reduce the spectral dimension properly.
OA (%) of the proposed method with different PCA Band
To see the effect of the spatial size of the input image patches, another set of experiments is conducted on three data sets for comparing classification accuracy of different spatial sizes, while the spectral dimension is fixed to 12. The overall classification accuracies (OA) results are provided in Table 5. From the table, we can observe that 12 × 12 and 16 × 16 have a better performance. This is because large image patches will introduce excessive noise, and small image patches may not include enough information. Therefore, after comprehensive experiments analysis, the 12 × 12 spatial size is adopted in our experiments.
The OA (%) of the proposed method with different spatial sizes
We also conducted further experiments to explore how the ultimate model was affected by the number of samples in the training set. By modifying the proportion of samples in the training set to the total data set samples, we compared the five methods. Firstly, in the Indian Pines data set, take 4%to 19% of the samples as the training set, and choose 1% to 6% of the samples as the training set in the Pavia University data set and the Salinas data set. Experiments were carried out to obtain the OA results shown in Figs. 10–12. Further analysis of the experimental results shows that among the five methods, the OA results of the data set improve with the increase of training samples proportion. Compared with other state-of-the-art methods, our method achieves better OA results.

OA (%) of different training set sizes (a) Indian Pines data set.

OA (%) of different training set sizes (b) Pavia University data set.

OA (%) of different training set sizes (c) Salinas data set.
Hyperspectral images have the dilemma of high dimensional spectral versus limited samples. The discriminative feature representation is of great importance for HSI classification. However, the traditional methods extracted the multi-scale feature of HSI data by the layer-wise multi-scale architecture which explores the multi-scale features with different resolutions and consumes a lot of computation. Moreover, the existing attention methods have not considered optimizing the extracted features from spectral, spatial and channel aspects simultaneously. Therefore, in this paper, a novel deep learning framework which consists of Res2Net and threefold attention mechanism is proposed for HSI classification. The proposed architecture explores the multi-scale features of HSI data at a fine-grained level by the Res2Net which is orthogonal to the traditional methods. To further enhance the feature representation, a threefold attention block is designed to select the visual discriminative and important features from the spectral-spatial and channel dimensions adaptively and simultaneously. The experiments on three benchmark hyperspectral data sets have shown the superiorities of the proposed approach in comparison with the state-of-the-art deep methods.
The advantage of such an approach is that the designed threefold attention block can solve the problem of unequal features learning and adaptively optimizes the multi-scale features from the spectral-spatial and channel dimensions simultaneously in which the neurons are dynamically and adaptively weighted. The proposed backbone is general and can be extended to other deep learning-based feature engineering applications. The potential limitation of our approach is the computation caused by the adopted multiple 3D convolution layers in the architecture. As future work, we will resort to new strategy to improve our threefold attention mechanism and the computational efficiency of the architecture.
Footnotes
Acknowledgments
The authors would like to thank the anonymous referees for their constructive comments which have helped improve the paper. The research is supported by the National Natural Science Foundation of China (Nos. 61860206004, 72071001, U20B2068), Natural Science Foundation of Anhui Province (Nos. 2008085MG226, 2008085QG334), Natural Science Foundation for the Higher Education Institutions of Anhui Province (No. KJ2021A0038).
