Abstract
BACKGROUND:
The accurate classification of pulmonary nodules has great application value in assisting doctors in diagnosing conditions and meeting clinical needs. However, the complexity and heterogeneity of pulmonary nodules make it difficult to extract valuable characteristics of pulmonary nodules, so it is still challenging to achieve high-accuracy classification of pulmonary nodules.
OBJECTIVE:
In this paper, we propose a local-global hybrid network (LGHNet) to jointly model local and global information to improve the classification ability of benign and malignant pulmonary nodules.
METHODS:
First, we introduce the multi-scale local (MSL) block, which splits the input tensor into multiple channel groups, utilizing dilated convolutions with different dilation rates and efficient channel attention to extract fine-grained local information at different scales. Secondly, we design the hybrid attention (HA) block to capture long-range dependencies in spatial and channel dimensions to enhance the representation of global features.
RESULTS:
Experiments are carried out on the publicly available LIDC-IDRI and LUNGx datasets, and the accuracy, sensitivity, precision, specificity, and area under the curve (AUC) of the LIDC-IDRI dataset are 94.42%, 94.25%, 93.05%, 92.87%, and 97.26%, respectively. The AUC on the LUNGx dataset was 79.26%.
CONCLUSION:
The above classification results are superior to the state-of-the-art methods, indicating that the network has better classification performance and generalization ability.
Introduction
According to the "Global Cancer Statistics 2020", the global incidence of lung cancer is 11.4%, ranking second among all cancers. Lung cancer causes 1.8 million deaths worldwide, accounting for 18% of the mortality rate, ranking first among all cancers [1]. Especially in China, lung cancer incidence and mortality rank first among all malignant tumors [2]. Lung cancer has become the leading cause of cancer-related deaths, posing a significant threat to human health. Relevant studies show that early lung cancer screening and treatment are crucial for improving patient survival rates and disease prognosis [3]. Low-dose computed tomography (LDCT) has emerged as an essential tool for early lung cancer screening due to its non-invasive nature, rapid imaging capabilities, and minimal radiation exposure. Early lung cancer diagnosis is primarily based on the analysis of lung nodules in CT images [4]. A pulmonary nodule is an abnormal area of the lung measuring less than 3 cm that appears denser than the surrounding tissue. While most pulmonary nodules are benign (non-cancerous), those with characteristics such as increased solidity, larger size, and irregular borders may have the potential to develop into malignant nodules and lead to lung cancer. In clinical, radiologists need to interpret many lung CT images daily, relying on their personal experience to diagnose lung nodules. However, this process is not only highly subjective but also time-consuming, which is prone to misdiagnosis or missed diagnosis. To improve radiologists’ diagnostic efficiency and accuracy, a computer-aided diagnostic (CAD) system has been developed for the diagnosis of early lung cancer screening [5]. The classification of benign and malignant pulmonary nodules is the last step in diagnosing pulmonary nodules and one of the critical components of the CAD system. Currently, two primary methods are employed for pulmonary nodule identification: handcraft-based methods and convolutional neural networks(CNNs)-based methods. The first method [6–10] involves extracting features such as size, shape, density, texture, and others from pulmonary nodules using various feature descriptors like Gray-level Co-occurrence Matrix(GLCM)[11], Histogram of Oriented Gradient(HOG) [12], and Local Binary Pattern(LBP) [13]. Subsequently, it selects the most relevant features based on medical knowledge and experience, feeding these extracted nodule features into different algorithms [14–16] of machine learning for classification. While handcraft-based methods have been widely employed in the past and have yielded favorable results in some cases, this approach heavily relies on individual domain knowledge and experience, making it susceptible to subjectivity. Additionally, handcrafted features may fail to capture certain complex structural or textural information, potentially limiting classifier performance.
In recent years, with the development of deep learning technology, CNNs have become the mainstream framework for classifying benign and malignant pulmonary nodules. Compared with handcraft-based methods, CNNs learn features and classification rules from original image data through end-to-end learning, which can automatically extract richer features from images. To enable the network to learn pulmonary nodule features at different scales, Shen et al. [17] and Xu et al. [18] cropped regions of interest in three different sizes from the original CT image and trained them into three networks to extract multi-scale and diverse pulmonary nodule features. Moreover, Sakshiwala et al. [19] proposed a new multi-scale (64 × 64, 32 × 32 and 16 × 16) CNN architecture and initialized the weights of the multi-scale architecture using transfer learning, achieving a classification accuracy of 93.88%. To fully utilize the spatial information of nodules, Xie et al.[20] further extracted CT image slices from nine fixed view angles of 3D pulmonary nodule CT images as input and combined multiple submodels based on knowledge collaboration to classify pulmonary nodules, obtaining richer nodule feature information. El-Regaily et al. [21] designed a multi-view CNN by extracting CT image slices from the axial, coronal, and sagittal of 3D pulmonary nodule CT images as input images. Wu et al. [22] proposed a multi-scale multi-view model based on ensemble attention for capturing more comprehensive discriminative nodule representations of nodules. Although the above multi-scale or multi-view methods achieve good classification performance, they require building multiple parallel networks for training, increasing the network’s complexity. The attention mechanism can enable networks to emphasize regions of interest while filtering out irrelevant information, making the network’s inference process interpretable. This mechanism has been widely applied in the field of pulmonary nodule classification. Jiang et al. [23] constructed a 3D dual-path network that combines contextual attention and spatial attention mechanisms to improve the representation of deep features and the robustness of predictions. Meanwhile, a multi-level feature fusion network with ResNeXt as the backbone and embedded channel attention mechanism was also used to classify benign and malignant pulmonary nodules [24].
CNNs can capture features through the convolution operation with a local receptive field, but they cannot directly model global information. To address this limitation, Wu et al. [25] and Sun et al. [26] used Swin Transformer [27], which can extract global information for pulmonary nodule classification. Since input pulmonary CT images contain both local and global features, relying solely on CNNs or transformers may result in the loss of valuable information and a reduction in network performance [28]. Given the complex morphology and varying sizes of pulmonary nodules, it is not only necessary to model local information to describe detailed features such as texture and edges of pulmonary nodules but also to capture global features to describe the shape or structure of pulmonary nodules [29]. This approach aligns with the clinical practice of doctors in diagnosing benign and malignant pulmonary nodules. Therefore, accurately capturing the local and global features of pulmonary nodules is crucial for the benign and malignant diagnosis of pulmonary nodules. In this paper, we construct an MSL block for extracting local spatial features at different scales to better represent the detailed features of pulmonary nodules. Additionally, an HA block can capture key information about the image globally, achieving better adaptability in both spatial and channel dimensions. By integrating local and global information, the classification performance of the network for pulmonary nodules is improved.
Methods
Overall architecture
To effectively capture the local and global features of pulmonary nodules, we propose a pulmonary nodule classification network with an architecture designed using a standard "four-stage" pyramid paradigm, as illustrated in Fig. 1. In this paper, we utilize 32×32 CT images of pulmonary nodules as input for LGHNet. The input image is initially processed through a stem module, comprising a 3×3 convolutional layer and layer normalization. This process yields feature maps with a resolution of 32×32 and 16 channels. Subsequently, the output feature maps enter the first stage of the network, which includes two Local blocks for extracting low-level semantic information. The structure of Local block is similar to ConvNeXt block [30], where a 3 × 3 depth-wise convolution with the number of groups equal to the number of channels and two 1 × 1 convolutions for channel mixing. In addition, there are Layer Normalization (LN) [31] and Gaussian Error Linear Unit (GELU) [32] also incorporated into the Local block. The network’s second, third, and fourth stages have the same configuration, consisting of an MSL block and an HA block. At the same time, we have designed an effective downsampling module composed of a max pooling layer and a 1 × 1 convolutional layer. The max pooling layer employs a 2 × 2 kernel with a stride of 2, which is used to reduce the spatial dimension of the feature map and preserve the most important features. The 1 × 1 convolutional layer increases the number of channels. This combination effectively enhances the network’s representation ability. Following the encoding of input images with semantic information across four stages, 4 × 4 ×128 feature maps are generated. Finally, the category probability is obtained through a sequence of global average pooling, a fully connected layer, and Softmax activation.

The overall architecture of LGHNet for lung nodule classification. The detailed structures of MSL block and HA block are described in Sections 2.2 and 2.3.
The MSL block enhances the network’s ability to perceive different scales and details by encoding multi-scale and local information, which helps process pulmonary nodules with different sizes and complexities, whose structure is shown in Fig. 2. The MSL block has three primary components: channel splitting, dilated convolution, and efficient channel attention (ECA).

The diagram of our proposed MSL block.

Schematic diagram of dilated convolution with different dilation rates based on 3 × 3 convolution. (a) A 3 × 3 convolution. (b) A 3 × 3 convolution with a dilation rate of 2. (c) A 3 × 3 convolution with a dilation rate of 3.

The structure of ECA, taking a one-dimensional convolution with a kernel size of 3 (K = 3) as an example.
After the above operations, feature maps from different groups are concatenated along the channel dimension. Subsequently, Softmax and GELU activation functions, LN, and two 1x1 convolutions are applied to enhance the network’s non-linear transformation and feature representation ability.
Compared to CNN, the key component of the Transformer [35] is self-attention, which allows the network to automatically focus on information from different positions when processing input sequences, helping the network to model long-range dependencies and capture the global structure of the input sequence. We leverage the advantages of self-attention to construct an HA block for extracting global features, such as the shape and size of pulmonary nodules, with the aim of capturing obvious visual differences among pulmonary nodules. The structure of the HA block is shown in Fig. 5.

The diagram of our proposed HA block.
Self-attention typically calculates the similarity score between the current pixel and other pixels in the spatial dimension, overlooking global modeling in the channel dimension. To address this limitation, the proposed HA block can simultaneously consider the relationships between channels and between different locations in spatial dimensions on a global scale. The input image
After feature mapping through 1 × 1 convolution, O s and O c are element-wise multiplied to encode more comprehensive global semantic information.
LIDC-IDRI and LUNGx dataset
The LIDC-IDRI dataset [36] was created through a collaboration between the Lung Image Database Consortium (LIDC) and the Image Database Resource Initiative (IDRI). This dataset contains 1,018 CT scans from 1,018 patients, each of which has been annotated by 1 to 4 professional radiologists. The annotation information is stored in extensible markup language (XML) files and includes features such as the location, diameter, shape, density, and malignancy level of nodules. Similar to the method [37], we only use nodules annotated by at least three radiologists and calculate the median of the malignancy levels annotated by all radiologists to categorize pulmonary nodules as benign or malignant. When the median malignancy level of a pulmonary nodule is less than 3, it is categorized as benign; when the median is greater than 3, the nodule is categorized as malignant. If the median malignancy level is equal to 3, it indicates that the radiologist is not sure whether the nodule is benign or malignant, and the nodule is excluded. In the end, we obtained a total of 848 CT images of pulmonary nodules, comprising 442 benign pulmonary nodules and 406 malignant pulmonary nodules.
The LUNGx [38] dataset was provided for the benign and malignant pulmonary nodule classification challenge at the SPIE Medical Imaging Conference in 2015. The dataset contains 10 CT scans for calibration and 60 CT scans for testing. The case names, coordinates of the approximate nodule centroid, and diagnosis results are stored in an associated Excel file. In this paper, since this dataset provides a small number of CT scans, the calibration and test data are mixed to form a larger test dataset containing a total of 83 pulmonary nodules (42 benign and 41 malignant).
Data processing
We used LIDC-IDRI as training and test data, while LUNGx was only used as external test data to evaluate the generalization ability of the network. Since the CT images come from different institutions and devices, it is necessary to linearly interpolate and resample CT images to ensure a uniform spacing of 1mm for each dimension. Subsequently, we apply a window range of [-1000, 400] to filter out air and bone regions and normalize the data using Z-score standardization, as follows:
The following preprocessing steps are applied to LIDC-IDRI dataset: (1) Crop CT images volume of size 32mm × 32mm × 32mm centered around the annotated position of the pulmonary nodule to remove a large amount of interference information. (2) Extract 2D slices from axial, coronal, and sagittal perspectives to obtain rich spatial information from CT images. (3) Apply data augmentation operations for each 2D slice to prevent overfitting of the network and improve its generalization ability. We use two augmentation methods. The schematic diagram of the two data augmentation methods is shown in Fig. 6. One is the conventional data augmentation strategy, which includes operations such as horizontal flipping, rotations (900, 1800, 2700), and the addition of Gaussian blur; the other is the MedAugment [39] method, which uses a sampling strategy to select data augmentation operations from an augmentation space (pixel augmentation space, spatial augmentation space). The selected operations are then randomly ordered and executed sequentially. Through these two augmentation methods, the training dataset is increased by ten times. (4) Divide the dataset according to the ten-fold cross-validation method, where nine folds are used for training, and the remaining one fold is used for testing.

Different data augmentation methods. (a) Conventional data augmentation method. (b) MedAugment method. The green rectangular box represents operations in the pixel augmentation space. The light orange rectangular box represents operations in the spatial augmentation space. ’Pass’ represents no augmentation in the augmentation space.
For the LUNGx dataset, we only need to crop CT image patches of size 32mm × 32mm based on the nodule coordinates provided in the Excel file.
We evaluate the network’s classification performance using five criteria: accuracy(Acc), sensitivity (Sen), precision(Pre), specificity(Spe), and AUC. Here are the expressions for some of the metrics:
We used Python3.9 and PyTorch1.12 framework for programming. The network was trained and tested on 32GB memory, NVIDIA GeForce RTX 3090Ti, Intel Core i5-12600KF 3.70GHz CPU processor, and Windows 10 system. We have chosen the cross-entropy loss function as our loss function. Additionally, experiments indicated that the network reached a converged state after 50 training epochs. Therefore, in this paper, we set the training epochs to 50. For other detailed parameter settings, please refer to Table 1.
Training parameter setting
Training parameter setting
Comparison with some state-of-the-art methods
Table 2 shows the comparison results of the proposed method with other state-of-the-art methods. To ensure fairness, all compared methods utilized the LIDC-IDRI dataset for both training and testing. We categorized pulmonary nodule classification methods into three groups: CNN-based classification methods, Transformer-based classification methods, and hybrid classification methods combining CNN and Transformer architectures. Liu et al. [29] designed a pulmonary nodules classification network that incorporated residual and transformer blocks. However, the designed network is too shallow and cannot effectively capture the hierarchical structure and high-level features of pulmonary nodules. Consequently, it has limitations in representing complex features, leading to lower classification performance compared to our method. Similarly, Al-Shabi et al. [40] used residual blocks and Non-Local [41] blocks to construct classification networks, but maintaining consistent feature map resolution can lead to feature duplication and redundancy, thereby negatively affecting classification performance. Additionally, Cao et al. [42] employed PixelShuffle [43] to first reconstruct the 32 × 32 pulmonary nodule CT images into 64 × 64 before feature extraction and classification. However, during the pixel shuffle process, some fine details may be lost, potentially increasing computational resources and time usage. While this method achieves slightly higher precision than our proposed approach (by 0.73%), other metrics in [42] are lower than those of our method. Wang et al. [44] proposed an improved Vision Transformer (ViT) [45] pulmonary nodule classification network, but due to the lack of CNN’s inherent inductive bias, its performance on classification tasks is reduced. In contrast, the Local block and MSL block in our method can extract rotation and translation-invariant local features, compensating for the shortcomings of the Transformer structure in modeling local spatial structures. Jiang et al. [46] used neural architecture search to construct a pulmonary nodule classification network automatically. Although the specificity is 2.17% higher than that of the proposed method, the sensitivity is much lower than our method’s 94.25%, indicating that the network is more likely to identify malignant nodules as benign, which increases the risk of patients missing the best treatment opportunity in clinical practice. At the same time, our proposed method outperforms other methods across various metrics, reflecting that LGHNet has stronger feature extraction abilities and can better classify benign and malignant pulmonary nodules.
Comparison with state-of-the-art methods on the LIDC-IDRI dataset–label tab2
Comparison with state-of-the-art methods on the LIDC-IDRI dataset–label tab2
We used the LUNGx dataset to evaluate the generalization capability of LGHNet. The best weight for each fold trained on the LIDC-IDRI dataset is selected and then tested using the LUNGx dataset. The results compared with state-of-the-art methods are shown in Table 3. Except for a 1.7% lower specificity than reference [20], other evaluation metrics of LGHNet on the LUNGx dataset are better than those of reference [20] and [52], reflecting that the proposed method has stronger generalization ability. In particular, LGHNet achieved a sensitivity of 90.30% on the LUNGx dataset, which is 3.08% higher than reference [20]. This indicates that LGHNet has significant advantage in accurately identifying malignant nodules, thereby reducing the risk of missed diagnosis.
Comparison with state-of-the-art methods on the LUNGx dataset–label tab3
To validate the superiority of LGHNet in the pulmonary nodule classification task, we compared it to six classical pre-trained models commonly used in natural image classification. These models include ResNet18, ResNet34, ResNet50, ResNet101, DenseNet121, and DenseNet161. Subsequently, we fine-tune these six models on the LIDC-IDRI dataset for training and testing. Since these six models require three-channel images as input, we replicate the single-channel pulmonary nodule CT image three times and concatenate them in the channel dimension, resulting in 32 × 32 × 3 size images. Furthermore, we freeze the convolutional layers of the models and only modify the output feature dimension of the final fully connected layer to 2 for the binary classification task of pulmonary nodules. The comparison results of the classification metrics of each model are shown in Fig. 7. Table 4 shows the comparison results of the number of parameters, floating-point operations (FLOPs), and inference time of different models.

Comparison of classification metrics of different models.
Comparison of parameter number, FLOPs and inference time of different models–label tab4
From the comprehensive results shown in Fig. 7 and Table 4, it is evident that, compared to the six models, the proposed method achieves higher classification performance with fewer parameters in the task of benign and malignant pulmonary nodule classification. Notably, it demonstrates significant advantages in terms of accuracy, sensitivity, and precision. At the same time, the FLOPs of LGHNet is only 0.02G, which is much lower than the FLOPs of other models, indicating that the proposed method is lightweight and requires fewer computing resources in actual deployment applications. Although the inference time of LGHNet is longer than that of ResNet18, LGHNet is more competitive than ResNet18 in terms of the number of parameters and FLOPs of the model.
The classification results of different data augmentation methods is presented in Table 5. As can be seen from Table 5, compared with the traditional data augmentation method, the accuracy on the test dataset increased by 0.74% when using data augmented by the MedAugment [39] method for network training. The MedAugment method excludes operations that are not suitable for medical images, such as invert, equalize, and solarize operations that may destroy details and features in medical images. Furthermore, using sampling strategy and hyperparameter mapping to make the augmented image closer to the real pulmonary nodule CT image. This may be one of the reasons why the MedAugment method outperforms the conventional data augmentation method. In this paper, we adopt two data augmentation methods and merge the data after the two data augmentation methods to create a larger training dataset, which enables the network to learn more diverse and rich pulmonary nodule features and have better adaptability and generalization ability for complex pulmonary nodule CT images. As a result, the network’s classification performance has significantly improved.
Classification results with the different data augmentation methods–label tab5
Classification results with the different data augmentation methods–label tab5
The impact of each block of the network and the structure of different networks on classification performance were discussed. The results of the ablation experiment are shown in Table 6. We compare the classification performance of the following networks:
The baseline model N1 achieved classification results with an accuracy of 91.51%, sensitivity of 91.12%, precision of 90.06%, specificity of 90.51%, and an AUC of 95.95%. Due to the addition of HA block in N2, the global features of pulmonary nodules can be captured in both spatial and channel dimensions, which improves the recognition ability of the network. Compared with N1, all classification metrics of N2 have improved. Based on N2, N3 adds the MSL block, and the classification ability of the network has declined, especially the sensitivity, which is only 89.66%. In addition, N4 removes the HA block in the first stage, and its classification metrics are better than N2. The reason may be that the shallow network is more suitable for modeling local features to capture detailed information on pulmonary nodules. Finally, we replaced the Local block in the second, third, and fourth stages of N4 with the MSL block to form the overall structure of LGHNet. Compared with N4, the classification accuracy of LGHNet increased by 1.9%.
The results of ablation experiment
The results of ablation experiment
In LGHNet, we introduce the HA block that performs self-attention in the spatial dimension and channel dimension to capture the global contextual information of pulmonary nodules from multiple dimensions. In this section, we explore the effects of different self-attention choices and fusion methods on classification performance. There are mainly the following methods:
The comparison results of the above different methods are shown in Fig. 8. Firstly, it can be observed that compared to the method using hybrid attention, the classification performance is worse when self-attention is applied separately in either the spatial or channel dimension, with a classification accuracy of only 91.99% and 91.75%, respectively. M3 and M4 sequentially execute self-attention in two dimensions. Although each metric is better than M1 and M2, this method cannot achieve shared parameters of Q, K, and V, which, to some extent, increases the number of parameters of the network and consumes more memory resources. M5 fuses the results of self-attention in two dimensions by element-wise addition. Compared with M3 and M4, the accuracy and sensitivity of M5 are both improved, reaching 93.4% and 93.2%, respectively. However, the precision is only 91.72%, indicating an imbalance between sensitivity and accuracy, where the network tends to predict most samples as malignant. M6 shows a significant improvement in various metrics. Fusion through element-wise multiplication helps capture more complex feature relationships, reduces conflicts between unrelated features, and takes into account both spatial relationships between pixels and feature correlations between channels, thus enhancing the network’s classification performance.

Comparison of different self-attention selection and fusion methods.
MSA maps input features into multiple heads, each of which learns to focus on a different part of the input. However, setting too many heads may make it difficult for the network to concentrate on learning key features and reduce network performance. Therefore, it is necessary to select an appropriate number of heads for each HA block. In addition, according to the overall structure of LGHNet, we only consider the setting of the number of heads in the second, third, and fourth stages. The results are shown in Table 7. When the number of heads in all three stages is set to 2, the classification accuracy, sensitivity, precision, specificity, and AUC are 91.87%, 91.85%, 91.5%, 91.88%, and 96.10%, respectively. When the number of heads in each stage increases to 4, the network’s classification performance reaches its optimum.
Classification results with the different number of head settings
Classification results with the different number of head settings
In this section, we discuss the impact of different numbers of branches in the MSL block on classification performance. The results are presented in Table 8. When the number of branches of the MSL block in the second, third, and fourth stages is set to 3, 3, and 4, respectively, LGHNet has the best classification performance. However, when the number of branches in the second stage is set to 3, and the number of branches in the third and fourth stages is set to 4, accuracy and precision are the lowest, which are 91.16% and 90.6%, respectively. The other two configurations also performed poorly in the classification of the LIDC-IDRI dataset.
Classification results with the different number of branch settings
Classification results with the different number of branch settings
To gain insights into the network’s decision-making process, we employ the Grad-CAM algorithm [53] to generate heat maps that reveal the regions of interest for the network. The more interested the network is in certain areas, the darker the color in the heat map. Fig. 9 shows the Grad-CAM visualization and classification results of benign and malignant pulmonary nodules.

Grad-CAM visualization and classification results of benign and malignant nodules.
As depicted in Fig. 9, the high heat values are predominantly concentrated at the center of pulmonary nodules and gradually extend towards the boundaries, which indicates that LGHNet can accurately locate the position of pulmonary nodules in CT images and extract local and global features of pulmonary nodules to determine the benign and malignant nodules. The diameters of the nodules in the figure ranged from 4mm to 13mm. However, LGHNet can correctly identify the categories of nodules with high accuracy, which, to a certain extent, shows that our proposed MSL block effectively addresses the multi-scale issue of pulmonary nodules, demonstrating adaptability to nodules of different sizes. At the same time, LGHNet still has a robust discriminative ability for some nodules with complex morphology, such as nodules ’C,’ ’E,’ ’F,’ as well as nodules ’B,’ ’G,’ and ’L,’ which are located around the trunk, reflecting the high generalization and robustness of the proposed method for nodules with different shapes and locations, which can further improve the reliability of LGHNet for practical clinical applications.
The LGHNet we proposed effectively captures local and global information of lung nodule CT images through the cooperation of the MSL block and the HA block, improving the classification accuracy and robustness of the network, thereby strengthening its diagnostic capability for potential lung diseases. However, due to the limited availability of CT images of pulmonary nodules, overfitting issues may arise during network training. Therefore, in future work, we will explore methods related to image generation for producing high-quality images to enrich training datasets.
