Adversarial consistency-based semi-supervised pneumonia segmentation using dual multiscale feature selection and fusion mean teacher model and triple-attention dynamic convolution in chest CTs

Abstract

Recently, semi-supervised learning has demonstrated significant potential in the field of medical image segmentation. However, the majority of the methods fail to establish connections among diverse sample data. Moreover, segmentation networks that utilize fixed parameters can impede model training and even amplify the risk of overfitting. To address these challenges, this paper proposes an adversarial consistency-based semi-supervised segmentation method, leveraging a dual multiscale mean teacher model. First, by designing a discriminator network with adaptive feature selection and training it alternately with the segmentation network, the method enhances the segmentation network's ability to transfer knowledge from the limited labeled data to the unlabeled data. The discriminator evaluates the quality of the segmentation network's results for both labeled and unlabeled data, while simultaneously guiding the network to learn consistency in segmentation performance throughout the training process. Second, we design a Triple-attention dynamic convolutional (TADC) module, which allows the convolution kernel parameters to be adjusted flexibly according to different input data. This improves the feature representation capability of the network model and helps reduce the risk of overfitting. Finally, we propose a novel feature selection and fusion module (FSFM) within the segmentation network, which dynamically selects and integrates important features to enhance the saliency of key information, improving the overall performance of the model. The proposed adversarial consistency-based semi-supervised segmentation method is applied to the MosMedData dataset. The results demonstrate that the segmentation network outperforms the baseline model, achieving improvements of 3.83%, 3.97%, 3.14% in terms of Dice, Jaccard, and NSD scores, respectively, for the segmentation of pneumonia lesions. The proposed segmentation method outperforms state-of-the-art segmentation networks and demonstrates superior potential for segmenting pneumonia lesions, as evidenced by extensive experiments conducted on the MosMedData and COVID-19-P20 datasets.

Keywords

semi-supervised learning dynamic convolution adversarial consistency learning medical image segmentation mean teacher

Introduction

Lung diseases pose a grave threat to human life and health. A diverse array of lung diseases prevails, including pulmonary nodules,^1–3 pneumonia,⁴ and pulmonary embolisms.⁵ Computer - aided diagnostic techniques, however, have emerged as a powerful tool in this regard. These techniques are capable of detecting even the most subtle foci, thereby significantly enhancing the diagnostic efficiency.^6–8 Recently, medical image segmentation technology has made remarkable progress in the detection and analysis of lung diseases, and accurate and reliable image segmentation is very important for early diagnosis and treatment of pneumonia as a serious lung infection disease. In recent years, supervised deep learning methods^9–11 such as U-Net networks¹² and networks improved on this basis have been applied to a large number of medical fields, and these network models have gained good results in the field of medical image processing by virtue of their unique encoder-decoder structure, however, the shortcomings of supervised segmentation methods are also obvious. First, these supervised segmentation methods rely on large-scale labeled datasets for training; however, medical images are more complex, such as lung CT images with irregular variations in the texture, size, and location of the infected region, as well as fuzzy boundaries of the lesions as well as low contrast,¹³ making it expensive and difficult to construct high-quality labeled datasets through manual labeling methods. To solve this problem, semi-supervised learning-based methods are gradually applied to the field of medical image segmentation. Semi-supervised learning methods combine a small amount of labeled data and a large amount of unlabeled data for training, and this emerging technology not only reduces the dependence on experts, but also effectively mitigates the problem of data scarcity and overfitting faced by supervised methods, which greatly promotes the development and application of image segmentation technology for pneumonia. Semi-supervised learning has significant advantages over traditional supervised learning methods and has shown excellent performance in different medical image domains. Specifically, semi-supervised segmentation methods based on Mean Teacher¹⁴ have been widely applied to medical image segmentation tasks by performing coherent learning under perturbation. A large amount of current research focuses on improving the reliability of generating pseudo-labels during training to enhance model performance, but the problems of insufficient feature mining of unlabeled data by the model as well as inefficient feature learning and weak generalization of its features remain challenging. And by playing on the role of a priori relationships between unlabeled and labeled data and applying the knowledge learned from the limited labeled data to the unlabeled data, the segmentation network can be motivated to generate high-quality prediction results for the unlabeled data through this method. Yu et al.¹⁵ used the MC-Dropout method during training to generate predictions and compute the prediction uncertainty through multiple forward propagation so that the segmentation network fits a relatively reliable target region. In order to better distinguish the reliability and unreliability of generating predictions on unlabeled data, Xia et al.¹⁶ generated multiple views with view differences by spatially varying the 3D data, and subsequently, the MC-dropout mechanism was introduced in the model architecture in order to estimate the uncertainty of each view. Xiang et al.¹⁷ used four classifiers with different loss functions at the decoder side of the network in order to address the problem of cognitive uncertainty, through multiple different classifiers in order to generate diverse predictions, where certain regions are considered to have a high level of confidence if they exhibit similar segmentation results across all four predictions. Lu et al.¹⁸ combined pseudo-labels and consistency regularization to improve the reliability of pseudo-labels by calculating the KL variance between the pseudo-labels generated by the teacher model and the predictions of the student model and treating it as an estimate of uncertainty to guide consistency loss. The MC-dropout method used by Yu¹⁵ and Xia¹⁶ et al. requires multiple inference, which is computationally expensive. The uncertainty approach proposed by Xiang et al.¹⁷ and Lu et al.¹⁸ improves the reliability of pseudo-labels to some extent, but does not establish a link between unlabeled and labeled data. In order to better improve the learning efficiency of the model on the features of unlabeled data, this paper, from the perspective of adversarial consistency learning, proposes a discriminator network with adaptive feature selection and makes it involved in segmentation training, which improves the prediction quality of unlabeled data by judging the prediction quality of different sample data as well as the consistency learning in the training process through the mutual game with the segmentation network and thus improves the prediction quality of unlabeled data. At the same time, the generalization performance of the model is improved.

In supervised image segmentation tasks, due to the limited diversity of training samples, the parameters of the segmentation network such as weights and biases are fixed, which means that once the model is trained, these parameters do not change with different input samples. However, the segmentation network in semi-supervised learning will be adversely affected by using fixed model parameters for different types of sample data, e.g., when dealing with different sample data. The fixed model parameters on the one hand limit the model's ability to adapt on different data distributions. On the other hand, it may lead to a situation where the model is not able to fully utilize this data for learning, resulting in poor quality of generated pseudo-labels as well as overfitting. Chen et al.¹⁹ proposed dynamic convolution in order to improve the representation of the network, which selects or weights multiple convolution kernels by analyzing the features of each input to extract features more efficiently. Jiang et al.²⁰ designed a new dynamic convolutional block and replaced the normal convolutional fast in 3D U-Net, which greatly improves the ability of the segmentation network to extract information and to perform segmentation accurately and quickly. Wang et al.²¹ proposed that the network enhances the extraction of image features by using multi-dimensional dynamic convolution combined with four different attention mechanisms to fully capture the rich contextual information in the image and improve the performance of the model. While the above methods dynamically aggregate weights from multiple convolutional kernels through flexible convolutional operations and bring some performance gains, they also give rise to a significant increase in the number of parameters and do not take into account the modeling of image dependencies between the spatial and channel dimensions, which leads to limited feature expressiveness as well as insufficient feature fusion and transfer. Aiming at the above mentioned problems, we put forward a triple-attention dynamic convolution (TADC) module based on the characteristics of semi-supervised training samples. This module is capable of dynamically selecting parameters in accordance with different samples and establishing interdependencies among different feature dimensions, thereby enhancing the overall representation of the model with respect to data features.

In order to solve the above problems, a discriminator network with adaptive feature selection is designed on the basis of a dual multiscale mean teacher model. The network consists of a novel triple - attention dynamic convolution (TADC). The alternate training of the segmentation network and the discriminator network motivates the segmentation network to improve its ability to transfer the knowledge learned at the labeled data to the unlabeled data, thus generating reliable segmentation results. In addition, feature selection and fusion modules were designed to enhance the feature representation in the channel and spatial dimensions. Thus, the contributions of this paper are as follows:

Adaptive Feature Aggregation Discriminator Network: We proposed a novel discriminator network based on adversarial training with adaptive feature selection. This network empowers the segmentation network to transfer knowledge from labeled data to unlabeled data more effectively, guaranteeing that the segmentation quality for both labeled and unlabeled data is highly consistent, thereby improving the model's generalization ability.

Triple - Attention Dynamic Convolution (TADC) Module: Developed the TADC module, which has the ability to adaptively adjust convolution kernel parameters based on different sample data. This mechanism enables flexible adjustment of model parameters according to diverse samples in the network, effectively mitigating the risk of model overfitting.

3D Triple Attention Module: Introduced a 3D triple attention module at the front - end of dynamic convolution. It promotes feature expressiveness and fusion via cross - dimensional interactions between channels and spatial dimensions, providing a more robust feature representation for subsequent processing.

Feature Selection and Fusion Module: Constructed a feature selection and fusion module. This module extracts information from the feature channel layer, preserves the valuable part, and then combines it with the extracted spatial information. Subsequently, the fused features are processed through a series of convolution operations to further boost their spatial - dimensional feature representation.

Related work

Semi-supervised learning has made significant advancements in the field of medical image segmentation compared to traditional supervised learning methods. The most common approaches in this domain include pseudo-labeling and consistency regularization. In practice, pseudo-labeling methods are widely employed in tasks such as image and semantic segmentation. These methods help alleviate the model's dependence on large amounts of labeled data by using the model's predictions on unlabeled data as pseudo-labels, which are then used in conjunction with labeled data for training.²² However, this method faces several challenges, such as the potential for noisy pseudo-labels, which can degrade the model's performance. In addition, the choice of confidence threshold plays a crucial role in the selection of pseudo-labels. An inappropriate threshold can either ignore useful pseudo-labels or introduce noise. Therefore, generating reliable pseudo-labels is essential for improving model performance.²³ In addition, a growing body of research in recent years has explored effective methods based on consistency regularization, which encourages prediction consistency by applying different perturbations to the same inputs or models. For example, the Mean Teacher model,^14,24 a widely used semi-supervised learning approach designed to enhance model performance with limited labeled data, introduces a self-ensembling technique. It combines a student model and a teacher model, where the weights of the teacher model are updated using the Exponential Moving Average (EMA) of the student model's weights. While simple and efficient, this approach inevitably faces some challenges. For instance, Lu et al.⁷ proposed addressing the reliability issue of pseudo-labels by incorporating uncertainty estimation to correct noisy pseudo-labels. Specifically, they used the Kullback–Leibler (KL) variance predicted by the teacher-student model as an uncertainty estimate, guiding the consistency loss to mitigate the impact of noisy pseudo-labels during training. Xu et al.²⁵ proposed the Ambiguity-Consensus Mean-Teacher (AC-MT) model to extract useful information from unlabeled data. This model is grounded in the concepts of entropy, model uncertainty, and self-identification of label noise. It improves performance by applying consistency learning in regions that are fuzzy yet still valuable for the model. Adiga et al.²⁶ proposed efficient segmentation methods with low computational complexity to enhance the performance of segmentation networks. These methods improve model effectiveness by learning anatomically-aware representations from labeled data through pre-training and by leveraging global information from segmentation masks to address uncertainty in segmentation tasks. Wang et al.²⁷ introduced entropy minimization and virtual adversarial strategies based on the Mean Teacher (MT) model to enhance network segmentation quality while smoothing decision boundaries. Xiao et al.²⁸ employed two teacher networks based on CNN and Transformer to regularize and guide the student network. Additionally, they focused on global information and enabled the teacher networks to supervise each other alongside the student network. Li et al.²⁹ proposed a semi-supervised learning framework that involves two student networks, where the teacher model parameters are alternately updated by the student networks. This approach reduces the similarity between the teacher and student model weights, thereby minimizing error accumulation.

Although consistency-based regularization has become the dominant method for semi-supervised learning in image segmentation, many segmentation approaches for unlabeled data struggle with slow progress in feature learning and are inefficient in this regard. Meanwhile, the use of Generative Adversarial Networks (GANs) is increasingly gaining traction in medical image segmentation tasks. Zhang et al.³⁰ proposed an enhanced Dense GAN network, which trains a deep learning model to improve performance by generating high-quality image augmentation data. Jiang et al.³¹ proposed a synthesis method that generates realistic COVID-19 CT images using conditional GANs. This approach effectively addresses the challenge of limited data availability. Xu et al.³² proposed GASNet, which incorporates generative adversarial training to enhance the segmentation performance of the network. This is achieved by using a binary classification discriminator to determine whether the healthy volume is generated and synthesized by the generator. Li et al.³³ proposed the Generative Adversarial Semi-Supervised Network (GASNet) to address the challenge of limited access to pixel-level labels in medical images. In this framework, the segmentation network acts as a generator, producing pseudo-labels that are evaluated for reliability by an uncertainty discriminator. The credibility of these pseudo-labels is further ensured through the use of a feature mapping loss. Wu et al.³⁴ proposed an adversarial training method called Auxiliary Adversarial Learning (AAL), which enhances the segmentation quality of unlabeled images. This is achieved by assigning distinct labels to the segmentation results of labeled and unlabeled images, thereby improving the overall segmentation performance. Zhang et al.³⁵ proposed a deep adversarial network that optimizes image segmentation quality by iteratively applying the adversarial training process, improving the segmentation of both unlabeled and labeled images.

Since the fixed-parameter segmentation network used as the backbone in a semi-supervised learning framework is prone to overfitting and requires significant memory overhead, it presents several challenges. Due to the diversity of sample data, there is a significant performance variation across different data types. Furthermore, the use of fixed-parameter segmentation networks in the training process for both labeled and unlabeled data introduces notable limitations. In recent years, increasing researches have focused on exploring how to leverage flexible convolutional operations and adaptively adjust the convolutional kernel parameters based on different inputs. Yin et al.³⁶ proposed a segmentation algorithm with low computational complexity and high performance. By using a dynamic filter, the algorithm can adjust the filter parameters in real-time based on different sample inputs, thereby enhancing its ability to capture and represent features. Hu et al.³⁷ proposed Dynamic Convolution-based Domain and Content Adaptive Convolution (DCAC) to address the limitations of static convolution. This method adapts to different image sources and features by flexibly adjusting the convolution operation. The model's convolution parameters are dynamically generated based on the domain code of each input image or its global feature conditions, enabling better adaptation to varying data. Su et al.³⁸ proposed a method that combines domain-adaptive and dynamic convolution to address the diversity and complexity of clinical data. This approach enables the model to leverage domain-specific knowledge and dynamically adjust its internal mechanisms when handling a wide range of data, thereby enhancing overall performance.

Semi-supervised segmentation framework design

In this paper, we propose a novel pneumonia infection segmentation method using dual multiscale mean teacher model³⁹ and adversarial consistency learning in chest CTs. Additionally, we design an adversarial discriminator network that incorporates proposed the 3D TADC module for improved pneumonia infection segmentation. The proposed segmentation framework consists of both a student model and a teacher model, with each model comprising the same encoder-decoder architecture. The semi-supervised framework for pneumonia infection segmentation with dual multiscale and adversarial consistency learning is shown in Figure 1(a). Additionally, we introduce a novel adversarial consistency learning-based discriminator with adaptive feature selection, and its structure is illustrated in Figure 1(b). This network incorporates adaptive feature selection module, the proposed 3D TADC, and average pooling. The discriminator takes both the original image and the concatenation of the segmentation results as input to evaluate the quality of the segmentation outputs generated by the network. It assesses the segmentation results on both labeled and unlabeled data, ensuring that the two segmentation outputs are as consistent as possible. This approach effectively enhances the segmentation network's ability to transfer knowledge learned from labeled data to unlabeled data. Furthermore, the discriminator network incorporates novel dynamic convolutional layers, which are designed to mitigate the risk of model overfitting while enhancing the model's ability to extract and interpret features.

Figure 1.

Proposed segmentation algorithm. (a) Semi-supervised framework for segmentation of pneumonia infections with dual multiscale and adversarial consistency learning; (b) Structure of the discriminator network.

Additionally, the segmentation network and the discriminator network are trained alternately during the training process. Notably, the discriminator network is not required during the inference phase, this leads to a reduction in unnecessary computational costs. In the supervised training phase, labeled data is fed into the student network, which generates segmentation results at four different scales from various layers of the decoder. The multiscale supervised loss is then calculated by comparing these four segmentation results with the ground truth. In the unsupervised training phase, unlabeled data is fed into both the student and teacher networks. The consistency loss between the segmentation results of the teacher and student networks at different scales is then computed.

FSFMT-Net structure design

We designed Feature Selection and Fusion Mean Teacher Network (FSFMT-Net) as the backbone network, which is shown in Figure 2, the network is an improvement on the Dual Multi-scale Mean Teacher Network (DM²TNet) network.³⁹ The segmentation network is divided into two parts, the encoder and the decoder. At the encoder side, the original 3D image undergoes two down-sampling operations with two different spatial dimensions to obtain two auxiliary 3D images. Subsequently, the three 3D images with three different spatial dimensions will be subjected to four convolutions and max-pooling operations and will then obtain the feature maps of the four groups, each of which contains three features with different spatial resolutions. In addition, the three different spatial resolution features of each group are aggregated by the attention module, which is shown in Figure 3. At the decoder side, by up-sampling the aggregated features and merging them with neighboring features, the input image and its multidimensional perceptual multi-scale representation can be fused for better segmentation of lung infection, and ultimately, multiple segmentation results at different scales are generated at the decoder layer. Finally, we incorporated a feature selection and fusion module into the final output layer of the network. This module enables better utilization of the feature information within the input data. It allows the network to focus more on significant features while suppressing less important ones. By fusing global channel information and global spatial information, the performance of the entire network is enhanced.

Figure 2.

Structure of the teacher-student network model.

Figure 3.

Structure of attention modules in networks.

Discriminator networks with adaptive feature selection based on adversarial consistency learning

Although Mean Teacher-based semi-supervised learning methods have shown promise in medical image processing by calculating consistency loss through data perturbation, they often overlook the a priori relationship between labeled and unlabeled data. This oversight can hinder the model's ability to effectively leverage the information from unlabeled data, ultimately reducing both prediction accuracy and generalization ability. In this paper, we extended the discriminator proposed by Lei et al.⁴⁰ into a novel discriminator network with adaptive feature selection. This approach enables the model to more effectively transfer knowledge from the limited labeled data to the unlabeled data, while simultaneously reducing the reliance on large amounts of labeled data. It achieves this by learning the a prior relationship between labeled and unlabeled data.

First, the original image is passed through the adaptive feature selection module. The resulting image, processed by the adaptive feature selection module, is then concatenated with the predictions generated by the student network for the image data. The concatenated data will then pass through dynamic convolution and global average pooling, ultimately outputting a value of 0 or 1. The discriminator network primarily assesses the quality of the segmentation result, outputting either 0 (poor) or 1 (good). During the training process, the discriminator network encourages the student network to generate the highest-quality predictions possible for the unlabeled data, aiming for an output of 1 from the discriminator network. The objective function for optimizing both the student network and the discriminator network is as follows:

min_{G} max_{D} (L_{G} (θ) + L_{D} (θ))

(1)

where

θ

denotes the parameter to be optimized;

L_{G} (θ)

denotes the loss function of the segmentation network G;

L_{D} (θ)

denotes the loss function of the discriminator network D;

max_{D}

denotes the D that needs to be found to maximize

L_{G} (θ) + L_{D} (θ)

for a given G;

min_{G}

means to find the G that minimizes this maximum among all possible G. The objective function

L_{G} (θ)

of the student network is as follows:

L_{G} (θ) = ς^{s} (x_{n}) + ς^{c} (y_{m}) + L_{a d} (x_{u}, {\hat{y}}_{u})

(2)

ς^{s} (x_{n}) = \sum_{k = 1}^{4} Φ_{d i c e} (P_{k}, G_{k})

(3)

ς^{c} (y_{m}) = \sum_{k = 1}^{4} Φ_{M S E} (S_{k}, T_{k})

(4)

where

ς^{s}

denotes the supervised loss on four scales,

Φ_{d i c e}

is the Dice loss, and

P_{k}

G_{k}

denote the predicted and ground truth of the student network for the lung lesions, respectively.

ς^{c}

denotes the consistency loss on the four scales and

Φ_{M S E}

denotes the mean square error (MSE) loss.

S_{k}

T_{k}

denote the predictions of the teacher-student model on unlabeled data, respectively.

L_{a d}

denotes the binary class cross-entropy loss function.

x_{u}

denotes unlabeled image data and

{\hat{y}}_{u}

denotes the segmentation result on unlabeled data.

As the network is trained, the segmentation predictions improve progressively, generating reliable pseudo-labels for the unlabeled data. Subsequently, the discriminator network evaluates the quality of the segmentation results, aiming for the best possible judgment. The objective function $L_{D} (θ)$ of the discriminator network is as follows:

L_{D} (θ) = L_{a d} (D_{1} (x_{i}, {\hat{y}}_{i}), 1) + L_{a d} (D_{1} (x_{u}, {\hat{y}}_{u}), 0)

(5)

where

x_{i}

and

x_{u}

denote the input labeled and unlabeled images, respectively,

{\hat{y}}_{i}

and

{\hat{y}}_{u}

denote the segmentation results of the labeled and unlabeled images, respectively.

To enhance the model's ability to represent important features, we embed an adaptive feature selection module⁴¹ at the front end of the discriminator network. The structure of the module is shown in Figure 4. This module adjusts the channel weights of the input feature maps by calculating inter-channel attentional weights. Specifically, the input feature maps are first processed using three convolutional layers to generate three feature maps $Q \in R^{C \times H \times W \times D}$ , $K \in R^{C \times H \times W \times D}$ , and $V \in R^{C \times H \times W \times D}$ , where C, H, W, and D denote channel, height, width, and depth, respectively. Then $Q^{T}$ and K are matrix multiplied and a softmax layer is applied to encode the feature relationships. The attention weights, computed from the softmax layer, are then multiplied with the feature map (V) through matrix multiplication. This process enables the features of each voxel to be adjusted based on their importance, creating voxel-level attention-enhanced features, which are then reshaped into $R^{C \times H \times W \times D}$ . Finally, the input feature map is aggregated through residual concatenation.

Figure 4.

Adaptive feature selection module structure.

3D triple-attention dynamic convolution

When different samples use the same model parameters in a parameter-invariant network, there is a risk of overfitting the segmentation network, leading to unreliable pseudo-labels. To address this issue, Lei et al.⁴⁰ proposed a dynamic convolution-based bidirectional attention component. Inspired by this, this paper designs a novel 3D TADC module from another perspective, as shown in Figure 5. Specifically, dynamic convolution allows the convolution kernel parameters to be adaptively adjusted based on different input data, enabling the model to handle diverse samples more effectively. This reduces the risk of overfitting while enhancing the model's ability to extract and understand features. Additionally, a 3D triple-attention module is integrated before the dynamic convolution in the network to address the issue of limited feature expressivity.

Figure 5.

3D Triple-attention dynamic convolutional module.

The TADC module is shown in Figure 5. The input feature map $x_{i n} \in R^{C \times H \times W}$ , where C denotes the number of channels and H and W denote the height and width, respectively. First, the feature map is processed by the 3D triple-attention module, as shown in Figure 6. Next, the dynamic convolution process consists of four steps, (1) Feature map transformation: a global average pooling operation is applied to the fine feature maps from the previous step to obtain the feature map $x_{2} \in R^{C \times 1 \times 1}$ . (2) reducing the dimensionality and calculating the weights: a $1 \times 1 \times 1$ convolution is applied to reduce the dimensionality of feature map $x_{2}$ , followed by a softmax operation to compute the weight coefficients p, $p \in R^{N \times 1 \times 1}$ , where N is a hyperparameter representing the number of convolution kernels. The softmax function normalizes the weights of each convolution kernel into a probability distribution, ensuring that the sum of all kernel weights equals 1. (3) To generate the dynamic convolution kernel: the coefficients P are multiplied with N convolution kernels one by one, and then the weights of the weighted convolution kernels are summed to obtain the dynamic convolution kernel. (4) Convolutional kernel weight definition: the final convolutional kernel weight w is dynamically generated, dynamic adjustment of the combination of convolutional kernels can be better adapted to the characteristics of the input data, the dynamically generated convolutional kernel weight w can be defined as:

w = \sum_{i = 1}^{N} (p_{i} \cdot c o n v_{i})

(6)

where

p_{i}

is a coefficient representing the ith coefficient of p. The sum of all coefficients equals 1, and each coefficient's value is between 0 and 1.

c o n v_{i}

denotes the weight of the ith convolutional kernel.

Figure 6.

Designed 3D triple-attention module structure.

The dependencies between the spatial and channel dimensions are not effectively captured since the channel and spatial dimensions are processed separately, with features computed independently in each dimension. Inspired by the 2D triadic attention module,⁴² we propose a novel 3D cross-dimensional triple attention mechanism to enhance the correlation between different locations within the same layer of feature maps in the spatial dimension. This is achieved by capturing the cross-dimensional interactions between the spatial and channel dimensions. The structure of the 3D triadic attention module is illustrated in Figure 6.

Spatial attention and channel attention are modeled on the input tensor through three branches, specifically, to achieve cross-dimensional interaction features between the aggregated channel C and the spatial dimensions W and H, as well as spatial dimensions to interact with information in the direction of the image height and width. Finally, the attention weights obtained on the three branches are averaged to aggregate the fine tensor generated in each branch. In addition, a Z-Pool is applied to each branch. It compresses the original multi-channel into two channels by means of a maximum pooling and average pooling operation. This reduces the computational effort while preserving the rich representation of the tensor. Z-Pool can be expressed in Equation (7).

Z - p o o l (x) = [M a x P o o l_{o d} (x), A v g P o o l_{o d} (x)]

(7)

For a given input feature x, the operation $M a x P o o l_{o d}$ means to implement maximum pooling in its dimension 0 direction; similarly, $A v g P o o l_{o d}$ represents average pooling in the dimension 0 direction of this input feature.

The feature vector goes through a convolutional layer with kernel size $7 \times 7 \times 7$ and a sigmoid activation layer to generate the attention weights for the three branches, and finally the fine tensor of each branch is aggregated by the average weighting method, and the output tensor is obtained through the Z-Pool, convolutional layer, and the activation function, and the whole computational process is shown in Equation (8).

Y = σ (ψ (Z - Pool (\frac{1}{3} \bar{((X_{1} σ (ψ_{1} ({\hat{X}}_{1})))} + \bar{((X_{2} σ (ψ_{2} ({\hat{X}}_{2})))} + \bar{((X_{3} σ (ψ_{3} ({\hat{X}}_{3})))})))

(8)

Where $X_{1}$ denotes that the input X in the first branch is rotated 90 degrees counterclockwise along the H–axis and that the rotated tensor shape is $W \times H \times D \times C$ ; $X_{2}$ denotes the input X in the second branch is rotated 90 degrees counterclockwise along the W -axis and the rotated tensor shape is $H \times D \times C \times W$ ; $X_{3}$ denotes the input X in the third branch is rotated 90 degrees counterclockwise along the C–axis, and the rotated tensor shape is $H \times W \times D \times C$ ; ${\hat{X}}_{123}$ denotes the Z-Pool operation on the rotated tensor in each of the three branches, $ψ_{123}$ is a standard convolutional layer defined with a kernel size of $7 \times 7 \times 7$ , and $σ$ denotes the activation function to obtain the attention weights. The top line indicates that the output features are rotated 90 degrees clockwise so that the output shape is the same as the input shape.

Feature selection and fusion module

In order to filter out features with important information, reduce redundant information, and improve the expressiveness of features, the work of Yang et al.⁴³ and Xing et al.⁴⁴ provides new ideas and directions for this paper. We designed a feature selection and fusion module (FSFM), as shown in Figure 7, which enables feature fusion by dynamically selecting important features both across channels and spatial dimensions. In the feature selection process, the channel dynamic selection mechanism selectively retains the most task-relevant feature channels based on the information in the feature map, thereby enhancing the prominence of important features. Alternatively, the spatial dynamic selection mechanism does not treat the entire feature map equally; instead, it focuses on key regions. Finally, integration of important features selected in channels and spatial, spatial convolution is performed on the fused features to enhance their representation in the spatial dimension.

Figure 7.

Designed feature selection and fusion module.

Specifically, for the input feature maps $F_{1}^{l}$ and $F_{2}^{l}$ , the two feature maps are first concatenated along the channel dimension, allowing the features to be merged together. To ensure that the fused features can be effectively utilized in subsequent processing, the number of channels must be reduced. This is achieved through cascaded average pooling, convolutional layers, and ReLU, which help reduce the channel dimension while preserving important features, thereby enhancing the effectiveness of the feature fusion, $w_{c h}$ can be expressed in Equation (9).

w_{c h} = C o n v_{1} (A V G P o o l ([F_{1}^{l}; F_{2}^{l}]))

(9)

where

F_{1}^{l}

and

F_{2}^{l}

represent the input feature maps. The operation

A V G P o o l

stands for average pooling,

C o n v_{1}

indicates a convolution operation. By

w_{c h}

, we mean the global channel information extracted from the two input feature maps.

Subsequently, the fused features are optimally selected using global channel information. This information is first used to assess the importance of each feature channel, guiding the processing and refinement of the features. Next, a $1 \times 1 \times 1$ convolutional layer, guided by the global channel information, decides which feature channels should be retained, $F^{l}$ can be expressed in Equation (10).

F^{l} = C o n v_{1} (w_{c h} \otimes [F_{1}^{l}; F_{2}^{l}])

(10)

where

F^{l}

denotes the features generated by selecting the important information, and the meanings of

w_{c h}

F_{1}^{l}

, and

F_{2}^{l}

are the same as in Equation (9), convolutional layers and Sigmoid activation operations are performed on the input feature maps

F_{1}^{l}

and

F_{2}^{l}

in order to extract the global spatial information, capture the spatial dependencies between the local feature maps, and enable the important spatial information to be highlighted in the feature maps.

In the spatial information extraction section of the feature, $1 \times 1 \times 1$ convolutional layers followed by Sigmoid activation are applied to the input feature maps $F_{1}^{l}$ and $F_{2}^{l}$ to extract global spatial information, the process is shown in Equation (11). This process captures the spatial dependencies between the local feature maps, enabling the important spatial information to be emphasized in the feature maps, feature ${\hat{F}}^{l}$ can be expressed by Equation (12).

w_{s p} = S i g m o i d (C o n v_{1} (F_{1}^{l}) \oplus C o n v_{1} (F_{2}^{l}))

(11)

{\hat{F}}^{l} = w_{s p} \otimes F^{l}

(12)

where

w_{s p}

implies global spatial information and

{\hat{F}}^{l}

implies fusion features. and the meanings of

C o n v_{1}

F_{1}^{l}

, and

F_{2}^{l}

are the same as in Equation (9),

F^{l}

is the same as in Equation (10).

Finally, the fused features are fed into two convolutional blocks with convolutional kernels $3 \times 3 \times 3$ and $1 \times 1 \times 1$ . Subsequently, the feature maps processed by the two convolutional blocks are subjected to pixel-by-pixel element-level multiplication operations. The purpose of this operation is to apply a “gating” mechanism that controls which part of the information is allowed to pass through and be retained, while suppressing or ignoring irrelevant information. Next, another convolutional block with convolutional kernels of size $1 \times 1 \times 1$ is applied to extract higher-level features from the input, while simultaneously fusing the input features through residual connections, the final output feature F can be expressed by Equation (13).

F_{o u t} = {\hat{F}}^{l} + C^{3 \times 3 \times 3} (C^{3 \times 3 \times 3} ({\hat{F}}^{l}) \cdot C^{1 \times 1 \times 1} ({\hat{F}}^{l}))

(13)

where

F_{o u t}

denotes the final output feature map,

C^{3 \times 3 \times 3}

denotes the convolution operation with a convolution kernel of

3 \times 3 \times 3

C^{1 \times 1 \times 1}

denotes the convolution operation with a convolution kernel of

1 \times 1 \times 1

, and

{\hat{F}}^{l}

has the same meaning as in Equation (12).

Results and analysis

Dataset

This study uses the MosmedData dataset,⁴⁵ provided by the Moscow City-owned Hospital in Russia. The dataset comprises five subsets (CT-0 to CT-4) with 1110 lung CT images of anonymous healthy individuals and COVID-19 patients, including males (42%), females (56%), and 2% of samples with missing gender annotation. Patient ages range from 18–97 years. CT-0 (254 samples): Normal lung CT images without COVID-19 infection. CT-1 to CT-4 (856 samples): Lung CT images with varying COVID-19 infection severity (increasing with subset numbering), including 50 expert-annotated samples converted into ground truth binary masks (infected areas = 1, non-infected = 0). For the semi-supervised segmentation task, data from CT-1 to CT-4 were used. Of the 50 annotated samples, 32 were allocated to training, 8 to validation, and 10 to testing (via random stratified sampling). Additionally, 806 unlabeled samples were incorporated into training to expand the dataset for semi-supervised learning.

Evaluating metrics

In the experiment, multiple metrics are used to evaluate segmentation results from different perspectives. The selected metrics include Dice, Jaccard, Normalized Surface Distance (NSD), and Average Boundary Distance (ABD). Dice and Jaccard metrics measure set similarity, with values ranging from 0 to 1. A value closer to 1 indicates higher similarity between predictions and ground truth, reflecting better segmentation performance. The mathematical formulas for the Dice and Jaccard coefficients are shown in Equations 14 and 15:

D i c e = \frac{2 \cdot | P \cap G |}{| P | + | G |}

(14)

J a c c a r d = \frac{| P \cap G |}{| P \cup G |}

(15)

where P represents the segmentation result, G represents the manually - annotated ground truth.

| P |

and

| G |

respectively represent the number of elements in the segmentation result region and the ground truth region.

| P \cap G |

represents the number of elements in the intersection part of the segmentation result and the ground truth.

| P \cup G |

represents the number of elements in the union part of the segmentation result and the ground truth.

The NSD metric is used to measure the proximity between the segmentation result and the ground truth in terms of boundaries. The mathematical formula for NSD is shown in Equation 16:

N S D = \frac{| \partial G \cap B_{\partial P}^{(τ)} | + | \partial P \cap B_{\partial G}^{(τ)} |}{| \partial G | + | \partial P |}

(16)

where

\partial G

represents the boundary of the ground - truth segmentation result, and

\partial P

represents the boundary of the segmentation result.

τ

represents the tolerance. Referring to the settings in the experiments by Wang et al.,³⁹ for lung segmentation and lesion segmentation,

τ

is set to

1 mm

and

3 m m

respectively.

B_{\partial P}^{(τ)}

represents the region composed of all points whose distance to the points on

\partial P

is less than or equal to

τ

, based on the segmentation result boundary

\partial P

. Similarly,

B_{\partial G}^{(τ)}

represents the region composed of all points whose distance to the points on

\partial G

is less than or equal to

τ

, based on the ground - truth result boundary

\partial G

ADB metric is used to evaluate the average distance between the boundary of the segmentation result and that of the ground truth. The lower its value is, the higher the fit of the two boundaries and the more ideal the segmentation effect is. The mathematical formula of ADB is shown in Equations 17–19:

A D B = \frac{1}{2} {\frac{\sum_{V_{i} \in Φ_{P}} h (v_{i}, Φ_{G})}{| G |} + \frac{\sum_{V_{j} \in Φ_{G}} h (v_{j}, Φ_{P})}{| P |}}

(17)

h (v_{i}, Φ_{G}) = min_{v_{j} \in Φ_{G}} d i s t (v_{i}, v_{j})

(18)

h (v_{j}, Φ_{P}) = min_{v_{i} \in Φ_{p}} d i s t (v_{j}, v_{i})

(19)

where

Φ_{P}

and

Φ_{G}

are the surface of the segmentation result and the ground truth, respectively.

v_{i}

and

v_{j}

are vertices on

Φ_{P}

and

Φ_{G}

,respectively.

h (v_{i}, Φ_{G}) = min_{v_{j} \in Φ_{G}} d i s t (v_{i}, v_{j})

represents the Euclidean distance between the vertex

v_{i}

on the surface of the segmentation result and the nearest vertex

v_{j}

on the surface

Φ_{G}

of the ground truth. Similarly,

h (v_{j}, Φ_{P}) = min_{v_{j} \in Φ_{p}} d i s t (v_{j}, v_{i})

represents the Euclidean distance between the vertex

v_{j}

on the surface of the ground truth and the nearest vertex

v_{i}

on the surface

Φ_{P}

of the segmentation result. The meanings of

| P |

and

| G |

are the same as those in Equations 14 and 15.

Experimental setting

The network model for this experiment is implemented using the PyTorch deep learning framework. The experiments were conducted on a computer running the Windows 10 operating system, equipped with a GeForce RTX 3090 graphics card (24GB) and an Intel Core i9-11900 K processor. The Adam optimizer is used for model training with a learning rate set to 0.0008. Due to the limitations of the device's GPU memory, the batch size is set to 1, and the training is conducted over 200 epochs. Additionally, an early-stopping strategy is employed during training to prevent overfitting.

Ablation studies

In the ablation studies, we use MDA-CNN as the benchmark model and conduct four sets of experiments to assess the impact of each designed module on segmentation performance. The experimental setups are as follows: (1) using MDA-CNN as the baseline network; (2) adding only the feature selection and fusion module (FSFM) on top of the baseline network; (3) Adding a discriminator network consisting of adaptive feature selection and TADC module to the baseline network; and (4) adding the FSFM to the setup in (3), which is then used as the final model in this paper.

From Table 1, it can be observed that when the FSFM is added to the benchmark model MDA-CNN, the segmentation network shows improved scores on the four segmentation metrics—Dice, Jaccard, NSD, and ADB. This indicates that the segmentation performance is enhanced compared to the benchmark model without the FSFM. When a discriminator network with adaptive feature selection and TADC was added to the baseline model and trained alternately with the segmentation network, the segmentation network's performance improved by 2.83%, 3.1%, 3.24%, and 0.89% on the four segmentation metrics, respectively. Finally, the method proposed in this paper achieves improvements of 3.83%, 3.97%, and 3.14% on the four evaluation metrics, namely Dice, Jaccard, and NSD, compared to the baseline model.

Table 1.

Ablation study with semi-supervised segmentation algorithm (mean ± standard deviation).

Method	Dice	Jaccard	NSD	ADB
BASE	60.58 ± 12.31	44.67 ± 13.85	85.01 ± 8.80	4.80 ± 4.67
BASE + FSFM	62.62 ± 12.53	46.89 ± 14.32	85.57 ± 12.56	4.42 ± 5.30
BASE + TCDA	63.41 ± 12.56	47.77 ± 14.66	88.25 ± 6.64	3.91 ± 3.82
BASE + TCDA + FSFM	64.41 ± 11.29	48.64 ± 13.69	88.15 ± 8.17	4.92 ± 4.12

Comparative experiments

Comparison experiment of adding Gaussian noise at different locations

To evaluate the algorithm's generalization ability in noisy environments, Gaussian noise, as a common type of interference, can effectively test the algorithm's capability to suppress random noise. However, if too much noise is introduced, it can interfere with model training and degrade segmentation performance. This is especially problematic in tasks involving the generation of reliable pseudo-labels for unlabeled data in semi-supervised training, as excessive noise can negatively impact the accuracy of the segmentation results. The segmentation performance in the experiment was compared across four scenarios: (1) adding Gaussian noise only to the labeled image data input of the student model, (2) adding Gaussian noise only to the unlabeled image data input of the teacher model, (3) adding Gaussian noise to the unlabeled image data input of both the student and teacher models, and (4) not adding any Gaussian noise to the unlabeled image data of either the student or teacher models. The results are summarized in Table 2 .

Table 2.

Comparative experiments for different locations of noise (mean ± standard deviation).

Method		Evaluating metrics
student	teacher	DICE	Jaccard	NSD
×	×	62.71 ± 12.13	46.89 ± 13.80	86.34 ± 10.30
√	×	64.41 ± 11.29	48.64 ± 13.69	88.15 ± 8.17
×	√	63.73 ± 11.53	47.90 ± 13.34	84.88 ± 10.73
√	√	60.61 ± 15.22	45.19 ± 15.80	83.03 ± 17.70

Table 3.

Comparison of the effect of replacing normal convolution with dynamic convolution.

	Dice	Jaccard	NSD	ADB
1-layer	62.41 ± 13.80	46.91 ± 15.54	87.06 ± 9.13	3.90 ± 4.53
2-layer	63.26 ± 11.52	47.41 ± 13.61	85.55 ± 8.97	6.96 ± 7.18
3-layer	63.54 ± 11.37	47.65 ± 13.17	84.99 ± 10.61	5.05 ± 5.79
4-layer	64.41 ± 11.29	48.64 ± 13.69	88.15 ± 8.17	4.92 ± 4.12

Comparative experiments with varying numbers dynamic convolutional layers

In this paper, we fixed the hyperparameter K = 4 as a prerequisite while varying the number of TADC layers in the discriminator, and experimentally validated the replacement of standard convolution layers with the TADC module in the discriminator network. Specifically, in the experiment, we replace one, two, three, and four convolutional layers with TADC module in the discriminator network. The results for each configuration are presented in Table 3. The performance of the segmentation network improves progressively as more ordinary convolution layers are replaced with TADC module in the discriminator network. The reason is that the network becomes capable of extracting more complex and abstract features with each additional layer, enabling the discriminator to more effectively assess the quality of different samples. Thus, two factors can be elaborated. On one hand, the triple-attention module enables the modeling of dependencies between the feature space and the channel dimensions, enhancing the network's representational power and its ability to capture more subtle differences in the data. On the other hand, dynamic convolution allows the convolution kernel parameters to be adjusted dynamically based on different samples. As a result, the TADC module, with its multiple layers, can progressively enhance the performance of the discriminator. it enables the network to learn from simple, low-level features to more complex, high-level features, thereby improving its discriminative power.

Comparative experiments with different temperature adjustment strategies

The discriminator's first four layers use TADC modules with a dynamic temperature ( $τ$ ) adjustment strategy. This allows the model to adaptively adjust attention during training, speeding up convergence to optimal performance. In experiments, $τ$ starts at 34, decreases by 3 per epoch, and stops at 1. A control group with fixed $τ = 34$ was used for comparison. The dynamic $τ$ adjustment strategy works in three stages: Early ( $τ = 34$ ): Learns global features to avoid local optima. Middle: Shifts from global to local attention, enhancing detail sensitivity. Late ( $τ = 1$ ): Focuses on key features to improve segmentation accuracy. Results show dynamic $τ$ adjustment strategy boosts feature extraction flexibility and precision, validating its effectiveness in optimizing segmentation performance. The effect comparison between the dynamic temperature adjustment strategy and the constant temperature strategy is shown in Figure 8. It can be seen from the figure that the dynamic temperature adjustment strategy outperforms the constant temperature strategy in the values of the four metrics, and the model performance under the dynamic temperature adjustment strategy reaches the optimal level.

Figure 8.

Comparison of the effects of different temperature adjustment strategies.

Comparative experiments with different values of K

The size of the parameter K in dynamic convolution determines the scale of the weight parameters, which in turn affects the number of weights to be learned and the complexity of weight updates during the model's training process. A larger K value can increase the number of weight parameters, leading to higher computational and memory demands. On the other hand, varying values of K result in dynamic convolution weights with different structures and numerical distributions. This, in turn, impacts the effectiveness of the subsequent 3D convolution operation on the input feature maps. Different dynamic convolution weights lead to varying levels of attention and combinations of the input features during the convolution process, thereby influencing the overall performance. Therefore, in this experiment, we compare the segmentation performance for different K values, with the results presented in Table 4. We fixed the first four layers of the discriminator as TADC modules as a prerequisite while adjusting the hyperparameter K, the performance of the model with different values of K is compared. The results show that when K is set to 4, the performance of all performance metrics of the model is superior. Specifically, compared with the cases where K is set to other values, each performance indicator of the model performs better. At this time, the diagnostic effect of the model is the best.

Table 4.

Comparison of the effect of hyperparameter K at different values.

K	Dice	Jaccard	NSD	ADB
2	59.27 ± 17.63	44.12 ± 16.28	80.29 ± 20.22	10.14 ± 11.22
4	64.41 ± 11.29	48.64 ± 13.69	88.15 ± 8.17	4.92 ± 4.12
6	59.22 ± 15.53	43.78 ± 15.73	81.93 ± 16.35	9.44 ± 11.35
8	63.35 ± 13.34	47.82 ± 15.06	86.97 ± 6.78	4.77 ± 4.76
10	58.22 ± 14.68	42.70 ± 15.96	80.08 ± 14.64	11.06 ± 13.14

Verify the impact of FSFM on the overall effect at different layers of the network

The Feature Selection and Fusion Module (FSFM) proposed in this paper mainly focuses on extracting global channel and global spatial information. It filters out features containing important information by extracting global channel information and global spatial information respectively. Meanwhile, it fuses the input information from different feature maps to enhance network performance. In this paper, the FSFM is incorporated into various layers of the segmentation network. This is done to examine the influence of the module's placement on the overall network performance. The corresponding results are presented in Figure 9. The FSFM exhibits superior performance when positioned within a shallow network. In contrast, its effectiveness diminishes when placed in a deep network. This could be attributed to the nature of shallow networks, where the feature maps encompass a greater abundance of low - level features like edges and textures. The FSFM is adept at effectively filtering and fusing these low - level features, which are characterized by a higher degree of diversity. In contrast, within deep networks, the features have become highly abstract. As a result, the FSFM struggles to glean valid information from these abstract features. In fact, it might even introduce noise when attempting to process such highly - abstracted characteristics. Therefore, placing the FSFM at different locations in the network has a large impact on the overall performance.

Figure 9.

Comparison of the impacts of FSFM at various positions within the network.

Validity verification of the proposed module

To validate the effectiveness and generalization of the proposed TADC module and feature selection and fusion module (FSFM) module, comparative experiments are conducted on both the 3D U-Net⁴⁶ and DM²TNet³⁹ networks. The overall performance of both networks improves after integrating the triple attention dynamic convolution, with significant increases in the Dice similarity coefficient and Jaccard coefficient compared to the original networks. The improved 3D U-Net shows a 1.01% increase in the Dice similarity coefficient and a 1.53% increase in the Jaccard index, as shown in Figure 10(a). Similarly, the enhanced DM²TNet demonstrates a 0.96% improvement in the Dice similarity coefficient and a 1.37% increase in the Jaccard index, as shown in figure Figure 11(a). This demonstrates the enhanced quality and expressiveness of the processed features.

Figure 10.

Validation of module validity based on 3D UNet.

Figure 11.

Validation of module validity based on DM²TNet.

In addition, the 3D U-Net and DM²TNet integrated with the feature selection and fusion module (FSFM) show improvements of 1.96% and 2.04% in the Dice similarity coefficient, respectively, with slight improvements observed across all other metrics, as shown in Figure 10(b) and Figure 11(b)

This fully demonstrates the effectiveness of the proposed TADC and feature selection and fusion module (FSFM).

Validity verification of the proposed module

In this study, the DM²TNet employs a multiscale supervised loss in the supervised training phase and a multiscale consistency loss in the unsupervised training phase. The student model is optimized by computing the combined loss, which is the sum of both these losses, throughout the training process. In this paper, we compare the model's segmentation performance under two different scenarios: single-scale output and multi-scale output. The results are presented in Figure 12. The results show that the method with multi-scale consistency constraints improves the Dice and Jaccard metrics by 0.94% and 0.86%, respectively, compared to the single-scale output. Multi-scale approaches are better suited for scenarios involving complex or multi-focal data, as the network can extract both low-level and high-level features by processing data at different scales. This is particularly important for segmentation tasks that require decision-making across varying contexts and backgrounds. Therefore, this method allows the segmentation network to better adapt to targets of varying scales, thereby enhancing the accuracy of recognition and prediction.

Figure 12.

Comparative experiments at different scales.

Visualization of segmentation results

In this study, to thoroughly analyze and evaluate the segmentation performance, we randomly picked the visualization outcomes of tissue segmentation regarding lung infections, Figure 13(a) visualizes the segmentation results of smaller lesions, and Figure 13(b) visualizes those of larger lesions. By comparing the segmentation results predicted by the model with the ground truth and the original image, we can clearly observe the alignment between the segmented regions and the lesions in the original image, as well as the discrepancies with the ground truth. The model was able to accurately detect even some of the smaller lesion areas in the lung images, without any omissions. At the same time, by comparing the segmentation results with those manually labeled by the experts, it is evident that although small lesions are challenging to distinguish due to factors such as imaging quality, the model still manages to recognize them accurately. This results in a high degree of similarity between the segmentation outcomes and the ground truth. On the other hand, notwithstanding that certain infected tissues display fuzzy boundaries and delicate grayscale distinctions when contrasted with the adjacent normal tissues, the proposed segmentation algorithm remains capable of precisely identifying the crucial characteristics of the lesions and attaining accurate segmentation of the infected areas within the lungs. Overall, the proposed method demonstrates exceptional performance in segmenting small and ambiguous lesions, highlighting its significant clinical application potential.

Figure 13.

Visualization of segmentation result.

Disscusion

Experimental comparison

This study compares the proposed method with other state-of-the-art models that have also been evaluated on the MosmedData dataset, as presented in Table 5. Based on the data presented in the table, it is evident that the proposed method outperforms others in several key metrics. Furthermore, the proposed 3D TCDA and adaptive feature selection discriminator network allow the segmentation model to more accurately capture and model the spatial information in the data, as well as effectively interpret the complex three-dimensional structures. This results in superior performance compared to other methods, including both 2D and 3D approaches. Specifically, our method achieves a Dice similarity coefficient value of 64.41, which significantly outperforms the Dice scores of Mean Teacher, DM²T-Net, and U-Net++. This highlights the substantial advantage of the proposed method over most existing models in terms of prediction accuracy and error reduction. The proposed method achieves a Jaccard value of 48.64, which is the highest in the table and significantly outperforms all other models. As a result, the proposed method leads in both the DICE coefficient and Jaccard index. Furthermore, our method also demonstrates advantages in performance on NSD and ADB compared to other approaches. In summary, the proposed algorithm performs exceptionally well and outperforms existing methods across several key evaluation metrics. This demonstrates that the design of the network architecture, along with the optimization of the training algorithm, effectively enhances the model's overall performance.

Table 5.

Comparison of results with other models and algorithms (\ denote not mentioned).

Method	Data type	DICE	Jaccard	NSD	ADB
GAS net ³²	3D	54.20 ± 22.4	-	-	-
GAN + UNet + Contrastive ⁴⁷	2D + 3D	58.4 ± 19.0	-	-	-
Inf-Net ¹³	2D	51.8 ± 0.31	-	-	-
TA-SegNet ⁴⁸	2D	62.30 ± 0.18	-	-	-
U-Net++ ⁴⁹	2D	56.03 ± 22.46	42.05 ± 20.39	77.71 ± 22.71	5.51 ± 6.94
3D U-Net ⁴⁶	3D	54.42 ± 23.51	40.69 ± 20.70	75.31 ± 27.09	8.21 ± 14.43
nn-UNet ⁵⁰	3D	56.30 ± 23.55	42.62 ± 21.31	76.45 ± 27.16	9.26 ± 18.25
UA-MT ¹⁵	3D	57.31 ± 20.53	42.87 ± 19.05	78.55 ± 21.42	6.89 ± 13.90
DM²T-Net ³⁹	3D	60.19 ± 19.22	45.56 ± 18.44	80.95 ± 20.99	6.55 ± 13.79
Our method	3D	64.41 ± 11.29	48.64 ± 13.69	88.15 ± 8.17	4.92 ± 4.12

Figure 14 horizontally shows segmentation results of 3D U-Net,⁴⁶ VNet,⁵¹ MDA-CNN³⁹ combined with VIT (Vision Transformer⁵²), our method (Ours), and the ground truth. Specifically, U-Net's results contain discrete, irregular regions with blurry boundaries, missing fine structures and struggling to capture small lesions or complex edges. VNet's segments are more compact but still lack details (e.g., incomplete small lesions), outperforming U-Net yet requiring further improvement. MDA-CNN + VIT's outputs are closer to ground truth but exhibit local edge inaccuracies (e.g., in magnified boxes) and poor complex-background feature discrimination. Our method's contours closely align with ground truth, enabling better restoration of small lesions and fine boundaries, and more precise capture of local details while maintaining overall structures.

Figure 14.

Visual comparison of segmentation results yielded by distinct methods on the MosmedData dataset.

Discusses the performance of GRMA-Net

To assess the model's generalization ability and reliability, we tested it on the COVID-19-P20 dataset using the best weights obtained from training on the MosMedData dataset. The COVID-19-P20 dataset⁵³ and the MosMedData dataset exhibited substantial disparities in both the magnitude of the infected area and imaging characteristics. This validation approach helps evaluate the model's ability to adapt to diverse data characteristics and assess its performance across different datasets. Additionally, we compare the generalization performance of the proposed algorithm with that of other network models, and the results are presented in Table 6. The algorithm proposed in this paper demonstrates a higher value in the Dice coefﬁcient. Additionally, we visualized the segmentation results for a randomly selected sample from the COVID-19-P20 dataset, as shown in Figure 15. In the figure, we zoomed in to highlight the infected tissue region in the original image and compared the model's prediction with the ground truth. From the figure, it is evident that the model is still able to make accurate predictions, even in the presence of complex or irregularly shaped infected regions. Therefore, the model is validated across different datasets, demonstrating strong generalization ability and robustness.

Figure 15.

Visualization of segmentation results for verifying generalization performance.

Table 6.

Comparison of results with other models and algorithms.

MosMedData → COVID-19-P20
Method	DICE
U-Net++ ⁴⁹	23.61 ± 17.31
UA-MT ¹⁵	50.24 ± 17.70
DM²T-Net ³⁹	51.36 ± 17.80
Ours	58.81 ± 24.00

Conclusion

In this study, we designed a novel pneumonia infection segmentation method using dual multiscale mean teacher model and adversarial consistency learning in chest CTs. The method is designed with a discriminator network with spatial attention, and the quality of prediction on unlabeled data is continuously improved during training while the discriminator network and the segmentation network are in an adversarial relationship with each other. Specifically, the discriminator determines the prediction quality of the input labeled and unlabeled data and the original image respectively, and the discriminator continuously motivates the segmentation network to generate higher-quality segmentation results during the training process. At the same time, an adaptive feature selection module is embedded in the discriminator, which adjusts the channel weights of the feature map to improve the feature expression ability of the network. In addition, the use of fixed model parameters for different sample data may lead to overfitting risk. And it is not applicable to use the same model parameters for two different types of data, labeled data and unlabeled data, in semi-supervised training. Therefore, we use dynamic convolution in the semi-supervised framework to solve this problem. By proposing TADC kernel and dynamically adjusting the operation of convolution kernel, the features of different data can be better adapted. Meanwhile, the designed TADC also enhances the model feature representation capability. Finally, we designed an innovative feature selection and fusion module (FSFM) that enables the segmentation network to focus on important regions, thereby enhancing the prominence of key features. In conclusion, the proposed adversarial consistency semi-supervised segmentation method for pneumonia infection holds significant value and meaning. In the future, we will attempt to introduce multimodal data fusion techniques and specialized data augmentation methods targeting the sophisticated morphological characteristics of pneumonia into our research, striving to achieve more breakthrough progress in follow-up work.

Footnotes

ORCID iDs

Yu Gu

Lidong Yang

Xin Liu

Ying Zhao

Ethics approval

This study utilized a public database of chest CT scans, from which the personal information of patients was removed. Therefore, this article does not contain any studies involving human participants or animals.

Consent to participate

This work utilized a public medical image dataset. Therefore, human subjects were not involved in this study.

Consent to publish

This doesn't apply to our research because our analysis and framework are built upon publicly accessible benchmark datasets.

Authors' note

Yu Gu is also affiliated with the Information Engineering College, Hebei University of Architecture, Zhangjiakou, Hebei 075000, China.

Author contributions

The first draft of the manuscript was written by Jianning Zang and Yu Gu. The review and editing of the manuscript were completed by Yu Gu, Xiaoqi Lu and Lidong Yang. The methodology of the study was completed by Jianning Zang, Baohua Zhang and Dahua Yu. The software in the study was completed by Jing Wang, Ying Zhao and Siyuan Tang. The validation of the study was completed by Jianjun Li, Xin Liu and Qun He.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The National Natural Science Foundation of China (grant numbers: 62001255, 62161040, 62262048, 62441212, and 61841204); The Central Government Guides Local Science and Technology Development Fund Project of China (grant numbers: 2025ZY0007, 2021ZY0004, and 2022ZY0024); Hebei Natural Science Foundation (grant number: F2025404013); Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (grant numbers: NJYT23057, NJYT22074, and NJYT23106); the Fundamental Research Funds for Inner Mongolia University of Science & Technology (grant numbers: 042, 019, and 101); Inner Mongolia Autonomous Region Natural Science Foundation (grant numbers: 2024MS06008, 2024LHMS06006, 2019MS06003, 2022MS06017, 2022MS06009, and 2015MS0604); Inner Mongolia Health Technology Program Project (grant number: 202201395); Baotou Health Science and Technology Program (grant number: wsjkkj2022120); Inner Mongolia College Science and Technology Research Project (grant number: NJZY145); Chunhui Program of the Ministry of Education of the People's Republic of China (Research on CAD for Lung Cancers with Big Image Data based on Hessian Dot Filter and PSO–SVM) (grant numbers: [2019]1383); 2024 Inner Mongolia Archives Technology Project (2024–47); Inner Mongolia Autonomous Region college students innovation and entrepreneurship training projects (grant numbers:202310130004, s202410130004).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The data that support the findings of this study are openly available in MosMedData dataset at https://doi.org/10.48550/arXiv.2005.06465.

References

Zhang

, et al. Automatic lung nodule detection using multi-scale dot nodule-enhancement filter and weighted support vector machines in chest computed tomography. PLoS ONE 2019; 14: e0210551.

Zhang

Yang

Tian

, et al. Classification of benign and malignant pulmonary nodule based on local-global hybrid network. J Xray Sci Technol 2024; Preprint: 1–18.

Yang

, et al. Multi-level 3D DenseNets for false-positive reduction in lung nodule detection based on chest computed tomography. Curr Med Imaging Formerly Curr Med Imaging Rev 2020; 16: 1004–1021.

Zheng

Lure

, et al. Clinical and radiological features of novel coronavirus pneumonia. J Xray Sci Technol 2020; 28: 391–404.

Wang

Zhou

, et al. Limitations of an automated embolism segmentation method in clinical practice. J Xray Sci Technol 2018; 26: 667–680.

Chi

Liu

, et al. A survey of computer-aided diagnosis of lung nodules from CT scans using deep learning. Comput Biol Med 2021; 137: 104806.

Naseer

Tamoor

Azhar

. Computer-aided COVID-19 diagnosis and a comparison of deep learners using augmented CXRs. J Xray Sci Technol 2022; 30: 89–109.

Yang

, et al. Automatic lung nodule detection using a 3D deep convolutional neural network combined with a multi-scale prediction strategy in chest CTs. Comput Biol Med 2018; 103: 220–231.

Liu

Yang

, et al. SSANet—novel residual network for computer-aided diagnosis of pulmonary nodules in chest computed tomography. Int J Imaging Syst Technol 2024; 34: e23176.

10.

Xie

Zhang

, et al. Predict pneumonia with chest X-ray images based on convolutional deep neural learning networks. J Intell Fuzzy Syst 2020; 39: 2893–2907.

11.

Shi

Yang

, et al. Multi-disease X-ray image classification of the chest based on global and local fusion adaptive networks. Curr Med Imaging Formerly Curr Med Imaging Rev 2024; 20: e15734056291283.

12.

Ronneberger

Fischer

Brox

. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18, pp. 234–241. Springer International Publishing.

13.

Fan

Zhou

, et al. Inf-net: automatic COVID-19 lung infection segmentation from CT images. IEEE Trans Med Imaging 2020; 39: 2626–2637.

14.

Tarvainen

Valpola

. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv Neural Inf Process Syst 2017; 30: 1195–1204.

15.

Wang

, et al. Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. In: International conference on medical image computing and computer-assisted intervention, MICCAI, 2019, pp. 605–613.

16.

Xia

Liu

Yang

, et al. 3D semi-supervised learning with uncertainty-aware multi-view co-training. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 3646–3655.

17.

Xiang

Qiu

Yang

. FUSSNet: fusing two sources of uncertainty for semi-supervised medical image segmentation. Paper presented at: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2022, pp. 481–491.

18.

Yin

, et al. Uncertainty-aware pseudo-label and consistency for semi-supervised medical image segmentation. Biomed Signal Process Control 2023; 79: 104203.

19.

Chen

Dai

Liu

, et al. Dynamic convolution: attention over convolution kernels. IEEE Access 2020; 6: 11030–11039.

20.

Jiang

Liu

, et al. Hippocampus segmentation method applying coordinate attention mechanism and dynamic convolution network. Appl Sci 2023; 13: 7921.

21.

Wang

Liu

, et al. MDC-RHT: multi-modal medical image fusion via multi-dimensional dynamic convolution and residual hybrid transformer. Sensors 2024; 24: 4056.

22.

Zhang

Zhu

Hallinan

, et al. 2022. BoostMIS: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 20666–20676.

23.

Rizve

Duarte

Rawat

, et al. In defense of pseudo-labeling: An uncertainty-aware pseudo label selection framework for semi-supervised learning. arXiv preprint arXiv:2101.06329, 2021.

24.

Cui

Liu

, et al. Semisupervised brain lesion segmentation with an adapted mean teacher model. In: International conference on information processing in medical imaging, 2019, pp. 554–565. Springer.

25.

Wang

, et al. Ambiguity-selective consistency regularization for mean-teacher semi-supervised medical image segmentation. Med Image Anal 2023; 88: 102880.

26.

Adiga

Dolz

Lombaert

. Anatomically-aware uncertainty for semi-supervised image segmentation. Med Image Anal 2024; 91: 103011.

27.

Wang

Chen

, et al. A regularization-driven Mean Teacher model based on semi-supervised learning for medical image segmentation. Phys Med Biol 2022; 67: 175010.

28.

Xiao

Deng

, et al. Efficient combination of CNN and transformer for dual-teacher uncertainty-guided semi-supervised medical image segmentation. Comput Methods Programs Biomed 2022; 226: 107099.

29.

Wang

, et al. DSST: a dual student model guided student–teacher framework for semi-supervised medical image segmentation. Biomed Signal Process Control 2024; 90: 105890.

30.

Zhang

Chen

, et al. Dense GAN and multi-layer attention based lesion segmentation method for COVID-19 CT images. Biomed Signal Process Control 2021; 69: 102901.

31.

Jiang

Chen

Loew

, et al. COVID-19 CT image synthesis with a conditional generative adversarial network. IEEE J Biomed Health Inform 2021; 25: 441–452.

32.

Cao

Jin

, et al. Gasnet: Weakly-supervised framework for covid-19 lesion segmentation. arXiv preprint arXiv:2010.09456, 2020.

33.

Liu

. Medical image segmentation with generative adversarial semi-supervised network. Phys Med Biol 2021; 66: 245008.

34.

Chen

Wen

, et al. Collaborative and adversarial learning of focused and dispersive representations for semi-supervised polyp segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3489–3498.

35.

Zhang

Yang

Chen

, et al. Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In Medical image computing and computer assisted intervention− MICCAI 2017: 20th international conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III 20, pp. 408–416. Springer International Publishing.

36.

Yin

Luo

Zhou

, et al. LDCNet: lightweight dynamic convolution network for laparoscopic procedures image segmentation. Neural Netw 2024; 170: 441–452.

37.

Liao

Zhang

, et al. Domain and content adaptive convolution for domain generalization in medical image segmentation, 2021.

38.

Huang

Han

, et al. Domain base dynamic convolution and distance map guidance for anterior mediastinal lesion segmentation. Knowl Based Syst 2024; 296: 111881.

39.

Wang

Zhu

, et al. Dual multiscale mean teacher network for semi-supervised infection segmentation in chest CT volume for COVID-19. IEEE Trans Cybernet 2023; 53: 6363–6375.

40.

Lei

Zhang

, et al. Semi-supervised medical image segmentation using adversarial consistency learning and dynamic convolution network. IEEE Trans Med Imaging 2023; 42: 1265–1277.

41.

Dong

Dai

, et al. A novel multi-attention, multi-scale 3D deep network for coronary artery segmentation. Med Image Anal 2023; 85: 102745.

42.

Misra

Nalamada

Arasanipalai

, et al. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3139–3148.

43.

Yang

Qiu

Zhang

, et al. D-net: Dynamic large kernel with dynamic feature fusion for volumetric medical image segmentation. arXiv preprint arXiv:2403.10674, 2024.

44.

Xing

Yang

, et al. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In International conference on medical image computing and computer-assisted intervention, 2024, October, pp. 578–588. Cham: Springer Nature Switzerland.

45.

Morozov

Andreychenko

Pavlov

, et al. Mosmeddata: Chest ct scans with covid-19 related findings dataset. arXiv preprint arXiv:2005.06465. 2020.

46.

Çiçek

Abdulkadir

Lienkamp

, et al. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Medical image computing and computer-assisted intervention–MICCAI 2016: 19th international conference, Athens, Greece, October 17-21, 2016, pp. 424–432, Proceedings, Part II 19, 2016.

47.

Shabani

Homayounfar

Vardhanabhuti

, et al. Self-supervised region-aware segmentation of COVID-19 CT images using 3D GAN and contrastive learning. Comput Biol Med 2022; 149: 106033.

48.

Mahmud

Alam

Chowdhury

, et al. CovTANet: a hybrid tri-level attention-based network for lesion segmentation, diagnosis, and severity prediction of COVID-19 chest CT scans. IEEE Trans Ind Inf 2021; 17: 6489–6498.

49.

Zhou

Rahman Siddiquee

Tajbakhsh

, et al. Unet++: A nested u-net architecture for medical image segmentation. Paper presented at: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 2018.

50.

Isensee

Petersen

Kohl

, et al. nnu-net: Breaking the spell on successful medical image segmentation. arXiv preprint arXiv:1904.08128, 1(1-8):2, 2019.

51.

Milletari

Navab

Ahmadi

S-A

. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), 2016, pp. 565–571. IEEE.

52.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16(16 words: Transformers for image recognition at scale. arxiv preprint arxiv:2010.11929, 2020.

53.

Wang

, et al. Towards efﬁcient covid-19 ct annotation: A benchmark for lung and infection segmentation. arXiv preprint arXiv:2004.12537, 2020.