Abstract
Deepfake medical images pose significant risks to clinical diagnosis and treatment planning owing to their high realism. Although many deepfake detection methods exist, their high computational cost limits their practical clinical deployment. This paper proposes a lightweight and efficient deepfake detection framework that combines self-supervised contrastive learning with an attention-enhanced convolutional neural network. The proposed method utilizes a modified MobileNetV2 architecture integrated with Efficient Channel Attention (ECA) modules to enhance feature representation with minimal computational overhead. We employed a two-stage training strategy: self-supervised pre-training on unlabeled CT scans to learn robust features, followed by supervised fine-tuning for the final classification task. The proposed approach achieves an accuracy of 98.39% when trained from scratch. When leveraging ImageNet pre-trained weights prior to self-supervised pre-training, the performance of the model improved significantly, reaching 99.87% accuracy and 100% specificity on the test set. This result achieved a competitive performance that surpassed the evaluated baselines and demonstrated the benefits of combining general-purpose pre-trained ImageNet weights with domain-specific self-supervised learning. The lightweight nature of the proposed ECA-enhanced MobileNetV2 makes it a practical solution in resource-constrained clinical environments.
Keywords
Introduction
The integration of artificial intelligence into healthcare has led to rapid transformation in medical diagnostics. However, this progress is shadowed by the rise of deepfake technology, which uses generative models, such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2020) and diffusion models (Cao et al., 2024), to create synthetic images of alarming realism. In medical imaging, this poses a severe threat to patient safety and diagnostic integrity by enabling convincing manipulations such as tumor injection or removal (Ghoneim et al., 2018). Addressing this vulnerability is a critical challenge in computational intelligence.
Although Convolutional Neural Networks (CNNs) have shown success in general deepfake detection, their application to medical imaging faces significant hurdles (Rossler et al., 2019). First, their high accuracy often depends on large labeled datasets, which are scarce and expensive in the medical domain. Second, general-purpose detectors often fail to capture subtle modality-specific artefacts in medical deepfakes. Finally, many state-of-the-art CNNs are computationally intensive for practical deployment in clinical environments where point-of-care assessment is crucial (Solaiyappan & Wen, 2022). Consequently, the development of lightweight architectures for medical diagnostics is an active research area. For instance, such models have been effectively used to detect diseases such as COVID-19 using chest X-ray images (Karakanis & Leontidis, 2020). As highlighted in recent studies on model optimization, this focus on efficiency is critical for enabling AI applications in resource-constrained clinical environments (Al-Milaji & Yousif, 2024).
Recently, self-supervised learning (SSL) has proven to be an effective way to learn robust image representations from unlabeled data without intensive manual annotation (Kumar et al., 2022). Contrastive learning has proven to be effective in SSL, where it is trained to maximize the agreement between different augmented views of the same image and minimize the agreement between views of different images. In recent years, with the huge popularity of deep learning, many SSL methods such as SimCLR, MoCo and BYOL have succeeded in achieving state-of-the-art accuracy on different benchmarks in the computer vision domain. The use of SSL has demonstrated effectiveness in medical image analysis (Shurrab & Duwairi, 2022) but its application to medical deepfake detection requires more research especially regarding lightweight efficient architectures (Sandler et al., 2018). The majority of existing SSL approaches including medical implementations utilize computationally intensive architectures such as ResNets, ViTs, etc. as their main network architecture despite being unsuitable for clinical environments with limited resources.
Therefore, a critical gap exists for lightweight and efficient deepfake detection methods suitable for real-world medical intelligent systems. This study addresses this gap by making three essential contributions. First, we present a modified MobileNetV2 architecture that integrates Efficient Channel Attention (ECA) (Wang et al., 2020) modules to improve the detection of subtle deepfake artefacts while maintaining computational efficiency. Second, we employ a two-stage training framework that utilizes self-supervised contrastive pre-training to learn robust features from unlabeled data before fine-tuning for classification. Finally, through extensive experiments, we demonstrate that this approach achieves highly competitive performance, reaching 99.87% accuracy by successfully combining general-purpose ImageNet weights with domain-specific self-supervised learning.
The remainder of this paper is organized as follows: Section II provides a review of the literature. In Section III, the materials and methodology used in this study are described, including the dataset details, pre-processing steps, proposed ECA-enhanced MobileNetV2 architecture, self-supervised contrastive pre-training stage and supervised fine-tuning process. In Section IV, the experimental details and performance evaluation of the proposed framework is provided and compared with baseline methods. Section V concludes the study by summarizing the significant findings, contributions, limitations and potential future research directions.
Literature Review
Related Works
The rise of GANs and diffusion models has sparked rapid growth in deepfake detection research. Figure 1 provides a historical overview of deepfake technology and its progression towards integration with medical imaging. Previous studies have employed automatic methods to identify visual anomalies (Nguyen et al., 2022). These include blending inconsistencies, unnatural facial features and noisy patterns. Some researchers have utilized frequency domain analysis (Frank et al., 2020) to detect the unique spectral signatures left by generative models. For deepfake video detection, researchers have also examined motion inconsistencies. These include unnatural blinking and unusual heart rate signals in facial videos (Yu et al., 2021).

Historical overview of deepfake technology and its integration with medical imaging.
Architectures such as XceptionNet have shown good performance on benchmark datasets like FaceForensics++ and Celeb-DF (Yasser et al., 2023). Researchers have worked on developing subtle manipulation detection systems using attention mechanisms and Vision Transformer models (Khormali & Yuan, 2024). These approaches require powerful computational resources and large amounts of labeled data for training. More importantly, general deepfake detectors may not work well for medical images because artefacts in manipulated faces differ from those in medical scans, which makes adaptation challenging. According to Mirsky et al. (2019) medical imaging deepfakes pose a growing threat to healthcare. Medical scans such as computed tomography (CT), magnetic resonance imaging (MRI) and X-ray imaging can be artificially altered by malicious individuals to add or remove disease indications. These deepfakes can lead to misdiagnoses, improper treatments and insurance fraud. Detecting these medical image deepfakes requires specialized knowledge owing to subtle modifications emphasizing the need for automated detection systems.
The legal and regulatory landscape of deepfakes in medical imaging remains under development. Existing laws on data privacy, such as the Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), and medical malpractice laws may not fully address the challenges posed by deepfake technology (van der Sloot & Wagensveld, 2022). New legal frameworks are required to address issues such as liability in cases of misdiagnosis caused by deepfake medical images. It is also important to establish standards for synthetic medical data and to address intellectual property rights related to AI-generated images. Patient data must be used ethically for in-depth research under legal guidelines developed by law.
Research on deepfake medical image detection is growing, with studies focusing on specific imaging modalities. Table 1 summarizes the key state-of-the-art methods developed for deepfake detection. CT scan manipulation was explored by Mirsky et al. (2019) and Albahli and Nawaz (2024), while X-ray deepfake detection was studied by Karaköse et al. (2024). CNN-based models and specialized detection techniques have been used to identify artefacts in fake medical images generated by GAN-based methods. Although most approaches rely on supervised learning, they are often limited by a lack of large labeled medical datasets. In addition, many models are computationally intensive, making them unsuitable for clinical applications.
State-of-the-Art Methods for Deepfake Detection in Medical Imaging.
The deployment of deep learning models for clinical purposes requires an efficient architectural design. Lightweight CNNs like MobileNetV2 (Sandler et al., 2018), ShuffleNet (Zhang et al., 2018) and EfficientNet (Tan & Le, 2019) balance the accuracy and speed using techniques such as depthwise separable convolutions and inverted residuals. The MobileNetV2 architecture is compatible with devices with a minimal computational capacity. Attention mechanisms in feature learning have become widespread because they deliver improved results at manageable computational cost. Techniques such as squeeze-and-excitation (SE) blocks and CBAM help models to focus on important image regions (Lv & Su, 2023). Our framework uses Efficient Channel Attention (ECA), which captures important channel-wise details without reducing dimensions. This helps MobileNetV2 detect subtle deepfake artefacts more effectively while remaining lightweight and efficient.
Self-supervised learning (SSL) has emerged as a powerful paradigm for addressing the limitations of labeled data in deep learning (Krishnan et al., 2022). Contrastive methods such as SimCLR and MoCo learn strong feature representations from unlabeled data by optimizing the feature similarity across different augmented views (Le-Khac et al., 2020). SSL has shown promising results in general medical image-analysis tasks (Rani et al., 2024). However, despite these advancements, there remains a significant gap between approaches that rely heavily on large datasets and those designed for efficiency in clinical applications. Current deepfake detection methods often face challenges related to domain-specific computational costs and their dependence on supervised learning with large amounts of labeled data. The potential of combining SSL with lightweight attention-enhanced architectures for medical deepfake detection remains largely underexplored. In this study, we aimed to address this gap by proposing and evaluating a two-stage SSL-based framework utilizing ECA-enhanced MobileNetV2. The objective is to achieve high detection accuracy while maintaining a practical efficiency suitable for real-world clinical deployment.
The development of effective deepfake detection methods requires a thorough understanding of medical deepfake generation methods. The foundational technology behind the creation of many deepfakes is the Generative Adversarial Network (GAN). A GAN includes a generator that creates synthetic data and a discriminator that attempts to identify whether the data are real or synthetic. This adversarial training process encourages the generator to continuously improve and produce increasingly realistic data. Advanced architectures such as StyleGAN (Karras et al., 2021) excel at high-fidelity image synthesis, whereas CycleGAN (Zhu et al., 2017) facilitates image-to-image translation, which is useful for manipulation. Autoencoders (AEs) and Variational Autoencoders (VAEs) also contribute to encoding features for tasks such as face swapping before decoding the manipulated result. The recent emergence of diffusion models (Ho et al., 2020) has introduced a new challenge for detection because these models produce images that are almost indistinguishable from real images. These generative models show high potential for use in medical-based malicious activities. Figure 2 illustrates a typical conceptual pipeline for generating such deepfake medical images. The challenge of detection arises not only from the realism of the generated fakes but also from the diversity of generation methods and the frequent use of post-processing techniques such as compression and blending, which are designed to conceal manipulation artefacts.

Flow diagram showing the procedure for generating deepfake medical images.
This section explains the datasets and the methodology used in this study. It first describes the selected datasets and image-pre-processing pipeline designed to enhance the discriminability of subtle features. The proposed approach introduces a novel deep learning framework optimized for both computational efficiency and detection performance. The framework follows a two-stage learning strategy. It begins with contrastive self-supervised learning of unlabeled data to capture strong intrinsic feature representations. This is followed by targeted supervised fine-tuning to adapt the learned features specifically for deepfake detection.
Datasets
Two primary data sources were used in this study. We obtained authentic CT images from the publicly available LIDC-IDRI repository (Armato et al., 2011) and deepfake CT images from the CT-GAN benchmark dataset (Mirsky et al., 2019).
LIDC-IDRI Dataset
Authentic CT images were obtained from the Lung Image Database Consortium image collection (LIDC-IDRI). This widely used public resource is available through The Cancer Imaging Archive (TCIA). The dataset comprised 1018 diverse patient cases, including both diagnostic and screening scans. This makes it representative of real-world clinical variability. A key strength of this dataset was its comprehensive annotation framework. Each scan included detailed findings from four independent radiologists who meticulously outlined pulmonary nodules ≥ 3 mm. For this study, we used a subset of 2697 axial CT slices. These were derived from the initial 16 patient cases (LIDC-IDRI-0001 to −0016). This selection process ensured a balanced class distribution within the experimental dataset. It also helped mitigate the potential model biases associated with severe class imbalances during training. For this experiment, authentic images were labeled as real images. In our study, data from these patients were partitioned using patient-level splits to prevent data leakage between the training and test sets. Figure 3 shows representative images from the dataset.

Samples from real CT scans in the LIDC-IDRI dataset.
Deepfake CT images were obtained from the CT-GAN dataset. This dataset is a recognized benchmark for evaluating the detection of GAN-based manipulations in lung CT scans. The original dataset comprised four categories. True Benign (TB), True Malignant (TM), False Benign (FB), and False Malignant (FM). For this study, we exclusively utilized manipulated images corresponding to the False Benign (FB – artificial tumor removal) and False Malignant (FM – artificial tumor injection) categories. The FB and FM categories were collectively labeled as fake class in this study.
Figure 4 shows the representative scans from the CT-GAN dataset. A critical characteristic of this benchmark is the pronounced scarcity of data in fake class. Initially, it contained 113 unique deepfake samples. This results in a severe class imbalance. To address this issue and enhance the robustness of the model against minor variations, targeted data augmentation techniques have been applied (Mikolajczyk & Grochowski, 2018). This method was applied exclusively to the 113 fake images. Each fake image underwent eight distinct transformations. These transformations combined rotational variations of ±5° and ±10° with brightness adjustments of ±10% and ±25%, respectively. This data augmentation strategy expanded the fake images from 113 to 1017 samples. This augmented pool of 1017 images was then randomly split at the image level to create training and test partitions for the deepfake class. The data augmentation strategy is illustrated in Figure 5.

Deepfake medical images from the CT-GAN dataset. Left: artificial tumour removal (FB); Right: artificial tumour injection (FM). The red rectangles are used to pinpoint the manipulated regions in each respective scan.

Visualization of rotation and brightness augmentations used to address class imbalance by expanding the “Fake” dataset.
A rigorous pre-processing pipeline was established to ensure uniformity and enhance the features for differentiating real and deepfake CT images before being input into deep learning models. All images in the two datasets were initially in the DICOM format and converted to the PNG format to maintain data integrity. This lossless conversion is crucial for preserving the original image data and preventing the degradation of subtle features that are critical for deepfake detection. Images from the LIDC-IDRI and CT-GAN datasets were converted to the RGB format and resized to 224 × 224 pixels using bilinear interpolation. The distributions of the real and deepfake CT scan samples in the training and test sets are summarized in Table 2.
Distribution of Real and Deepfake CT Scan Samples in the Training and Test Sets.
Distribution of Real and Deepfake CT Scan Samples in the Training and Test Sets.
To prepare the data for the model, two distinct data augmentation pipelines were defined based on the training stage. In the self-supervised pre-training (SSL) stage, a strong augmentation strategy was employed to generate diverse views. This strategy included a Random Resized Crop to 224 × 224 pixels with a scale of (0.8, 1.0), Random Horizontal Flip with a default probability of 0.5, Random Rotation with a degree range of [-15, + 15], Color Jitter with a factor of 0.2 for both brightness and contrast, and Random Grayscale with a probability of 0.1. In contrast, for the supervised fine-tuning and final evaluation stages, a more conservative augmentation pipeline was used, which included the following: Center Crop to 224 × 224 pixels, Random Horizontal Flip with a probability of 0.5, Random Rotation with a degree range of [-15, + 15], and Color Jitter with a factor of 0.3 for both brightness and contrast. Both augmentation pathways incorporated contrast-limited adaptive histogram equalization (CLAHE) to enhance local contrast across all processed images (Pizer et al., 1987). CLAHE was applied using a clip limit of 2.0 and an 8 × 8 tile grid. This improved the visibility of the fine structures in CT scans. CLAHE operates in localized image regions. It prevents excessive contrast amplification and preserves the critical anatomical details. Finally, all images were converted to PyTorch tensors and normalized. This step standardizes the intensity distribution across the dataset. This ensures stable training dynamics and facilitates convergence in deep-learning models.
This section describes the proposed deep-learning methodology for detecting deepfake medical images. Training deep-learning models using limited labeled data can be challenging. To overcome this problem, we adopted a two-stage training paradigm that combines self-supervised contrastive learning for feature extraction and supervised fine-tuning for classification. To ensure an unbiased evaluation and prevent data leakage, the test set was strictly held out and was not used in any capacity during the self-supervised pre-training or fine-tuning stages. The test set was exclusively used for the final performance assessment of the fully trained model.
In the first stage, self-supervised contrastive learning was applied to the initial training dataset of 2970 images (2157 real and 813 deepfake images) without using labels. The goal was to train the core encoder (ECA-enhanced MobileNetV2) to learn robust domain-relevant features. In the second stage, this dataset was further subdivided into a training subset (2524 images) and a validation set (446 images) for supervised fine-tuning. The model was then fine-tuned on the training subset, and the validation set was used to monitor the performance for early stopping. Finally, the untouched test set (744 images: 540 real and 204 deepfake) was used for the final performance evaluation. Figure 6 shows the complete workflow and the interaction between pre-training and fine-tuning. Details of the model architecture are described in the following subsections.

Flow diagram of the proposed self-supervised pre-training and fine-tuning framework.
MobileNetV2 (Sandler et al., 2018) is used to build the architecture of the proposed framework. We chose this model because of its balance between computational efficiency and strong feature-extraction capabilities. MobileNetV2 uses inverted residual blocks and linear bottlenecks with depth wise separable convolutions.
This makes it well suited for real-time applications in resource-constrained clinical settings. Figure 7 shows the standard MobileNetV2 framework architecture. The standard MobileNetV2 lacks a dedicated attention mechanism. This limits its ability in selectively enhancing the discriminative features necessary for distinguishing between real and deepfake medical images. Because deepfake artefacts are often subtle, incorporating a more adaptive and computationally efficient attention mechanism can significantly enhance detection performance.

The framework of the MobileNetV2.
To address this limitation, we enhance the MobileNetV2 backbone by incorporating Efficient Channel Attention (ECA). Figure 8 illustrates the structure of the Efficient Channel Attention (ECA) module. ECA improves the feature representation by learning channel-wise dependencies. Unlike conventional attention mechanisms, ECA avoids dimensionality reduction before recalibrating the features. Instead, it computes local channel interactions through an adaptive 1D convolution. The kernel size k for this convolution was determined dynamically. It is based on the number of feature channels, C. This ensures optimal feature refinement without increasing the computational burden. Compared to other attention mechanisms, ECA offers a lightweight yet highly effective alternative. Quantitatively, to justify our architectural choice, we first compared the theoretical efficiency of the ECA against other prominent attention mechanisms. As summarized in Table 3, ECA adds a negligible number of parameters and computational costs, making it the ideal choice for enhancing our lightweight backbone without compromising its efficiency. A full experimental validation of this choice including performance metrics is reported in Table 8 (Section 4.2.2).

Architecture of the Efficient Channel Attention (ECA) module.
Theoretical Overhead of Attention Mechanisms on MobileNetV2.
Although Squeeze-and-excitation (SE) attention enhances feature selectivity through global channel recalibration, it introduces significant computational overhead. The Convolutional Block Attention Module (CBAM) is even more complex, as it applies both spatial and channel attention. In contrast, ECA focuses solely on channel-wise attention. It eliminates spatial attention modules, global pooling operations, and multi-branch computations. This significantly reduces the computational cost while maintaining strong feature selectivity. These characteristics make ECA particularly beneficial for deepfake detection in medical imaging. The mathematical formulation of the ECA is defined as follows:
Where C is the dimension of the input channel and
Our modified architecture integrates the ECA module within each inverted residual block of the MobileNetV2. The ECA was placed after the initial 1 × 1 pointwise convolution, which expands the feature channels. It is positioned before the 3 × 3 depth-wise convolution, which performs spatial filtering. This placement ensures that the ECA operates in a richer high-dimensional feature space. This allows the model to refine the feature representations before the spatial convolution.
Consequently, the proposed architecture enhances the ability to capture subtle deepfake artefacts. This improves the overall detection accuracy while maintaining the computational efficiency. Figure 9 shows the modified inverted residual block with an integrated ECA in MobileNetV2.

Modified Inverted Residual Block in MobileNetV2 with integrated Efficient Channel Attention (ECA).
To optimize the ECA-enhanced MobileNetV2 architecture for deepfake detection, we employed a two-stage training strategy. This approach separates representation learning from task-specific adaptation. It enhances feature extraction and generalization by leveraging self-supervised contrastive pre-training followed by supervised fine-tuning.
We specifically chose a contrastive learning framework (SimCLR) over generative SSL approaches, such as Masked Autoencoders (MAE), for a crucial reason. The task of deepfake detection is fundamentally discriminative, relying on identifying subtle high-frequency artefacts and inconsistencies that differentiate a real image from a manipulated one. The objective of contrastive learning is highly aligned with this goal, as it trains the model to be sensitive to fine-grained, instance-level differences. In contrast, reconstruction-driven methods, such as MAE, optimize pixel-level fidelity and may prioritize learning low-frequency anatomical structures, potentially overlooking the manipulation artefacts critical for detection. Furthermore, from a practical standpoint, MAE pre-training is typically the most effective with high-capacity Vision Transformer (ViT) backbones and requires extensive training epochs. SimCLR, however, converges effectively on lightweight CNNs, such as MobileNetV2, within a reasonable training time, aligning with our project's goal of developing an efficient solution.
The first stage involves self-supervised contrastive learning using SimCLR. SimCLR is a widely used framework that has demonstrated a strong performance in learning visual representations without the need for annotated data. This step aims to learn robust feature representations from the training dataset without labels, which consists of real and deepfake CT images. The encoder network, augmented with an MLP projection head, was trained using strong augmentation and optimized the Normalized Temperature-scaled Cross Entropy (NT-Xent) loss. The objective was to maximize the similarity between the augmented views of the same image, while minimizing the similarity between different images. The resulting pre-trained encoder weights were preserved for subsequent fine-tuning.
In the second stage, the pre-trained encoder was adapted for deepfake detection (‘real’ vs “deepfake”). The ECA-enhanced encoder was initialized using the saved weights. The MLP projection head is replaced with a task-specific classification head. The model was then fine-tuned on the same training dataset, now provided with labels for the real and deepfake images. This two-stage framework combines self-supervised representation learning with supervised optimization. It enhances deepfake detection by improving the feature generalization and task-specific adaptation.
Experimental Results
Implementation Details and Analysis
All experimental procedures were conducted using the Google Colab Pro cloud-computing environment. The primary computational task was executed on an NVIDIA Tesla T4 Graphics Processing Unit with 15 GB of dedicated memory. GPU-accelerated computations were supported by the NVIDIA CUDA Toolkit version 12.4 and the NVIDIA CUDA Deep Neural Network library (cuDNN) version 9.3.0. The deep learning framework was developed using Python version 3.11.12, with core, functionalities implemented using the PyTorch library version 2.6.0.
The input image size for training and evaluating the proposed model was standardized to 224 × 224 pixels. In the first phase, self-supervised pre-training was performed for 100 epochs using NT-Xent contrastive loss to learn feature representations. The second phase involved supervised fine-tuning for a maximum of 50 epochs using cross-entropy loss for the binary deepfake classification. During the supervised fine-tuning stage, we employed an early stopping mechanism with a patience of 10 epochs, which monitored the validation accuracy to prevent overfitting and save the best-performing model. We did not use a learning rate scheduler for this training protocol. The hyper-parameter configurations for both the stages are listed in Tables 4 and 5.
Hyperparameters and Corresponding Values Used in Self-Supervised Pre-Training.
Hyperparameters and Corresponding Values Used in Self-Supervised Pre-Training.
Hyperparameters and Corresponding Values Used in Supervised Fine-Tuning.
To assess performance across thresholds, we computed the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Model interpretability was analysed using gradient-based techniques. Grad-CAM generated with the PyTorch-grad-cam library helped to identify the regions influencing the deepfake detection decisions of the model.
This evaluation approach combines quantitative metrics with qualitative interpretability, providing a comprehensive assessment of the model's performance and practical applicability in medical deepfake detection.
The performance of the proposed deepfake detection approach was rigorously evaluated against several established pre-trained CNN models and a Vision Transformer (ViT). The comparative results, including key performance metrics, model complexity, and inference speed, are presented in Table 6. On the evaluated dataset, the proposed approach achieved highly competitive performance across all metrics. Specifically, it achieved an overall accuracy of 99.87% (95% CI: 99.27%–99.99%), a sensitivity of 99.51% (95% CI: 97.26%–99.99%), a specificity of 100.00% (95% CI: 99.32%–100.00%), and an F1-score for the “Fake” class of 99.75% (95% CI: 99.22%–100.00%). The narrow confidence intervals for these metrics underscored the stability and reliability of the performance of the model on the test set.
Performance and Efficiency Comparison of the Proposed Approach with Baseline Deep Learning Architectures.
Note: 95% confidence intervals (CI) are reported for the proposed approach. Clopper-Pearson exact intervals were used for Accuracy, Sensitivity, and Specificity. A bootstrap percentile interval (1000 resamples) was used for the F1-score.
In terms of classification accuracy, the proposed method showed a modest but consistent improvement over other high-performance models, including EfficientNetB0 (by 0.54%) and ViT-Base-16 (by 0.27%). Although these margins are small, a key differentiator of our model is its specificity. The proposed model achieved a specificity of 100% on the test set and successfully identified all 540 real images without any false positives. This is a crucial advantage in clinical applications, where falsely flagging a real image as a fake is highly undesirable. In comparison, ViT-Base-16, another strong performer, achieved a specificity of 99.63%. To assess the performance comparison rigorously, we conducted McNemar's test on the paired predictions from our proposed model and ViT-Base-16. The test confirmed that there was no statistically significant difference in the classification errors (χ² = 0.5, p = 0.48). This result is crucial from an efficiency perspective. The proposed lightweight ECA-MobileNetV2, with only ∼2.23 million parameters, achieves a classification performance that is statistically indistinguishable from the vastly heavier Vision Transformer, which has ∼85.8 million parameters. The proposed approach provides a comparable top-tier accuracy, which is approximately 38 times smaller in terms of the parameters.
The confusion matrices presented in Figure 10 visually confirm the quantitative results. The matrix for the proposed approach shows the highest concentration of predictions along the main diagonal, indicating correct classification. Critically, the false-positive cell (actual real, predicted fake) contained zero, aligning with 100% specificity. Only a single instance of a False Negative (actual false or predicted real) is recorded. In contrast, while other models such as EfficientNetB0, ResNet18, DenseNet121, and ViT-Base-16 also performed well, they exhibited slightly higher numbers of misclassifications. MobileNetV2 and MobileNetV3 Small exhibited comparatively more errors, particularly when fake images were misclassified as real images.

Comparison of confusion matrices between the proposed and evaluated baseline models.
Analysing the model complexity and efficiency aspects from Table 6, the proposed approach maintains a low parameter count for its base MobileNetV2 architecture (∼2.23 million parameters). This is significantly lighter than ResNet18 (11.18 M), DenseNet121 (6.96 M), and ViT-Base-16 (85.80 M). In terms of inference speed, the proposed model achieved an average time of 0.932 ms per image, which was slightly slower than the base MobileNetV2 (0.855 ms), likely because of the added ECA modules, but considerably faster than larger models such as ResNet18 (2.981 ms), DenseNet121 (1.799 ms), and ViT-Base-16 (10.951 ms). MobileNetV3 Small offers the fastest inference (0.255 ms) but at a significant cost to accuracy and other metrics. Therefore, the proposed approach strikes an excellent balance among high accuracy, high specificity, model compactness, and efficient inference.
Further insight into the model's decision-making confidence is provided in Figure 11, which plots the distribution of the predicted probabilities for the “Fake” class separated by the true class. The plot reveals a clear separation between the distributions of real (Class 0) and fake (Class 1) images. The confidence scores for true real images were tightly clustered near zero, whereas the scores for true fake images were predominantly clustered near one. The minimal overlap between these distributions underscores the high discriminative power and confidence of the model in its predictions.

Confidence Score distribution by True Class for the proposed approach.
Figure 12 shows the Receiver Operating Characteristic (ROC) curve of the proposed model. This curve illustrates the trade-off between the true positive rate (sensitivity) and false positive rate across various classification thresholds. The area under the ROC curve was approximately 0.9994. This score indicates a high level of discriminative ability between the real and deepfake images. The proposed approach excels not only in overall accuracy but also in specificity while maintaining computational efficiency, making it a compelling solution for practical deployment.

ROC curve for the proposed model.
Effect of ImageNet Pre-Initialization
In our first ablation study, we investigated the contribution of leveraging pre-trained ImageNet weights before the self-supervised learning (SSL) phase. The performance of the proposed approach when initialized with standard ImageNet weights prior to SSL was compared to that of an approach initialized with random weights. The performance metrics of the final test set for both the configurations are listed in Table 7.
Effect of ImageNet Pre-Initialization on Performance Metrics of the Proposed Model.
Effect of ImageNet Pre-Initialization on Performance Metrics of the Proposed Model.
Performance and Efficiency Comparison of Attention Mechanisms in the Proposed Architecture.
Initializing the encoder with ImageNet weights before SSL pre-training yielded improved results across all evaluated metrics. The ImageNet-initialized model achieved an accuracy of 99.87% and a fake class F1-score of 99.75%, compared with 98.39% and 96.98% for the randomly initialized model, respectively. This represents an improvement of 1.48% in accuracy and 2.77% in the F1 score. A key finding was in specificity, where the ImageNet-initialized model achieved a specificity of 100.00% on the test set, making no false positive errors. In comparison, the randomly initialized counterpart achieved a specificity of 99.81%.
The fine-tuning performance curves in Figures 13 and 14 qualitatively confirm the benefits of the ImageNet initialization. As shown in Figure 13, the training accuracy of the model with pre-trained weights started at a higher baseline (∼0.76 at epoch 1) than that of the randomly initialized model (∼0.70). More importantly, the ImageNet-initialized model exhibited significantly accelerated convergence, reaching high training accuracy (>0.99) within the first five epochs. The randomly initialized model displayed a much slower convergence trajectory, taking approximately 15–20 epochs to reach a similar high-accuracy plateau.

Training accuracy comparison during fine-tuning with and without ImageNet pre-initialization.

Training loss comparison during fine-tuning: With vs. Without ImageNet pre-initialization.
Figure 14 illustrates the training loss during fine-tuning for both the initialization strategies. The loss for the ImageNet-initialized model decreased sharply, whereas the randomly initialized model exhibited a much more gradual decrease. These results highlight the power of transfer learning to provide beneficial inductive bias. ImageNet pre-training endows the model with a robust set of generic low-level feature detectors for edges, textures, and shapes.
This provides a “warm start,” allowing the subsequent domain-specific self-supervised learning phase to focus on adapting these foundational features to the nuances of CT imaging and deepfake artefacts rather than learning them from scratch. Although pre-training exclusively on a large-scale medical dataset is a potential alternative, the scale and diversity of ImageNet are currently unparalleled. Our hybrid approach, combining general-purpose ImageNet pre-training with domain-specific SSL, represents a practical and highly effective strategy that leads to faster convergence and improved final classification performance.
To validate our choice of the Efficient Channel Attention (ECA) module, we conducted a second ablation study comparing our proposed architecture with two variants enhanced by other widely used attention mechanisms: the squeeze-and-excitation (SE) block and the Convolutional Block Attention Module (CBAM). Each variant was trained using the same two-stage protocol to ensure a fair and direct comparison.
The performance and efficiency trade-offs are listed in Table 8. The results show that our proposed ECA model's performance is on par with that of the best-performing baseline, CBAM, achieving an identical 99.87% accuracy and 100.00% specificity. The SE-enhanced model showed 99.73% accuracy and 99.81% specificity. Critically, our model achieved this top-tier performance with a negligible fraction of computational cost. As shown in the table, the ECA module adds only ∼0.08 K parameters and ∼3.5 × 10-5 GFLOPs to the MobileNetV2 backbone. In stark contrast, the CBAM module, which achieved the same performance, was over 7000 times larger in terms of additional parameters and had a computational overhead that was more than 100 times greater.
This analysis confirms that the ECA module provides an optimal balance between classification performance and computational efficiency for this task. It successfully enhances the ability of the model to detect subtle deepfake artefacts without the significant computational burden of other attention mechanisms.
Comparison with State of art Studies
Researchers have used numerous deep-learning techniques along with different datasets to develop solutions for identifying deepfake medical images. To position our work within this landscape, Table 9 provides a summary of recent representative studies, detailing their utilized datasets, applied methodologies and reported detection accuracies.
Comparison of the Proposed Approach with State-of-the-Art Medical Deepfake Detection Methods.
Comparison of the Proposed Approach with State-of-the-Art Medical Deepfake Detection Methods.
Alheeti et al. (2022) detected fake cancerous nodules on 3D CT lung scans, where fake nodules were added or real nodules were removed. Using a pre-trained Deep Neural Network, they achieved a detection accuracy of 93.19% and reported metrics related to false alarms and error rates. Another approach by Aruna and Narayan (2024) utilized LIDC-IDRI and CT-GAN datasets containing real and GAN-generated tampered CT scans, respectively. They used a standard method of processing images with Local Binary Patterns (LBP) to find important features and then applied the U-Net architecture and classified the results using a Support Vector Machine (SVM), achieving an accuracy of 93.9%.
Budhiraja et al. (2022) conducted thorough tests on LIDC-IDRI and CT-GAN datasets using well-known CNN architectures such as DenseNet, ResNet, and VGG, as well as Residual Connections (RC). The highest reported accuracy was 97.2% using an ensemble architecture combining DenseNet201 and RC, demonstrating the potential of combining multiple strong models.
Researchers have recently made significant progress in terms of accuracy. Karaköse et al. (2024) conducted a deepfake detection study utilizing the 3D CT-GAN dataset, where they evaluated the performance of the YOLOv5 and YOLOv8 models. Their approach achieved a high detection accuracy of 99.7% and demonstrated the potential of object detection frameworks for medical-image integrity verification. Pradeepan and Raj (2024) also employed the CT-GAN dataset but introduced a hybrid methodology that integrates EfficientNet-B0 with Discrete Wavelet Transform (DWT) features. This method achieved an accuracy of 99.6%. Both studies highlighted the efficacy of combining advanced deep learning architectures with handcrafted feature extraction techniques for reliable deepfake detection in medical imaging applications.
Several factors must be considered when comparing the results of this study to those of other studies. Not all the studies used the same data or methods. More importantly, various evaluation techniques have been frequently used. These variations make direct comparison difficult. However, placing our results in a broader context would provide useful insights. As summarized in Table 9, our proposed method achieved a highly competitive accuracy of 99.87% on the LIDC-IDRI and CT-GAN datasets. This performance is on par with or marginally exceeds that of the recent top-performing methods. While the accuracy gain is modest, the primary advantage of our approach lies in its unique combination of high performance and computational efficiency. Furthermore, our model achieved a specificity of 100% for the test set, which is a critical metric for clinical reliability, as discussed in the previous performance analysis section.
In this study, we developed a deepfake detection method that focuses on both accuracy and efficiency. Our proposed model, which combines self-supervised learning with a lightweight ECA-MobileNetV2 architecture, achieved highly competitive accuracy and specificity of 100% on the test set. A key differentiator of our study is its emphasis on practical deployment. While many studies have reported high detection accuracy, few have focused on the inference time and computational cost that are critical for real-time clinical use.
We trained and evaluated the proposed model using the widely recognized LIDC-IDRI and CT-GAN datasets and validated the effectiveness of the proposed model through two crucial ablation studies. First, we demonstrated that our hybrid pre-training strategy, initializing with general-purpose ImageNet weights before domain-specific self-supervised learning, resulted in significantly faster convergence and stronger final performance than random initialization. Second, our ablation of attention mechanisms confirmed that the ECA module is highly effective, enabling the model to achieve top-tier accuracy on par with a much heavier CBAM-based model, but with a negligible fraction of the computational cost, thereby providing an optimal balance of performance and efficiency.
Furthermore, visual explanations from Grad-CAM (Figure 15) provide insight into the model's decision-making process and support its clinical relevance. For the deepfake CT scan (bottom row), activation of the model was more concentrated around the focal region of the lung containing the synthetically generated nodule (indicated by the red box). The intense and concentrated activation demonstrated that the model successfully identified the specific location and characteristics of the manipulation. In contrast, in the real CT scan (top row), activation is more diffuse and spread across key anatomical structures, such as the aorta and chest wall, reflecting a holistic assessment of structural consistency. These visualizations highlight that the model utilizes clinically meaningful features and anatomical coherence, rather than superficial pixel-level cues.

Grad-CAM visualizations of model attention. (a) Original CT scans: Real (top) and deepfake (bottom). (b) Corresponding Grad-CAM activation maps. The real scan shows a more diffuse, holistic activation across key anatomical structures. In contrast, in the deepfake example, the activation of the model was more concentrated around the synthetically generated nodule (indicated by the red box) with limited spill over to the surrounding tissue.
Despite these promising results, this study has several limitations that suggest clear directions for future research. First, the generalization of the study was constrained by the dataset and the splitting procedure. The model was evaluated using a single source of GAN-generated forgeries (CT-GAN). Additionally, the deepfake samples were augmented prior to partitioning to address class imbalance, which means that different augmented views of the same images could exist across partitions. To ensure generalization, future work should validate the model against more diverse and modern forgery techniques, such as diffusion models, and across modalities, including MRI and X-rays. Second, although an independent test set was used, the stability could be further assessed through k-fold cross-validation. Third, two practical challenges in real-world deployment remain unaddressed. The first is robustness against adversarial attacks, which are subtle perturbations designed to cause misclassifications. The second is the risk of model drift, where the performance may degrade as new imaging hardware or forgery techniques emerge. This requires continuous monitoring and retraining in a clinical environment. Finally, while Grad-CAM provides useful insights, it offers only coarse explanations of the model reasoning. Future work could employ advanced interpretability methods, such as SHAP or LIME, to yield quantitative feature-level insights. Such explanations would be invaluable for building clinical trust and positioning the tool as a supportive aid for flagging suspicious regions rather than as a standalone diagnostic system.
Conclusion
This study introduced a deepfake detection framework that integrates self-supervised learning (SSL) with an attention-enhanced lightweight convolutional network (ECA-MobileNetV2) specifically tailored for robust medical imaging analysis. This framework uses contrastive learning to extract meaningful features directly from medical images, thereby enabling the model to detect subtle artefacts that may be missed by traditional methods. The Efficient Channel Attention (ECA) module helps the model to focus on the critical areas of the image, improving its ability to distinguish between real and deepfake images. The MobileNetV2 backbone ensures that the model remains fast and suitable for systems with a limited computational power. Grad-CAM helps explain how a model makes decisions by highlighting the areas that influence its predictions. This is important for building trust in a clinical environment. The model was tested on two well-known CT scan datasets, LIDC-IDRI and CT-GAN. On this benchmark, the results demonstrated highly competitive performance, achieving a detection accuracy of 99.87% and a specificity of 100%. These results suggest that the proposed framework is both accurate and reliable. The strong combination of self-supervised learning, attention-based feature selection and lightweight design makes this approach a promising and effective solution for real-world medical deepfake detection.
Footnotes
Acknowledgements
Not applicable
Ethics Approval
Not applicable
Author Contributions
Pradeepan Pankiraj and Gladston Raj conceived and designed the study. Pradeepan Pankiraj developed the methodology and implemented the software. Pradeepan Pankiraj and Neethu Nath conducted the experiments and analyzed the data. Pradeepan Pankiraj wrote the original draft of the manuscript. Gladston Raj, Juby George, and Neethu Nath reviewed and edited the manuscript. Gladston Raj supervised the project. All the authors have read and approved the final version of the manuscript.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability
The data supporting the findings of this study are openly available. Authentic CT images (LIDC-IDRI) are available in The Cancer Imaging Archive (TCIA) at https://www.cancerimagingarchive.net/collection/lidc-idri/ (Armato et al., 2011). Deepfake CT images (CT-GAN) are available in the CT-GAN GitHub repository at https://github.com/ymirsky/CT-GAN (Mirsky et al., 2019).
