Abstract
Abuse of face swap techniques poses serious threats to the integrity and authenticity of digital visual media. More alarmingly, fake images or videos created by deep learning technologies, also known as Deepfakes, are more realistic, high-quality, and reveal few tampering traces, which attracts great attention in digital multimedia forensics research. To address those threats imposed by Deepfakes, previous work attempted to classify real and fake faces by discriminative visual features, which is subjected to various objective conditions such as the angle or posture of a face. Differently, some research devises deep neural networks to discriminate Deepfakes at the microscopic-level semantics of images, which achieves promising results. Nevertheless, such methods show limited success as encountering unseen Deepfakes created with different methods from the training sets. Therefore, we propose a novel Deepfake detection system, named
Introduction
Digital visual media, such as images and videos, are widely accepted as critical evidence in investigations, military and digital multimedia forensics [13]. Nevertheless, the integrity and authenticity of digital visual media can be undermined by various forgeries, the most harmful of which is the face swap techniques [41]. Such forgery swaps the face in the target image or video with a donor’s face observed in other media, which creates the illusion of a person’s behaviors that do not happen in reality. Adversaries usually employ computer graphics-based methods to swap faces [40], which inevitably reveals some tampering traces. Whereafter, a new type of fake samples is emerging, i.e., Deepfake [41], which is created using deep learning models such as Autoencoder [4] and Generative Adversarial Network (GAN) [16]. Compared with the former ones, Deepfakes are more realistic and high-quality, which retains the target’s facial expressions, head poses and reveals few tampering traces, as illustrated in Fig. 1. With the risks of the applications for creating the Deepfakes, such as FakeAPP1
With increasing concerns on the Deepfakes, Deepfake detection has become an emerging trend in digital multimedia forensics research, and various methods have been proposed. Among them, some research focuses on the macroscopic-level semantics of images, and attempts to identify discriminative visual features to detect Deepfakes, such as head poses [44], details in eyes and teeth [28], and the movements of facial muscles [3]. Nevertheless, affected by objective conditions such as the angle or posture of a face, the discriminative visual features are unlikely available in every image or video frame, which limits the applicability of such methods.
As an alternative to the above visual feature-based methods, based on the difference in the microscopic-level semantics of images, some research focuses on devising effective deep networks to classify real and fake faces [2,11,30]. Deep neural networks are capable of perceiving the differences that are difficult for human eyes, which leads to promising performance of such methods. Nevertheless, these methods are demonstrated with good performance as the testing and training Deepfakes are created with the same methods (abbreviated as seen Deepfakes). Upon the encountering of Deepfakes created with different methods from the training sets (abbreviated as unseen Deepfakes), the performance of deep neural networks will be greatly degraded [27]. That is, the poor generalization on unseen Deepfakes poses a challenge to existing detection methods.

Screenshots from the corresponding videos [24]. The generated Deepfake (c) presents the donor’s face in (b) while retaining the target’s facial expressions and head poses in (a).
In this paper, we focus on the deep neural network-based methods in views of their promising performance, and aim to address the existing challenge of unseen Deepfake detection. Therefore, we would like to propose a novel Deepfake detection system, named
With the above expectations on
Based on the observation that seen and unseen Deepfakes share identical labels whereas express different feature distributions for a neural network. We provide a promising perspective of formulating the generalization challenge into a task of cross-distribution data classification, to classify multifold Deepfakes with different feature distributions.
We devise a preliminary detection network of
We address the issue of cross-distribution Deepfake detection with the strategy of domain adaptation [14,15]. Specifically, we devise and augment a domain adaptation component to the preliminary detection network, which enables
The reminder of this paper is structured as follows. Section 2 presents a concise introduction to related methods of Deepfake creation and detection. Our proposed detection system is presented on and evaluated in Section 3 and Section 4, respectively. Finally, we conclude this paper and envisage the future work in Section 5.
In this section, we first introduce two typical methods to create Deepfakes, and then review their related detection methods.
Background
Currently, various implementations of Deepfakes are available, and the most typical ones are based on AutoEncoder and GAN.
The AutoEncoder-based method,

Workflow of the Autoencoder-based Deepfake creation method. Based on the latent identify features extracted from a donor face, a Deepfake is created by Decoder 2 trained with target faces.
The GAN-based method,
At present, the typical Autoencoder-based and GAN-based methods are extensively leveraged to create multiple Deepfake datasets, which are adopted as critical baselines for Deepfake detection. It is noteworthy that some previous fake faces created with computer graphics-based methods, such as
With the increasing concerns on such threats, Deepfake detection becomes an emerging topic in digital multimedia forensics research. In general, existing typical work can be divided into visual feature-based and deep neural network-based methods.
Visual feature-based methods
Focusing on the macroscopic-level semantics of images, i.e., the difference between real and fake faces visible to naked eyes, some research detects Deepfakes by the artefacts left when they are created.
Intuitively, by observing the artefacts left by splicing synthesized face regions into source images, Yang et al. train a support vector machine (SVM) model, which classifies real and fake faces by comparing their orientation and the position of 3D head poses [44]. Analogously, based on the face warping artefacts left by the matching step to create fake images, Li et al. propose to train a neural network to perceive such artefacts from the face regions and surrounding areas of Deepfakes [26]. Moreover, focusing on the artefacts inevitably left by the face tracking and editing step, including eye color, light reflections, and missing details of teeth or eyes, Matern et al. train a detection network based on these visual features [28].
The above researchers observe the artefacts in the critical steps to create Deepfakes, including face splicing, warping, tracking and editing. Overall, they provide a fundamental strategy of Deepfake detection. Beyond such methods, some research explores some physiological characteristics that are difficult to be imitated by Deepfakes, which are then applied as discriminative visual features to classify real and fake faces.
Specifically, Agarwal et al. collect the occurrence and intensity for 18 facial action units about the movements of facial muscles and heads as visual features for classification, such as mouth stretch, nose wrinkle and cheek raiser [3]. Moreover, Jung et al. observe the significant change in the pattern of blinking, including a person’s biological factors, physical conditions, etc. Therefore, they focus on the visual feature of eye aspect ratio, and classify real and fake videos based on eye blinking counts and periods [22].
Overall, the aforementioned visual feature-based methods, whether based on artefacts or biological characteristics, have provided a feasible perspective for Deepfake detection. Nevertheless, subjected to objective conditions such as the angle or posture of faces, these discriminative visual features sought out are unlikely available in all images or video frames. In practice, fake faces in various conditions are likely to be encountered, which leads to the limitation of the applicability of such methods.
Deep neural network-based methods
In comparison with the visual feature-based methods, the deep neural network-based ones have overcome the above limitations. Focusing on the microscopic-level semantics of images, i.e., the pixel-level difference between real and fake faces, various types of deep neural networks are devised to classify Deepfakes. Such methods have avoided depending on any specific visual traces, and thus will not be affected by the objective conditions of Deepfakes.
Typically, Güera et al. propose a temporal-aware pipeline, in which a convolutional neural network (CNN) is utilized to extract features, and a recurrent neural network (RNN) is devised to identify Deepfakes [17]. Analogously, focusing on the microscopic properties of images, Afchar et al. propose a fundamental CNN-based detection network, named
The above methods are verified with good performance on specific datasets, which demonstrates deep neural networks with fundamental architectures are enough for the detection task. Beyond these fundamental networks, some research devises more diversified and customized deep neural networks to detect Deepfakes.
Specifically, Zhou et al. propose a two-stream detection network, in which a GoogLeNet model [39] and an SVM model trained with steganalysis features are collaborated applied for Deepfake detection [18]. Moreover, Considering both spatial and motion information, Wang et al. propose a novel detection system based on 3DCNN models, in which the I3D [7] and 3D ResNet [19] approaches are utilized. Especially, to improve the feature maps of classification models, Dang et al. propose to apply the attention mechanism to highlight the informative regions, and devise a customized CNN for Deepfake detection [11].
Overall, the aforementioned deep neural network-based methods, whether their architecture is simple or complex, can perceive the microscopic differences that are difficult for human eyes, and thereby achieve promising results. Moreover, the deep neural network-based methods have avoided the limitations of the visual feature-based ones that depend on specific visual traces, and we thus mainly focus on such methods in this paper. Unfortunately, Upon the encountering of unseen Deepfakes, the performance of these carefully devised deep neural networks is noticed to significantly degrade [27], which poses the poor generalization on unseen Deepfakes remains an urgent issue to be addressed.
To address the above challenge, two typical solutions are currently proposed. Specifically, Nguyen et al. devise a deep neural network that utilizes multi-task learning [46] to simultaneously classify and segment fake faces, and semi-supervised learning is applied to improve its generalization [29]. Moreover, Cozzolino et al. explore the strategy of transfer learning [31] to learn a forensic embedding using an autoencoder network. The learned embedding acts as an anomaly detector, which enables an unseen fake image to be correctly classified if it is mapped adequately far away from the cluster of real ones [10].
In general, the above two typical methods have improved the generalization capability of deep neural networks on unseen Deepfakes. However, they focus on the computer graphics-based Deepfakes, rather than the emerging deep learning-based ones, and are thus still insufficient to deal with the ever-changing Deepfakes.
Compared with the above related work, the work in this paper has the following characteristics:
Rather than the visual feature-based method, this work belongs to the category of the deep neural network-based method applied to Deepfake detection. Compared with the typical deep neural network-based methods, we mainly focus on the generalization issue on unseen Deepfakes, which remains a challenge to existing work. Compared with the work with the same goal as ours, we aim at proposing a comprehensive system that applied to both deep learning-based and computer graphics-based Deepfakes, so as to overcome their limitations of applicative ranges.
Summary
In response to the harm of Deepfakes, effective detection methods are essential for protecting the integrity and authenticity of digital visual media [8]. On one hand, detecting methods focus on identifying discriminative visual features, show limited success. On the other hand, methods have also been proposed to devise deep neural networks for Deepfake detection at the microscopic-level semantics of images. Although promising results have been achieved by the latter ones, the poor generalization on unseen Deepfakes remains a challenge. Although a few attempts are made to address the challenge, they are still insufficient in terms of their applicative ranges.
Hence, in this paper, we will devise effective neural networks to classify Deepfakes, and investigate how to improve its generalization on unseen Deepfakes.
Cross-distribution Deepfake detection system
In this section, we elaborate on the proposed detection system,
Preliminary detection network
As mentioned in Section 2, Deepfakes can be created more and more realistically and indistinguishable for human eyes, whereas can be differentiated by deep neural networks. Therefore, we analyze and discriminate the differences between real and fake faces at the pixel level instead of the higher semantic level, and detect Deepfakes by learning a binary-classification deep neural network.

The architecture of
In the preprocessing phase, we first utilize a
Subsequently, to put aside the unseen Deepfakes, a preliminary detection network is constructed to detect Deepfakes, with the architecture as in the yellow and blue parts of Fig. 3.
Feature extractor
In this component, a feature extractor maps rich visual representations and optimizes underlying features for other components.
Specifically, a basic convolutional neural network (CNN) [25,43] is adopted, which consists of a series of successive convolutional and pooling layers. Moreover, it utilizes Batch normalization (BN) [20] for regularization and an activation function of Rectified Linear Units (ReLUs).
Formally, with the feature extractor
Fake classifier
Based on the differences in feature distribution between real and fake faces, a fake classifier is constructed to predict the authenticity of input images. Specifically, the fake classifier is composed of a dense network, which utilizes Dropout [38] for regularization and an activation function of Sigmoid.
Formally, the feature vector f is feed into the fake classifier
During the training phase, the parameters of the fake classifier (
We define the classification loss
At this point, we have constructed a preliminary neural network to detect Deepfakes, and its performance is evaluated in Section 4.2.
Domain adaptation component
Upon the encountering of unseen Deepfakes, the seen and unseen ones share identical labels (real and fake) while expressing different feature distributions. Based on such observations, we formulate the challenge of unseen Deepfake detection into a problem of cross-distribution data classification, and adopt a strategy of domain adaptation to address the issue of cross-distribution Deepfake detection. The strategy is inspired and extended by the domain adversarial training network [14,15], a fundamental architecture for addressing the cross-distribution data classification in the domain of machine learning.
Specifically, we divide those two types of Deepfakes into two sets based on the concept of domain, i.e., the feature space and distributions of samples [31].
labelled samples created with the same methods as the training sets.
unlabelled samples created with different methods from the training sets.
Our ultimate goal is enabling
To measure the dissimilarity of these two types of feature distributions, a domain adaptation component is augmented to the preliminary detection network, as illustrated in the pink part of Fig. 3.
Domain classifier
Based on the features mapped by the feature extractor, a domain classifier is constructed to predict the domain of an input image. Specifically, the domain classifier adopts the same network architecture as the fake classifier. The feature vector f is input into the domain classifier
During the training phase, the parameters of the domain classifier (
Analogously, we define the classification loss
Subsequently, those three components are considered in a detection network, and the overall optimization objective is then as:
Ultimately, the optimization involves the updates of these parameters, where the parameters of the fake classifier (
At this point, the feature extractor has learned its parameters as training the fake classifier and domain classifier, as well as extracted the discriminative and domain-invariant features to make
Performance analysis
For a further analysis of
Further, the performance of the fake classifier on the target domain,
To correlate the performance of
Subsequently, considering the defined
During the training phase,
In summary, by training a preliminary detection network with real and fake samples, the discriminative features are learned, which enables
Experiments
In this section, we first introduce the experimental settings, and then evaluate the capacity of
Experimental settings
Datasets
We evaluate this work with four typical datasets, and the samples of which are created with both deep learning-based and computer graphics-based methods.
To overcome the challenge of GAN-based Deepfake detection, we adopt the
In consideration of its extensive usage for evaluation, we adopt the
To further evaluate
The metrics of accuracy, EER (equal error rate) and AUC (area under the curve) are used as the performance indicators.
The proportion of correctly classified real and fake samples among the total number of samples.
An equal ratio of false negative rate and false positive rate as varying the discrimination threshold. These two rates refer to the proportions of misclassified real or fake samples among the total real or fake samples, respectively.
The area under the receiver operating characteristic curve (ROC). ROC is a graphical plot created by plotting the true positive rate against the false positive rate at various threshold settings, where the former rate refers to the proportion of correctly classified real samples among the total real samples.
To comprehensively evaluate this work, the results are evaluated at both image and video levels.
The testing images are extracted by splitting an input video at regular intervals. The testing images are extracted by splitting a testing video into continuous frames, and the testing video is recorded fake if over 50 percent of its frames are classified as fake.
Data and platforms
Based on the above datasets, we prepare the videos and their corresponding images for model training and testing, and their statistics are shown in Table 1. In the table, the numbers on the left and right represent the quantity of videos and images respectively, and only the number of videos are represented in the last two columns for testing at the video level.
Statistics of training and testing data
Statistics of training and testing data
Specifically, for the training sets, we adopt about 90% of real videos and 70% of fake ones in the
In addition, all the implementation and evaluation phases in this paper are carried out on a server with an i7-7820X 3.60 GHz CPU, a 64 GiB memory and three GeForce GTX 2080Ti graphics cards.
Among all the typical Deepfake detection methods, we apply the multi-task learning-based method [29] and
Specifically, we make the decision of employing the above two benchmark methods based on fourfold reasons. Firstly, promising results are achieved by them in terms of accuracy, e.g., the accuracy of 94.47% and 90.00% on seen and unseen Deepfakes are separately achieved by
Overall, the multi-task learning-based method and
Evaluation of the performance on discrimination
To verify the capacity of
In the testing phase, we first preprocess the testing videos from the
Evaluation of the performance on discrimination
Evaluation of the performance on discrimination
In the table, for the samples from the
However, observed from the table, as encountering the samples from the
To evaluate the generalization of
In the testing phase, we still utilize the same videos and images collected in Section 4.2, i.e., the testing images are preprocessed and selected from the
To compare the results in Table 2 and Table 3, the performance of
Evaluation of the performance on generalization
Evaluation of the performance on generalization
In general, the experimental results are basically consistent with expectations. Specifically, faced with unseen Deepfakes,
In addition, we further analyze the results by visualizing the distribution of the mapped features. Specifically, we input 1000 fake images from each of the

Visualization of the mapped feature distributions. The activation of the top feature extractor layer before and after domain adaptation are visualized in (a) and (b) respectively, where the distributions of the two types of Deepfakes are represented by purple (source domain) and yellow (target domain) points.
To further evaluate the effectiveness of
To compare with the multi-task learning-based method, we apply exactly the same training and testing data provided by the authors with the same configurations as in [29], and compare
To compare with
Comparison with benchmark methods
Comparison with benchmark methods
Subsequently, we present the classification results in Table 4. To compare with the benchmark methods, each classification result of
In general, compared with these two state-of-the-art methods,
In addition, similar to these two benchmark methods, a small amount of target-domain fake samples are still required to be incorporated into training the domain adaptation component. In reality, such parameter updates are light and fine-tuned, which makes
With the continuous emergence of new methods to create Deepfakes, various detection methods have been proposed as the responses to the risk, which makes Deepfake detection a trendy research topic in digital multimedia forensics research. Although promising results in Deepfake detection have been achieved by various deep neural networks, their performance will be greatly degraded upon the encountering of Deepfakes created with different methods from the training sets, which results in the poor generalization on unseen Deepfakes, and it remains a challenge for Deepfake detection methods.
In this paper, to address the above challenge of unseen Deepfake detection, we propose
We attempt this challenge from a promising perspective, by formulating it as a task of cross-distribution data classification, i.e., to classify multifold Deepfakes with different feature distributions.
We devise a preliminary detection network of
We address the issue of cross-distribution Deepfake detection with the strategy of domain adaptation, which enables
Our experimental results confirm that
Although promising results are achieved by
Footnotes
Acknowledgments
This work is supported by National Natural Science Foundation of China (No. 61572469) and Youth Innovation Promotion Association CAS (No.2021155).
We would like to thank the anonymous reviewers whose suggestions have helped us to improve the quality of this paper.
