Abstract
Deep learning has been used in computer vision to accomplish many tasks that were previously considered too complex or resource-intensive to be feasible. One remarkable application is the creation of deepfakes. Deepfake images change or manipulate a person’s face to give a different expression or identity by using generative models. Deepfakes applied to videos can change the facial expressions in a manner to associate a different speech with a person than the one originally given. Deepfake videos pose a serious threat to legal, political, and social systems as they can destroy the integrity of a person. Research solutions are being designed for the detection of such deepfake content to preserve privacy and combat fake news. This study details the existing deepfake video creation techniques and provides an overview of the deepfake datasets that are publicly available. More importantly, we provide an overview of the deepfake detection methods, along with a discussion on the issues, challenges, and future research directions. The study aims to present an all-inclusive overview of deepfakes by providing insights into the deepfake creation techniques and the latest detection methods, facilitating the development of a robust and effective deepfake detection solution.
Introduction
Digital media has numerous benefits including ease of access and vast propagation of information. Any media created at a user’s device can be shared with thousands of people in a matter of minutes. This has changed our lives for the better in most ways. There is however a growing concern for the authenticity of the content being shared. In recent years we have seen that intelligent systems have been used to create fake content that is so realistic that it cannot be detected by a nonsuspecting person [1–3]. Modification of images has been in play for quite a few years but the application of the same techniques to videos was previously too cumbersome and resource-intensive. The application of machine learning models for this task has made it possible to easily create convincing fake video content. Even though most of this content is created for recreational purposes, some of the modifications are done with malicious intent and have very adverse effects on the audience it is intended for. To understand the dynamics of video forgery, we begin by explaining the structure of a video.
Video structure
A video is a continuous sequence of images, where each image is known as a frame. These frames when produced (or displayed) in quick succession, provide the illusion of movement. This succession or order of the sequence constitutes the temporal dimension of a video; and by that definition, each video has three dimensions - the spatial ones, x and y; and temporal one as can be seen in Fig. 1 [4]. The audio information related to the captured scene is also encoded with the visual file, whereas the common encoding standards are Motion JPEGs [5] and H.264 [6]. Encoding, essentially involves creating short segments of video, known as ’Group of Pictures (GoP)’ that consist of: Intra-coded or I frames, which are the reference frames with least compression, Predicted or P-frames and bidirectionally predicted or B-frames.

Video as a sequence of frames and a group of I,P and B frames as a GOP.
While creating a tampered or forged video, changes can be made in two ways: (i) changing the sequence of the frames, causing a disarray in the temporal dimension; and (ii) changing the contents of the frames, disturbing the information content in spatial dimension. [4, 7] The latest forgeries now create a video synthetically, used deep learning [8]. In video synthesis, either the complete frame [1] or a portion of a frame [9–11] is generated by a deep network and arranged in a meaningful sequence to create a complete video.
Synthesized videos are often known as ’Motion transfer’ videos as well [12]. Synthetic video generation has many noble uses, such as fixing old and damaged videos [13, 14], providing voice to those that have passed on; but they also gave rise to deepfakes. Video or image of a person that is generated by deep-learning, or is manipulated by the use of it is commonly known as deepfake. The term
Deepfake videos manipulate a person’s face in a way to alter the expression and speech. This means that if we have a person’s identity information, we can manipulate it to create any propaganda video. Deepfakes are a threat to privacy and security of people [15]. Deepfakes can be created in numerous ways, however, notable is the use of generative adversarial networks (GANs) for video alteration. Using GANs we can as re-enact facial expressions [9], and apply face-swapping [10]. These approaches allow the source user’s expressions to be aligned to the target user’s face [9, 17], Another way of creating a deepfake video is re-dubbing by attaching a different audio track to the target user [11, 18]. These sophisticated techniques make it impossible for human users to distinguish between a real video or synthesized video [17].
Even though deepfakes have many harmless uses in entertainment industry, they are often used for malicious purposes. This research study is dedicated to understanding the creation and detection of deepfake videos. The importance associated with deepfakes arises from the impact they have on the community. Videos have been throughout history considered to be ’believable evidence’. Forging these can make it very easy to manipulate innocent people. When social media is used as a tool to spread such forged media with intention of causing misinformation, the problem becomes much more serious. Recent research has been dedicated to finding ways to detect deepfake media using deep learning solutions. Work is continuously being done to improve both the deepfake creation techniques and the detection schemes. We present numerous detection techniques in this study, highlighting their strengths and shortcoming, and providing insights to creating better solutions. Many recent studies have been dedicated to understanding how deepfakes are made and what approaches have been adopted to detect them [19–28]. These studies present an overview of the domain and an understanding of the dynamics behind deepfakes. Whereas works like [29–31] have discussed the social and moral impact of deepfakes. A thorough analysis of deepfakes creation mechanisms with some insight into the detection schemes is given in [23]. The authors have presented a list of architectures that are used for deepfake creation. For a deeper understanding of deepfake creation, [23] should be studied. A more detailed discussion on deepfake detection techniques is found in [19, 25] which we have elaborated in our work, giving greater insights into the deepfake detection methods by discussing the various categorizations of the detection schemes. Table 1 provides an overview of these studies for a better understanding of their scope.
Surveys and studies on Deepfakes
We first discuss the deepfake creation methods in Section 2, along with the available datasets created using those methods. In Section 3 we highlight the challenges to detecting deepfakes and later present an analysis of the numerous deepfake detection schemes in Section 4. Conclusion and future directions are presented in Section 5.
Traditional modes of video editing and tampering were once resource-intensive work. The techniques applied to image editing were extended to videos as well with a fair amount of success [33–37]. In the beginning, the video modification work was in limited directions. There were In-paintings [34–36, 39], object forgeries [37, 41], and frame prediction and interpolation [33, 43]. The more sophisticated techniques attempted to transfer motion from one video to another, such as [8, 17]. Motion transfer has been done with traditional methods but the interesting ones make use of deep learning models [8, 17]. Deep models have also been used for image synthesis [44–49] and for video synthesis [3, 50]. Recently many deep architectures such as Autoencoders and Generative Adversarial Networks (GANs) have been used to create forged videos.

Deepfake generation process can be categorized based on the amount of face replacement they do. Top Row: (a) Original frame (b) Selected region comprising of complete facial area. Bottom Row: (c) Original frame (d) Selected region comprising of mouth region only.
This process of generating a portion of a frame and embedding back into the original video can be implemented using graphics-based algorithms or learning-based algorithms. Each of these is discussed in detail ahead for various categories of deepfakes.
The second issue is of generating realistic content in real-time. It is easier to produce offline content, however online or real-time content requires processing speeds that are not achievable if the content produced is of high-quality [3, 53].
Deepfakes can be broadly classified as either
Facial identity manipulation, commonly known as

Replacement Deepfakes: Original input image of the source actor, from which we extract identity features, combined with the input image of the target actor, and final replacement result. Images taken from FaceForensics++ dataset [58].
Deepfake generation techniques that perform identity manipulation or replacement
Graphical approaches involve the creation of 2D and 3D models of human faces. Taylor et al. [56] use a computer graphics (CG) animated model. First the authors use audio to generate a phonetic sequence. Animation parameters for the final image are generated from this phonetic sequence. Karras et al. [18] use a similar approach where they produce 3D vertex coordinates of a face mesh corresponding to the sample. Garrido et al. [57] use a video to video approach where they transfer the mouth shape from the dubber to the target user.
FaceSwap algorithm as used in FaceForensics++ by Rossler et al. [58] is a graphics-based approach that transfers the face region from a source person’s video to a target person’s video. The method detects sparse facial landmarks to extract the face region. With these landmarks, a 3D template model is created by using blendshapes. Gauss Newton optimization is done for error minimization. Lastly, color correction is performed to blend the image back to the original.
Learning based approaches for replacement Deepfakes
Learning-based approaches make use of various generative networks for creating face-swapped deepfakes. The first attempt of creating a deepfake app ’FakeApp’ uses an autoencoder-decoder pairing structure [54]. Later focus was shifted to using Generative Adversarial Networks as in [59]. Works in [59] incorporate adversarial loss and perceptual loss to encoder-decoder network to produce higher quality of deepfakes.
Korshunov and Marcel [60] used a public GAN-based face swapping algorithm [59]. The algorithm is an adaptation of CycleGAN [61] with weights of FaceNet [62]. For face detection and alignments, their work uses Multi-Task Cascaded Convolution Networks. Once the faces have been swapped, they use Kalman filter to smooth out the effect of jitter on faces that may have appeared due to swapping. They also released a dataset by the name of DeepfakeTIMIT.
Rossler et al. [58] created a deepfake dataset which contains a face swap strategy based on a GitHub implementation [55]. The architecture consists of two autoencoders with a shared encoder that is trained to reconstruct the images of source and target faces. To create a fake image the trained encoder-decoder of a source image is applied to create a video of the target person. To blend the output of the autoencoder back to the original image, Poisson image editing [63] is used. Based on the same implementation, Google [64] has also released a deepfake dataset named DeepFakeDetection. The dataset is available with the FaceForensics++ dataset. Li et al. [65] also used the same algorithm to generate Celeb-DF while improving the output data quality by addressing the problem of the low resolution of images and performing color augmentations.

Reenactment Deepfakes: Original input image of the source actor, from which we extract expression features only,combined with the input image of the target actor, and final reenactment result. Images taken from FaceForensics++ dataset [58].
UADFV dataset [66] was one of the first public datasets; comprising 98 videos, 49 real and 49 forged. The real videos are taken from youtube while forged videos are created by using FakeApp mobile application for deepfake generation.
Deepfake-TIMIT [67] is introduced by Korshunov and Marcel, which consists of 620 videos of 32 subjects, generated from Vid-TIMIT dataset by applying faceswap-GAN [59]. Each subject has 20 deepfake videos, where 10 are generated by the model with 64x64 output size and others by 128x128 size. The generative network used is adapted from CycleGAN [61] using the weights of FaceNet [62]. For face detection and alignment, authors use Multi-Task Cascaded Convolution Network. Kalman filter is used as an added post-processing step to smooth the bounding box positions over each frame to reduce the jitter on the swapped faces.
FaceForensics++ [58] is one of the largest and most popular forensics dataset. The dataset is an extension of the faceforensics dataset originally released in 2018 by the same team. The dataset consists of 1000 original videos, sourced from youtube, that have been manipulated with four generation techniques: Face2Face, Deepfakes, FaceSwap, and neuralTextures. The identity-swap or replacement videos were made using both the graphics-based approach (FaceSwap) and the learning-based approach (DeepFake). For the graphics approach the authors have used the publicly available implementation ’FaceSwap’ [54] which involves face alignment, Gauss newton optimization, and a blending function for video generation. Whereas for the learning-based approach they used DeepFake FaceSwap algorithm [55] which comprises of two autoencoder networks with shared encoders, where one autoencoder is trained to reconstruct the face of the target person while the other is trained to reconstruct the face of the source person. A face detector is used to crop and align the face images and the autoencoder is used to reconstruct the face of the target entity. The resultant image is blended back into the original (target) face using Poisson image editing [63].
All videos in this dataset contain a frontal face, that is easily trackable. There are no occlusions either, making it easy for manipulation algorithms to create realistic forgeries. The dataset contains videos compressed at different levels as well resulting in three different qualities (i) Raw (original quality), (ii) HQ (quantization rate of 23), and (iii) LQ (quantization rate of 40)
Google [64] also released a deepfake dataset (DeepFakeDetection) of 3000 videos that is available with FaceForensics++ since October 2019. The company recorded various actors and then generated deepfakes from the freely available implementation of faceswapping, ’Deepfake FaceSwap Github [55]’. The dataset contains 363 real videos from 28 paid actors recorded in 16 different scenes. Just as FaceForensics++, DeepFakeDetection is also provided at three different quality levels, RAW, HQ, and LQ.
One of the latest addition to deepfake datasets is Celeb-Deepfake dataset [65], presented by Li et al. in 2019. The dataset is generated from youtube videos with an average length of 13 seconds and 30fps. The dataset is so designed to produce videos of better quality that do not exhibit many visual artifacts. For this purpose, the dataset uses color augmentation by using data of randomly changed brightness, contrast, color distortion, and sharpness to adjust the color inconsistencies caused in synthesized videos. Celeb-DF [65] has 408 real videos and 795 fake videos, using a customized version of Deepfake faceswap algorithm [55].
Facebook has also released a deepfake dataset in collaboration with institutions and companies such as MIT, Microsoft, and Amazon. They also released a challenge Deepfake Detection Challenge (DFDC) [69] on Kaggle. A preview dataset was initially released with 1131 real videos from 66 paid actors and 4119 fake videos. The complete dataset is over 470 GB in size. The algorithm for the creation of these deepfakes has not been disclosed.
Table 3 provides a list of the deepfake video datasets created by using the face replacement strategy.
Deepfake datasets created with identity manipulation or replacement strategy
Deepfake datasets created with identity manipulation or replacement strategy
The second major category of deepfakes is
In this method, we transfer the facial expressions of one person, onto another one without causing any changes to the person’s identity. To achieve this we extract the key points of a selected region of a face (usually the mouth region) and track and replace their movements. Prominent works in this domain include Face2Face by Thies et al. [16], “Synthesizing Obama” by Suwajanakorn et al. [9], and “You said that?!” by Jamaluddin et al. [70]. Notable techniques in the domain of expression modification are discussed in the following subsections while their list is given in Table 4.
Deepfake generation techniques that perform expression manipulation or Re-enactment
Deepfake generation techniques that perform expression manipulation or Re-enactment
*Employed by Rossler et al. [17] in 2018 for deepfake generation.
Face2Face technique as presented by Thies et al. [16] transfers the expressions of a source video to a target video while maintaining the identity of the target person. From the input video stream, keyframes are selected manually. These frames are used to generate a dense reconstruction of the face (using 76 Blendshape coefficients) which can be used to re-synthesize the face under different lighting conditions, bearing different expressions. The expression, rigid pose, and lightening parameters are computed for every frame and later composed into a video.
Learning based approaches for Reenactment deepfakes
Thies et al. [71] developed an approach to perform style and motion transfer using neural textures. Rossler et al. [58] in FaceForensics++ used neural textures to ’learn’ facial textures instead of being calculated. The approach uses the original video data to learn a neural network representation of the target person, also making use of a rendering network for final rendition. The network is trained using ’photometric reconstruction loss’ along with ’adversarial loss’. Instead of regenerating the complete face, only the mouth region was regenerated [58] and blended into the original face.
Some of the mobile applications also allow changing the expression of a person, such as FaceApp [72], from happy to sad or smiling less or more. Prominent architecture in such applications is the StarGAN as presented by Choi et al. in [73].
Audio-to-Video based approaches for Reenactment deepfakes
One of the noteworthy techniques for audio-to-video reenactment deepfakes is by Jamaludin et al. [9] where the authors have created a real-time model that is not limited to only the seen training data but works just as effectively for unseen data. The authors have developed an encoder-decoder convolutional neural network that uses joint embeddings of the face and audio to generate synthesized talking face video frames. The model is cross-domain self-supervised as the videos are unlabeled. The audio information is used to create labels for the videos. The ’redubbed videos are visually blended to the original video by using a multi-stream convolutional neural network (CNN) model. As a final step, a spatial registration is done of the source image’s facial landmarks with those of the target image. The results are promising as they work on unseen samples just as effectively as they do on seen samples, and thus provide a generality which the previous techniques had not.
Among other works we see a very successful approach by Suwajanakorn et al. [9] where the authors have synthesized video of Barak Obama, the US president. A recurrent neural network (RNN) is employed that learns a mapping from raw audio input to mouth shapes. These are again synthesized into the original video to create a realistic effect. The only problem here is that their model needs to be re-trained for each individual and requires a significant amount of data for that.
Some other prominent works include those by Vougiokas et al. [74], Zhou et al. [75] and Song et al. [76]. Vougiokas et al. [74] generated videos using temporal GANs. These videos have lip movements in sync with audio and natural facial expressions such as blinks and eyebrow movements. Temporal GAN uses 3 discriminators focused on achieving detailed frames, audiovisual synchronization, and realistic expressions. Zhou et al. [75] use an encoder-decoder network for learning dis-entangled audio-visual representation. A joint audio-visual representation is learned through audiovisual speech discrimination by associating several supervisions. They disentangle the person-identity and speech information through adversarial learning for better talking-face generation. The authors show that unifying audiovisual stream helps in generating arbitrary-identity talking face from either video or audio speech as inputs in an end-to-end framework. Song et al. [76] use a conditional Recurrent neural network, where they incorporate image and audio in the same recurrent unit to create a temporal dependency needed for smooth transitions of lip and facial movements throughout the video. Their work uses a ’multi-task adversarial training scheme’, which uses a pair of spatial-temporal discriminators for image realism and video realism; and a Lip reading discriminator to boost the accuracy of lip synchronization.
A brief reference to these techniques is presented in Table 4.
Datasets for Reenactment deepfakes
In terms of face reenactment deepfakes, the only dataset available so far is the FaceForesics++ dataset [58]. The audio-to-video deepfake studies have not produced any public datasets.
From among the four types of deepfakes available in FaceForensics++, ’Face2Face’ and ’NeuralTextures’ are reenactment or expression swapping deepfakes. Face2Face [16] is a graphics based approach that transfers the expression of source person to target person using keyframe selection. The first frames of each video were used to obtain a temporary face identity that was tracked over the rest of the video. To embed the expressions onto the original video, 76 blendshape coefficients were used.
The other approach, NeuralTextures [71] is a learning based approach developed by using conditional GAN. This approach uses data to learn a neural texture of the target person and renders that using a rendering network. The implementation uses a patch-based GAN-loss for optimizing results.
The FaceForensics dataset contains 1000 original videos and 1000 deepfakes each created from the various (deepfake) implementations. These videos are available in three different qualities as well, based on their compression ratios, (i) RAW, (ii) HQ with a quantization parameter of 23, and (iii) LQ with a quantization parameter of 40. A benchmark was released for the same dataset in July 2019, used by many subsequent researchers for the classification of videos as forged or real. A description of these datasets can be seen in Table 5.
Deepfake datasets created with expression manipulation or Reenactment strategy
Deepfake datasets created with expression manipulation or Reenactment strategy
There are numerous challenges involved in the detection of deepfake videos since the technologies used to generate deepfakes are constantly evolving and improving. Some of the major challenges are listed as: Evolving technology - Latest GAN technologies such as StyleGan and ProGAN can easily generate high-quality content that is harder to detect using end-to-end machine learning classifiers. Generalization - Any classifier trained on media from one dataset, generated by one type of GAN technology, fails to produce similar results on media generated from some other GAN technology since each procedure produces its unique patterns. Real-time Processing - Many tools recently developed have the ability to create real-time deepfakes [70, 77–79]. Even though these are not mature yet, and do not produce high-quality results, this brings forth a unique challenge in trying to identify such media and prevent the spread of it.
Various detection techniques have acquired a different level of success against each one of the challenges. Even the state-of-the-art techniques have not been able to fully address all of the issues in a manner to have a definitive impact on the research in this field. We discuss the detection techniques in the next section.
Deepfake detection methods
Deepfake detection has gained a heightened interest from the research community, as the threat posed by deepfakes is profound. The detection methods have sought two general strategies, first focuses on the exploitation of image features and thus works in the spatial domain only, while the second makes use of features both from the spatial and temporal domain to define meaningful detectors for deepfakes.
Spatial domain deepfake detectors
In spatial domain analysis, some studies have used specialized features for the detection of visual artifacts in videos (or set of frames), while others harness the power of deep networks for the same task. We discuss these in detail in coming sections and provide a summary in Tables 6 7. In Tables 7, the ‘Type’ signifies the kind of deepfake that was used in the study and can have possible values of ‘Re.P’ for Replacement deepfakes, ‘Re.E’ for Reenactment deepfakes, and ‘RR’ for both replacement and reenactment deepfakes.
Spatial Deepfake detection using End-to-End solutions (sorted on type and detection algorithm)
Spatial Deepfake detection using End-to-End solutions (sorted on type and detection algorithm)
Spatial Deepfake detection using selected features (sorted on type and detection algorithm)
Among the earliest works, is one by Zhou et al. [80], where they have used a two-stream network for face manipulation detection. One stream is a face classification stream based on GoogLeNet convolution network that differentiates between a real face and a fake face, while the other extracts steganalysis features with a triplet loss and an SVM classifier. Another work that uses SVM with spatial features is by Wang et al. [81].
Matern et al. [82] have exploited the visual inconsistencies in generated images such as global consistency, illumination estimation, and geometry estimation. They have tested their approach on images only and have limitations to the size and type of images considered. For classification, they used a logistic regression model and a Multilayer Perceptron (MLP). Using a private dataset they report an AUC of 85%using the MLP system. Similar work is presented in [83].
Chan et al. [51], Li et al. [65], Dolhansky et al. [69], Ding et al. [84], Do et al. [85], Li et al. [86], Tu et al. [87], de Lima et al. [88], and Bonettini et al. [89] have also used a convolutional neural network for both extracting features and classification. Works by Du et al. [90] and Fernando et al. [91] have not used any specialized features as well, but for classification [90] uses an AutoEncoder GAN network while [91] uses a Memory network.
Afchar et al. in [92] extract ’mesoscopic’ properties of images and design a shallow network consisting of 4 convolutional layers. In a modification of the same network, they used an Inception module. Their approach was initially tested against a private dataset generating an accuracy of over 98%, but was later added to the FaceForensics++ benchmark where the pre-trained model was tested against unseen data with promising results.
Rossler et al. [58] in Faceforensics++ have used a set of techniques for synthetic video detection to set a benchmark using a large dataset that was generated using four synthesis techniques (FaceSwap, DeepFakes, Face2Face, and NeuralTextures). The original dataset is collected from youtube videos and is processed for selection of face region. The five different implementations they used include: (i) extracted steganalysis features fed to a CNN for classification, (ii) mesoInception-4 as presented in [92], (iii) shallow CNN to suppress high-level content of an image, (iv) XceptionNet pre-trained on ImageNet dataset, and (v) a CNN with global pooling layer that computes four statistics - mean, variance, maximum and minimum.
In [93] the image features extracted are passed to a capsule network for classification, whereas in [81] the authors use a VGGnet for extracting features and SVM for classification.
Nguyen et al. [94] proposed a multitask learning system based on CNNs, used to both detect deepfake images and videos and to locate the manipulated regions. The autoencoder network uses a Y-shaped decoder enabling the autoencoder to share valuable information between classification, segmentation and reconstruction tasks. The technique was tested on the FaceForensics dataset’s Face2Face method and the given results are not very generalizable.
Stehouwer et al. [95] used attention maps with CNN to detect the modified regions of an image along with a classification of the video as a real or forged one. The attention mechanism introduced is easy to implement and can be inserted into any existing network. The technique is tested on the Deefake dataset released by Google and achieves an AUC of 99.43%. A summary of these is given in Table 6.
Selected features-based solutions
Zhang et al. [96] in 2017 made use of facial landmarks, exploiting their inconsistencies and use SVM and MLP as classifiers for discrimination between pristine or camera-generated videos and synthesized videos. The technique shows some effectiveness but is limited to images only. Bao et al. [97] extract a person’s facial features as a representation of identity and use an Encoder-decoder GAN network for classification. They have used a custom dataset and report accuracy over 90%.
A similar approach is used by Agarwal et al. [98], where they extract facial action units and employ a Support Vector Machine for classification. Other works that use facial features include research by Montserrat et al. in [99]. Korshunov and Marcel [67] use a set of video frames, and extract some image quality metrics from them. Later a usable set is extracted using PCA and LDA which are then classified using an SVM.
Photo Response non-uniformity (PRNU) analysis has been widely used in the case of detecting image or video forgery. Koopman et al. in [100] have explored the use of the same technique for the identification of deepfake videos. The authors make the case that once a region of an image is swapped, it changes the local PRNU in the region of interest of the image. In the case of deepfake videos, we should be able to see a difference in PRNU in the facial area throughout the video. The video is broken down into frames, which are cropped to extract the facial region. The cropped regions are sequentially separated into 8 groups and an average PRNU is calculated for each one of these. Analysis results show that there is a significant difference between mean normalized cross-correlation scores of deepfakes and original videos. The dataset used for experiments is rather small consisting of only 10 authentic videos and 16 forged videos, but the results suggest that there may be potential in this statistical approach for deepfake detection. Taking advantage of intrinsic image properties, Akhtar and Dasgupta [101] extract local binary patterns from video frames. These LBP features are passed on to an SVM classifier for final classification. Durall et al. [102] similarly use DFT features with an SVM classifier. Bonomi et al. [103] have used a similar approach with local derivative patterns on three orthogonal planes.
Du et al. [104] have used attention maps to highlight the areas that may have been tampered with in any facial image. They have tested their technique on the FaceForensics dataset using the FaceSwap technique reporting accuracy of over 60%, and Face2Face with an accuracy of over 95%. These attention maps are detected using an autoencoder. Li et al. [105] have also used attention maps to highlight the tampered face region, which is classified by a fully connected neural network. Similar work is presented by Li et al. in [106].
Experimenting with facial landmarks we see works in [20, 108] and [109]. Other than [109], all works perform a CNN-based classification. Kim et al. [109] use a 1D CNN since they measure discrepancies between a tampered face’s edges and the background with a non-tampered face’s edge and background.
Mittal et al. [111] employ an interesting approach where they use emotion features extracted from visual data to make predictions using an Encoder-decoder CNN network. They have tested their work on the DeepfakeTIMIT dataset and have reported an accuracy of over 89%.
In [112] Li and Lyu make use of a limitation of generative networks that they only generate images of a limited resolution. To match the resolution of the original or target video, these image segments are often warped. Such manipulations leave behind detectable artifacts in the video. The authors used four different CNN implementations, VGG16, ResNet50, ResNet101, and ResNet152, to detect such artifacts present in and around the face region. Results were generated after testing on UADFV and DeepfakeTIMIT datasets.
Yang et al. [66] have argued that blending or synthesizing faces into original images leaves behind inconsistencies in images. In their study, they extract 68 facial landmarks from the full face area, normalize them and classify the frames based on these as real or fake using an SVM classifier. Tested on UADFV, their approach achieved an AUC of 89%. A summary of these is given in Table 7.
Discussion
A video is essentially comprised of individual frames that are simply images. These images can be thoroughly tested to check for inconsistencies in them, since generative models leave behind some ’artificial fingerprints’. These inconsistencies can give us some idea about the authenticity of the image and by extension, the authenticity of the video. The techniques that have worked only on spatial domain have focused on exploiting inconsistencies such as colors, shapes, and object deformities. Most of the researchers have made use of deep networks both for feature extraction and classification [92, 113] while some others have worked with hand-crafted features for the same task [66, 112]. Since deepfake creation is in its rudimentary stages we see that even the techniques that directly employ a deep network for an end-to-end classification have great performance, given the dataset is developed using some open-source technology. This is also due to the fact that all experiments have been carried out at raw outputs. With post-processing as is done in [9], most of the inconsistencies that these methods capitalize on will no longer be available, putting into question the effectiveness of these techniques. We can also not root out the fact that more and more data is now available, particularly since mobile consumers of such deepfake apps are willingly giving up their data. Algorithms are now being trained on more realistic data, giving better performances. Even as many algorithms claim accuracies closer to 99%, these are not tested on in-the-wild data and it can be expected that their performance may vary given such a change.
Temporal domain deepfake detectors
In this section we discuss the studies that have explored beyond the spatial domain, and have considered the temporal information as well. Temporal information can be incorporated using a set of frames for analysis instead of focusing on a single frame or image only. We discuss these in detail in coming sections and provide a summary in Tables 9.
Temporal Deepfake detection using End-to-End solutions (sorted on type and detection algorithm)
Temporal Deepfake detection using End-to-End solutions (sorted on type and detection algorithm)
Temporal Deepfake detection using selected features (sorted on type and detection algorithm)
In Tables 9, the ’Type’ signifies the kind of deepfake that was used in the study and can have possible values of ’Re.P’ for Replacement deepfakes, ’Re.E’ for Reenactment deepfakes and ’RR’ for both replacement and reenactment deepfakes.
Geura and Delp [114] use a pre-trained InceptionV3 on imageNet for extracting features from frames and a LSTM model consisting of one hidden layer with 2048 memory blocks for temporal sequence analysis. Attached at the end are two fully connected layers that provide the probability of a frame being real or fake. Evaluation was done on a propriety database with an accuracy of 97%. Tariq et al. [115] have also used a combination of CNN and LSTM in their work.
Singh et al. [116] used a similar approach where they used a combination of CNN and LSTMs. The authors have used a time-distributed approach in order to extract features from multiple frames at the same time. With DFDC dataset it was analyzed that there were on average 300 frames per video, out of which 30 frames per video were utilized for the experiments.
3D CNN have been used many times to extract data from the temporal dimension of a video. Works by Nguyen et al. [117] and Wang and Dantcheva [118] have used this approach. Wang and Dantcheva [118] have tested three 3D CNN networks, 3D ResNet, 3D ResNeXt and I3D on the FaceForensics dataset and have reported a true classification rate of over 80%for each FaceForensics category and the best result of 95.13%for FaceForensics DeepFake. Dogonadze et al. [119] have extended the approach by using a BiLSTM with a 3D-CNN.
Sabir et al. [120] have also made use of temporal features to extract temporal discrepancies in videos. They used a simplistic mode where features are extracted using a CNN and these are fed to a recurrent neural network similar to [114] for classification. Tested on FaceForensics++, they reported an AUC result of 96%. However, high quality videos were not considered for the analysis.
A summary of these is given in Table 8.
Selected feature-based solutions
Agarwal and Farid. [98] created a deepfake detection technique that is targeted at finding deepfake of a ’person of interest’ such as prominent political figures. Their approach works by extracting numerous facial actions such as mouth stretch, jaw drop, lip stretcher using OpenFace2 toolkit. 18 of these facial actions and 4 features related to head movement were considered. In total, a feature vector of dimension 190 is formed for a 10-second video clip. To measure the linearity between features the authors used Pearson correlation, which is used with an SVM for classification. The authors used their own dataset, in which original videos were collected from youtube, where the person of interest is facing towards the camera, while deepfakes were made using Faceswap-GAN [59]. The results produced an AUC of over 96%for the best performance.
In the domain of physiological signals, we see works by Hernandez-Ortega et al. [121]. The authors have used remote photoplethysmography to extract features related to a person’s heartbeat. They claim that deepfake methods are not effective in replicating the heartbeat of a person even though they can generate the facial features to quite a realistic extent. They use Attention networks for classification. Similar work is presented in [122], [123] and [124].
Some works have used multi-modal solutions involving both audio and video features. Agarwal et al. in [125] highlight the discrepancies between the spoken audio and the generated mouth shape for it. They collect a set of visemes and phonemes and using a CNN flag all the videos with a high number of mismatches as fake videos. Mittal et al. [111] create a multi-stream network, where one stream is responsible for working on audio modality and finds an emotion label for it, while the other stream is responsible for video modality and finds an emotion label for it. If these emotion labels have a high number of mismatches, then such a video is classified as fake. The authors have used a modified triplet loss function for training. Similar work is presented in [126] where they measure the audio-visual dissonance.
Tursman et al. [127], Korshunov et al. [128], Demir and Cifti [129] and Agarwal et al. [130] have extracted facial landmarks as definitive features for deepfake detection. Tursman [127] use 3D facial landmarks and use a memory model for classification, while Korshunov [128] make use of pose information with facial landmarks as well, and classify using a deep network with LSTM. Demir and Cifti [129] use Gaze features along with the facial landmarks with comparable results to the other approaches. Agarwal et al. [130] use head position, gaze, and facial landmarks, and their results appear to be better than the other works using facial landmarks.
Guo et al. [131] have used an interesting approach where they enhance manipulation traces that are left in a video and use those as a feature set. For classification of these, they made use of an adaptive convolution network.
Wu et al. [132] have used the residuals left in an image due to tampering and have created a network that is optimized to detect them. Instead of hand-crafting the noise-capturing filters, they used a learning-based approach and train a CNN for this. Features learned this way are later passed to an RNN for final classification. Similar work is presented in [133] where they use an Attention-based Convolutional Neural Network.
Li et al. [68] also focused on the use of physiological signals such as eye blinking in their research. The authors made an observation that people in deepfake videos tend to blink a lot less often than normal people, mostly since networks used for creating deepfakes are trained mostly on images of people with their eyes open. In [68] a video is broken into individual frames, and for each frame, the face region is segregated. Given the face region, 6 landmarks are used to extract the eye region. These eye features are then processed using a long-term recurrent convolutional network (LRCN) [134] for dynamic state prediction. The LRCN is composed of a CNN for feature extraction and an LSTM for sequence learning, followed by a fully connected layer for predicting the probability of eye open or eye closed state as seen in Fig. 5. LSTM helps to capture the strong temporal dependencies that exist in eye blinking. Tests were conducted on a dataset available with FaceForensics++, based on face-swapping techniques, with promising results. Cozzolino et al. [135] similarly use head tracking and pulse information for classification.

a) Video segmentation and extraction of eye landmarks; b) feature extraction from individual frames using CNN; c) passing the extracted features to LSTM for sequence analysis; d) fully-connected layers for state prediction; e) binary output representing open or closed eye. [68].
Masi et al. [136] use a two-stream network, where one stream passes the original image, the other performs a Laplacian of Gaussian operation on the same image to repress the face-specific features and enhance the frequencies. They show promising performance on the FaceForensics dataset.
Amerini et al. [137] explored the temporal relationship of frames by extracting optical flow fields to exploit inter-frame discrepancies. Optical flow is calculated between two consecutive frames, in this case, to find the relatedness of the target person with surroundings. Usage of optical flow vectors is interesting in detecting deepfakes since they have unusual movement of lips, eyes, and in particular the regions that are blended in the original image. The authors have reported accuracy of 81%. In [138] authors used Energy-maps as an indicator of mismatches in the temporal domain. Trinh et al. [139] magnify these discrepancies and use a modified Auto-Encoder network to identify fake videos using them.
A summary of these is given in Table 9.
There is a tremendous amount of information that we can extract simply from looking at an image. However, when it comes to video analysis, we cannot solely rely on the information extracted from individual frames. Particularly in cases where we see the effect of some post-processing in the form of realignments, readjustments, or re-compression. Even though the work in the temporal domain is slightly limited compared to that in the spatial domain, we see that researchers have come up with numerous solutions that do not solely rely on image inconsistencies. These are, however, too far from being considered a final solution; both due to their limited datasets and too few experiments. It is interesting to note here that generative networks synthesize video one frame at a time, and need to be further tuned to maintain consistency in the temporal dimension. This shortcoming of generative networks allows finding features in the temporal domain that can be decisive about the authenticity of a video. A significant area under research is the use of physiological markers such as rPPGs [122–124]. Eye blinking rates, gaze features, head positions, and head tracking are considered features invariant of the GAN generation technology and thus present a better prospect of creating a more lasting solution. This idea of finding ’ human-factors ’ is extended to multiple modalities such as audio and video in these works [111, 126] producing results at par with other state-of-the-art techniques. In terms of temporal features, these handcrafted features are an important consideration for all future solutions.
Conclusion and future directions
Deepfakes are a threat that is far from over. With applications as Zao [140], we see that the ability to create forged videos is no longer something that only people with certain skills can do. It is now easy for common people to create realistic deepfakes that can cause an array of problems such as spreading fake news or false propaganda, creating discord among people, political and emotional manipulations, and destroying the credibility of people and businesses [141–143]. Intelligence services may use this strategy to influence decisions taken by important individuals, such as lawmakers, leading to national and international security threats [144].
There is quite some ongoing research to find a solution to such a global challenge. It is evident that the deepfake detection methods are continuously struggling to catch up to deepfake generation methods. Detection methods developed so far are in their rudimentary stages since most of the work towards detection has been done in controlled environments using fragmented datasets. Training and testing are done on datasets with the same level of image compression. A majority of the current detection techniques are designed to exploit the weaknesses of the generative networks used to create the deepfakes. Such insights are not always available in adversarial environments, resulting in a massive decline in the performance of detection techniques [145, 146]. The challenge is the creation of a solution that is robust to such scenarios. In order to bring about a performance improvement in detection methods, it is essential to create a growing and updated benchmark data set of deepfakes, that will facilitate validation of the existing detection methods.
A more recent trend in research has been the use of physiological or biological features for deepfake detection. FakeCatcher [122] is among the preliminary works in this direction where the authors have investigated the use of biological signals to detect deepfakes. Similarly, in [111] the authors have explored affective cues present in the audio and video modality corresponding to different emotions. Any mismatch between such signals is used for the identification of deepfakes. The effectiveness of such techniques is yet to be proven, though the results do seem promising.
Taking into consideration the incredible pace at which news is created and propagated nowadays, the issue of deepfakes takes on its worst form. It is pertinent to integrate deepfake detection methods into distribution channels such as social media to improve their effectiveness in tackling the momentous impact of deepfakes. Apart from effective detection strategies, work needs to be done for creating preventive schemes as well. User devices that are capable of video capturing should be equipped with watermarking ability. This will help in creating immutable metadata for videos. The use of blockchain technology [147] as a preventive measure may also offer a solution, but the effectiveness of it is yet to be seen.
All these aspects, combined with the constant improvements in existing GANs and the development of new ones, will lead not only to the generation of more realistic content, but also advanced techniques for deepfake detection.
