Abstract
Forged portraits of people are widely used for creating deceitful propaganda of individuals or events in social media, and even for cooking up fake pieces of evidence in court proceedings. Hence, it is very important to find the authenticity of the images, and image forgery detection is a significant research area now. This work proposes an ensemble learning technique by combining predictions of different Convolutional Neural Networks (CNNs) for detecting forged portrait photographs. In the proposed method seven different pretrained CNN architectures such as AlexNet, VGG-16, GoogLeNet, Res-Net-18, ResNet-101, Inception-v3, and Inception-ResNet-v2 are utilized. As an initial step, we fine-tune the seven pretrained networks for portrait forgery detection with illuminant maps of images as input, and then uses a majority voting ensemble scheme to combine predictions from the fine-tuned networks. Ensemble methods had been found out to be good for improving the generalization capability of classification models. Experimental analysis is conducted using two publicly available portrait splicing datasets (DSO-1 and DSI-1). The results show that the proposed method outperforms the state-of-the-art methods using traditional machine learning techniques as well as the methods using single CNN classification models.
Keywords
Introduction
Millions of portraits are being uploaded to the Internet through various social media platforms like Facebook, Twitter, Instagram, etc. Any person can purposefully and seamlessly manipulate the images with the help of image editing software or mobile applications that are freely available. The manipulated images can misrepresent the information content of the original images and their widespread propagation may result in unwanted social and political unrest. So it is very important to ensure the authenticity of the images uploaded.
Copy move [12] and splicing [23] are the two general categories of image forgery operations. A portion of image is copied and pasted on the same image to generate a copy move forged image. A spliced image is made by copying portions of one or more images and then pasting them on another image. In this work, we focus on portrait splicing detection. The portrait splicing forgery is done by copying images of one or more persons from a single/multiple photo, and then pasting them on other photos which already contain images of single/multiple persons. This type of manipulation cannot be easily captured by bare human eyes. Figure 1. shows an example of image manipulation by splicing, in which the left image is an authentic portrait. This image shows that the former US President Barack Obama was presenting the prestigious Presidential Medal of Freedom to the former US Vice President Joe Biden in January 2017. The image on the right is a forged portrait, demonstrating Obama with Harvey Weinstein [11]. Weinstein is a Hollywood film producer and convicted of sex offense [14]. Here, the forged portrait has created by replacing Biden’s head with Weinstein’s. These types of manipulated images are widely used for spreading fake news to defame celebrities or for making personal advantages. These incidents stress the need for efficient portrait forgery detection techniques.

(a) Authentic portrait and (b) Spliced portrait [11].
An image splicing detection is a two-class classification problem to predict an image as forged or authentic. Image preprocessing, handcrafted feature extraction, and training of an appropriate classifier are the major steps in traditional machine learning based forgery detection methods. However, these handcrafted methods require a lot of effort to select and extract suitable features for better classification performance.
Recently, image classification using deep learning techniques with Convolutional Neural Networks (CNNs) [21] have shown remarkable results. CNNs are trained using a huge number of images, and each of its layers would learn powerful features directly from images [2]. In the case of portrait splicing detection, training a CNN from scratch is difficult due to the scarcity of the number of labeled image samples. In such situations, one can utilize a transfer learning approach where one can fine-tune the pretrained CNNs for specific applications [3, 40]. In this work, seven different pretrained CNNs are fine-tuned using illuminant maps of images, for classifying images as spliced or original. A majority voting ensemble scheme is used to combine predictions from the fine-tuned networks. The prediction by the ensemble method is more accurate than that using any single classifier [9]. Neural networks with different architectures learn differently from the same dataset and hence the ensemble learning technique tries to nullify the effect of overfitting of the trained networks and helps to enhance the generalization performance of classification [13]. AlexNet [21], VGG-16 [32], GoogLeNet [34], ResNet-18, ResNet- 101 [15], Inception-v3 [35], and Inception-ResNet-v2 [33] are the seven pretrained CNNs used in the proposed method.
DSO-1 and DSI-1 [5] are two benchmark portrait forgery datasets used in this work for training and experimental analysis. The results show that the proposed method outperforms the state-of-the-art methods using traditional machine learning techniques as well as methods using single CNN classification models.
The rest of this paper is organized as follows, Section 2 presents the related works. The proposed portrait splicing detection method is described in section 3. Experimental setup and results analysis are explained in section 4 and section 5, respectively. Conclusions are drawn in section 6.
Several approaches have been suggested in recent years for image forgery detection, this section discusses the detection techniques based on the handcrafted feature engineering as well as deep learning techniques.
An image forgery detection technique combining four different texture descriptors such as Local Binary Pattern (LBP), Local Phase Quantization (LPQ), Binarized Statistical Image Features (BSIF), and Binary Gabor Pattern (BGP) is suggested by is proposed by Vidyadharan and Thampi [38]. As a preliminary step, they applied the Steerable Pyramid Transform (SPT) on the images to decompose images into various sub-bands. The texture features are extracted from image sub-bands, and these features are concatenated to form a multi-texture descriptor. ReliefF feature selection technique is used as the feature selection technique and then the selected features are used to train a Random Forest classifier.
Isaac and Wilscy [17] proposed a method that uses Gabor Wavelets Transform(GWT) and LPQ for image forgery detection. As an initial step, the images are converted into YCbCr color space and GWT is applied on chrominance image components at various scales and orientations. LPQ features are generated from different Gabor image sub-band and dimensionality of the features are reduced using the Non-negative Matrix Factorization (NMF) technique. Then, an Support Vector Machine (SVM) classifier is trained using these features to classify images into forged or original.
Vega et al. [37] proposed an image forgery detection method that utilizes LBP, DCT, and Discrete Wavelet Transform (DWT). At first, the images are converted into YCbCr color space, and DWT is applied to chrominance channels (Cb and Cr) of the images. In the next step, Low –Low (LL) approximation coefficient that contains low-frequency information is considered and Quadrature Mirror Filter is applied on each LL sub-bands. Each filtered LL sub-bands is divided into overlapping blocks and LBP is applied on each block. Then, DCT is applied to each LBP histogram and standard deviations of DCT coefficients are used to train an SVM classifier.
Kanwal et al. [19] proposed a block-based procedure for image splicing detection. In this approach, the image is transformed into the YCbCr color space. Otsu based Enhanced Local Ternary Pattern (OELTP) is used to extract features from overlapping blocks of the chrominance components. The extracted features are used to train an SVM classifier.
Rota et al. [29] designed and trained CNN from scratches for classifying images into forged or original. They trained CNN using image patch samples, and patch based features are extracted from the trained network. The features are combined and then used to train an SVM classifier. Even though the trained model achieved a good detection accuracy on this particular dataset, its generalization capability is questionable since they did not test the model with other datasets.
Zhou et al. [42] designed and trained a rich model Convolutional Neural Network (rCNN) for image forgery detection. At first, the images are divided into blocks and labeled as forged or authentic. These image blocks are used for training rCNN and the trained network functions as block descriptor. The features of each block are joined to obtain a feature map of the whole image and then a 2D max pooling operation is done on the feature map. Lastly, the feature map is converted as a vector to train an SVM classifier.
The patch based method increases cost in terms of computation and time. Even though these trained models achieved a good detection accuracy on the particular dataset, the generalization capability of that model when tested on a different dataset is questionable.
Xiao et al. [39] proposed a Coarse-to-Refined convolutional neural Network (C2RNet), which cascades a Coarse CNN (C-CNN) and a Refined CNN (R-CNN) to extract the image properties that discriminate the authentic and forged regions.
El-Latif et al. [1] proposed a deep learning approach along with wavelet transform for image splicing detection. They designed and trained a CNN with RGB images as input and the deep image features are extracted from the trained CNN. Then, the deep features are combined with features generated by applying DWT on the images are used to train an SVM classifier.
Revi and Wilscy [26] proposed an image forgery detection method by utilizing a pretrained CNN as a feature extractor. The Rotation Invariant - Local Binary Pattern (RI-LBP) map of the chrominance image is given as input to the CNN. The dimensionality of the extracted features is reduced by NMF technique, and these features are used to train a quadratic SVM.
Methods [17, 38] exploited various handcrafted feature engineering techniques for forgery detection and methods [1, 42] used deep learning approaches for detecting forgeries. Methods are evaluated on general image splicing datasets like CUISDE [16], CASIA v1.0, CASIA v2.0 [10] etc. These methods are designed for detecting general image forgeries and not for detecting portrait splicing. Only a few research works have been done in the area of portrait splicing detection and hence, in the following section, we are reviewing relevant works related to portrait splicing detection.
Carvalho et al. [5] proposed a traditional machine learning technique for detecting portrait forgeries. They extracted texture and edge features from the illuminant maps of the images. The extracted features are used to train an SVM classifier.
Carvalho et al. [4] proposed an enhanced meth-od for portrait splicing detection, and in that method, they extracted color, shape, and texture features from illuminant maps. The extracted features are fed to an SVM classifier.
Cristin et al. [6] proposed a technique for detecting portrait forgeries. They extracted features from illuminant maps of the images by using Gabor filter, wavelet transform, and LBP. The features are concatenated to form a feature vector and given to a fruitfly optimized support vector neural network classifier.
These methods [4–6] employed handcrafted feature engineering techniques and are evaluated on two benchmark portrait splicing forgery datasets, DSO-1, and DSI-1. The process of selecting a hand-crafted feature is a time-consuming and complex process. In this work, we utilize the power of pre-trained CNNs by transfer learning approach for portrait splicing detection, and the fine-tuning method in the transfer learning is adapted. Seven different pretrained CNNs are fine-tuned separately on a portrait splicing dataset and then, a majority voting ensemble technique is used for increasing classification performance. The transfer learning approach helps to save time and struggle required for the handcrafted feature engineering techniques. The details of the proposed method are presented in the next section.
Proposed method
An outline of the proposed portrait splicing forgery detection method is given in Fig. 2. The illuminant maps of the portraits are given as input to CNNs, since the spliced region will have a different illuminant color compared to rest of the portrait regions. The predictions of the each of the fine-tuned CNNs are combined using majority voting ensemble technique for obtaining the final prediction, as spliced or original.

An outline of proposed portrait splicing detection method.
The detailed explanation of the proposed method is provided in the following subsections.
Inverse Intensity Chromaticity (IIC) illuminant estimation method is a physics based technique which considers the dichromatic reflection model [36]. In this work, the modified IIC method proposed by Riess and Angellopolou is used [27]. In modified method, illuminant map of an image is generated by subdividing an image into regions of similar color, and each region is further divided into small patches. Then, the illuminant color is estimated from each patch and a dominant voted illuminant color is selected as illuminant color of the region. When capturing an image using a camera, the illumination present in environment is recorded in the pixels, and it will be consistent throughout the image. As a result of the portrait splicing operation, scene illumination on forged regions will be different from the rest of image. So in the proposed work to improve detection accuracy, instead of RGB images the IIC illuminant maps are used as input to the pretrained CNN.
Pretrained convolutional neural networks(CNNs)
CNN is a deep neural network specifically designed for image classification. A basic CNN architecture contains convolutional layers, activation layers, pooling layers, fully connected layers and softmax layer. The Rectified Linear Units (Re-LUs) are used as activation function and the size of feature map (output of convolutional layer) is reduced by pooling operation. The network is trained using backpropagation algorithm [21]. One of the important breakthrough in deep learning is the outstanding performances of CNNs in ImageNet Large Scale Visual Recognition Challenge (ILSVRC). For this challenge, the networks are trained using millions of images in the ImageNet dataset for a 1000 class classification problem [21]. The pretrained CNNs have learned rich feature representations, i.e., the initial layers of the network learn basic features and the deeper layers learn more rich discriminative features [24]. CNNs should have trained using millions of labeled images to get accurate and reliable classification performances, but in most real world scenarios, the number of labelled images available will be much less. So training a CNN from scratch for such situations is a challenging process, and fine-tuning the pretrained CNNs with transfer learning approach help to overcome this issue [8, 30]. In the proposed work, seven CNNs with different architecture, pretrained on ImageNet datasets are considered. They are AlexNet, VGG-16, GoogLeNet, ResNet-18, ResNet- 101, Inception-v3 and Inception-ResNet-v2. The details of the pretrained networks are briefed in the following subsections.
AlexNet
AlexNet is a series network with five convolutional layers, pooling layer, three fully connected layers and ReLU is used as activation function. This network won ILSVRC in the year 2012 and achieved a top-5 test accuracy of 84.6% [21].
VGG-16
VGG-16 is a series CNN architecture, designed by Visual Geometry Group (VGG) in the University of Oxford. It contains thirteen convolutional layers with a fixed filter size (3x3) and three fully connected layers and it performed well in ILSVRC 2014 [32].
GoogLeNet
GoogLeNet is a Directed Acyclic Graph (DAG) network with 22-layers, designed by Google with the introduction of inception modules. Each inception module will do four operations simultaneously as shown in Fig. 3. This CNN model contains much smaller training parameter when compared to the other pretrained networks [34].

Inception module [34].
Residual Network (ResNet) architecture contains residual blocks as shown in Fig. 4. Residual block introduces skip connection to fit the input from the previous layer to the next layer without any modification of the input. This network won first place in ILSVRC 2015. ResNet -18 and ResNet -101, are 18 and101 layer in deep, respectively [15].

Residual block [15].
Inception-v3 model was designed by modifying the inception modules in the GoogLeNet architecture as shown in Fig. 5. This network won first place in ILSVRC 2014 [35].

Updated inception module [35].
This network is a modified version of Inception-v3 model with introduction of residual blocks as shown in Fig. 6. This network had achieved the highest test accuracy on the ILSVRC dataset, when compared to other pretrained networks discussed here [33].

Inception - residual block [33].
Details of the pretrained CNNs used in the proposed work are summarized in the Table 1.
Details of the pretrained CNNs used in the proposed work
All the seven pretrained networks had been trained on ImageNet dataset for 1000 class classification problem. In the proposed method, these pretrained CNNs are to be fine-tuned for portrait splicing detection, which is a two-class classification problem. Fine-tuning process contains following steps: Network architecture is reorganized by removing the existing Fully Connected (FC) layer and replacing it with new FC layer for the two-class classification problem with cross-entropy as cost function. Freezing the initial layers, and then training the new FC layers on the DSO-1 dataset to learn features specific to portrait splicing.
Random weight initialization is done on the new FC layers and same set of training parameters are used for fine-tuning all the seven pretrained networks (more details on training options are provided in Section 4.3). Lastly, we use a majority voting ensemble technique to combine the predictions of all the fine-tuned networks.
Majority voting ensemble
The combination of predictions from several classifiers will give a better performance than the prediction of a single classifier [9]. Even though the same dataset is used for training, different pretrained network architectures learn features in different ways. As a result, they overfit differently to the same dataset, and the ensemble technique is found to reduce the effect of overfitting [13]. Recently, deep learning-based ensemble techniques had shown outstanding performance for image recognition problems [18, 41]. In this work, class labels predicted by the CNN classifiers are combined by majority voting (plurality voting) ensemble technique. Each classifier will give a prediction label for a test image and in this ensemble technique, the final prediction label is the majority of votes obtained from all the classifiers [22, 31]. To avoid a tie in voting, an odd number of CNN classifiers are usually considered and here we use seven classifiers.
Experimental setup
The proposed method is implemented on a GPU (NVIDIA GTX 1060,6.0GB) based system using MATLAB R2020a platform. The following subsections discuss the datasets, performance evaluation metrics and training options.
Datasets used
DSO-1 and DSI-1 [5] are the datasets used for the experimental analysis of the proposed method. These two datasets are publicly available 1 benchmark portrait forgery datasets which contain high quality, genuine looking spliced portrait images in png image format. DSO-1dataset contains a total of 200 images with 2048×1536 resolution, and it is a balanced dataset with 100 images each in both classes. DSI-1 contains a total of 50 images with different sizes with 25 images each in both spliced and original classes. In DSI-1 dataset, authentic pictures were collected from Flicker website and spliced images were downloaded from various websites like Worth 1000, Instagram etc. The IIC illuminant maps of the images are created by the method provided by Riess and Angellopolou [27].
Performance evaluation metrics
In the proposed method, spliced image is considered as positive and original image is considered as negative. Accuracy, Precision, Recall and F
Measure
are the performance evaluation metrics used for assessing the proposed method. They are calculated based on True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). TP: number of spliced images which are correctly predicted as spliced. TN: number of original images that are correctly predicted as original. FP: number of original images which are incorrectly predicted as spliced. FN: number of spliced images that are incorrectly predicted as original.
Accuracy is the measure of the overall performance of a classifier and is calculated as in Equation (1). Precision and Recall of the classification model are found using Equations (2) and Equation (3), respectively. The harmonic mean of Precision and Recall is called FMeasure and calculated as given in Equation (4).
In order to attain good performance from a CNN, the hyperparameters are to be properly selected. The learning rate (α) is set to a small value of 0.0003 for avoiding large changes in the values of weight as well as to maintain the useful features already learned by the pretrained networks [3]. The number of epochs and mini-batch size is set as 5 and 8, respectively. Since only a few data samples are available in the portrait splicing datasets used, 60% of data is used for training, and 40% for validation [7]. Also, the data is shuffled before every epoch. Adam [20] optimization algorithm is used for minimizing the cost function. The shortage of data samples may lead to model overfitting, and hence to improve the generalization ability of the network, we are employing two methods: 1) Data augmentation and 2) L2 weight regularization [21]. Data augmentation is performed by applying a random vertical reflection on the training images. L2 weight regularization helps to keep the values of weights smaller and hence it helps to improve the generalization capability of the network. L2 weight regularization is set to 0.0001. We train FC layers of all the seven pretrained CNNs using the same training options. Then to obtain the final prediction, a majority voting ensemble scheme is used for combining the predictions from all the pretrained CNN models.
Experimental results and analysis
We conducted the following experiments to evaluate the performance of the proposed method. Experiments for comparing the performance of the majority voting ensemble method with that of the individual fine-tuned CNN models Experiments are conducted to study the effect of learning rate Generalization capability evaluation of the proposed ensemble method and individual fine-tuned CNN models using DSI-1 dataset Comparison of the proposed method with the state-of-the-art methods
Performance comparison of majority voting ensemble method with that of the individual fine-tuned CNN models
The illuminant maps of DSO-1 dataset are used for training and validation. Illuminant maps are given as input for fine-tuning the seven pretrained CNNs with the training options as discussed in Section 4.3. After separately fine-tuning the networks, its predictions are combined using majority voting technique. Table 2 gives the performance comparison of the individual pretrained CNN models with that of majority voting ensemble method in terms of the metrics Accuracy, Recall, Precision and FMeasure. It can be noted that, the proposed majority voting ensemble method attains the highest performance with an Accuracy of 98.75 %, Recall of 1.0, Precision of 0.976 and Fmeasure of 0.988, compared to all the seven fine-tuned CNN models. This proves the superiority of the ensemble method over the individual CNN models. It can also be noted from the table that, among the seven fine-tuned CNN models, ResNet-18, ResNet-101 and Inception-ResNet-v2 achieved the highest validation accuracy of 97.5 %, which is due to the presence of residual blocks in the network architecture.
Performance comparison of the individual pretrained CNNs with majority voting ensemble
Performance comparison of the individual pretrained CNNs with majority voting ensemble
Experiments are conducted to investigate the effect of learning rate (α) on accuracy and training speed. In this study, two different learning rates, α1 = 0.0003 (α1 is the original learning rate considered in the proposed method) and α2 = 0. 003 (learning rate increased by a factor of 10) are considered while keeping value of other training parameters same as explained in the section 4.3, and the results are tabulated in Table 3. The training time depends on the network parameters. Also, once the network is trained, it takes only a fraction of second to test a single image. From the results, it is clear that larger the learning rate smaller the training time. For example, consider ResNet -101architecture, training time is 67 seconds when α1 = 0.0003 and training time is 63 seconds when α2 = 0.003. Also from our experiments, it is clear that larger value of learning rate may help in quicker training of network, but it adversely affects the accuracy. For example, consider ResNet-101architecture, accuracy is 97.5% when α1 = 0.0003 and accuracy is 93.75% when α2 = 0.003.
Effect of learning rate on accuracy and training time
Effect of learning rate on accuracy and training time
To evaluate the generalization ability of the proposed method, the ensemble method and individual fine-tuned models are tested using DSI-1 dataset. The results are given in Table 4, and it can be seen that the proposed majority voting ensemble method achieves the highest test accuracy of 86% compared to the individual fine-tuned pretrained models. This is due to the generalization capability of the ensemble method reducing the overfitting of the individual models.
Generalization capability evaluation using DSI-1 dataset
Generalization capability evaluation using DSI-1 dataset
The detection accuracies of three state-of-the-art methods [4–6] published in literature are taken to compare the performance of the proposed method. All these state-of-the-art techniques are based on traditional machine learning approaches with images in DSO-1 as training and validation data, and a brief discussion of these techniques are given in Section 2. The values of detection accuracies are taken directly from the respective publications. Table 5 gives the comparison results and it is clear that the proposed ensemble method outperforms these state-of-the-art methods with highest accuracy of 98.75%. Also, it can be noted from a comparison between Tables 2 and Table 5, the deep learning models based on ResNet architectures (ResNet-18, ResNet-101 and Inception-ResNet-v2) are better than the traditional machine learning models in terms of percentage accuracy in portrait splicing detection.
Comparison of the proposed method with state-of-the-art methods
Comparison of the proposed method with state-of-the-art methods
In this work, seven different pretrained CNNs are fine-tuned on portrait splicing dataset with illuminant maps of images as input, and then a majority voting ensemble scheme is used to combine predictions from the fine-tuned networks for classifying images as spliced or original. The proposed majority voting ensemble method is able to classify images with 98.75% validation accuracy on DSO-1 dataset. DSI-1 dataset is used as test dataset, and a test accuracy of 86 % is achieved on the proposed ensemble method, which shows the generalization capability of the proposed classification method. The results show that, the proposed method outperforms the state-of-the-art methods using traditional machine learning techniques as well as predictions using the individual fine-tuned pretrained CNN models.
