Scrutinize artificial intelligence algorithms for Pakistani and Indian parody tweets detection

Abstract

False information is becoming more frequent in distributing disinformation by distorting people’s awareness and decision-making by altering their views or knowledge. The propagation of disinformation has been aided by the proliferation of social media and online forums. Allowing it to readily blend in with true information. Parody news and rumors are the most common types of misleading and unverified information, and they should be caught as soon as possible to avoid their disastrous consequences. As a result, in recent years, there has been a surge in interest in effective detection approaches. For this study, a customized dataset was built that included both real and parody tweets from Pakistan and India. This study proposes a two-step strategy for detecting parody tweets. In the first stage of the approach the unstructured data is converted into structured data set. In the second step, multiple supervised artificial intelligence algorithms were employed. An experimental assessment of the different classification methods inside a customized dataset was undertaken in this study, and these classification models were compared using evaluation metrics. Our results showed accuracy of 92%.

Keywords

Social media parody tweets binary classification machine learning deep learning word embedding

1 Introduction

Due to quick access to the latest news, trends, and major events happening around the world, the use of social sites (such as Twitter and Facebook) has exploded in recent years [1]. Newspapers, tabloids, and journals gave way to online news platforms, blogs, social media feeds, and other digital media formats as the news medium changed [2]. Consumers now have access to information about any topic from any corner of the world at their fingertips. Users of social media platforms can share their feelings and opinions, and they can discuss and chat about any issue they desire, such as democracy, education, healthcare, and finance. However, not all the news on social media platforms is true and genuine. Majority of fake and parody news scattered in emergency situations [3, 4].

Fig. 1

Examples of tweets from parody accounts.

People, companies, and media outlets utilise Twitter more than any other social media network to share and inform their subscribers about current events and news. Twitter has seen a huge increase in subscribers in recent years, owing to the fact that most politicians, athletes, media outlets, and businesses have official Twitter accounts [5]. Subscribers can simply obtain the most recent news in near real time and easily communicate their negative and positive reactions to that news 1

Almost all prominent and influential people in Pakistan and India have official Twitter accounts that they use to keep their followers up to date on the newest news. However, many parody and false accounts have been formed in the name of notable people and companies. These accounts are used to criticize and create a negative picture of a person or company. These accounts are used to create unfavorable and biased alternatives against a specific entity, and one of the reasons for these accounts is financial reward [6, 7]. Many parody accounts with a large following operate on Twitter in Pakistan and India. Tweets from these accounts are used to criticize political opponents, and these accounts are used to criticize entities in a humorous manner. Since there is a history of animosity between Pakistan and India, these accounts are also used to tweet against each other in order to build a particular narrative for their audience. In Table 1 mentioned some of the real and parody twitter usernames. Maryam Nawaz is the vice president of Pakistan’s PML-N. She routinely tweets about party ideology, condemns government decisions, and critics of her political opponents. Her genuine verified Twitter username is @MaryamNSharif, however, @MaryamNShref is a popular parody account affiliated with her. Because the Twitter handles of parody and real accounts are so identical, many people including politicians and media outlets, retweet parody messages from their real accounts.

Parody tweets are frequently misinterpreted as genuine, despite the fact that Twitter only allows parody accounts if they are clearly labelled as such and the image does not have the intent to mislead 2 . Figure 2 shows an example of tweets from parody accounts that appear to be real because the writing style of these tweets seems to be very similar to those of real accounts. These tweets frequently efficaciously duped a large number of users and harmed the narrative of a real person or organization.

Table 1

Example of real accounts and parody accounts

Real twitter handle	Parody twitter handle
@NawazSharifMNS	@NawazShrifMNS
@MaryamNSharif	@MaryamNShref
@ARYNEWSOFFICIAL	@ARYParody
@timesofindia	@LimesOfIndia
@RahulGandhi	@RoflGandhi_
@anjanaomkashyap	@AnjanaOmModi5
@SrBachchan	@Sirbachpan

1.1 Our contribution

Recently, many researchers have been working on detecting fake news using various machine learning and deep learning techniques. The majority of the datasets used to train the models are publicly available. Many researchers have not addressed the domain of detecting parody tweets on specific country regions.

Our contribution in this domain is the creation of a previously unavailable region-based tweet dataset. Trained various machine learning and deep learning models on datasets and then evaluated model performance in various scenarios such as (Pakistan training dataset and test on Indian dataset). This work focuses on the use of advanced NLP techniques known as Transformers and compare their performance to traditional machine learning models such as logistic Regression. Proposed solution was assessed by using various performance metrics.

2 Review of literature

Parody was invented by Aristotle in ancient Greece. Who turned awesome poems in to laughable by faintly changing the wording in famous poems. Parody studied as different subject in linguistics [8]. Typically, verbal parody entails a highly placed, purposeful, and conventional speaking act [8] that includes both a harsh judgments and a sort of pretense or echoic remark [9] in which an entity is copied or imitated with the goal of critiquing it in a comic manner. As a result, parody has an intrinsic quality of imitative creation for amusement purposes [10]. The parodist purposefully re-presents the object of the parody and proudly displays it [8]. Parody has different forms and different purpose [11]. Author explains in study [11] about the different forms of the parody and what are the aim of spreading parody. Different forms of parody may be a lie and fake news. Parody can be used to make fun of someone or something in a humorous fashion. It can be used to distribute fake news and cause social unrest. People must be educated to understand why parody is employed and how it affects society in both positive and negative ways.

Because social media is so easy to use, it has grown in popularity over the previous decade. This ease of access led to erroneous use of social media. It has become quite simple to disseminate fake information and create parodic content about entities. Parody is now widely regarded as a vital and integral aspect of social media, particularly on Twitter [12]. Customers can create spoof accounts on Twitter, but there are some limitations. This parody issue has recently piqued the interest of many scholars who want to learn more about this area of social media. Previous studies on parody in social media focused on evaluating how these accounts contribute to topical discussions [13] and the relationship between identity, deception, and legitimacy [12].

According to public relations studies, parody accounts have an impact on organizations during crises [14] and can represent a threat to their credibility [7]. Author in [7] studied how parody accounts come in to play in crisis situation and what are the impacts of these parody accounts on people and overall society. The research also looks at how these accounts behave and how they reinforce negative impressions and sabotage an organization’s efforts and initiatives and gave suggestions to organizations how to tackle these parody accounts effectively.

Many researchers studied how to find parody and fake accounts on twitter. One of the study conducted by researchers was [15] in which activity based approach is used to identify fake accounts. In this study, researchers looked at 62 million publicly available Twitter user profiles and devised a system for detecting automatically generated bogus profiles in the future. A very reliable subset of false user accounts was found using a pattern-matching algorithm on screen names combined with an examination of tweet update times. The fake users’ conduct was exposed by examining the profile creation times and URLs of these false accounts in comparison to a ground truth data set. There are also multiple ways to find parody and fake profiles automatically from twitter. For example [16] identify fake profile by using public information like "how many number of following and how many number of followers". In [17] authors looked into the idea of using text mining to create an algorithm that automatically determines a user’s identity on Twitter. It validated the owners of social media profiles in order to reduce the impact of false accounts on public perception. The method was based on write-print, a bio-metric for writing style. Imitation detection [18] seeks to distinguish between an original text and a text authored by someone attempting to emulate the original author’s style in order to impersonate them.

In the context of recognizing disinformation, satire has been briefly examined as one of multiple prediction goals in NLP [19]. Authors proposed a method to detect satire from the articles by using text-mining and features extraction. Dataset used by the researchers came from different news web portals and 71% f1-score achieved by applying different classifiers on real news articles to detect satire. To detect linguistic elements of parody, [20] compare the language of true news with that of sarcasm, forgeries, and propaganda. They show how stylistic features might aid in determining the text’s truthfulness. Satire detection is always a challenging task due to its wide range of text features and satire contains seriously spoken words which cause difficulty to detect. Researches studied to detect satire in different languages as well. One of the research to detect satire in Spanish language was conducted by authors [21]. In this study author suggested a method for detection of satire from twitter text. Dataset was categorised into two classes (satire and non-satire). The propose of this research was to focus on the style of the tweets rather than the context written in the post. [22, 23] another studies conducted for satire detection containing from news articles, tweets, customers reviews. As satire and parodies are categorized as a type of disinformation with "no intent to inflict damage but has the ability to fool," the study of parody is pertinent to this topic [24].

In [25] authors did research into detecting sarcasm in text automatically. The sentiment analysis community reacted positively to this study. This paper is a collection of previous work in the field of automatic sarcasm detection. The authors mention three research achievements: semi-supervised trend extraction to identify underlying sentiment, hashtag-based monitoring, and context beyond target text inclusion. They also go through datasets, techniques, trends, and problems with detecting sarcasm. In this research study [26] two different approaches were suggested for the detection of sarcasm from the twitter tweets. The first approach was parsing-based lexicon generation algorithm (PBLGA) and the second approach to detect sarcasm from the frequency of interjection words from the twitter text. In this study 89% precision was achieved by using first approach and 85% precision was accomplished from second approach. In [27] the goal was to tackle the difficult task of detecting sarcasm on Twitter by using behavioural factors unique to people who express sarcasm.

On the detection of misleading and false content on social media, research has been undertaken. The authors [28] conducted a study for identification of irony and satire using data-mining and ensemble feature selection from the news articles. A series of classification models were applied on news dataset including Logistic regression (LR), Support vector machine (SVM), Linear Model tree (LMT) and C4.5. A 95.8% precision was achieved from the experiment.

Another study conducted by authors [21] to identify whether the tweet is satirical or non-satirical. Researchers collected all the satirical and non-satirical tweets in Spanish language and applied different machine learning approaches for the recognition of satirical tweets. They model each tweet’s text using a collection of linguistically driven attributes aimed at capturing the text’s style rather than its content.

Automatically irony detection in tweets research carried out by researchers in study [29] where they proposed a method for this task. They offered a unique model that investigates the usage of subjective elements based on a diverse set of lexical resources for English that express various aspects of affect. Sentimental information aids in discriminating between ironic and non-ironic tweets, according to classification trials conducted across a variety of corpora.

Twitter is the main social media platform on which different political parties run their election campaigns, so many parody accounts created to damage opposition party narrative by tweeting fake tweets which cause real political damage. M. S. Looijenga [30] explored how fake tweets impact during the Dutch election of 2012. Eight different supervised machine learning models were used on tweets dataset including Decision Trees (DT), Bernoulli Naive Bayes(B-NB), Linear Support Vector Machine (LSVM), Gaussian Nai ve Bayes (G-NB) and Multinomial Naive Bayes(M-NB), ExtraTrees (ET), Stochastic Gradient Descent (SGD) and Random Forests (RF). Bag of Words representation model was used for data tokenization, normalization and vectorisation.

Kaliyar et al. [31] presented the method to identify fake news using a deep neural network approach. The dataset used was fake and real news propagated during the time of the U.S. General Presidential Election-2016. In this paper author proposed a model called "FakeBERT" which is the combination of NLP pre-trained model called BERT(Bidirectional Encoder Representations from Transformers) along with deep learning approach. Very promising results were obtained by applying suggested model on dataset. Author achieved 98.90% accuracy.

Parody account were used to spread fake news in crisis situation. Many researcher studied how parody accounts influenced society and organization in a crisis situation [7, 32]. Different machine learing and deep learning techniques were used by the researchers for the detection of parody news. Ajao et al. [33] proposed the mechanism to automatically identify fake news originated from a Twitter post. Hybrid CNN and RNN models was used in this research paper and LSTM was used for the evaluation of models.

Fig. 2

Proposed methodology for classification of parody tweets.

With the introduction of pre-trained models, practical applications of natural language processing have been fundamentally transformed. It has not only democratised the development of machine learning applications by allowing amateurs to create them, but it has also aided specialists in achieving better outcomes without having to train a model from scratch. Pre-trained models have also proven to be a valuable resource for amateur experts looking to learn from an established framework that can then be fine-tuned to generate new applications. Pre-trained models are simple to implement and do not require a lot of labelled data to work with, making them useful for a wide range of business challenges, including prediction, transfer learning, and feature extraction. Lots of already pre-trained models are available publicly which are trained on text including Wikipedia and books. Researcher used these pre-trained models for classification binary as well as multi labeled classes. Performance of models enhanced by using these transformers [31 , 35].

Past study was related to this research for recognizing parody tweets from real ones. In [35] authors study to identify parody from Twitter exclusively pertaining to political themes tweeted by US and UK politicians. Another study [34] was recently conducted on the Pakistani tweets dataset to detect parody tweets using various machine learning and deep learning techniques.

3 Experimentation

For this work, we suggest a framework in the following section, followed by explanation of dataset creation, data preprocessing and algorithms that were applied in this research work. Figure 3 illustrates the suggested composition of predicting parody tweets.

We define social media parody recognition as a binary classification problem conducted at the post level. Let assume twitter post as T, defined as the series of tokens. The target is to classify which tweet is real and which one is parody.

Initially we have built a customized dataset for this study and has tested a segment of users from Pakistan and India who come from a variety of backgrounds, including politicians, athletes, media outlets, and well-known corporations. We picked Twitter as our social media platform of choice since most politicians, sportsmen, and businesses have Twitter accounts. They kept their followers up to date with the newest news, government pronouncements, and their reactions to the most recent occurrence. The supporters may simply share their opinions on any well-known entity’s statements, as well as critique its policies. We selected Pakistan and India from the South-Asian area for this study since the political climates in both nations are nearly identical, and the sports played in each region are also nearly identical. Twitter permit to create parody account with some restriction, mention clearly specific terms like fake, not-real, parody in bio and twitter handle 3 Finally, classifies the tweets as real or parody based on the type of Twitter account from which tweets were posted. We used tweet data and the type of tweet to simplify the task (genuine or parody).

3.1 Dataset generation

This sub-section explains how real and parody accounts were collected and procedure to developed tweets dataset of Pakistan and India tweets.

3.1.1 Gathering parody and real accounts

Using the Twitter API [36], users can retrieve public information from Twitter. To locate parody accounts, we use the Twitter API with keywords such as #fake,fake-account,nonofficial, leaks, false, not-real, parody account with an additional parameter to find parody accounts in a certain region, in our case Pakistan and India.

The accounts with tweets in different languages other than English, and accounts that were blacklisted were also deleted from list of user. If numerous parody profiles are found, we only retain one.

We were able to distinguish parody accounts of prominent people, sportsmen, and media houses from both India and Pakistan after following all of the above methods. Following that, we acquired all of the real accounts related with the parody accounts entities.

3.1.2 Gathering parody and real tweets

Following the gathering of all parody and real user accounts, we use the publicly available Twitter API to collect all of the tweets posted by the users. According to Twitter’s API rules, each account is limited to 3200 tweets. We collected up to 1000 tweets from each account for this study. A total of 34620 tweets we collected, with 19185 Indian tweets and 15435 Pakistani tweets. After that we, classify tweets as real or parody by assigning a 1 to actual tweets and a 0 to parody tweets based on the type of Twitter account from which they were originated.

3.2 Data division

The dataset was split into 80% training and 20% testing part. Different machine learning and deep learning algorithms were applied to classify parody tweets in a novel way. The data divisions that were applied to the dataset to automatically detect parody tweets were as follows:

3.2.1 Overall tweet dataset

The combined dataset contains 34620 real and parody tweets of both India and Pakistan (see Table 1).

3.2.2 Pakistani tweet dataset

In second experiment, we took the same dataset from study [34]. We divide Pakistani tweet dataset holding 15435 tweets into training and testing part as shown in Table 1.

3.2.3 Indian tweets dataset

In third phase, we took Indian tweet dataset which contains 19185 records having both real and parody tweets as shown in Table 2.

Table 2
Different dataset splits

Dataset Train Test Total

Overall Tweets Real 14005 3399 17404

Parody 13808 3408 17216

Total 27813 6807 34620

Pakistani Tweets Real 6392 1601 7993

Parody 5980 1462 7442

Total 12372 3063 15435

Indian Tweets Real 7558 1853 9411

Parody 7764 2010 9774

Total 15322 3863 19185

Dataset		Train	Test	Total
Overall Tweets	Real	14005	3399	17404
	Parody	13808	3408	17216
	Total	27813	6807	34620
Pakistani Tweets	Real	6392	1601	7993
	Parody	5980	1462	7442
	Total	12372	3063	15435
Indian Tweets	Real	7558	1853	9411
	Parody	7764	2010	9774
	Total	15322	3863	19185

3.2.4 Region-based split

Finally, we split the dataset according to region. First we took Pakistani tweet dataset as training and Indian tweet dataset as testing and used Indian tweet dataset as training and Pakistani tweet dataset as testing as well shown in Table 2.

3.3 Data preprocessing

As raw data is incomprehensible to models, data cannot be supplied straight to them for classification. Models accept integers and double values, but the data we collected from tweets is of the string format. Therefore, preprocessing techniques must be used on the dataset in order for machine learning and deep learning models to accept the data.

Following steps were performed while preprocessing the tweet text.

3.3.1 Replace contraction

As the first step, we replace contraction.

Table 3
Region-Based Data-Split

Region Based Data-Split

Real Parody Total

Pakistan Train 7993 7442 15435

Test (India) 9411 9774 19185

Indian Train 9411 9774 19185

Test (Pakistan) 7993 7442 15435

Region Based Data-Split
Pakistan	Train	7993	7442	15435
	Test (India)	9411	9774	19185
Indian	Train	9411	9774	19185
	Test (Pakistan)	7993	7442	15435

3.3.2 Convert text

As the second step, we covert all text into lower case for the symmetry.

3.3.3 Remove URLs & HTML tags

As the third step, all the unnecessary URLs and the HTML tags from the tweet text were removed.

3.3.4 Stop stopwords

As the fourth step, all the stopwords from the text tokens were removed using stopwords list from python library NLTK.

3.3.5 Text tokenization

For tokenization of the cleaned dataset, Differential Language Analysis ToolKit (DLATK) was employed 4 .

3.4 Classification models

We ran a number of experiments on the dataset using a variety of machine learning and deep learning algorithms, including simple logistic regression, recurrent neural networks, and transformers, which are pre-trained NLP models to predict parody from tweets dataset. Following are the algorithms were used in this study:

3.4.1 Logistic regression model

The first model used on the dataset was logistic regression model that extracted features from text data using the Bag of Words approach [37].

In the second part, we use Part-of-Speech (POS) tagging to extend LR with Bag-of-Words [38]. Initially we tagged all of the text with POS, then used BoW to automatically extract from the text, with each word being connected with a Part-of-Speech tag 5 .

3.4.2 Bi-directional long short term memory (Bi-LSTM)

The recurrent neural network (RNN) used in this study is Bi-LSTM. Bi-LSTM is the extension of standard LSTM [39]. To protect future and previous knowledge, make the input flow in both directions in Bi-LSTM. The GloVe 200-Dimensional Word Vectors [40], which had already been trained on tweets, were used for embedding 6 .

3.5 Pre-trained models

The advance techniques used are pre-trained NLP transformers models for the prediction of real and parody tweet. In natural language processing, the Transformer is a unique design that seeks to solve sequence-to-sequence tasks while also resolving long-range dependencies. It does not use sequence-aligned RNNs or convolution to compute representations of its input and output, instead relying solely on self-attention [41].

Pre-trained models have also proven to be a valuable resource for amateur experts looking to learn from an established framework that can then be fine-tuned to generate new applications. Pre-trained models are simple to implement and don’t require a lot of labelled data to work with, making them useful for a wide range of business challenges, including prediction, transfer learning, and feature extraction.

3.5.1 BERT

First pre-trained model used in this research is BERT (Bidirectional Encoder Representations from Transformers) [42] which is trained on un-labeled English words over 800M words and around 2500M words of English Wikipedia [42]. To learn bidirectional embedding for input tokens, the model employs several multi-head attention layers. It’s been trained for masked language modelling, which involves masking a portion of the input tokens in a sequence and predicting a masked word given its context. BERT employs word pieces that are summed with positional and segment embeddings after passing through an embedding layer. We added an output dense layer for binary classification and feed it the ’classification’ token to fine-tune the BERT-base model for predicting parody tweets.

3.5.2 RoBERTa

RoBERTa [43] is also used in this research work, RoBERTa is the extension of BERT with more pre-trained data injected in this model. RoBERTa showed promising performance as compared to BERT.

3.5.3 XLNET

Third pre-trained model used is XLNet [44] based on transformers network. XLNet is an auto-regressive pretrained model. The structure wise XLNet is alike BERT but differ from BERT in the training process.

3.6 Hyperparameters

We optimize all model parameters on the development set for each data split shown in Table 4.

Table 4
All models parameters and configurations

Models Parameter &Configurations

Logistic Regression with BOW n-grams with n = (1, 2)

L2 regularization

Python sklearn library

Logistic Regression with POS n-grams with n = (1, 3)

L2 regularization

Python sklearn library

BiSLTM 200-dimensional GloVe embeddings

sequence length set to 50

10 epoch with 64 batch size

binary cross-entropy

Adam is utilized as an optimizer

BERT and RoBERTa base model used

1 epoch with learning rate l=5e-5

batch size 64

Adam is utilized as an optimizer

Pyton Hugging face library

XLNET base model used

1 epoch with learning rate l=4e-5

batch size 64

Adam is utilized as an optimizer

Pyton Hugging face library

Models	Parameter &Configurations
Logistic Regression with BOW	n-grams with n = (1, 2)
	L2 regularization
	Python sklearn library
Logistic Regression with POS	n-grams with n = (1, 3)
	L2 regularization
	Python sklearn library
BiSLTM	200-dimensional GloVe embeddings
	sequence length set to 50
	10 epoch with 64 batch size
	binary cross-entropy
	Adam is utilized as an optimizer
BERT and RoBERTa	base model used
	1 epoch with learning rate l=5e-5
	batch size 64
	Adam is utilized as an optimizer
	Pyton Hugging face library
XLNET	base model used
	1 epoch with learning rate l=4e-5
	batch size 64
	Adam is utilized as an optimizer
	Pyton Hugging face library

3.7 Performance metrics

A variety of indicators were employed to evaluate the algorithm’s performance. Multiple performance metrics are used for this purpose. There are lot of model evaluation metrics to access the performance of models on test dataset. Following are the performance metrics use in this research

Confusion Matrix

Accuracy

Precision

Recall

F1-Measure

3.7.1 Confusion matrix

The confusion matrix is a table showing of the effectiveness of a classification algorithm on the test set, with four parameters:

True Positive

False Positive

True Negative

False Negative

In this research, predicting whether tweet is real or parody in the task. Above indicators represents the following meaning of parameters used in confusion matrix:

True Positive (TP): Tweet predicted real which is labeled as a real tweet

True Negative (TN): Tweet predicted parody which is labeled as parody tweet

False Negative (FN): Tweet predicted real which is labeled as parody tweet

False Positive (FP): Tweet predicted parody which is labeled as a real tweet

3.7.2 Accuracy

Accuracy is one of the mostly utilize performance indicator for the classification. Accuracy is the ratio of the correct predicted values either true or false to the whole predicted values. Formula Accuracy is used to calculate the accuracy. $Accuracy (ACC) = \frac{TP + TN}{TP + TN + FP + FN}$ (1) In generally, high accuracy showed the good performance of the classification model.

3.7.3 Precision

Precision is used to find the positive predicted values. Precision is the ratio of accurately predicted positive instances divided by the total number of true positives predicted is used to compute it [45]. Equation Precision shown below describe the formula for the calculation of precision. $Precision (P) = \frac{TP}{TP + FP}$ (2)

3.7.4 Recall

Recall is a ratio that measures how many correct positive predictions were produced out of all possible positive predictions [45]. Unlike precision, which only considers the correct positive predictions out of all true positives, recall evaluates the positive predictions that were missed. Formula for the calculation of recall is shown in equation . $Recall (R) = \frac{TP}{TP + FN}$ (3)

3.7.5 F1-measure

The F1-Score is a method of combining precision and recall into a single metric that encompasses both features [45]. F1-Score can be calculated by using formula shown in equation 4 F1-Score by using precision and recall values. Equation display the formula for the calculation of f1-score by using predicted values.

F 1 - Score (F 1) = 2 \times \frac{Recall * Precision}{Recall + Precision}

(4)

F 1 - Score (F 1) = \frac{2 * TP}{2 * TP + FP + FN}

(5)

Table 5

Classification models performance on different datasets. Highest result are in bolds

Dataset	Models	Accuracy %	Precision %	Recall %	F1-Score %
Overall Tweets	Logistic Regression with BOW	84.00	84.00	84.00	84.00
	Logistic Regression with POS	83.00	83.00	83.00	83.00
	BiLSTM	82.40	83.00	82.40	82.00
	BERT	86.70	87.65	85.40	86.51
	RoBERTa	86.27	86.54	85.87	86.20
	XLNet	85.11	84.68	85.70	85.18
Pakistani Tweets	Logistic Regression with BOW	90.00	90.00	90.00	90.00
	Logistic Regression with POS	89.00	89.00	89.00	89.00
	BiLSTM	86.50	87.00	86.00	86.00
	BERT	91.5	92.00	92.00	91.65
	RoBERTa	92.00	92.00	91.50	92.00
	XLNet	90.30	91.00	91.00	91.00
Indian Tweets	Logistic Regression with BOW	86.00	86.00	86.00	86.00
	Logistic Regression with POS	86.00	86.00	86.00	86.00
	BiLSTM	83.70	84.00	82.00	83.70
	BERT	85.94	86.02	84.00	84.86
	RoBERTa	84.07	84.07	83.25	82.79
	XLNet	85.01	85.00	85.47	84.13

4 Results

This part of research paper contains all of the results obtained by using various classification models on multiple dataset samples as explained in the subsection . We evaluate our algorithms methods using different evaluating metrics like accuracy, recall, f1-score and precision [46].

4.1 Overall dataset

The complete dataset result is presented in the Table . Both RNN and basic logistic regression models are outperformed by pre-trained models. Among all the models, BERT was the best performing model.

4.2 Pakistani tweet dataset

This work is an extension of a study [34], the results obtained on the Pakistani tweet dataset are identical to those acquired in earlier study. Results are shown in Table 1-results. On the RoBERTa model, 92% accuracy was attained on the Pakistani tweet dataset.

4.3 Indian tweets dataset

Table shows the result obtained on Indian tweets dataset. In this case logistic regression model perform good as compared to other models. 86% accuracy achieved on Indian tweets dataset.

Table 6
Region based Dataset

PAK ⇒ IND IND ⇒ Pak

Model Acc % P % Acc % P %

LR-BOW 72.00 72.00 66.00 68.50

LR-PoS 72.00 72.00 68.00 69.00

Bi-LSTM 70.10 71.00 67.30 68.00

BERT 72.00 72.00 72.80 73.20

RoBERTa 75.00 77.00 73.40 75.00

XLNet 74.30 75.00 71.20 71.00

	PAK ⇒ IND	IND ⇒ Pak
LR-BOW	72.00	72.00	66.00	68.50
LR-PoS	72.00	72.00	68.00	69.00
Bi-LSTM	70.10	71.00	67.30	68.00
BERT	72.00	72.00	72.80	73.20
RoBERTa	75.00	77.00	73.40	75.00
XLNet	74.30	75.00	71.20	71.00

4.4 Region-based split

Table 1 shows the accuracy and precision acquired on the region based split.

Train on Pakistan based tweets and test on Indian tweets data.

Train on Indian tweet data and test on Pakistani tweets data.

When comparing pre-trained models to logistic models and neural networks, the results obtained in both scenarios suggest that pre-trained models outperform them. RoBERTa outperforms pre-trained models, achieving 75% in the first scenario and 73.40% in the second.

5 Outcome and way-forward

We suggested a method to discover parody from a twitter dataset based on the Pakistan and India regions in this paper. This research builds on the findings of a previous study undertaken by the authors [34]. For this study, we created a data set of both nations’ tweets, which was previously unavailable. There are 34620 tweets in this dataset, both actual and parody. We run multiple algorithms on the dataset on different data. On the test dataset, we produced very promising results, with up to 86.70% accuracy. On Pakistani and Indian tweets, accuracy was 92% and 86%, respectively.

In future, we plan to extend the research by enhancing the tweets dataset and optimize parameters to attain better accuracy on unseen data.

Footnotes

India has 24.45 million number of Twitter users as of October 2021, https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-countries/ 46.00 million social media users in Pakistan in January 2021

sklearn library was used for feature extraction

References

Gottfried

, Shearer

News use across social media platforms, 2016, 2019.

Ahlers

, News consumption and the new electronic media,}, Harvard International Journal of Press/Politics 11(1) (2006), 29–52.

Ittefaq

, Hussain

S.A.

and Fatima

, Covid-19 and social-politics of medical misinformation on social media in pakistan,, Media Asia 47(1-2) (2020), 75–80.

Wahutu

J.S.

, Fake news and journalistic “rules of the game”,, African Journalism Studies 40(4) (2019), 13–26.

Alam

, Lucas

Tweeting government: A case of australian government use of twitter, in 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing. IEEE, 2011, pp. 995–1001.

Highfield

, News via voldemort: Parody accounts in topical discussions on twitter, New Media& Society 18(9) (2016), 2028–2045.

Wan

, Koh

, Ong

and Pang

, Parody social media accounts: Influence and impact on organizations during crisis,, Public Relations Review 41(3) (2015), 381–385.

Rossen-Knill

D.F.

and Henry

, The pragmatics of verbal parody,, Journal of Pragmatics 27(6) (1997), 719–752.

Dynel

, Isn’t it ironic? defining the scope of humorous irony,, Humor 27(4) (2014), 619–639.

10.

Franke

, A note on parody in chinese traditional literature,, Oriens Extremus 18(2) (1971), 237–251.

11.

Sinclair

, Parody: fake news, regeneration and education,, Postdigital Science and Education 2(1) (2020), 61–77.

12.

Vis

, Twitter as a reporting tool for breaking news: Journalists tweeting the uk riots,, Digital Journalism 1(1) (2013), 27–47.

13.

Johnson

Twitter and the body parodic: Global acts of recreation and recreation, Ph.D. dissertation, Massachusetts Institute of Technology, 2017.

14.

Kim

and Kim

, The crisis of public health and infodemic: Analyzing belief structure of fake news about covid-19 pandemic,, Sustainability 12(23) (2020), 9904.

15.

Gurajala

, White

J.S.

, Hudson

, Matthews

J.N.

Fake twitter accounts: Profile characteristics obtained using an activity-based pattern detection approach, in Proceedings of the 2015 International Conference on Social Media amp; Society, ser. SMSociety ’15. New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2789187.2789206.

16.

Benevenuto

, Magno

, Rodrigues

and Almeida

, Detecting spammers on twitter, in, Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS) 6(2010) (2010), 12.

17.

Kreuz

R.J.

and Roberts

R.M.

, On satire and parody: The importance of being ironic,, Metaphor and Symbol 8(2) (1993), 97–109.

18.

Dinu

L.P.

, Niculae

, Sulea

O.-M.

Pastiche detection based on stopword rankings. exposing impersonators of a romanian writer, in Proceedings of the Workshop on Computational Approaches to Deception Detection, 2012, pp. 72–77.

19.

de Morais

J.I.

, Abonizio

H.Q.

, Tavares

G.M.

, da Fonseca

A.A.

, Barbon

Jr Deciding among fake, satirical, objective and legitimate news: A multi-label classification system, in Proceedings of the XV Brazilian Symposium on Information Systems, 2019, pp. 1–8.

20.

Rashkin

, Choi

, Jang

J.Y.

, Volkova

, Choi

Truth of varying shades: Analyzing language in fake news and political fact-checking, in Proceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 2931–2937.

21.

Barbieri

, Ronzano

and Saggion

, Is this tweet satirical? a computational approach for satire detection in spanish, Procesamiento del Lenguaje Natural (55) (2015), 135–142.

22.

Burfoot

, Baldwin

Automatic satire detection: Are you having a laugh? in Proceedings of the ACL-IJCNLP 2009 conference short papers, 2009, pp. 161–164.

23.

Reganti

A.N.

, Maheshwari

, Kumar

, Das

, Bajpai

Modeling satire in english text for automatic detection, in 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 2016, pp. 970–977.

24.

Wardle

, Derakhshan

, et al., Thinking about‘information disorder’: formats of misinformation, disinformation, and malinformation, Ireton, Cherilyn; Posetti, Julie. Journalism, ‘fake news’ & disinformation. Paris: Unesco, pp. 43–54, (2018).

25.

Joshi

, Bhattacharyya

and Carman

M.J.

, Automatic sarcasm detection: A survey,, ACM Computing Surveys (CSUR) 50(5) (2017), 1–22.

26.

Bharti

S.K.

, Babu

K.S.

, Jena

S.K.

Parsing-based sarcasm sentiment recognition in twitter data, in 2015 IEEE/ACMInternational Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, 2015, pp. 1373–1380.

27.

Rajadesingan

, Zafarani

, Liu

Sarcasm detection on twitter: A behavioral modeling approach, in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ser. WSDM ‘15. New York, NY, USA: Association for Computing Machinery, 2015, p. 97–106. [Online]. Available: https://doi.org/10.1145/2684822.2685316.

28.

Ravi

and Ravi

, A novel automatic satire and irony detection using ensembled feature selection and data mining,, Knowledge-Based Systems 120 (2017), 15–33.

29.

Farias

D.I.H.

, Patti

and Rosso

, Irony detection in twitter: The role of affective content,, ACM Transactions on Internet Technology (TOIT) 16(3) (2016), 1–24.

30.

Looijenga

M.S.

The detection of fake messages using machine learning, B.S. thesis, University of Twente, 2018.

31.

Kaliyar

R.K.

, Goswami

, Narang

Fakebert: Fake news detection in social media with a bert-based deep learning approach, Multimedia Tools and Applications, pp. 1–24, (2021).

32.

Montesi

Understanding fake news during the covid-19 health crisis from the perspective of information behaviour: The case of spain, Journal of Librarianship and Information Science p. 0961000620949653, (2020).

33.

Ajao

, Bhowmik

, Zargari

Fake news identification on twitter with hybrid cnn and rnn models, in Proceedings of the 9th international conference on social media and society, 2018, pp. 226–230.

34.

Talha

M.A.

, Zafar

Investigating parody from social media accounts, in 2021 6th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), IEEE, 2021, pp. 1–6.

35.

Maronikolakis

, Villegas

D.S.

, Preotiuc-Pietro

, Aletras

Analyzing political parody in social media, arXiv preprint arXiv:2004.13878, 2020.

36.

Makice

Twitter API: Up and running: Learn how to build applications with the Twitter API. O’Reilly Media, Inc., 2009.

37.

Zhang

, Jin

and Zhou

Z.-H.

, Understanding bag-ofwords model: a statistical framework,, International Journal of Machine Learning and Cybernetics 1(1-4) (2010), 43–52.

38.

Voutilainen

The Oxford handbook of computational linguistics, pp. 219–232, (2003).

39.

Hochreiter

and Schmidhuber

, Long short-term memory,, Neural computation 9(8) (1997), 1735–1780.

40.

Pennington

, Socher

, Manning

C.D.

Glove: Global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

41.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

, Polosukhin

Attention is all you need, 2017.

42.

Devlin

, Chang

M.-W.

, Lee

, Toutanova

Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

43.

Liu

, Ott

, Goyal

, Du

, Joshi

, Chen

, Levy

, Lewis

, Zettlemoyer

, Stoyanov

Roberta:A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692, 2019.

44.

Yang

, Dai

, Yang

, Carbonell

, Salakhutdinov

, Le

Q.V.

Xlnet: Generalized autoregressive pretraining for language understanding, 2020.

45.

Precision and recall, https://machinelearningmastery.com/precisionrecall-and-f-measure-for-imbalanced-classificat-ion/, 2020.

46.

Novaković

J.D.

, Veljović

, Ilié

S.S.

, Papić

Ž.

and Milica

, Evaluation of classification models in machine learning, Theory and Applications of Mathematics&Computer Science 7(1) (2017), 39–46.

	PAK ⇒ IND		IND ⇒ Pak
Model	Acc %	P %	Acc %	P %
LR-BOW	72.00	72.00	66.00	68.50
LR-PoS	72.00	72.00	68.00	69.00
Bi-LSTM	70.10	71.00	67.30	68.00
BERT	72.00	72.00	72.80	73.20
RoBERTa	75.00	77.00	73.40	75.00
XLNet	74.30	75.00	71.20	71.00