Abstract
The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1Fake and 0.90 F1Real. We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.
Keywords
Introduction
Even though the Urdu language has more than 100 million speakers across the world, it is a resource poor languages in the Natural Language Processing (NLP) domain both from the perspective of NLP tools inaccessibility as well as scarcity of labeled datasets [1]. In this work, we dedicate our attention to assemble a plausible and credible source in the form of Urdu corpus for automatic fake news detection.
In digital media, the epidemic of fake news grows substantially when a change in public opinion is demanded during an important event. Hence, we need to tap natural language processing algorithms to design a system that can determine whether a source is trustworthy or politically inclined with or without human curation. For example, in January 2019 Google showed incorrect Pakistani rupee value against US dollar (exchange price of the dollar) in Pakistan 1 , the following day stock market in Pakistan was crashed because people started to sell their shares due to the dramatic decline of the stock exchange.
Fake news is painting significant challenges to branch out our society. The availability of information has raised the challenges associated with testing the trustworthiness of the data automatically. For this reason, it is necessary to build systems for controlling the amount of factually incorrect and misleading data on the Web. This deficit can be met by designing computational models for detecting fake news. In turn, this requires sufficient amount of labeled data to apply supervised machine learning approaches. As fake news dissemination can be cross-lingual, it is best to have datasets available in a wide variety of languages.
Therefore, we present a labeled dataset for fake news detection in Urdu language. It contains 500 labeled real news from legitimate news sources and 400 fake news in the corresponding topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology.
Additionally, we provide baseline classification methods for fake news detection on this dataset. There are three categories of fake news detection methods [3]: knowledge-based (attempt fact verification), context-based (analyze how the news disseminate in social networks), and style-based (analyze writing style). The problem with implementing the first two approaches for the Urdu language is the unavailability of the NLP tools required for their intermediate feature crafting. However, the style-based approach in its basic form is based on analyzing n-gram sequences.
The main contributions of this work are: the first corpus for the Urdu language for research on automatic fake news detection containing real news extracted from various legitimate news agencies and fake news written by professional native Urdu-speaking journalists; corpus development methodology. This corpus is a unique resource to study style-based fake news detection models deeply; the description of the challenges faced in assembling the fake news part; statistical metrics for the corpus vocabulary; recommendations for most effective feature combination; a comparison of supervised learning classifiers and their performance in fake news detection based on linguistic and stilometric features; baseline classification results evaluated against a number of metrics with the best results of 0.86 F1Fake score for fake news detection, 0.89 F1Real score for real news detection, and 0.95 ROC-AUC.
The rest of the paper is composed as follows. Section 2 overviews the state-of-the-art work on fake news detection and corpora for other languages. Section 3 describes the methodology we followed for building the corpus along with the annotation guidelines and the corpus statistics. Section 4 describes the classification approach for automatically detecting fake news. In Section 5 we analyze the experimental results. Subsequently, Section 6 present general conclusions and points to the permissible steps of future work.
Related work
In this section, we review the literature regarding the automatic analysis of fake news which has been a subject of particularly acute attention. The presence of fake news started with the invention of printing press back in 1439 2 . However, there are divergent opinions in defining the term “fake news.”
In recent times, there are only two main directions of research to automatically classify fake news: on a conceptual and an operational level. On a conceptual level, fake news have been further divided into three categories [4]: hoaxes, i.e., posting factitious information using social networks alluding to certain news broadcast in its genuine form via reputable news websites; satire, i.e., news that imitate the real content of news with addition of untrue and sarcastic content; and serious fabrications, i.e., misleading news about a celebrity or an event that did not take place.
On an operational level, researchers [5] suggested different approaches, such as an inference task in a Markov random field (MRF) [6], fact-checking, and source-checking. Moreover, fake news detection and deception detection has been used in several studies as a data mining [3] to classify news pieces, posts, and online reviews in publicly available corpora [7, 8].
Fake news pieces contain dogmatic and seditious language to urge users to click on the link to read the full article (known as “clickbait”) [9]. Thus, in the fake news detection task, the linguistic features have been used to capture the different writing styles in news content and sensational headlines [7]. Additionally, social networks, in particular Twitter posts associated with natural disasters have been used for the development of fake news detection model [10].
Linguistic-based features are derived from language to examine various aspects of a language at different levels such as characters, words, sentences, and documents as a whole. There are two primary types of features: common and domain-specific linguistic features. Common linguistic features contain two kind of features: (i) lexical features and (ii) syntactic features. Lexical features including character level and word-level features, such as a total number of words, an average number of characters per word, frequency of words present in the dataset, unique word count, frequency of function words and phrases, parts-of-speech (POS) tags, etc. Syntactic features include sentence-level features such as syntactic dependencies/constituents, clauses, and punctuation. Ivanov and Tutubalina [11] use syntactic clause features for user review analysis.
Domain-specific linguistic features, which are precisely aligned to news domain, are quoted words, external links, number of images, etc. [2]. Furthermore, to find out the deceptive cues in writing styles to flag fake news, features such as author lying-detection features and different types of new features can be created [12].
The reason to attain features is to outline the content of news items mathematically. Model-oriented fake news research opens the door to developing more robust models for fake news detection. Recent studies [2, 13] have suggested different approaches by focusing on extracting several kinds of features and integrating them into supervised classification models such as logistic regression(LR), k-nearest neighbours (kNN), random forest (RF), and support vector machines (SVM), and after that choosing the classifier that outperform other machine learning algorithms.
A recent study suggested a multi-task ordinal regression framework that models the problem of trustworthiness and political ideology detection of entire news content jointly instead of analyzing each news article individually. Furthermore, this study also revealed that joint models over models that tackle the problems separately obtained significantly better results [14].
Additionally, fake news detection has been investigated as a stance detection problem rather than true/false classification [15, 16]. In particular, this approach was adopted in the Fake News Challenge project (FNC1) 3 which reduces problem to checking the relationship between the title and the body of the news: a) the title and the body are clearly related, b) no association between the title and the body, and c) a partial relationship.
The winning team (best performing system) achieved 82.02 accuracy score using machine learning and deep learning approaches [17]. For machines, 2-gram and 3-gram features with TF-IDF weighting scheme using Gradient-Boosted Decision tree were used. For deep learning, word level vectors using word2vec embeddings from Google News [18] were applied using a one-dimensional deep convolutional neural network (CNN) on the title and the body text.
News articles can be accumulated using different online sources, such as news agency homepages, search engines, and social networks. Despite this, manually checking the authenticity of a news article requires annotators with domain expertise who conduct deliberate analysis of claims. For fake news detection, datasets in English and Spanish are available. In English the datasets are available, such as BuzzFeedNews [2], BS Detector 4 , Liar [19], CREDBANK [20] and FakeNewsNet [21]. Likewise, Fake News Corpus Spanish [13] annotated for fake news detection in Spanish. A corpus of social network news feed paraphrases exist for the Russian language, yet it is not annotated for content authenticity [22]. However, to the best of our knowledge, there is still no such resource available in the Urdu language despite the tremendous advancement in research work for this language.
EMILLE 5 Project (Enabling Minority Language Engineering) was the first initiative to assemble a 67 million word corpus of South Asian languages [23]. The Urdu corpus which was collected within this project contained approximately 0.5M spoken Urdu words transcribed from transmissions of BBC Asian Network and BBC Radio. Subsequently, researchers started to make attempts to build resources for resource-poor language, such as the Urdu corpus for word sense disambiguation [24], the Urdu POS-tagged corpus [25], and the initiative of phonetically rich Urdu corpus for speech recognition [26]. However, these resources do not have annotation suitable for fake news detection.
The data: the first corpus of fake news in Urdu
In this section, we provide an overview of the data acquisition process as well as the corpus statistics. We assembled real news by crawling thousands of news articles from numerous reliable sources for the time frame from January 2018 to December 2018.
This corpus contain news from five domains: (i) Business, (ii) Health, (iii) Showbiz (entertainment), (iv) Sports, and (v) Technology. This selection of topics is in line with a similar dataset for English language [7] except for the educational domain which presented difficulties in obtaining.
Previous study [2, 27] has almost exclusively focused on providing a more detailed analysis of procedures on how two types of news (real and fake) are collected. It also discussed serious issues associated with fake news corpus. Moreover, some news corpora contain news articles which are a combination of real and fake information. As far as we know, no previous research has illuminated the rigorous criteria for fake news definition and categorization. Although researchers have examined different types of news in creating a corpus, some questions regarding the exact procedure of how they annotated the news pieces remain to be addressed. With this in mind, we introduced an alternative approach to data collection to address this limitation in fake news annotation and applied it to the Urdu language.
This “Bend The Truth” corpus is a unique, reasonably accurate, and reliable source of its kind in the Urdu language for this particular task. Urdu is a national language of Pakistan. This is a binary annotated corpus. The uniqueness about this corpus apart from its language, is that we availed professional journalist services to write fake news stories corresponding to the original real news, just as what takes place in the real life. News agencies used to crawl real news are mentioned in Table 1. The “Bend The Truth” Urdu corpus is publicly available to use for academic research 6 .
Legitimate websites
Legitimate websites
The Newspaper 7 library for Python was used as a web scraper to extract the content of news articles from newspaper web pages. This library offers advanced features to deal with web pages of newspapers and magazines to extract news articles. This capability was essential for obtaining not only the relevant text of Urdu news articles by husking additional obsolete HTML tags but also eliminating Urdu text which did not belong to the news text body (e.g., name of the author, location). Despite that HTML structure of each news source (website) is different, this scrapper performed exceptionally good job dealing with noisy texts, images, and advertisements. For evaluation of the performance of our method, we need balanced corpus and this is why besides fake news news, we also need real news.
Real news collection
The real news were collected from different mainstream news websites. The major points in the real news data collection and handling procedure were: The data was collected and annotated manually. The news piece was labeled as real if it fell into one of the following categories: It was published by a reliable newspaper and prominent news agency. The same news was found on different newspapers which provided evidence about the authenticity of the news, such as image, date, place of the event, etc. The source of the news is mentioned and that source is reliable. Subsequently, we verified the news source and cross-referencing information among several sources. There is correlation between title and the contents of the news article. To verify correlation between the title and the contents, we had to read all the news articles.
The length of the news pieces in this collection varies because each news agency has a different style of news articles. So, the length of each news content is heterogeneous. Using this methodology we collected 100 news in each of the five domains, for a total of 500 real news.
Professional crowdsourcing of fake news
The collection of fake news for the corresponding real news was a challenging task. The reason was that it demanded a tremendous amount of work to be done for evaluating fake news. Firstly, there are no websites that offer news validation services for the Urdu language. Consequently, the web scraping approach was out of consideration as it would require manual analysis of hundreds of thousands of news articles for authenticity. Therefore, generating fake news of the corresponding real news was the alternative we chose. For writing fake news, we drew great benefits from professional journalists from various news agencies in Pakistan: Express news, Dawn news, etc. Using the services of professional journalists ensured the quality of the fake news articles and realistically imitated the process that happens in real life when fake news are created.
As our dataset covered news articles in five major domains (sports, business, education, technology), the news cannot be the same from the linguistic point of view. Thus, we tasked journalist who were experts in a corresponding domain.
We provided the journalists with very open-ended instructions to avoid unintentionally introducing any clearly defined patterns that would make the produced news pieces easily distinguishable from the real news. Journalists were asked to keep the same length of the news as the original. For this task, we largely relied on the journalists’ expertise.
Problems in collecting real and fake news
During the Real news collection, there were some problems found, e.g., typing mistakes or word misuse (see the concrete examples below). To avoid such errors, it was required to re-read the whole news corpus and remove such faults. Another example, the word “ Urdu has compound words (i.e., consisting of several tokens), e.g., “ Some Indian newspapers misreport artists’ names. The newspaper mentioned “ Some Indian newspapers report grammatical gender (masculine and feminine) differently than Pakistanis newspapers. For example, “ Some newspapers use Roman numerals as “ Some newspaper had typed written mistakes such as “ Some Hindi sports newspapers write “Matches” as “ In health news, there were some mistakes which completely change the meaning of the sentence. For example, in one health news, it was stated “
”, which means (“with”), is sometimes spelled as “
” by some Indian newspapers.
” (“with safety”), “
” (“climate”), “
” (“forefathers”), “
” (“modernization”). Such compound words are split into two or three tokens by the standard tokenizers. However, they are actually a single word, and we needed to be very careful while tokenizing these words. However, in our experiments, we didn’t do any additional tokenization step for compound words and used default splitting.
” instead of “
” (in English transliterated as “Katrina”).
” (in the English language, it means “occasion” or “Event”). According to the Indian newspaper, “event” is masculine. On the other hand, Urdu newspapers report “event” as feminine.
”.
” instead of “
”“October”.
” and “
” instead of “
”. Additionally, the word “Test Series” is written as “
” instead of “
”.
” which means “stupid” instead of “
” which means “ancle”.
The journalists were asked to read a full news article before writing a fake version of it, which required substantial effort and time to write each fake news.
Data pre-processing and data cleaning
We enhanced the quality of the text data after extraction with the scrapper by performing additional data cleaning on the plain text of news articles. We took the following steps: All auxiliary character sequences and tokens in Latin alphabet, e.g., special characters such as the description of the images in news, references to images, videos, were discarded manually. However, we did not eliminate punctuation marks from Western Latin character sets. Tokenization (splitting sentences into words/tokens) is performed on the white space character. Sentences with less than two tokens are not included. Ramification of paragraphs into sentences is performed on Urdu sentence end markers, e.g., question mark (?), full stop (-). Numerals in the Eastern Arabic-Indic system were converted to Western Arabic to normalize the entire data. Noise from the data in the form of white space tokens, bullets, smiley icons (emojis) is removed. We use the standard utf-8 codification. Invalid utf-8 characters were discarded. The title of the news is also included in the corpus as a part of an article.
Corpus statistics
The Table 2 presents the distribution of the news articles collected from five major domains.
Urdu Corpus for Fake News distribution by topics
Urdu Corpus for Fake News distribution by topics
Further, we performed statistical description of the corpus. All the stop words and lemmas are taken into account. All tokens were lower-cased. We calculated the vocabulary size (i.e., the number of unique tokens) for each topic domain. The Table 3 indicates the vocabulary size of the distributed data used for testing and training phase.
Vocabulary size of distributed corpus
The vocabulary overlap between real and fake news articles is calculated as shown in Table 4. The vocabulary overlap in train set is 47.38%, and in the test set is 45.14%. The vocabulary overlap is calculated by the vocabulary (words) present in both news classes (real and fake) divided by the entire dictionary.
Vocabulary overlap within each category in the complete corpus
In this section, we describe a series of experiments on automatic fake news detection set as a binary classification problem (real or fake). We explore various combinations of feature sets, look at different feature value weighting schemes (scalers), and try out a number of classifiers. This is done to find a best performing baseline classifier for the assembled dataset.
Dataset split
To prepare the data for the experiments, the corpus was split into train and test sets with 70% and 30% ratio, respectively. In particular, all five domains were distributed proportionally such as 70% news articles of each domain belongs to the train set and the resting 30% belongs to the test set. The Table 5 described the corpus distribution for training and testing sets.
Domain Distribution in Train and Test subsets
Domain Distribution in Train and Test subsets
Several sets of n-gram based features, such as character n-grams, word n-grams, and function words (see below) n-grams, with n varying from 1 to 6 have been used to build the fake news detection models.
Feature combinations
Combination of different N-gram sizes are important to recognize fake news. The minimum, average and maximum number of features we used in our experiments are 18, 4,079, and 41,125 respectively.
Weighting schemes
A number of approaches are available to calculate values for n-gram features and their scaling across features. We consider them “weighting schemes”. We also used Frequency distribution as Frequency distribution has been used to attempt to understand the lexical structure of a text. Weighting schemes: binary values, raw frequency, relative frequency, normalized frequency, log-entropy weighting, and TF-IDF are investigated.
Classifiers
We considered a number of machine learning classifiers to find the best performing classifier for fake news identification task on our corpus. These classifiers include Multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Support Vector Machines (SVM), Logistic Regression (LR), Random Forests (RF), Decision Tree (DT), and AdaBoost (AB). These classifiers have been used in various NLP tasks and obtained state-of-the-art performance in tasks, such as in opinion mining studies [32], authorship attribution [33], sentiment analysis [34]. We used the Scikit-learn [35] implementation of the above mentioned classifiers with their default parameters.
Experiments
In this subsection, we describe our approach to the experiment generation. We run an experiment for each feature set, for each weighting scheme, and for each classifier. The total number of experiments are 2880 using different representation of text features.
Metrics and evaluation
For evaluations, we used the following performance metrics: balanced accuracy, F1Real score, F1Fake score, and ROC-AUC. Balanced accuracy is used as we want to label both fake and real news correctly.
A trivial classification baseline is established by assigning all the news in the test subset to one of the classes. Since dataset contain more real news articles and is essentially the real-world case, we assigned label “real” to all instances in the test subset. Our trivial assignment as all truthful provides the baseline using accuracy score of 0.55 as reference value. We perform 10-fold cross-validation for each experiment on the train subset and run each experiment once on the test subset. No parameter fine tuning is performed.
Result analysis
This section presents the analysis of the experimental results and provides recommendations for the set of features, weighting schemes, and classifiers based on the best performing combinations of those.
We present the results as top 10 best performing experimental combinations by F1Real score in Table 6, top 10 best performing experimental combinations by F1Fake score in Table 7, and the experiments that are best performing by both F1Fake score and F1Real score, i.e., those experimental combinations that achieve highest performance both in detecting fake news as fake and real as real, in Table 8.
Top 10 classification results by F1Real score
Top 10 classification results by F1Real score
Top 10 classification results by F1Fake score
Top classification results by both F1Real and F1Fake scores
As it can be seen in the tables, all best performing experiments achieved performance well above the trivial single-class baseline of 0.55, which indicates that the task of fake news detection can be effectively addressed using n-gram features. The maximum F1Fake score was achieved by AdaBoost (often called boosted decision tree) classifier which outperformed other classifiers with the peculiar combinations of character-word 2-grams and 1-grams (i.e., 2c-1w-0f) by providing 0.87 F1Fake score on the test set.
Figure 1 illustrates ROC-curves for the experimental combinations from Table 8.

ROC curves for best performing experimental combinations.
Additionally, both Tables 6 and 7 contain feature sets that have only one type of n-gram, e.g., 2c-0w-0f includes only character bi-grams, two types, e.g., 2c-1w-0f and 2c-0w-1f, and all three n-gram types jointly. In addition, while character n-grams can lead to top results individually, there’s no feature combination where it would be absent. Hence, we conclude that character n-grams are the most descriptive features for fake news detection in Urdu. Further, as the tables show, their text representation can be enhanced by either of word n-grams or function word n-grams or both. it was also noticed during experiments that the classifiers without function words did not give better results compared with experiments using stop-words. Therefore, in the course of our experiment, stop-words played an important role.

Weighting scheme performance in terms of F1-score distribution. Y-axes show experiment counts.
From these distributions as well as from the best performing experimental combinations presented in Tables 6 and 7, we conclude that binary and normalized frequency weighting schemes are most useful in the majority of cases. However, as it can be seen from those tables, other weighting schemes are involved in highly performing runs for some feature combinations.

Classifiers Performance.
Bayes based classifiers showed inferior results to AdaBoost, we haven’t performed much analysis on the feature distribution (which is actually our future work and will be done in an upcoming paper). To our understanding, these both classifiers are suitable for discrete data such as counts and we ran experiments on both binary features as well as counts to give best settings for each of the classifiers respectively. However, they still showed inferior results compared to AdaBoost in most of the experiments — beyond the paper scope
To gain further insights into the classes that are associated with fake and real content, we evaluate which classes show significant differences between the two groups of news.
The reflective observation from the experiments leads to the following conclusions: Combinations of different n-gram types obtain better results instead of single n-gram type. Combination of different feature sets with n-grams size 1 to 3 achieved significantly good results as compared with other feature sets. The n-gram size 1,2,3 achieve significantly better results compared with 4,5,6. However, in other languages, higher order n-grams performed well in detecting writing style [13]. Feature sets with n-grams size 4 to 6 dropped the performance of all the classifiers either by combining feature sets or separately. This might be due to the limited dataset size. Adaboost algorithm is 87% percent accurate at detecting whether a news article is fake or not, when combining bi-gram characters and uni-gram function word n-grams.
The paper concludes by arguing that the automatic detection of fake news is a promising area of research. In this research, a new resource for poor resource languages in the form of a dataset in Urdu language is presented and build a model that can correctly prognosticate the likelihood that a given news article is fake news. To our knowledge, this is the first corpus in the Urdu language for fake news detection, extracted from the internet and annotated manually containing real or fake news.
This is an essential contribution to the development of the Urdu corpus. Importantly, we provide statistics of the complete corpus, casts light on vocabulary size, vocabulary overlap and significant findings by the analysis. The present results confirm with the implementation of machine learning classifiers on lexical features BOW, n-grams (with n varying from 1 to 6), and in combination with n-grams methods. Overall, our results demonstrate the broad implication of the present research by obtaining promising results. On this basis, the main conclusion that can be drawn is that the addition of the new Urdu corpus is a particularly fruitful, reliable resource for the further development of fake news detection models.
In the future, we intend to explore whether the system can be adapted to other languages (it was trained exclusively on Urdu) and whether it can be trained to detect region-specific biases. We will also investigate new features to flag fake news.
