Abstract
We present a new resource to analyze and detect deceptive information that is present in a huge amount of news websites. Specifically, we compiled a corpus of news in the Spanish language extracted from several websites. The corpus is annotated with two labels (real and fake) for automatic fake news detection. Furthermore, the corpus also provides the category of the news, presenting a detailed analysis on vocabulary overlap among categories. Finally, we present a style-based fake news detection method. The obtained results show that the introduced corpus is an interesting resource for future research in this area.
Introduction
The dissemination of information on social networks can be defined as a process in which news, events, and opinions are published, received and re-sent through users. The information disseminated in social networks follows a route, from one user to another and from one site to another, which allows the information to be traceable, however, the verification of this can be difficult. This is due to the fact that currently the volume of existing information exceeds the capacity of the human being to process and understand it.
Fake news provides information that aims to manipulate people for different purposes [1]. In social networks, misinformation extends in seconds among thousands of people, so it is necessary to develop tools that help control the amount of false information on the web. Similar tasks are detection of popularity in social networks [2] and also detection of subjectivity of messages in this media [3].
A fake news detection system aims to help users detect and filter out potentially deceptive news. The prediction of intentionally misleading news is based on the analysis of truthful and fraudulent previously reviewed news, i.e., annotated corpora. In [4] the authors divide the approaches to fake news detection divide into three categories: knowledge-based (by relating to known facts), context-based (by analyzing news spread in social media), and style-based (by analyzing writing style). In all cases, the approaches require annotated corpora, which implies an additional challenge given that the number of labeled fake news corpora is scarce and the few available resources are in the English language.
The main contributions of this research work can be summarized as follows: the development of the first Spanish corpus consisting of fake and real news extracted from news websites. It is a new resource to investigate and analyze different aspects of the style-based fake news detection. This is the first resource for Spanish language in this research field. the annotation procedure for classifying real and fake news. statistics of the corpus, including vocabulary overlap of the different news topics and classes (real vs. fake). experiments for automatic fake news detection using supervised learning on linguistically motivated features.
The rest of the paper is organized as follows.Section 2 provides a review of related work. Section 3 presents the methodology of the corpus construction, the annotation steps, the statistics of the corpus, and a comparison with other fake news resources. Section 4 describes a baseline approach for fake news detection and the obtained results. Finally, Section 5 draws the conclusions and points to the possible directions of future work.
Related work
Although the publication and dissemination of fake news is not a new issue, nowadays their propagation is being potentiated by social media platforms. It is easy and fast to spread fake news in social media so its identification turns out to be a process with a certain degree of complexity. Therefore, the detection of fake news is attracting a lot of attention in recentyears.
Shu et al. [1], presents an overview of research on the detection of fake news in social networks, focusing on psychology, social theories, and algorithms. Two aspects of the problem are reviewed: characterization and detection. Their fake news detection approach is divided into two stages: (i) extraction of characteristics and (ii) construction of the model. The feature extraction stage aims to represent news content and related information in a formal mathematical structure and the model construction stage builds an automatic learning model to differentiate between fake and real news.
Other works tackle the problem by analyzing the source of origin, for example, Nazer et al. [5] introduced a methodology for the detection of fake news related to natural disasters by means of social networks, specifically Twitter, analyzing the characteristics of the language of the tweets, as well as their variations throughout the period of time in which the disaster. This work analyzes the problem of news streaming because at the beginning of disaster situations the bombardment of information is such that it is difficult to keep track of all publications to determine their veracity. It also considers the dissemination of unwanted content such as spam, rumors, generic opinions and the use of bots, which has become a daily matter. The bots are capable of spreading large amounts of information in a short period of time.
There are three generally used characteristics of fake news: the text, the responses of the users to such news and the users that disperse the news. In [6], the authors introduced a model that integrates such characteristics for a more accurate prediction model. The model uses the text and users responses to train a Recurrent Neural Network to capture the temporal pattern of user activity on a given article. Then, the behavior of the users that disperse the news is learned and combined with the other characteristics in order to decide if the news is real or fake.
The Fake News Challenge project (FNC-1) 1 propose to break down the problem into stages. A first step would be to understand what several news organizations are writing about the topic, through a Stance Detection task. This task seeks to estimate the stance of the text in relation to the title (of the news), the text can agree, disagree, discuss or not be related to the title. So the goal of this stage is to develop tools to organize the news so that later people can analyze and quickly identify fake news [7, 8].
Another interesting dataset [9] considers six class labels: pants-fire, false, barely true, half-true, mostly-true, and true. The data for this corpus are taken from the manually labeled news by Politifact 2 . The author evaluated four classification methods: regularized logistic regression (LR), support vector machine classifier (SVM), a bi-directional short-term memory network model (Bi-LSTMs), and a convolutional neural network model (CNN) (to integrate text and metadata). The hybrid CNN model outperformed all models, resulting in a precision of 0.270 in the test set.
As we can see, a considerable large amount of annotated corpora can be found in English for fake news detection. However, to the best of our knowledge, there is still no resource for building a machine-learning-based approach, which requires annotated corpora [10]. There have been huge advances on research work for the Spanish language, for example, the Spanish corpus manager [11] consists in a huge database of general purpose corpora in Spanish. Another initiative is the Sociolinguistic Corpus of WhatsApp [12] and the Spanish language proverbs [13]. Although there are already tools that focus on the detection of fake news in the Spanish language, most of them carry out the verification process manually, that is, through exhaustive investigations by the work team and users. This implies that the identification of deceptive content is subjective and delayed. An example of this type of platform is the VerificadoMX 3 site, which focuses on publications related to the political sphere.
The Spanish fake news corpus
Spanish Fake News Corpus contains a collection of news compiled from several resources on the Web: established newspapers websites, media companies websites, special websites dedicated to validating fake news and websites designated by different journalists as sites that regularly publish fake news.
The news were collected from January to July of 2018 and all of them were written in Spanish. The resource is freely available at https://github.com/jpposadas/FakeNewsCorpusSpanish.
In contrast to other works [4, 9], where a more detailed classification of the news is used, the presented corpus was tagged considering only two classes (true or fake). Although in some cases the news is not completely true or fake, there is not a convention on how to classify the news and some of the proposed categories in previous works are not clear enough.
The process was manually performed and the following aspects were considered: 1) news were tagged as true if there was evidence that it has been published in reliable sites, i.e., established newspaper websites or renowned journalists websites; 2) news were tagged as fake if there were news from reliable sites or specialized website in detection of deceptive content (for example VerificadoMX) that contradicts it or no other evidence was found about the news besides the source; 3) the correlation between the news was kept by collecting the true-fake news pair of an event; 4) we tried to find the source of the news.
Table 1 presents a list of websites considered as reliables. Animal Político and Aritegui Noticias are websites of news managed by a prestigious journalist that frequently appear in the media like newspapers, TV or radio. Proceso is a magazine focused on political and social themes that offers news on his site and MVS Noticias correspond to a company that broadcasts news on TV and radio. The rest of the sites mentioned in Table 1 correspond to the digital version of established newspapers.
Reliable sites
Reliable sites
According to specialized websites in unmasking fake news, certain websites that systematically publish fake news have been detected. This type of deceptive websites contains news that are not true at all (completely fake, a mixture of true and fake, satire, humorist, among others). Deceptive websites usually combine true news along with fake news to confuse the users. The site VerificadoMX mention some of these kind of websites.
Websites that offer news validation service have appeared due to the rapid propagation of fake news. These websites detect and unmask some of the fake news that surf on the Web or social media to prevent the misinformation between users. Most of the validation websites perform the validation manually by a journalist. Some validation websites were used in the compilation of the proposed corpus, their names and descriptions are below.
The fake news was recollected from the validation sites and the deceiving sites and tagged manually. In the case of the validation sites the following steps were performed to gather a real-fake news pair: Select a news from the validation site. Look for the link to the source. If the link is missing, the news is discarded and go back to Step 1. Verify the link to the source. If the link is broken then the news is discarded and go back to Step 1 else download and save the news as fake. Identify the keywords of the news by answering the questions What?, Who?, How?, When? and Where?. Use the Google search service to look for the real news counterpart of the fake news by typing its headline or keywords. From the results of the search performed in the previous step, select the news whose origin belong to a reliable site (see Table 1) and best match with the headline and keywords. Download and save the selected news from the previous step as a true news.
Note that in
For deceiving sites, the procedure to extract fake news changes from the previous one because the deceiving sites combine true news with fake and it is important for us to identify first the fake news. The steps to gather a real-fake news pair from deceiving websites are described next. Select a piece of news from a deceiving website. Identify the keywords of the news by answering the questions What?, Who?, How?, When? and Where?. Use Google
4
, Yahoo
5
and Duckduckgo
6
searching services to look for news by typing its headline or keywords. From the results of the previous step, compare the keywords of the links in the first three pages of results with the news selected in Step 1. Identify the keywords of the fake news recently found. Use the Google search service to look for the real news counterpart by typing its headline or keywords. From the results of the search performed in the previous step, try to find the news whose origin belong to a reliable site (see Table 1) and best match with the headline and keywords. If it exists, the news is considered true.
The main drawback we faced for the compilation of the corpus was the search for the real-fake news pair of an event. In some cases, it was difficult to track real news on the Web that complemented the false news.
Corpus normalization
To prevent that some elements of the news (for example numbers, emails, authors name, amongothers) act as markers for the classes, we perform a normalization for the corpus.
The normalization process eliminates elements that are common in the structure of news and can be used as markers. These elements are the name of the author or the name of the editor, dates, any footer or header that references the source website.
We use the standard utf-8 codification and keep the punctuation marks. Special characters were eliminated and references to photos or videos were not included in the corpus.
The normalization process also seeks the following elements and mask them by a common identifier: 1) numbers that represent quantities, schedules or prices were masked using the NUMBER tag; 2) email addresses of authors or editors were masked using the EMAIL tag; 3) URLs of references was masked using the URL tag; 4) telephone numbers were masked using the PHONE tag; 5) dollar and euro symbols were masked using the DOL and EUR tagsrespectively.
The title of the news is detected and is saved as an element apart. The URL of the source is also considered as an element apart from the text of the news.
Corpus statistics
The corpus covers news from 9 different topics: Science, Sport, Economy, Education, Entertainment, Politics, Health, Security, and Society. Table 2 shows the distribution of the collected news along the different categories.
Spanish Fake News Corpus topic distribution
Spanish Fake News Corpus topic distribution
To identify the vocabulary overlap between true and fake news we considered the lemmas and discarded the stop words. The overlap was calculated by dividing the intersection of both true and fake vocabulary between the joint vocabulary. The general vocabulary overlap between real and fake news is 27.68%. Table 3 presents the vocabulary for true and fake news along with the vocabulary overlap between categories.
Vocabulary overlap within each category
The general vocabulary overlap between categories is presented in Table 4.
Vocabulary overlap within each category
For some pairs of categories, the vocabulary overlap was expected because the categories have topics in common. Cases like the pair of Politics and Entertainment, where it was not expected to have a high overlap, can be explained as a temporary phenomenon caused by the desire of entertainment personalities to participate in political positions for electoral processes in some countries. Other cases can be explained by the methodology of mixing the content of news from different categories used by sites dedicated to spreading fake news.
The corpus was split into train and test sets, using 70% of the corpus for train and the rest for test. We performed a hierarchical distribution of the corpus, i.e., all the categories keep the 70% –30% ratio. This distribution is described in the Table 5.
Corpus distribution
Corpus distribution
We followed a machine learning approach for automatically identifying fake news. We evaluated three feature representations. One of them is the standard bag-of-words model which is a simple baseline for evaluating if there is an specific selection of words that can help us to identify fake news. The other two representations are the character n-grams and POS tags n-grams representation, both of them probed to be helpful for representing writing style of authors [14–16]. We evaluated the performance of word and character n-grams type when including and excluding stop words. A standard preprocessing of the corpus was performed by eliminating the stop words and punctuation marks using the Spacy 7 tool. The performance of each of the feature sets was evaluated separately and in combinations.
We trained a classifier to generate a model that can distinguish between real and fake news. We experimented with four machine learning classifiers: support vector machine (SVM) with linear kernel, logistic regression (LR), random forest (RF), and boosting (BO). All of them are widely used in many natural language problems and have achieved state-of-the-art results, like in authorship attribution [17], sentiment analysis [18] and opinion mining [19], and plagiarism [20], author profiling [21],among others.
We conducted experiments trying to identify real and fake news in the corpus regardless of the category of the news (binary scenario). We trained the classifiers using their scikit-learn [22] implementation on the previously mentioned features set: bag of words (BOW), POS tags, and n-grams features (with n varying from 3 to 4).
We established for this corpus as a baseline the strategy of assigning all the news in the test corpus to one of the classes because the train corpus is balanced for both real and fake classes. We choose the real class since there are more news that belong to that class in the test corpus.
Table 6 presents the accuracy obtained in the test set when we trained the classifiers on individual features set such as BOW and POS. We also evaluated the performance of the classifiers when combining those feature sets.
Results on the test set in terms of accuracy (%)
Table 7 presents the evaluation of n-gram as feature representation for detecting fake news. It can be observed the results of the different classifiers when trained on character and POS n-grams with sizes from 3 to 5.
Results of fake news detection on the test set in terms of accuracy (%) when classifiers are trained on n-gram features sets
The best results on the test set were obtained with character 4-grams without removing the stop words with the Boosting algorithm. Note that the exclusion of stop words decreased the performance of the classifiers.
Table 8 presents the performance of the classifiers on the test set when they are trained on the combination of n-grams features. We first evaluated the combination of sizes of n-grams and then the combination of types of n-grams. The SVM algorithm achieved the best results when combining all types and sizes of n-grams.
Results of fake news detection on the test set in terms of accuracy (%) when classifiers are trained on combinations n-gram sizes and types
Analyzing the results from the experiments, we can see that the models based on character n-grams in general achieved better results. The case when combining the representations of n-grams of size 3 to 5 achieved lower results than using individual feature sets.
The use of a machine learning approach improved the accuracy of the proposed baseline for thecorpus. Traditional techniques often used to solve natural language problems obtain good results in the proposed corpus.
The detection of fake news is an emerging research area that is gaining a lot of attention. The development of new resources such as annotated corpora can help to increase the performance of automatic methods aiming at detecting this kind of news. In this work, we presented the Spanish Fake News corpus, which to the best of our knowledge is the first corpus consisting in news extracted from the internet and labeled as real or fake.
We presented the development procedure, that can be used to further increase the size of the corpus. We show the statistics of the complete corpus, describing the vocabulary size in the different news topics, and the vocabulary overlap between real and fake news. We aimed at increasing the vocabulary overlap, thus ensuring the classification algorithm is truly identifying fake news and not only thematic areas.
Concerning the fake news detection methodology, we trained well-known classification algorithms on lexical features BOW, POS tags, n-grams (with n varying from 3 to 5), and n-grams combination. The classification results show that is possible to achieve very high accuracy, and that the corpus is a valuable resource for building fake news detection models.
