Abstract
We identify deceptive text by using different kinds of features: A continuous semantic space model based on latent Dirichlet allocation topics (LDA), one-hot representation (OHR), syntactic information from syntactic n-grams (SN), and lexicon-based features using the linguistic inquiry and word count dictionary (LIWC). Several combinations of these features were tested to assess the best source(s) for deceptive text identification. By selecting the appropriate features, we were able to obtain a benchmark-level performance using a Naïve Bayes classifier. We tested on three different available corpora: A corpus consisting of 800 reviews about hotels, a corpus consisting of 600 reviews about controversial topics, and a corpus consisting of 236 book reviews. We found that the merge of both LDA features and OHR yielded the best results, obtaining accuracy above 80% in all tested datasets. Additionally, this combination of features has the advantage that language-specific-resources are not required (e.g. SN, LIWC), compared to other reference works. Additionally, we present an analysis on which features lead to either deceptive or truthful texts, finding that certain words can play different roles (sometimes even opposing ones) depending on the task being evaluated.
Keywords
Introduction
Customer’s online reviews on products, hotels, restaurants, and other services are taking more relevance on buyers’ decisions, as these latter want to know whether a purchase they make will meet their expectations. However, many of these reviews may have been written not by actual customers but by people trying to boost their own sales or to reduce those of the competitors’.
Several studies have been conducted with the goal of detecting whether an opinion is truthful or deceptive. By using Natural Language Processing (NLP) techniques, text is preprocessed and it is turned into vectors of features that are classified by machine learning methods. In most works, it is common to use text-processing tools that require language specific information to conduct text analysis, v. gr. syntactic n-grams (SN), or dictionaries such as the Linguistic Inquiry and Word Count (LIWC) dictionary. Our intuition is that using merely distributional aspects of the text can identify deception, and that even a good performance could be achieved without considering syntactic features. The aim of this work is to determine to which extent is this possible; thus, we propose using a continuous semantic space model obtained through a topic modeling algorithm called Latent Dirichlet Allocation (LDA). Furthermore, we compare its performance while merging its features with other features obtained from different NLP sources (namely, a one-hot representation, syntactic n-grams, and a dictionary of psychological aspects of words). We present an analysis of our method applied on three different datasets for deception detection with the purpose of showing a broad perspective with regard to the efficiency of each method used.
This paper is organized as follows. Following this Introduction, Section 2 shows several approaches used in different works within the current standards to detect deceptive text. Afterwards, in Section 3 we briefly describe our proposed method, starting with a detailed description of each source of features (Section 3.1), followed by a description of the deceptive detection datasets (Section 3.2). Details about experimental set up and a discussion of our results are provided in Section 4. Finally, we draw our conclusions in Section 5.
Related work
Several researchers have studied deception detection by using different sources of features. Some of these methods are based on bag-of-words (BoW) tools and others add syntactic information as features. In some cases, certain general deception cues are sought [3] such as the use of unique words, self-references, modifiers, among others.
Overall, linguistic style approaches (such as SN) analyze relations between words; in contrast, BoW approaches (LDA, OHR, LIWC, n-grams) disregard grammar and even word order but keep counting the number of instances of each word.
To detect deceptive texts, a commonly applied technique is to use n-grams. This method can extract features from a text based on different elements; for instance, words, syllables, phoneme, letters, etc.
In the work of Donato et al. [11] word n-grams were compared with letter n-grams. The latter have shown to yield a better performance on OpSpam dataset. Although n-grams achieve acceptable results by themselves, generally they are complemented with other NLP techniques due to the fact that mixtures of features have proved to improve results.
Another BoW approach consists on using the Linguistic Inquiry and Word Count (LIWC) dictionary, which includes a word classification and count tool. Newman et al. [16], for example, by analyzing LIWC’s word categories, found that liars use fewer self-references and use more negative emotion words. This work laid the foundation for the LIWC tool to be widely used by other researchers [22, 24].
Hauch et al. [10] introduced a meta-analysis of several deceptive text identification research works. This meta-analysis was focused on specific linguistic categories, for instance, those contained on LIWC. Research findings suggest that liars use certain linguistic categories at a different rate than the truth-tellers.
Mihalcea and Strapparava [14] used LIWC to discover dominant classes of deceptive texts. Essentially, the authors attempted to separate deceptive and truthful texts in controversial topics such as abortion, death penalty, and a best friend. In a similar study conducted by Pérez-Rosas and Mihalcea [20], the researchers attempted to find deceptive text in the same topics, but this time taking samples from different languages as Mexican Spanish and English from the United States and India.
Following the work of Mihalcea and Strapparava, a study was also conducted to detect deceptive text in Spanish: Almela et al. [1] collected a new dataset with topics covering homosexual adoption, opinions on bullfighting, and feelings about a best friend. One hundred false and one hundred true documents were collected for each topic with an average of 80 words per document. Distinct LIWC dimensions were used to achieve a more accurate classification by means of a support vector machine (SVM).
Deception detection has been applied in various particular situations. In Williams et al. [25], a comparison was made between lies told by children and lies told by adults. Research was conducted with aim of detecting deception in courts where children testify. The authors chose 48 children and 28 adults for generating a dataset; half of the children and adults told lies and half of them told the truth. In this way, the LIWC tool was used to generate the samples for classification. Results showed that there are significant differences between true and false texts, mainly in linguistic variables such as singular self-references (e.g., I, my, me), plural self-references (e.g., we, our, us), and in positive and negative emotions. In addition, research findings showed that the linguistic variables were found to be in distinct ratios depending on whether the lie was told by a child or by an adult.
Studies developed through the use of BoW tools have been successful; however, in an effort to improve results, the context of the sentences has been taken into account, as when analyzing the syntactic relations of the words with the use of dependency trees [4, 26]. In general, the use of syntactic relations has not shown an outstanding performance in the task of classifying deceptive text. In contrast, complementing this method with a BoW approach can improve results.
In Mihalcea and Pérez-Rosas [21], features were collected using different approaches (e.g., Part of Speech (PoS), Context Free Grammars (CFG), unigrams, LIWC, and combinations of these). The authors predicted, with an accuracy between 60% and 70%, whether a person of feminine or masculine gender had written a deceptive text. In the results shown, the use of PoS and CFG did not show a significant improvement in the accuracy with regard to unigrams and LIWC. This would suggest that BoW approaches have a similar performance to linguistic style approaches.
Sometimes additional information to the words in a text or their syntactic information is available, for example information coming from the source whence the text was extracted. In the work of Fornaciari and Poesio [5] false and true opinions collected from the Amazon web site1
www.amazon.com.
Another example of using additional information can be seen in Li et al. [12]. The authors introduced TopicSpam, a variation of the LDA algorithm. This algorithm uses hotel specific information in the OpSpam dataset (consisting of opinions about hotels) for creating topics. In addition, the authors generated background topics, deceptive topics, and truthful topics. Finally, given the specific topics, the topic-modeling approach used word frequency as features. TopicSpam achieved 95% accuracy. This kind of additional information is not always available, so we did not focus on this sort of features.
Combining different methods is common practice in the task of generating more accurate models used in deception detection. This is done for the purpose of maximizing the advantages of each approach used and thus, to achieve a more efficient classification.
In this section we present our method for deception detection. First, in Section 3.1 we detail the various sources of features we use. Next, in Section 3.2 we describe the different datasets we use for evaluation, and finally in Section 3.3 we give details on our feature vector construction.
Sources of text features
We will specifically focus on three different sources of features (LIWC, SN, OHR) aimed at comparing and combining the proposed method (LDA).
Most works, presented as the current standards (see Section 2), used unigrams as the basis for adding new features aiming to obtain a better performance. Instead, we have chosen to use one-hot representation (OHR) (Section 3.1.1) since it showed similar performance, in addition of being a simpler representation.
Given that deception involves cognitive processes, another feature source we selected was the LIWC dictionary (Section 3.1.3). This tool has been used to detect deception as a result of its precedent in having been created for psychological studies. In addition, LIWC consists of a set of words with a behavioral link.
Nevertheless, the main drawback of LIWC is the set of pre-established words. We therefore opted for using a continuous semantic space model (LDA) (Section 3.1.4). This method automatically creates a set of topics; each topic consists of a set of words with a semantic link. Unlike LIWC, LDA’s set of words can change depending on a dataset, suggesting that the generated categories are specific to the document collection, and thus features might be more informative.
Finally, so far we have only considered lexical information to the words, including lexical semantics; however, text style information was not included. Therefore, we explored an approach based on text style (SN) (Section 3.1.2). In contrast to other methods (BoW), SN can consider word context.
One-hot representation (OHR)
Word vectors are prominently chosen in many works due to the fact that they generate relevant features that help to classify texts. For this reason, we decided to analyze the performance of features based on a word matrix, more specifically, using one-hot representation.
To obtain a one-hot representation, a list of all words
Syntactic n-grams (SN)
Syntactic n-grams [23] (SN) are a relatively new feature that emerged after considering certain disadvantages of conventional n-grams, being the main disadvantage of these latter ones that long distance relationships were not properly captured by them, producing the effect that conventional n-grams seemed to be generated in an overly random manner. SN, on the other hand, take advantage of the syntactic structure of a sentence, being their main drawback that a syntactic parser is needed to obtain them.
To obtain SN, a syntactic tree is constructed for collecting syntactic relations represented by the edges of the tree. These edges are those that link words with an appropriate label.
We used the Stanford parser [13] for generating syntactic relations to collect syntactic information of text. Collapsed representation was not used. We show an example of the Stanford representation of syntactic relations in step 1 of Fig. 2. In step 2, useful relations are selected for generating bigrams; it can be seen that prep_to (give-3, kid-7) is related with det (kid-7, the-6) because kid-7 links both of them. In this way, we obtain the bigram prep_to-det.
Example showing how vectors of features are constructed by using one-hot representation.
The process for generating continuous syntactic bigrams given a current text.
Documents can be represented by their syntactic relations through the use of a feature vector. This is done in the same way we showed in previous section, see Fig. 1, but this time replacing words by syntactic n-grams.
Linguistic Inquiry and Word Count (LIWC) is a tool based on a dictionary of words [19]. LIWC’s dictionary contains groups of words labeled by humans and it was originally used in works related to psychological analysis. Applications of this resource have been growing recently; for instance, LIWC has been used in computational linguistics as a source of features for authorship identification [15] and deception detection [14], among other applications.
Example of some groups of words with their corresponding labels contained in LIWC. A sample of only 7 words per label is shown in this example.
In this work we used version 2007 of LIWC. This version consists of 4,000 words grouped in 64 categories. Figure 3 shows an example of some groups of words that we can find in LIWC.
Latent Dirichlet Allocation (LDA) is a probabilistic generative model [2] for discrete data collections such as text collection.
LDA represents documents as a mix of different topics; each topic consists of a set of words that keep some link between them. Words, in turn, are chosen based on a probability. The process of selecting topics and words is repeated to generate a document or a set of documents. As a result, each generated document deals with different topics.
In a simplified manner, the LDA generative process consists of the following steps:
A determination of the The choice of a mixture of topics for the document according to Dirichlet distribution with respect to a fixed set of K topics. The generation of each word in the document by:
Choosing a topic; Using the chosen topic for generating the word.
Using a generative model, LDA analyzes the set of documents to find the most likely set of topics possibly dealt with in a document.
Example of generated topics by using LDA in texts about death penalty.
Example of document processed by LDA.
We can regard LDA as a source of word features similar to LIWC; but unlike LIWC, LDA generates the groups of words (topics) automatically. Additionally, groups of words from LDA are not labeled, and their contents are different depending on the corpus where LDA is trained.
See Fig. 4 for an example of words that are obtained using LDA.
Figure 5 shows the processing of a document by LDA (before binarization). Each topic (set of words) shown has a particular probability of existing within the current document. In this way, a high probability of a topic (
To evaluate our method, three different datasets were used: the DeRev (DEception in REViews) corpus [5], the OpSpam corpus (Opinion SPAM) [17, 18] and a corpus created by opinions about three controversial topics [20]. The first corpus was collected using sanctioned deception, while the latter two were collected using unsanctioned deception [6]. Details about these two kinds of deception are described in the following sections.
DeRev dataset
The DeRev dataset is a corpus composed of deceptive and truthful opinions obtained from an Amazon web page. This corpus consists of opinions about books. It is a gold standard that contains 236 texts of which 118 are truthful and 118 are deceptive. Both deceptive and truthful texts were obtained from Amazon.
To achieve a high degree of confidence on a correct collection of deceptive texts, DeRev’s authors considered two starting points; the first one, by Sandra Parker, was published on Money Talks News;2
Finally, to obtain the 118 truthful texts, DeRev’s authors took into account certain aspects to ensure a high probability of the selection being correct. The texts should not have any cue of deception. For this reason, the selection focused mainly on aspects like whether opinions were written by users who had used their real names, and whether opinions were written by users who bought the book through Amazon, among others.
In this dataset, the deceptive and truthful texts were not obtained in a deliberate manner, i.e., participants were not asked to write lies; instead, texts were obtained after the participant had lied. For this reason, it is said that the way of building this text is by unsanctioned deception.
Deceptive text sample
Circle of Lies, by Douglas Alan is a fast paced, gritty crime thriller that introduces us to John Delaney, ex-cop turned lawyer who finds himself defending his best friend against embezzlement and murder charges when no one else will believe him. Alan, a retired trial lawyer, gives us a rather unique look at the “men behind the curtains” of the American justice system with this fast paced, exciting read. Highly recommended!
Truthful text sample
This book is well written and very funny, it is a good “guys”; kind of book, and it is an easy read, but don’t let that fool you. The story is good and it is in the painful details. I felt like I was working right there with them. However, it is crude and not for the casual reader. I try to read both classic and non-classic writers of all eras, and I am always looking for something weird, funny, and cool to mix it up and this was it.
The OpSpam dataset is a corpus composed of fake and truthful opinions. These are opinions about different hotels.
OpSpam’s authors used Amazon Mechanical Turk5
On the other hand, truthful opinions were collected from a TripAdvisor web page.6
The OpSpam’s authors collected a dataset composed of 800 texts in total. To form this corpus, participants were asked to write lies to obtain the deceptive text. Accordingly, the corpus was formed by sanctioned deception.
Deceptive text sample
I had a wonderful time at the James Hotel while on business in Chicago. The rooms are modern, tasteful and well-kept, while the staff was responsive and efficient. This was the perfect place to unwind after working all day, but also provided an ideal atmosphere to do some more work.
Truthful text sample
Despite what other are saying, this was one, if not the best Hotel stay in Chicago I have had. I travel to the Big City about three times a year for pleasure and The James rates up with the best. I was upgraded and the staff made the stay worth my special weekend visit. I don’t believe you will be disappointed.
This is a corpus composed of 600 opinions on three controversial topics: abortion (200), death penalty (200), and a best friend (200). The corpus consists of 100 deceptive texts and 100 truthful texts for each topic (giving the total of 600 texts). The compilation of texts was conducted through AMT and the task was restricted to turkers who lived in the United States.
To obtain truthful texts, the authors requested from the participants their real opinion on each of the topics; next, participants were asked to lie about their real opinion, whereby deceptive texts were obtained. The method used to collect this corpus was sanctioned deception.
This corpus also contains texts in English as spoken in India and in Mexican Spanish; however, those texts were not used in this work.
Deceptive text sample
Abortion is despicable because we’re giving women the choice to kill life. Where is the unborn child’s voice? Do we not care about those who can’t defend themselves? Abortion is systemic murder that needs to be stopped.
Truthful text sample
Abortion is a choice. It should be made by both parties if possible, but the woman should have final say if there is disagreement. I think that with counseling, and all the other options available, it is not morally wrong for a woman to choose to have an abortion. A fetus is not viable outside the womb before 26 weeks, and then only barely, and with very expensive care.
Analysis of corpora
To provide a deeper view of corpora, we show in Table 1 a count of tokens and types of the different corpora described above, where the tokens show the total amount of words contained in documents and the types represent the total number of unrepeated words encountered in documents. Stop words were kept in all experiments.
| Dataset | Number | Tokens | Types | Avg. tokens |
|---|---|---|---|---|
| of texts | per doc | |||
| OpSpam | 800 | 96,793 | 6,469 | 121 |
| DeRev | 236 | 29,990 | 5,162 | 127 |
| Abortion | 200 | 15,958 | 1,997 | 80 |
| Best friend | 200 | 11,717 | 1,718 | 59 |
| Death penalty | 200 | 15,615 | 2,034 | 78 |
Corpora’s average of shared types
Table 2 shows the average of types with regard to each corpus. This means, the average of unrepeated words that the documents of the corpus have in common. Results are shown both on all documents (Deceptive
We conducted several experiments using different sources (described in Section 3.1) and combinations of these to find the best combination of features for deception detection (according to the datasets analyzed in this work). For all experiments, classification was carried out by means of the Multinomial Naïve Bayes algorithm implemented under WEKA [9] with five-fold cross validation.
In an attempt to generate a model that better represents deception detection, we implemented an attribute selection with the purpose of removing repetitive and irrelevant features.
The attribute selection [7] helps to obtain a model for identifying the deceptive text with greater efficiency. It also helps to reduce the dimensions of vectors of features. For example, vectors composed of about 4,000 features were reduced to about 60 features.
In the process of attribute selection, the WEKA tool was used with an evaluator based on correlation proposed by Hall [8]; furthermore, we used a search criterion based on a hill climbing with backtracking algorithm. Such combination showed a significant increase in precision. We detailed specific parameters for attribute selection on Weka as follows: the evaluator used was CfsSubSetEval (numThreads 1, poolSize 1) and the strategy search used was BestFirst (direction: Forward, searchTermination 5). In addition, we tried with other strategies of search as ExhaustiveSearch but checking all possibilities on a large set of attributes is demanding in terms of time. In addition, we used genetic search, which, although faster, did not provide good results even increasing the number of generations/population size and adjusting crossover/mutation probabilities. Eventually, BestFirst was more efficient applied to the document representation used in this work. A more detailed study of all attribute selection approaches has been left as future work.
Furthermore, we found that binarizing the vectors of features yielded a more accurate model for deceptive texts detection: Table 3 shows results of deception detection with LDA
| Accuracy | |||
| Dataset | Binary | Not binary | Binary |
| with AtSe | with AtSe | without AtSe | |
| Abort | 87.5% | 75.2% | 72.2% |
| BestFriend | 87.0% | 82.0% | 76.8% |
| DeathPenalty | 80.0% | 69.3% | 61.5% |
| DeRev | 94.9% | 74.9% | 88.8% |
| OpSpam | 90.9% | 88.5% | 87.2% |
Accuracy for different numbers of LDA topics
Below we show details for converting all features into binary values.
LIWC generated vectors of 64 features. The means for obtaining each vector was as follows: given a document and the 64 categories, if a word of a current category was found in the document, then such feature had the value of one, otherwise it had the value of zero. The means of generating vectors of features by using SN was as follows: first, we formed a list of SN without reiterations obtained in all documents of the dataset to be analyzed. Next, given a document and the list of SN, if the current SN was found in the document, then the feature value was set to one, otherwise it was set to zero. As mentioned above, LDA shows as resulting vectors of features with real type values (probabilities of belonging to topics). Therefore, we proceeded to convert values of the features into binary values. To that end, a threshold was calculated dividing the sum of all probabilities of belonging by the number of topics established. Each probability that is equal to or greater than the threshold was turned into one; otherwise it was turned into zero.
We tried using the TF-IDF (Term Frecuency
LDA requires specifying the number of topics to be generated; any change in this parameter may change the classification accuracy. For this reason, it is necessary to find an appropriate value.
To find the number of topics that allows an optimal classification, we performed several experiments. Results from those experiments are shown in Table 4; in this table, the number of topics is compared to the obtained accuracy. In addition, it can be seen that by increasing the number of topics it is possible to reach an optimal point from which increasing the number of topics does not imply a decrease in accuracy (i.e. 500 topics).
Although it is possible to slightly increase accuracy for some datasets by adjusting the number of topics, we were interested in finding a general method with features that yield a good performance with different types of deception. In this way, we sought a set of common parameters that work for all datasets.
Henceforth, in all experiments 500 topics were used, that is, each document processed by LDA generates a vector of 500 features and each of those is represented by a probability of belonging to each topic (see Section 3.1.4, Fig. 5).
Results and discussion
In this section we present detailed results on the correct identification of deception. The tables for each classified corpus contain the following values: accuracy, precision (P), recall (R), and F-measure (F). Whilst accuracy is a measure used in many researches on deception detection and it provides us a point of comparison with other results, we also opted for showing precision, recall, and F-measure; this allows for a deeper analysis of outputs. Thus, precision shows the percentage of selected texts that are correct, while recall shows the percentage of correct texts that are selected. Finally, F-measure is the combined measure to assess the P/R trade-off.
We obtained these values for the following methods: Latent Dirichlet Allocation (LDA), one-hot representation (OHR), syntactic n-grams (SN), and Linguistic Inquiry and Word Count (LIWC); as well as for combinations of these.
The aim of combining distinct NLP techniques is to find which is the combination of features that best favors accuracy in deceptive and truthful text classification. Words contain a lot of information by themselves; this can be confirmed by the great number of works (for examples see our related works survey) that used those features as a basis for adding new features, aiming to obtain a better performance. In a similar fashion, we used LDA to complement OHR features. Results in this section shows that LDA complement OHR features better than other methods used (i.e. LIWC, SN) in most cases. In the classification of dataset OpSpam, which is shown in Table 5, we can see that the combination of LDA and OHR shows an increase in accuracy with regard to the same techniques evaluated separately. This is because features generated using LDA complement in a favorable way those generated by OHR.
Classification of OpSpam dataset by using different methods for obtaining features
Classification of OpSpam dataset by using different methods for obtaining features
Number of relevant features obtained by using LDA and features of one-hot representation (OHR)
Classification of DeRev dataset by using different methods for obtaining features
Classification of dataset about abortion by using different methods for obtaining features
There are cases in which there are many features that represent truthful text while only few features represent deceptive text, or vice versa. LDA may help to minimize this problem by increasing the number of relevant features that lead towards a more accurate text classification. To study this effect, we list the number of attributes, after attribute selection, that are strongly linked with deceptive or with truthful text respectively. For example, in 60 relevant features, there are a specific number of features that are strongly correlated with deceptive class, and other features with truthful class (see first two columns of OHR in Table 6). Moreover, the number of features can be increased by combining feature generation methods; in Table 6 we show the number of relevant features obtained for each class by using only OHR in comparison with the number of features obtained by using the LDA and OHR combination. It can be seen that in all cases an increase of deceptive text features is present.
Classification of dataset about best friend by using different methods for obtaining features
Vectors of features generated by combining features of LIWC and OHR showed similar results, although with a lower performance compared with the combination LDA
Classification of dataset about death penalty by using different methods for obtaining features
Another tested approach was the use of SN. Similar to LIWC, previous language-dependent information is necessary, due to the fact that syntactic trees are obtained based on the probability of occurrence in a previously labeled corpus. Syntactic bigrams showed an acceptable accuracy but the combination of their features may worsen the results. For instance, in the datasets shown in Tables 7 and 9, the combination of features between NS and OHR decreased the classification accuracy. Thus, this specific combination of features was disadvantageous.
| Deceptive text | Truthful text |
|
my I luxury relax spa visit vacation anyone amaze chicago |
location floor block bathroom street large small 2 Priceline upgrade |
Top ten relevant features for the abortion dataset. Topics are shown by a sample of their words in italics
Learning curve on different datasets.
It is important to note that with the proposed method (LDA
Based on the analysis of the obtained results, we found that the best combination of features was LDA plus vectors from the one-hot representation (LDA
Figure 6 shows the learning process of each dataset. For each percentage of data randomly sampled, from 10% to 100%, a part was used as training set (80%) and the remaining (20%) was used as test set. For example, if the 60% of data was used, let us say, 600 out of 1000 registers, 480 would be used as training, and 120 for test. This graph suggests that for some corpus, using more data could yield a small increase on accuracy, while for others there is almost no change between 90% or 100% of all data.
Detecting deception in a text is a task that has been tackled by using different techniques. Such task is not easy due to the fact that the features that distinguish a deceptive text may vary between different datasets: some words that represent deceptive text in a specific dataset may represent truthful text in another. This means that one word is not strictly representative of one class or another in every case. For example, in Table 12 the personal pronoun “I” and the possessive adjective “my” are representative features of truthful text in the dataset about abortion, whereas, as can be seen in Table 11, those words are representative features for deceptive texts in the OpSpam dataset.
The fact that certain words were contradictory in different datasets does not indicate that machine learning methods fail in doing this kind of task, but rather that those words’ representative function may vary depending on specific datasets. This is, they will not necessarily appear in a more general dataset, as when both OpSpam and the abortion dataset were to be combined.
The use of LDA proved to be an alternative method largely more efficient than LIWC. This is because LDA generates topics based on all words contained in a set of texts in such a way that there are no words outside of the probabilistic process to generate groups of words. Conversely, LIWC contains a group of pre-established words that might not be included in the documents being processed. Therefore, for the analyzed datasets in this work, generating a model to represent deceptive text is a more successful task when LDA is used.
The combination of features between LDA and OHR has proven to yield an accuracy greater or equal to 80% in all three analyzed datasets. This suggests that the generated models by the use of such combination of features are more precise at detecting deceptive texts, given that they were tested in different cases of deception. In addition, accuracy was more stable across several corpora. Furthermore, LDA
To show how LDA and OHR combination complement each other for the better, consider the dataset on abortion. The most relevant words and topics for deceptive or truthful text of this topic are shown in Table 12. Features of deceptive texts consist of topics and words, while features of truthful text mostly consist of only words. We noted that whenever a class lacked relevant features to differentiate it from other classes, topics covering this lack came into place and improved accuracy for classification.
Footnotes
Acknowledgments
We thank Instituto Politécnico Nacional (SIP, COFAA and BEIFI), CONACYT, and Red TTL for their support.
