Machine learning-based authorship attribution using token n-grams and other time tested features

Abstract

Authorship Attribution is a process to determine and/or identify the author of a given text document. The relevance of this research area comes to the fore when two or more writers claim to be the prospective authors of an unidentified or anonymous text document or are unwilling to accept any authorship. This research work aims to utilize various Machine Learning techniques in order to solve the problem of author identification. In the proposed approach, a number of textual features such as Token n-grams, Stylometric features, bag-of-words and TF-IDF have been extracted. Experimentation has been performed on three datasets viz. Spooky Author Identification dataset, Reuter_50_50 dataset and Manual dataset with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33. Models have been built and tested with supervised learning algorithms such as Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Tree and Random Forest. The proposed system yields promising results. For the Spooky dataset, the best accuracy score obtained is 84.14% with bag-of-words using Naïve Bayes classifier. The best accuracy score of 86.2% is computed for the Reuter_50_50 dataset with 2100 most frequent words when the classifier used is Support Vector Machine. For the Manual dataset, the best score of 96.67% is obtained using the Naïve Bayes Classification Model with both 5-fold and 10-fold cross validation when both syntactic features and 600 most frequent unigrams are used in combination.

Keywords

Author identification token n-grams stylometric features bag-of-words tf-idf score text classification WEKA tool

1. Introduction

Over the last two decades, Internet has evolved from being a network of interconnected computers for sharing data to being the part and parcel of our lives. But the overuse of Internet has its bad share as well. The rapid and voluminous generation of online texts poses serious threat to users of the cyber world. Issues such as spamming, phishing, spread of offensive language, distribution of illicit and pirated materials, cyber bullying etc have mushroomed immensely. These activities are popularly clubbed under the notion of Cybercrime [1] as illegal activities done online through global electronic networks [2].

The generation of textual content, especially anonymous data is increasing exponentially. Researchers are exploring varied methodologies and/or approaches for dealing with the task of Authorship Attribution i.e., for predicting the true author of a given unknown text [3]. The true essence lies in revealing the identity of the writer through an automated system so that issues like preventing theft of articles, giving proper citation and due credit to the real author etc can be mitigated in a better and efficient way [4].

The primary objective of Authorship Attribution is to define an appropriate characterization of documents that captures the writing style of authors [5] and identify authors or writers from emails, books, tweets, blogs, posts, comments, research papers and other textual documents. Traditional stylometric features and document fingerprinting features [38], unsupervised learning approach [39] or combination of textual distance, supervised and unsupervised techniques [40] have all shown promising results. Sometimes, AA is often misinterpreted to be synonymous with Author Profiling (AP) or Author Characterization. Unlike AA that predicts the name of true author(s) based on proper training as well as comparison results to reveal writing style similarities and differences within a certain category [6], AP is concerned with determining features of an author such as age, gender, geographical location, personality traits etc. Thus, AP does not usually attempt to identify the name of a specific author [7].

This work focuses on identifying the real author of a text using token n-grams and other time tested features. Three datasets, with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33, have been used in this research. Along with the various feature types such as lexical, syntactic and content-specific features, the experimentation has been performed on the ensemble feature types as well. Most frequent words on token n-gram features have been taken according to the dataset, where n $=$ 1, 2, 3, 4. A total of five different supervised learning based classification models have been built and tested which include Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Tree and Random Forest. In self-made “Manual” dataset, both 5 and 10 cross-validation testing have been employed to assess the results aptly.

The rest of the paper is organized as follows – Section 2 discusses the related works on Authorship Attribution by different researchers. In Section 3, the proposed methodology has been described thoroughly using system architecture and a flowchart. Section 4 describes the implementation strategy through discussion of the feature sets, datasets and classification techniques used in this study. Section 5 presents the results obtained and an analysis is done to bring to the fore the effectiveness of each proposed approach. Section 6 concludes the paper and the finally the next line of actions are discussed in Section 7.

2. Related work

A lot of research has been carried out in the domain of Authorship Attribution since last few decades. With the ever-increasing amount of texts being generated online, especially the documents that are characterized by anonymity, the AA task becomes critical and worth exploring.

Many researchers have focused on showcasing the different properties of texts, namely the content of the text and the writing style of the author. In 2003, Kopple et al. [8] have used lexical features, part-of-speech (POS) tags and idiosyncratic features for their experiment and have implemented the AA model using linear Support Vector Machine (SVM) and Decision Tree classifiers. The best accuracy score of 72% is achieved with Decision Tree algorithm.

Hu and Liu in the year 2004 have achieved an accuracy value of 84% through the use of WordNet [9]. In 2006, Zheng et al. [2] have applied supervised learning techniques viz. Decision Tree, SVM and Back Propagation Neural Network (BPNN) on both English and Chinese language datasets, subsequently yielding accuracy values of 97.69% with English dataset and 88.33% with Chinese dataset.

In 2007, Dominique Labbe [10] has utilized intertextual distance as a feature, thereby reaching a fair accuracy of 62% with the sliding window concept. In the same year, Company and Wanner [11] have achieved a better accuracy score of 82.72% by employing stylometric features along with the bag-of-words and TF-IDF for the experiment. Bozkurt et al. [12] also performed a similar experiment using stylometric features, vocabulary diversity, bag-of-words and frequency of functions and reached 95% accuracy. Many techniques are applied such as Histogram method, K-Nearest Neighborhood method, Parzen Windows, Bayes Classifier, K-means clustering, SVM and combination of various classifiers.

Ramnial et al. [7] have extracted stylometric features in their work have achieved a better accuracy value of 98% with K-Nearest Neighbor and SVM techniques. Another popular research by Efstathios Stamatatos is performed by using function words and character n-grams and more than 80% accuracy is achieved through SVM [13].

Prasad et al. [14] have used Decision Tree, Neural Network, KNN and Naïve Bayes for the purpose of experimentation and have yielded the accuracy score of 87.5%. Leo Wanner [15] has performed many similar experiments in the same year and with SVM classification model, a higher accuracy of 91.41% is achieved. In [16], Wanner has used various datasets and has achieved more than 80% accuracy on all datasets.

In the year 2020, Rocha et al. [17] have achieved 92% accuracy on AVASUS database by utilizing a handful of stylometric features and have applied classification techniques viz. SVM, Logistic Regression, Naïve Bayes, KNN, Gaussian Process classification and

Table 1
Comparative study of related works on authorship attribution

Paper ID	Author(s) (publication year)	Contribution	Limitation	Corpus	Feature types	Techniques	Reported result
[8]	Koppel et al. (2003)	The main motive is to identify authors through stylistic idiosyncratic features.	Small number of training documents is used. But idiosyncratic features suggest that for greater improvements large training corpora can be used.	480 emails written by 11 different authors during a period of about one year	Lexical, Syntactic (POS), Idiosyncratic features	Linear SVM, Decision Tree	72%
[9]	Hu and Liu (2004)	Text mining, sentiment analysis and text summarization are discussed based on data mining and natural language processing methods.	Authors have not monitored customer reviews & pronoun resolution and have not determined the strength of opinions, and have neither investigated opinions expressed.	Online customers’ Reviews	Feature-Based Summarization (FBS)	WordNet	84%
[2]	Zheng et al. (2006)	This is a framework of authorship identification on online messages which helps to assist in tracing identities in cyberspace.	There is no way to find the optimal set of features for online messages and to reduce number of features.	Online messages in English and Chinese languages	Lexical (character & word-based), Syntactic (function words & POS), Structural (paragraph & greeting) and Content based features (content words)	Decision Tree, SVM, BPNN	97.69% (English) & 88.33% (Chinese)
[10]	Dominique Labbe (2007)	Intertextual distance is used as a feature to identify authorship attribution to know the distance between two texts.	There is also no prejudging of ‘sure’ or ‘doubtful’ authorship. The method entails no limitation on the number of texts used.	153 different pairs of excerpts by the same authors and 1173 pairs which group excerpts by different authors	Intertextual Distance	Sliding Window	62%
[11]	Company and Wanner (2007)	The main goal is to show how to use less features and obtain better efficiency in author gender identification.	The researchers have not explored the identification of the age, native tongue and education level of the authors.	NY Times Opinion Blog corpus and informal blog post dataset	Character based features, Word based features, Sentence based features, Dictionary based features, Syntactic features	BOW, TF-IDF	82.72%
[12]	Bozkurt et al. (2007)	The main aim is to determine the writer of a document.	Authors have not examined whether the classification errors occur in the same documents among different classifiers and also have not compared the classification errors of the authors using different classifiers.	All writings of Milliyet columnists from 2001 to 2005	Stylometric features, Vocabulary Diversity, BOW, Frequency of Function Words	SVM, Histogram Method, KNN, Parzen Windows, Bayes Classifier, K-means Clustering, Combination of Classifiers	95%

Table 1, continued
Paper ID	Author(s) (publication year)	Contribution	Limitation	Corpus	Feature types	Techniques	Reported result
[7]	Ramnial et al. (2016)	This paper aims at studying the use of stylometric features present in a document in order to verify its authorship to identify potential cases of plagiarism in formal writings.	This approach is only suitable when large amount of texts is available and hence these techniques would not be suitable to classify shorter texts originating from student essays, emails and social media posts.	Ten PhD thesis, split into different segments of 1000, 5000 and 10000 words, (Total 520 documents)	Lexical (character & word-based features), Syntactic (function words & POS), Structural (paragraphs & greetings) and Content based features (content words)	KNN, SVM	98%
[13]	Efstathios Stamatatos (2016)	Style-based text categorization is discussed and it is also examined whether such universal stylometric features are effective under different documents or not.	There is a problem of dimensionality of the representation. Changes in topic or genre as well as the number of candidate authors considerably affect the appropriate choice of the number of features in the attribution models.	Texts from The Guardian daily UK newspaper. There are 8 top-level tags.	Function Words and Character n-grams	SVM	More than 80%
[14]	Prasad et al. (2017)	This proposed approach is used to predict the authorship of anonymous documents by extracting different stylometric features from authors’ works.	This project is not used to classify wide range of authors.	Text documents written by five popular Victorian authors	Some basic stylometric features	Decision Trees, NN, KNN, Naive Bayes	87.5%
[15]	Leo Wanner (2017)	The main motive is to derive demographic author information such as gender or age in case of author profiling and in author identification the goal is to predict the author of a text selected from a pool of potential candidates.	Small domain of features is used.	18 different authors, 3 novels per author and 2014’s PAN author verification task dataset	Character based, Word based, Sentence based, Dictionary based, Syntactic (POS), Dependency features, Tree features and Discourse features	SVM	91.41%

Table 1, continued
Paper ID	Author(s) (publication year)	Contribution	Limitation	Corpus	Feature types	Techniques	Reported result
[16]	Leo Wanner (2017)	This work is based on author identification and author profiling tasks. Author profiling aims to identify demographic traits of the authors, while author identification aims to identify the authors themselves by searching for distinctive linguistic patterns that distinguish them.	The experimentation is not done in noisier environment. Literary stylistic variation is not explored. Discourse features are not expanded. Semantic parsing is not done.	Several types of datasets are taken like Blog Corpus, Email data, Facebook data, Movie Review, Student Essay, News Corpora, Scientific Articles	Character based features, Word based features, Sentence based features, Dictionary based features, Morpho-Syntactic Dependency features, Discourse features, Function words, POS, Token n-gram, BOW	SVM, Random Forest, Enriched KNN, Density based K-means cluster	More than 80% in most of the datasets
[17]	Rocha et al. (2020)	The problem of authorship recognition is being exposed in order to make it a tool for use in the distance education platform of the Ministry of Health.	Very small set of features are used and so the domain of work is very less.	AVASUS Database	Only 9 stylometric features such as Average length of tokens, Average short tokens, Numeric Digits Average etc.	SVM, Logistic Regression, Naïve Bayes, KNN, Gaussian Process Classification, Bernoulli Naive Bayes	92%
[18]	Tamboli and Prasad (2020)	The proposed methodology for author identification is based on the change in writing style of various authors and this change is mitigated by a new feature normalization technique.	If the dimension of the feature vector is increased to a certain limit, then a negative impact is seen on accuracy.	Dataset from New York Times, The Indian Express, and the correspondence of well-known authors	Character n-grams, Word n-grams, POS n-grams	SVM	94.83%
[4]	Noura Khalid Alhuqail (2021)	This work is focused on Predicting authors of articles to preserve intellectual property rights and for preventing theft.	The dataset is new, and there is no previous work on it. The BERT model is relatively new, so there are few papers on it and no paper about Author Identification by applying BERT.	100 articles each of 20 authors and 500 articles each of 10 authors	bag-of-words and Latent Sentiment Analysis	SVM, Random Forest, BERT, Logistic Regression	94.9%

Bernoulli Naïve Bayes for the experiment. Tamboli and Prasad [18] have used SVM and have achieved 94.83% accuracy value by using character n-grams, word n-grams and part-of-speech n-grams.

Noura Khalid Alhuqail in the year 2021 [4] has used SVM, Random Forest, Bidirectional Encoder Representations from Transformers (BERT) and Logistic Regression classifiers for model generation. Latent Sentiment Analysis and bag-of-words are used in the experiment and the highest accuracy achieved is 94.9%.

Table 1 presents a detailed comparative study of the related works on AA as discussed in this section, especially focusing on parameters such as contribution, limitations, corpus used, features extracted, techniques followed and results reported in terms of accuracy scores.

3. Proposed methodology

The proposed work focuses on solving the task of Authorship Attribution. Three datasets, with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33, have been used in this research. The datasets are as follows:

1.
Spooky Author Identification Dataset [25] comprising 19579 horror sentences of 3 spooky authors.
2.
Reuter_50_50 Dataset [26] comprising 5000 news articles of 50 authors.
3.
Manual Dataset comprising 30 short stories of 3 authors.

Initially, the collected data is in raw form. While extracting the select stylometric features [2, 7, 11], data cleaning operations are not performed. This is done as stylometric features are style markers that furnish details about any author’s writing style. Features such as Lexical, Syntactic and Content-specific are mainly used in this work [2, 7]. Along with such earmarked features, permutation and combination of these features is done and the ensemble feature sets are tested on individual datasets. Next, for extracting features such as n-grams, BOW and TF-IDF, data cleaning is done. During the Data Pre-processing stage, operations such as tokenization, removal of punctuations & stopwords and lemmatization are done. After cleaning the data, token n-grams, namely Uni-grams, Bi-grams, Tri-grams and Tetra-grams are extracted from the corpus [16, 18, 35]. In the process, as many as 100, 300, 600, 900, … most frequent words from individual datasets are extracted [16] and the iteration stops when accuracy score starts to decrease. Next, bag-of-words (BOW) [4, 12, 16] and TF-IDF values are calculated on the datasets individually. Some popular text classification models have been built and tested. These include Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Tree and Random Forest. The WEKA data mining tool has been used for implementing the different classifiers. The architecture of the proposed system is presented in Fig. 1 and its working principle is explained with the help of a flowchart in Fig. 2 next.

Figure 1.
Architecture of the proposed system.

Text in raw form is fed into the feature extractor module which transforms input text or tokens into multidimensional vectors, one per instance, with each feature value as a dimension. These vectors and the ground truth label of each instance are given as input to a Machine Learning algorithm, which extracts patterns from the training material and makes predictions on unseen instances. For prediction, the algorithms work on the feature vectors of the test set that has been obtained after splitting dataset into training and test datasets in three different split ratios.

Figure 2.
Flowchart depicting the Workflow of the proposed system.

4. Implementation

This section discusses about feature sets, datasets and classification techniques used in this work. The predictions from any Machine Learning algorithm can only be as good as the dataset. Thus, an adequate number of documents are collected, so that each document contains a reasonable amount of words. This is a very crucial requirement for performing the AA task as the final results are very much dependent on the corpus.

4.1 Feature sets

In this work, four types of features viz. stylometric features, token n-grams, BOW and TF-IDF have been extracted.

Stylometric Features

In order to obtain best outcomes of the AA task and detect potential suspects of plagiarism, it is important to comprehend effectively the various stylometric features used so that the results do not suffer from any bias. Stylometric features are categorized as Lexical, Syntactic, Structural and Content-specific features.

Lexical features can be further divided into character-based and word-based features. In this research, 64 character-based lexical features and 34 word-based lexical features have been extracted.

Syntactic features include function words, punctuation, and parts of speech (POS) that can help in detecting an author’s writing style at the sentence level. The discriminating power of syntactic features is derived from a writer’s unique habit of organizing sentences. POS tags have not been used in this work. In total, 158 syntactic features have been used that incorporate a large set of 150 function words, which was selected based on the previous study [7].

Structural features such as paragraphs and greetings have not been extracted in this approach as the datasets have no paragraph type texts and almost each instance is in the form of a sentence(s).

Content-specific features are important discriminating features. These features are application dependent or semantic in nature as the selection of such features is dependent on the type of dataset used.

Token n-grams

n-gram is a contiguous sequence of n items from a given text. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”; size 4 is a “tetragram”; size 5 is a “pentagram” and so on. A k-skip-n-gram is a length-n subsequence where the components occur at distance at most k from each other. Syntactic n-grams are n-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text. Syntactic n-grams are intended to reflect syntactic structure more faithfully than linear n-grams, and have many of the same applications, especially as features in a Vector Space Model. Syntactic n-grams for certain tasks give better results than the use of standard n-grams, e.g., in case of the Authorship Attribution task. In this work, 100, 300, 600, 900, … most frequent token n-grams (n $=$ 1 to 4) are extracted. The upper limit depends on the accuracy score. The iteration stops when accuracy ceases to increase or starts decreasing [16].

bag-of-words

The bag-of-words (BOW) model is the simplest form of text representation in numbers. BOW is used in Natural Language Processing (NLP) and Information Retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model is commonly used in methods of document classification where the frequency of occurrence of each word is used as a feature for training a classifier [4, 12, 16]. There are some drawbacks of this model. It gives either 1 or 0. So, it gives same priority to 1 word and the same priority to 0 word.

Term Frequency-Inverse Document Frequency [32]

Term Frequency-Inverse Document Frequency (TF-IDF) is based on the bag-of-words (BOW) model, which contains insights about the less relevant and more relevant words in a document. The importance of a word in the text is of great significance in IR. Term Frequency (TF) is a measure of the frequency of a word (w) in a document (d). TF is defined as the ratio of a word’s occurrence in a document to the total number of words in a document. The denominator term in the formula (see Eq. (1)) is used to normalize since all the corpus documents are of different lengths.

$\displaystyle F({w,d})=\frac{\textit{Occurence of w in document d}}{\textit{% Total number of words in document d}}$ (1)

Inverse Document Frequency (IDF) is the measure of the importance of a word. TF does not consider the importance of word. Some words can be most frequently present but are of little significance. IDF provides weightage to each word based on its frequency in the corpus D and is calculated based on the following formula (see Eq. (2)):

$\displaystyle\textit{IDF}({w,D})=\log_{2}\frac{\begin{array}[]{c}\textit{Total% number of documents (N)}\\ \textit{in corpus D}\end{array}}{\textit{Number of documents containing w}}$ (2)

4.2 Datasets

Three datasets, with 3 different train-test split ratios viz. 80-20, 70-30 and 66.67-33.33, have been used in this research. The datasets are as follows:

Spooky Author Identification Dataset [25]

This dataset, hereby referred to as Spooky dataset, contains text from works of fiction written by three horror authors of the public domain, namely Edgar Allan Poe (EAP), HP Lovecraft (HPL) and Mary Wollstonecraft Shelley (MWS). The data has been prepared by chunking larger texts into sentences using CoreNLP’s MaxEnt sentence tokenizer.

Reuter_50_50 Dataset [26]

This dataset is the subset of RCV1. This corpus has already been used in Author Identification experiments. In it, top 50 authors, with respect to the total size of articles, are selected. 50 authors of texts labeled with at least one subtopic of the class CCAT (corporate/industrial) are taken. It is an attempt to minimize the topic factor for distinguishing among the texts in a better way. The corpus consists of 5000 texts, with 100 texts per author.

Manual Dataset

This dataset comprises short stories of three authors namely, Ernest Hemingway, Mark Twain and O’Henry. 10 literary documents of each author have been collected manually and a dataset of 30 documents has been assorted. More details about the various documents of this dataset is provided in the previous work [19].

Table 2
Performance comparison of different classifiers for best performing stylometric features on Spooky dataset

Feature type	Feature count	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
Combined features	460	Naïve Bayes	45.7354	46.6292	46.7628
		SVM	66.0112	66.2581	65.9006
		k-NN	45.9653	45.5056	45.8315
		Decision Tree	51.0981	52.7579	51.3745
		Random Forest	60.6231	60.1805	60.6429

4.3 Classification techniques

In this work, a total of five supervised learning algorithms have been employed for solving the text classification problem. The classification techniques are as follows:

Naïve Bayes

This technique performs classification by making an assumption of conditional independence over the training dataset. In this work, multinomial Naïve Bayes classifier has been employed. Learning such classifiers can be greatly simplified by assuming that features are independent of each other given the class as seen from the following formula (see Eq. (3)):

$\displaystyle P({X{|}C})=\mathop{\prod}\limits_{i=1}^{n}P({X_{i}{|}C})$ (3)

where $X=(X_{1},X_{2},\ldots X_{n})$ is a feature vector and $C$ is a class [27].

Support Vector Machine

This algorithm is used to find a N-dimensional hyperplane which distinctly classifies any data points. Here N is the number of features. SVM does not only focus on creating a hyperplane. It also creates two hyperplanes which are passing through the nearest positive and negative points of that hyperplane. The difference between these two points is called margin. The aim is to maximize the marginal distance to increase the accuracy [28].

k-Nearest Neighbors

In k-NN classification, an object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its $k$ nearest neighbors ( $k$ is a positive integer, typically small). The output is a class membership. If $k=$ 1, then the object is simply assigned to the class of that single nearest neighbor [29].

Decision Tree

It is a predictive model in machine learning. In decision analysis, a decision tree is used to represent decisions and decision-making trees visually and explicitly. Tree models where the target variable can take a discrete set of values are called classification trees, leaves represent class labels and branches represent conjunctions of features that lead to those class labels [30].

Random Forest

Random Forest is based on the concept of ensemble learning. This classifier contains a number of decision trees on various subsets of the given dataset and takes the average of all outputs to improve the predictive accuracy of that dataset. Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, it predicts the final output [31].

5. Results and analysis

In this section, different results have been tabulated for three datasets for an easy comparison amongst the performances of five classification techniques used. The confusion matrix is also plot for better understandability. All the experiments are conducted with help of WEKA data mining tool. Waikato Environment for Knowledge Analysis (WEKA) contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions. WEKA supports several standard data mining tasks, more specifically, data pre-processing, clustering, classification, regression, visualization and feature selection [20].

Table 3
Performance comparison of different classifiers for best performing token n-grams on Spooky dataset

Value of n	Most frequent words	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
1	1200	Naïve Bayes	70.8121	70.9738	71.0981
		SVM	71.9612	71.0249	71.4586
		k-NN	46.7824	45.9993	46.0418
		Decision Tree	62.7426	62.1893	61.7696
		Random Forest	66.2921	66.4964	66.186

Table 4

Performance comparison of different classifiers for BOW on Spooky dataset

Feature type	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
BOW	Naïve Bayes	84.14	83.90	83.74
	SVM	77.24	75.62	75.46
	k-NN	46.88	45.83	45.61
	Decision Tree	50.87	51.11	50.84
	Random Forest	63.79	63.38	62.91

Table 5

Performance comparison of different classifiers for TF-IDF on Spooky dataset

Feature type	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
TF-IDF	Naïve Bayes	82.71	82.07	81.38
	SVM	83.71	82.47	81.89
	k-NN	74.74	73.36	72.97
	Decision Tree	50.70	50.43	50.87
	Random Forest	63.41	63.60	62.69

Table 6

Performance comparison of different classifiers for best performing stylometric features on Reuter_50_50 dataset

Feature type	Feature count	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
Combined features	377	Naïve Bayes	51.4	52.3333	51.8824
		SVM	67.8	65.8667	65.8235
		k-NN	48.7	46.2	45
		Decision Tree	27	24.6	27
		Random Forest	48.6	48.1333	47.4118

Table 7

Performance comparison of different classifiers for best performing token n-grams on Reuter_50_50 dataset

Value of n	Most frequent words	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
1	2100	Naïve Bayes	73.8	74.8667	74.1176
		SVM	86.2	83.2667	82.1765
		k-NN	52.3	48.7333	47.5882
		Decision Tree	58.2	57	56.5882
		Random Forest	71.3	69.8667	68.8824

Table 8

Performance comparison of different classifiers for BOW on Reuter_50_50 dataset

Feature type	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
BOW	Naïve Bayes	78.40	76.47	76.36
	SVM	80.50	79.73	79.39
	k-NN	65.20	61.60	61.2121
	Decision Tree	47.30	47.07	46.48
	Random Forest	80.30	77.67	77.70

Table 9

Performance comparison of different classifiers for TF-IDF on Reuter_50_50 dataset

Feature type	Technique	80-20 split accuracy (%)	70-30 split accuracy (%)	66.67-33.33 split accuracy (%)
TF-IDF	Naïve Bayes	73	71.73	71.64
	SVM	80.10	78	78.12
	k-NN	75.50	73.33	73.2121
	Decision Tree	47.30	46.80	46.79
	Random Forest	79.20	79.20	78.12

Table 10

Performance comparison of different classifiers for best performing stylometric features on manual dataset

Feature type	Feature count	Technique	5-cross validation accuracy (%)	10-cross validation accuracy (%)
Syntactic (F3)	158	Naïve Bayes	90	90
		SVM	73.3333	76.6667
		k-NN	66.6667	66.6667
		Decision Tree	70	73.3333
		Random Forest	76.6667	76.6667

Table 11

Performance comparison of different classifiers for best performing token n-grams on manual dataset

Value of n	Most frequent words	Technique	5-cross validation accuracy (%)	10-cross validation accuracy (%)
1	600	Naïve Bayes	90	93.3333
		SVM	76.6667	83.3333
		k-NN	60	60
		Decision Tree	60	53.3333
		Random Forest	76.6667	76.6667

Table 12

Performance comparison of different classifiers for BOW on manual dataset

Feature type	Technique	5-cross validation accuracy (%)	10-cross validation accuracy (%)
BOW	Naïve Bayes	86.67	86.67
	SVM	73.33	73.33
	k-NN	63.33	66.67
	Decision Tree	66.67	73.33
	Random Forest	70	80

Table 13

Performance comparison of different classifiers for TF-IDF on manual dataset

Feature type	Technique	5-cross validation accuracy (%)	10-cross validation accuracy (%)
TF-IDF	Naïve Bayes	63.33	66.67
	SVM	76.66	73.33
	k-NN	76.67	80
	Decision Tree	73.33	73.33
	Random Forest	86.67	80

Table 14

Comparison of classifier performance using combination features on manual dataset

Features type	Feature count	Technique	5-cross validation accuracy (%)	10-cross validation accuracy (%)
Syntactic $+$ Unigrams (600)	758	Naïve Bayes	96.67	96.67
		SVM	73.33	73.33
		k-NN	80	76.67
		Decision Tree	73.33	63.33
		Random Forest	80	73.33
Syntactic $+$ BOW	9028	Naïve Bayes	90	90
		SVM	73.33	70
		k-NN	83.33	83.33
		Decision Tree	70	70
		Random Forest	80	66.67
Syntactic $+$ TFIDF	9028	Naïve Bayes	63.33	56.37
		SVM	80	76.67
		k-NN	76.67	73.33
		Decision Tree	76.67	66.67
		Random Forest	83.33	86.67
Unigrams (600) $+$ BOW	9470	Naïve Bayes	93.33	96.67
		SVM	63.33	60
		k-NN	73.33	76.67
		Decision Tree	60	56.67
		Random Forest	80	73.33
Unigrams (600) $+$ TF-IDF	9470	Naïve Bayes	70	70
		SVM	66.67	56.67
		k-NN	70	70
		Decision Tree	73.33	73.33
		Random Forest	86.67	76.67
BOW $+$ TF-IDF	17740	Naïve Bayes	70	76.67
		SVM	63.33	63.33
		k-NN	66.67	66.67
		Decision Tree	60	76.67
		Random Forest	80	90
Syntactic $+$ Unigrams (600) $+$ BOW	9628	Naïve Bayes	96.67	96.67
		SVM	70	66.67
		k-NN	73.33	76.67
		Decision Tree	73.33	63.33
		Random Forest	83.33	70
Syntactic $+$ Unigrams (600) $+$ TF-IDF	9628	Naïve Bayes	76.67	73.33
		SVM	73.33	73.33
		k-NN	80	76.67
		Decision Tree	70	76.67
		Random Forest	83.33	83.33
Unigrams (600) $+$ BOW $+$ TF-IDF	18340	Naïve Bayes	83.33	80
		SVM	66.67	63.33
		k-NN	73.33	76.67
		Decision Tree	80	80
		Random Forest	70	86.67
Syntactic $+$ Unigrams (600) $+$ BOW $+$ TF-IDF	18498	Naïve Bayes	83.33	80
		SVM	70	66.67
		k-NN	73.33	76.67
		Decision Tree	73.33	76.67
		Random Forest	80	80

Table 15

Comparison of best accuracy scores of the proposed work with few related works

Dataset	Author(s)	Features	Technique	Reported accuracy (%)
Spooky Dataset	Jang et al. [21]	4-grams	Word2Vec	79.24
	Westin [22]	1-grams	Support Vector Machine	82.14
	Proposed work	BOW	Naïve Bayes	84.14
Reuter_50_50 Dataset	Houvardas and Stamatatos [33]	3-grams $+$ 4-grams $+$ 5-grams	Support Vector Machine	74.04
	Posadas-Durán et al. [34]	Doc2vec words $+$ 2-grams $+$ 3-grams	Logistic Regression	75.24
	Proposed work	Unigrams (2100)	Support Vector Machine	86.20
Manual Dataset	Prasad et al. [14] ${}^{*}$	Stylometric	Neural Network	87.50
	Gupta et al. [19]	Stylometric	Neural Network	93.33
	Proposed work	Syntactic $+$ Unigrams (600)	Naïve Bayes	96.67

${}^{*}$ Comparable Manual Dataset comprises 75 training tuples and 15 test tuples of 5 authors of the Victorian era. The authors are Jane Austen, Charles Dickens, William Thackeray, Emily Brontë and Charlotte Brontë.

Stylometric features include F1 that denotes character based lexical features, F2 that denotes word based lexical features, F3 that denotes syntactic features and F4 that denotes content specific features. Most frequent words on token n-gram features have been taken according to the dataset, where n $=$ 1, 2, 3, 4. Naïve Bayes, Support Vector Machine (SVM), k-Nearest Neighbors (k-NN) [k $=$ 1], Decision Tree and Random Forest Classification techniques have been adopted. All datasets have been split into train and test sets. Three types of splitting have been done viz. 80-20, 70-30 and 66.67-33.33. In self-made “Manual” dataset, both 5 and 10 cross-validation testing have been employed to assess the results aptly. The performance of different classifiers when all the combined stylometric features (F1 $+$ F2 $+$ F3 $+$ F4) are used on the Spooky dataset is shown in Table 2.

Next token n-grams (n $=$ 1 to 4) are extracted from the Spooky dataset. The number of most frequent words is based on the respective values of n and the corresponding datasets. In this work, the experimentation is performed till 100 most frequent tetra-grams because the model does not yield better result as the number of features is increased. Table 3 shows the accuracy scores of the best performing n-grams when tested on different classification models.

Tables 4 and 5 show the accuracy results when different classifiers are employed for bag-of-words (BOW) and TF-IDF features on the Spooky dataset.

In the similar way, the steps are repeated for the other two datasets viz. Reuter_50_50 dataset and Manual dataset for different feature types and the results for five different classifiers are represented in a tabular fashion. As seen from Table 6, the best result is calculated for the combined set of features (F1 $+$ F2 $+$ F3 $+$ F4). Next Table 7 shows the results for n-gram features, Table 8 for BOW features and Table 9 for TF-IDF features for the Reuter_50_50 dataset. In self-made “Manual” dataset, 64 F1, 34 F2, 158 F3 and 192 F4 features have been extracted. As this dataset is very small in size, so, k-fold cross validation testing is done where the values of k are 5 and 10. Tables 6–9 show the results for the Reuter_50_50 dataset and Tables 10–13 show results for the Manual dataset.

Analysis

The best obtained result for stylometric features on Spooky dataset is 66.2581% in 70:30 train-test dataset splitting and the algorithm which is used on full feature set i.e., on a total 460 stylometric features is Support Vector Machine. The precision, recall and F-score values are calculated as 0.669, 0.663, 0.658 respectively. For the token n-gram features on Spooky dataset, the best accuracy obtained is 71.9612% in 80:20 dataset split and the algorithm which is used on Unigram feature set with 1200 most frequent words is SVM. The precision, recall and F-score values are 0.73, 0.72, 0.719 respectively. With BOW and TF-IDF features, the best accuracy scores of 84.14% (Naïve Bayes with 80:20 split) and 83.71% (SVM with 80:20 split) respectively have been obtained.

On Reuter_50_50 dataset, the best obtained result for stylometric features is 64.4706 % in 66.67:33.33 splitting and the algorithm which is used on full feature set (a total of 248 features) is SVM. The precision, recall and F-score scores are 0.656, 0.645, 0.645 respectively. For token n-gram features, 86.2% accuracy score is obtained in 80:20 splitting and the algorithm which is used on Unigram feature set with 2100 most frequent words is SVM. The precision, recall and F-score are calculated to be 0.875, 0.862, 0.863 respectively. While using the BOW and TF-IDF features, the best accuracy scores of 80.50% (SVM with 80:20 split) and 80.10% (SVM with 80:20) respectively are obtained.

The best obtained result for stylometric features on Manual dataset is 90% in both 5-cross validation and 10-cross validation testing and the algorithm which is used on only syntactic features (a total of 158 features) is Naïve Bayes. The precision, recall and F-score values are 0.923, 0.9, 0.898 respectively whereas for the token n-gram features, the best obtained result is 93.3333 % in 10-cross validation and the algorithm which is used on Unigram feature set with 600 most frequent words is Naïve Bayes. The precision, recall and F-score are 0.939, 0.933, 0.931 respectively. With BOW and TF-IDF features, the best accuracy scores of 86.67% (Naïve Bayes with both 5-cross and 10-cross validation testing) and 86.67% (Random Forest with 5-cross validation testing) respectively have been obtained.

It can be observed from Table 14 that when all the best performing features have been combined and the results are calculated for the Manual dataset, interestingly there is an enhancement in the performance of the AA model. When syntactic features and 600 most frequent unigrams are fed together in combination, the overall best accuracy value of 96.67% is obtained both in 5-cross validation and 10-cross validation when Naïve Bayes classifier is employed. Table 15 provides the summarized results of the proposed approach presented in this work and compares them with related works of other researchers.

It can be inferred from Table 15 that for the Spooky dataset, in comparison to the performances of Word2Vec model [21] and SVM model [22], the approach proposed in this paper using BOW features and Multinomial Naïve Bayes algorithm produces a better accuracy score of 84.14%. In case of Reuter_50_50 dataset, the use of 2100 unigrams achieves better accuracy score with SVM classifier compared to the previous works [33, 34]. For the Manual dataset, there is a perceptible improvement of 3.34% over previous research works published by a subset team [19]. Findings indicate that when syntactic features are used in unison with 600 unigrams then even Naïve Bayes algorithm yields better results compared to Neural Networks.

6. Conclusion

In this work on Authorship Attribution, different feature sets namely, stylometric features, token n-grams, bag-of-words (BOW) and TF-IDF features are used and experimentation has been performed using five different classifiers such as Naive Bayes, Support Vector Machine, k-Nearest Neighbors, Decision Tree and Random Forest. The models have been tested on three datasets viz. Spooky Author Identification dataset, Reuter_50_50 dataset and Manual dataset. The prime focus of this research work is to identify the author and determine true authorship of a given test document. Using supervised learning approaches, the proposed system tries to predict the name of author(s). For the Spooky dataset, the best accuracy score obtained is 84.14% with bag-of-words (BOW) features using Naïve Bayes classifier. The best accuracy score of 86.2% is computed for the Reuter_50_50 dataset with 2100 most frequent words when the classifier used is Support Vector Machine. The score of 96.67% with the combined feature set of syntactic features and 600 most frequent unigrams is the best amongst all using the Naïve Bayes Classification Model with both 5-fold and 10-fold cross validation testing for the Manual dataset.

7. Future work

Many possible extensions are in the pipeline for enhancing the outcome of the proposed AA system. More datasets with varied size and genre need to be employed. Apart from English, datasets in different languages like Thai [36], Spanish [37] etc need to be tested as well. In this work, the classification models have not been fine tuned by changing parameter values. So, model remains to be optimized and this task has been initiated. Only supervised algorithms have been used in this work. So, performance analysis of ML models after employing unsupervised algorithms is pending. On Reuter_50_50 and Manual datasets, F4 features (semantic or content-specific features) have not been extracted as the dictionary of literary and news words could not be arranged. Only token n-gram (for n $=$ 1 to 4) features have been used here, so the implementation of k-skip n-gram features can be taken up. Furthermore, incorporation of idiosyncratic features [23] that have potential to reflect the innate side and characteristics of an author remains to be done in conjunction with the already existing stylometric features [2, 24] for building an optimal Authorship Attribution system. Tasks such as Author Profiling for predicting author’s age, gender etc. and Author Categorization for predicting the group to which an author or a user belongs to based on lifestyle, income, religion preferences, political interest etc can be explored. Few researchers have also investigated the efficacy of stylometric features’ category using an ablation study [36], thus paving the way to perform model ablation. Deep Learning Models such as RNN, LSTM, TextCNN etc need to be built and implemented so that with large datasets better results can be obtained as it is observed that with more features, prediction accuracy enhances.

References

Thomas

and Loader

B.D.

, Cybercrime: Law enforcement, security and surveillance in the information age, Routledge, 2000.

Zheng

Chen

and Huang

, A framework for authorship identification of online messages: Writing-style features and classification techniques, Journal of the American society for Information Science and Technology 57(3) (2006), 378–393.

Juola

, Authorship Attribution for Electronic Documents, in: Olivier

M.S.

and Shenoi

, eds., Advances in Digital Forensics II, Digital Forensics, IFIP Advances in Information and Communication, Vol. 222, 2006, pp. 119–130.

Alhuqail

N.K.

, Author identification based on NLP, European Journal of Computer Science and Information Technology 9(1) (2021), 1–26.

Argamon

Whitelaw

Chase

Hota

S.R.

Garg

and Levitan

, Stylistic text classification using functional lexical features, Journal of The American Society for Information Science and Technology 58(6) (2007), 802–822.

Benjamin

Chung

Abbasi

Chuang

Larson

C.A.

and Chen

, Evaluating text visualization for authorship analysis, Security Informatics 3(1) (2014), 1–13.

Ramnial

Panchoo

and Pudaruth

, Authorship attribution using stylometry and machine learning techniques, Intelligent Systems Technologies and Applications, 2016, 113–125.

Koppel

and Schler

, Exploiting stylistic idiosyncrasies for authorship attribution, in: Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, Vol. 69, 2003, pp. 72–80.

and Liu

, Mining and summarizing customer reviews, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 168–177.

10.

Labbé

, Experiments on authorship attribution by intertextual distance in English, Journal of Quantitative Linguistics 14(1) (2007), 33–80.

11.

Company

J.S.

and Wanner

, How to use less features and reach better performance in author gender identification, in: The 9th edition of the Language Resources and Evaluation Conference (LREC), 2007, pp. 26–31.

12.

Bozkurt

I.N.

Baghoglu

and Uyar

, Authorship attribution, in: 22nd IEEE International Symposium on Computer and Information Sciences, 2007, pp. 1–5.

13.

Stamatatos

, Universality of stylistic traits in texts, Creativity and universality in language, 2016, 143–155.

14.

Prasad

Kallimani

J.S.

and Jain

, Prediction of authorship using various classification algorithms, in: IEEE International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017, pp. 1671–1676.

15.

Wanner

, On the relevance of syntactic and discourse features for author profiling and identification, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers, Vol. 2, 2017, pp. 681–687.

16.

Wanner

, Feature engineering for author profiling and identification: on the relevance of syntax and discourse, Doctoral dissertation, Universitat Pompeu Fabra, 2017.

17.

Rocha

M.A.D.

Nóbrega

G.Â.S.D.

de Medeiros Valentim

R.A.

and Alves

L.P.C.

, A text as unique as fingerprint: AVASUS text analysis and authorship recognition, in: Proceedings of the 10th Euro-American Conference on Telematics and Information Systems, 2020, pp. 1–8.

18.

Tamboli

M.S.

and Prasad

, Author identification with feature transformation method, Digital Scholarship in the Humanities 35(3) (2020), 642–651.

19.

Gupta

Patra

T.K.

and Chaudhuri

, Role of Machine Learning in Authorship Attribution with Select Stylometric Features, in: Abraham

Gandhi

Hanne

Hong

T.P.

Nogueira Rios

and Ding

, eds., Intelligent Systems Design and Applications (ISDA 2021), Lecture Notes in Networks and Systems, Vol. 418, 2022, pp. 920–932.

20.

Waikato University Webpage, Weka 3: Machine Learning Software in Java, Retrieved from: https://www.cs.waikato.ac.nz/ml/weka/index.html, last accessed on 06/02/2022.

21.

Jang

Kim

and Lam

, Kaggle competitions: Author identification & statoil/C-CORE iceberg classifier challenge, Dept. School Inform., Comput., Eng. Indiana Univ., Blooming-ton, IN, USA, Tech. Rep., 2017.

22.

Westin

, Authorship classification using the Vector Space Model and kernel methods, Retrieved from: https://www.diva-portal.org/smash/get/diva2:1439191/FULLTEXT01.pdf, 2020.

23.

Gupta

, Identifying Authors Through Idiosyncratic Usage and Stylistic Inconsistencies, in: Proceedings of the Two Day AICTE Sponsored Online International Conference on Data Science, Machine Learning and It’s Application (ICDML), 2020, pp. 225–231.

24.

Neal

Sundararajan

Fatima

Yan

Xiang

and Woodard

, Surveying stylometry techniques and applications, ACM Computing Surveys (CSUR) 50(6) (2017), 1–36.

25.

kaggle.com Webpage, Spooky Author Identification, Retrieved from: https://www.kaggle.com/c/spooky-author-identification/data, last accessed on 06/02/2022.

26.

UCI Machine Learning Repository Webpage, Reuter_50_50 Data Set, Retrieved from: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50, last accessed on 06/02/2022.

27.

Rish

, An empirical study of the naive Bayes classifier, in: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Vol. 3, 22, 2001, pp. 41–46.

28.

Noble

W.S.

, What is a support vector machine, Nature Biotechnology 24(12) (2006), 1565–1567.

29.

Zhang

M.L.

and Zhou

Z.H.

, ML-KNN: A lazy learning approach to multi-label learning, Pattern Recognition 40(7) (2007), 2038–2048.

30.

Safavian

S.R.

and Landgrebe

, A survey of decision tree classifier methodology, IEEE Transactions on Systems, Man, and Cybernetics 21(3) (1991), 660–674.

31.

Pal

, Random forest classifier for remote sensing classification, International Journal of Remote Sensing 26(1) (2005), 217–222.

32.

Han

Kamber

and Pei

, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann Publishers, 2012.

33.

Houvardas

and Stamatatos

, N-Gram Feature Selection for Authorship Identification, in: Euzenat

and Domingue

, eds., Artificial Intelligence: Methodology, Systems, and Applications (AIMSA), Lecture Notes in Computer Science, Vol. 4183, 2006, pp. 77–86.

34.

Posadas-Durán

J.P.

Gómez-Adorno

Sidorov

Batyrshin

Pinto

and Chanona-Hernández

, Application of the distributed document representation in the authorship attribution task for small corpora, Soft Computing 21(3) (2017), 627–639.

35.

Shang

Liu

Song

and Cheng

, The Role of Traditional Features in Authorship Attribution, in: IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC), 2020, pp. 244–247.

36.

Sarwar

Porthaveepong

Rutherford

Rakthanmanon

and Nutanong

, StyloThai: A scalable framework for stylometric authorship identification of thai documents, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19(3) (2020), 1–15.

37.

Guzmán-Cabrera

, Authorship attribution of Spanish poems using n-grams and the web as corpus, Journal of Intelligent & Fuzzy Systems 39(2) (2020), 2391–2396.

38.

Yadav

Rathore

S.S.

and Chouhan