Abstract
Document novelty detection is a concept learning problem wherein the system gains its knowledge only from the positive documents under a concept and with that limited knowledge it attempts to detect the negative cases. This work focuses on learning author style as a concept from the given set of documents, particularly emails. Since author attribution for shorter texts such as emails is more complex compared to larger documents, the techniques originally used for the large documents prove inefficient for short texts. To address this shortcoming of existing algorithms in detecting aberration in author style, we have proposed a graph-model based technique for feature set extraction from short documents. Given the extracted feature set, we have also developed two probability based text representation schemes that could best represent a text document to an underlying one-class SVM classifier. The proposed models have been compared and evaluated on the public Enron email dataset. Applying graph based feature set extraction technique in combination with the inclusive compound probability based text representation has proved to be very efficient. The generality of the proposed method allows the approach to be applicable to all kind of text documents including emails.
Keywords
Introduction
Equipped only with the knowledge of known data, a system can achieve the ability to differentiate between known and unknown instances through novelty detection mechanisms. This becomes extremely important in the real life scenarios such as fault detection or intrusion detection systems, where it is nearly impossible to know in advance about all possible kinds of data the system would encounter. Novelty detection has proven to be a very challenging task due to the difficulty in defining a decision boundary for the known class, pertaining to absence of negative instances. Classifiers that are designed to detect whether an input is a known data or unknown are called one-class classifiers. Our goal is to build a one-class classifier that could perform novelty detection in short documents for authorship attribution.
The idea behind novelty detection in documents is to learn an inherent concept from a set of related documents. The concept could be any knowledge that relates all the documents under the known class such as document category, topic of interest, sentiment of the article, author style, etc. In this paper, we deal with learning the writing style of an author from a given set of documents written by him/her. The field of research related to author style classification is generally termed as author attribution [1]. So the focus of our research work is the intersection of novelty detection and author attribution, wherein we try to detect if an anonymous document is written by the particular author or not, by gaining the required knowledge only from the author’s archived documents.
A lot of research has been performed for document author attribution in the past, focusing on larger documents such as books or lengthy articles. We have observed that these mechanisms when applied on documents of smaller size do not perform efficiently. Also most of the works have approached author attribution as a multi-class classification problem, wherein the system gains the required knowledge on the style of a given set of authors from a sample set of their documents and tries to classify an anonymous document to any of the authors. However in case of novelty detection, we need a single-class classification mechanism which could gain knowledge of a particular author’s style only from the set of sample documents written by that author, without any counter example.
We have based this research work on emails, as a representation of small sized documents. However, the proposed classification model can be applied to any text content. We have attempted to build a knowledgeable system that could learn the author’s style from the old emails of a person and could identify if a given new email text follows the author’s writing style or not. Several challenges had to be addressed in identifying authorship for an email. First of all, emails are generally concise in nature, with some tens to hundreds of words. Moreover, unlike books or other large texts, the vocabulary of an email acutely changes based on the context or the addressee. To overcome these issues, our work focused on identifying the ‘writer invariant’ features, which are used by the author subconsciously, so that the usage of these features is independent of the context or addressee and it cannot be manipulated. The significant contributions of this paper are as follows:
A graph-model based algorithm has been proposed for feature set selection. A comprehensive graph is constructed to represent the old emails and the most significant features are selected from this graph. This process results mostly in the writer-invariant features. Two text representation schemes have been proposed to generate feature vectors for documents, and are subsequently used by the machine learning algorithm, such as One-Class Support Vector Machine. The first is Probability Model (from here on, referred as PM), wherein the value of each feature is the probability with which the feature appears in that email. The second one is Inclusive Compound Probability Model (from here on, referred as ICPM), where the value of each feature is the probability with which the feature appears in a sentence of the email. The proposed models have been implemented and tested against the public Enron dataset. Since Enron corpus is the only substantial collection of real emails that is public, by testing our prototypes with this corpus, we could measure the performance of our approach in a real-life scenario. We have compared the proposed models with other author attribution and novelty detection techniques. ICPM has proved to be the most efficient. We have evaluated the performance of ICPM against all possible values for its parameters and attempted to arrive at optimal values.
The rest of this paper has been organized as follows. Section 2 describes the previous works related to novelty detection, stylometry, author attribution and email categorization. Section 3 explains the proposed model in detail. Section 4 shows the performance of the proposed model compared to existing techniques and then presents an evaluation of the proposed model. Finally, in Sections 5, we suggest possible applications for the proposed classification model and conclude by stating the way forward.
Novelty detection in documents for author attribution is a research problem connected with many existing research fields such as author attribution, stylometry, email classification and novelty detection. Work on the field of authorship attribution has been carried on for hundreds of years. During 19th century, several non-traditional methodologies have been proposed to prove the authorship of a given text. These techniques used to be statistical in nature and were referred to as stylometry [2].
In 1851, Augustus de Morgan has suggested that the length of words could be used as a differentiator for various author styles in Bible [2]. Medelhall published his investigation results in 1887 to prove that word lengths are not significant after working on lengths of several hundred thousand words from the works of Bacon, Marlowe, and Shakespeare. However, he did find that the works of Shakespeare and Marlowe had similarities but they were consistently different from Bacon [1].
In 1932, George Zipf worked on stylometry and proposed a universal law of language, “Statistical rank of a word varies inversely to its frequency” [3, 4]. Yule in 1944, define “Yules characteristic K”, as a measure of word frequencies, the basic assumption being that the occurrence of a word can be modeled by a Poisson distribution [3]. This marker was also proved wrong by subsequent researches in this area. In 1963, Mosteller and Wallace resolved the conflict with the authorship of Federalist Papers, using statistical methods [3, 4].
It was then that the significance of function words came to focus. Function words are words such as ‘to’, ‘if’, ‘by’, ‘upon’ and so on, that have little lexical meaning, but instead serve to express grammatical relationships with other words in the sentence. The pattern of usage of such words shows the attitude of the author or the author’s particular language style. Subsequently, a lot of research has been carried out on this area. The basic strategy is to train a model with several documents of the authors in consideration, then the given sample text is checked against the model to identify its author [1, 2].
Feature selection has played an important part in the performance of the algorithms. The selected features were statistics on usage of sentences, words, and special characters. Roughly they can be classified as i) character features such as character
Text representation is another aspect which plays a major role in classification performance. This deals with how the classifier understands the samples. The features selected by the Feature set selection procedure are converted to real numbers by some kind of transformation so that the machine learning algorithm could understand them. Many representations such as binary, frequency, TF-IDF, hadamard representation have been used in several previous works on document classification [8, 9, 10, 11].
Classifier is the core component of all classification algorithms. A classifier is trained in the beginning of the algorithms with sample documents/text of the concerned authors. After the training phase, the classifier gains enough knowledge on the authors’ styles, so that it can identify the author of a given text. Different classifiers have been used such as Naïve Bayes [6], Support Vector Machine [11, 12], Neural networks [7] for author attribution in the past.
Books, legal documents or large text have been the subject of research in most of these works. Very less work has been done to identify authorship of small text such as emails, since there are several challenges in email author attribution as outlined in the previous section. Categorization of emails to custom folders was the focus of several works, such as using set theory [13], graph-based approach [14] or language models [15] for classification. Spam mails identification is another class of work done on emails [13]. De Vel et al. used attributes such as style markers as well as structured features [16]. It also used statistical count of vocabulary, functional words and number of short words for forensic investigations. Similar features were used to identify gender of the author, with the assumption that women’s language makes more frequent use of emotionally intensive adverbs and adjectives and that their language is more punctuated. Whereas, men’s language was assumed to have characteristics such as strong assertions, aggressive, self-promotion, rhetorical questions, authoritative orientation, challenges and humor [17]. Goodman et al. used a set of stylistic features to determine authorship. Statistics collected on these features such as number of words, number of sentences, number of special characters, etc. were used for further KNN or SVM classification [18, 19].
Most of the above mentioned works deal with multi-class classification. But novelty detection is all about one-class classification. Novelty detection focuses on how to find out any abnormality of a negative sample under a predefined condition. Here again, classifier forms the core component. Variants of several multi-class classifiers, such as Naïve Bayes [20], Support Vector Machine [11, 21], Neural networks [10, 21, 22] have been developed to deal with single class classification.
To the best of our knowledge, there has not been any novelty detection work over short text documents such as emails. Also, author attribution has not been or tested over a public email dataset. The main motivation of this work is to address these two areas.
The graph model based classifier
Architecture overview
As mentioned in the previous sections, novelty detection is a subset of classification problem, wherein the number of target classes is just one. As with any text classification algorithm, a 3-step procedure is followed: i) Feature set selection, ii) Training phase and iii) Testing phase. The proposed classification model analyses the lexical features of the email texts to learn the hidden pattern. This needs the raw email text to be pre-processed and converted into lexical tokens before being fed to any of these steps.
Pre-processing of text is a fundamental operation in text classification. This operation cleans unwanted characters or contents from the text and tokenizes the useful content of the text for further processing. As lexical features are the only features of interest, all special characters are stripped off in the pre-processing stage. Also, the text is converted to lower case, thereby removing the distinction between upper-cased and lower-cased words. All the email texts, both in training phase and in testing phase are preprocessed, tokenized and then forwarded to rest of the processing.
Feature-set selection is a critical phase in classification models. Efficiency of a model greatly depends on the features selected to identify the pattern. Here, all emails under the training set are converted to a comprehensive graph and the significant feature-set is identified. The detailed process will be explained in Section 3.2.
Classifier is the core component used in both the training and testing phases. In the context of prevention/detection of email impersonation, we need a classifier that could only be trained with positive samples. There are various approaches proposed to this problem in previous works, such as neural networks, one-class SVM and so on. We have opted to use one-class SVM classifier due to its performance advantage. Based on the technique used for text representation, we have defined two classification models – Probability Model (PM) and Inclusive Compound Probability Model (ICPM). Under training phase the one-class SVM classifier is trained with the feature vectors corresponding to the feature-set of each email in the training set. The classifier is fed with the feature vector of a new unclassified email in the testing phase to detect its legitimacy. Section 3.3 will provide details on the text representation schemes and Section 3.4 explains the underlying classifier algorithms.
Depending on the application requirement, the model parameters could be tuned to configure either a specific model or a sensitive model. A specific model could achieve high Specificity/True Negative Rate (TNR) and a sensitive model could provide high Sensitivity/True Positive Rate (TPR). TPR defines how effectively the model identifies the positive (author’s own) emails. TNR represents the model’s performance in identifying the illegitimate emails. Currently, a model with both high specificity and high sensitivity is not possible, as there is a trade-off between the two. More on this trade-off will be covered in the Results section.
Feature set selection
Feature-set selection is the process of selecting the set of features that best define the model, and in this case, these are the features that best define the author’s writing style. In the lexical view, words and
The process begins by converting the pre-processed documents into tokens of words. This conversion is done for all the documents under the given training set and merged to a comprehensive representation.
All of these tokens form feature set candidates and passed through an algorithm which could identify the set of features that could uniquely identify the author. Other than the word tokens, there are other derived parameters which could be considered as features. Stylometry techniques consider punctuation marks and statistics such as number of sentences, number of paragraphs, etc as features [18]. Combination of n adjacent words could be considered as features [23]. In this work we have come up with a technique to generate features from a graph model, where the nodes and links of the graph form the feature set candidates. The following paragraphs explain various existing techniques that could be employed to select the feature set and then the proposed method.
Frequency method.
One of the simplest ways for feature set selection is the frequency method. All the words are considered to be candidate features in this method. Feature set extraction involves filtering out the least frequently used words from the list of candidate features. This results in a set of words which are used more often by the author. More often a word is used by the author, greater is its probability to be part of the feature set. The frequency threshold used to divide the most frequently and least frequently used features is usually evaluated experimentally based on the performance of the underlying classifier.
TF-IDF method.
Information retrieval algorithms treat the most frequent words as stop words as they do not carry any significance. They state to give more importance to the rare words as they have more relation to the topic/category under consideration. So TF-IDF scheme was introduced, where TF is the term frequency and IDF is inverse document frequency [8]. IDF of a term
TF-IDF is the combination of both these values to arrive at the overall weight of each term and is calculated by multiplying TF of a term to its corresponding IDF. In TFIDF method [8], terms with less TF-IDF value are discarded and the threshold used to eliminate less weight terms is determined experimentally.
Graph-model based approach.
In this work, we propose a graph-model based feature set selection technique, wherein the tokens are adapted to a graphical representation, with words represented as nodes and adjacent nodes are connected with links. Rest of this subsection is dedicated to explain the existing graph representation methods [24] and the proposed n-simple distance
Under a general graph representation scheme, vertices and edges are used to represent a text document in the form of a graph. Each vertex in the graph represents a unique word in the document. If the word a is followed by a word b in the given section s of a document, it is represented as a directed edge from the node a to b in the graph. A graph
Standard and Simple representations: These are the simple forms, where each term is represented as a vertex and edge is created between two adjacent words. Irrespective of the number of times a word appears in the document, only one node is created to represent the word.
Frequency representation: In this representation, each node and edge is labeled with their corresponding frequency. For nodes it indicates how many times the associated word has appeared in the document. For edges it indicates how many times the connected terms have appeared adjacent to each other in the specified order. The benefit of this scheme is that it helps in identifying the significance of a node or edge on the author’s style.
In our work, we propose a new graph representation scheme by combining both n-simple distance and frequency representations in order to tap the advantages of both the schemes and hence effectively represent emails in the form of graphs. We call this scheme as n-simple distance
Conversion of email text to a graph.
Figure 1 shows how this scheme is used to convert an email text to a fully-connected graph structure. Frequency of the nodes is shown within the node’s circle and that of the edges are shown next to the edge. The node “thanks” has a frequency 2, as it has appeared twice in the email. And all the remaining nodes have a frequency 1 as they appear just once. Since this is a fully connected structure, each node is connected to all the nodes representing successive words in the sentence.
In a similar fashion, all emails under the training set are converted into graphs. These graphs are merged to form a consolidated graph, wherein the frequency of a node/link represents the frequency of the same in all the mails together. The consolidated graph provides a rough view of the overall words usage. For example, the number of nodes and links used to represent a particular user (refer Section 4.1 for details) with around 600 emails was roughly 2378 nodes and 50892 links. If the nodes and links are sorted based on their frequency, then we could safely say that the topmost nodes/links significantly dictate the author’s style. Hence, the top n% of nodes and top m% of links have been selected as the feature set, to be used for the classifier.
Top 25 nodes from a user model.
The challenge with this strategy is arriving at a precise value for the parameters m and n. It has been found experimentally that the ideal value of n for nodes is in the range 1–5% and the ideal value of m is of the range 0.5–2.5% (refer Section 4.4). Even with lesser feature set, classification is possible, with a drop in performance. More on the effect of feature size on performance is explained in detail in the Experimental Results section. To appreciate the significance of this strategy, consider the top 25 nodes from a particular user’s model depicted in Fig. 2. As it can be seen, the top 25 nodes of the model mostly consists of function words (‘the’, ‘is’, ‘or’, ‘of’, ‘on’, ‘and’, ‘to’, etc.) and certain generally used key words (‘lynn’) and these words are mostly not related to any specific topic or context. They correspond to the author’s style about which even the author may not be aware and they do not change with underlying topic or context.
Text documents need to be converted to an appropriate format which could be recognized by the classifier component. For any classifier using machine learning, probabilistic or heuristic algorithms, the recognizable format is that of a vector of real numbers. The text of a document is transformed to a feature vector of real numbers corresponding to the feature set, using some text representation techniques. Below are listed some of the schemes used for text representation and in the later part we explain the proposed schemes in detail.
Binary representation.
This is the simplest technique used for text representation. Here, the presence of a feature in a document is denoted with a 1 and its absence by 0. The feature vector is thus a vector of 0’s and 1’s. The frequency of occurrence of a feature is not captured in this representation. Thus a value of 1 corresponding to a feature in the feature vector means that the feature is in the document.
Frequency representation.
Converse to the previous scheme, this scheme gives importance to the number of times a feature occurs in a document. Each feature is given a value that is equal to the frequency of occurrence of the feature in the document. The feature vector is thus composed of the frequency of each feature in the document. Here, a value
TF-IDF representation.
Since most used words get more weight in frequency representation, function words end up getting more significance in that scheme. However, in most of the information retrieval algorithms, such words are treated as stop words and are simply discarded. So, in order to give more weight to rare words than such stop words and also to make use of the document frequency in determining weight, TF-IDF representation was introduced for text representation. IDF when combined with TF compensates the frequency of stop words and boosts the weight of rare words.
Hadamard representation.
This scheme was used extensively in [10, 11], where feature vectors are populated with the Hadamard product of the frequency of a feature in the document to that of the feature in the whole training set. If e is the frequency vector of all features in a given document and E is the frequency vector for the complete training set, then Hadamard product is the component-wise product of these two vectors computed using the formula in Eq. (2).
Probability representations.
In this paper, we propose to make use of the probability of occurrence of the features to represent them in feature vector. This resulted in two types of models, which will be explained in the following paragraphs. Both the schemes are explained here in context with the graph model.
In this model, the value of a feature in a given email is the probability of occurrence of the feature in the email. This model treats the whole email as a bag of features. For a feature node, it is the ratio of frequency of the corresponding word to total number of words in the email. The same computation applies for feature links as well. Mathematically, the probability of a feature node (Fnode) and feature link (Flink) are computed using Eqs (3) and (4), where n is the total number of nodes and l is the total number of links in the given email.
This model assigns value to a feature based on the probability of occurrence of the feature in a sentence in the email. Here, the attempt is to capture the pattern of feature usage in a sentence. The occurrence of a feature in two difference sentences of an email is mutually inclusive. Hence this is a case of inclusive compound probability (ICP). If
Classifier learns the pattern of the input feature vectors during training phase. Then using the gained knowledge, the classifier finds out the class to which a test sample belongs. Most of the classifiers were originally designed for multi-class classifications. With certain extended logic, these algorithms have been made to work with one-class classifications.
Prototype algorithm.
This is the Rocchio’s method of classification [8]. The basic idea in this algorithm is to represent each document as a vector
The maximum distance
Nearest neighbor algorithm.
This is another algorithm for single-class classification achieved by altering the basic nearest neighbor algorithm for multi-class. The maximum distance
Naïve Bayes method.
This algorithm uses the probabilistic details of features in the document to determine its class. The probability of a document d to belong to the class
Autoencoder.
This is a one-class classifier based on neural networks [21, 22]. The network is designed with three layers. The input and output layers have same number of nodes as the number of features in the feature set. The hidden layer has less number of nodes compared to the other layer, thereby forming a bottleneck forcing the network to learn the identity function to reconstruct output from input. If the network is able to reconstruct the test sample successfully from the input test sample vector, then the sample is classified to belong to the trained class.
One class SVM.
In this work, we have used one-class SVM following Scholkopf methodology [25]. This again is an enhancement on top of the two-class SVM. In this methodology, after transforming the features via kernel, they treat the origin as the only member of the second class. Then using some relaxation parameters, they separate the image of one class from the origin. Gradually the SVM learns the boundary of the given class.
The main idea of this algorithm is to return a function
subject to
where,
If
In this section, we compare the proposed models with existing mechanisms for feature set selection, text representation and classifier algorithm, using a public Email dataset. Then the proposed models are evaluated against all controlling parameters to understand their effects on performance and then finally we try to optimize the model.
Dataset and evaluation parameters
Before proceeding on to the description and evaluation of the proposed models, we will provide a brief overview of the dataset used and the parameters that will be used throughout this paper to evaluate the performance of the models.
Data collection
Emails are the documents of interest chosen for our case study. The Enron public email dataset has been used to experimentally evaluate the techniques suggested in this paper. The Enron corpus consists of a large set of email messages which was made public during the legal investigation concerning the Enron Corporation. It consists of the email records by 158 users, mostly senior management of Enron, organized into folders. This corpus is the only publicly available mass collection of real emails that can be used freely for study.
We have used the email messages from the “sent items” folders of the users for this study. The emails ranged over both personal and business communications. The messages were first filtered to discard emails where there was no contribution from the author, such as forward emails and also discarded contents of the message which was not directly contributed by the author. This filtration step reduced the size of the dataset significantly and resulted in 80 users with at least tens of email messages. Out of these, 50 users folders which had a minimum 200 messages were selected for this experimental analysis. Every experiment under this section was conducted over all these 50 users dataset. However for the sake of clarity, except for the comprehensive performance result (Fig. 5), results for only 10 selected users dataset have been illustrated.
Evaluation measures
To evaluate the result, we use standard measures for performance validation in classification problem. A True Positive (TP) is the state when a positive sample from the trained author’s set gets reported as positive. False Negative state (FN) is when a positive sample of the author gets classified as a negative sample. While testing the classifier with a sample message from any user other than the trained author, if it is classified as positive, then it results in False Positive state (FP). If the email gets classified as negative, then it falls to True Negative (TN) state. Based on these 4 states, we calculate the following parameters that well define the model performance.
Sensitivity/True Positive Rate (TPR): Sensitivity defined as
Specificity/True Negative Rate (TNR): Specificity defined as
Accuracy: It is the proportion of correctly classified cases among the total samples tested. Higher the accuracy, better the model in identifying the true cases, be it identifying author’s original emails as original or identifying the illegitimate case correctly. But note that accuracy could be misleading when either of TP or TN is dominating.
F
ROC curve: This Receiver Operating Characteristic (ROC) curve is drawn by plotting True Positive Rate (TPR) to False Positive Rate (FPR), where FPR is calculated as (1-TNR). FPR forms the
Experiment setup
Implementation
The experiments were performed using a standalone java program. A neo4j graph database [26] was used to store the comprehensive graph for each user. The java library libsvm [27] was used to implement one-class SVM. The experiment was performed with values of
Cross validation
All the experiments related to comparison and evaluation has been performed using 3-fold cross validation. Each user’s email messages have been divided into 3 parts. When the first and second parts were used to train the classifier, the third part was used as the positive test set. Test emails of the same size as that of the positive test set was randomly chosen from the remaining 49 users for negative test set.
Comparison with other techniques
Feature set selection
We compared the graph-model based feature set selection mechanism with other techniques such as frequency method and TF-IDF method. For implementing Frequency method, a comprehensive frequency-count vector was maintained, which was populated with the frequency of all the terms in the sample space of training set. In order to discard the least frequent terms, the vector was sorted to filter out the terms whose occurrence is less than the first quartile of the frequency distribution. Two schemes of text representation – Frequency representation and ICPM representation were used to analyze this feature set selection technique. Words were considered as terms under frequency representation. Whereas, nodes and links of the comprehensive graph were used as terms in ICPM representation.
For TF-IDF method, frequency of each term (
For Graph-model based method, a neo4j database was used in the backend of a java program to store all the vertices and links. Top 2.5% of vertices and 1% of links have been selected as feature set. Both frequency and ICPM representations were analyzed in this case as well.
All three feature-set selection methods with two types of text representation, totally 6 combinations, were tested along with the one-class SVM classifier corresponding to the model parameter
Comparison of feature set selection methods
Comparison of feature set selection methods
Comparison of feature set selection methods in terms of 
Four text representation schemes are compared with both the proposed schemes PM and ICPM. Since it is already proved that Graph-model based feature selection outperforms all other feature selection techniques, we have based this Text Representation experiment on this optimal feature set selection technique. Also the classifier has been chosen to be the one-class SVM at the
For Binary representation, feature vector of each document was populated with 1 if the corresponding feature was present in the document, else it was marked as 0. In Frequency representation, feature vector was populated with the number of occurrences of each feature in the document. TFIDF representation was implemented by maintaining the feature vector with product of TF and IDF values of each feature, where IDF was calculated using Eq. (1). For Hadamard representation, a frequency vector
PM and ICPM representations were implemented by creating feature vectors as per Eqs (3), (4) and (5). Performance comparison of all these text representation schemes is listed in Table 2. Sensitivity remains in the range of 0.5-0.6 for all of the representations, with ICPM topping the list. With respect to accuracy, hadamard representation closely follows ICPM. However, specificity of ICPM is almost perfect, and hadamard representation could achieve 0.87, which is much better than other schemes.
Comparison of text representation methods
Comparison of text representation methods
Comparison of text representation schemes in terms of 
The proposed classification method of graph model based feature selection in combination with ICPM text representation applied on a one class SVM classifier was compared with other classification models in this experiment.
The prototype algorithm was implemented by constructing the tf-idf vector
Comparison of classifier models
The nearest neighbor algorithm was implemented by calculating the distance from each document in the sample set to every other document in that sample space. The distance
Performance of Inclusive Compound Probability Model: Heat-maps corresponding to 
Effect of nu on performance.
Effect of feature size on specificity.
For Naïve Bayes algorithm, probability of a document
AutoEncoder was implemented with a 3 layer feed forward network. To improve the efficiency, feature set was selected from the top 1% of nodes and 0.5% of links, by discarding features with more than 90% of zeros – in other words the features that had presence in less than 10% of documents were discarded. For all epochs till 200, discarding the initial few epochs for stabilization, average square error at each epoch was calculated and recorded for all the documents in the sample space, to come up with the acceptable level of error (threshold), by discarding the highest 25 percentile error cases from training set.
For testing the proposed Graph FS
Effect of feature size on sensitivity.
Optimization of user models.
Table 3 compares the performance of all the above mentioned classification algorithms. The controlling parameter
Under this section, we would evaluate the performance of the proposed classification model and analyze the effect of the controlling parameters – nu(
Figure 5 gives an overview of the performance of the model through heatmaps. Each map corresponds to a particular nu (
Effect of
The parameter
When
Effect of feature size.
The number of features selected to uniquely represent a user style is called the feature size. In graph model, we define feature size as the percentage of top nodes (
Figure 7 shows the effect of feature size on specificity. With increase in feature size, the specificity increases. However at a feature size 1% of nodes and 0.5% of links, the model stabilizes and the specificity almost remains constant after that. Figure 8 shows the effect of feature size on sensitivity. It can be seen that for lower feature sizes below 0.25% of nodes and 0.1% of links, the sensitivity oscillates a lot and is not stable. Here again, it could be observed that a feature size more than 1% of nodes and 0.5% of links gives an almost stable sensitivity.
Optimization
As it was seen from the previous results, there is a trade-off between specificity and sensitivity. In order to get the ideal specificity, the sensitivity needs to be reduced to the range of 0.5–0.6. Though this ratio holds good for all the user models in general, we will be able to reach even better performance for each user model, if they are individually evaluated. Figure 9 shows the evaluation of four user models as an example case study. It could be noted that for User1, with a feature size of 1% of nodes and 0.5% of links at a
So, it can be concluded that based on the flexibility to handle larger feature size, performance could be improved. For the best user models, with very less feature size of 1:0.5, we could gain the best performance at lower
Discussion and conclusion
In the past decade, with the advances in technology, electronic communication has become the main mode of correspondence, which has resulted in extensive usage of short texts such as email, tweets, etc, for conveying messages. In fact, emails are used not only for conveying messages but also for other applications such as, official data transfer (electronic documents), highly secured data transfer (password/PIN numbers), private data transfer (personal information) and even to transfer highly sensitive information (company secrets). When a user sends an email from her mailbox, the sender guarantees the authorship of the corresponding email. The receiver can consider this email as significant as a signed letter from the sender. Even in forensic investigations or legal procedures, an email is granted a lot of significance to stand as a valid proof.
With the novelty detection mechanism proposed by this research work, an additional security interface to detect and prevent email impersonation could be provided by the email servers which would be difficult to break, as manipulating someone’s style of writing is quite tedious. A similar kind of intelligence model could also be used in forensic investigations, where the objective is to find whether a particular suspect has written the given malicious email or not.
In conclusion, we have tried to address the shortcomings of novelty detection mechanisms for short texts through a graph-model based feature extraction technique and a novel text representation scheme. The graph-model based feature extraction technique was compared with other feature-set selection techniques and proved to be much efficient than the existing mechanisms for short texts. Also the ICPM text representation technique is proved to perform better than other methods.
Currently the classification model has been tested only on emails. This could very well be extended for even smaller text messages such as tweets or SMSs. The classification model in this work has focused only on the lexical features of text. Analysis of character and syntactic features could provide further insight into the writing style. Stylometric analysis based on the statistics such as average number of paragraphs, average number of sentences, etc, could also help further improve the performance in novelty detection. Some of the future directions for this work would be to combine this model with character/syntactic features analysis and stylometric analysis as well to achieve even better performance.
Footnotes
Acknowledgments
This research was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Science, ICT, and Future Planning (MSIP) (grant number 2014R1A1A3051169 & 2016R1D1A1B03934129) and was partially supported by the Ajou University research fund.
