Abstract
In this paper we present an approach for novelty detection in text data. The approach can also be considered as semi-supervised anomaly detection because it operates with the training dataset containing labelled instances for the known classes only. During the training phase the classification model is learned. It is assumed that at least two known classes exist in the available training dataset. In the testing phase instances are classified as normal or anomalous based on the classifier confidence. In other words, if the classifier cannot assign any of the known class labels to the given instance with sufficiently high confidence (probability), the instance will be declared as novelty (anomaly). We propose two procedures to objectively measure the classifier confidence. Experimental results show that the proposed approach is comparable to methods known in the literature.
Introduction
The problem that we consider in this paper is as follows. Given a general document understanding system, we design and implement module to identify previously unobserved documents. Latter, such documents must be incorporated into system knowledge. In this sense, we can talk about novelty detection rather than the outlier or anomaly detection.
Document understanding systems can be defined as software solutions which automatize immediate processing of administrative documents with minimal human intervention [2]. These systems receive documents as input. Documents can be very different in terms of their structure and include invoices, forms, contracts, requests, letters etc. The task is to find and extract relevant information from these documents. For example, from the electricity invoice the system can extract and save in database information like date, total amount and customer name. Later, these data can be used for faster document searching or as input to the decision support systems.
Information extraction from documents is very challenging task. This problem occupies attention of many researchers and practitioners for decades. General approach in information extraction is to separately create and maintain extractor model for every known document class. It means that document understanding systems usually contain classification module and information extraction module for each class.
The basic use case in a document understanding system is as follows [1,6,8,16]. The system starts with initial knowledge. This means that available training dataset is used to learn classifier and for every class to learn extractor. Then, documents are arriving from the stream. The system takes the current document and runs the classification module. After that, the document is sent to the appropriate information extraction module responsible for extraction of target information from documents belonging to the recognized class. In this scenario it is very important that document class is correctly recognized because it determines which information extraction module is called. Also, the system must be able to recognize the appearance of the previously unobserved classes. Otherwise, novelty will be wrongly assigned to one of the existing classes and its extractor will be called resulting in wrong extraction of desired fields.
According to the previous, the classification module must be able to recognize novelty and indicate such situation to the user. The user should define the new class, target fields and run specific procedures to learn extractor. But, in this study we concentrate just on the novelty detection task.
We present classification based approach for novelty detection. Bearing in mind that our system starts with initial knowledge generated with training dataset containing instances of only known classes, the proposed method can be referred to as semi-supervised and classification based approach for outlier detection with multi-class classifier.
We compare different approaches in performing novelty detection and different machine learning models that can be used as a basis. Implementation is written in Python, with scikit-learn (scikit-learn.org [14]) and NLTK (nltk.org [3]) being the main libraries used for this purpose. Apart from them, we use the accompanying modules frequently present in machine learning and natural language processing tasks (numpy [12], matplotlib, re etc.). Experimental results indicate that the proposed method is comparable to the best solutions reported in the literature.
The paper is organized as follows. The next section discusses problems and algorithms connected with this study. Motivation, novelty and contribution of the proposed methods are discussed in the third section. Details about proposed methods are exposed in the fourth section. After that we explain experimental protocol and discuss results. Finally, we give conclusion and possible extensions.
Main contributions
The motif for this research is to extend general document understanding systems with a capability to differentiate novel classes from known classes on which the classifier module is learned. It is very important that document class is correctly recognized because it determines which information extraction module is called.
We must extend the classification module and make it able to recognize the appearance of previously unobserved classes. Otherwise, a novelty can be wrongly assigned to one of the existing classes and its extractor will be called resulting in wrong extraction of desired fields.
Solution that we propose in this paper is based on an idea to use the classification module not just for discriminating among classes available in the training dataset but also to recognize novelties. This approach can be categorized as multi-class classification-based method for novelty detection. We propose two variations.
Let the prediction for the new document d arriving from the stream be
With
Let
Let
Threshold value ϵ is a parameter of the algorithm. We propose heuristics to estimate optimal ϵ value for both max-confidence novelty detection and confidence-distance novelty detection.
In the case that there are only two classes in the training, set max-confidence detection with threshold
Related work
Anomaly or outlier detection is the problem of identifying patterns in data that are very different with respect to expected behaviour [14]. Such data points are usually referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains [4].
Tasks related to but distinct from the anomaly detection are noise removal [17] and noise accommodation [15]. Noise is unwanted data points that usually must be eliminated before any data analysis. Noise removal is to eliminate noise from dataset for analysis [17]. Noise accommodation refers to implementing techniques and models that are robust to noise [15].
Anomaly detection has been researched intensively within wide variety of research areas and application domains. The problem originates from 19th century when procedures for detecting outliers or anomalies were proposed in statistics community [4]. Since then, a variety of techniques have been developed and designed for general purpose or specific application domains.
Main challenges in anomaly detection are recognized in the literature. For example, it is very difficult to specify the boundary between normal and not normal instances, especially in domains where it is possible that instances from both classes evolve [4]. In other words, current definition of normal or not normal behaviour can significantly change. In addition, different domains generally imply different definitions of an anomaly, so it is not possible to directly apply technique from one domain to another. Finally, a major issue is availability of representative and labelled dataset for training and testing of models [4].
Applicability of anomaly detection methods is usually determined by the nature of the given dataset. Dataset is a collection of entities also referred to as objects, records, points, vectors, patterns, events, samples, observations [4]. Each entity is described with a set of attributes also referred to as variables, features, dimensions [4]. When only one attribute is assigned to an entity, such dataset is called univariate. When multiple attributes are assigned to an entity, such dataset is called multivariate.
Anomalies can be classified as point, contextual (conditional) and collective [4]. Point anomaly is individual entity that can be considered non-conforming with respect to the whole dataset. Contextual anomaly is entity that can be considered anomalous in a specific context but not otherwise. In this case every, entity is described with contextual and behavioural attributes. The contextual attributes determine the context (for instance, location, time etc.). The behavioural attributes describe the non-contextual characteristics of an entity (for example, the temperature at any location). They are used to determine if an entity is anomaly or not. An entity can be anomalous in a specific context, while the same data instance could be considered normal in a different context. For example, temperature in June could be anomaly even though the same temperature could appear in January and be considered as normal. Collective anomaly is a collection of entities whose occurrence together is anomalous, but single entity from the collection may not be anomaly by itself.
Anomaly detection methods can work in supervised, semi-supervised and unsupervised mode [14]. The supervised mode assumes availability of training dataset with instances from known and not known classes. In this mode, usually predictive model capable to distinguish between known and not known classes is built. The semi-supervised mode assumes that the training dataset contains only instances of the known class. General solution in this mode is to learn model corresponding to the known class, and use that model to recognize anomalies in the test data. If an instance cannot be assigned to the known class, it is declared as anomaly. Techniques that work in unsupervised mode do not require training dataset, but make assumption that anomalies are far less frequent than instances belonging to the known classes.
Anomaly detection techniques can be categorized as classification based, clustering based, nearest neighbour based, statistical, information theoretic and spectral [4]. These techniques cover a number of domains, including cyber-intrusion detection, fraud detection, medical anomaly detection, industrial damage detection, image processing, textual anomaly detection, sensor networks [4].
In text data domain the anomaly detection is primarily devoted to novelty detection: novel topics or events or news stories. One of the first formulations of this problem in this domain is First Story Detection – FSD problem [4]. The FSD task consists of detecting the first story related to an event that is not covered by documents (originally news articles) from the stream. Incremental clustering methods are usual methods to solve FSD task [14]. The idea is to form clusters from articles related to known events. When new article arrives, it is compared to the existing clusters to determine if it belongs to them (redundant article describing known event) or it represents a new event (novelty). The main challenges in text data domain regarding anomaly detection are related to high dimensionality (course of dimensionality) and sparsity.
In [18] authors propose filtering system to determine relevant and redundant documents from the stream. The system can be classified as event level novelty detection, because relevancy and redundancy are defined in regard to events described in the document. Relevant documents from stream contains information that is relevant to the end user. If information is relevant but it is already known from relevant documents in the stream, such documents are redundant. Otherwise, such documents are novel. Relevancy and redundancy/novelty are measured separately. Relevant documents are similar to previous relevant document in the stream in terms of covering the same topic. Novel documents are dissimilar to previous relevant document in the stream in terms of containing new information. Because these two targets are contradictory, authors claim that it is necessary to model them explicitly and separately. They propose two phase filtering system. In the first step relevance filtering is done. Only documents that are relevant are sent to redundancy/novelty detection module. Redundancy/novelty filtering distinguishes between documents containing new information (novelty) and documents that are relevant but containing previously known information (redundancy). Authors propose several procedures to measure redundancy based on document to document similarity along with corresponding thresholds which can be leaned and dynamically adapted.
Similar approach is presented in [7]. Authors propose entropy-based procedure to produce novelty score of incoming documents. Documents with novelty score above some threshold (for instance average + standard deviation) are novel. Otherwise, they are non-novel.
In [10] authors propose method and introduce dataset for novelty detection for document level. It is of great significance is effort to create universal dataset, namely TAP-DLND (document level novelty detection), to provide a benchmark for objective performance estimation of different algorithms from the literature. Authors define novelty detection for document level as binary classification problem. Already processed documents constitute the class of known documents. An incoming document is represented with vector of features and sent to classification model to determine if it is novel or redundant (already seen). The model is built with Random Forest algorithm. Different procedures for creating document vector are tested: paragraph to vector, word to vector, n-grams, named entities and keyword match, new word count, divergence. Authors claim that the proposed method with new word count procedure for feature generation supersedes other algorithms.
The similar method as previous but Convolutional Neural Networks as classification model is used in [9]. In addition, for document representation authors propose procedure called Relative Document Vector. The main idea is to recognize sentences from incoming document and represent each of them with the closest (based on cosine distance) sentence from the already seen documents.
In [13] authors propose the method TONMF – text outliers using non-negative matrix factorization. It is the first approach based on non-negative matrix factorization (NMF) method to solve anomaly detection in text data. Documents are represented as bag of words matrix in which columns correspond to documents and rows correspond to terms (words). The main idea is to write term-document matrix as follows:
In the previous equation
Finally, authors present results from various experiments comparing the method TONMF against many baseline algorithms on many datasets. The conclusion is that their approach achieves better performances than traditional methods for outlier detection in text data. Because of that, we compare our results to the TONMF method and go well beyond its capabilities.
The one-class classification approach
The first method we examine is based on training a classifier to directly distinguish between novelties and regular documents. We can consider all regular documents as belonging to one class, and everything outside that class as a novelty. The OneClassSVM model in sklearn performs exactly this. It utilizes a support vector machine at its core, and trains it to recognize regions of feature space that are densely populated with training data as responding to regularities. On the other hand, new instances that are mapped into the sparse regions in the feature space, that contain little to no training examples (that is, almost entire space), are labelled as novelties.
One thing that we are particularly interested in is how well the OneClassSVM detector performs with varying values of its ν parameter. The ν parameter is actually an upper bound on the fraction of training errors and also a lower bound of the fraction of support vectors that are used. It takes values from
In our case, we find this parameter important because it allows us to control what percentage of training examples will be treated as initial outliers. This can be particularly useful when working with ‘polluted’ training sets, but also for our purposes, since we can have documents that are significantly different from most of the documents in their respective classes.
Novelty detection based on classifier confidence
Now we turn to a different approach. Instead of doing the previously explained, direct kind of classification, we can try to use a classifier that outputs a probability distribution over the set of all training class labels, and infer whether a certain document is unusual from this probability distribution.
So, let us assume we have a classifier that, given an input feature vector
A number of different machine learning models can be used to achieve this distribution, including both linear and non-linear, nearest neighbours and purely probabilistic models, which work this way by default. Some models, however, are not suitable for this approach, most notably support vector machines, that do not inherently provide probabilistic classification.
Here we discuss two procedures to novelty detection based on distribution P. Both of them can be depicted with the simplified scheme (pipeline) shown in Fig. 1.

Simplified scheme for methods based on classifier confidence.
Both procedures depend on a parameter ϵ. As we will see shortly, it can be a bit tricky to predict which values of ϵ give satisfying results. We examine how proposed procedures perform when using three different underlying classifiers: logistic regression as an example of a linear model, a naïve Bayes model as a probabilistic classifier, and a neural network as an example of a non-linear model. One can also use other models that provide probability estimates, like nearest neighbours, decision trees and random forests.
Let
Although
Here we present a heuristic to estimate optimal ϵ. As noted earlier, the classifier prediction for a document d contains probabilities that d belongs to each class.
Generally, the final predicted class label is the one for which estimated probability is maximal. In other words, classifier will predict label
The classifier confidence threshold estimate is calculated as the average of probability estimates of the predicted class over the training set. Formally:
This estimation generally gives good but not optimal performance. Machine learning practitioners can therefore use the estimate as an initial guess, before fine tuning the model hyperparameters.
However, we can only check the goodness of this estimation empirically, and we will see that an optimal value can vary substantially for different underlying classifiers, with different parameters and for different datasets. We refer to this procedure as max-confidence novelty detection.
Examining the distance between two largest probabilities in distribution
Let
We refer to this procedure as confidence-distance novelty detection, even though it utilizes the similar idea as the previous one, to emphasize the fact that it examines absolute differences between two highest probability estimates.
Now, in the case that there are only two classes in the training set, we can see that max-confidence detection with threshold
Let
Since P is a probability distribution, we have Now it is easy to see that Here we present a heuristic to estimate optimal ϵ for confidence-distance. As before, the classifier prediction for a document d contains probabilities that d belongs to each class
The dataset used to evaluate the implemented models is Reuters-21578 ApteMod corpus, loaded from the nltk API. It contains 90 different categories of documents, divided into 7769 training documents and 3019 testing documents. Some of the documents are multi-labelled, i.e. belonging to more than one category, but we filter those out. Filtering leaves us with 6577 single labelled documents in the training set, divided into 58 categories, and 2583 documents in the test set, divided into 59 categories. All documents are written in English.
We use
Top categories of documents in the Reuters dataset
Top categories of documents in the Reuters dataset
From this dataset, smaller datasets will be constructed. At first it is assumed that no anomalous instances are present in the training set, however in the finishing sections of the paper an extension of the proposed methods that works with polluted datasets is given. In the context of novelty detection, after training a model on a certain set of documents, we want that model to be able to distinguish between documents that are familiar, in a sense that they are very similar to something the model has already seen, and the ones that represent something new, or differ significantly in some aspect(s) from everything seen up to that point.
Our approach in defining novelties is similar to the way authors usually define outliers in a textual dataset, in a sense that we do not take semantics of documents into account. The main difference is that our confidence based methods represent semi-supervised learning paradigm, as opposed to an unsupervised setting where one does not know the categories to which documents belong. So, instead of finding examples that are unusual compared to most of the documents in a set, we fit a model to an initial set of documents and expect it to find examples that are substantially different from those in that initial set, i.e. the ones that do not fit into any of the initial categories.
Let’s assume we have chosen k classes
For testing, unless stated otherwise, the entire
Before getting started with text vectorization and applying machine learning methods, we ought to pre-process the available data. This includes converting all characters to lower-case, removing punctuation symbols, numbers, stop-words (words that appear frequently in most of the documents and carry no particularly useful information) and possibly other elements of textual data that are irrelevant to the task of recognizing new, unusual text documents. Also, we perform word lemmatization, that is, converting words into their base form (plural nouns to singular, verbs in various forms to infinitive etc.). To perform these tasks, we combine Wordnet lemmatizer, NLTK utilities and regular expressions. An example of these transformations is illustrated in Fig. 2.

Preprocessing textual data. Past tense is converted to infinitive, nouns in plural form are converted into singular etc.
Next, there are many ways to represent text documents in vector form, from the simplest ones, such as classic bag of words representation, to more complex approaches, such as tf-idf weighting, using word or character n-grams, shingles and so on. We experimented with plain bag of words models and a tf-idf weighting model, and found (not surprisingly) the latter to be slightly superior.
We use scikit-learn’s TfidfVectorizer tuned to consider the top N terms (i.e. words, tokens) based on their frequency over the entire training set. On Reuters data,
For each document d, tf-idf scores are computed over the top N terms:
It is important to note that out-of-vocabulary words (i.e. the ones that were not included in top N terms over the training set) are not encoded in any particular way. In other words, they are simply ignored – the vectorizer takes into account only the words present in initial vocabulary and computes their tf-idf scores. From a statistical viewpoint, this is actually justified for the task of novelty detection when we think about the nature of tf-idf features. If a document contains many out-of-vocabulary words and the rest are mostly common words likely to appear in an average document, its tf-idf vector will have a very small norm, since unknown words are not included, and the common ones are penalized for being common and carrying no specific information. On the other hand, documents from known classes are expected to have relatively larger vector norms, since they contain words specific to those classes, and those words contribute large tf-idf scores. A document with a small norm is then likely to be classified as novel, as its feature vector belongs to an unexplored region of feature space.
It is also necessary to have a way of evaluating the performance of a novelty detector. This task is actually quite similar to pure classification, so we are going to use the same performance metrics. First of all, since we assign label 0 to normal documents, and 1 to novelties, the confusion matrix will have the form shown in Table 2.
Confusion matrix for the task of novelty detection
Now we can use the metrics that are computed from entries in this matrix: accuracy, precision, recall, F1-score and so on.
As always, we would like our model to have a high accuracy, i.e. the ratio of correctly classified instances to the total number of instances in the set. As it is usually the case in machine learning, there will be a trade-off between precision and recall. Concerning that phenomenon, in the setting of the problem we are trying to solve, recall can be considered slightly more important than precision. In other words, we would rather have a model that, among all true novelties, detects as many of them as possible, at the expense of raising a higher number of false alarms, than the other way around.
At the end, we will also consider ROC-curves (Receiver Operating Characteristic) as a way to measure the performance of a novelty detector. The idea is to graph
At the beginning, the performance of OneClassSVM model is analysed. Obviously, in a classification task, ν parameter can be tweaked to control the bias-variance trade-off. That can also be accomplished by tuning the γ parameter when using a Gaussian kernel
Performance measures of discussed methods on Reuters data
Performance measures of discussed methods on Reuters data
It should be noted that recall increases when ν is increased and reaches the extreme value of 1 when ν is sufficiently large. This is because larger ν means larger regularization penalty, and in the extreme case the model is so underfitted that it simply classifies everything as novelty. However, one cannot just aim for such an increase as it damages other performance metrics, and scores in Table 3 show a balanced case. The results are not exactly bad, but not impressive either.
Also, experimenting with different values of gamma did not seem to lead to anything substantially better than this. To conclude, we notice from Table 3 that even a small increase in the number of known categories at the beginning (in the training set) caused performance of this kind of detector to decline.
We now turn our attention to particular machine learning models that we can use as classifiers that will provide probability estimates for max-confidence methods.
All the classifiers that are about to be discussed achieve between 97% and 98% classification accuracy on 2-class and 5-class subsets of Reuters data (i.e. using the same k classes for training and testing set). Also, the weighted averages of precision, recall and F1-score fall into the same range.
Figure 3 illustrates the behaviour of accuracy, precision and recall of max-confidence models for varying threshold values, on different datasets and for various underlying classifiers. In all the graphs, the ϵ parameter takes values from a subinterval of the largest theoretically allowed interval (e.g.
First off, logistic regression is tested for detecting novelties through a max-confidence scheme. Fig. 3 illustrates its performance.
There is an obvious, major improvement when compared to the one-class approach using ν-SVMs. One can obtain both accuracy and recall at around 90%, with precision just below 70% just by searching for a good value of confidence threshold in a simple linear model.
The next classifier we examined is multinomial naïve Bayes (also in Fig. 3). It performs worse than logistic regression (particularly in terms of recall) on the 2-class task and behaves almost identically on 5 classes. Since this model generally works well for text classification, it is definitely worth to experiment with it. Intuitively, we would expect it to achieve good results when there is a larger number of categories in the training set, as it is shown in Fig. 3, since it works better when faced with 5 initial classes, than with 2.
When it comes to using a neural network as a backbone of the detector, things expectedly get a bit more complicated. With logistic regression and naïve Bayes we did not have to manage trade-off, but now we have a robust, non-linear classifier, capable of learning incredibly complex decision functions. Unfortunately, it is also very prone to over-fitting, so regularization plays an important role here. We basically have a situation similar to the one with ν-SVM approach. Here we try to tune the

Performance of max-confidence models. Rows correspond to different evaluation metrics, and columns correspond to different datasets. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.
At the end, we tested the same three machine learning models as a basis of confidence-distance methods. All the results are shown in Table 3 for easy comparison with OneClassSVM. Notice that for the 2-class case, confidence-distance results are not shown because they would be the same as max-confidence results, due to Theorem 1. The results for the 5-class case also turned out to be very similar to max-confidence.
To test the robustness of the proposed methods further, a different experiment is performed. We took the

Accuracy of max-confidence models in a leave-one-out setting. Each group of bars has a label that indicates which class was dropped from the training set in that particular case. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.
We notice that Naive Bayes classifier achieves roghly the same accuracy in every setting. However, for logistic regression and neural network classifiers, it is difficult to achieve the same performance on every sub-setting with the same threshold since earn category contains more documents than all the other 4 categories combined, so the setting where it is dropped from the training set needs a different threshold for better accuracy. On the other hand, dropping either of the three categories that are similar in terms of the number of documents (crude, trade, money-fx) does not affect the performance of max-confidence detectors.
We have also experimented with graphing ROC curves and calculating areas under them for the proposed methods. Table 3 shows the AUC scores for max-confidence and confidence-distance methods, as well as for OneClassSVM.
The best score we achieved when training on

ROC curves for the proposed methods. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.
Judging by the experimental results shown in Fig. 5, we can see that max-confidence methods backed with a logistic regression model or a neural network outperform the other two methods in terms of area under the curve score on Reuters data.
In this section we expose the results of an experiment where the method presented in the paper is compared to TONMF algorithm [10].
TONMF is an approach that has significant advantages over traditional methods for text outlier detection. It achieves AUC score of 0.9340 in detecting outliers in a dataset consisting of all the documents belonging to the earn and acq categories, with true ouliers being 100 documents from interest category (an unsupervised setting).
The experimental protocol that we used for comparison is the same as authors explain in [7], only adjusted into a semi-supervised setting. Briefly, 4436 documents from categories earn and acq comprised the training set for the classifiers. Novelties were 81 documents from category interest, which, mixed with the remaining 1779 documents from earn and acq, comprised the testing set.
Table 4 shows that the registered AUC score for logistic regression max-confidence scheme is 97.82%, which is significantly higher than 93.4% that TONMF achieves on the same task.
AUC scores for different methods when using earn and acq as known classes and interest as a novel class
AUC scores for different methods when using earn and acq as known classes and interest as a novel class
In this experiment, we took data from the BBC Dataset [11] and segmented it to test the proposed methods once more. This dataset contains BBC News articles, divided into five categories: business, entertainment, politics, sport and tech. We randomly chose articles from business and sport categories, 350 from each, constructing a training set of 700 documents. The remaining articles from these two categories (about 150 from each) were added to the test set, representing non-novelties for the testing phase. All articles from entertainment, politics and tech (approximately 400 each) were added to the test set as novelties. Figure 6 shows the ROC curves on this dataset, which was intentionally built to be very different from the sets we constructed from Reuters data. Namely, the testing set is larger than the training set, and it contains many more novelties than non-novelties. There is an obvious decline in the performance of OneClassSVM, while the max-confidence methods still achieved great results. This shows that they are robust to datasets of this particular type, where there is a large number of novelties to be detected, several times bigger than the set of regular documents.

ROC curves for proposed methods on the dataset constructed from BBC News data. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.
To make the terminology clear, “polluted” is used to refer to cases where the training set may contain documents from certain categories that should be considered novel during testing. An extension of max-confidence novelty detection that can work with such setups is discussed.
The training set needs to be re-structured in the following way. All known classes are treated separately, as before. In addition to that, all the classes that should be considered novel at testing time are packed into a single class, that we refer to as other. After this is done, a classifier can be fit to that training set.
Upon the completion of the training phase, the extended procedure for detecting novelties is applied. If the largest probability in the output distribution of the classifier corresponds to the other class, the current document is immediately classified as novel. Otherwise, the usual threshold test is performed, i.e. the largest probability is compared to ϵ, and the document is estimated to be novel if a lower-than inequality holds.
We carried out experiments with such a detector on training datasets formed from
The addition of “pollution” to the training set may also be viewed as a kind of a regularizer for the novelty detection task. Inclusion of documents that should be treated as unusual during training should help the classifier avoid overfitting, as it no longer focuses solely on known classes. It is beneficial to gather as much data for the other class as possible, and it does not even have to be labelled. This is reminiscent of traditional semi-supervised methods where labelled examples are suplemented with unlabelled ones to decrease test error.
It is possible to extend confidence-distance in a similar way and achieve roughly the same results.
Discussion
With the procedures introduced in this paper it is possible to have novelty detection encapsulated in a classifier module and its knowledge. We conducted a series of experiments simulating the same protocols on the same datasets explained in the literature. Methods’ accuracy, precision, recall and F1-score were measured, as well as Receiver Operating Characteristic (ROC) AUC scores. Both proposed methods easily achieve 90% and higher accuracy and recall at the same time on various datasets. Also, the highest AUC scores we obtained ranged from 95–96% to 97–98%, depending on the dataset, while the best results obtained in the literature are below 95%.
Confidence-distance novelty detection performs similarly to max-confidence detection, but we included it in this study as a separate procedure because it appears to be useful in the following use case. We implemented it as a part of a specific document understanding system where an information extraction module helps a classifier learn classes incrementally from a live document stream. The idea is as follows. When max-confidence detector does not recognize that document d belongs to an unknown class, classifier will start confidence-distance detection and return two labels corresponding to two highest probability estimations if
The result is of the form
In this research we tested several classification models: linear regression, Naive Bayes and neural networks. All of them achieve very similar performances regrading the novelty detection. We can conclude that, for the novelty detection task, it is not critical what model is used to perform classification as long as it achieves high accuracy at distinguishing between known classes in the training dataset. With a properly tuned ϵ parameter such a model can be successfully used for separating novelties from the known classes and obtaining the high performance measures mentioned earlier.
Conclusion
In this paper we discuss the problem of outlier and novelty detection in text data. The motif for this study is to design and implement an approach to distinguish documents from previously unknown classes from the document stream processed by a general document understanding system.
We propose semi-supervised, classification based methods for outlier detection with a multi-class classifier. The multi-class classifier is created on the training dataset containing only samples from regular (known) classes (at least two classes). Classifier confidence threshold
The classifier prediction for a new document d contains probabilities that d belongs to one of each known classes
We propose two procedures to determine if a new document is a novelty or a known sample. Max-confidence novelty detection declares a new document as a novelty if
Our approach manages to go well beyond the capabilities of One-class, SVM-based novelty detection method in terms of various important performance metrics. Also, the achieved performance on Reuters data is better than the TONMF method in the same experimental protocol.
