An approach for outlier and novelty detection for text data based on classifier confidence

Abstract

In this paper we present an approach for novelty detection in text data. The approach can also be considered as semi-supervised anomaly detection because it operates with the training dataset containing labelled instances for the known classes only. During the training phase the classification model is learned. It is assumed that at least two known classes exist in the available training dataset. In the testing phase instances are classified as normal or anomalous based on the classifier confidence. In other words, if the classifier cannot assign any of the known class labels to the given instance with sufficiently high confidence (probability), the instance will be declared as novelty (anomaly). We propose two procedures to objectively measure the classifier confidence. Experimental results show that the proposed approach is comparable to methods known in the literature.

Keywords

Classification novelty detection outlier detection classifier confidence information retrieval

1. Introduction

The problem that we consider in this paper is as follows. Given a general document understanding system, we design and implement module to identify previously unobserved documents. Latter, such documents must be incorporated into system knowledge. In this sense, we can talk about novelty detection rather than the outlier or anomaly detection.

Document understanding systems can be defined as software solutions which automatize immediate processing of administrative documents with minimal human intervention [2]. These systems receive documents as input. Documents can be very different in terms of their structure and include invoices, forms, contracts, requests, letters etc. The task is to find and extract relevant information from these documents. For example, from the electricity invoice the system can extract and save in database information like date, total amount and customer name. Later, these data can be used for faster document searching or as input to the decision support systems.

Information extraction from documents is very challenging task. This problem occupies attention of many researchers and practitioners for decades. General approach in information extraction is to separately create and maintain extractor model for every known document class. It means that document understanding systems usually contain classification module and information extraction module for each class.

The basic use case in a document understanding system is as follows [1,6,8,16]. The system starts with initial knowledge. This means that available training dataset is used to learn classifier and for every class to learn extractor. Then, documents are arriving from the stream. The system takes the current document and runs the classification module. After that, the document is sent to the appropriate information extraction module responsible for extraction of target information from documents belonging to the recognized class. In this scenario it is very important that document class is correctly recognized because it determines which information extraction module is called. Also, the system must be able to recognize the appearance of the previously unobserved classes. Otherwise, novelty will be wrongly assigned to one of the existing classes and its extractor will be called resulting in wrong extraction of desired fields.

According to the previous, the classification module must be able to recognize novelty and indicate such situation to the user. The user should define the new class, target fields and run specific procedures to learn extractor. But, in this study we concentrate just on the novelty detection task.

We present classification based approach for novelty detection. Bearing in mind that our system starts with initial knowledge generated with training dataset containing instances of only known classes, the proposed method can be referred to as semi-supervised and classification based approach for outlier detection with multi-class classifier.

We compare different approaches in performing novelty detection and different machine learning models that can be used as a basis. Implementation is written in Python, with scikit-learn (scikit-learn.org [14]) and NLTK (nltk.org [3]) being the main libraries used for this purpose. Apart from them, we use the accompanying modules frequently present in machine learning and natural language processing tasks (numpy [12], matplotlib, re etc.). Experimental results indicate that the proposed method is comparable to the best solutions reported in the literature.

The paper is organized as follows. The next section discusses problems and algorithms connected with this study. Motivation, novelty and contribution of the proposed methods are discussed in the third section. Details about proposed methods are exposed in the fourth section. After that we explain experimental protocol and discuss results. Finally, we give conclusion and possible extensions.

2. Main contributions

The motif for this research is to extend general document understanding systems with a capability to differentiate novel classes from known classes on which the classifier module is learned. It is very important that document class is correctly recognized because it determines which information extraction module is called.

We must extend the classification module and make it able to recognize the appearance of previously unobserved classes. Otherwise, a novelty can be wrongly assigned to one of the existing classes and its extractor will be called resulting in wrong extraction of desired fields.

Solution that we propose in this paper is based on an idea to use the classification module not just for discriminating among classes available in the training dataset but also to recognize novelties. This approach can be categorized as multi-class classification-based method for novelty detection. We propose two variations.

Let the prediction for the new document d arriving from the stream be ${c_{1} : p_{1}, c_{2} : p_{2}, \dots, c_{k} : p_{k}}$ . It contains estimates for all classes available in the training dataset. This representation means that the classifier is predicting class label $c_{i}$ with probability $p_{i}$ . Of course, $p_{1} + \dots + p_{k} = 1$ .

With $p_{i_{\max}}$ we denote the highest probability, more precisely $p_{i_{\max}} = {max}_{i = 1, 2, \dots, k} p_{i}$ . With $Δ p$ we denote the difference between two highest probabilities, more precisely $Δ p = p_{i_{\max}} - p_{i_{\max}^{″}}$ , where $p_{i_{\max}^{″}} = {max}_{i \neq i_{\max}} p_{i}$ . In other words, $p^{(k)} = p_{i_{\max}}$ is the largest, and $p^{(k - 1)} = p_{i_{\max}^{″}}$ is the second largest probability in the output distribution (i.e. kth and (k-1)th order statistic).

Let $ϵ \in (\frac{1}{k}, 1)$ be a fixed number. The first approach will determine that a new document d is a novelty if and only if $p_{i_{max}} < ϵ$ holds. We refer to this procedure as max-confidence novelty detection.

Let $ϵ \in (0, 1)$ be a fixed number. The second approach will determine that a new document d is a novelty if and only if $Δ p < ϵ$ . We refer to this procedure as confidence-distance novelty detection.

Threshold value ϵ is a parameter of the algorithm. We propose heuristics to estimate optimal ϵ value for both max-confidence novelty detection and confidence-distance novelty detection.

In the case that there are only two classes in the training, set max-confidence detection with threshold $ϵ \in (\frac{1}{2}, 1]$ is equivalent to confidence-distance detection with threshold $2 ϵ - 1 \in (0, 1]$ (Section 5.2).

3. Related work

Anomaly or outlier detection is the problem of identifying patterns in data that are very different with respect to expected behaviour [14]. Such data points are usually referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains [4].

Tasks related to but distinct from the anomaly detection are noise removal [17] and noise accommodation [15]. Noise is unwanted data points that usually must be eliminated before any data analysis. Noise removal is to eliminate noise from dataset for analysis [17]. Noise accommodation refers to implementing techniques and models that are robust to noise [15].

Anomaly detection has been researched intensively within wide variety of research areas and application domains. The problem originates from 19th century when procedures for detecting outliers or anomalies were proposed in statistics community [4]. Since then, a variety of techniques have been developed and designed for general purpose or specific application domains.

Main challenges in anomaly detection are recognized in the literature. For example, it is very difficult to specify the boundary between normal and not normal instances, especially in domains where it is possible that instances from both classes evolve [4]. In other words, current definition of normal or not normal behaviour can significantly change. In addition, different domains generally imply different definitions of an anomaly, so it is not possible to directly apply technique from one domain to another. Finally, a major issue is availability of representative and labelled dataset for training and testing of models [4].

Applicability of anomaly detection methods is usually determined by the nature of the given dataset. Dataset is a collection of entities also referred to as objects, records, points, vectors, patterns, events, samples, observations [4]. Each entity is described with a set of attributes also referred to as variables, features, dimensions [4]. When only one attribute is assigned to an entity, such dataset is called univariate. When multiple attributes are assigned to an entity, such dataset is called multivariate.

Anomalies can be classified as point, contextual (conditional) and collective [4]. Point anomaly is individual entity that can be considered non-conforming with respect to the whole dataset. Contextual anomaly is entity that can be considered anomalous in a specific context but not otherwise. In this case every, entity is described with contextual and behavioural attributes. The contextual attributes determine the context (for instance, location, time etc.). The behavioural attributes describe the non-contextual characteristics of an entity (for example, the temperature at any location). They are used to determine if an entity is anomaly or not. An entity can be anomalous in a specific context, while the same data instance could be considered normal in a different context. For example, temperature in June could be anomaly even though the same temperature could appear in January and be considered as normal. Collective anomaly is a collection of entities whose occurrence together is anomalous, but single entity from the collection may not be anomaly by itself.

Anomaly detection methods can work in supervised, semi-supervised and unsupervised mode [14]. The supervised mode assumes availability of training dataset with instances from known and not known classes. In this mode, usually predictive model capable to distinguish between known and not known classes is built. The semi-supervised mode assumes that the training dataset contains only instances of the known class. General solution in this mode is to learn model corresponding to the known class, and use that model to recognize anomalies in the test data. If an instance cannot be assigned to the known class, it is declared as anomaly. Techniques that work in unsupervised mode do not require training dataset, but make assumption that anomalies are far less frequent than instances belonging to the known classes.

Anomaly detection techniques can be categorized as classification based, clustering based, nearest neighbour based, statistical, information theoretic and spectral [4]. These techniques cover a number of domains, including cyber-intrusion detection, fraud detection, medical anomaly detection, industrial damage detection, image processing, textual anomaly detection, sensor networks [4].

In text data domain the anomaly detection is primarily devoted to novelty detection: novel topics or events or news stories. One of the first formulations of this problem in this domain is First Story Detection – FSD problem [4]. The FSD task consists of detecting the first story related to an event that is not covered by documents (originally news articles) from the stream. Incremental clustering methods are usual methods to solve FSD task [14]. The idea is to form clusters from articles related to known events. When new article arrives, it is compared to the existing clusters to determine if it belongs to them (redundant article describing known event) or it represents a new event (novelty). The main challenges in text data domain regarding anomaly detection are related to high dimensionality (course of dimensionality) and sparsity.

In [18] authors propose filtering system to determine relevant and redundant documents from the stream. The system can be classified as event level novelty detection, because relevancy and redundancy are defined in regard to events described in the document. Relevant documents from stream contains information that is relevant to the end user. If information is relevant but it is already known from relevant documents in the stream, such documents are redundant. Otherwise, such documents are novel. Relevancy and redundancy/novelty are measured separately. Relevant documents are similar to previous relevant document in the stream in terms of covering the same topic. Novel documents are dissimilar to previous relevant document in the stream in terms of containing new information. Because these two targets are contradictory, authors claim that it is necessary to model them explicitly and separately. They propose two phase filtering system. In the first step relevance filtering is done. Only documents that are relevant are sent to redundancy/novelty detection module. Redundancy/novelty filtering distinguishes between documents containing new information (novelty) and documents that are relevant but containing previously known information (redundancy). Authors propose several procedures to measure redundancy based on document to document similarity along with corresponding thresholds which can be leaned and dynamically adapted.

Similar approach is presented in [7]. Authors propose entropy-based procedure to produce novelty score of incoming documents. Documents with novelty score above some threshold (for instance average + standard deviation) are novel. Otherwise, they are non-novel.

In [10] authors propose method and introduce dataset for novelty detection for document level. It is of great significance is effort to create universal dataset, namely TAP-DLND (document level novelty detection), to provide a benchmark for objective performance estimation of different algorithms from the literature. Authors define novelty detection for document level as binary classification problem. Already processed documents constitute the class of known documents. An incoming document is represented with vector of features and sent to classification model to determine if it is novel or redundant (already seen). The model is built with Random Forest algorithm. Different procedures for creating document vector are tested: paragraph to vector, word to vector, n-grams, named entities and keyword match, new word count, divergence. Authors claim that the proposed method with new word count procedure for feature generation supersedes other algorithms.

The similar method as previous but Convolutional Neural Networks as classification model is used in [9]. In addition, for document representation authors propose procedure called Relative Document Vector. The main idea is to recognize sentences from incoming document and represent each of them with the closest (based on cosine distance) sentence from the already seen documents.

In [13] authors propose the method TONMF – text outliers using non-negative matrix factorization. It is the first approach based on non-negative matrix factorization (NMF) method to solve anomaly detection in text data. Documents are represented as bag of words matrix in which columns correspond to documents and rows correspond to terms (words). The main idea is to write term-document matrix as follows: $\begin{array}{l} (1) & A = L_{0} + Z_{0} . \end{array}$

In the previous equation $L_{0}$ is low rank matrix that represents documents from the corpus that can be expressed as linear combination of the basis bag of words vectors. Consequently, $L_{0}$ is defined as $L_{0} = W_{0} H_{0}$ . The matrix $H_{0}$ corresponds to the coefficients (non-negative) and the matrix $W_{0}$ contains basis document vectors. Authors propose specific optimization method based on Block Coordinate Descent (BCD) framework to determine the matrices $W, H$ and Z. Documents that can be represented as linear combination of basis vectors, i.e. documents represented with $L_{0}$ are considered regular. Otherwise, the document is an outlier. In that case, the corresponding column in the matrix Z has significant number of non-zero values. In the case of regular documents, corresponding column in the matrix Z contains entries that are very close to zero. Authors propose to estimate outlier degree of document x with $l_{2}$ norm of appropriate column $z_{x}$ from the matrix Z. With this approach it is expectable to have outlier scores of regular documents against outliers highly separable. Consequently, it is often possible to define clear threshold to determine the outliers from the other documents.

Finally, authors present results from various experiments comparing the method TONMF against many baseline algorithms on many datasets. The conclusion is that their approach achieves better performances than traditional methods for outlier detection in text data. Because of that, we compare our results to the TONMF method and go well beyond its capabilities.

4. The one-class classification approach

The first method we examine is based on training a classifier to directly distinguish between novelties and regular documents. We can consider all regular documents as belonging to one class, and everything outside that class as a novelty. The OneClassSVM model in sklearn performs exactly this. It utilizes a support vector machine at its core, and trains it to recognize regions of feature space that are densely populated with training data as responding to regularities. On the other hand, new instances that are mapped into the sparse regions in the feature space, that contain little to no training examples (that is, almost entire space), are labelled as novelties.

One thing that we are particularly interested in is how well the OneClassSVM detector performs with varying values of its ν parameter. The ν parameter is actually an upper bound on the fraction of training errors and also a lower bound of the fraction of support vectors that are used. It takes values from $(0, 1]$ , with 1 representing the extreme case of having all input examples as support vectors. Also, it is related to the constant of regularization C, but has an advantage of being bounded, unlike C, which can take arbitrarily large positive values. More on ν-SVMs can be found in [5].

In our case, we find this parameter important because it allows us to control what percentage of training examples will be treated as initial outliers. This can be particularly useful when working with ‘polluted’ training sets, but also for our purposes, since we can have documents that are significantly different from most of the documents in their respective classes.

5. Novelty detection based on classifier confidence

Now we turn to a different approach. Instead of doing the previously explained, direct kind of classification, we can try to use a classifier that outputs a probability distribution over the set of all training class labels, and infer whether a certain document is unusual from this probability distribution.

So, let us assume we have a classifier that, given an input feature vector $x = (x_{1}, x_{2}, \dots, x_{n})$ , computes $P = {c_{1} : p_{1}, c_{2} : p_{2}, \dots, c_{k} : p_{k}}$ , where $c_{1}, c_{2}, \dots, c_{k}$ are class labels in the training set, and each $p_{i}$ is interpreted as a probability estimation, or how ‘confident’ the classifier is that vector x represents a document in class $c_{i}$ .

A number of different machine learning models can be used to achieve this distribution, including both linear and non-linear, nearest neighbours and purely probabilistic models, which work this way by default. Some models, however, are not suitable for this approach, most notably support vector machines, that do not inherently provide probabilistic classification.

Here we discuss two procedures to novelty detection based on distribution P. Both of them can be depicted with the simplified scheme (pipeline) shown in Fig. 1.

Fig. 1.

Simplified scheme for methods based on classifier confidence.

Both procedures depend on a parameter ϵ. As we will see shortly, it can be a bit tricky to predict which values of ϵ give satisfying results. We examine how proposed procedures perform when using three different underlying classifiers: logistic regression as an example of a linear model, a naïve Bayes model as a probabilistic classifier, and a neural network as an example of a non-linear model. One can also use other models that provide probability estimates, like nearest neighbours, decision trees and random forests.

5.1. Examining the largest probability in distribution

Let $ϵ \in (\frac{1}{k}, 1)$ be a fixed number. For a document d, we first find the largest probability in the output distribution: $p_{max} = {max}_{i = 1, 2, \dots, k} p_{i}$ . Then, if the inequality $p_{max} < ϵ$ holds, document d is estimated to be a novelty. Intuitively, this should work, as we expect a well-trained classifier to be uncertain about documents it has never seen, while giving high maximum probabilities for documents that belong to one of the learned classes.

Although $1 / k$ is a theoretical lower bound for the parameter ϵ, since $p_{max}$ will always be greater than or equal to this number, it seems reasonable that we should only consider larger values, closer to 1.

Here we present a heuristic to estimate optimal ϵ. As noted earlier, the classifier prediction for a document d contains probabilities that d belongs to each class.

Generally, the final predicted class label is the one for which estimated probability is maximal. In other words, classifier will predict label $c_{j}$ for which $p_{j} = \max_{i = 1, 2, \dots, k} p_{i}$ .

The classifier confidence threshold estimate is calculated as the average of probability estimates of the predicted class over the training set. Formally: $\begin{matrix} (2) & \hat{ϵ} = \frac{\sum_{i = 1}^{M} p_{\max}^{i}}{M} \end{matrix}$ where M is the size of the training dataset and $p_{\max}^{i}$ is maximal probability estimation for ith sample in the training dataset.

This estimation generally gives good but not optimal performance. Machine learning practitioners can therefore use the estimate as an initial guess, before fine tuning the model hyperparameters.

However, we can only check the goodness of this estimation empirically, and we will see that an optimal value can vary substantially for different underlying classifiers, with different parameters and for different datasets. We refer to this procedure as max-confidence novelty detection.

5.2. Examining the distance between two largest probabilities in distribution

Let $ϵ \in (0, 1)$ be a fixed number. For a document d, let $p^{(k)}$ be the largest, and $p^{(k - 1)}$ the second largest probability in the output distribution (i.e. kth and (k-1)th order statistic). Then, if the distance between these two numbers is less than ϵ, i.e. if the inequality $p^{(k)} - p^{(k - 1)} < ϵ$ holds, document d is estimated to be a novelty. This sticks to the idea of measuring how confident a model is in classifying a given document. Similarly to the previous considerations, we do not expect really small values of ϵ to give us good results.

We refer to this procedure as confidence-distance novelty detection, even though it utilizes the similar idea as the previous one, to emphasize the fact that it examines absolute differences between two highest probability estimates.

Now, in the case that there are only two classes in the training set, we can see that max-confidence detection with threshold $ϵ \in (\frac{1}{2}, 1]$ is equivalent to confidence-distance detection with threshold $2 ϵ - 1 \in (0, 1]$ . More precisely, we have:

Theorem 1.
Let $ϵ \in (\frac{1}{2}, 1]$ be a fixed number. Then, for every distribution $P = {c_{1} : p_{1}, c_{2} : p_{2}}$ the following holds: $\begin{matrix} max {p_{1}, p_{2}} < ϵ ⟺ | p_{1} - p_{2} | < 2 ϵ - 1 \end{matrix}$
Proof.
Since P is a probability distribution, we have $p_{1} + p_{2} = 1$ . For the sake of notational simplicity, let $p_{1} = p$ . Then we have $p_{2} = 1 - p$ . Consider the functions $M (p) = max {p, 1 - p}$ and $D (p) = | p - (1 - p) | = | 2 p - 1 |$ on the closed interval $[0, 1]$ . Simplifying gives us: $\begin{array}{l} (3) & M (p) = \{\begin{matrix} 1 - p, & 0 ⩽ p < \frac{1}{2} \\ \frac{1}{2}, & p = \frac{1}{2} \\ p, & \frac{1}{2} < p ⩽ 1 \end{matrix} \\ (4) & D (p) = \{\begin{matrix} 1 - 2 p, & 0 ⩽ p < \frac{1}{2} \\ 0, & p = \frac{1}{2} \\ 2 p - 1, & \frac{1}{2} < p ⩽ 1 \end{matrix} \end{array}$

Now it is easy to see that $M (p) < ϵ \Leftrightarrow 1 - ϵ < p < ϵ$ and $D (p) < ϵ^{'} \Leftrightarrow \frac{1 - ϵ^{'}}{2} < p < \frac{1 + ϵ}{2}$ . Equalizing the bounds of these two intervals gives us $ϵ^{'} = 2 ϵ - 1$ . From the preceding, it follows that $M (p) < ϵ \Leftrightarrow D (p) < 2 ϵ - 1$ , which proves the claim. □

Here we present a heuristic to estimate optimal ϵ for confidence-distance. As before, the classifier prediction for a document d contains probabilities that d belongs to each class $P = {c_{1} : p_{1}, c_{2} : p_{2}, \dots, c_{k} : p_{k}}$ . The threshold $\hat{ϵ}$ is estimated as average difference between two highest probability estimates over the training set. Formally: $\begin{matrix} (5) & \hat{ϵ} = \frac{\sum_{i = 1}^{M} Δ p^{i}}{M} \end{matrix}$ where $Δ p^{i}$ is the absolute difference between two highest probability estimates for ith sample in the training set. □

6. Dataset

The dataset used to evaluate the implemented models is Reuters-21578 ApteMod corpus, loaded from the nltk API. It contains 90 different categories of documents, divided into 7769 training documents and 3019 testing documents. Some of the documents are multi-labelled, i.e. belonging to more than one category, but we filter those out. Filtering leaves us with 6577 single labelled documents in the training set, divided into 58 categories, and 2583 documents in the test set, divided into 59 categories. All documents are written in English.

We use $D_{train}$ to denote the former set, and $D_{test}$ for the latter. The elements of these sets are raw texts, i.e. strings of data. Similarly, we use $C_{train}$ and $C_{test}$ to denote the respective sets of class labels. Most of the classes appear in both of these sets, but there are certain ones specific to only one of them. For an arbitrary document d, we will denote its class by $c (d)$ . Detailed statistics for the Reuters dataset are shown in Table 1. For each category, the number of documents along with the corresponding percentage with respect to the entire set is given, as well as per-category vocabulary size (i.e. the number of different words in all the documents in a single class). Although the entire table would have had almost 60 rows, we trimmed it to show the top 20 rows only (i.e. the ones that contribute at least 0.5% to the total number, approximately) for an easier overview of the most important categories of documents.

Table 1
Top categories of documents in the Reuters dataset

$D_{train}$ $D_{test}$

Category Number of documents Percentage Vocabulary size Category Number of documents Percentage Vocabulary size

earn 2840 43.18 9528 earn 1083 41.93 4316

acq 1596 24.27 10035 acq 696 26.95 6348

crude 253 3.85 4281 crude 121 4.68 2767

trade 250 3.8 4355 money-fx 87 3.37 1941

money-fx 222 3.38 3492 interest 81 3.14 1653

interest 191 2.9 2455 trade 76 2.94 2475

money-supply 123 1.87 1061 ship 36 1.39 1322

ship 108 1.64 2546 money-supply 28 1.08 608

sugar 97 1.47 1981 sugar 25 0.97 1091

coffee 90 1.37 2315 coffee 22 0.85 976

gold 70 1.06 1615 gold 20 0.77 617

gnp 59 0.9 1728 alum 19 0.74 658

cpi 54 0.82 920 cpi 17 0.66 439

cocoa 46 0.7 1471 cocoa 15 0.58 652

grain 41 0.62 1721 gnp 15 0.58 1092

jobs 37 0.56 801 copper 13 0.5 761

reserves 37 0.56 684 iron-steel 12 0.46 538

ipi 34 0.52 773 jobs 12 0.46 246

alum 31 0.47 1078 nat-gas 12 0.46 636

$D_{train}$	$D_{test}$
earn	2840	43.18	9528	earn	1083	41.93	4316
acq	1596	24.27	10035	acq	696	26.95	6348
crude	253	3.85	4281	crude	121	4.68	2767
trade	250	3.8	4355	money-fx	87	3.37	1941
money-fx	222	3.38	3492	interest	81	3.14	1653
interest	191	2.9	2455	trade	76	2.94	2475
money-supply	123	1.87	1061	ship	36	1.39	1322
ship	108	1.64	2546	money-supply	28	1.08	608
sugar	97	1.47	1981	sugar	25	0.97	1091
coffee	90	1.37	2315	coffee	22	0.85	976
gold	70	1.06	1615	gold	20	0.77	617
gnp	59	0.9	1728	alum	19	0.74	658
cpi	54	0.82	920	cpi	17	0.66	439
cocoa	46	0.7	1471	cocoa	15	0.58	652
grain	41	0.62	1721	gnp	15	0.58	1092
jobs	37	0.56	801	copper	13	0.5	761
reserves	37	0.56	684	iron-steel	12	0.46	538
ipi	34	0.52	773	jobs	12	0.46	246
alum	31	0.47	1078	nat-gas	12	0.46	636

From this dataset, smaller datasets will be constructed. At first it is assumed that no anomalous instances are present in the training set, however in the finishing sections of the paper an extension of the proposed methods that works with polluted datasets is given. In the context of novelty detection, after training a model on a certain set of documents, we want that model to be able to distinguish between documents that are familiar, in a sense that they are very similar to something the model has already seen, and the ones that represent something new, or differ significantly in some aspect(s) from everything seen up to that point.

Our approach in defining novelties is similar to the way authors usually define outliers in a textual dataset, in a sense that we do not take semantics of documents into account. The main difference is that our confidence based methods represent semi-supervised learning paradigm, as opposed to an unsupervised setting where one does not know the categories to which documents belong. So, instead of finding examples that are unusual compared to most of the documents in a set, we fit a model to an initial set of documents and expect it to find examples that are substantially different from those in that initial set, i.e. the ones that do not fit into any of the initial categories.

Let’s assume we have chosen k classes $c_{1}, c_{2}, \dots, c_{k} \in C_{train}$ . We construct a specific training set in the following way: $D_{train}^{(k)} = {d \in D_{train} | c (d) \in {c_{1}, c_{2}, \dots, c_{k}}}$ . Now, the most natural way (perhaps even the only one) to define novelties precisely is as follows: $D_{novelty} = {d \in D_{test} | c (d) \notin {c_{1}, c_{2}, \dots, c_{k}}}$ . The subset of familiar documents is then simply $D_{test} ∖ D_{novelty}$ . Throughout this paper, we closely examine the cases of $k = 2$ and $k = 5$ , while choosing the most frequent categories for the training set. More precisely, we use the following two training sets:

$D_{train}^{(2)} = {d \in D_{train} | c (d) \in {^{'} {earn}^{'},^{'} {acq}^{'}}}$ , where earn and acq are the two most frequent categories in $D_{train}$ .

${D_{train}^{(5)} = {d \in D_{train} | c (d) \in {^{'} {earn}^{'},^{'} {acq}^{'},^{'} {crude}^{'},}^{'} {trade}^{'},^{'} money - f x^{'}}}$ , i.e. using the top 5 categories of documents. Documents belonging to one of these should be acknowledged as regular (already seen), while, for example, those belonging to classes ship and gold should be recognized as something new.

For testing, unless stated otherwise, the entire $D_{test}$ set is used. Choosing the most frequent categories of documents for training, paired with high non-uniformity of class distribution of Reuters data gives us a setting where regular documents make up a large fraction of all the documents at testing time (roughly 70–80% as Table 1 indicates). Real document understanding systems are likely to operate in this kind of setting. However, an additional experiment with a dataset that has many more novleties than non-novelties in the testing subset will be discussed near the end of the paper, and the results will show that the proposed methods still work just as well in such a setting.

7. Implementation details 1

github.com/NikolaPizurica/novelty-detection-for-text

Before getting started with text vectorization and applying machine learning methods, we ought to pre-process the available data. This includes converting all characters to lower-case, removing punctuation symbols, numbers, stop-words (words that appear frequently in most of the documents and carry no particularly useful information) and possibly other elements of textual data that are irrelevant to the task of recognizing new, unusual text documents. Also, we perform word lemmatization, that is, converting words into their base form (plural nouns to singular, verbs in various forms to infinitive etc.). To perform these tasks, we combine Wordnet lemmatizer, NLTK utilities and regular expressions. An example of these transformations is illustrated in Fig. 2.

Fig. 2.

Preprocessing textual data. Past tense is converted to infinitive, nouns in plural form are converted into singular etc.

Next, there are many ways to represent text documents in vector form, from the simplest ones, such as classic bag of words representation, to more complex approaches, such as tf-idf weighting, using word or character n-grams, shingles and so on. We experimented with plain bag of words models and a tf-idf weighting model, and found (not surprisingly) the latter to be slightly superior.

We use scikit-learn’s TfidfVectorizer tuned to consider the top N terms (i.e. words, tokens) based on their frequency over the entire training set. On Reuters data, $N = 500$ is used. This number seems to make the vector representation expressive enough, while still being relatively small and avoiding the danger of over-dimensionality. Of course, this number is a hyperparameter of the machine learning pipeline, that needs to be fine-tuned for a specific task. A more complex dataset could certainly require the experimenters to use more tokens in order to capture all the relevant features. Specifically on datasets constructed from Reuters data, we could not improve performance metrics by using larger lengths of tf-idf vectors, so we settled for 500 as an adequate amount of features.

For each document d, tf-idf scores are computed over the top N terms: $t f i d f (t_{i}) = t f (t_{i}, d) \cdot i d f (t_{i}), i = 0, 1, \dots, N$ . Here, the $t f (t_{i}, d)$ is simply the number of occurrences of the term $t_{i}$ in the document d, and the inverse document frequency is computed using the formula $i d f (t_{i}) = log \frac{1 + | D_{train}^{(k)} |}{1 + d f (t_{i})} + 1$ , where $d f (t_{i})$ denotes the number of documents in the training set that contain the term $t_{i}$ . This way we do not have to worry about any zero divisions or ill-defined logarithms. Finally, the N-dimensional vector of these scores is normalized with respect to Euclidean norm. Doing this for each document in the set gives us a data matrix $X_{train}$ of format $| D_{train}^{(k)} | \times N$ , whose rows represent individual documents and columns correspond to terms. After we have fitted the vectorizer on a training set, we can use it to get a data matrix for a test set as well.

It is important to note that out-of-vocabulary words (i.e. the ones that were not included in top N terms over the training set) are not encoded in any particular way. In other words, they are simply ignored – the vectorizer takes into account only the words present in initial vocabulary and computes their tf-idf scores. From a statistical viewpoint, this is actually justified for the task of novelty detection when we think about the nature of tf-idf features. If a document contains many out-of-vocabulary words and the rest are mostly common words likely to appear in an average document, its tf-idf vector will have a very small norm, since unknown words are not included, and the common ones are penalized for being common and carrying no specific information. On the other hand, documents from known classes are expected to have relatively larger vector norms, since they contain words specific to those classes, and those words contribute large tf-idf scores. A document with a small norm is then likely to be classified as novel, as its feature vector belongs to an unexplored region of feature space.

It is also necessary to have a way of evaluating the performance of a novelty detector. This task is actually quite similar to pure classification, so we are going to use the same performance metrics. First of all, since we assign label 0 to normal documents, and 1 to novelties, the confusion matrix will have the form shown in Table 2.

Table 2

Confusion matrix for the task of novelty detection

	Predicted non-novelties	Predicted novelties
True non-novelties	True negatives	False positives
True novelties	False negatives	True positives

Now we can use the metrics that are computed from entries in this matrix: accuracy, precision, recall, F1-score and so on.

As always, we would like our model to have a high accuracy, i.e. the ratio of correctly classified instances to the total number of instances in the set. As it is usually the case in machine learning, there will be a trade-off between precision and recall. Concerning that phenomenon, in the setting of the problem we are trying to solve, recall can be considered slightly more important than precision. In other words, we would rather have a model that, among all true novelties, detects as many of them as possible, at the expense of raising a higher number of false alarms, than the other way around.

At the end, we will also consider ROC-curves (Receiver Operating Characteristic) as a way to measure the performance of a novelty detector. The idea is to graph $(FPR, TPR)$ pairs, where $FPR = \frac{FP}{TN + FP}$ (false positive rate), and $TPR = Recall$ (true positive rate) for various values of the threshold parameter used to perform detection (epsilon in the case of max-confidence and confidence-distance). The resulting curve will always contain $(0, 0)$ and $(1, 1)$ as its endpoints, corresponding to the trivial cases of classifying everything as a non-novelty and everything as a novelty, respectively. The ideal detection corresponds to the point $(0, 1)$ , i.e. having detected all the true novelties and raising no false alarms. So, the closer the ROC curve gets to the point $(0, 1)$ , the better the performance is, or, phrased differently, we want the area under the curve (AUC score) to be as large as possible.

8. Experimental results

At the beginning, the performance of OneClassSVM model is analysed. Obviously, in a classification task, ν parameter can be tweaked to control the bias-variance trade-off. That can also be accomplished by tuning the γ parameter when using a Gaussian kernel $K (x^{(i)}, x^{(j)}) = exp (- γ ‖ x^{(i)} - x^{(j)} ‖)$ . We found that Gaussian kernels yield the best results in the domain of detecting novelties in text documents, and experimented with different configurations of both γ and ν parameters. For a data matrix X of format $m \times n$ , representing a vectorized training set of documents, we found that using OneClassSVM option $gamma = scale$ works well, among all possible values. It computes gamma as a reciprocal of the number of training examples multiplied by the variance of flattened data matrix. Fixing this γ and tuning for the optimal value of ν (e.g. using a grid search or randomized search) gives the results shown in Table 3.

Table 3
Performance measures of discussed methods on Reuters data

Setup Method Accuracy Precision Recall F1-score ROC-AUC score

Trained on $D_{train}^{(2)}$ and tested on $D_{test}$ OneClassSVM 0.8506 0.7465 0.7873 0.7663 0.9031

LR max.conf. 0.8951 0.8032 0.8781 0.8390 0.9521

NB max.conf. 0.8378 0.7113 0.8060 0.7557 0.9159

NN max.conf. 0.8858 0.7757 0.8905 0.8292 0.9519

Trained on $D_{train}^{(5)}$ and tested on $D_{test}$ OneClassSVM 0.7789 0.4695 0.7538 0.5786 0.8483

LR max.conf. 0.8873 0.6606 0.9058 0.7640 0.9486

NB max.conf. 0.8893 0.6653 0.9058 0.7671 0.9508

NN max.conf. 0.8955 0.6866 0.8846 0.7731 0.9562

LR conf.dist. 0.8827 0.6505 0.9019 0.7558 0.9349

NB conf.dist. 0.8773 0.6385 0.9011 0.7470 0.9425

NN conf.dist. 0.8808 0.6468 0.8981 0.7520 0.9491

Setup	Method	Accuracy	Precision	Recall	F1-score	ROC-AUC score
Trained on $D_{train}^{(2)}$ and tested on $D_{test}$	OneClassSVM	0.8506	0.7465	0.7873	0.7663	0.9031
	LR max.conf.	0.8951	0.8032	0.8781	0.8390	0.9521
	NB max.conf.	0.8378	0.7113	0.8060	0.7557	0.9159
	NN max.conf.	0.8858	0.7757	0.8905	0.8292	0.9519
Trained on $D_{train}^{(5)}$ and tested on $D_{test}$	OneClassSVM	0.7789	0.4695	0.7538	0.5786	0.8483
	LR max.conf.	0.8873	0.6606	0.9058	0.7640	0.9486
	NB max.conf.	0.8893	0.6653	0.9058	0.7671	0.9508
	NN max.conf.	0.8955	0.6866	0.8846	0.7731	0.9562
	LR conf.dist.	0.8827	0.6505	0.9019	0.7558	0.9349
	NB conf.dist.	0.8773	0.6385	0.9011	0.7470	0.9425
	NN conf.dist.	0.8808	0.6468	0.8981	0.7520	0.9491

It should be noted that recall increases when ν is increased and reaches the extreme value of 1 when ν is sufficiently large. This is because larger ν means larger regularization penalty, and in the extreme case the model is so underfitted that it simply classifies everything as novelty. However, one cannot just aim for such an increase as it damages other performance metrics, and scores in Table 3 show a balanced case. The results are not exactly bad, but not impressive either.

Also, experimenting with different values of gamma did not seem to lead to anything substantially better than this. To conclude, we notice from Table 3 that even a small increase in the number of known categories at the beginning (in the training set) caused performance of this kind of detector to decline.

We now turn our attention to particular machine learning models that we can use as classifiers that will provide probability estimates for max-confidence methods.

All the classifiers that are about to be discussed achieve between 97% and 98% classification accuracy on 2-class and 5-class subsets of Reuters data (i.e. using the same k classes for training and testing set). Also, the weighted averages of precision, recall and F1-score fall into the same range.

Figure 3 illustrates the behaviour of accuracy, precision and recall of max-confidence models for varying threshold values, on different datasets and for various underlying classifiers. In all the graphs, the ϵ parameter takes values from a subinterval of the largest theoretically allowed interval (e.g. $(0.3, 1] \subseteq (0.2, 1]$ in the 5-class case). This is chosen so that all the metrics are well defined for every ϵ value, as some of them might involve division by zero when no novelties are predicted, which is often the case for very small values of threshold parameter.

First off, logistic regression is tested for detecting novelties through a max-confidence scheme. Fig. 3 illustrates its performance.

There is an obvious, major improvement when compared to the one-class approach using ν-SVMs. One can obtain both accuracy and recall at around 90%, with precision just below 70% just by searching for a good value of confidence threshold in a simple linear model.

The next classifier we examined is multinomial naïve Bayes (also in Fig. 3). It performs worse than logistic regression (particularly in terms of recall) on the 2-class task and behaves almost identically on 5 classes. Since this model generally works well for text classification, it is definitely worth to experiment with it. Intuitively, we would expect it to achieve good results when there is a larger number of categories in the training set, as it is shown in Fig. 3, since it works better when faced with 5 initial classes, than with 2.

When it comes to using a neural network as a backbone of the detector, things expectedly get a bit more complicated. With logistic regression and naïve Bayes we did not have to manage trade-off, but now we have a robust, non-linear classifier, capable of learning incredibly complex decision functions. Unfortunately, it is also very prone to over-fitting, so regularization plays an important role here. We basically have a situation similar to the one with ν-SVM approach. Here we try to tune the $L^{2}$ regularization constant α to an optimal value, while exploring relationships between performance metrics and the parameter ϵ. During the experimental phase, we found that a neural network with 2 hidden layers, 100 neurons each, and a ReLU activation function works well for classification of text documents. Scikit-learn uses $α = 0.0001$ for regularization by default. We can get the detector to work with that value, but it turns out that we need really high values of ϵ. Optimal choice for the case with 2 initial classes seems to be around $ϵ = 0.9999999$ , and approximately $ϵ = 0.99999$ for 5 initial classes. The reason for this is that decision function complexity is not penalized enough, so it fits the training data so well that it is very confident in its classification predictions. After tweaking the regularization constant it seemed that $α = 1.0$ is a reasonable choice, so we illustrate the performance metrics for this value in Fig. 3. Surprisingly, it does not perform better than logistic regression.

Fig. 3.

Performance of max-confidence models. Rows correspond to different evaluation metrics, and columns correspond to different datasets. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.

At the end, we tested the same three machine learning models as a basis of confidence-distance methods. All the results are shown in Table 3 for easy comparison with OneClassSVM. Notice that for the 2-class case, confidence-distance results are not shown because they would be the same as max-confidence results, due to Theorem 1. The results for the 5-class case also turned out to be very similar to max-confidence.

8.1. Class-level leave-one-out cross validation

To test the robustness of the proposed methods further, a different experiment is performed. We took the $D_{train}^{(5)}$ set and analogously constructed $D_{test}^{(5)} = {d \in D_{test} | c (d) \in {^{'} {earn}^{'},^{'} {acq}^{'},^{'} {crude}^{'},^{'} {trade}^{'},^{'} money - f x^{'}}}$ , i.e. we took the same 5 categories from the $D_{test}$ . Afterwards, a class-level leave-one-out cross validation was carried out. To be precise, 5 new training sets were formed from $D_{train}^{(5)}$ by excluding one class at a time. In each case, $D_{test}^{(5)}$ was used for testing. Figure 4 shows the results of max-confidence models using the threshold value of $ϵ = 0.9$ . We use this particular value to illustrate certain points – better results than the ones shown here can be obtained in every specific setting by tweaking the threshold.

Fig. 4.

Accuracy of max-confidence models in a leave-one-out setting. Each group of bars has a label that indicates which class was dropped from the training set in that particular case. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.

We notice that Naive Bayes classifier achieves roghly the same accuracy in every setting. However, for logistic regression and neural network classifiers, it is difficult to achieve the same performance on every sub-setting with the same threshold since earn category contains more documents than all the other 4 categories combined, so the setting where it is dropped from the training set needs a different threshold for better accuracy. On the other hand, dropping either of the three categories that are similar in terms of the number of documents (crude, trade, money-fx) does not affect the performance of max-confidence detectors.

9. ROC curves and AUC scores

We have also experimented with graphing ROC curves and calculating areas under them for the proposed methods. Table 3 shows the AUC scores for max-confidence and confidence-distance methods, as well as for OneClassSVM.

The best score we achieved when training on $D_{train}^{(5)}$ and testing on $D_{test}$ is about 0.956 with max-confidence neural network model, using 2 hidden layers, 100 neurons each and the regularization parameter $α = 1$ , same as the one we described before. We also notice that all max-confidence models performed slightly better than confidence-distance methods. As we proved before, the two approaches are equivalent when training on two classes.

Fig. 5.

ROC curves for the proposed methods. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.

Judging by the experimental results shown in Fig. 5, we can see that max-confidence methods backed with a logistic regression model or a neural network outperform the other two methods in terms of area under the curve score on Reuters data.

9.1. Comparison to TONMF method

In this section we expose the results of an experiment where the method presented in the paper is compared to TONMF algorithm [10].

TONMF is an approach that has significant advantages over traditional methods for text outlier detection. It achieves AUC score of 0.9340 in detecting outliers in a dataset consisting of all the documents belonging to the earn and acq categories, with true ouliers being 100 documents from interest category (an unsupervised setting).

The experimental protocol that we used for comparison is the same as authors explain in [7], only adjusted into a semi-supervised setting. Briefly, 4436 documents from categories earn and acq comprised the training set for the classifiers. Novelties were 81 documents from category interest, which, mixed with the remaining 1779 documents from earn and acq, comprised the testing set.

Table 4 shows that the registered AUC score for logistic regression max-confidence scheme is 97.82%, which is significantly higher than 93.4% that TONMF achieves on the same task.

Table 4
AUC scores for different methods when using earn and acq as known classes and interest as a novel class

Detector AUC score

Max-confidence logistic regression 0.9782

Max-confidence naïve Bayes 0.9662

Max-confidence neural network 0.9700

Detector	AUC score
Max-confidence logistic regression	0.9782
Max-confidence naïve Bayes	0.9662
Max-confidence neural network	0.9700

9.2. Additional experiment

In this experiment, we took data from the BBC Dataset [11] and segmented it to test the proposed methods once more. This dataset contains BBC News articles, divided into five categories: business, entertainment, politics, sport and tech. We randomly chose articles from business and sport categories, 350 from each, constructing a training set of 700 documents. The remaining articles from these two categories (about 150 from each) were added to the test set, representing non-novelties for the testing phase. All articles from entertainment, politics and tech (approximately 400 each) were added to the test set as novelties. Figure 6 shows the ROC curves on this dataset, which was intentionally built to be very different from the sets we constructed from Reuters data. Namely, the testing set is larger than the training set, and it contains many more novelties than non-novelties. There is an obvious decline in the performance of OneClassSVM, while the max-confidence methods still achieved great results. This shows that they are robust to datasets of this particular type, where there is a large number of novelties to be detected, several times bigger than the set of regular documents.

Fig. 6.

ROC curves for proposed methods on the dataset constructed from BBC News data. Abbreviations: LR – Logistic Regression, NB – Naïve Bayes, NN – Neural Network.

10. Extending the proposed methods to work with polluted datasets

To make the terminology clear, “polluted” is used to refer to cases where the training set may contain documents from certain categories that should be considered novel during testing. An extension of max-confidence novelty detection that can work with such setups is discussed.

The training set needs to be re-structured in the following way. All known classes are treated separately, as before. In addition to that, all the classes that should be considered novel at testing time are packed into a single class, that we refer to as other. After this is done, a classifier can be fit to that training set.

Upon the completion of the training phase, the extended procedure for detecting novelties is applied. If the largest probability in the output distribution of the classifier corresponds to the other class, the current document is immediately classified as novel. Otherwise, the usual threshold test is performed, i.e. the largest probability is compared to ϵ, and the document is estimated to be novel if a lower-than inequality holds.

We carried out experiments with such a detector on training datasets formed from $D_{train}^{(2)}$ and $D_{train}^{(5)}$ by including a randomly selected fraction of documents remaining in unused categories. For testing, we still used the same $D_{test}$ dataset. All performance scores increased. The larger the fraction of added documents was, the performance got better. In the extreme case of packing everything in $D_{train}$ that was initially unused into the other category, we got approximately the same scores as for pure classification (i.e. 97–98% for accuracy, precision and recall). This is intuitively expected, since we are basically reducing the task of novelty detection to that of document classification.

The addition of “pollution” to the training set may also be viewed as a kind of a regularizer for the novelty detection task. Inclusion of documents that should be treated as unusual during training should help the classifier avoid overfitting, as it no longer focuses solely on known classes. It is beneficial to gather as much data for the other class as possible, and it does not even have to be labelled. This is reminiscent of traditional semi-supervised methods where labelled examples are suplemented with unlabelled ones to decrease test error.

It is possible to extend confidence-distance in a similar way and achieve roughly the same results.

11. Discussion

With the procedures introduced in this paper it is possible to have novelty detection encapsulated in a classifier module and its knowledge. We conducted a series of experiments simulating the same protocols on the same datasets explained in the literature. Methods’ accuracy, precision, recall and F1-score were measured, as well as Receiver Operating Characteristic (ROC) AUC scores. Both proposed methods easily achieve 90% and higher accuracy and recall at the same time on various datasets. Also, the highest AUC scores we obtained ranged from 95–96% to 97–98%, depending on the dataset, while the best results obtained in the literature are below 95%.

Confidence-distance novelty detection performs similarly to max-confidence detection, but we included it in this study as a separate procedure because it appears to be useful in the following use case. We implemented it as a part of a specific document understanding system where an information extraction module helps a classifier learn classes incrementally from a live document stream. The idea is as follows. When max-confidence detector does not recognize that document d belongs to an unknown class, classifier will start confidence-distance detection and return two labels corresponding to two highest probability estimations if $Δ p < ϵ$ holds.

The result is of the form $[c_{i_{\max}}, c_{i_{\max}^{″}}]$ . When the extractor receives two options for the class label, it will run both extraction models, namely $M_{c_{i_{\max}}}$ and $M_{c_{i_{\max}^{″}}}$ . The extraction process is evaluated by a confidence measure and depending on the registered confidence, different feedback message will be sent towards the classifier. If $conf (M_{c_{i_{\max}}}) > conf (M_{c_{i_{\max}^{″}}})$ the feedback message will be of the form $(d, c_{i_{\max}})$ . Otherwise, the feedback message will be of the form $(d, c_{i_{\max}^{″}})$ . This way the classifier will use the knowledge of the extractor to distinguish correct class label for the document d and include it in the training dataset for incremental learning.

In this research we tested several classification models: linear regression, Naive Bayes and neural networks. All of them achieve very similar performances regrading the novelty detection. We can conclude that, for the novelty detection task, it is not critical what model is used to perform classification as long as it achieves high accuracy at distinguishing between known classes in the training dataset. With a properly tuned ϵ parameter such a model can be successfully used for separating novelties from the known classes and obtaining the high performance measures mentioned earlier.

12. Conclusion

In this paper we discuss the problem of outlier and novelty detection in text data. The motif for this study is to design and implement an approach to distinguish documents from previously unknown classes from the document stream processed by a general document understanding system.

We propose semi-supervised, classification based methods for outlier detection with a multi-class classifier. The multi-class classifier is created on the training dataset containing only samples from regular (known) classes (at least two classes). Classifier confidence threshold ${cls}_{conf} = ϵ_{conf}$ and classifier confidence delta threshold ${cls}_{conf Δ} = ϵ_{conf Δ}$ are hyperparameters of their respective models (max-confidence and confidence-distance) and can be heuristically initialised or tuned using the standard hyperparameter optimization procedures.

The classifier prediction for a new document d contains probabilities that d belongs to one of each known classes $P = {c_{1} : p_{1}, c_{2} : p_{2}, \dots, c_{k} : p_{k}}$ . The class label $c_{i}$ is assigned to the new document with probability $p_{i}$ . Let $p_{i_{\max}} = {max}_{i} p_{i}$ be the highest probability in P and $p_{i_{\max}^{″}} = {max}_{i \neq i_{\max}} p_{i}$ the second highest probability in P.

We propose two procedures to determine if a new document is a novelty or a known sample. Max-confidence novelty detection declares a new document as a novelty if $p_{i_{\max}} < ϵ_{conf}$ holds. Confidence-distance novelty detection declares a new document as a novelty if $p_{i_{\max}} - p_{i_{\max}^{″}} < ϵ_{conf Δ}$ .

Our approach manages to go well beyond the capabilities of One-class, SVM-based novelty detection method in terms of various important performance metrics. Also, the achieved performance on Reuters data is better than the TONMF method in the same experimental protocol.

References

Aggrawal and

Zhai, Mining Text Data, Chapter 4, Kluwer Academic Publishers, 2012, pp. 77–222. doi:10.1007/978-1-4614-3223-4_4.

Akram,

M.U.D.

Dar and

Quyoum, Document image processing – a review, International Journal of Computer Applications 10(5) (2010), 35–40. doi:10.5120/1475-1991.

Bird,

Klein and

Loper, Natural Language Processing with Python, O’Reilly Media, 2009.

Chandola,

Banerjee and

Kumar, Anomaly detection a survey, ACM Computing Surveys (2009), 1–72. doi:10.1145/1541880.1541882.

Chen,

Lin and

Scholkopf, A tutorial on ν-support vector machines, in: Applied Stochastic Models in Business and Industry, 2005, pp. 111–136.

M.G.

Constantino,

Atkinson and

Bollegala, CLIEL context-based information extraction from commercial law documents, in: Proceedings of Theition of the International Conference on Articial Intelligence and Law, 16th edn, 2017, pp. 79–87, doi:10.1145/3086512.3086520.

Dasgupta and

Dey, Automatic scoring for innovativeness of textual ideas, in: Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp. 507–511.

Esser,

Shuster,

Muthmann,

Berger and

Schill, Automatic indexing of scanned documents – a layout-based approach, Proceedings of DRR 2012 – Document Recognition and Retrieval XIX, 2012, doi:10.1117/12.908542.

Ghosal,

Edithal,

Ekbal,

Bhattacharyya,

Tsatsaronis,

S.S.

Sameer and

Chivukula, Novelty goes deep. A deep neural solution to document level novelty detection, in: Proceedings of 27th International Conference on Computational Linguistics (COLING 2018), 2018.

10.

Ghosal,

Salam,

Tiwari,

Ekbal and

Bhattacharyya, TAP-DLND 1.0: A Corpus for Document Level Novelty Detection, 2018, arXiv:1802.06950.

11.

Greene and

Cunningham, Practical solutions to the problem of diagonal dominance in Kernel document clustering, Proc. ICML (2006).

12.

C.R.

Harris,

K.J.

Millman and

S.J.

van der Walt, Array programming with NumPy, Nature 585 (2020), 357–362. doi:10.1038/s41586-020-2649-2.

13.

Kannan,

Woo,

C.C.

Aggarwal and

Park, Outlier detection for text data: An extended version, 2017, arXiv:1701.01325.

14.

Pedregosa

et al., Scikit-learn: Machine learning in Python, JMLR 12 (2011), 2825–2830.

15.

P.J.

Rousseeuw and

A.M.

Leroy, Robust Regression and Outlier Detection, John Wiley & Sons, Inc., New York, NY, USA, 1987.

16.

Schuster,

Muthmann,

Esser and

Berger, Intellix – end-user trained information extraction for document archiving, in: The 12th International Conference on Document Analysis and Recognition (ICDAR), 2013, pp. 101–105, doi:10.1109/ICDAR.2013.28.

17.

Teng,

Chen and

Lu, Adaptive real-time anomaly detection using inductively generated sequential patterns, in: Proceedings of IEEE Computer Society Symposium on Research in Security and Privacy, IEEE Computer Society Press, 1990, pp. 278–284. doi:10.1109/RISP.1990.63857.

18.

Zhang,

Callan and

Minka, Novelty and redundancy detection in adaptive filtering, in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, pp. 81–88, doi:10.1145/564376.564393.

$D_{train}$				$D_{test}$

Category	Number of documents	Percentage	Vocabulary size	Category	Number of documents	Percentage	Vocabulary size
earn	2840	43.18	9528	earn	1083	41.93	4316
acq	1596	24.27	10035	acq	696	26.95	6348
crude	253	3.85	4281	crude	121	4.68	2767
trade	250	3.8	4355	money-fx	87	3.37	1941
money-fx	222	3.38	3492	interest	81	3.14	1653
interest	191	2.9	2455	trade	76	2.94	2475
money-supply	123	1.87	1061	ship	36	1.39	1322
ship	108	1.64	2546	money-supply	28	1.08	608
sugar	97	1.47	1981	sugar	25	0.97	1091
coffee	90	1.37	2315	coffee	22	0.85	976
gold	70	1.06	1615	gold	20	0.77	617
gnp	59	0.9	1728	alum	19	0.74	658
cpi	54	0.82	920	cpi	17	0.66	439
cocoa	46	0.7	1471	cocoa	15	0.58	652
grain	41	0.62	1721	gnp	15	0.58	1092
jobs	37	0.56	801	copper	13	0.5	761
reserves	37	0.56	684	iron-steel	12	0.46	538
ipi	34	0.52	773	jobs	12	0.46	246
alum	31	0.47	1078	nat-gas	12	0.46	636