Estimating a one-class naive Bayes text classifier

Abstract

Nowadays more and more information extraction projects need to classify large amounts of text data. The common way to classify text is to build a supervised classifier trained on human-labeled positive and negative examples. In many cases, however, it is easy to label positive examples, but hard to label negative examples. In this paper, we address the problem of building a one-class classifier when only the positive examples are labeled. Previous works on building one-class classifier mostly use positive examples and unlabeled data. In this paper, we show that a configurable one-class classifier such as one-class naive Bayes can be optimized by examining the clustering quality of the classification on target data. We propose to use existing and new quality scores for determining clustering quality of the classification. Experimental analysis with real-world data show that our approach generally achieves high classification accuracy, and in some cases improves the accuracy by more than 10% compared to state-of-art baselines.

Keywords

Machine learning one class classifier naive Bayes

1. Introduction

Information systems based on text data commonly need to have a text classification component. We have seen such information systems built around microblog text data, that predict earthquake movements [21] and influenza [6], locate crime incidents [27], analyze public sentiments in elections [24], and detect rumors [15]. In these information systems, classification is required since only a small portion of the text data are desirable. For instance, when building a system for detecting crime and disaster events, the authors find that only 0.05% tweets in all collected tweets are related to the application [11].

A common way to classify text is to build a supervised classifier that is trained on some labeled examples, which usually include positive and negative examples [23]. However, in particular applications, it is often easier for human annotators to label positive examples than to label negative ones. According to psychology theories, people tend to see what they want to see, based on their experience, motivation, and emotional state [1]. In an event monitoring project, for example, a researcher may already have some experiences with the event-related text data, and is interested in looking for more. However, she may not be interested in identifying unrelated data, which is tedious and unexciting for her. As a result, determining positive class can faster as she knows what to look for in the data, while determining negative class can be difficult and slow. This is a reality already recognized by a number of researches [14, 3]. The question is whether we can build a supervised classifier using only positive examples.

A number of studies propose to build one-class classifiers, in other words, classifiers trained only with examples from the positive class [9]. These works can be divided into two groups, one uses only positive examples [17, 25], and the other uses positive examples and unlabeled data as a form of semi-supervised learning [12, 7]. An example of the latter is One-class naive Bayes classifier, which in previous works has been built based on positive and unlabeled data [3]. Naive Bayes is a supervised classification technique that has proven effectiveness in text classification. At the core of this technique is the independent feature assumption, with which the probability of the text belonging to a class can be expressed as the joint probability of its single feature probabilities given the class. As already shown in previous work, a one-class naive Bayes can be built with unlabeled data and a user-supplied parameter [3]. This user-supplied parameter, called background positive probability, however, is depending on the target data to be classified, and cannot be estimated easily.

The background positive probability is an indication of the ratio of positive instances in the target data, and is a missing piece in existing one class classifiers. In order to estimate this parameter, we need to look at classified results involving the target data. Assuming that positive instances are more similar to each other than to negative instances, our method is based on clustering quality, which measures how well two classes are separated. Our hypothesis is as follows: given a number of parameter values of a one-class classifier, we measure the clustering quality of classification, and the parameter value that provides the best clustering quality should also provide the best classification quality. We use existing measures of clustering quality [20], as well as a new measure called largest separation of nearest instance (LSN). To summarize, we make the following contributions with this paper:

•
We review the technique for building one-class naive Bayes classifiers and make an extension by considering clustering quality of the classification on target data. We identify that the configurable parameter, i.e. the background positive probability can be optimized according to the clustering quality. Importantly, this optimization eliminates the need of manually setting the parameter as required by previous works, and does not require extra information or labeling.
•
We conduct an extensive evaluation for the effectiveness of our approach, comparing it to a wide range of baseline methods. Tested on publicly available datasets, we find that the classifier we build generally achieves high classification accuracy, and in some cases improves the accuracy by more than 10% compared to state-of-art baselines.

The remainder of this paper is organized as follows: in Section 2, we discuss related works on one-class classification. Section 3 reviews one-class naive Bayes classifier and presents our method for parameter estimation. In Section 4, we present our experimental evaluation and discuss the results. In Section 5, we provide some insights we gain through this study regarding the background positive probability. Finally, Section 6 concludes this paper.
2. Related works

Early works usually treat one-class classification problem as an anomaly detection problem. The essential technique is to measure the distance between the document to be classified and the positive examples, and use a threshold for determining results. For instance, Pazzani and Billsus propose to use the average of the positive examples, which is called Prototype, as the basis for measuring distance [19], and then apply a threshold that provides class judgment. A similar approach based on Nearest Neighbor (NN) is introduced by Manevitz and Yousef [16], for which the distance is measured between the test document and the nearest document in the positive examples. An interesting technique for positive-only classification, called One-class SVM (OSVM), is now the benchmark method for one-class classification [22]. Proposed by Schölkopf et al. [22], OSVM is a modified version of Support Vector Machine (SVM) that builds a decision boundary based on positive example only. Generally, because these techniques use only the positive examples, their performance is worse than techniques that incorporate both the positives and unlabeled data [13].

There is a series of studies called PU-learning, which aims to build classifiers with positive and unlabeled data. The techniques introduced in these studies are quite similar to each other, as they use a two-step approach. First, using some heuristics they discover reliable negative data from the unlabeled data. Then using the positive data and reliable negative data they build a traditional classifier, usually SVM [13]. Liu et al. propose a technique call Spy-EM [14]. This method first selects spy, a portion of positive data, to add to the unlabeled data. Then it uses Expectation Maximization (EM) to select a probable classifier, which can be used to select a probability threshold that leads to reliable negatives. This technique is based on the assumption that spies will behave similarly to the unknown positives in the unlabeled data to let the classifier learn the positive data characteristics. Li et al. [12] propose another technique called Roc-SVM, and the main difference between this and the previous technique is that it selects reliable negatives from unlabeled data using Rocchio classifier, which tends to provide lower false-negative rates. The assumption common to this series of techniques in the context of document classification is that the proportion of positives in the unlabeled data is small, and the negatives fall into diverse topics, which is generally true for topic-based document classifications. However, it is unclear if these techniques can achieve same effectiveness in other type of classification applications.

There is a series of existing studies on building one-class Naive Bayes classifier with positive and unlabeled data. Early techniques such as the one proposed by Wang and Stolfo use threshold that sets a boundary between positive and negative predictions [25]. Denis et al. first show that when using naive Bayes classifier in document classification, the word probability for the negative class can be derived from positive examples and word counts in the unlabeled data [5]. They call this technique Positive Naive Bayes (PNB). However, the authors do not provide the solution for estimating the necessary component of the background positive probability. Calvo et al. show that for feature vectors with categorical values, the negative feature probability can be transformed using Laplace correction to remove below-zero probability [3]. They go on to show that the PNB can be extended to more complex networks. However, the positive instance probability remains unknown. He et al. extend the PNB model by incorporating uncertainty, which they call Uncertain Positive Naive Bayes (UPNB), basically using probability cardinalities to replace feature counts [8]. They show that the negative feature probability is a parameter that needs to be adjusted, and propose a method to find its optimal value by using part of the positive examples as validation. Recently, an attempt has been made by Bekker and Davis to establish the prior positive and negative probability from discrete training data [2]. The technique they use based on the label frequency estimation, however, can only estimate the upper and lower bonds of the required probability. Based on existing studies, it is clear that the negative feature probability, which is required to build one-class naive Bayes classifier, cannot be effectively established using only positive training examples and unlabeled data.

3. Optimized one-class naive Bayes classifier

We will first discuss one-class naive Bayes classifier in the context of text classification, particularly, the classification of short messages such as tweets. Because our focus is on building the classifier rather than designing effective features, we will use, without loss of generality, the most basic feature repre- sentation for text messages, namely, bag-of-word (BOW) binary representation. Specifically, given a vocabulary $V$ , where $|V|=m$ , each text message is represented as a feature vector $\{x_{1},\ldots,x_{m}\}$ , where $x_{i}\in\{0,1\}$ indicates if the message contains the $i$ -th word in the vocabulary. This representation is suitable for short messages and tweets, as in most cases a word will occur no more than once in a message.

3.1 Naive Bayes classifier

The goal of a text classifier is to find the probability of a document belonging to a class given data $p(c=l|x_{1},\ldots,x_{m})$ , where $\{x_{1},\ldots,x_{m}\}$ is the transformed feature vector of the document, and $l\in\{1,\ldots,k\}$ are the different classes.

A naive Bayes classifier is built on the “naive” conditional independence assumption, such that each $x_{i}$ is independent of each other. Therefore we can transform the probability of a predicted class to a set of conditional probabilities of feature $x_{i}$ on $c=l$ .

$\displaystyle p(c=l|x_{1},\ldots,x_{m})\propto p(c=l,x_{1},\ldots,x_{m})% \propto p(c=l)\prod_{i=1}^{m}p(x_{i}|c=l)$ (1)

If we can obtain each $p(x_{i}|c=l)$ we could calculate the probability of the document belong to a class using the above equation. In the case of bag-of-words representation of text data, $p(x_{i}|c=l)$ can be obtained based on the occurrence frequency in the training data:

$\displaystyle p(x_{i}|c=l)=\frac{p(x_{i},c=l)}{p(c=l)}=\frac{f(x_{i},c=l)}{f(c% =l)}$

where $f(x_{i},c=l)$ is the number of times word $x_{i}$ appears in documents labeled as $l$ , and $f(c=l)$ is the number of documents labeled as $l$ in all training documents.

To make a prediction of the class, the probability is calculated for all classes and the one with the highest probability is chosen as the predicted class:

$\displaystyle\hat{y}=\mathop{\text{missing}}{argmax}\limits_{l\in\{1,\ldots,k% \}}p(c=l)\prod_{i=1}^{m}p(x_{i}|c=l)$ (2)

We can see that if we only have one class (e.g., 1) in the training data, $p(c=1)$ will be 1, and all other $p(c=l)$ will be zero, and the class will always be predicted as 1. It is thus clear that a naive Bayes classifier does not work with training examples from one class only.

In the next subsection, we will show that with two pieces of auxiliary information we can devise a naive Bayes classifier with only one class of training examples.

3.2 One-class naive Bayes classifier

In one-class classification, the problem becomes finding the probability of positive and negative classes for a document, $p(c=1|\mathbf{x})$ and $p(c=0|\mathbf{x})$ , with only positive training examples. As we discussed earlier, with only positive training examples, we cannot estimate $p(c=0|\mathbf{x})$ using the standard prediction equation, i.e., Eq. (2), because $p(c=0)$ will be 0 and $p(x_{i}|c=0)$ will be undefinable.

Nowadays it is easy to get a large number of unlabeled data from online platforms such as microblog services. Assume we have a set of unlabeled data. From such data we can estimate the overall probability of a feature $p(x_{i})$ . In the case of BOW representation, $p(x_{i})=\frac{f(x_{i})}{N}$ , where $N$ is the total number of documents.

On the other hand, we can express $p(x_{i})$ as the marginal probability:

$\displaystyle p(x_{i})=p(x_{i}|c=1)p(c=1)+p(x_{i}|c=0)p(c=0)$

Re-organizing this equation will give

$\displaystyle p(x_{i}|c=0)=\frac{p(x_{i})-p(x_{i}|c=1)p(c=1)}{1-p(c=1)}$ (3)

This negative feature probability $p(x_{i}|c=0)$ is one of the two components required for predicting $p(c=0|\mathbf{x})$ using the standard prediction equation. The other component is the background negative probability, $p(c=0)$ . Assume we can estimate the background positive probability, $p(c=1)$ . Consequently, $p(c=0)=1-p(c=1)$ .

Previous studies show that the background positive probability $p(c=1)$ cannot be obtained within positive training data and unlabeled data [3, 8]. Some studies suggest to leave it as a parameter that will be defined by the user [3]. We argue that the reason for the unclear $p(c=1)$ is that this information is depending on the target data to be classified. Essentially, this parameter represents the proportion of potential positive instances in the target data. Accurately estimating this parameter will thus involve studying the target data. In the next subsection, we show that by looking at the clustering quality of the classification, we can find an optimal value for $p(c=1)$ .

3.3 Estimating optimal

p(c=1)

from the clustering quality of the classification

The existing one-class classification techniques mostly use only the positive and unlabeled data to build the model. In this paper, we aim to optimize the parameter $p(c=1)$ by looking at the clustering quality of classification on the target data. Here we make the common assumption that, with text data, documents within the positive class are similar to each other, and are more different from documents in the negative class. Therefore it is plausible to apply clustering quality measures based on the separation of positive and negative class.

The process for this optimization includes three steps, namely, setting the parameter, classification, and obtaining cluster quality score. First we set a number of $p(c=1)$ values, e.g., 100 values between [0, 1] with a stepping of 0.01. Then with each value, we build the classifier. We note that the positive feature probability $p(x_{i}|c=1)$ needs not to be re-calculated with the training data. We only need to update $p(x_{i}|c=0)$ using Eq. (3) for different $p(c=1)$ . Next we apply the classifier to the target data, and after obtaining a number of positive and negative predictions, which can be considered as two clusters, we calculate a clustering quality score. Finally, the $p(c=1)$ value that leads to the highest quality score will be chosen as the optimal value.

There are many statistical measures for the clustering quality can be considered. A popular choice is called average silhouette width (ASW). Introduced in [20], this measure can be adapted to measuring the clustering quality of one class while ignoring other classes, which suits our scenario in which only the positive class is considered. Applying ASW to our one-class scenario, a silhouette is calculated for each document in the positive class:

$\displaystyle s(i)=\frac{b(i)-a(i)}{\max\{a(i),b(i)\}}$

where $a(i)$ is the average dissimilarity of document $i$ with all other documents within positive class, and $b(i)$ is the average dissimilarity of document $i$ with all documents in the negative class. It is clear that for $s(i)$ to have a higher value, $a(i)$ needs to be much less than $b(i)$ , indicating small dissimilarity within the positive class and large dissimilarity to the negative class. Therefore a higher $s(i)$ means better clustering. The final quality score is calculated as the average $s(i)$ of all documents in the positive class.

We also propose a new measure called largest separation of nearest instance (LSN), which is calculated as the distance between the nearest pair of positive and negative instances. The rationale behind this measure is that, because text data generally have high dimensionalities, given positive and negative clusters, it is not the overall difference, but the difference of two most similar instances from each class that shows cluster separation. Algorithmically, we achieve this by first calculate distances between all pairs of positive and negative classes, and pick the smallest value $d$ . Empirically, we found this measurement tend to provide better classifier configuration than ASW.

3.4 Overall algorithm

We present the algorithms for training and estimating the optimal one-class naive Bayes classifier in this section. As discussed in the previous section, a naive Bayes classifier can be described in four components $p(\mathbf{x}|c=1)$ , $p(\mathbf{x}|c=0)$ , $p(c=1)$ , and $p(c=0)$ . For one-class naive Bayes classifier, since $p(c=0)=1-p(c=1)$ , we do not need to specify $p(c=0)$ . Also for efficiency reasons to be explained, we would like to store $p(\mathbf{x})$ . Thus a one-class naive Bayes classifier can be described by $C=<p(\mathbf{x}|c=1),p(\mathbf{x}|c=0),p(\mathbf{x}),p(c=1)>$ . In Algorithm 1, we show how a classifier can be trained using positive examples and unlabeled data. Particularly, $p(\mathbf{x})$ is obtained through unlabeled data (line 1), which is a more time-consuming part of the algorithm, since unlabeled data are usually much larger than labeled data in size.

Algorithm 1 Training One-class Naive Bayes Classifier For Binary Data
INPUT positive examples $X_{p}$ , unlabeled data $X_{u}$ , configuration $p(c=1)$
OUTPUT classifier $C=<p(\mathbf{x}\|c=1),p(\mathbf{x}\|c=0),p(\mathbf{x}),p(c=1)>$
1: $p(\mathbf{x})\leftarrow\frac{X_{u}}{\|X_{u}\|}$
2: $\mathbf{s}\leftarrow\sum_{i}f(X_{p})^{i}$
3: $p(\mathbf{x}\|c=1)\leftarrow\frac{s}{\|X_{p}\|}$
4: $p(\mathbf{x}\|c=0)\leftarrow\frac{p(\mathbf{x})-p(\mathbf{x}\|c=1)}{1-p(c=1)}$
5: return $C=<p(\mathbf{x}\|c=1),p(\mathbf{x}\|c=0),p(\mathbf{x}),p(c=1)>$

We have discovered that it is unnecessary to retrain a classifier each time when a new $p(c=1)$ is supplied. First, the new $p(c=1)$ will not change $p(\mathbf{x}|c=1)$ and $p(\mathbf{x})$ . Second, using the stored $p(\mathbf{x})$ , we can calculate the new $p(\mathbf{x}|c=0)$ in a single step. Algorithm 2 shows how to efficiently update the classifier when given a new $p(c=1)$ .

Algorithm 2 Updating $p(c=1)$
INPUT model $C=<p(\mathbf{x}\|c=1),p(\mathbf{x}\|c=0),p(\mathbf{x}),p(c=1)>$ , $p(c=1)$
OUTPUT updated model $C^{\prime}$
1: $p(c=1)^{\prime}\leftarrow p(c=1)$
2: $p(\mathbf{x}\|c=0)^{\prime}\leftarrow\frac{p(\mathbf{x})-p(\mathbf{x}\|c=1)}{1-p% (c=1)}$
3: return $C^{\prime}=<p(\mathbf{x}\|c=1),p(\mathbf{x}\|c=0)^{\prime},p(\mathbf{x}),p(c=1)^% {\prime}>$

Finally, Algorithm 3 shows how to estimate an optimal $p(c=1)$ using the LSN measure. It involves updating a number of candidate $p(c=1)$ values (line 3) and measuring the classification quality as the target data being classified by the updated model (lines 5–7). The optimal $p(c=1)$ is chosen as the value that provides the minimum LSN score for the positive and negative classes (line 8). We skip showing the algorithm for ASW-based optimization as replacing LSN measure with ASW is straightforward.

Algorithm 3 Estimating Optimal $p(c=1)$ from Classification Results Using LSN
INPUT model C, target data $X_{t}$
OUTPUT optimal background positive probability $p(c=1)_{o}$
1: $p(c=1)_{o}\leftarrow 0$
2: $d_{\min}\leftarrow+\inf$
3: for $p(c=1)\leftarrow$ 100 values between [0, 1], with a stepping of 0.01
4: update $C$ with $p(c=1)$ using Algorithm 2
5: $X_{P},X_{N}\leftarrow$ positive and negative instances predicted by applying $C$ to $X_{t}$
6: $D\leftarrow$ pairwise distances between $X_{P}$ and $X_{N}$
7: $d\leftarrow\min(D)$
8: if $d_{\min}\leqslant s$ then $d_{\min}\leftarrow s$ , $p(c=1)_{o}\leftarrow p(c=1)$
9: end for
10: return $p(c=1)_{o}$

4. Evaluation

To test the effectiveness of our approach, we conduct experiments on three real-world text datasets previously used in other studies, and compare against several baseline approaches, including the state-of-art PU learning methods. In this section, we discuss the used datasets, experiment setup, and baseline approaches, before presenting the results.

4.1 Experimental setup and evaluation matrix

We conduct experiments on three real-world text datasets used in different applications. The first dataset, called the shooting dataset, is a collection of tweets containing the keyword shooting. The collection contains labeled and unlabeled tweets. The labeled tweets have been divided into four domains, namely, crime, imaging, game, and metaphor. In a previous work, the task has been classifying crime-related tweets from these tweets [28]. In this work, we also consider the crime-related tweets as our target. The positive examples are thus taken from the crime-related tweets.

The second dataset is called the crisis dataset and is a publicly available dataset1

¹
http://crisislex.org/.

introduced by Olteanu et al. [18]. It contains sets of tweets related to 26 natural disasters and other crisis events, including labeled and unlabeled tweets. There are two types of labels, (1) based on whether the tweet is related and informative, and (2) based on the source of the tweet, respectively. We use only related tweets. Combining 26 events we obtain a number of labeled and unlabeled tweets. The labeled tweets contain five categories, namely, eyewitness, business, government, media, and NGO. It has been shown previously that the eyewitness tweets are more desirable information in a crisis [26]. In this paper, our task is to classify eyewitness tweets among other tweets. The positive examples are taken from eyewitness tweets.

The third dataset is called the sentiment dataset, and is also publicly available.2

https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences.

This dataset has been used by Kotzias et al. to study group and individual deep features [10]. It contains a collection of sentences taken from product review websites such as imdb.com, amazon.com, and yelp.com. A positive or negative sentiment is associated with each sentence, based on the review score provided by the reviewer. In this paper, our task is to classify positive sentences, which are our positive class. We use some of the sentences as labeled data, and the rest are treated as unlabeled.

The number of tweets in the labeled and unlabeled data for three datasets is shown in Table 1. For all three datasets, we use a portion of positive examples in the labeled tweets as the training data for our one-class classifiers. Please note that there are no missing values in the datasets.

Table 1

Dataset statistics

	Training	Testing			Unlabeled
	Positive	Positive	Negative	Total
Shooting dataset	110	224	749	973	284,343
Crisis dataset	174	354	3118	3472	94,388
Sentiment dataset	135	276	389	665	2,200

For all three datasets, the tweets are transformed into vectors using BOW-binary representation. Words that appear more than 10 times in the labeled dataset are selected to be included in the BOW vocabulary. For the shooting dataset, the vocabulary size is 159. For the crisis dataset, the vocabulary size is 583. For the sentiment dataset, the vocabulary size is 227.

We measured the precision, recall, and $f1$ score for the positive prediction results. The precision is calculated as the percentage of true-positives in all predicted positives. The classification recall is calculated as the percentage of true-positives in all labeled positives in the test data. As an indicator of general accuracy, the $f1$ score is calculated as $\frac{2\times\textit{precision}\times\textit{recall}}{\textit{precision}+% \textit{recall}}$ .

4.2 Baseline methods

In the experiments, we test proposed one-class naive Bayes (ONB) classifier with two different clustering quality measures, ONB-ASW and ONB-LSN. They are compared with a number of baseline one-class classifiers, which are listed below with short explanations.

4.2.1 Prototype

Introduced in [19], this technique first finds a prototype for the positive examples, which is the average of vector values, then compares test data with the prototype by cosine similarity, and uses a threshold for judgment.

4.2.2 Nearest Neighbor (NN)

Similar to Prototype, this technique compares test data with positive examples by cosine similarity and uses a threshold for judgment [16]. The difference is that, instead of the Prototype, the distance between target data and the nearest instance in the positives is used.

4.2.3 One-class SVM (OSVM)

Proposed by [22], OSVM is a modified version of Support Vector Machine that supports one-class classification. Similar to a standard SVM, this model maps data points into another space through kernel transformation, $k(\mathbf{x},\mathbf{y})=(\Phi(\mathbf{x}),\Phi(\mathbf{y}))$ . For example, when using Gaussian kernel, the transformation is $k(\mathbf{x},\mathbf{y})=\exp^{-||\mathbf{x}-\mathbf{y}||^{2}/c}$ . Such transformation captures non-linear relationship among data. However, instead of using two classes to generate decision boundary, the OSVM algorithm generates a decision boundary that has maximum margins for the positive examples. For a new data point $\mathbf{x}$ , the prediction is determined by which side of the hyperplane it falls on in feature space. For our evaluations, we use LIBSVM, which contains an implementation of OSVM [4].

4.2.4 Spy

The Spy technique introduced in [14] is a reliable negative (RN) based technique. This technique chooses reliable negatives from unlabeled documents based on the probability calculated using naive Bayes approaches. It uses an Expectation-Maximization (EM) algorithm combined with naive Bayes to find the probabilities, then selects a number of unreliable documents. More specifically, a parameter $l$ is set such that $l$ % documents will be chosen as RNs. The default value for $l$ is 15. In the second step of the method, RNs is combined with existing positive examples to build a final classifier. Although in [14] EM is used again in the second step as the classifier builder, we instead use SVM in the second step because empirical studies show that Spy-SVM generally produced better classification results than Spy-EM.

4.2.5 Roc

This is another a RN based technique introduced in [12]. It first considers unlabeled data as negative, then uses Rocchio classifier to find the prototype of positive and negative classes by the following formula:

$\displaystyle c_{j}=\alpha\frac{1}{|C_{j}|}\sum_{d\in C_{j}}\frac{d}{||d||}-% \beta\frac{1}{D-C_{j}}\sum_{d\in D-C_{j}}\frac{d}{||d||}$

where $C_{j}$ is the positive training data, and $D$ is the collection of all documents. $\alpha$ and $\beta$ are parameters that adjust the relative impact of relevant and irrelevant training examples. The prototypes are used with cosine similarity to classify unlabeled data as positive or negative. Theoretically, this classification would have high precision (but low recall). In the second step, RNs and positive examples are combined to train a traditional SVM classifier. We use the implementation of Spy and Roc made available by the authors.3

³
https://www.cs.uic.edu/ liub/LPU/LPU-download.html.

4.2.6 aUPNB

Introduced in [3], this technique uses an alternative definition of negative feature probability by applying Laplace correction and using an increment to avoid zero value in the summation component of negative feature probability. More specifically, the negative feature probability is defined as:

$\displaystyle P(x|0)=\frac{1+\max(\alpha,R(x))\frac{1}{Z}}{|V|+|D|(1-p(c=1))}$ (4)

A significant drawback of this approach is that it involves an additional parameter $\alpha$ , although empirical results show that with the right value, this addition can help improve accuracy. In the experiment, we use the default value 0.1 for $\alpha$ as suggested in [3]. The paper also suggests a method to find $p(c=1)$ by separating the positive examples into training and validation data, and capturing the best $p(c=1)$ with regard to a score function. In the paper, the score function $\textit{recall}\times\textit{recall}/P(f(x)=1)$ is directly equal to recall when $P(f(x)=1)$ is tested on the validation set that contains only positive examples. In our implementation, we use all positive examples as the basis for calculating $P(f(x)=1)$ to produce more meaningful results. We use the suggested training-validation separation ratio of 5:2.

4.3 Results and discussion

The evaluations of different one-class classifiers on three datasets are shown in Table 2, in which we also list the average precision and f1 score results. The best result in each measurement is highlighted in bold font. The results of the proposed methods (ONB-ASW and ONB-LSN) are shown in the last two columns.

Table 2
Classification accuracy of one-class classifiers

	Prototype	NN	OSVM	Spy	Roc	aUPNB	ONB-ASW	ONB-LSN
Shooting dataset
Prec	0.316	0.318	0.213	0.574	0.822	0.445	0.252	0.532
Recall	0.688	0.696	0.358	0.463	0.236	0.449	0.768	0.558
f1	0.433	0.436	0.268	0.512	0.366	0.447	0.379	0.545
Crisis dataset
Prec	0.067	0.057	0.098	0.25	0.237	0.154	0.107	0.222
Recall	0.331	0.282	0.739	0.120	0.080	0.130	0.497	0.387
f1	0.111	0.095	0.173	0.162	0.119	0.141	0.176	0.282
Sentiment dataset
Prec	0.453	0.502	0.442	0.485	0.680	0.439	0.461	0.472
Recall	0.547	0.605	0.654	0.688	0.120	0.656	0.757	0.772
f1	0.496	0.548	0.527	0.569	0.219	0.526	0.573	0.586
Average
Prec	0.278	0.292	0.251	0.436	0.579	0.346	0.273	0.408
f1	0.346	0.359	0.322	0.414	0.234	0.371	0.376	0.471

From the results we can see that the proposed ONB-LSN method achieves the highest overall accuracy, having the highest f1 score for all three datasets. But we also notice that there is different strength for different classifiers, some of which are better at getting high precisions, while some are better at getting high recalls. For example, the Roc classifier is shown to have better precision, having the highest precision for the shooting and sentiment dataset, and the second highest for the crisis dataset. However, the Roc classifier has very poor recalls, having the lowest recalls for all datasets, which leads to low f1 scores.

The methods that tend to get higher recall than precision include Prototype, NN, OSVM, and ONB-ASW. OSVM for example, achieves the highest recall of 0.739 for the crisis dataset. However, it has a very low precision that prevents it from reaching a high f1 score. It is indicated that these methods are more lenient when identifying positive instances, which causes more false positives to be included. Indeed we can make a simple classifier that accepts all data as positive and we will have the highest recall of 1, but we will not have a high f1 score due to poor precision. The proposed ONB-LSN however, achieves reasonable precision while maintaining a high recall, which allows it to have the highest f1 score among all classifiers. The aUPNB method has a similar precision and recall balance like ONB-LSN, possibly due to their similar implementation. However, ONB-LSN achieves better precisions and recalls for all datasets than aUPNB.

We also notice that the SPY method is an interesting one among classifiers. While it achieves neither the best precision nor the best recall, it gets a relatively high precision for all datasets while keeping an over-average recall. It also has relatively low performance impact when given different datasets. This is possibly due to the robustness provided by the EM algorithm. At the end it achieves the second highest average f1 score among classifiers, only lower than the proposed ONB-LSN. Lastly we notice that the Prototype and NN methods have surprising accuracies, considering their simplicity. While worse than more sophisticated methods such as Spy and the proposed ONB-LSN, they manage to perform better than OSVM, according to the average f1 scores.

5. Interpretation of

p(c=1)

In the previous section, we have shown empirically that the one-class naive Bayes classifier estimated using the proposed LSN measure can achieve better classification results than state-of-art PU learning methods. In this section, we offer some more insights into such a classifier. Specifically, we would like to discuss the interpretation of the background positive probability $p(c=1)$ , the focus of this paper. Essentially this value indicates the presence of positive examples in the target data. A one-class naive Bayes classifier, theoretically, would achieve the best performance when this value correctly reflects the portion of positive instances in the target data. But is this true in reality? We investigate this question by running the classifier with different $p(c=1)$ configurations on the three datasets in the previous section, and measure the precision, recall, and f1 score directly. The results are shown in Fig. 1.

Figure 1.

Precision, recall, and f1 score achieved by one-class naive Bayes classifier with different $p(c=1)$ for three datasets.

More specifically, the highest f1 score for the shooting, crisis, and sentiment datasets are 0.554, 0.304, and 0.587, when $p(c=1)$ are 0.18, 0.10, and 0.73, respectively. On the other hand, the true proportions of positive class in the three datasets are 0.23, 0.10, and 0.41, respectively. We can see that, except for the crisis dataset, the numbers do not match exactly. There are some possible reasons. One possibility is the noise in the dataset, which causes a wrong model being trained, or a wrong flag when predictions are made with the model. In this case, when some positive instances are wrongly labeled as the negative instances, the optimal $p(c=1)$ will be higher than the true portion of the positive instances. And the reverse is similar.

Nevertheless, we see positive correlations between $p(c=1)$ and the true proportion of positive class in the data. Another way to look at $p(c=1)$ is to consider it as a parameter to control false positive rates. In Fig. 1, we can see a similar trend in all three datasets, that as $p(c=1)$ increases, the recall generally decreases, and the precision increases before reaching a threshold. This is an indication that predicted positives and false positives become fewer given a higher $p(c=1)$ . In other words, the classifier becomes more strict when classifying positive class given a higher $p(c=1)$ . In practice we can use the property to tighten or relax the classifier according to our need by changing the $p(c=1)$ . But we need to note that the accuracy of a one-class naive Bayes classifier already has a upper bound (and a lower bound), which cannot be surpassed by changing $p(c=1)$ . This is the limitation of naive Bayes classifier itself. And it is not a surprise that our ONB-LSN method only can achieve limited accuracy as discussed in the previous section, even though it captures near optimal $p(c=1)$ .

6. Conclusion

In this paper we propose a method to build a one-class naive Bayes classifier given positive and unlabeled data. After reviewing existing one-class naive Bayes approaches, we identify that the background positive probability $p(c=1)$ is the key component to build a one-class naive Bayes classifier. However, it has been shown that this information cannot be obtained from positive and unlabeled data. In contrast to existing works, we propose to find the optimal value for $p(c=1)$ based on the clustering quality of classification results. In particular, we select ASW as the quality score as well as a new measure named LSN. Experimental results with real-world text data show that our approach can perform better than state-of-art one class classification and PU learning methods, and in some cases improves the classification accuracy by 10%.

However, our work has limitations. Although we propose a method for finding the optimal parameter, the performance of the classifier already has an upper-bound (and a lower-bound). In other words, even if we reach the exact optimal value for the configurable parameter, we still can only get limited classification accuracy. This is the limit of the naive Bayes classifier itself. Future researches for one-class naive Bayes classifier will need to deal with this limitation. Another interesting future direction could be considering online processing capability of current method.

References

Alan

and Gary

, Perception, attribution, and judgment of others, Organizational Behaviour: Understanding and Managing Life at Work 7 (2011).

Bekker

and Davis

, Estimating the class prior in positive and unlabeled data through decision tree induction, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Calvo

Larrañaga

and Lozano

J.A.

, Learning bayesian classifiers from positive and unlabeled examples, Pattern Recognition Letters 28(16) (2007), 2375–2384.

Chang

C.-C.

and Lin

C.-J.

, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27.

Denis

Laurent

Gilleron

and Tommasi

, Text classification and co-training from positive and unlabeled examples, in: Proceedings of the ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data, 2003, pp. 80–87.

Dredze

Paul

M.J.

Bergsma

and Tran

, Carmen: A Twitter geolocation system with applications to public health, in: AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI, 2013, pp. 20–24.

Elkan

and Noto

, Learning classifiers from only positive and unlabeled data, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 213–220.

Zhang

and Wang

, Naive bayes classifier for positive unlabeled learning with uncertainty, in: Proceedings of the 2010 SIAM International Conference on Data Mining, SIAM, 2010, pp. 361–372.

Khan

S.S.

and Madden

M.G.

, One-class classification: Taxonomy of study and review of techniques, The Knowledge Engineering Review 29(3) (2014), 345–374.

10.

Kotzias

Denil

De Freitas

and Smyth

, From group to individual labels using deep features, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2015, pp. 597–606.

11.

Lei

K.H.

Khadiwala

and Chang

K.-C.

, TEDAS: A Twitter-based event detection and analysis system, in: Proceedings of 28th International Conference on Data Engineering, 2012, pp. 1273–1276.

12.

and Liu

, Learning to classify texts using positive and unlabeled data, in: Proceedings of the 18th International Joint Conference on Artificial Intelligence, Vol. 3, 2003, pp. 587–592.

13.

Liu

Dai

Lee

W.S.

and Yu

P.S.

, Building text classifiers using positive and unlabeled examples, in: Proceedings of the Third IEEE International Conference on Data Mining, IEEE, 2003, pp. 179–186.

14.

Liu

Lee

W.S.

P.S.

and Li

, Partially supervised classification of text documents, in: Proceedings of the Nineteenth International Conference on Machine Learning, 2002, pp. 387–394.

15.

Maddock

Starbird

Al-Hassani

Sandoval

D.E.

Orand

and Mason

R.M.

, Characterizing online rumoring behavior using multi-dimensional signatures, in: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 2015, pp. 228–241.

16.

Manevitz

and Yousef

, One-class document classification via neural networks, Neurocomputing 70(7) (2007), 1466–1481.

17.

Manevitz

L.M.

and Yousef

, One-class svms for document classification, Journal of Machine Learning Research 2(Dec) (2001), 139–154.

18.

Olteanu

Castillo

Diaz

and Vieweg

, CrisisLex: A lexicon for collecting and filtering microblogged communications in crises, in: In Proceedings of the 8th International AAAI Conference on Weblogs and Social Media, 2014, pp. 376–385.

19.

Pazzani

and Billsus

, Learning and revising user profiles: The identification of interesting web sites, Machine Learning 27(3) (1997), 313–331.

20.

Rousseeuw

P.J.

, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20 (1987), 53–65.

21.

Sakaki

Okazaki

and Matsuo

, Earthquake shakes Twitter users: Real-time event detection by social sensors, in: Proceedings of the 19th International World Wide Web Conference, 2010, pp. 851–860.

22.

Schölkopf

Platt

J.C.

Shawe-Taylor

Smola

A.J.

and Williamson

R.C.

, Estimating the support of a high-dimensional distribution, Neural Computation 13(7) (2001), 1443–1471.

23.

Sriram

Fuhry

Demir

Ferhatosmanoglu

and Demirbas

, Short text classification in Twitter to improve information filtering, in: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2010, pp. 841–842.

24.

Tumasjan

Sprenger

T.O.

Sandner

P.G.

and Welpe

I.M.

, Predicting elections with Twitter: What 140 characters reveal about political sentiment, in: Proceedings of the Fourth International Conference on Weblogs and Social Media, 2010, pp. 178–185.

25.

Wang

and Stolfo

S.J.

, One-class training for masquerade detection, in: Workshop on Data Mining for Computer Security, 2003, pp. 10–19.

26.

Zhang

Szabo

and Sheng

Q.Z.

, Improved object and event monitoring on twitter through lexical analysis and user profiling, in: Proceedings of the 17th International Conference on Web Information System Engineering, 2016, pp. 19–34.

27.

Zhang

Szabo

Sheng

Q.Z.

and Fang

X.S.

, Snaf: Observation filtering and location inference for event monitoring on twitter, World Wide Web 21(2) (2018), 311–343.

28.

Zhang

Szabo

Sheng

Q.Z.

Zhang

W.E.

and Qin

, Identifying domains and concepts in short texts via partial taxonomy and unlabeled data, in: Proceedings of the 29th International Conference on Advanced Information Systems Engineering, 2017, pp. 127–143.

Estimating a one-class naive Bayes text classifier

Abstract

Keywords

1. Introduction

3. Optimized one-class naive Bayes classifier

3.1 Naive Bayes classifier

3.4 Overall algorithm

4. Evaluation

4.1 Experimental setup and evaluation matrix

1 http://crisislex.org/.

4.2.1 Prototype

4.2.2 Nearest Neighbor (NN)

4.2.3 One-class SVM (OSVM)

4.2.4 Spy

4.2.5 Roc

3 https://www.cs.uic.edu/ liub/LPU/LPU-download.html.

Table 2 Classification accuracy of one-class classifiers

References

¹
http://crisislex.org/.

³
https://www.cs.uic.edu/ liub/LPU/LPU-download.html.

Table 2
Classification accuracy of one-class classifiers