Abstract
Nowadays more and more information extraction projects need to classify large amounts of text data. The common way to classify text is to build a supervised classifier trained on human-labeled positive and negative examples. In many cases, however, it is easy to label positive examples, but hard to label negative examples. In this paper, we address the problem of building a one-class classifier when only the positive examples are labeled. Previous works on building one-class classifier mostly use positive examples and unlabeled data. In this paper, we show that a configurable one-class classifier such as one-class naive Bayes can be optimized by examining the clustering quality of the classification on target data. We propose to use existing and new quality scores for determining clustering quality of the classification. Experimental analysis with real-world data show that our approach generally achieves high classification accuracy, and in some cases improves the accuracy by more than 10% compared to state-of-art baselines.
Introduction
Information systems based on text data commonly need to have a text classification component. We have seen such information systems built around microblog text data, that predict earthquake movements [21] and influenza [6], locate crime incidents [27], analyze public sentiments in elections [24], and detect rumors [15]. In these information systems, classification is required since only a small portion of the text data are desirable. For instance, when building a system for detecting crime and disaster events, the authors find that only 0.05% tweets in all collected tweets are related to the application [11].
A common way to classify text is to build a supervised classifier that is trained on some labeled examples, which usually include positive and negative examples [23]. However, in particular applications, it is often easier for human annotators to label positive examples than to label negative ones. According to psychology theories, people tend to see what they want to see, based on their experience, motivation, and emotional state [1]. In an event monitoring project, for example, a researcher may already have some experiences with the event-related text data, and is interested in looking for more. However, she may not be interested in identifying unrelated data, which is tedious and unexciting for her. As a result, determining positive class can faster as she knows what to look for in the data, while determining negative class can be difficult and slow. This is a reality already recognized by a number of researches [14, 3]. The question is whether we can build a supervised classifier using only positive examples.
A number of studies propose to build one-class classifiers, in other words, classifiers trained only with examples from the positive class [9]. These works can be divided into two groups, one uses only positive examples [17, 25], and the other uses positive examples and unlabeled data as a form of semi-supervised learning [12, 7]. An example of the latter is One-class naive Bayes classifier, which in previous works has been built based on positive and unlabeled data [3]. Naive Bayes is a supervised classification technique that has proven effectiveness in text classification. At the core of this technique is the independent feature assumption, with which the probability of the text belonging to a class can be expressed as the joint probability of its single feature probabilities given the class. As already shown in previous work, a one-class naive Bayes can be built with unlabeled data and a user-supplied parameter [3]. This user-supplied parameter, called background positive probability, however, is depending on the target data to be classified, and cannot be estimated easily.
The background positive probability is an indication of the ratio of positive instances in the target data, and is a missing piece in existing one class classifiers. In order to estimate this parameter, we need to look at classified results involving the target data. Assuming that positive instances are more similar to each other than to negative instances, our method is based on clustering quality, which measures how well two classes are separated. Our hypothesis is as follows: given a number of parameter values of a one-class classifier, we measure the clustering quality of classification, and the parameter value that provides the best clustering quality should also provide the best classification quality. We use existing measures of clustering quality [20], as well as a new measure called largest separation of nearest instance (LSN). To summarize, we make the following contributions with this paper:
We review the technique for building one-class naive Bayes classifiers and make an extension by considering clustering quality of the classification on target data. We identify that the configurable parameter, i.e. the background positive probability can be optimized according to the clustering quality. Importantly, this optimization eliminates the need of manually setting the parameter as required by previous works, and does not require extra information or labeling. We conduct an extensive evaluation for the effectiveness of our approach, comparing it to a wide range of baseline methods. Tested on publicly available datasets, we find that the classifier we build generally achieves high classification accuracy, and in some cases improves the accuracy by more than 10% compared to state-of-art baselines.
The remainder of this paper is organized as follows: in Section 2, we discuss related works on one-class classification. Section 3 reviews one-class naive Bayes classifier and presents our method for parameter estimation. In Section 4, we present our experimental evaluation and discuss the results. In Section 5, we provide some insights we gain through this study regarding the background positive probability. Finally, Section 6 concludes this paper.
Early works usually treat one-class classification problem as an anomaly detection problem. The essential technique is to measure the distance between the document to be classified and the positive examples, and use a threshold for determining results. For instance, Pazzani and Billsus propose to use the average of the positive examples, which is called Prototype, as the basis for measuring distance [19], and then apply a threshold that provides class judgment. A similar approach based on Nearest Neighbor (NN) is introduced by Manevitz and Yousef [16], for which the distance is measured between the test document and the nearest document in the positive examples. An interesting technique for positive-only classification, called One-class SVM (OSVM), is now the benchmark method for one-class classification [22]. Proposed by Schölkopf et al. [22], OSVM is a modified version of Support Vector Machine (SVM) that builds a decision boundary based on positive example only. Generally, because these techniques use only the positive examples, their performance is worse than techniques that incorporate both the positives and unlabeled data [13].
There is a series of studies called PU-learning, which aims to build classifiers with positive and unlabeled data. The techniques introduced in these studies are quite similar to each other, as they use a two-step approach. First, using some heuristics they discover reliable negative data from the unlabeled data. Then using the positive data and reliable negative data they build a traditional classifier, usually SVM [13]. Liu et al. propose a technique call Spy-EM [14]. This method first selects spy, a portion of positive data, to add to the unlabeled data. Then it uses Expectation Maximization (EM) to select a probable classifier, which can be used to select a probability threshold that leads to reliable negatives. This technique is based on the assumption that spies will behave similarly to the unknown positives in the unlabeled data to let the classifier learn the positive data characteristics. Li et al. [12] propose another technique called Roc-SVM, and the main difference between this and the previous technique is that it selects reliable negatives from unlabeled data using Rocchio classifier, which tends to provide lower false-negative rates. The assumption common to this series of techniques in the context of document classification is that the proportion of positives in the unlabeled data is small, and the negatives fall into diverse topics, which is generally true for topic-based document classifications. However, it is unclear if these techniques can achieve same effectiveness in other type of classification applications.
There is a series of existing studies on building one-class Naive Bayes classifier with positive and unlabeled data. Early techniques such as the one proposed by Wang and Stolfo use threshold that sets a boundary between positive and negative predictions [25]. Denis et al. first show that when using naive Bayes classifier in document classification, the word probability for the negative class can be derived from positive examples and word counts in the unlabeled data [5]. They call this technique Positive Naive Bayes (PNB). However, the authors do not provide the solution for estimating the necessary component of the background positive probability. Calvo et al. show that for feature vectors with categorical values, the negative feature probability can be transformed using Laplace correction to remove below-zero probability [3]. They go on to show that the PNB can be extended to more complex networks. However, the positive instance probability remains unknown. He et al. extend the PNB model by incorporating uncertainty, which they call Uncertain Positive Naive Bayes (UPNB), basically using probability cardinalities to replace feature counts [8]. They show that the negative feature probability is a parameter that needs to be adjusted, and propose a method to find its optimal value by using part of the positive examples as validation. Recently, an attempt has been made by Bekker and Davis to establish the prior positive and negative probability from discrete training data [2]. The technique they use based on the label frequency estimation, however, can only estimate the upper and lower bonds of the required probability. Based on existing studies, it is clear that the negative feature probability, which is required to build one-class naive Bayes classifier, cannot be effectively established using only positive training examples and unlabeled data.
Optimized one-class naive Bayes classifier
We will first discuss one-class naive Bayes classifier in the context of text classification, particularly, the classification of short messages such as tweets. Because our focus is on building the classifier rather than designing effective features, we will use, without loss of generality, the most basic feature repre- sentation for text messages, namely, bag-of-word (BOW) binary representation. Specifically, given a vocabulary
Naive Bayes classifier
The goal of a text classifier is to find the probability of a document belonging to a class given data
A naive Bayes classifier is built on the “naive” conditional independence assumption, such that each
If we can obtain each
where
To make a prediction of the class, the probability is calculated for all classes and the one with the highest probability is chosen as the predicted class:
We can see that if we only have one class (e.g., 1) in the training data,
In the next subsection, we will show that with two pieces of auxiliary information we can devise a naive Bayes classifier with only one class of training examples.
In one-class classification, the problem becomes finding the probability of positive and negative classes for a document,
Nowadays it is easy to get a large number of unlabeled data from online platforms such as microblog services. Assume we have a set of unlabeled data. From such data we can estimate the overall probability of a feature
On the other hand, we can express
Re-organizing this equation will give
This negative feature probability
Previous studies show that the background positive probability
The existing one-class classification techniques mostly use only the positive and unlabeled data to build the model. In this paper, we aim to optimize the parameter
The process for this optimization includes three steps, namely, setting the parameter, classification, and obtaining cluster quality score. First we set a number of
There are many statistical measures for the clustering quality can be considered. A popular choice is called average silhouette width (ASW). Introduced in [20], this measure can be adapted to measuring the clustering quality of one class while ignoring other classes, which suits our scenario in which only the positive class is considered. Applying ASW to our one-class scenario, a silhouette is calculated for each document in the positive class:
where
We also propose a new measure called largest separation of nearest instance (LSN), which is calculated as the distance between the nearest pair of positive and negative instances. The rationale behind this measure is that, because text data generally have high dimensionalities, given positive and negative clusters, it is not the overall difference, but the difference of two most similar instances from each class that shows cluster separation. Algorithmically, we achieve this by first calculate distances between all pairs of positive and negative classes, and pick the smallest value
Overall algorithm
We present the algorithms for training and estimating the optimal one-class naive Bayes classifier in this section. As discussed in the previous section, a naive Bayes classifier can be described in four components
We have discovered that it is unnecessary to retrain a classifier each time when a new
Finally, Algorithm 3 shows how to estimate an optimal
Evaluation
To test the effectiveness of our approach, we conduct experiments on three real-world text datasets previously used in other studies, and compare against several baseline approaches, including the state-of-art PU learning methods. In this section, we discuss the used datasets, experiment setup, and baseline approaches, before presenting the results.
Experimental setup and evaluation matrix
We conduct experiments on three real-world text datasets used in different applications. The first dataset, called the shooting dataset, is a collection of tweets containing the keyword shooting. The collection contains labeled and unlabeled tweets. The labeled tweets have been divided into four domains, namely, crime, imaging, game, and metaphor. In a previous work, the task has been classifying crime-related tweets from these tweets [28]. In this work, we also consider the crime-related tweets as our target. The positive examples are thus taken from the crime-related tweets.
The second dataset is called the crisis dataset and is a publicly available dataset1
The third dataset is called the sentiment dataset, and is also publicly available.2
The number of tweets in the labeled and unlabeled data for three datasets is shown in Table 1. For all three datasets, we use a portion of positive examples in the labeled tweets as the training data for our one-class classifiers. Please note that there are no missing values in the datasets.
Dataset statistics
For all three datasets, the tweets are transformed into vectors using BOW-binary representation. Words that appear more than 10 times in the labeled dataset are selected to be included in the BOW vocabulary. For the shooting dataset, the vocabulary size is 159. For the crisis dataset, the vocabulary size is 583. For the sentiment dataset, the vocabulary size is 227.
We measured the precision, recall, and
In the experiments, we test proposed one-class naive Bayes (ONB) classifier with two different clustering quality measures, ONB-ASW and ONB-LSN. They are compared with a number of baseline one-class classifiers, which are listed below with short explanations.
Prototype
Introduced in [19], this technique first finds a prototype for the positive examples, which is the average of vector values, then compares test data with the prototype by cosine similarity, and uses a threshold for judgment.
Nearest Neighbor (NN)
Similar to Prototype, this technique compares test data with positive examples by cosine similarity and uses a threshold for judgment [16]. The difference is that, instead of the Prototype, the distance between target data and the nearest instance in the positives is used.
One-class SVM (OSVM)
Proposed by [22], OSVM is a modified version of Support Vector Machine that supports one-class classification. Similar to a standard SVM, this model maps data points into another space through kernel transformation,
Spy
The Spy technique introduced in [14] is a reliable negative (RN) based technique. This technique chooses reliable negatives from unlabeled documents based on the probability calculated using naive Bayes approaches. It uses an Expectation-Maximization (EM) algorithm combined with naive Bayes to find the probabilities, then selects a number of unreliable documents. More specifically, a parameter
Roc
This is another a RN based technique introduced in [12]. It first considers unlabeled data as negative, then uses Rocchio classifier to find the prototype of positive and negative classes by the following formula:
where
Introduced in [3], this technique uses an alternative definition of negative feature probability by applying Laplace correction and using an increment to avoid zero value in the summation component of negative feature probability. More specifically, the negative feature probability is defined as:
A significant drawback of this approach is that it involves an additional parameter
The evaluations of different one-class classifiers on three datasets are shown in Table 2, in which we also list the average precision and f1 score results. The best result in each measurement is highlighted in bold font. The results of the proposed methods (ONB-ASW and ONB-LSN) are shown in the last two columns.
Classification accuracy of one-class classifiers
Classification accuracy of one-class classifiers
From the results we can see that the proposed ONB-LSN method achieves the highest overall accuracy, having the highest f1 score for all three datasets. But we also notice that there is different strength for different classifiers, some of which are better at getting high precisions, while some are better at getting high recalls. For example, the Roc classifier is shown to have better precision, having the highest precision for the shooting and sentiment dataset, and the second highest for the crisis dataset. However, the Roc classifier has very poor recalls, having the lowest recalls for all datasets, which leads to low f1 scores.
The methods that tend to get higher recall than precision include Prototype, NN, OSVM, and ONB-ASW. OSVM for example, achieves the highest recall of 0.739 for the crisis dataset. However, it has a very low precision that prevents it from reaching a high f1 score. It is indicated that these methods are more lenient when identifying positive instances, which causes more false positives to be included. Indeed we can make a simple classifier that accepts all data as positive and we will have the highest recall of 1, but we will not have a high f1 score due to poor precision. The proposed ONB-LSN however, achieves reasonable precision while maintaining a high recall, which allows it to have the highest f1 score among all classifiers. The aUPNB method has a similar precision and recall balance like ONB-LSN, possibly due to their similar implementation. However, ONB-LSN achieves better precisions and recalls for all datasets than aUPNB.
We also notice that the SPY method is an interesting one among classifiers. While it achieves neither the best precision nor the best recall, it gets a relatively high precision for all datasets while keeping an over-average recall. It also has relatively low performance impact when given different datasets. This is possibly due to the robustness provided by the EM algorithm. At the end it achieves the second highest average f1 score among classifiers, only lower than the proposed ONB-LSN. Lastly we notice that the Prototype and NN methods have surprising accuracies, considering their simplicity. While worse than more sophisticated methods such as Spy and the proposed ONB-LSN, they manage to perform better than OSVM, according to the average f1 scores.
In the previous section, we have shown empirically that the one-class naive Bayes classifier estimated using the proposed LSN measure can achieve better classification results than state-of-art PU learning methods. In this section, we offer some more insights into such a classifier. Specifically, we would like to discuss the interpretation of the background positive probability
Precision, recall, and f1 score achieved by one-class naive Bayes classifier with different 
More specifically, the highest f1 score for the shooting, crisis, and sentiment datasets are 0.554, 0.304, and 0.587, when
Nevertheless, we see positive correlations between
In this paper we propose a method to build a one-class naive Bayes classifier given positive and unlabeled data. After reviewing existing one-class naive Bayes approaches, we identify that the background positive probability
However, our work has limitations. Although we propose a method for finding the optimal parameter, the performance of the classifier already has an upper-bound (and a lower-bound). In other words, even if we reach the exact optimal value for the configurable parameter, we still can only get limited classification accuracy. This is the limit of the naive Bayes classifier itself. Future researches for one-class naive Bayes classifier will need to deal with this limitation. Another interesting future direction could be considering online processing capability of current method.
