Abstract
In order to improve the classification speed without sacrificing the email classification accuracy seriously, a novel online spam classification method is proposed. Firstly, the conceptions of term frequency based interest sets are introduced, and emails are classified by combining term frequency based interest sets and Naïve Bayes classifier. Secondly, based on the active learning theory, a novel boundary density based email classification certainty evaluating method is proposed to select and recommend emails to users for labeling by combining the user interests. Finally, the emails which are labeled and classified with the greatest possibilities are used for retraining based on the incremental learning theory. In the experiments, Support Vector Machine (SVM), Naïve Bayesian (NB) and K-Nearest Neighbors (KNN) classifiers are used on two corpuses: Trec2007 and Enron-spam. Comparing with six typical active learning based incremental learning methods, the proposed method greatly reduces the consuming time of email classification while guaranteeing the accuracy. Moreover, the proposed method brings very small sample labeling burden to the users, proving its high value on online application.
Keywords
Introduction
Online spam classification is normally defined as an incremental supervised binary streaming text classification task [1]. In this process, emails are classified by a classifier in their chronological sequence, and the classifier will update itself incrementally according to the feedback of the user. Online spam classification requires statistical text classification methods which are really space-time-efficient [2]. There are numerous sophisticated algorithms that have been applied to text categorization, for example, Naïve Bayes classifier (NB) [3], Support Vector Machine (SVM) [4], K-Nearest Neighbor (KNN) [5], Decision tree [6],Rocchio [7], etc. As one of these successful methods, Naïve Bayes is popular in text classification due to its computational efficiency and relatively good predictive performance [8].
Incremental learning and active learning theories have been widely used in online spam classification [9]. The goal of incremental learning is to select a subset of training samples while preserving the performance as using all the training samples [10]. Batch SVM is a typical incremental learning method which is proposed by Syed [11]. This method simply merged the incremental samples and support vectors (SV) of the training set together to form a new training set. The sample training speed is fast, but the spam recognition accuracy is not very high. Wu et al. improved Batch SVM, and combined Karush-Kuhn-Tucker (KKT) conditions to add new samples into the training set and remove redundant samples [12]. Samples are trained and classified fast, but the users do not participate in the process of sample classification, thus, the classification results are often inconsistent with the actual judgments of the users. In the increamental learning process, active learning is used to select the most useful sample for labeling and add the labeled sample for retraining. Active learning mainly contains the following types: membership query synthesis, stream based active learning and pool based active learning. Comparing with stream based active learning and pool based active learning, it is quite awkward to label arbitrary samples in membership query synthesis [13]. Moreover, pool based active learning appears to be much more common among online spam classification applications than stream based active learning, for the latter is inefficient in dealing with online emails as it scans through the data sequentially and makes query decisions individually. In pool based active learning, uncertainty sampling [8] is perhaps the simplest and most commonly used query framework. Typical uncertainty sampling methods include least confident (LC) [13], margin sampling (MS) [13], entropy sampling (ES) [13], centroid sampling (CS) [14], etc. Amayri compared the traditional SVM based online spam classification methods and the improved online spam classification methods which combined the active learning method, and deduced that the latter can significantly improve the spam recognition performance [15]. Sculley combined SVM and explored several reasonable approaches to determine when to request labels for new samples, and proved that online active learning is the most appropriate form of active learning for spam filtering [16]. Cheng presented an improved incremental learning and active learning algorithm for SVMs. The k-means clustering algorithm was used to collect the initial training samples, and a weight factor was assigned to each sample according to its distance to the current separating hyperplane [10]. For reducing the sample labeling burden of the active learning process, Ienco et al. combined the density and prediction uncertainty of sample stream and proposed a new active learning method, which allowed focusing labelling efforts in the instance space where more samples are concentrated [17]. Moreover, Georgala introduced a spam filtering method by combining the incremental clustering based active learning method. Experimental results showed that the method performed very well, verifying the efficiency of the proposed method on spam filtering [18].
Although the methods, which combine the incremental learning methods and the active learning methods, are widely used in the field of spam classification, but there are still some problems [19, 20]: (1) many methods use SVM for email training and classification. However, SVM is very time consuming in dealing with online emails; (2) in traditional uncertainty sampling methods (such as MS, ES and et al.), there exist some problems when different classifiers are used. For example, MS ignores the effect of the inner samples in different categories when SVM is used; the calculation of posterior probabilities in MS and ES are based on the independence assumption, which is always supposed to be over-simplistic, when NB is used; MS ignores the sample densities of different categories when KNN is used; (3) in the sample labeling process of the traditional active methods, the user’s interests on the contents of the recommended emails are not considered, deducing that all samples must be labeled, causing large burden to the experts [21].
For solving the above problems, a quick online spam classification method is proposed in this paper. The originalities of the proposed method are given as follows: (1) in order to highlight the spam classification speed, the conceptions of term frequency based positive (negative) interest sets are introduced. Moreover, NB classifier is combined for email classification; (2) for correcting the shortcomings of the traditional uncertainty sampling methods, a novel boundary density based sample classification certainty evaluating method is proposed; (3) for reducing the labeling burden of sample labeling process, a novel user interest based sample labeling model is proposed. The remainder of this paper is organized as follows. Section 2 reviews the algorithms of Naïve Bayesian and active learning. Section 3 descripts the proposed method. Section 4 shows experimental results and analysis. Section 5 concludes the whole paper.
Algorithms of Naïve Bayesian and active learning
Naïve Bayesian
The Naïve Bayesian algorithm derives from Bayesian Decision Theory, and has become the simplest and mostwidely used classifier in text classfication filed. Given the categories C = {c
i
} (i = 0, 1, …, N - 1), and a sample with vector X = {x
j
}(j = 0, 1, …, M), the probability that X belongs to category c
i
is calculated by formula (1) [13]:
Although the independence assumption is always supposed to be over-simplistic, studies have proven that NB is significately effective in dealing with discrete problems in text classification filed [22]. Moreover, NB appeared to be the best in terms of computational complexity among flexible Bayes, linear SVM and LogitBoost [8]. Hence, NB is used for sample training and classification in this paper.
Method description
Figure 1 gives the flowchart of the proposed method, and more details are given as follows.
Feature selection and sample training
The attachments, headers, stop-words and labels of each email in the training set A i and the incremental set S i are removed. Moreover, Porter Stemming algorithm is used for Lemmatization [24]. In order to keep the executing speed and the accuracy of feature selection methods, the feature selection based on comprehensive measurement both in inter-category and intra-category (CMFS) [25] is used to select features from A i . The selected features are sorted by the CMFS values in descending order, and the first n f features are selected and denoted as S f . For the simplicity, linear computational complexity, and high accuracy of NB classifier, it is used for sample training and classification. McCallum and Nigam indicated that multinomial model can generate higher accuracy than multivariate Bernoulli model, thus multinomial model is used in NB classifier [26].
Sample classification
Given the training set A
i
, the incremental set S
i
(i = 1, 2, …, n
s
, n
s
is the number of the incremental sample sets), and the sample s
ij
of S
i
, some definitions are given as follows: content keyword: the terms, which both occur in S
f
and s
ij
are called the content keywords of s
ij
, and denoted as {ct
ijk
} (0 ≤ k < m, , m ≥ 0), where exists: if 0 ≤ p ≠ q ≤ m - 1, then ct
ijp
≠ ct
ijq
. term frequency based positive interest (TFPI): If s
ij
is identified as a legitimate email (ham) by a user, then each content keyword ct
ijk
of s
ij
is called the term frequency based positive interest of the user. term frequency based negative interest (TFNI): If s
ij
is identified as an unsolicited email (spam) by a user, then each content keyword ct
ijk
of s
ij
is called the term frequency based negative interest of the user.
With respect to a user, all the TFPIs are denoted as I P , and all the TFNIs are denoted as I N . For each content keyword in I P or I N , an attribute, which is called positive occurring frequency (POF) or negative occurring frequency (NOF), is introduced to denote the number of the times that the content keyword repeatedly occurs in the email which has been classified as ham or spam.
Where, m denotes the number of the content keywords in s ij , NOF (ct ijk ) denotes the number of the times that the content keyword ct ijk repeatedly occurs in I N ; POF (ct ijk ) denotes the number of the times that ct ijk repeatedly occurs in I P ; CMFS(ct ijk ) denotes the CMFS value of term ct ijk obtained by CMFS feature selection method.
Active learning process
The traditional classifiers make great efforts to reduce the classification errors [27, 28]. Active learning, which is also called inquiry learning or optimal experimental design, is an important technology of machine learning [13]. Active learning recommends the samples with the richest information to users for labeling, and realizes the retraining of the samples which are classified uncertainly. The processes of active learning can be executed in two steps: evaluation of classification certainty and sample labeling.
Evaluation of classification certainty
Given the unlabeled sample x, the spam category c
s
and the ham category c
h
, MS and ES calculate the uncertainty of x according to the following formulas [13]:
Further, MS and ES can be represented as the following modes when different classifiers are used: MS (ES) of SVM: P (c
s
|x) and P (c
h
|x) are represented by calculating the distances between x and the separating hyper-plane of SVM; MS (ES) of KNN: P (c
s
|x) and P (c
h
|x) are represented by calculating the proportions of the samples labeled spam and ham in x’s K nearest neighbors; MS (ES) of NB: P (c
s
|x) and P (c
h
|x) are calculated by the formula (2) of Section 2.1.
Obviously, formula (9) considers not only the relationships between x and its nearest samples Q si (Q hi ), but also the relationships between Q si (Q hi ) and Q si ’s (Q hi ’s) nearest samples. From Equation (9) we deduce that: (1) if , x belongs to spam category with greater possibility; (2) the smaller the p value is, the more uncertainly sample x is classified in Section 3.2.
According to the above processes, the p values of the samples in S i are calculated. On this basis, all the samples in S i are ranked in ascending order according to their p values. Then, the top |S i |α samples are denoted as Sr_i and recommended to users for labeling.
Obviously, the sample labeling burden is reduced for the following reasons: (1) the samples which the user is not sure whether they should be classified as “spam” or “ham” are not labeled; (2) the samples of which the content the user is not interested in are also not labeled.
Update the term frequency based interest sets and the sample training set
From the above processes, we notice that the term frequency based interest sets and the sample training set are updated by using the samples which are labeled by the user and the samples which are classified with the greatest certainties. In order to reduce the size growth speeds of the term frequency based interest sets and the sample training set, the proposed sample labeling model makes sure that two kinds of samples: the samples of which the category can not be determined by the user certainly and the samples of which the contents the user is not insterested in, are not considered for retraining.
Consuming time analysis of sample classification
Suppose the sample number of current training set is n
a
, the dimension of the feature space is n
f
, the sample number of each incremental set is n
i
, the number of content keywords in sample s
ij
is m. Moreover, the time cost on addition, subtraction, multiplication and division are denoted as t1, t2, t3 and t4, respectively. According to [8], the consuming time TC-NB of linear classifier NB can be estimated by the following formula:
From the sample classification process of the proposed method we know that the time costs of the steps in Section 3.2 can be calculated according to the following cases:
Case 1: the sample is classified in step 1. 2(m-1) addition operations. 2 subtraction operations. 2m multiplications.
Overall, the time cost of this step is:
Case 2: the sample is classified in step 2. 2(m-1) addition operations. 4 subtraction operations. 2m multiplications.
Overall, the time cost is:
Case 3: The sample is classified in step 3.
NB classifier is executed based on step 2, and the time cost is:
For two real numbers, because the consuming time of comparison, addition and subtraction are almost negligible when comparing with the time-consuming of multiplication, there exists t1 << t3 and t2 << t3. Based on this, formulas (12) and (13) approach to the following formulas:
Suppose there are w (w > 0) samples, which are classified in step 1 or step 2, then the average time cost of sample classification is:
Suppose TC-I1 = ζTC-NB, thus the following formula can be deduced:
In General, there exists ζ ≤ 0.05. On this basis, it can be deduced that TC-IA < TC-NB when the condition that w > 0.05n i satisfies. Actually, as more and more content keywords are added to the I P and I N , many samples are classified in step 1-2, deducing that the advantage of the proposed method on reducing the time cost of sample classification becomes increasingly apparent.
The experiments were conducted on an Intel Core(TM)-i5 Processor with a CPU clock rate of 3.10 GHz and 4 GB main memory. The vector space model [29] of selected features is built on the platform of visual studio 2008, using C++ standard template library (STL). As are shown in Table 1, two benchmark corpuses: TREC2007 (TR07) [30] and Enron-spam (ES) [31] are used as the experimental corpuses.
In order to evaluate the effectiveness of different spam filtering methods, the 10 folds cross validation is applied to the sample classification process. Moreover, the F1 measurement [31] is used to compare the proposed method with six typical methods, which are given in Table 2. The definition of F1 measurement can be formulated as follows:
In order to simulate the online spam classification process, each sample of the increments set is labeled before incremental learning. Moreover, the parameters of the proposed method are set as follows: the vector dimension of the feature space is set n f = 600; the number of the test sample is initialized n t = 5; the sample number of the test set is initialized n ts = 100.
Here, the sample number of the initial training set A0 is denoted as |A0|, the sample number of the incremental sample set S
i
is denoted as |S
i
| and the number of the incremental set is denoted as n
s
. Suppose |A0| = 100, 200, 300, 400, 500, |S
i
| = 100, 200, 300, 400, 500(0 < i < n
s
) and ns = 100. On this basis, we define the average F1 value F
a
and the average classification consuming time T
ca
as follows:
From Section 3.2 we know that the classification consuming time is irrelevant to the number of content keywords in term frequency based interest sets, and relevant to the value of Δ. In order to obtain the optimal Δ, statistical experiments are carried out when Δ (denoted as delt in Fig. 3) ranges from 0 to 100, and the corresponding F a and T ca values of TR07 and ES are given in Fig. 3(a) and (b), respectively. From Fig. 3 we know, when Δ ranges from 0 to 30 and 0 to 40, the F a values increase rapidly; as Δ increases further, the F a values remain stable. Hence, for reducing the consuming time of sample classification as much as possible, we set Δ= 30 and Δ= 40 when TR07 and ES are used, respectively.
Suppose |A0| = 200; n s is set to 100, 200, 300, 400 and 500, respectively; |S i | is set to 50 and 100, respectively. Figure 4 gives the average classification consuming time T ca values of ES and TR07 when different methods are used, respectively. Obviously, because the computational complexity of KNN classification process is relevant to the sample number of the training set, the T ca values of MS+KNN and ES+KNN increase significantly as n s ranges from 100 to 500. Meanwhile, we notice that MS+SVM and ES+SVM perform better than those KNN combined methods for the computational complexities of the former ones are relevant to the number of support vectors rather than the number of all samples in the training set. It is obvious that the T ca values of MS+NB and ES+NB are much lower than those SVM combined methods for the computational complexities of the former ones are only relevant to the vector dimension of the feature space. Further, the performance of the proposed method is obviously better than those of MS+NB and ES+NB when n s values are greater than 300, that’s mainly because the former method doesn’t classify the emails by using NB classifier directly, effectively reducing the average classification consuming time by combining the term frequency based user interest sets. Moreover, the lowest T ca values are both obtained by the proposed method when n s equals to 500 in Fig. 4(a) and (b), with the highest improvements of both about 0.05 ms over the corresponding T ca values of MS(ES)+NB methods.
Accuracy comparison of different methods
Suppose |A0|, |S i | and n s are all set to 100, 200, 300, 400 and 500, respectively. Figure 5 gives the average F a values (denoted as F aa ) of the proposed method when TR07 and ES are used, respectively. Obviously, comparing with other methods, the proposed method performs better than MS+KNN and ES+KNN for KNN classifier only utilizes the nearest neighbors of the unlabeled samples in the active learning and classification processes. In Fig. 5(a), the F aa values of the proposed method are lower than those of ES+SVM when n s ranges from 200 to 400, that’s because SVM may be good at dealing with the TR07 corpus for its good generalization. However, the proposed method still obtains the highest F aa value 0.972 when n s equals to 500. In Fig. 5(b), the proposed method performs obviously better than other methods when n s ranges from 200 to 400, illustrating the high accuracy of the proposed method on spam classification.
The high accuracy of the proposed method can be attributed to the following reasons: (1) the improved active learning method measures the sample classification certainty by calculating the boundary densities of different categories, which are ignored in MS and ES; (2) the improved active learning method is not based on the probability theory, thus the feature independence assumption, which is supposed to be over-simplistic in MS and ES is not combined; (3) the impacts of the samples in which the user is not interested, on the classification results of the samples in which the user is interested, is reduced by combining the user interest based sample labeling model.
Comparisons of sample labeling burden
Because the active learning process brings extra labeling burden to the users, the number of the samples which are recommended to the users should not be too large. To test the validity of the proposed method, different methods which combine MS, ES and LS are used for comparing, respectively. For ease of calculation, |A0| is set to 300, |S i | is set to 300, n s is set to 500, n t is set to 10 and n ts is set to 200. With respect to corpuses TR07 and ES, the maximum F1 values, which are denoted as F M , are shown in Table 3. From this table we know, the highest F M value 0.983 (denoted in bold) is obtained by the proposed method on TR07 corpus, and the minimum F M values of all methods are 0.961 and 0.964 when ES and TR07 corpuses are used, respectively. On this basis, the total numbers of the samples, which are recommended to the user for labeling until the corresponding F1 values are no less than 0.96, are calculated and denoted as n F in Table 3. Obviously, the n F values of the proposed method are generally much lower than those of other methods, with the lowest n F value 563 on ES corpus. Those may be the reasons for the lower sample labeling burden of the proposed method: (1) the active learning method of the proposed method makes the samples recommended to the users more representative; (2) the recommended samples in which the user is not interested are not labeled by combining the user interest based sample labeling model; (3) the samples which are wrongly classified with great possibilities, are also added into the training set for retraining, thus reducing the sample labeling burden while ensuring the accuracy of the proposed method.
Conclusions
The aim of the proposed method is to improve the speed of online spam classification and reduce the sample labeling burden without sacrificing the spam classificatiom accuracy. The originalities of the proposed method are given as follows: (1) the conceptions of term frequency based positive and negitive interest sets are proposed and combined with the Naïve Bayes classifer to rdeuce the consuming time of email classfication; (2) the sample boundary density is considered and a boundary density base evaluating function is proposed to measuring the classification certainties of the unlabeled samples in active learning process; (3) the user interested based sample labeling model is proposed to reduce the impacts of the samples in which the user is not interested, on the classification results of the samples in which the user is interested. To evaluate the performance of the proposed method, six commonly used spam classification methods are used for comparing. The classification processes are completed by Support Vector Machine, Naïve Bayesian and K-Nearest Neighbor classifiers on TREC2007 and Enron-spam corpuses. Experimental results show that, the email training and classification speeds of the proposed method are much more faster than those of other methods. Meanwhile, on the premise of small labeling burden, the proposed method performs generally better than other methods on the two corpuses by using F1 measurement. In the immediate future, we will continue the study on classifying emails with images and links which are important characteristic of spam emails.
Footnotes
Acknowledgments
Project supported by the National Nature Science Foundation of China (No. 60973040, No. 60903098), the National Nature Science Foundation for Young Scientists of China (No. 61300148), the Key Scientificand Technology Projects of Jilin Province. (No. 20130206051GX).
