Quick online spam classification method based on active and incremental learning

Abstract

In order to improve the classification speed without sacrificing the email classification accuracy seriously, a novel online spam classification method is proposed. Firstly, the conceptions of term frequency based interest sets are introduced, and emails are classified by combining term frequency based interest sets and Naïve Bayes classifier. Secondly, based on the active learning theory, a novel boundary density based email classification certainty evaluating method is proposed to select and recommend emails to users for labeling by combining the user interests. Finally, the emails which are labeled and classified with the greatest possibilities are used for retraining based on the incremental learning theory. In the experiments, Support Vector Machine (SVM), Naïve Bayesian (NB) and K-Nearest Neighbors (KNN) classifiers are used on two corpuses: Trec2007 and Enron-spam. Comparing with six typical active learning based incremental learning methods, the proposed method greatly reduces the consuming time of email classification while guaranteeing the accuracy. Moreover, the proposed method brings very small sample labeling burden to the users, proving its high value on online application.

Keywords

A mail classification term frequency based interest sets Support Vector Machine Naïve Bayesian K-NearestNeighbors active learning incremental learning

1 Introduction

Online spam classification is normally defined as an incremental supervised binary streaming text classification task [1]. In this process, emails are classified by a classifier in their chronological sequence, and the classifier will update itself incrementally according to the feedback of the user. Online spam classification requires statistical text classification methods which are really space-time-efficient [2]. There are numerous sophisticated algorithms that have been applied to text categorization, for example, Naïve Bayes classifier (NB) [3], Support Vector Machine (SVM) [4], K-Nearest Neighbor (KNN) [5], Decision tree [6],Rocchio [7], etc. As one of these successful methods, Naïve Bayes is popular in text classification due to its computational efficiency and relatively good predictive performance [8].

Incremental learning and active learning theories have been widely used in online spam classification [9]. The goal of incremental learning is to select a subset of training samples while preserving the performance as using all the training samples [10]. Batch SVM is a typical incremental learning method which is proposed by Syed [11]. This method simply merged the incremental samples and support vectors (SV) of the training set together to form a new training set. The sample training speed is fast, but the spam recognition accuracy is not very high. Wu et al. improved Batch SVM, and combined Karush-Kuhn-Tucker (KKT) conditions to add new samples into the training set and remove redundant samples [12]. Samples are trained and classified fast, but the users do not participate in the process of sample classification, thus, the classification results are often inconsistent with the actual judgments of the users. In the increamental learning process, active learning is used to select the most useful sample for labeling and add the labeled sample for retraining. Active learning mainly contains the following types: membership query synthesis, stream based active learning and pool based active learning. Comparing with stream based active learning and pool based active learning, it is quite awkward to label arbitrary samples in membership query synthesis [13]. Moreover, pool based active learning appears to be much more common among online spam classification applications than stream based active learning, for the latter is inefficient in dealing with online emails as it scans through the data sequentially and makes query decisions individually. In pool based active learning, uncertainty sampling [8] is perhaps the simplest and most commonly used query framework. Typical uncertainty sampling methods include least confident (LC) [13], margin sampling (MS) [13], entropy sampling (ES) [13], centroid sampling (CS) [14], etc. Amayri compared the traditional SVM based online spam classification methods and the improved online spam classification methods which combined the active learning method, and deduced that the latter can significantly improve the spam recognition performance [15]. Sculley combined SVM and explored several reasonable approaches to determine when to request labels for new samples, and proved that online active learning is the most appropriate form of active learning for spam filtering [16]. Cheng presented an improved incremental learning and active learning algorithm for SVMs. The k-means clustering algorithm was used to collect the initial training samples, and a weight factor was assigned to each sample according to its distance to the current separating hyperplane [10]. For reducing the sample labeling burden of the active learning process, Ienco et al. combined the density and prediction uncertainty of sample stream and proposed a new active learning method, which allowed focusing labelling efforts in the instance space where more samples are concentrated [17]. Moreover, Georgala introduced a spam filtering method by combining the incremental clustering based active learning method. Experimental results showed that the method performed very well, verifying the efficiency of the proposed method on spam filtering [18].

Although the methods, which combine the incremental learning methods and the active learning methods, are widely used in the field of spam classification, but there are still some problems [19, 20]: (1) many methods use SVM for email training and classification. However, SVM is very time consuming in dealing with online emails; (2) in traditional uncertainty sampling methods (such as MS, ES and et al.), there exist some problems when different classifiers are used. For example, MS ignores the effect of the inner samples in different categories when SVM is used; the calculation of posterior probabilities in MS and ES are based on the independence assumption, which is always supposed to be over-simplistic, when NB is used; MS ignores the sample densities of different categories when KNN is used; (3) in the sample labeling process of the traditional active methods, the user’s interests on the contents of the recommended emails are not considered, deducing that all samples must be labeled, causing large burden to the experts [21].

For solving the above problems, a quick online spam classification method is proposed in this paper. The originalities of the proposed method are given as follows: (1) in order to highlight the spam classification speed, the conceptions of term frequency based positive (negative) interest sets are introduced. Moreover, NB classifier is combined for email classification; (2) for correcting the shortcomings of the traditional uncertainty sampling methods, a novel boundary density based sample classification certainty evaluating method is proposed; (3) for reducing the labeling burden of sample labeling process, a novel user interest based sample labeling model is proposed. The remainder of this paper is organized as follows. Section 2 reviews the algorithms of Naïve Bayesian and active learning. Section 3 descripts the proposed method. Section 4 shows experimental results and analysis. Section 5 concludes the whole paper.

2 Algorithms of Naïve Bayesian and active learning

2.1 Naïve Bayesian

The Naïve Bayesian algorithm derives from Bayesian Decision Theory, and has become the simplest and mostwidely used classifier in text classfication filed. Given the categories C = {c_i} (i = 0, 1, …, N - 1), and a sample with vector X = {x_j}(j = 0, 1, …, M), the probability that X belongs to category c_i is calculated by formula (1) [13]: $P (c_{i} | X) = \frac{P (c_{i}) P (X | c_{i})}{\sum_{c_{k} \in C} P (c_{k}) P (X | c_{k})}$ (1)where, p (c_i) denotes the probability of the documents in category c_i, p (X|c_i) is the probability that document X occurs in c_i. X belongs to category c_i if c_i maximizes Equation (1). Actually, it is very difficult to estimate the value of p (X|c_i). With respect to sample X, the attributes {x_j} (j = 0, 1, …, M) are considered conditionally independent when a category c_i is given. Hence, formula (1) can be rewritten as formula (2): $P (c_{i} | X) = \frac{P (c_{i}) \prod_{k = 0}^{M} P (x_{k} | c_{i})}{\sum_{l \in N} P (c_{l}) \prod_{k = 0}^{M} P (x_{k} | c_{l})}$ (2)

Although the independence assumption is always supposed to be over-simplistic, studies have proven that NB is significately effective in dealing with discrete problems in text classification filed [22]. Moreover, NB appeared to be the best in terms of computational complexity among flexible Bayes, linear SVM and LogitBoost [8]. Hence, NB is used for sample training and classification in this paper.

2.2 Active learning

Labeled instances are very difficult, time-consuming, or expensive to obtain in many supervised learning tasks [13]. According to [23], the pool based active learning process is the most widely used approach, which can be modeled as quintuple (G, Q, S, T, U). Where G is a supervised classifier, which is trained with the training set T. Q is a query function used to select the most informative unlabeled samples from a pool U of unlabeled samples on the basis of the current classification results. S is a supervisor who can assign the true class label to any unlabeled sample of U. The general process of pool based active learning can be descripted as follows:

Pool based active learning procedure
1: Train classifier G with the initial training set T
2: Classify the unlabeled samples of the pool U
Repeat
3: Query a set of samples (with query function Q) from the pool U
4: A label is assigned to the queried samples by the supervisor S
5: Add the new labeled samples to the training set T
6: Retrain the classifier
Until a stopping criterion is satisfied

3 Method description

Figure 1 gives the flowchart of the proposed method, and more details are given as follows.

3.1 Feature selection and sample training

The attachments, headers, stop-words and labels of each email in the training set A_i and the incremental set S_i are removed. Moreover, Porter Stemming algorithm is used for Lemmatization [24]. In order to keep the executing speed and the accuracy of feature selection methods, the feature selection based on comprehensive measurement both in inter-category and intra-category (CMFS) [25] is used to select features from A_i. The selected features are sorted by the CMFS values in descending order, and the first n_f features are selected and denoted as S_f. For the simplicity, linear computational complexity, and high accuracy of NB classifier, it is used for sample training and classification. McCallum and Nigam indicated that multinomial model can generate higher accuracy than multivariate Bernoulli model, thus multinomial model is used in NB classifier [26].

3.2 Sample classification

Given the training set A_i, the incremental set S_i (i = 1, 2, …, n_s, n_s is the number of the incremental sample sets), and the sample s_ij of S_i, some definitions are given as follows:

content keyword: the terms, which both occur in S_f and s_ij are called the content keywords of s_ij, and denoted as {ct_ijk} (0 ≤ k < m, , m ≥ 0), where exists: if 0 ≤ p ≠ q ≤ m - 1, then ct_ijp ≠ ct_ijq.

term frequency based positive interest (TFPI): If s_ij is identified as a legitimate email (ham) by a user, then each content keyword ct_ijk of s_ij is called the term frequency based positive interest of the user.

term frequency based negative interest (TFNI): If s_ij is identified as an unsolicited email (spam) by a user, then each content keyword ct_ijk of s_ij is called the term frequency based negative interest of the user.

With respect to a user, all the TFPIs are denoted as I_P, and all the TFNIs are denoted as I_N. For each content keyword in I_P or I_N, an attribute, which is called positive occurring frequency (POF) or negative occurring frequency (NOF), is introduced to denote the number of the times that the content keyword repeatedly occurs in the email which has been classified as ham or spam.

Given the current term frequency based positive interest set I_P, the current term frequency negative interest set I_N and a threshold Δ > 0, the process of obtaining sample s_ij’s category C (s_ij) can be described as follows:

Input:s_ij, I_P, I_N and Δ.
Output:s_ij’s category C (s_ij)
1:
$If \sum_{k = 0}^{m - 1} NOF ({ct}_{ijk}) \times CMFS ({ct}_{ijk})$
$- \sum_{k = 0}^{m - 1} POF ({ct}_{ijk}) \times CMFS ({ct}_{ijk}) \geq Δ$
Then C (s_ij) = spam
2: Else
$If \sum_{k = 0}^{m - 1} POF ({ct}_{ijk}) \times CMFS ({ct}_{ijk})$
$- \sum_{k = 0}^{m - 1} NOF ({ct}_{ijk}) \times CMFS ({ct}_{ijk}) \geq Δ$
Then C (s_ij) = ham
3: Else
Obtain the category of s_ij by using NB classifier.

Where, m denotes the number of the content keywords in s_ij, NOF (ct_ijk) denotes the number of the times that the content keyword ct_ijk repeatedly occurs in I_N; POF (ct_ijk) denotes the number of the times that ct_ijk repeatedly occurs in I_P; CMFS(ct_ijk) denotes the CMFS value of term ct_ijk obtained by CMFS feature selection method.

3.3 Active learning process

The traditional classifiers make great efforts to reduce the classification errors [27, 28]. Active learning, which is also called inquiry learning or optimal experimental design, is an important technology of machine learning [13]. Active learning recommends the samples with the richest information to users for labeling, and realizes the retraining of the samples which are classified uncertainly. The processes of active learning can be executed in two steps: evaluation of classification certainty and sample labeling.

3.3.1 Evaluation of classification certainty

Given the unlabeled sample x, the spam category c_s and the ham category c_h, MS and ES calculate the uncertainty of x according to the following formulas [13]: $MS (x) = argmin (P (c_{s} | x) - P (c_{h} | x))$ (3) $ES (x) = argmin - (\begin{matrix} \sum_{s} P (c_{s} | x) log P (c_{s} | x) \\ + \sum_{h} P (c_{h} | x) log P (c_{h} | x) \end{matrix})$ (4)

Further, MS and ES can be represented as the following modes when different classifiers are used:

MS (ES) of SVM: P (c_s|x) and P (c_h|x) are represented by calculating the distances between x and the separating hyper-plane of SVM;

MS (ES) of KNN: P (c_s|x) and P (c_h|x) are represented by calculating the proportions of the samples labeled spam and ham in x’s K nearest neighbors;

MS (ES) of NB: P (c_s|x) and P (c_h|x) are calculated by the formula (2) of Section 2.1.

Obviously, instances with small margins are more ambiguous, thus knowing the true label would help the model discriminating more effectively between them. However, there are some problems when MS and ES are used. For example, when SVM is used, MS and ES only utilizes the samples which are near the hyper-plane, thus ignores the effect of the inner samples in different categories; when Naïve Bayes is used, MS and ES cannot obtain the accurate posterior probabilities for the feature independence assumption is always supposed to be over-simplistic; when KNN is used, MS and ES only concentrate on the sample numbers of the x’s nearest neighbors in different categories on the classification results, ignoring the boundary densities of different categories. Based on this, a new boundary density based classification certainty evaluating method is proposed. Given the spam set A_si, the ham set A_hi and the sample x of the incremental set S_i, which are shown in Fig. 2, the processes of the proposed classification certainty evaluating method are given as follows:

Input:A_si, A_hi and S_i.
Output: classification certainties of the samples in S_i.
1: for eachxinS_i
2: obtain x’s nearest K (we set K = 5 in this paper) samples in
A_si (A_hi), and denote them as Q_si and Q_hi, respectively.
3: calculate the distance d_si (d_hi) between x and Q_si (Q_hi):

d_{si} (x) = \frac{1}{K} \sum_{x_{j} \in Q_{si}} dis (x, x_{j})

(5)

d_{hi} (x) = \frac{1}{K} \sum_{x_{j} \in Q_{hi}} dis (x, x_{j})

(6)

Where, dis (x, x_j) denotes the distance between the sample

x and sample x_j.

4: calculate the boundary density of A_si (A_hi):

\bar{d_{si}}

(

\bar{d_{hi}}

), which are

the average distances between the samples in Q_si (Q_hi) and their

nearest K samples in A_si (A_hi):

$\bar{d_{si}} = \frac{1}{| Q_{s i} |} \sum_{x^{'} \in Q_{s i}} \frac{1}{K} \sum_{x_{j} \in Neighour (x^{'}, K, A_{s i})} dis (x^{'}, x_{j})$ (7) $\bar{d_{hi}} = \frac{1}{| Q_{hi} |} \sum_{x^{'} \in Q_{hi}} \frac{1}{K} \sum_{x_{j} \in Neighour (x^{'}, K, A_{hi})} dis (x^{'}, x_{j})$ (8)

Neighour (x′, K, A_si) and Neighour (x′, K, A_hi) denote the

nearest K samples of sample x′ in A_si and A_hi, respectively.

5: calculate the classification certainty of x:

p (x) = | \frac{d_{si} (x)}{\bar{d_{si}}} - \frac{d_{hi} (x)}{\bar{d_{hi}}} | .

(9)6: end for

Obviously, formula (9) considers not only the relationships between x and its nearest samples Q_si (Q_hi), but also the relationships between Q_si (Q_hi) and Q_si’s (Q_hi’s) nearest samples. From Equation (9) we deduce that: (1) if $\frac{d_{si} (x)}{\bar{d_{si}}} < 1$ , x belongs to spam category with greater possibility; (2) the smaller the p value is, the more uncertainly sample x is classified in Section 3.2.

According to the above processes, the p values of the samples in S_i are calculated. On this basis, all the samples in S_i are ranked in ascending order according to their p values. Then, the top |S_i|α samples are denoted as S_{r_i} and recommended to users for labeling.

3.3.2 Sample labeling

In the sample labeling processes of traditional active learning methods, the experts label the categories of the samples which are recommended to them. With respect to spam filtering fields, the category of an email is determined by the email receiver (user), thus the categories of the recommended samples must be labeled manually by the email users. In traditional sample labeling methods, all samples must be labeled, bringing large burden to users [21]. For solving this problem, we proposed a novel user interest based sample labeling model. Given the sample set S_{r_i}, which was obtained in Section 3.3.1, the details of the model are described as follows:

Input: S _{r_i}
Output: the labeled sample set S_{r_i}
for each sample x in S_{r_i}
1: if x is deemed as a spam email certainly and the user
is interested in the content of x
then x is labeled “spam”.
2: else if x is deemed as a ham email certainly and the user
is interested in the content of x
then x is labeled “ham”.
3: else
remove x from S_{r_i}.
continue.
end for

Obviously, the sample labeling burden is reduced for the following reasons: (1) the samples which the user is not sure whether they should be classified as “spam” or “ham” are not labeled; (2) the samples of which the content the user is not interested in are also not labeled.

3.4 Update the term frequency based interest sets and the sample training set

Based on the process of Section 3.3, the term frequency based interest sets (denoted as I_P and I_N), and the sample clusters of A_i are updated according to the following process:

Input:I_P, I_N, A_i, temp set S_{up_i}.
Output: the updated I_P, I_N and A_i.
1: Rank all the samples of S_i in descending order according to
the p values which are obtained according to formula (9).
2: Select the top \|S_i\|α (α= 0.05 in this paper) samples and
denote them as S_{top_i}.
3: For each sample x in S_{top_i}
If $\frac{d_{si} (x)}{\bar{d_{si}}} > \frac{d_{hi} (x)}{\bar{d_{hi}}}$ & & C (x) = ham∥
$\frac{d_{si} (x)}{\bar{d_{si}}} < \frac{d_{hi} (x)}{\bar{d_{hi}}}$ & & C (x) = spam
Then, x is considered to be correctly classified.
If $\frac{d_{si} (x)}{\bar{d_{si}}} < \frac{d_{hi} (x)}{\bar{d_{hi}}}$ & & C (x) = ham∥
$\frac{d_{si} (x)}{\bar{d_{si}}} > \frac{d_{hi} (x)}{\bar{d_{hi}}}$ & & C (x) = spam
Then, x is considered to be wrongly classified, and the x’s
category C (x) is reversed.
End for

4: Set S_{up_i} = S_{r_i} ∪ S_{top_i}.

5: The positive interest set I_P, negative interest set I_N and

the training set are updated according to the following processes.

For each sample x in S_{up_I}

5.1: Update I_P and I_N

5.1.1: If C (x) = ham

5.1.2: For each content keyword ct_xk in x

5.1.3: If ct_xk ∉ I_P

5.1.4: Add ct_xk into I_P, POF (ct_xk)= 1;

5.1.5: Else

5.1.6: POF (ct_xk)++;

5.1.7: Else

5.1.8 For each content keyword ct_xk in x

5.1.9: If ct_xk ∉ I_N

5.1.10: Add ct_xk into I_N, NOF(ct_xk) = 1;

5.1.11: Else

5.1.12: NOF(ct_xk)++.

5.2: Add the sample x to its category and obtain the updated

training set A_i+1.

End for

From the above processes, we notice that the term frequency based interest sets and the sample training set are updated by using the samples which are labeled by the user and the samples which are classified with the greatest certainties. In order to reduce the size growth speeds of the term frequency based interest sets and the sample training set, the proposed sample labeling model makes sure that two kinds of samples: the samples of which the category can not be determined by the user certainly and the samples of which the contents the user is not insterested in, are not considered for retraining.

3.5 Consuming time analysis of sample classification

Suppose the sample number of current training set is n_a, the dimension of the feature space is n_f, the sample number of each incremental set is n_i, the number of content keywords in sample s_ij is m. Moreover, the time cost on addition, subtraction, multiplication and division are denoted as t₁, t₂, t₃ and t₄, respectively. According to [8], the consuming time T_C-NB of linear classifier NB can be estimated by the following formula: $T_{C - NB} = 2 n_{f} t_{3} + t_{4} + 2 t_{1}$ (10)

From the sample classification process of the proposed method we know that the time costs of the steps in Section 3.2 can be calculated according to the following cases:

Case 1: the sample is classified in step 1.

2(m-1) addition operations.

2 subtraction operations.

2m multiplications.

Overall, the time cost of this step is: $T_{C - I 1} = 2 (m - 1) t_{1} + 2 t_{2} + 2 {mt}_{3}$ (11)

Case 2: the sample is classified in step 2.

2(m-1) addition operations.

4 subtraction operations.

2m multiplications.

Overall, the time cost is: $T_{C - I 2} = 2 (m - 1) t_{1} + 4 t_{2} + 2 {mt}_{3}$ (12)

Case 3: The sample is classified in step 3.

NB classifier is executed based on step 2, and the time cost is:

$\begin{matrix} T_{C - I 3} & = & T_{C - I 2} + T_{C - NB} \\ = & 2 {mt}_{1} + 4 t_{2} + 2 {mt}_{3} + 2 n_{f} t_{3} + t_{4} \\ = & 2 {mt}_{1} + 4 t_{2} + (2 m + 2 n_{f}) t_{3} + t_{4} \end{matrix}$ (13)

For two real numbers, because the consuming time of comparison, addition and subtraction are almost negligible when comparing with the time-consuming of multiplication, there exists t₁ << t₃ and t₂ << t₃. Based on this, formulas (12) and (13) approach to the following formulas: $T_{C - I 1} \approx T_{C - I 2} \approx 2 {mt}_{3}$ (14) $T_{C - I 3} \approx (2 m + 2 n_{f}) t_{3} + t_{4}$ (15)

Suppose there are w (w > 0) samples, which are classified in step 1 or step 2, then the average time cost of sample classification is:

$\begin{matrix} T_{C - IA} & = & \frac{T_{C - I 1} w + T_{C - I 3} (n_{i} - w)}{n_{i}} \\ = & \frac{2 {mt}_{3} w + ((2 m + 2 n_{f}) t_{3} + t_{4}) (n_{i} - w)}{n_{i}} \\ = & 2 {mt}_{3} + (2 n_{f} t_{3} + t_{4}) \frac{n_{i} - w}{n_{i}} \\ = & T_{C - I 1} + T_{C - NB} \frac{n_{i} - w}{n_{i}} \end{matrix}$ (16)

Suppose T_C-I1 = ζT_C-NB, thus the following formula can be deduced:

$\begin{matrix} T_{C - NB} - T_{C - IA} & = & T_{C - NB} - ζ T_{C - NB} - \frac{n_{i} - w}{n_{i}} T_{C - NB} \\ = & T_{C - NB} (\frac{w}{ni} - ζ) \end{matrix}$ (17)

In General, there exists ζ ≤ 0.05. On this basis, it can be deduced that T_C-IA < T_C-NB when the condition that w > 0.05n_i satisfies. Actually, as more and more content keywords are added to the I_P and I_N, many samples are classified in step 1-2, deducing that the advantage of the proposed method on reducing the time cost of sample classification becomes increasingly apparent.

4 Experimental results and analysis

The experiments were conducted on an Intel Core(TM)-i5 Processor with a CPU clock rate of 3.10 GHz and 4 GB main memory. The vector space model [29] of selected features is built on the platform of visual studio 2008, using C++ standard template library (STL). As are shown in Table 1, two benchmark corpuses: TREC2007 (TR07) [30] and Enron-spam (ES) [31] are used as the experimental corpuses.

In order to evaluate the effectiveness of different spam filtering methods, the 10 folds cross validation is applied to the sample classification process. Moreover, the F₁ measurement [31] is used to compare the proposed method with six typical methods, which are given in Table 2. The definition of F₁ measurement can be formulated as follows: $F_{1} = \frac{2 \times r \times p}{r + p}$ (18)where, p and r denote the spam classification precision and classification recall, respectively. Based on Wang’s method [32], we calculate p and r by combining the user’s interests. The definitions of p and r are given as follows: $p = \frac{n_{ss}}{n_{ss} + n_{hs}}, r = \frac{n_{ss}}{n_{ss} + n_{sh}}$ (19)where, n_ss is the number of spam emails which the user is interested in and are correctly classified as spam; n_hs is the number of ham emails which the user is interested in and are misclassified as spam; n_sh is the number of spam emails which the user is interested in and are misclassified as ham.

In order to simulate the online spam classification process, each sample of the increments set is labeled before incremental learning. Moreover, the parameters of the proposed method are set as follows: the vector dimension of the feature space is set n_f = 600; the number of the test sample is initialized n_t = 5; the sample number of the test set is initialized n_ts = 100.

4.1 Selection of the threshold Δ

Here, the sample number of the initial training set A₀ is denoted as |A₀|, the sample number of the incremental sample set S_i is denoted as |S_i| and the number of the incremental set is denoted as n_s. Suppose |A₀| = 100, 200, 300, 400, 500, |S_i| = 100, 200, 300, 400, 500(0 < i < n_s) and n_s = 100. On this basis, we define the average F₁ value F_a and the average classification consuming time T_ca as follows: $F_{a} = \frac{1}{25 n_{t}} \sum_{| A_{0} |} \sum_{| S_{i} |} \sum_{k = 0}^{n_{t}} F_{1} (| A_{0} |, | S_{i} |, T_{k})$ (20) $T_{ca} = \frac{1}{n_{t} \times n_{ts}} \sum_{i = 0}^{n_{t} - 1} \sum_{j = 0}^{n_{ts} - 1} T_Cla (t_{ij})$ (21)where, F₁ (|A₀|, |S_i|, T_k) denotes the F₁ value obtained by the proposed method with respect to test set T_k (0 < k < n_t) when the sample number of the initial training set is |A₀| and the sample number of the incremental set is |S_i|. T_Cla(t_ij) denotes the classification consuming time of the jth sample in test set T_i (0 < i < n_t).

From Section 3.2 we know that the classification consuming time is irrelevant to the number of content keywords in term frequency based interest sets, and relevant to the value of Δ. In order to obtain the optimal Δ, statistical experiments are carried out when Δ (denoted as delt in Fig. 3) ranges from 0 to 100, and the corresponding F_a and T_ca values of TR07 and ES are given in Fig. 3(a) and (b), respectively. From Fig. 3 we know, when Δ ranges from 0 to 30 and 0 to 40, the F_a values increase rapidly; as Δ increases further, the F_a values remain stable. Hence, for reducing the consuming time of sample classification as much as possible, we set Δ= 30 and Δ= 40 when TR07 and ES are used, respectively.

4.2 Consuming time comparison

Suppose |A₀| = 200; n_s is set to 100, 200, 300, 400 and 500, respectively; |S_i| is set to 50 and 100, respectively. Figure 4 gives the average classification consuming time T_ca values of ES and TR07 when different methods are used, respectively. Obviously, because the computational complexity of KNN classification process is relevant to the sample number of the training set, the T_ca values of MS+KNN and ES+KNN increase significantly as n_s ranges from 100 to 500. Meanwhile, we notice that MS+SVM and ES+SVM perform better than those KNN combined methods for the computational complexities of the former ones are relevant to the number of support vectors rather than the number of all samples in the training set. It is obvious that the T_ca values of MS+NB and ES+NB are much lower than those SVM combined methods for the computational complexities of the former ones are only relevant to the vector dimension of the feature space. Further, the performance of the proposed method is obviously better than those of MS+NB and ES+NB when n_s values are greater than 300, that’s mainly because the former method doesn’t classify the emails by using NB classifier directly, effectively reducing the average classification consuming time by combining the term frequency based user interest sets. Moreover, the lowest T_ca values are both obtained by the proposed method when n_s equals to 500 in Fig. 4(a) and (b), with the highest improvements of both about 0.05 ms over the corresponding T_ca values of MS(ES)+NB methods.

4.3 Accuracy comparison of different methods

Suppose |A₀|, |S_i| and n_s are all set to 100, 200, 300, 400 and 500, respectively. Figure 5 gives the average F_a values (denoted as F_aa) of the proposed method when TR07 and ES are used, respectively. Obviously, comparing with other methods, the proposed method performs better than MS+KNN and ES+KNN for KNN classifier only utilizes the nearest neighbors of the unlabeled samples in the active learning and classification processes. In Fig. 5(a), the F_aa values of the proposed method are lower than those of ES+SVM when n_s ranges from 200 to 400, that’s because SVM may be good at dealing with the TR07 corpus for its good generalization. However, the proposed method still obtains the highest F_aa value 0.972 when n_s equals to 500. In Fig. 5(b), the proposed method performs obviously better than other methods when n_s ranges from 200 to 400, illustrating the high accuracy of the proposed method on spam classification.

The high accuracy of the proposed method can be attributed to the following reasons: (1) the improved active learning method measures the sample classification certainty by calculating the boundary densities of different categories, which are ignored in MS and ES; (2) the improved active learning method is not based on the probability theory, thus the feature independence assumption, which is supposed to be over-simplistic in MS and ES is not combined; (3) the impacts of the samples in which the user is not interested, on the classification results of the samples in which the user is interested, is reduced by combining the user interest based sample labeling model.

4.4 Comparisons of sample labeling burden

Because the active learning process brings extra labeling burden to the users, the number of the samples which are recommended to the users should not be too large. To test the validity of the proposed method, different methods which combine MS, ES and LS are used for comparing, respectively. For ease of calculation, |A₀| is set to 300, |S_i| is set to 300, n_s is set to 500, n_t is set to 10 and n_ts is set to 200. With respect to corpuses TR07 and ES, the maximum F₁ values, which are denoted as F_M, are shown in Table 3. From this table we know, the highest F_M value 0.983 (denoted in bold) is obtained by the proposed method on TR07 corpus, and the minimum F_M values of all methods are 0.961 and 0.964 when ES and TR07 corpuses are used, respectively. On this basis, the total numbers of the samples, which are recommended to the user for labeling until the corresponding F₁ values are no less than 0.96, are calculated and denoted as n_F in Table 3. Obviously, the n_F values of the proposed method are generally much lower than those of other methods, with the lowest n_F value 563 on ES corpus. Those may be the reasons for the lower sample labeling burden of the proposed method: (1) the active learning method of the proposed method makes the samples recommended to the users more representative; (2) the recommended samples in which the user is not interested are not labeled by combining the user interest based sample labeling model; (3) the samples which are wrongly classified with great possibilities, are also added into the training set for retraining, thus reducing the sample labeling burden while ensuring the accuracy of the proposed method.

5 Conclusions

The aim of the proposed method is to improve the speed of online spam classification and reduce the sample labeling burden without sacrificing the spam classificatiom accuracy. The originalities of the proposed method are given as follows: (1) the conceptions of term frequency based positive and negitive interest sets are proposed and combined with the Naïve Bayes classifer to rdeuce the consuming time of email classfication; (2) the sample boundary density is considered and a boundary density base evaluating function is proposed to measuring the classification certainties of the unlabeled samples in active learning process; (3) the user interested based sample labeling model is proposed to reduce the impacts of the samples in which the user is not interested, on the classification results of the samples in which the user is interested. To evaluate the performance of the proposed method, six commonly used spam classification methods are used for comparing. The classification processes are completed by Support Vector Machine, Naïve Bayesian and K-Nearest Neighbor classifiers on TREC2007 and Enron-spam corpuses. Experimental results show that, the email training and classification speeds of the proposed method are much more faster than those of other methods. Meanwhile, on the premise of small labeling burden, the proposed method performs generally better than other methods on the two corpuses by using F₁ measurement. In the immediate future, we will continue the study on classifying emails with images and links which are important characteristic of spam emails.

Footnotes

Acknowledgments

Project supported by the National Nature Science Foundation of China (No. 60973040, No. 60903098), the National Nature Science Foundation for Young Scientists of China (No. 61300148), the Key Scientificand Technology Projects of Jilin Province. (No. 20130206051GX).

References

Han

L.X.

and Levenberg

, Scalable Online Incremental Learning for Web Spam Detection, Recent Advances in Computer Science and Information Engineering, Springer Berlin Heidelberg, 2012, pp. 235–241.

Liu

and Wang

, Online active multi-field learning for efficient email spam filtering, Knowledge and InformationSystems 33(1) (2012), 117–136.

Ting

S.L.

, Ip

W.H.

and Tsang

A.H.C.

, Is Naïve bayes a good classifier for document classification? International Journal of Software Engineering & Its Applications 5(3), 2011.

Abe

, (2010)Support vector machines for pattern classification Springer,

Wan

C.H.

, Lee

L.H.

, Rajkumar

, et al., A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine, Expert Systems with Applications 39(15) (2012), 11880–11888.

Lee

C.C.

, Mower

, Busso

, et al., Emotion recognition using a hierarchical binary decision tree approach, Speech Communication 53(9) (2011), 1162–1171.

Tarragó

D.S.

, Cornelis

, Bello

, et al., A Multi-Instance Learning Wrapper Based on the Rocchio Classifier for Web Index Recommendation, Knowledge-Based Systems, 2014.

Androutsopoulos

, Paliouras

and Michelakis

, Learning to filter unsolicited commercial E-mail, International Proceedings of Computer Science & Information Technology 49 (2012), 1–52.

Ghanbari

and Beigy

, An incremental spam detection algorithm, Artificial Intelligence and Signal Processing (AISP), 2011 International Symposium on IEEE, 2011, pp. 31–36.

10.

Cheng

S.X.

and Shih

F.Y.

, An improved incremental training algorithm for support vector machines using active query, Pattern Recognition 40(3) (2007), 964–971.

11.

Syed

, Liu

and Sung

, Incremental learning with support vector machines, In: Proceedings of the Workshop on Support Vector Machines at the International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden, 1999.

12.

C.M.

, Wang

X.D.

, Bai

D.Y.

, et al., Fast Incremental Learning Algorithm of SVM on KKT Conditions, Sixth International Conference on Fuzzy Systems and Knowledge Discovery, (2009), 551–554.

13.

Settles

, Active learning literature survey, University of Wisconsin, Madison 52 (2010), 55–66.

14.

Donmez

, Carbonell

J.G.

and Bennett

P.N.

, Dual strategy active learning, Machine Learning: ECML 4701 (2007), 116–127.

15.

Amayri

and Bouguila

, A study of spam filtering using support vector machines, Artificial Intelligence Review 34(1) (2010), 73–108.

16.

Sculley

, Online Active Learning Methods for Fast Label-Efficient Spam Filtering, Fourth Conference on Email and Anti-Spam, Mountain View, California USA, 2007.

17.

Ienco

, Pfahringer

and Zliobaite

, High density-focused uncertainty sampling for active learning over evolving stream data, JMLR W&CP 36 (2014), 133–148.

18.

Georgala

, Kosmopoulos

and Paliouras

, Spam Filtering: An Active Learning Approach using Incremental Clustering. Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics, Thessaloniki, Greece, 2014 No. 23.

19.

Joshi

A.J.

, Porikli

and Papanikolopoulos

N.P.

, Scalable active learning for multiclass image classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 34(11) (2012), 2259–2273.

20.

, Chen

, Li

, et al., Incremental learning from stream data, EEE Transactions on Neural Networks 22(12) (2011), 1901–1914.

21.

Nissim

, Moskovitch

, Rokach

and Elovici

, Detecting unknown computer worm activity via support vector machines and active learning, Pattern Analysis and Application 15(4) (2012), 459–475.

22.

Youn

and McLeod

, A comparative study for email classification. Advances and innovations in systems, Computing Sciences and Software Engineering (2007), 387–391.

23.

Persello

and Bruzzone

, Active learning for domain adaptation in the supervised classification of remote sensing images, IEEE Transactions on Geoscience and RemoteSensing 50(11) (2012), 4468–4483.

24.

Ali

N.H.

and Ibrahim

N.S.

, Porter Stemming Algorithm for Semantic Checking, ICCIT, Chittagong University, Chittagong, 2012.

25.

Yang

J.M.

, Liu

Y.N.

, Zhu

X.D.

, et al., A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization, InformationProcessing & Management 48(4) (2012), 741–754.

26.

McCallum

and Nigam

, A comparison of event models for naive Bayes spam filtering, EACL ’03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics 1 (2003), 307–314.

27.

Pan

S.R.

, Zhang

and Li

, Dynamic classifier ensemble for positive unlabeled text stream classification, Knowledge and Information Systems 33(2) (2012), 267–287.

28.

Mesleh

A.M.

, Feature sub-set selection metrics for arabic text classification, Pattern Recognition Letters 32(14) (2011), 1922–1929.

29.

Jing

L.P.

, Michael

K.N.

and Huang

J.Z.

, Knowledge-based vector space model for text clustering, Knowledge and Information Systems 25(1) (2010), 35–55.

30.

Cormack

G.V.

, TREC spam track overview, In: TREC2007 Proceedings of the 16th Text Retrieval Conference, National Institute of Standards and Technology, Special Publication, 2007, pp. 500-274.

31.

Zhu

Y.C.

and Tan

, A local-concentration-based feature extraction approach for spam filtering, IEEE Transactions on Information Forensics and Security 6(2) (2011), 486–497.

32.

Wang

Y.W.

, Liu

Y.N.

, Feng

L.Z.

and Zhu

X.D.

, A novel online spam identification method based on user interest degree, Journal of South China University of Technology (Natural Science Edition) 42(7) (2014), 21–27.