Abstract
Email is one of the most popular ways of communication. Nevertheless, it is also a potential tool to deceive and fill users with unwanted publicity, which reduces productivity. To alleviate such fact, a common solution has been building machine learning models based on the content of emails to automatically separate emails (spam vs ham). In this work, a study of a set of machine learning models and content-based features for the problem of cross-dataset email classification is presented. This problem consists in training and testing the models using different datasets; considering the fact that the datasets were collected under different independent setups. This has the purpose of simulating future variable or unpredictable conditions in the emails content distributions as could happen in a real setting, where models are trained using emails from a certain period of time, group of users or accounts, but tested with emails from other users or accounts. Experiments were conducted with the models and features using different datasets and two setups, same-dataset, and cross-dataset, to show the complexity of the later. The performance was evaluated using the Area Under the ROC Curve, a common metric in email classification. The results show interesting insights for the problem.
Introduction
Electronic mail (shortened as email), is nowadays one of the most popular and efficient ways to communicate among persons and organizations; it is mainly due to its easy access, low cost, and speed. As an example, since 2014 an approximate of 200 million email accounts have been created every year, according to a survey published in a popular german online portal for statistics. Nevertheless, despite its advantages and popularity, this service has several problems to face; the main one lies in that not all emails that an user receives are important or reliable. To tackle this problem, email classification has been widely studied throughout the years to help automatically filter out emails that the users do not want to read or that are untruthful.
The most common task in email classification consists in separating the email in two classes, spam and ham. The first class refers to emails that users do not expect or want to receive and that do not have importance for them. The content of these emails is generally about commercial publicity, but it could also be more dangerous by trying to steal users’ financial identities (phishing email). The second class refers to those legit emails that users want to receive; some of them of high importance because could contain urgent or sensitive information. The present work deals with this type of email classification.
One of the most popular approaches among the existing techniques for email classification [7, 17] is to build machine learning models based on emails text content [8]. Following this line, along the years researchers have built models using different features such as words [19, 34], n-grams [23], character n-grams [19], stylistics [20] and deep learning features [18]. They have also implemented a variety of methods for feature selection [15, 32] and feature extraction [15]. Finally, several machine learning models have been proposed for the task [31, 33], such as Support Vector Machines (SVM) [9], Näive Bayes (NB) [22, 27], K-Nearest Neighbors (KNN) [11], Logistic Regression (LR) [5], PCA reconstruction [14], Decision Trees (DT) [29], Neural Networks [6] and classification ensembles [28].
Despite the great efforts conducted by academic and industrial researchers, the filtering of unwanted email is still an open problem because spammers and phishers are continuously changing their techniques to deceive automated filters, which results in an evolution of the content of the emails. The previous situation leads to the so called dataset shift problem [26]. This problem comes from the fact that models are trained using data collected during a certain period of time, and from certain groups of users or accounts. The population of users or accounts to collect emails could be selected based on practical or experimental decisions, such as geographic locations, topics of interest and/or email servers. When using such collected data to build machine learning based email filters, the models are biased to the specific content distribution of the data, and when applied in a real setting, they have problems to correctly detect emails. There are very few works that consider the dataset shift problem in email classification, such as in [4], where the authors propose a model based on Markov processes that could be updated with new emails; and [15], in which the authors used a set of features to train models with emails from the past, and test the models with emails from the future. Nevertheless, the vast majority of works in the literature uses the same dataset to train and test models (even if they use a diversity of datasets), reaching very high values in performance measures, but without reflecting the performance in more realistic environments.
In this work, a study of a variety of machine learning models and content-based features for the problem of email classification in a cross-dataset scenario was conducted. In this scenario, the models were trained using emails from a dataset and tested using emails from another dataset; considering the fact that the datasets were collected under different independent setups. This has the purpose of simulating future variable or unpredictable conditions in the emails content distributions as could happen in a real setting. Experiments were developed with five machine learning models encompassing four approaches, discriminative (SVM and LR), probabilistic (NB), instance-based (KNN) and decision trees (Random Forrest). Aditionally, four feature sets were used, divided in two approaches, superficial features (words, links and emoticons/emojis) and deep learning features (word2vec). Finally, five email datasets that are commonly used in the literature were used, TREC 2007, GenSpam, SpamAssassin, Enron, and Ling Spam. The main contribution is to gain understanding on the effect of different machine learning approaches and feature sets for the dataset shift problem in email classification. Final results present interesting insights.
There are three question for the present research; for cross-dataset email classification: Which machine learning model or approach performs the best? Which feature set is better to discriminate between spam and ham? Is there an association between the amount of shared content among datasets and the classification performance?
The remainder of this paper is organized as follows: Section 2 presents a review of previous works on email classification. Section 3 describes the materials and methods used in this paper. Section 4 contains the experimental results. Finally, Section 5 concludes this paper and presents some insights for future work.
Related works
Since email was developed and thanks to its huge and growing popularity through the years, there have been many works around it and its problems. Specifically on email classification, there are some works that summarize methods and techniques used in the literature [2, 33]. For example, Blanzieri and Bryl [3] conducted a survey of learning-based techniques of email classification using nine different datasets to review the most common methods and features (non-content, language based, bag of words and image-based) used in spam filtering. Further studies have been carried by not only comparing models and using different datasets, but also parameters, architectures and even simulation tools and environments [8]. In [30], the authors reviewed popular email mining tasks, techniques and tools and identified five major categories named spam detection, email categorization, contact analysis, email network property and email visualization.
Regarding feature sets, several types have been explored, including words [19, 34], n-grams [23], character n-grams [19], stylistics [20] and deep learning features [18]. Additionally, feature selection (FS) and feature extraction (FE) have also been used. In [32], the authors implement a meta-heuristic model based on harmony search to select a set of features considering document frequency and discriminative power. For FE, extended versions of several techniques and models have been developed, for example, based on PCA (principal component analysis) there is PCAII [15] and PCADR (PCA document reconstruction) [14]; whilst using LDA (linear discriminant analysis) there are BDA (biased discriminant analysis), and ANMM (average neighborhood maximization) [13].
Although using solely content-based features has been more explored by researchers, other interesting options have been used in combination with content-based features for improving the classification, such as using online and offline features [12], behavioural email traffic features [21], a combination of content, image and salting features [1], and web-based approaches [16].
Finally, there have been a diversity of machine learning models that have been employed for the task [31, 33], such as SVM [22, 27–29], KNN [11], LR [5], PCA reconstruction [14], Decision Trees [29], Neural Networks [6], compression models [4] and classification ensembles [28].
In most of these mentioned works, the experimentation has been done either by splitting a dataset in training and test parts, or by applying cross validation over it. In this sense, the same distribution of content is used for training and testing, which is something that normally does not occur in a real application, where the models have to be tested in more complex settings.
Materials and methods
Datasets description
For experimenting, in the present paper, five popular datasets that are common in the literature were used, SpamAssassin (SA) 1 , Enron (EN) 2 , TREC 2007 (TR) 3 , GenSpam (GS) 4 and Ling Spam (LS) 5 . These datasets are free and publicly available.
The SA dataset is originally split in five files, easy_ham, easy_ham_2, hard_ham, spam and spam_2. The first two files contain emails easy to differentiate from spam and without any spammish signatures. The third file contains ham emails closer to typical spam. Finally, the fourth and fifth files contain emails received from non-spam-trap sources. All the emails were merged in two files for ham and spam respectively. The GS dataset is originally split in 5 directories, train_GEN, train_SPAM, adapt_GEN, adapt_SPAM, test_GEN, and test_SPAM, which contain spam and ham (GEN) for each phase, training, validation, and testing, respectively. When conducting the same-dataset experiments (see Subsection 3.3), the original split of the data was used; when doing cross-dataset experiments, all the emails from the directories were merged in two files for ham and spam respectively. The Enron dataset is originally divided in six folders, each one containing spam and ham emails. All the emails were joined in two files for ham and spam respectively. The LS dataset originally contains different versions of the data, in this work the bare version was used, which is unprocessed or coded. This version is split in ten parts with ham and spam in each one. When conducting the same dataset experiments, the original split of the data was used; when doing cross-dataset experiments, the emails from the directories were merged in two files for ham and spam respectively. Finally, the full version of the TR dataset was the one used, where each email is labeled either as ham or spam.
Table 1 shows the distribution of emails for each class. In most of the dataset, there is a predominance of spam email (since it is easier to collect), except in SA where the proportions are reverted, and EN, where the dataset is almost balanced.
Datasets content breakdown
Datasets content breakdown
Table 2 contains some statistics for the three superficial features used for experimenting, words, links and emoticons/emojis (see Subsection 3.2). Columns two to four indicate the vocabulary size for each feature (the number of unique words, links and emoticons/emojis) per dataset. The fifth column represents the proportion of unique words per email (the size of the word vocabulary divided by the total number of emails). The larger this number, the more diversity of content is present in the dataset. Columns six to eight are the averages of each feature per email in the dataset. In this table, it can be observed that TR has the largest word and link vocabularies; that means that it has a more diverse content than the rest of the datasets. This is complemented by the fact that its emails are the largest on average, and they contain more emoticons than emails in other datasets. The GS dataset, despite being the second largest one, it has the shortest emails on average, producing a moderate word vocabulary. Emails in this dataset do not contain links and have the least emoticons. The contrary occurs with SA, that despite being a small dataset, its word, link and emoticon vocabularies are larger. Emails in this dataset contain the largest average number of links, as well as the second largest number of emoticons. The EN dataset is the third largest one, with a moderate word vocabulary, some links, and some emoticons on average per email. Finally, LS is the smallest dataset, but its emails are the second largest, producing a moderate word and link vocabularies. Also, its emails contain some links and some emoticons on average. In general, emails in the datasets contain few emoticons and very few links; with small vocabularies for these features, specially for emoticons/emojis, which is expected since the available set of symbols for this feature is limited. On the other hand, the number of words varies largely, with different sizes for the word vocabulary.
Vocabulary sizes and statistics for superficial features
Before using the datasets for building and testing machine learning models, several processing techniques to transform them and extract the features were applied. First the text from the body and subject line was taken, excluding tags such as re, fwd and fw; and for datasets in XML format, those tags referring to the beginning and end of an element. Secondly, regular expressions were used to extract the three superficial features used, words, links and emoticons/emojis. Words represent the majority of content in an email since they are used to express the purpose of the message. Links are sometimes present in spam email to redirect the user to pages with publicity or malicious content. Emoticons are sometimes used to capture the attention of the user with a friendly style. After extracting the superficial features, very short (length <3) and very long (length >35) words, and the English stopwords from the Python NLTK library were removed. Finally, for each superficial feature, each email was transformed to a document vector using the term-document-inverse-document-frequency (tf-idf) method. Tf-idf is defined by Eq. (1), where tf (t, d) is the frequency of feature t in document d and idf (t) is given by Eq. (2). In turn, df (d, t) is the number of emails that contain feature t; whilst n
d
is the total number of emails in the training set.
Additionally, a deep learning representation was used, by employing the Word2Vec (W2V) [24] technique to express each word as a vector. In this case, for each email its filtered words were taken and passed each one through the Google pre-trained model 6 , obtaining as output a word embedding of 300 dimensions. Afterwards, the average of all the word embeddings from an email was computed as its final document vector.
In both cases, of superficial and deep learning features, each email document vector was normalized to the unit using the Euclidean norm.
Using the extracted superficial and deep learning features, different models based on five popular machine learning methods were built, trying to explore different approaches. The methods include two discriminative classifiers (SVM and LR), one probabilistic (NB), one instance-based (KNN), and one based on decision trees (RF).
Two types of experiments for building and testing the models were conducted, one where the same dataset is used for training and testing, and another where a dataset was used for building a model and the rest of the datasets as test sets.
In the same-dataset setup, a 10-fold cross-validation with the TR, SA, EN and LS datasets was performed. For the first three datasets, the cross-validation was randomly stratified, whilst for LS the original data split was used. With GS, the original data split of training, validation and test was utilized. During the 10-fold cross-validation, for every iteration, an independent vocabulary from the training part was extracted, composed of nine folds, compute the idf for each feature in it, and then used this vocabulary and idfs to vectorize both the training part and the remaining test fold. The vectorization is done either using tf-idf for the superficial features, or word embeddings with W2V, using the vocabulary extracted from the training part.
In the cross-dataset setup, as mentioned before, first, all the emails from each dataset were merged in two files, containing ham and spam. Secondly, the vocabulary from each training dataset was extracted. Finally, such vocabulary was used to transform all the test datasets, either using tf-idf for the superficial features, or word embeddings with W2V.
Additionally, the SVM, LR, KNN and RF classifiers have hyper-parameters that affect their behaviours. The performance of the individual models was optimized by conducting a 3-fold cross-validation for each training part and for each possible value of the hyper-parameter. The training part could be the one composed by nine folds, in the same-dataset setup; or a whole dataset in the cross-dataset setup. In the case of the GS dataset, in order to conduct the optimization, the original split of training and validation was used. Table 3 shows the hyper-parameter (HP) optimized for each method and the values considered for it. All this seeking to maximize the AUC metric.
Hyper-parameter to optimize in each method
Hyper-parameter to optimize in each method
The email classification problem naturally deals with unbalanced distributions of data, where a class is more predominant than the other (as shown in Table 1). In that case, metrics such accuracy are not recommended to evaluate the classification performance of a model, because it would tend to be biased towards the dominating class. In our case, the performance of the different models was evaluated using the area under the ROC (Receiver Operating Characteristic) curve, or AUC, which is a popular metric for email classification [4]. ROC is a probability curve that plots the true positive rate against the false positive rate at various thresholds. AUC assesses a degree of separability, by measuring the probability that a model will rank a randomly chosen positive instance higher than a randomly chosen negative one [10]. AUC takes values between 0 and 1; however, a random classifier would produce the diagonal of the ROC curve obtaining and AUC of 0.5, which represents the baseline performance.
The processing and classification methods were implemented in Python, using the libraries NLTK, numpy and Scikit-learn. The experiments were performed in a Linux workstation with a 2.1 GHz Xeon Silver processor and 128 GB of RAM. Datasets and code used for this work is publicly available. 7
Tables7 present the results of the experiments using the described features, words, links, emoticons/emojis, and W2V. The tables are split in blocks of 5×5, each block represents the AUC values for a specific classification model for the corresponding feature. Each row in the block indicates the dataset that was used for training and the columns the datasets used for testing. The AUC values in italics in the diagonal of a block correspond to the same-dataset results. The values in bold indicate the best values in a row. The last column in a table shows the average and standard deviation of all the values in a row, corresponding to a single training dataset over all the classification models. All the previous averages are computed without considering the values of the diagonal, meaning they represent the cross-dataset performance. The penultimate row in a table indicates the average and standard deviation of the values in a block’s diagonal, corresponding to the performance of a single classifier in the same-dataset setup (SD AVG). Similarly, the last row in a table shows the average and standard deviation of all the non-diagonal values in a block, corresponding to the performance of the classifier in the cross-dataset setup (CD AVG). The values in the lower right corner aggregate all the averages for the datasets, indicating the general performance for a specific feature.
Results for all machine learning models using words
Results for all machine learning models using words
Results for all machine learning models using links
Results for all machine learning models using emoticons/emojis
Results for all machine learning models using W2V
In Table 4 is shown that for the same-dataset experiments, when using words, most of the classifiers show a good performance. There are few exceptions, such as when classifying the LS dataset with NB, or the GS dataset with KNN and NB. On the other hand, in the same-dataset experiments, the datasets that produce the highest AUC values are TR and EN with all the classifiers, meaning they are easier to classify with word features. In this setup, discriminative classifiers SVM and LR present the best average performance over all the datasets. For the cross-dataset experiments, when using words, training models with some datasets yield better results when testing with the other datasets. Specifically, using the GS dataset for training produces the best average performance. Furthermore, training with GS or SA and testing with LS produces high AUC values with some classifiers. LS is the smallest dataset, thus its word features must be well represented inside the distributions of GS and SA. The second best performance is obtained using either the SA or the EN datasets for training; and the lowest values are obtained when training either with the TR or LS datasets. Despite the large vocabulary present in TR, its word features are very specific of such dataset; whilst the small vocabulary from LS does not represent well the distributions of other datasets. Regarding the classification model, SVM, LR, and RF present similar average performances.
In Table 5, it is observed that in general, when using links, the models perform poorly. The ‘-’ symbol indicates that the GS dataset does not contain links. In the same-dataset setup, the datasets that produce better results are TR and SA. The SA dataset contains the most links per email, but the ones in the TR dataset only have a few, which would mean that such few links include valuable information. Similarly than with words, when using links, discriminative classifiers SVM and LR present the best average performance over all the datasets. The general average of using links in the same-dataset setup is much lower than when using words. In the cross-dataset results, for most of the cases, the performance is similar to a random classifier. The exceptions are when training with the TR or EN datasets and testing with the SA dataset. This indicates that such datasets share some link information. Regarding the classification model, SVM and LR present the best average performances.
Table 6 shows that in general when using emoticons/emojis the models perform poorly. In the same-dataset setup, the datasets that produce better results are TR, SA, and EN. The two first are the ones with more emoticons/emojis in their emails, indicating a higher chance of such features being discriminative between classes. For the same-dataset setup, RF presents the best average performance among all the classifiers. The general average AUC of using emoticons/emojis is much lower than when using words but similar to the use of links. In the cross-dataset experiments, the performance varies depending on the datasets used for training and testing. Some of the best results are obtained when training with EN and testing with LS in all the classifiers. When the datasets are switched, the performance drops. This would mean that the emoticons/emojis from the LS are well represented in the distribution of EN dataset, but not the other way around. On average, using TR as the training dataset produces the best AUC value. Since emails in this dataset contain the most emoticons, there is a higher chance that the distributions of emoticons/emojis from other datasets are captured inside the distribution of TR. The lowest performance comes when using LS for training since it is the smallest dataset, likely to have a poor distribution of emoticons/emojis between its classes. The best classification models in the cross-dataset setup are SVM, LR, and KNN, whose performances are a little higher than those shown by a random classifier.
In Table 7, the results of the experiments with W2V features are presented. The results are good and similar to the ones of using words, but with some differences. The most notorious is that the NB classifier performs poorly in almost all cases. In the same-dataset setup, the LS dataset obtains the highest AUC values with all the classifiers, meaning it is easy to classify with W2V features. In this setup, LR produces the best average performance among all the classifiers. In the cross-dataset setup, training with datasets TR, GS, SA and EN, and testing with the LS dataset produces the best results. In some cases, such as training with GS, this combination yields values only 4% below to the same-dataset experiment with LS. Values with other combinations of datasets vary largely. The best average values are obtained when using the GS dataset for training, and the lowest when using the TR dataset. Similarly than with words, the W2V features from TR are very specific of such dataset. Regarding the classification model, SVM and LR and RF present the best averages performance.
Table 8 shows the aggregated average performance per test dataset along with the different features in the cross-dataset setup. In this table, it is observed that LS is the easiest dataset to classify when training with any other dataset. The hardest datasets to classify are TR and EN. LS is the smallest dataset, and most of its features are well captured by the feature distributions of other datasets. The contrary occurs with TR, which is the largest dataset and it is more likely that part of its content is not present in other datasets.
Average performance per testing dataset per feature
Tables 9 and 10 show aggregated averages of the cross-dataset performance for each model and for each feature, respectively. Here, it can be observed that in general, discriminative classifiers SVM and LR perform the best on average, but RF performs closely. Regarding features, words produce the best results on average, but W2V features are just behind.
Summary of classification results per model
Summary of classification results per feature
As a complement of Table 9, in Table 11 the average training and test times per model and per feature is presented. The slowest model on average is RF, specially with words. The fastest one is NB, with any feature. Thus, even if RF has a similar general performance than SVM and LR, the later models are on average faster to train and test. Regarding features, words are the slowest for training and testing, whilst links and emoticons/emojis are the fastest. W2V features also have fast training and test times, since in that case there are only 300 features to represent a document. These features could be preferred over words because of the similar performance, but faster times.
Averages of training and testing times in minutes
To test if there is an association between the amount of shared content and the classification performance, first, the degree of similarity between the feature distributions of two datasets using the Jaccard index was calculated. This index is defined by Eq. (3), where A is the feature vocabulary of one dataset and B the feature vocabulary of another. Jaccard index takes values from 0 to 1, being 0 if the datasets do not share any feature, and 1 if they share all the features. Table 12 contains the Jaccard indices for each pair of datasets using words and emoticons/emojis. It is important to mention that links were not consider since for such feature most of the classification results are similar, no matter the datasets used for training and testing.
Jaccard indices for words and emoticons/emojis
Afterwards, the Pearson correlation coefficient of the Jaccard indices of a training dataset and its test datasets, and the classification results of using that training dataset and those test datasets was calculated. The correlation was computed per feature. For that, the same Jaccard indices for each classifier were repeated and concatenated the results of the different classifiers in a single array. In Table 13, the correlations for each dataset used for training are shown. From that table, it is observed that effectively there is, in general, a correlation between the amount of shared content of a training dataset with the test datasets, and the classification performance. That could mean that the more features two datasets share, the higher the expected classification results in a cross-dataset experiment. The correlation is stronger with emoticons/emojis for datasets TR and SA, which are the ones that contain more emoticons/emojis in their emails. The negative correlation seen in the GS dataset could be due to the very few emoticons/emojis this dataset contains. On the other hand, the low correlation presented in words is due to GS producing better results when classifying a dataset with a lower Jaccard index (SA dataset) than another with a higher index (EN), but considering that SA is a much smaller dataset and it is easier to classify.
Correlation between Jaccard indices and classification results when using a specific dataset for training
In this work, an analysis of different machine learning models and content-based features for email classification in a cross-dataset setting has been presented. Emails were classified in two classes, spam and ham. Models and features were tested with a set of five different datasets conducting both, same-dataset and cross-dataset experiments, this in order to analyze the complexity of the later. In the cross-dataset scenario, models were trained using emails from a dataset and test the models using emails from another dataset; considering the fact that the datasets were collected under different independent setups. This has the purpose of simulating future variable or unpredictable conditions in the emails content distributions as could happen in a real setting. From the results obtained with the experiments with cross-dataset classification, the following can be concluded: Cross-dataset experiments produce lower results than the same-dataset experiments, with differences of more than 20%, indicating that the former problem is more complex to solve. Discriminative classifiers SVM and LR tend to perform better on average than other machine learning approaches, no matter the features used for building the models. Random Forrest shows a good performance in different experiments, but it tends to require more time for training and testing. Words and W2V as features tend to produce the highest values in classification, averaging over all the classification methods. Nevertheless, W2V uses a shorter representation for emails, and has lower training and testing times. Links are moderately present in emails in the datasets, and they do not represent a good feature to discriminate among classes. Most of the models using this feature reach a performance comparable to a random classifier. Emoticons/emojis carry some discriminative information to separate the email classes. Nevertheless, given the very low presence of this feature in emails of the datasets, models using them have limited performance. Small datasets, such as LS and SA, are relatively easy to classify when training a model with a bigger dataset since the feature distribution of the bigger dataset should capture the content of the small dataset. With words and emoticons/emojis as features, there is a general correlation between the amount of shared content between two datasets and the expected classification performance when using such datasets for training and testing.
Directions for further research include the use of other features, such as projections over discriminative spaces (e.g. Linear Discriminant Analysis or Biased Discriminant Analysis), and other types of deep learning features (such as Doc2Vec or Thought Vectors); and the use of feature selection methods, to choose features with more discriminative power between classes.
Footnotes
Code is available at https://bit.ly/2loKQsQ, and datasets at [
].
