Abstract
Spam email problem is a major shortcoming of email technology for computer security. In this research, a granular classifier model is proposed to discover hyper-boxes in the geometry of information granules for spam detection in three steps. In the first step, the k-means clustering algorithm is applied to find the seed_points to build the granular structure of the spam and non-spam patterns. Moreover, the key part of the spam and non-spam classifiers’ structure is captured by applying the interval analysis through the high homogeneity of the patterns. In the second step, PSO algorithm is hybridized with the k-means to optimize the formalized information granules’ performance. The size of the hyperboxes is expanded away from the center of the granules by PSO. There are some patterns that do not placed in any of the created clusters and known as noise points. In the third step, the membership function in fuzzy sets is applied to solve the noise points’ problem by allocating the noise points through the membership grades. The proposed model is evaluated based on the accuracy, misclassification and coverage criteria. Experimental results reveal that the performance of our proposed model is increased through applying Particle Swarm Optimization and fuzzy set.
Keywords
Introduction
Email is a powerful communication channel in a human life to improve group communications. This technology has great impact on human life, national development and business in a positive growth [1]. Due to popularity, low cost and fast delivery characteristics of email, this media is prone to misuse. Spammers abuse email’s characteristics to send unsolicited messages(spam) to fill users’ accounts and waste network bandwidth, memory and server capacity in ISPs [2]. Spam emails are commonly used to spread threats, including malicious software (e.g., internet worms, Trojan horses, computer viruses and spyware) [3] and phishing (i.e., attacks that seek to acquire sensitive information from end-users) [4]. Spam threats sent by spammers are including advertisements for a variety of services and products, such as electronics, pharmaceuticals, loans, jewelry, software, gambling, pornography, weight loss, and stocks [5]. Symantec’s “Message Labs Intelligence” report [6] accounted almost 92% of all Internet email traffic as spam. Spam emails are not only annoying for the email users, but additionally costs billions of dollars in productivity losses through a major computer security problem [7]. Nucleus research report on April 2007 shows that American companies alone lost more than $70 billion a year in 162 workers’ productivity caused by spam. Hence, spam detection considered to be a vital challenge for individuals and organizations [8].
Spam email problem is a major shortcoming of this technology for computer security. Hence, theoretical and applied research on spam filtering becomes fundamental to eliminate spam. In this regard, classification of email is an important and growing task that concluded from the spam email’s menace [1, 9]. Several spam filtering techniques have been introduced by using machine learning approaches including SVM [10], GA [11], AIS [1, 13], case based technique [14] and ANN [15] to combat spam messages. In this research, a granular classifier model is proposed to discover hyperboxes for spam detection in three steps. In the first step, the k-means clustering algorithm [16] is applied to build the centers of clusters that called seed_points [17] of the hyperboxes in spam and non-spam classes. Then, the key part of the spam and non-spam classifiers’ structure is captured by applying the interval analysis through the high homogeneity of the patterns. The classifiers are created according to the sets of the areas within the geometry of the feature space. In the second step, Particle Swarm Optimization (PSO) algorithm is hybridized with the k-means to enhance the performance of hyperboxes to cover more patterns in related classes within the seed_points optimization. The sizes of the hyperboxes are expanded away from the center of the granules by PSO algorithm. After the first and second steps, some points (emails) remain as noise points that not covered by the created and optimized hyperboxes and located outside the optimized hyperboxes. In the third step, the membership function in fuzzy sets is applied to solve the noise points’ problem by allocating the noise points through the membership grades. The proposed model is evaluated based on the accuracy, misclassification and coverage criteria [11, 18 35].
This paper is structured as follows. The related works for spam detection is presented in Section 2. The proposed model to discover hyperboxes in spam and non_spam dataset, and noise points in the geometry of the information granules is described in Section 3. In Section 4, we evaluate the proposed model for spam detection based on the coverage, misclassification and accuracy criteria. Finally, the conclusion of this paper is presented in Section 5.
Related works
In this section, a review of recent studies for spam detection is presented. Idris and Selamat [1] proposed model was hybridized the NSA [19] with PSO. The proposed model was inspired by AIS [20] model with the adaptive nature of NSA to generate random detectors. Then, PSO was applied to enhance the random generation phase of NSA. They applied their method on Spam-base dataset from UCI machine learning repository [21] to compare their NSA-PSO with NSA model. The experimental result showed that the accuracy of NSA–PSO is better that the NSA. NSA-PSO could be useful for complex problems through the NSA capability. Santos et al. [22] proposed a semantic-aware model by focusing on synonyms [23] among several linguistic phenomena in natural languages [24] for spam detection. They applied their model on the information of body and subject of the email messages to detect the internal semantics of spam messages. The proposed model removes words that add noise to the model and do not prepare any semantic information [25] by the enhanced Topic-based Vector Space Model (eTVSM). The proposed model applied on the Ling Spam dataset. The experimental results represent high accuracy and low true negative compare to the both academic approaches and real-word solutions. This model can deal with synonyms in the messages to improve current machine learning methods.
Laorden et al. [7] considered the spam detection problem through the anomaly detection systems inspiration. This approach determines the email is spam or non-spam by comparing word frequency features with a dataset. The dataset usually assumed as legitimate emails. If the significant deviation is found from what is considered typical, that email is recognized as a spam. They also applied the partitional clustering algorithm to improve the scalability of this approach by reducing the number of comparisons for new samples. They used the LingSpam Corpus1, SpamAssassin2 public corpus and TREC 2007 Public Corpus3 to validate their proposed method. False positive and true negative results for LingSpam datasets are almost 92% and 6%, respectively. For Spam Assassin datasets, 90% and 7% for Weighted Accuracy (WA) and true negative are obtained, respectively. The worst results were obtained for the TREC dataset with only a 74.82% WA and an improvable 9.19% true negative. In contrast to statistical approaches, this method uses the existing emails in the inbox folder and does not need updated data. Thus, it led to reducing the efforts to reduce the number of required labeledmessages.
Zhang et al. [8] proposed a hybrid method of the wrapper-based feature selection method and C4.5 decision tree algorithm to reduce the false positive error in email classification. Moreover, different weights were given to the false positive and false negative errors by cost matrix. In addition, out-of sample error was reduced with k-fold cross validation. Finally, they used binary PSO with mutation operator (MBPSO) for the search strategy of the wrapper. They applied their proposed method on 6000 emails. The results showed that the MBPSO is superior to Iterated Local Search (ILS) [8], Particle Swarm Optimization (PSO) [26], Restarted Simulated Annealing (RSA) [27], Binary Particle Swarm Optimization (BPSO) [28], Ant Colony Optimization (ACO) [29] and Genetic Algorithm (GA) [30] in terms of classification performance. In addition, the MBPSO performance is better than the sequential forward selection (SFS) and sequential backward selection (SBS) methods in terms of feature selection. Their proposed method demonstrated that the effectiveness of wrappers is more than the filters with regard to classification performance indexes. In addition, the false positive was reduced without compromising the sensitivity and accuracy values. DeBarr and Wechsler [31] proposed two alternative forms of random projections as Random Project and Random Boost for spam detection that can be used with a small number of patterns for training. Random Project employed a random projection matrix to produce linear combinations of input features. Moreover, Random Boost used random feature selection through the combination of Logit Boost and Random Forest to enhance the performance of the Logit Boost algorithm. Their proposed method applied on the TREC and CEAS spam benchmark sets. Experimental results showed that the Random Boost method improves the performance compare to the Logit Boost algorithm, and yields similar classification accuracy compared to the Random Forest method but using only one-fourth the run time complexity of the Random Forest algorithm. Moreover, the Random Boost algorithm reduced the training time compare to the Logit Boost.
In the current study, granular classifier is considered for spam detection. K-means clustering algorithm [16] is applied to achieve information granules for the clusters and noise points discovery. Moreover, the k-means were hybridized by PSO to expand the size of the hyperboxes in order to optimize the formalized information granules’ performance. Finally, fuzzy sets were used to consider the noise point problem. In this situation, the noise point belongingness is solved by applying the fuzzy membership grades to cover the universe in classification. In this study, the proposed approach was applied on Spam-base dataset from UCI machine learning repository [21] to evaluate and compare it with the other researchers’ model.
Proposed model
In this section, the proposed model is introduced for spam detection by creating the hyperboxes in the geometry of information granules. Each step of the proposed model is outlined as follows: Find the seed_points to build the granular structure of the spam and non-spam patterns by applying the k-means clustering algorithm in the training phases; Enhance the performance of the found seed_points by hybridizing the k-means and PSO; Apply the fuzzy sets to solve the noise points problem by calculating the membership grades of them.
There are some relevant definitions in [17] to construct the granular structure. Figures 1 and 2 present the proposed granular classifier model for spam detection. Each step of the proposed model is presented in details in the following subsections.
First step: The design of the spamand non_spam classifiers
In this section, the procedure of designing the spam and non_spam classifiers by k-means clustering algorithm is presented. Based on the density-based cluster and noise points definitions [17], the seed_points of the unsupervised hyperboxes are built to discover the information granules for creating the clusters and noise points in spam and non_spam dataset. In this step, interval analysis is applied to build the sets as the key part of the classifier’s structure on the high homogeneity of the patterns.
The significant point in the granular structure is that the density of patterns in the discovered clusters is more than the outside of that clusters based on the predefined parameters as ‘MinPoints’ and ‘EPS’ [17]. The process of designing the classifiers to establish the decision boundaries and the geometry of the feature space is illustrated on the following paragraph. The training phase begins by splitting the spam and non_spam datasets to generate granular classifiers. One of the capabilities of the granular computing is that each granule plays a role of classifier [17]. Therefore, the computing time is more than the traditional classifiers, specifically for loading large data. In addition, the program consumes a lot of memory and Java Heap Space Error is occurred during the running of the program. In order to solve this problem, the training datasets were split to 8 sections to discover the hyperboxes.
Figure 3 represents the algorithm of hyperbox generation. The hyperboxes is found according to the EPS-neighborhood [17] of each pattern in the training patterns. Therefore, all density-reachable [17] patterns from that pattern are returned to create a new hyperbox ‘H i ’ with respect to ‘EPS’ and ‘MinPoints’. In addition, the new cluster ‘C’ is created for the spam or non-spam class label for the new hyperbox ‘H i ’ in each class label. Otherwise, the new hyperbox is added to the existing cluster. Each hyperbox belongs to a single class. In the next step, the aforementioned process is also applied on all unvisited points in ‘H i ’ to create new hyperbox ‘Hi+1’ and add it to cluster ‘C’.
This process is applied iteratively to collect all directly density-reachable [17] points until there is not any point remains for adding to the current cluster ‘C’. Hyperboxes finding procedure is performed on all of the spam and non_spam patterns in training phase separately to discover clusters and noise points. Formula 1 is used to as a distance formula in k-means clustering algorithm [16].
In the first step, the hyperboxes of spam and non_spam are created. In this step, the PSO algorithm is applied to enhance the performance of seed_points for classification. In fact, the PSO expands the dimension of hyperbox to cover more points. In this situation, the size of the hyperbox does not change through the PSO algorithm with invariant value of ‘EPS’ value. The details of expanding the hyperboxes are described in the following subsections.
A considerable degree of patterns is covered with the structure of granular classifier in the first and second phases. On the other hand, there are some patterns that do not placed in any of the created clusters and known as noise points [17]. The capability of membership function of a fuzzy set was used to improve the geometry of the classifier by identifying the belongingness of noise points. The discovered noise points could be located close or far from the created clusters. The membership degree of noise point, as shown ‘noise’, is presented here by ‘u
i
’. The way of computing the class membership is explained on the following: Calculate the distance of ‘noise’ and seed_points of each cluster by formula 1; Calculate the average distance of achieved distances in step (i) by formula 8, which ‘c’ and ‘n’ indicate the total number of created clusters and seed_points, respectively; Calculate the ‘ui’ of ‘noise’ in each cluster ‘Ci’ among the created clusters ‘Cj’ through the Lagrange multipliers technique byformula 9.
In this section, the results of our proposed method are presented and analyzed on testing datasets. The Spam-base dataset [21] is used to evaluate and compare the performance of the proposed approach for spam detection. It contains 4601 instances of spam and non_spam email messages which 39.4% being spam. Each instance is characterized by 57 attributes and labeled with the ‘0’ and ‘1’: in which ‘0’ and ‘1’ consider the email is non-spam or spam, respectively. In the following, the performance evaluation criteria that are used in this study are presented. The proposed model is applied based on 10 runs for new training and testing datasets in each run. The proposed model is implemented using Eclipse Java JDK 8 on a DELL workstation with 3.30 GHz CPU, 3.41 GB memory, where the operating system is WINDOWS 7.
Evaluation criteria
Performance of the proposed approach is considered based coverage, misclassification error and accuracy. Table 1 represents the evaluation criteria and their interpretations, which are used in this study: where ‘NoCp’ denotes the number of emails covered by proposed model, ‘NoMe’ denotes the number of misclassified emails, ‘NoE’ denotes total number of emails, ‘TP’ denotes the number of non-spam emails that recognized correctly, ‘TN’ denotes the number of spam that recognized correctly, ‘FN’ denotes the number of spam that recognized wrongly, and ‘FP’ denotes the number of non-spam that recognized wrongly.
Experimental results
In this part, the result of the proposed model is presented and analyzed to determine its robustness in spam filtering. This model is applied based on 10 runs each is based on new training-testing datasets. The compared models used the same number of runs, where their average results are presented in various ‘MinPts’ and ‘EPS’ values. As it was mentioned, the proposed granular computing classifier consumes a lot of memory and Java Heap space Error occurs for the Spam-base datasets. Therefore, the training datasets in each run was split to 8 sections to discover the hyperboxes for training in order to solve this problem. For instance, the aforementioned error is occurred to discover hyperboxes where “MinPts” and “EPS” values are “2” and “0.7”, respectively.
Tables 2 and 3 represent the achieved results in various ‘MinPts’ and ‘EPS’ values based on aforementioned evaluation criteria. In Table 3, the achieved best results are presented in bold. Table 3 shows the best coverage, misclassification error, accuracy in second phase and accuracy in third phase are ‘94.42’, ‘0.004’, ‘92.55’ and ‘94.62’, respectively; in which ‘MinPts’ and ‘EPS’ values are ‘3’ and ‘0.7’. In addition, Fig. 4 shows the comparative analysis of accuracy results in the first, second and third steps with various “MinPts” and “EPS” values. It shows that the accuracy results are increased by applying the PSO algorithm in most of the cases.
In addition, the accuracy results are increased after applying the fuzzy sets to solve the issue related to the noise point. Table 4 shows the comparative analysis of the proposed model with the other studies in spam detection. This table demonstrates that the achieved result of the current model is superior to the other studies.
Conclusion
Email technology has profound impacts on human life, growth of business and national development in a positive path. Spam email problem is a major shortcoming of this technology for computer security. In this study, we consider granular classifier model to create the hyperboxes in the geometry of information granules for spam detection in three steps. In the first step, the k-means clustering algorithm is applied to build the centers of clusters that called seed_points of the hyperboxes in spam and non-spam classes. In this step, interval analysis is applied to build the sets as the key part of the classifier’s structure on the high homogeneity of the patterns. In the second step, PSO algorithm is hybridized with the k-means to enhance the performance of hyperboxes. In fact, the PSO expands the dimension of hyperbox to cover more points in related classes within the seed_points optimization. In the third step, the noise point belongingness is solved by applying the fuzzy membership grades to cover the universe in classification. We evaluate the proposed model for spam detection based on the coverage, misclassification and accuracy criteria. The achieved results show the performance of the proposed model in the first step is increased by applying PSO and fuzzy set in the second and third steps, respectively. The experiments revealed superiority of the proposed model compare to the other studies for spam detection.
Footnotes
Acknowledgments
Universiti Teknologi Malaysia (UTM) and Ministry of Higher Education (MOHE) Malaysia, under research grant FRGS 4F550 and GUP 02G31, are hereby acknowledged for some of the facilities utilized during the course of this research work. Also, this work and the contribution were also supported by project of Excellence “Smart Solutions in Ubiquitous Computing Environments” FIM, University of Hradec Kralove, Czech Republic (under ID: UHK-FIM-Excellence-2016-2203).
