Abstract
Nowadays Electronic communication is an important medium and an inevitable way for official communication. So, the email classification into spam or ham gains a lot of importance. Commonly used approaches are text-based or collaborative methods for spam detection. However, not only choosing the right classifier is very difficult but, handling poison attacks and impersonation attacks are also very important. The proposed model considers a powerful spam filtering technique which includes both social network and email factors in addition to the email data analysis for spam classification. The incoming emails are subjected to header parsing for finding the trust and reputation of senders with respect to the receivers and keyword parsing is applied to find the topic of interest using LDA with Gibbs Sampling method. Optical Character Recognition (OCR) method is applied to find the image spam e-mails. Degree and strength of the connection between the users from the social networks are also considered along with the email data factors for better message classification. Logistic Regression is used to combine all the independent input features to get an effective result. The experimental results and comparisons with the existing models vividly show the significant performance of the proposed classifier.
Keywords
Introduction
Electronic communication is the most prominent and important means of communication. E-mails and social media are an established way for communication purposes and it is an important tool for business, research, public and private use. Spam detection in e-mails is one of the major problems for users and e-mail service providers. The attack-resilient specialty of the filter [24] determines the spam detection accuracy. A user-friendly spam filter will be comfortable for users. High accurate filters have less false positives and false negatives rate [21]. According to Kaspersky lab report 2017, spam rate of global e-mail traffic is shown in Fig. 1. The global growth of spam e-mails in the second and third quarters of the year 2017 is vividly shown in the figure. Several spam filtering methods such as VoIP method [3] are failed to identify the poison and impersonation attacks which are the powerful varieties of threats happens in e-mails. The inclusion of legitimate words in the spam e-mails and convert a spam e-mail to ham is happened in poison attack. The spammer who sends unwanted e-mails uses the identity of the genuine users by hacking their IDs or PCs and this type of attack come under the impersonation attack. E-mail spam filtering methods can be classified into two main types namely e-mail content related and e-mail user-id related methods. Keyword-based parsing approach happens in the content related e-mail spam filtering The blacklist and the whitelist of the e-mails are checked and compared in the user-id related spam filtering. The previous research methods for spam filtering in e-mails is concentrated on the identification of unsolicited e-mails based on the keyword based searching and on the pattern that a spam message used. The identity of the senders of the e-mail is also checked and compared their identities in the blacklist and whitelist present in the recipient’s e-mail account. Other hybrid methods such as features from the social media also considered for the filtering of unwanted messages from the e-mails, but the above approach is fully concentrated on the features coming from the social media and the text data analysis and e-mail features extraction from the e-mails did not happen. The other limitation here is that they did not mention the spam identification of e-mails with images. Hence, a method has innovated that filters the unsolicited e-mails effectively with or without images and that has to acquire high accuracy and less false positive and false negative rate. To acquire an effective filtering mechanism in e-mail data, the proposed method that uses features from the e-mail and social network datasets and uses three different types of classifiers is used for accurate filtering and to find the best filter. E-mails with images are also considered for spam detection.

Spam rate in global e-mail traffic.
Main contributions of our work as follows:
Our new method for filtering unsolicited messages in e-mail considers social network factors and e-mail factors in addition to the existing content-based filtering (CBF) technique for effective spam classification. Neural Network (NN) based spam classification is also developed to make a comparison with the proposed method. LDA with Gibbs sampling method is applied to find the topic of interest of users by keyword parsing. Trust and Reputation of senders with respect to receivers is calculated by header parsing of e-mails. Optical Character Recognition (OCR) method is applied to recognize characters from images in the incoming e-mails. Logistic Regression is used for combining the email and social network factors to provide accurate classification.
The remaining portion of the paper is organized as follows. Section 2 contains the literature survey related to the proposed work. Section 3 gives the methodology proposed. Section 4 shows the outcomes and their analysis. Section 5 wind up and condenses the paper.
E-mail spam filtering methods can be basically classified into two different categories such as e-mail content related and e-mail used id related filtering methods [15]. Keywords and pattern based parsing of the e-mail datasets happens in the content related filtering. Impersonation attack and poison attack is not free from this method. To overcome these attacks, strong and effective spam identification and filtering method is needed. Factors from social media are also considered for e-mail classification by Haiying Shen and Ze. However, they are not concentrated on e-mail content analysis and e-mails with images [10]. Features related to adjacent nodes in the twitter network are used to identify twitter spammers in [1] which was introduced by Chao Yang et al. Spammers in the social networks can be identified by considering both social media account related feature evasion and accuracy of evasion tactics. XianghanZheng et. al. given an idea to detect malicious users in social networks which consider both user’s account related and character related features [22] and given these features as input to an SVM related algorithm for classification of spammer classification. Time complexity is very high for this proposed scheme.
A cooperative spam identification with e-mail condensation is proposed by Chi-Yao Tseng et al. in which redundancy comparison and increasing alterations are performed [2]. Condensation of e-mails is carried out by including HTML contents in e-mails. Improved the spam identification performance by adding several filters from users which can be called as symbiotic spam filtering method in e-mails proposed by Clotilde Lopes which integrates features from both cooperative filters and data based filters [5]. The main aim is to increase the collaboration among distinct objects which have an attraction in customized filtering. Optimizing the scores of anti-spam filters with the help of evolutionary algorithms is discussed in [12]. Study of poison attacks happens in e-mails is discussed in [6]. The current direction in research and unlocked issues for the e-mail classification is given in [9]. Discovering Spam patterns in e-mail samples is discussed in [7]. The negative selection algorithm for identification of spam is mentioned in [13, 14]. Yehonatan et.al discussed the detection of malicious attachments [23].
Several types of research were carried out on e-mail foldering. Irena Koprinska et al. learned various techniques for classifying e-mails into supervised and semi-supervised learning [11]. For handling e-mails in a better manner, one of the most significant behaviors of the email management system is giving importance to the email and sending only the most important messages to the mobile devices. An attentive learning for categorization of the e-mails in the inbox is introduced by MostafaDehghani et al. were their aim is to study the type of categorization of e-mails [17] based on the users’ behaviors while the emails are being arranged into various categories. Feng Lizhou et al. proposed a classification of spam in e-mails based on an active learning approach is used in [8]. Joseph S. Kong et al. proposed a combined spam filtering method with e-mail networks. Rohan M. Amin et al. proposed the mechanism to detect malicious e-mails [18]. In this scheme, the focus is on persistent threat and the recipient-based feature performs well than other commonly available techniques. Targeted Malicious E-mail (TME) aims single users or small groups into small volumes. Salehi Saber et.al. proposed a fuzzy-based model for detection of spam in e-mails is applied in [19]. Sang Min Lee et.al. proposed the cost-sensitive spam detection using parameters optimization and feature selection by using random forest method [20]. Detection of genuine users on the basis of trust was proposed in [4].
Even though a variety of methods are available for identification and filtering of spam messages from e-mails, the proposed method is an initial and innovative approach for spam e-mail identification and filtering which considers information from social media, e-mail data analysis and e-mail with images for better data classification. Existing algorithms just analyze the incoming text messages and determine that the arriving messages are wanted or not. But, most of the e-mails nowadays come with images. So, the proposed model uses Optical Character Recognition (OCR) method to read texts from images for effective filtering of e-mails with images also. Identification and classification of spam and ham messages can be attained by considering the innovative and important features from the social media and from e-mail data sets and give it to different classifiers to compare the performance levels. The important goal of the proposed method is to identify spam e-mails and classify the e-mails into unsolicited or legitimate ones effectively. Factors collected from social media and e-mail data are considered along with different classification methods for outstanding performance. Integration of the independent inputs happens through logistic regression. The system is constructed using a combination of two filters namely Bayesian filter and SVM (Support Vector Machine) in order to farther improve the effectiveness of filtering by ensuring that the filter so constructed is attacked resilient, personalized and produces minimal false positives and false negatives. The proposed model works well even in the absence of one or more input features because logistic regression will accurately combine the input features with appropriate weight values. To compare with the proposed model, the NN machine learning technique is developed for spam classification in which the input features directly given as input to the input layers of NN. The list of acronyms included in the proposed work is given in the Table 1.
List of acronyms
List of acronyms
The proposed methodology for the identification and filtering of spam messages from the e-mail datasets by including the features from the social media and from the e-mail datasets. Geodesic distance and the solidity between the users is included from the social media and factors like trust, reputation and interest are considered from the e-mail datasets. Bayesian and SVM classifier is used for effective spam detection. OCR method is applied to find the text in the images present in the e-mails. Logistic Regression is used to combine independent inputs. The experimental results delivered from the proposed method are analyzed and its achievement has been compared with the NN and the traditional content-based filtering.
E-mail spam filtering
The architectural diagram for the proposed model is given in Fig. 2 shows the entire system for identification and filtering of spam messages from the e-mail datasets. The proposed filtering method takes information from both social media and from e-mail datasets for effective identification and filtering of spam e-mails. The geodesic distance between the nodes is calculated for identifying the degree of connection between the nodes. Number of connections between the nodes is calculated for identifying the solidity between users. Trust and reputation between the users, the interest of the users is calculated on the basis of e-mail data analysis. OCR method is applied to identify the text in the images present in the incoming e-mails. The probability of e-mails being ham or not is determined by using Bayesian and SVM. A probability value is computed by combining all the independent input features and the classifiers’ probability values with the help of Logistic regression and finally, a threshold value comparison is happened to classify e-mail as legitimate or not. The notations used in the proposed work are shown in Table 2.

Workflow diagram.
Notation table
The features calculated from the social media include the following:
The geodesic between the sender and the receiver is calculated for the computation of the degree of connection between users. That is,
The degree of connection should ensure the following property to accurately show the social relationship.
Property1: Reduction in Intimacy (RiN)
Intimacy between nodes exponentially decreases as their distance increases. Intimacy can be identified from the degree of the connection. Reduction in the relationship intimacy happens when the degree is high. RiN is shown in Fig. 3.

Reduction in intimacy.
The number of links between two nodes is used to identify the solidity S of the relationship. The solidity between nodes A and B increases when the count of connections between them increased.
Where n is the path count between user A and user B, P is the path between users A and B. Trust between the users, interest of users and opinion or reputation between users is measured by considering the e-mail datasets are as follows: Considering the blacklists and whitelists of receivers’ e-mail account and also the quantity of solicited e-mails received by the recipient from the corresponding sender and answers made by the receiver to that sender determines the trust value of the sender with respect to the receiver. The trust T of user B with respect to user A, when user B sent an e-mail to user A or user B got a reply from user A is calculated as follows:
Where Lb,a is the ham count received by user A from user B, Ra,b is the number of answer e-mails given by A to B, 0 < z < 1 if the email received a spam. Here z = 0 if legitimate, x, y, z are known as the learning parameters.
The reputation R is calculated on the basis of the count of e-mails forwarded by the users. That is the reputation of user B with respect to user A when user A forwards an email which is sent by user B is:
Here k is the learning factor that is learned during the training phase, n (F) is the count of peoples to which the e-mail is forwarded. Trust and Reputation calculation is on the basis of reply and number of forwarded email messages is shown in Fig. 4.

Trust and Reputation calculation on the basis of reply and number of forwarded email messages.
LDA algorithm is used to calculate the interest I for a particular topic.
Interest of A for a Topic t:
Where Lt,a is count of legitimate e-mails of Topic‘t’ received by user A and St,a is the count of spam emails of topic‘t’ received by user A and T a is the total email count in topic ‘t’ received by user A. E-mail contents are analyzed to find the interest of the users and LDA performs the clustering of keywords into topics. The text recognition in the images of the incoming e-mails is done. Logistic regression predicts the probability of e-mails being solicited or not by considering the independent input features from the data sets. The linear model for logistic regression is
To construct a probability model, P (L|E) for incoming e-mails being spam or legitimate, apply logistic function in the equation: (8)
Where the logistic function,
Where W = a, b, c, d, e, f are the weights computed during training, X is the input parameters such as S is the strength, D is the degree, T is the trust, I is the interest, R is the reputation, L is the legitimate and E is the e-mail. The probability value of logistic regression is calculated by considering the input values and comparison of probability values with a threshold value is happened for the e-mail classification. The logistic regression model is applied for both the Bayesian filter and SVM, and they are finally compared to see which filter has better performance in terms of efficiency, accuracy, and attacks. Each weighted parameter is independently estimated against a threshold (ω) that is found out and is further used to identify whether the incoming e-mail is ham or not. All the factors achieved their weight values during the training time.
The proposed algorithm 1 calculates the factors like degree, solidity from social networks and trust, interest, reputation from the e-mail datasets. To eliminate stop words, different natural language processing methods must be included and stemming of text messages is also needed. Parsing of e-mail is required for every keyword to identify as ham or spam, the Bayesian and SVM filter is used. OCR method is applied to search the text in the images present in the incoming e-mails and it is also given to the filters for classification. In the algorithm 1, P(BF) is the Probability of Bayesian Filter, P(SVM) is the Probability of SVM Filter, Sr is the sender, Rr is the receiver and BL is the Black List. The factors like blacklists and whitelist of the e-mail account of the receiver, the amount of solicited e-mails received by the receiver from the sender and the count of e-mails replied by the receiver to the incoming e-mail are used to calculate the trust value of the sender with respect to the receiver. LDA method is used to identify the topic of the e-mail and thereby we can find the interest of the users. The number of forwarded e-mails to other users by the recipients, based on that reputation is calculated. Finally, the likelihood values of the filters and the factors considered from the social media and e-mail datasets are obtained for each incoming e-mail.

Computation of factors from E-Mail and Social Networks
The proposed spam classification shown in Algorithm 2 considers different parameters such as Bayesian probability value, SVM probability value, degree, solidity, trust, interest, reputation, and logistic regression is applied to combine all these independent input features for each test e-mail to produce an apt probability value and compare it with threshold value to identify the incoming e-mail is ham or not. All the factors achieved their weight values during the training time. The adaptive trust is nothing but the trust value can be altered after the e-mail classification. When the intimacy value of the user went down and if it goes below the threshold value, her name entered in the blacklist of the corresponding receivers’ e-mail account. If the closeness value goes beneath the threshold value, then the user is included to the blacklist. The trust value between the users is a dynamic field because every time an e-mail is classified as ham or unsolicited, it alters the trust value of the users who send the e-mails in the e-mail network.

Spam Classification
The different factors considered from the social media and e-mail datasets need to be multiplied with some weighted values which are achieved during the training phase. The classifiers namely SVM and Bayesian also need to be multiplied with learning parameters before classification. Logistic regression is used to identify the training values at the training time. The comparison between the probability values and threshold value will determine the e-mail messages be spam or not. The accuracy factor determines the best threshold value. Further, the false positive rate and the false negative rate have to also be determined. A false negative classification will happen when a spam e-mail is mistakenly classified as ham. The higher the accuracy value is obtained better is the classification. Higher accuracy is possible only if the correct classifications count is high which is in turn dependent on the count of false positives and false negatives.
Spam classification through NN
The extracted features from the datasets are given as input to the NN. The development of the proposed NN considers both high variance (over fitting) and high bias (under fitting) problem into consideration. Adding or deleting of neurons in the hidden layers of the NN method [16] is used to solve these problems. The framework of NN is shown in Fig. 5.

Neural network framework.
The criterion for summation of neurons can be indicated as:
Where Er(t) and Er(t + τ) are the training errors at epochs t and t + τ, respectively. Here, ɛ and τ are two user-specified parameters. In the merging operation, the less important hidden neurons get deleted in which it does not affect the proposed classification model.
The data sets have been obtained from two sources. The email dataset is collected from Csmining group. Each email has the header information that conveys the ‘from’ address, ‘to’ address, subject, mailing lists etc. The social media data set has been collected from Higgs Twitter Datasets. The Twitter dataset has the edgelist information from where the followers and followee information are obtained. There are two basic steps done in pre-processing namely, stemming and stop words removal. Stem means reducing a word to its root. Wordnet groups all the English words into sets of synonyms which are given a name synsets. Synsets are short definitions and it records the semantic relationships between the synonym sets. Stop words removal is also carried out using wordnet word lists. Both the datasets have been combined to generate a merged data set for e-mail spam classification. Table 3 shows a merged data set has a three column format where the first column represents the email number, the second column indicates the sender and the third represents the receiver. If more than one value in the receiver column, it indicates that the sender mail an e-mail to more than a single person.
Sample datasets
Sample datasets
Merging the datasets has been performed out in such a manner that if that email is not a ham e-mail, then the sender and the receiver will not be in the first two degree connections of the sender. The interest of a user in a topic was found by using JGibb LDA. It gives more importance to inferring hidden or latent topic structures of unknown data because the parameter inference process requires less computation time than the parameter estimation time. OCR method is applied to find the text in the images present in the incoming e-mails and that text is also given to the proposed classifier for better spam classification.
The trust and reputation factors are calculated from the e-mail network. Trust indicates the magnitude of trust the receiver has on the sender. To implement trust, a hash set of all reply e-mail list is maintained. Further, the adaptive trust management is implemented by reducing the trust value of the user, whenever a trusted sender sends a spam e-mail, then the trust value is computed as follows.
Where LEC is the Legitimate E-mail Count and NR is the Number of Replies. The learning parameters are chosen as 1 for both the legitimate email and the number of reply e-mails. To implement adaptive trust whenever a spam e-mail is sent out by the trusted sender, the trust matrix is updated as follows.
Reputation is computed with respect to the count of forwarded e-mails. Each of these e-mails is taken and the headers of the e-mails are analyzed to find the subject as ‘Fwd:’ All those e-mails are stored in a hash set. When user A forwards an e-mail to user B and user B forwards the e-mail to users C, D, E then the aim of reputation is to find the reputation of the original sender with respect to the receiver. That is the reputation of user A in the eyes of user B. Upon reading an e-mail we find information only about the sender and the receivers of the e-mail. The degree of the connection is determined by finding the first six degrees of connection and writing them to separate files. Each follower is taken and the entire Higgs Twitter dataset is collected and followee information is taken and all of them are written together into a list in the form <follower> <followee list>. Values are written into a hash map with the follower as the key and the followee as the values. The degree of the connection is inversely proportional to the closeness value. The higher the degree of the connection smaller is the closeness value. So if a follower and the followee are connected by more than one degree of connection, the least degree by which they are connected to each other is the degree of the connection. The solidity of the connection refers to the number of path by which followees are linked to the follower. Solidity is implemented by finding the strength of all degrees of connection.
When a follower is connected by several degrees of connection then all the degrees through which they have connected also in the same degree the followee might be connected to the follower in many ways. All these are added up, to find the different ways by which a follower and a followee are connected to each other. The proposed model is implemented and analysis is performed on the basis of results delivered. Accuracy rate, sensitivity, specificity and error rate are analyzed for the evaluation of the performance. Further, the proposed system with Bayesian filter and SVM filter is compared to analyze its performance similarity separately with the traditional Bayesian and SVM filter. Logistic regression is used to combine all the independent input features to produce accurate results. NN is also developed to compare with other proposed methods. The best classifier has been finalized after analyzing the results.
Figure 6 shows the FPR and FNR comparison with existing and proposed system with SVM. The figure 6(a) shows that the false positive rate falls by 2.5% and the figure 6(b) shows that the false negative rate falls by 6.9% for the proposed model with SVM. The proposed model combines all the input features with the help of logistic regression which helps to select suitable weight values for the input features during the training time.

Parameters calculated for test e-mails filtering rate while using SVM.
Figure 7 shows the FPR and FNR comparison with existing and proposed system with Bayesian filter. The data were collected and trained as same as in the case of SVM. Ideally, the accuracy is assumed to be higher for Bayesian filter combined with the proposed features than Bayesian classifier alone. The Fig. 7(a) shows that the false positive rate falls by 3.5% and the Fig. 7(b) shows that the false negative rate falls by 6% for the proposed model with NB. Sensitivity and Specificity comparisons are shown in Figs. 8 and 9 respectively.

Parameters calculated for test e-mails filtering rate while using Bayesian filter.

Sensitivity vs. No: of e-mails.

Specificity vs. No. of e-mails.
Figure 10 shows the accuracy comparison of spam filtering rate for the proposed models with the existing methods. Accuracy rate of Bayesian filter with proposed factors is higher than that of other classifiers with additional features. Error rate between the existing and proposed method is shown in Figure 11. Table 4 display the performance evaluation of the spam filtering rate of the existing models and the proposed method after training. From the table, it is very clear that the Bayesian filter with the proposed features is a better classifier when compared with existing Bayesian filter or SVM filter or when SVM combined with the factors from the e-mail and from the social networks. Accuracy is computed and when the number of training e-mails increases, the machine keeps learning from more examples on how to classify an e-mail as spam or ham. Based on these, it is accurate enough to classify the test e-mails. During the training phase, the system learns the best threshold value at which the train e-mails achieve maximum accuracy. The threshold value is compared against the probability value of each test e-mail which is produced by the logistic regression. The results reveal that the proposed model with Bayesian classifier achieves an accuracy of 97% for classifying the e-mails as spam or ham.

Number of e-mails Vs. Accuracy while using Bayesian filter, Bayesian filter with e-mail and social network factors, SVM filter and SVM filter with e-mail and social network factors.

Error rate vs. existing and proposed classifiers.
Performance evaluation of spam filters by existing and proposed filters
An effective spam e-mail identification and filtering mechanism which considers factors from social media and from e-mail datasets to increase the achievements of the filter at the sky level. Integration of the independent inputs is done with the help of logistic regression. OCR method is applied to find the text in the images present in the incoming e-mails for effective spam classification. SVM and Bayesian classifier with the e-mail and social network factors, traditional SVM and Bayesian classifier and NN with e-mail and social network factors are developed and its performance comparisons were done. Experimental outputs clearly explain that the proposed model with Bayesian classifier achieves an accuracy of 97% for the classification of e-mail as spam or ham.
In the future, in addition to the existing features, interest of the users will be collected and dynamic changing behavior of the interest is also taken into consideration for better e-mail classification.
Footnotes
Acknowledgment
This work is financially aided by the Visvesvaraya Ph.D. scheme under Electronics and IT.
