Abstract
Detecting alert e-mails received daily by millions of subscribers from online news providers is a relatively new area of research which falls within the e-mail filtering field of research. Alert e-mails may address government, political issues, breaking news, and criminal attacks. This article proposes a hybrid approach based on both the Graham statistical filter and rule-based filters to detect and filter Arabic alert e-mails. The approach is basically language-independent. To test the performance of the proposed approach, several experiments have been conducted using a set of 1500 Arabic messages related to criminal activities collected manually from some news websites such as Al-Jazeera Net and BBC Arabic news. The results showed that the proposed approach has achieved a competitive performance in terms of accuracy, precision, and F-measure, where about 87% of the messages tested have been correctly detected and filtered by the proposed filter.
Keywords
1. Introduction
With the rapid development of the internet, e-mail has become the most popular and efficient communication medium for users owing to its convenience, immediacy, and low cost. The e-mail service is used for facilitating contact between individuals and enhancing the productivity of organizations.
Commonly, e-mail systems allow end users to manually construct keyword-based rules and filters to automatically direct e-mails into specific folders in order to facilitate efficient retrieval. This filtering includes the filtering of undesired e-mails (called spam) by either deleting them or directing them into a specific folder (usually known as the junk mail folder). As the popularity of e-mail has increased, e-mail management has become an important and growing problem for individuals and organizations. As a result, there has been a growing need to build systems that can learn automatically and then filter and move e-mails into specific folders properly. In practice, e-mail filtering is the process of automatic organization of incoming e-mails into folders, where each folder represents a category to which an e-mail may belong.
Several research projects in the area of e-mail filtering have focused on classifying e-mails into either spam or non-spam messages. However, the immense number of e-mails interchanged daily, and the needs for personalization and organizational data analysis, have opened new and critical research areas in the field of e-mail filtering. An example is the filtering of threatening e-mails related to terrorism activities. A threatening e-mail is an e-mail that carries information about possible future criminal activities that may have serious consequences (including, but not limited to, the loss of lives). The first effort in this area was made by Appavu and Rajaram [1] and Appavu et al. [2], who used machine-learning concepts, such as rule-based filtering, for detecting and filtering threatening e-mails.
Users of e-mail systems may receive alert e-mails that are either sent by security agencies or through online news providers where subscribers may request e-mail alerts on breaking news. For instance, several countries that have suffered from terrorist attacks have issued instant e-mail warnings that are sent by alert systems about any expected future terrorist attacks. The MI5 UK security agency, for instance, has launched a system that sends alert e-mails to UK citizens about expected terrorist attacks. The central Europe terrorism news website offers a free terrorism alert news service and sends such news by e-mail. E-mail alerts are received daily by millions of subscribers around the world from online news providers; one-quarter of such e-mails have addressed government and politics news, followed by crime and international news [3].
Detecting alert e-mails related to terrorist attacks embedded in an e-mail message is very important for users and analysts who are interested in such kinds of data. In practice, there are two types of e-mails that have information related to terrorist attacks: the first type is alert e-mails that warn users about expected attacks in the future; the second informs users about recent terrorist attacks. It has been noticed that alert e-mails are usually expressed using the future tense. The immense number of e-mails received makes it difficult for users to navigate through their inbox in order to find alert e-mails. This has motivated researchers to develop new automated techniques for detecting such alert e-mails and store them in a separate folder labeled the ‘alert folder’ instead.
Previous studies in the area of e-mail filtering did not take into consideration e-mails that contain threats and alerts regardless of the sender of such e-mails. This type of e-mail should be treated with high priority that guarantees accurate analysis and filtering. In this case, appropriate mechanisms should be used to notify users about such e-mails.
Appavu et al. [2] have proposed a rule-based technique based on classification and association for the detection of threatening e-mails via analysing the text embedded in English e-mail text. The basic idea of rule-based classifiers is to use a set of rules that is generated in the training phase of a rule-based classification algorithm (e.g. ID3, OneR). Two basic techniques were used for feature extraction from text, information gain (IG) and term frequency variance (TFV). The proposed method involved verb tense as an important factor and has shown that associative-classifiers provided competitive detection rates compared to common classification methods.
As for Arabic, this problem has been tackled by Al-Radaideh et al. in [4], where the performance of two e-mail filtering techniques to filter Arabic alert e-mails was evaluated. The techniques used were rule-based filtering using the ID3 classifier [5], and the statistical-based filter proposed by Paul Graham [6]. According to the authors, their work was the first to measure the performance of the Graham filter for filtering other types of e-mails such as those that carry security threats and it was also the only one to deal with the Arabic language.
This has motivated the current authors to propose a new approach to enhance these filters. This article proposes a hybrid approach that utilizes the performance of both the Graham statistical filter and rule-based filters to detect and filter Arabic alert e-mails. In the proposed approach, useful features were extracted from e-mails by involving both the categorical proportional difference (CPD) and term frequency variance as feature weighting methods for the rule-based filter. To test the performance of the proposed approach, several experiments have been conducted using a set of Arabic e-mails related to criminal activities.
2. Related work
In recent years, several studies have successfully addressed the issue of spam filtering with the help of machine-learning methods such as the Naive Bayes [7], Support Vector Machine [8], and rule-based filters. A comparative study was performed by Youn and McLeod in [8] to measure the performance of different e-mail filters. The study was performed using four filtering techniques: Decision trees, Naive Bayesian, Support Vector Machine (SVM), and Artificial Neural Networks (ANN).
The main problem with rule-based filters is the difficulty of keeping the rules up to date with the abilities of spammers to make different variations of the same spam message. This problem has forced several researchers to employ statistical approaches in the e-mail filtering process. A Bayesian filter, based on a boosting approach, was proposed by Koprinska et al. [9], and another Bayesian statistical filter was proposed by Gajewski in [10]. Additionally, Sheu [11] has proposed a mathematical modelling process in order to prove the importance of using message content in filtering.
Several other studies have proposed some approaches for spam e-mail filtering. Chen et al. [12] have evaluated and compared three novel Naive Bayesian filters with other common filters for e-mail filtering. Similarly, Nazir et al. [13] have used a two-pass statistical approach for automatic personalized spam filtering, whereas Hershkop and Stolfo [14] have used the decision forest approach to filtering spam e-mail. Chiemeke and Longe [15] have used a decision tree (DT)-based model to find a set of association rules that could identify spam e-mails. Liu et al. [16] have proposed a hybrid combined approach for e-mail filtering, which utilized the performance of the N-Gram filter, Naive Bayesian filter, content-based Naive Bayesian, term frequency–inverse document frequency (TF-IDF) filter, and limited N-Gram-based filter.
Sakkis et al. [7] have proposed a stacking filter model for spam filtering, whereas Blanzieri and Melgani [17] and Blanzieri and Bayl [18] have combined SVM and k-Nearest Neighbor (k-NN) classifiers to provide a hybrid model for spam e-mail filtering. Clustering was also incorporated in the area of spam detection, where it was used by Kyriakopoulou and Kalamboukis [19] as a complementary step for classification using support vector machine filters.
As for alert e-mail filtering, Appavu and Rajaram [1] and Appavu et al. [2] have used machine-learning data-mining concepts, and have proposed two techniques that were based on classification and association for the detection of threatening e-mails related to terrorism attacks by analysing the embedded text in an English e-mail message. To measure the performance of association-rule-mining (a priori algorithm) in alert e-mail detection, Appavu and Rajaram [1] have used a mixture of e-mails consisting of informative e-mail messages related to terrorism attacks that have occurred, and alert e-mail messages related to terrorist attacks expected to happen in the future. The proposed method involved using the verb tense as an important factor and showed that the associative-classifier used provided competitive detection rates compared with common classification methods. In terms of classifier accuracy, Appavu et al. [2] have shown that the DT classifier outperforms other classifiers, such as support vector machine and Naive Bayesian filters.
The combination of more than one machine-learning technique was also used in the area of e-mail filtering. The combinations were motivated by the deficiencies that have appeared in several single algorithm-based approaches. Kyriakopoulou and Kalamboukis [19] referred to the time for building a filter as a common deficiency in the support vector machines filters, while Koprinska et al. [9] referred to the feature selection techniques problem for simple Bayesian filters as a deficiency.
In order to reduce the false positives rates, Liu et al. [16] have used a hybrid combined approach for e-mail filtering. Different filters were used individually, such as N-Gram filter, Naive Bayesian filter, content-based Naive Bayesian (Paul Graham) TF-IDF filter, and limited N-Gram filter. Simple averaging equal weights, in addition to a weighted majority, were selected as a combination scoring technique. A combination of weak and strong individual filters was used in combination scenarios. The datasets used have consisted of e-mails that were collected from five users at Columbia University. In summary, about 4% of false positives were reduced by different combination scenarios.
A stacking filter model for spam filtering was used by Sakkis et al. [7], where two-level filters were used. The first one was a ground level filter, which is a Bayesian classifier. The second filter, which is the controlling filter, was used by the authors as the president filter trained on the same corpus as the ground filter. The final decision on the appropriate category was based on the president filter’s decision and the ground level one. Computational efficiency was the major problem with this combination.
Clustering was also incorporated in the area of spam detection. Clustering was used by Kyriakopoulou and Kalamboukis [19] as a complementary step for classification using support vector machine filters. According to Kyriakopoulou and Kalamboukis, SVM required a long time for large datasets, in the training phase especially. Clustering was used as a pre-processing sub-sampling operation; it was also used to add k-meta features from the k-clusters; those meta-features were extracted using a TF-IDF weighting schema, and finally SVM was used in the filtering step. The experiments have shown that this combined approach could achieve better detection rates in addition to better performance; however, the effects of using stemming and stop word removal revealed a biased approach in the filter performance.
Several studies have proposed an enhancement to statistical Naive Bayesian filters. Chen et al. [12] have evaluated and compared three novel Naive Bayesian filters with other common filters in e-mail filtering. The first novel Bayesian filter was the Aggregating One Dependency Estimator (AODE), which built a collection of models and then combined the predictions of such models. The second one was the Hidden Naive Bayesian filter (HNB), which creates a hidden parent for each word. The third is the Locally Weighted Learning with Naive Bayesian (LWNB), which learned a local model in the same way as linear regression is used in locally weighted linear regression. The experiments and evaluations were performed on two data sets, the PU1, which consists of 1099 messages, while the second dataset was PU2, which has 721 messages. They argued that the Naive Bayesian filter, especially the ADOE, has achieved the highest performance in the two datasets.
A decision forest was used by Hershkop and Stolfo [14] for e-mail spam filtering. The reason behind this decision is their ability to deal with the large dimensionality of feature vectors through combining trees and selecting a different set of features for each ensemble member. Three datasets were used in the evaluation phase: Ling Spam, PU1, and U5. The performance of the decision forest was better than highly accurate filters, including boosted trees and neural networks.
A two-phase spam filtering system was also discussed by Chiemeke and Longe [15], who have focused only on e-mail headers instead of scanning the content of the e-mail. A decision tree was used to find the association rules of spam e-mails. Finding association rules depends mainly on the e-mail header, such as the existence of abnormal terms in an e-mail subject. Two thousand Chinese e-mails were selected as training data for the first phase, and then some 600 unknown Chinese e-mails were chosen for the experiment of the second phase. The rate of accuracy achieved was 96.5%, the precision achieved was 96.67%, and the rate of recall was 96.35%. The main deficiency using this approach is its dependence on e-mail headers as the basic source for filtering.
Blanzieri and Melgani [17] evaluated a combination of k-NN and SVM filters. The algorithm used the traditional k-NN algorithm for selecting the best k-nearest samples, which was followed by training the SVM filter on the best k chosen, the model was then used for filtering the message of interest. The major problem of this algorithm was that it gave no procedure to find a good value for the parameter k.
Similarly, Yang et al. [20] used a two-layered Naive Bayes Bagging spam detection model that was based on embedded decision trees. An e-mail header was removed from an incoming datagram. If the e-mail could be filtered according to the e-mail header by rules, there would be no need to decode the e-mail body; otherwise a score threshold based on the Naive Bayesian probabilities was used in building the decision tree. Finally, the threshold would be used to label the e-mails of interest. Empirical results have shown that this modified method was able to improve the performance of Naive Bayes for spam detection significantly.
Koprinska et al. [21] conducted a comparison between four filters using a co-training algorithm, which builds more than one filter through dividing the full-feature vector into sub-feature vectors. The filter confidence was enhanced by allowing it to learn from others using a confidence threshold for each classified instance. Random forest is an example of such a kind of filter, based on decision trees. According to the findings of Koprinska et al., the filter outperformed, in terms of filtering performance, the well-established algorithms such as DT, SVM, and NB. These findings are very similar to those by Zhang et al. in [22], who compared random forests and support vector machines, where the random forest showed better performance in terms of accuracy and true positive rates.
3. The proposed approach
The proposed algorithm, which is an ensemble filter, uses both a decision tree-based algorithm and the Paul Graham algorithm together to determine whether the final category assigned to a message should be an alert e-mail or an informative one. The proposed method is a hybrid Paul Graham and DT algorithm (HPGDT).
The proposed approach is divided into four phases, as illustrated in Figure 1 and described below:
E-mail text pre-processing, which involves two main steps: removal of Arabic stop words and special characters, and normalizing the letters by replacing some character shapes with a unified shape.
Feature selection, which comprises two measures: CPD, which was proposed by Simeon and Hildeman [23], and TFV, which was used by Appavu and Rajaram [1] and Appavu et al. [2].
Filter construction, which involves building both the rule-based filter using the ID3 algorithm, and the statistical filter using the Graham algorithm. The results generated by the two classifiers are combined to generate the HPGDT hybrid filter.
Evaluation of the filter, which involves evaluating the performance of the generated filter in the detection of alerts about threatening e-mails.

The major filtering phases.
3.1. Phase 1: e-mail text pre-processing
This phase involves three main steps:
Tokenization: Before an e-mail text message can be processed, it is first split into units called tokens (words); this process is called tokenization. Tokenizing the message into tokens actually takes place using a blank space or any other some other nonalphabetic character used as a delimiter.
Stop word removal: The next step of e-mail text pre-processing involves the removal of some Arabic stop words; the removal of words that have little or no meaning and often appear frequently in text documents. Stop words typically have a syntactic rather than a semantic function in text documents, and usually do not have a relation to the text subject, as in the words ‘or’ (أو), ‘whose’ (لمن), ‘on’ (على), ‘where’ (حيث), ‘in’ (في), ‘from ‘(من), ‘beyond’ (غير), and ‘all’ (كل). Removing stop words generally leads to shorter documents, which can be processed efficiently. This also enhances the efficiency of the term indexing process [24].
Text normalization: This step involves replacing some variants of a character’s shapes such as (آ,إ,أ), (ه, ة) by unified shapes. The normalization step is necessary for the Arabic language because the Arabic characters may have up to four different shapes: one is used at the beginning of a word, one at the end, another in the middle and a fourth shape when the character stands alone. This step also involves removing digits and some special characters such as ‘1’, ‘2’, ‘@’, ‘[’, ‘(’, ‘?’, ‘/’.
3.2. Phase 2: feature selection
The next step involves finding relevant keywords (tokens) that could help in identifying alert e-mails. As for the Arabic language, it has been noticed that alert e-mails are usually expressed using the future tense; therefore the most common verb prefixes used in the Arabic language to write terrorist related alerts are the tokens ‘will’ (ﺴ، سوف), ‘may’ (قد), and ‘with’ (ب). On the other hand, informative e-mails are commonly written using simple present or simple past action verbs such as the words ‘happened’ (وقع), ‘shocked’ (هز), ‘hit’ (ضرب), ‘kill’ (قتل), ‘targeted’ (استهدف), and ‘burst’ (انفجر). Common examples of nouns that could also indicate that an incoming e-mail is an informative e-mail are the words ‘explosion’ (تفجير), ‘assassination’ (اغتيال), ‘killed’ (مقتل), ‘death’ (مصرع), ‘target’ (استهداف), and ‘hijack’ (اختطاف). There are also several nouns that are used to indicate that an e-mail may be an alert or an informative e-mail. Some action verbs and gerunds such as the words ‘attack’ (هجوم), ‘bomb’ (مفخخة), ‘car’ (سيارة), and ‘assault’ (اعتداء) could also appear in an Arabic alert e-mail message.
For the purposes of feature selection, two methods were used:
categorical proportional difference [23], which measures the degree to which a word w contributes to differentiating a particular category from other categories. The CPD value for a word w in a category Ci is calculated using Equation (1), where an e-mail E belongs to one of two categories (C1, C2):
where A is the number of e-mails where the word w and the category C1 occur together and I is the number of e-mails where the word w and category C2 occur together. The total number of e-mails in which the word w occurs is N = A + I. The possible values for CPD are within the interval (−1, 1]. As the value of CPD increases, it contributes to differentiating a particular category from other categories. When the value of CPD is 1, it means that the word occurs in documents of one category and does not occur in the other. The maximum ratio associated with a category Ci for a word w is the final value of CPD for that word; this is given in Equation (2):
term frequency variance, which is a category-dependent feature selection technique. The TFV was used by Appavu et al. [2] for alert e-mail detection and showed a competitive performance when compared with information gain. TFV is an indication of the term importance in each category, where higher values indicate that the term t has predominantly occurred in one category more than other categories, which means that t is more important and could be useful for filtering purposes.
To compute the TFV value, the term frequency tf in both categories is initially found followed by finding the mean value of the term occurrence in both categories (mean_ tf). The variation between the term frequency in each category and the mean of term frequency in all categories is evaluated and squared. Finally, the sum of this variation represents the value of TFV for that term. Equation (3) is used to find the TFV for each term. In the evaluation phase, the values for both TFV and CPD need to be normalized first since the range of values for TFV is greater than those of the CPD.
The set of documents and the selected features are represented using the term document matrix format [25], where the intersection of a row and a column takes the value of either 1 or 0 to indicate the existence (1) or absence (0) of a particular term in a selected document. An example of such a matrix is found in the evaluation section of this article (Section 4.2, Table 1).
Sample of term-document matrix.
3.3. Phase 3: proposed filter construction
The third phase of the process is filter construction. This phase involves building both the rule-based filter using the DT-based ID3 algorithm, and the statistical filter using the Paul Graham algorithm. The ID3 filter is a well-known DT-based algorithm that uses the information gain measure to select the best feature and then creates a branch for each known value of the test feature; a recursive partitioning is then applied for each partition.
The Graham statistical filter is a modified version of a Naive Bayesian classifier originally proposed for spam e-mail filtering. The Graham filter algorithm, which is adopted from Graham [6], starts using two pre-processed training corpuses: one for alert e-mails and the other for informative e-mails. To categorize the incoming unlabeled e-mail, the algorithm builds a vector of e-mail tokens, and then for each token the frequency of that token in both training corpuses is computed.
The Graham algorithm will then take the sum of the two frequencies and check whether it is greater than a pre-defined threshold called min_count_for_inclusion, originally found in the Graham algorithm. This step is used to explore the important terms in an e-mail body. The algorithm proceeds by finding two factors for each category: the Alert factor and the Informative factor. The alert factor is the ratio between the frequencies of the token’s occurrence in the alert e-mail corpus to the total number of e-mails in the alert training corpus. The informative factor is the ratio between the frequencies of the token’s occurrence in the informative e-mail corpus to the total number of e-mails in the informative training corpus. The probability of an e-mail being an alert given that the token (wi) occurs in the message text is computed by taking the ratio between the alert factor and the sum of the alert factor and the informative factor together for that token.
The calculation of the alert and informative factors is performed for all important tokens in an e-mail body, followed by calculating an alert probability score for that message, which is a combination of the probabilities of the most 15 important tokens in the message to be filtered. The final decision for an e-mail category is determined by the alert probability score of that e-mail. A threshold could be chosen to determine whether to filter the e-mail as either a threat alert or an informative e-mail.
The proposed HPGDT algorithm consists of two phases for labeling an incoming e-mail. The major steps in the proposed filter are listed in Figure 2. The first phase starts by scanning a set of rules generated using the DT classifier where the rule accuracy is already computed for each rule. Two threshold values are used in the two filtering phases: the rule accuracy threshold (RAT), which is the rule-based filter threshold which indicates rule accuracy, and the D threshold, which is a predefined threshold used to decide whether to use the class label associated with the rule-based classifier or just using the probability score calculated using the Graham filter.

Steps in the HPGDT algorithm.
The Ts value is initially chosen to have either one of two values: RAT = D or RAT > D. Once all rules are scanned, the triggered rule is checked to decide whether its quality measure meets the predefined threshold, RAT: if the rule accuracy is larger than RAT, then the rule category is assigned to the unlabeled e-mail; otherwise the e-mail will be checked using the second phase, in which the Graham algorithm is involved.
In the second phase, an alert probability score is calculated by applying the steps mentioned earlier in the Graham algorithm. The probability score is then checked to decide whether its value is greater than the predefined statistical classifier threshold (D). If the value of the probability score is greater than D, then the e-mail will be directly classified as an alert e-mail; otherwise, the class label generated in the first phase is used to correctly label the unlabeled incoming e-mail. One of the special cases for the proposed filter is related to those instances where no rule could be triggered. The traditional technique in the rule-based system is to use a default rule for such instances. In the HPGDT algorithm, the class generated by the Graham filter is used in this case.
4. Experiments and evaluation
To evaluate the proposed filter, several metrics were used, including accuracy, true positives, false positives, and false negative rates. These metrics were also used to find the recall and precision rates for each filter. The F-measure was also used to find the best combination between precision and recall.
The true positive (TP) rate is measured by the number of actual alert e-mails that are classified as alert e-mails. The true negative (TN) rate is measured by the number of actual informative e-mails that are classified as informative e-mails. The TP and TN rates are calculated using Equations (4) and (5), respectively:
The false negative (FN) rate, given in Equation (6), is also measured as the number of actual alert e-mails that are incorrectly classified as informative e-mails. The false positive (FP) rate, given in Equation (7), is measured as the number of informative e-mails that were classified as alert e-mails:
The accuracy is the ratio of the correctly filtered e-mails to the total number of e-mails used in the evaluation phase. The accuracy is evaluated using Equation (8):
The recall (R) is an indication of the percentage of alert e-mails that has been filtered as alerts from the overall number of alert e-mails in the whole testing collection of e-mails. The precision measure (P) is used also as a quality measure to evaluate the filters of interest. The F-measure (F) is the harmonic mean of precision and recall, which aims to find the best combination of precision and recall. R, P, and F are calculated using Equations (9)–(11), respectively:
4.1. Data collection
Collecting Arabic alert e-mails was a major challenging issue in this research because there was no standard Arabic alert e-mail corpus that could be used for alert e-mail filtering. Alert e-mail detection is a relatively new research direction for e-mail filtering. Appavu et al. [2] also mentioned that the data collection step was a major challenge in their work and caused some bias and noisy results. In their research, the two datasets used for experiments were constructed in a brain-storming session.
In this research, in order to build the Arabic alert e-mail corpus, the e-mails have been collected manually from some news websites. These websites, such as Al-Jazeera Net (www.aljazeera.net) and BBC Arabic news (http://news.bbc.co.uk/hi/arabic/news), are a suitable medium for terrorist groups to spread alerts. The corpus consists of 1500 e-mails: 825 of them were alerts, representing about 55% of the total number of e-mails collected, and 675 were informative, representing about 45% of the total number of e-mails. Two examples of alert messages are as follows:
4.2. Data preparation
The e-mail corpus was processed in order to generate the term document matrix. A sample of the term document matrix is shown in Table 1, where every cell gives an indication about the existence (1) or absence (0) of a term in an e-mail text. Simple terms are found in the header of the matrix along with their English translation.
For experimental purposes, the corpus dataset was divided into two parts using the (80/20) rule, where the 80% of the collected e-mails were used in the training phase, representing about 1200 e-mails, while the other 20% of the collected e-mails, representing about 300 e-mails, were used in the evaluation (testing) phase. In each part, about 55% of the e-mails were alerts and about 45% were informative. Both datasets were preprocessed.
To build the DT rule-based filter, the ID3 implementation in the KNIME information-mining tool [26] was used. The rule generator settings used no average split points, no binary nominal split was chosen, and the quality measure used was the gain ratio. With each rule, its accuracy value was used to indicate the rule strength. The set of generated rules extracted using the DT learner was then transformed into a set of If–Then rules.
Table 2 shows the set of generated rules and the accuracy of each rule within the training samples. Each rule antecedent indicates a set of terms; those are the features that decide whether the e-mail is either an alert or an informative message. If the rule antecedent is triggered, the rule is then applied to filter the e-mail of interest. The rule consequent, which has the class label (‘yes’), indicates that the e-mail carries alert terms; such e-mails are then transferred directly into the alert e-mails folder. On the other hand, if the e-mail body text does not contain such terms and most of its text is in the past tense, it will be directly filtered as an informative e-mail and hence transferred into the informative e-mails folder.
Sample rules generated using ID3.
Table 3 gives a sample of token probabilities that were calculated using the Graham filter for all alert and informative messages in the corpus. Table 4 gives an example of the seven most important probabilities extracted from a message text. The final alert probability of the message, using these seven probabilities, was nearly equal to 0.9995. In comparison with the alert threshold, chosen to be 0.75, the final decision will be to filter the e-mail as an alert e-mail.
Graham tokens alert probabilities.
AP, alert probability.
Calculation of the most important tokens.
To choose the best combination of the DT and Graham methods, the number of tokens used in the probability calculation was evaluated on a set of 300 e-mails that was used previously in the DT filter evaluation. The best (5, 10, 15, 20, 25, and 30) token probabilities were used. The results are summarized in Table 3, where the features above 20 gave better results in terms of accuracy.
The final experiment conducted using the e-mail client application was to measure its performance in terms of accuracy, recall, precision, and F-measure. To address the effects on the performance of DT classifier, which shows some weaknesses in the previous experiment, different DT rule accuracy thresholds were used in the range [0.6–0.95]. A set of 300 e-mails were used in the HPGDT evaluation, 160 of them being alerts and the rest being informative messages.
The result of the evaluation of the HPGDT is given in Table 5. As shown, when the RAT increased from 0.7 to 0.9, the accuracy value of HPGDT increased from 83% to ~87%.
HPGDT algorithm performance.
RAT = Rule accuracy threshold, A = alert e-mail, I = informative e-mail, A% = accuracy, P% = precision, R% = recall, F% = F-measure.
The best F-measure value was achieved also when the RAT was 0.9 and 0.95, respectively. The highest recall value achieved, which was 91%, was generated when the RAT threshold was 0.9. From the results of Table 5, the values of the RAT in the range 0.85–0.95 achieved better results than the values in the range 0.6–0.8.These results indicated that the HPGDT algorithm achieved the best results when the DT rule accuracy threshold was set to 0.90, which produced an accuracy rate >87%. From the 11 rules produced by the DT algorithm, only five of them may be applied in the HPGDT filtering procedure. The applied rules are only those that have ≥90% accuracy. Of the five rules, only four have 100% accuracy; hence, ~54% of the rules were excluded when the RAT was set to 0.9.
5. Comparing the three algorithms
A comparison between the DT, Graham, and HPGDT algorithms is the last step in the experimental phase. The comparison between the three algorithms depends on the best results that have been achieved in the previous three experiments. The parameters used for each algorithm are shown in Table 6. Those parameters are the feature size of the DT filter, the interesting token count for the Graham filter, and in addition, the rule accuracy thresholds for the hybrid filter HPGDT.
Algorithm comparison criteria.
Table 7 shows the comparison results between the performances of the three evaluated algorithms. It is clear from the results that the DT filter shows some weaknesses in terms of accuracy, recall, and F-measures. The best accuracy achieved by the DT filter is 80%, for the Graham filter the best accuracy achieved is 85%, while the HPGDT algorithm has achieved the best accuracy rate of >87%.
DT and Graham algorithm performance.
A% = Accuracy, P% = precision, R% = recall, F% = F-measure.
In terms of precision and recall, the Graham and HPGDT algorithms have achieved better results than DT algorithm. The HPGDT has achieved the best results in terms of recall and F-measure values: the F-measure value of HPGDT algorithm is 88%, while the Graham algorithm has achieved about 85% in terms of F-measure rate.
6. Conclusion
An alert e-mail is an e-mail that carries information about an expected terrorist attack(s). In this research, a hybrid technique for Arabic language alert e-mail detection is proposed and evaluated. The proposed hybrid filter (HPGDT) depends on combining the features of the Graham statistical filter and the decision tree-based filter.
Two Arabic e-mail corpora were collected and pre-processed; the two corpuses consisted of 300 e-mails, of which ~55% were alert e-mails and the rest were informative. Three experiments were performed with the purpose of evaluating the performance of DT, Graham, and the hybrid HPGDT filters. Results show that the HPGDT achieved better performance in terms of accuracy, precision, and F-measure. The F-measure value of the HPGDT algorithm was 88%, while the Graham algorithm has achieved ~85% in terms of F-measure rate. The accuracy achieved by HPGDT was 87%, while it was 85% for the Graham filter. The DT filter achieved an accuracy of 80%, which was the lowest level of accuracy of the three algorithms evaluated.
The proposed HPGDT algorithm represents a new effective approach in the e-mail filtering area. It also shows competitive detection rates. The proposed approach is language-independent and it is worth evaluating for other languages.
