Abstract
Phishing websites trick honest users into believing that they interact with a legitimate website and capture sensitive information, such as user names, passwords, credit card numbers, and other personal information. Machine learning is a promising technique to distinguish between phishing and legitimate websites. However, machine learning approaches are susceptible to adversarial learning attacks where a phishing sample can bypass classifiers. Our experiments on publicly available datasets reveal that the phishing detection mechanisms are vulnerable to adversarial learning attacks. We investigate the robustness of machine learning-based phishing detection in the face of adversarial learning attacks.
We propose a practical approach to simulate such attacks by generating adversarial samples through direct feature manipulation. To enhance the sample’s success probability, we describe a clustering approach that guides an attacker to select the best possible phishing samples that can bypass the classifier by appearing as legitimate samples. We define the notion of vulnerability level for each dataset that measures the number of features that can be manipulated and the cost for such manipulation. Further, we clustered phishing samples and showed that some clusters of samples are more likely to exhibit higher vulnerability levels than others. This helps an adversary identify the best candidates of phishing samples to generate adversarial samples at a lower cost. Our finding can be used to refine the dataset and develop better learning models to compensate for the weak samples in the training dataset.
Introduction
Motivation
Phishing, as defined in [27], is an attempt to obtain sensitive information such as user-names, passwords, and credit card details by masquerading as a trustworthy entity in an electronic communication. The first recorded mention of the term is found in the hacking tool against American Online (AOL) users in 1995 named AOHell. The technique was elaborated in a presentation by Felix et al. as early as 1987 [9]. Phishing attacks have shown remarkable resilience against a multitude of defensive efforts, and attackers continue to generate sophisticated phishing websites that closely mimic legitimate websites. There were 328,000 unique attacks reported in 2007, and this number almost quadrupled by 2017 [2].
When viewed as a social-engineering attack, phishing cannot be solved solely by educating the end-users, and hence, automatic detection techniques are essential. Several defenses were proposed against phishing attacks, such as URL blacklisting, keyword-based filtering, IP address filtering, and machine learning-based techniques. Solutions like URL-blacklisting are no longer effective as attackers can evade such techniques through simple URL manipulation or by hosting websites on popular free hosting services on the Internet. Machine learning-based techniques appear to be a promising direction.
Problem statement
Phishing websites are the ones that mimic legitimate websites to defraud honest Internet users and steal personal information. Machine learning-based techniques are effective in detecting patterns among different types of websites, i.e. phishing and legitimate. However, phishing and legitimate websites should be represented as a set of features for use in machine learning algorithms. A feature is a measurable property of a characteristic of a website. Researchers define a set of features and measure feature values for each given website. Features could be defined at certain levels i.e. contextual characteristics of the websites or URLs of the websites. A labeled phishing dataset is comprised of a set of instances of phishing and legitimate websites where each instance is represented by its feature values and has a label that indicates whether it is phishing or legitimate. A classification algorithm trains on a part of labeled data to make predictions about the label of the other parts which have not been used for training.
Most works emphasize feature definition and aim to improve the statistical learning models to discriminate between phishing and legitimate websites. The state-of-the-art solutions for phishing detection [11,12,14,16,19] use engineered features based on observations made by the research experts in this domain on publicly available datasets. One crucial assumption in using machine learning approaches is that the training data collection process is independent of the attackers’ actions [6]. However, in adversarial contexts, e.g. phishing or spam filtering, this is far from the reality as attackers either generate noisy data samples or generate new attack samples by manipulating features of existing phishing instances. Furthermore, manipulating features results in a dangerous scenario wherein, an attacker can bypass the generated classifier without much effort. A carefully crafted phishing data sample that appears to a machine learning classifier as a legitimate sample is called an adversarial sample. The immediate impact of adversarial samples is to degrade the accuracy of a machine learning classifier. A key problem for the attacker to consider would be choosing the features that need to be manipulated and the associated cost for such manipulation. Ideally, the attacker would like to bypass the classifier with the lowest cost of manipulating the data sample features. In this work, we explore and study the effect of adversarial sampling on phishing detection algorithms in-depth, starting with some simple feature manipulation strategies, and show some surprising results that demonstrate impact on the classification accuracy with trivial feature manipulation.
Proposed approach and key contributions
We gathered four separate, publicly available phishing datasets developed by other researchers and applied adversarial sampling techniques to evaluate the robustness of the trained model against artificially generated samples. Although we do not show any solution to address this current threat, we demonstrate the vulnerability of the existing approaches and explore the datasets’ robustness against the engineered features and the learning models. Our key contributions are as follows:
We survey a full range of phishing detection techniques focusing on machine learning-based approaches and model the threat against them by the attackers’ access, knowledge, and goal, which the attackers utilize to attack any given trained classifier model. We show the weakness of some well-known machine learning approaches and emphasize how a phisher can generate new phishing website instances, i.e., adversarial samples, to evade the machine learning classifier in each of these approaches. We define phishing instances’ vulnerability level, which quantifies the attackers’ efforts and optimize the attacker’s effort to generate adversarial samples. Finally, we describe a clustering approach to direct the attacker in generating better adversarial samples with a higher likelihood of success to bypass the classifier. We show that the clustering approach identifies data samples with higher vulnerability levels. We built an experimental setup, conducted a wide range of experiments and analyzed how vulnerable the datasets and learning models are by testing against the generated samples.
The rest of this paper is organized as follows. In Section 2, we describe a wide range of defense mechanisms against phishing attacks in the literature. Also, we describe the various adversarial attacks against the machine learning classifiers in non-phishing domains. In Section 3, we model the threat from three points of view: attackers’ goal, knowledge, and influence. In Section 4, we simulated an adversarial sampling attack followed by assessing the vulnerability level and quantifying the cost of the attack. In Section 5, we describe the clustering-based approach for generating samples with a higher chance of bypassing the classifier at a lower cost. In Section 6, we explain the results of our experiments to prove the robustness of the classifiers and datasets against these attacks. In Section 7, we conclude the paper and discuss some future work.
Related work
Machine learning for phishing detection
Researchers engineered novel sets of features from diverse perspectives based on public datasets of phishing and legitimate websites in prior machine learning approaches for phishing detection. Machine learning algorithms are well suited to assimilate common attack patterns such as hidden fields, keywords, and page layouts across multiple phishing data instances and train learning models that are resilient to small variations in future unknown phishing data instances.
According to a Symantec report [22], the number of URL obfuscation based phishing attacks was up by 182.6% in 2017. Some URL obfuscation techniques used by attackers are the misspelling of the targeted domain name, using the targeted domain name in other parts of the URL like the sub-domain, and adding sensitive keywords like ‘login’, ‘secure’, or ‘https’. Researchers defined machine learning features based on the URLs to capture these techniques and trained learning models. For example, Jiang et al. [11] merged information from DNS and the URL to develop a Deep Neural Network (DNN) with the help of Natural Language Processing (NLP) to detect phishing attacks. While other approaches need to specify features explicitly, this method extracts features automatically. However, this approach relies on information from third-party services like search or DNS queries to leverage the feature set and make the feature set more reliable; it also endangers users’ privacy. Third-party inquiries to fetch the feature value reveals the browsing history of the end-users.
Sahinguz et al. [18] addressed this issue and proposed a real-time detection mechanism based on Natural Language Processing (NLP) of URLs on a large dataset of features derived from URL obfuscation without requiring third-party inquiry, and achieved an accuracy of more than 95%. While URL based phishing detection approaches are promising but have two limitations of (i) having full control over URLs by attackers that can create any URL and (ii) not considering pages’ contents. The website’s content is the most critical factor in luring the end-users rather than the URL or domain name themselves. Therefore, any solution distinct from the websites’ content would not be useful in the real world.
Niakanlahiji et al. [14] introduced PhishMon, a scalable feature-rich framework with a series of new and existing features derived from HTTP responses, SSL certificates, HTML documents, and JavaScript files and reported accuracy of 95%.
Shirazi et al. [19] observed two concerns with existing machine learning approaches: a large number of training features and bias in the type of datasets used. Their study focused on the features derived from the domain name usage in phishing and legitimate websites, not the URL, and reported an accuracy of 97–98% on the chosen datasets.
Li et al. [12] proposed an approach to extract the features from both URL and webpage content and ran multiple machine learning techniques, including GBDT, XGBoost, and LightGBM, in numerous layers, referred to as stacking approaches. The experiment has been conducted on three datasets, of which two are large ones with 50K instances, and the accuracy is more than 97% in all cases. Although this approach is similar to recent machine learning approaches and does not use third-party services, it is similar to other previous work [19].
While these approaches have demonstrated excellent results for detecting phishing websites, they also suffer from severe disadvantages due to adversarial sampling, as we show in the following discussion.
Learning in adversarial context
Defense mechanisms have been proposed in the literature, widely employed machine learning techniques to counter phishing attacks. However, adversarial sampling attacks can threaten current defense mechanisms. An adversarial sampling attack is when an adversary generates a phishing data sample based on existing phishing samples to avoid detection by the classifier. In general, such a sample is called an adversarial sample. While there is some general analysis of the vulnerabilities of classification algorithms and the corresponding attacks [10], to the best of our knowledge, there is no other study on adversarial sampling in the context of the phishing attacks. Thus far, researchers have studied and formulated these threats in a general manner or other application contexts like image recognition. In the following, we briefly explore these efforts.
Dalvi et al. [6] studied the problem of adversary learning as a game between two active agents: data miner and adversary. The goal of each agent is to minimize its cost and maximize the cost to the other agent. The classifier adapts to the environment and its settings either manually or automatically in this approach. The authors assumed that both sides, including data miners and adversaries, have perfect knowledge about a problem. However, this assumption does not hold in many situations as we modeled our adversary in Section 3, and elaborate on why the adversary cannot have perfect knowledge.
Xiao et al. [26] explored the vulnerabilities of feature selection algorithms under adversarial sampling attacks. They extended a previous framework [3] to investigate the robustness of three well-known feature selection algorithms.
There are a few approaches that create more secure machine learning models. Designing a secure learning algorithm is one way to build a more robust classifier against these attacks. Demontis et al. [7] investigated a defense method that can improve the security of linear classifier by learning more evenly-distributed feature weights. They presented a secure SVM called Sec-SVM to defend against evasion attacks with feature manipulation. Wang et al. [25] theoretically guaranteed the robustness of the k-nearest neighbor algorithm in the context of adversarial examples. They introduced a modified version of the k-nearest neighbor classifier where k is equal to 1 and theoretically guaranteed its robustness in a large dataset.
Shirazi et al. [20] used an adversarial autoencoder to generate synthesized phishing samples and tested these samples against models trained with real-world data. It is shown that a portion of generated samples was able to evade existing detection models. Some synthesized samples have been used for training and showed the new learning models are more robust against adversarial attacks and hold higher accuracy. In other words, real-world phishing site data augmented with synthesized data used for training the model provides more robust classifiers which are more effective for phishing detection.
Finally, there are some tools for benchmarking and standardizing the performance of machine learning classifiers against adversarial attacks in the literature. Cleverhans [15] is an open-source library that provides an implementation of adversarial sample construction techniques and adversarial training for image datasets. Given the lack of benchmarking tools for the phishing problem, we tested our approach with our attack strategies and implementation.
Threat model
In this section, we model the adversarial sampling attack against machine learning-based phishing detection approaches. We start with the attacker’s goal, knowledge, and influence in general machine learning solutions, and then we explain them in the context of our phishing problem. We model the adversarial sample generation for existing phishing instances based on the attackers’ abilities and then evaluate the cost that the adversary has to pay for the successful execution of this attack. Finally, we define the vulnerability level for the dataset.
Attacker’s goal
Biggioa et al. [4] explored three different goals for attackers in reactive arms race namely security violation, attack specificity, and error specificity. An attacker’s goal in the security violation is to evade well-known security metrics, including confidentiality, availability, and integrity. The attacker may violate the availability of the system by a denial-of-service attack. In this case, if the system cannot accomplish the desired task due to the attacker’s behavior, the availability of the service would be affected. The attacker needs to obtain users’ sensitive and private information with approaches like reverse-engineering to violate the user’s confidentiality.
In a phishing context, the adversary will attack the integrity of the system. The integrity is violated if the attack does not permit the regular system behavior; however, the attacker violates the accuracy of the classifier e.g. by making the classifier label the maliciously crafted phishing instances as legitimate to evade the classifier. The attack specificity depends on whether a specific set of samples (like phishing) being incorrectly classified for any given sample. The error specificity relates a specific type of error in the system and degrades the classifier’s scores.
Attacker’s knowledge
An attacker may have different levels of knowledge about the machine learning model. An attacker might have detailed knowledge, i.e., white-box or perfect knowledge, minimal knowledge about the model called zero knowledge [4,26] and limited knowledge about the model known as gray-box. If the adversary knows everything about the learning model, parameters, and the training dataset, including the classifier parameters, the attacker has perfect knowledge. In the case of zero knowledge, the adversary can probe the model by sending instances and observing the results. The adversary infers information about the model by choosing appropriate data samples. In the case of limited knowledge, it is assumed that the adversary knows about features and their representation and the learning algorithm. However, the adversary does not know about the training set or the algorithm’s parameters.
From the dataset point of view, the attacker may have partial or full access to the training dataset. The attacker may also have partial or full knowledge about the feature representation or feature selection algorithm and its criteria. In the worst-case scenario, an attacker may know about the subset of selected features.
Attacker influence
Two major types of attacker influence have been defined in the literature, namely poisoning and evasion attacks. In a poisoning attack, the adversary generates and injects adversarial instances in the training phase. Adversarial instances are the ones with manipulated labels. For example, email providers use spam detection services and give the users the ability to override the email’s label e.g. re-labeling a spam email as non-spam to deal with false-positive detection cases. The system benefits from the user’s labeling to improve accuracy by updating the training set. However, in a poisoning attack, an attacker, with an authorized email account in the system, can re-label the correctly detected spam emails as non-spam to poison the training set which results in a poor learning model that is easy to bypass even with slightly manipulated phishing instances.
In evasion attack, the attacker does not have access to the training set and intentionally and smartly manipulate features to avoid samples being labeled correctly by the classifier at the testing phase. Similar to the previous example on the spam detection system, a phisher may send an email with intentionally misspelled words to evade the classifier.
Our assumption
In this subsection, we define the threat model that we assumed in this work.
In the next section, we describe our sample generation approach and outline our method for measuring the effectiveness of the samples in lowering the classifier’s accuracy.
Adversarial sampling for phishing
We simulate the attacker’s approach to generate new adversarial samples based on the existing phishing samples. The adversary generates new samples by manipulating phishing samples’ features and then checks whether the generated samples evade the classifier. A phishing sample evades the classifier if it is labeled as a legitimate sample. All such generated samples that bypass the machine learning classifier are adversarial samples. The motivation for using features from existing phishing samples is to guarantee that the generated samples are guaranteed to possess some key phishing characteristics. We assume that the attacker has full control over the URL and phishing page content except for the domain name part. The attacker has limited knowledge about the classifier and features, as discussed in Section 3.2.
Defining the dataset
We use similar notation to that used in [26]. The dataset has been generated by a procedure
Each instance
Selecting features for manipulation
To specify a subset of features, we introduce the notation
To illustrate,
The first step towards generating samples is to select one or more features for manipulation. The generative algorithm can be represented in terms of function
Table of notation
Table of notation

Generating samples
After defining the features that will be manipulated, we must assign new values to them. We assume that each value will be replaced only by values that appeared in existing phishing instances. The intuition is that if the value has been found to be assigned to that feature previously for a phishing instance, then that feature value is more likely to be found in another phishing instance. In Algorithm 1, in lines 4 to 5, we used Cartesian Product to generate all possible combination for each feature, taking the values from phishing instances.
For the features that have been selected for manipulation, the corresponding bit in π will be 1. In this case, the
If the newly generated sample is equal to the given input, we discard it as it does not hold any changes; Otherwise, we include it in the set of
Adversary cost
Attackers have to handle two challenges for generating adversarial samples. From a machine learning point of view, the dataset includes feature vectors. Still, the attacker has to change the phishing website to generate the desired vector similar to adversarial samples. For example, if a feature is URL length, the adversary can generate a new URL with the desired length based on adversarial samples. This is not a trivial process, and it has a considerable cost for the attacker. Whereas adversarial samples may have a higher chance of evading the classifier, they may not be visually or functionally similar to the targeted websites. This increases the chance of being detected by the end-user. Thus, the adversary wants to minimize two parameters: the number of manipulated features and the assigned feature values. We consider this as a cost function for the adversary.
In the previous section, we discussed how the attacker controls the number of manipulated features, but it is not the only parameter. If the manipulated feature values differ much from the original values, it will increase the classifier’s chance of evading. We study this hypothesis in Section 6. This will also change the website’s visual appearance or behavioral functionality from the targeted website, thereby increasing the chance of phishing websites being detected by the end-user.
In this work, we used the Euclidean distance between the original phishing sample and newly generated sample to estimate the cost; a higher distance indicates a higher cost. Consider
Vulnerability level
A phishing instance is vulnerable at the level l with the cost d if there is at least one adversarial instance generated from this phishing instance that can bypass the machine learning classifier with l manipulated features and a distance d from the original instance. In other words, we call this instance vulnerable if manipulating l features of the original phishing instance with a distance of d allows it to bypass the classifier. The attacker’s goal here is optimizing the l and d, a multi-objective optimization problem for the attacker. For example, suppose we have a phishing instance, and we are able to generate an adversarial sample by manipulating 3 features with Euclidean distance of 2.7. In that case, we say that the original phishing sample is vulnerable at the level of 3, with a cost of 2.7.
Directed adversarial sampling
In the approach described so far, the adversary needs to adopt a trial-and-error with a given phishing sample, i.e., the attacker is not sure whether a given phishing sample can be used to generate an adversarial sample that can bypass the classifier. This process is further constrained if the attacker attempts to minimize the cost of generating such adversarial samples. As a result, the attacker’s effort is increased significantly as the attacker needs to experiment with each sample and try various feature manipulation combinations to generate an adversarial sample. To address these problems, we describe a clustering-based pre-processing approach that directs the attacker towards selecting the best possible phishing samples that are likely to bypass the classifier. Simultaneously, this approach also helps the defender identify those features that are more likely to be useful to adversaries and refining the existing machine learning model.
Outline of clustering approach
In general, the clustering of data samples using standard approaches like the k-means algorithm [13] generates groups of samples that share a significant number of common features or have similarities in a few dimensions. This feature of clustering algorithms is the key intuition for our improved adversarial sample generation technique.
Concisely stated, our approach first clusters the phishing samples using a standard clustering algorithm such as k-means and initializes a per-cluster counter to zero. Next, we select one random sample from each of the clusters to generate adversarial samples using Algorithm 1. If the generated sample is adversarial, we increment the per-cluster counter of the respective cluster. Next, we repeat the experiment with a few more samples by progressively selecting more samples from successful clusters, i.e., the cluster with higher per-cluster counter values, after the initial testing period.
We note that, based on the properties of clustering, i.e., similar data samples are placed in the same cluster; we surmise that a cluster that has contributed to adversarial samples is more likely to contribute to many other adversarial samples. Our experimental results show that this is indeed the case and demonstrate that the clustering approach significantly improves the success rate of generating adversarial samples.
Correlating cluster membership and adversarial sampling
From our experimental results, we make a few important observations and state them here. We clustered adversarial samples using the existing clusters of the data. If an adversarial sample belongs to a different cluster than the cluster to which the original phishing sample belonged, we denote such an adversarial sample as transferred sample. This definition captures a key notion that an adversarial sample is likely to belong to a different cluster due to the feature manipulation. When viewed from a different perspective, this indicates that a generated sample is likely to be an adversarial sample if the generated sample’s cluster membership is different from the original phishing sample from which it was generated. We demonstrate this characteristic using our experimental results in Section 6.
Using this notion of transferred samples, we define the correlation between adversarial samples and cluster membership transfer. For this purpose, we calculate the probability of an adversarial sample being transferred to a new cluster. Formula (3) articulates the probability of success in generating an adversarial sample when the generated sample is transferred to a new cluster. In this formula,
Experiments and results
This section shows the effectiveness of our proposed adversarial sampling attack that degrades existing learning models’ accuracy and efficacy. First, we discuss the datasets utilized, and we elaborate on three different experiments we have conducted and the results.
Used datasets
We obtained four publicly available phishing datasets on the Internet, and the details of these datasets are given below.
Table 2 summarizes the number of instances, features, and the portion of legitimate vs. phishing instances in each dataset. We have datasets with a large number of instances, DS-2 and DS-4, with 11000 and 10000 instances, respectively. We also have a small dataset DS-3 with 1250 instances. With respect to the number of features, DS-1 has just seven features, whereas DS-4 has 48 features. Besides, each dataset’s features are selected from different points of view, such as URL-based features in DS-2, DS-3, and DS-4, or domain-related features in DS-1, and HTML-based features in DS-2 and DS-4. These variations validate our hypothesis in a stronger and more general sense. Also, it shows that adversarial sampling is a severe problem that may manipulate different types of features to evade the classifier.
Number of instances, features, and portion of legitimate and phishing websites in each dataset
Data definitions
Definition of performance metrics
In Table 3, we describe the notation used in our experimental results. For evaluating the robustness of the classifier against the adversarial samples, we used standard machine learning metrics as shown in Table 4. We calculated: True Positive Rate (TPR), Positive Predictive Value (PPV), F1, and Accuracy (ACC) to evaluate the performance of our proposed approach.
Phishing detection accuracy without adversarial sampling
In the first experiment, we tested each dataset’s performance against a wide range of standard classifiers. We labeled phishing websites in all datasets as
For DS-1, RF and GB both generate the highest ACCs and the TPRs for both classifiers are comparable. Also, DS-1 has the best average of TPR among all datasets. RF gives the best TPR (94.25%) and ACC (95.76%) on DS-2. Interestingly, the DT does not generate a good TPR for DS-2 (86.77%). The DS-3 dataset experiments did not yield a high TPR or the ACC. Both GB and SVM with Gaussian kernel have the TPRs close to 87%, which is not that good. The best ACC, for this dataset, is from GB, with 83%. The experiment on DS-4 gave excellent results. Both GB and RF gave a TPR over 97% and accuracy of 97%, which are very high. This dataset has the best average of ACC among different classifiers meaning this dataset performs very well with different types of classifiers. With six different classifiers, the experiments on both DS-1 and DS-4 show an average ACC of more than 94%, which is significantly high.
Evaluation of model against different classifiers with two metrics
Evaluation of model against different classifiers with two metrics
The classifier that holds best F1 on each dataset has been selected. TPR and ACC are also reported for comparison
We used a single metric of F1 to compare all classifiers and datasets together. Table 6 shows the best F1 score for each dataset with the classifier that has produced that result. It is evident from this table that both GB and RF generate the best results among all of the experiments, so we selected these two classifiers for the remaining experiments.
We reserved 200 random phishing instances in each dataset and then trained the model without the 200 random reserved phishing instances. The generated adversarial samples need to be similar to the phishing examples; otherwise, those cannot be assumed to be phishing instances. We used previously seen values in the phishing instances to assign new values to the features and generate new instances. With this strategy, it is guaranteed that the newly assigned value is valid and has already been seen in other phishing instances in the dataset. We discussed this process earlier in Section 4. We randomly selected features, up to four different features, and changed each feature’s values with all possible feature values. If an adversarial sample is generated, we consider the original phishing instance to be vulnerable. A given phishing instance can generate several adversarial samples with varying costs, as defined in Section 4.4. We call the phishing samples with the lowest cost of generating adversarial samples as optimized samples.
Robustness of the learning model

Robustness of datasets against adversarial samples.
This experiment studies the robustness of datasets and learning models against generated adversarial samples. We selected one classifier that performs best for each dataset based on the F1 score from Table 6. For the datasets DS-1 and DS-3, we selected GB, and for DS-2 and DS-4, we chose RF.
In this experiment, we counted the number of reserved phishing instances that are vulnerable. This means that there should be at least one adversarial sample with the lowest cost based on the original sample. With small perturbation in these instances, they can bypass the classifier and elude the users to release their critical information. Based on our hypothesis, these are vulnerable instances and can be assumed as a threat to the learning model. We repeated each experiment ten times and reported the average of the results.
Figure 1 shows the results of our experiment. The x-axis shows the number of manipulated features; zero manipulated feature means that the test happened with the original phishing instances detected correctly by the classifier. The trend of results reveals that increasing the number of perturbations results in an increase in the number of evaded samples proportionally. We continued increasing the perturbed features for up to four different features at a time. We observed that with four features, almost all manipulated phishing instances bypass the classifier model.
For example, Fig. 1 shows that less than
This experiment shows how vulnerable the machine learning models are to the phishing problem. Small perturbation on features can bypass the classifier and degrade the accuracy significantly.
In this experiment, we studied the cost that an adversary has to pay to bypass a classifier. From an adversary perspective, it is not inexpensive to manipulate an instance with new feature values to create an adversarial sample. In Section 4.4, we assessed the cost and in Section 4.5, we defined the term vulnerability level for one instance. Once again, we reserved 200 random phishing instances from each dataset and chose the classifier for each dataset based on Table 6. For datasets DS-1 and DS-3, we chose GB while we chose RF for both DS-2 and DS-4 datasets. Averaging the vulnerability level for each of the 200 selected instances and repeating the experiment ten times, we assessed the whole dataset’s vulnerability level.
Figure 2 presents the results of this experiment for all datasets for two parameters: the number of manipulated features and the average cost of adversarial instances. It is evident that, by increasing the number of manipulated features, the cost also increases steadily. For example, for the dataset DS-1, the average cost, for adversarial samples, with one manipulated feature is 0.95, and with four manipulated features, the cost is 3.93.
Furthermore, the average cost for some datasets is more than that of other datasets. For example, in the DS-4, the adversary has to pay more cost, particularly when the number of features increases to three and four compared to the other datasets. This shows that this dataset is more robust against these attacks and has a lower vulnerability level.

The manipulation cost for adversarial samples based on number of manipulated features.
We discuss using the clustering approach described in Section 5 and present the results. In this experiment, we calculated the probability of transferred samples and adversarial samples as we discussed in Section 5. For each dataset, we calculated the probability of generating adversarial samples and also the probability of such a sample being transferred to a new cluster from the original cluster.

Ratio of bypassing and transferring adversarial samples in tested datasets.
Figure 3 shows the adversarial sampling probability and transferred samples for each dataset. On average, more than 60% of all adversarial samples in DS-1, DS-2, and DS-4 datasets were able to bypass the classifier. For the DS-3 dataset, the bypassing rate is around 30%.
Another measure in Fig. 3 is the transferring rate in which new adversarial samples are categorized in a new cluster. In all datasets, we see an average of at least 75% or more. This reveals that the majority of adversarial samples belong to a different cluster rather than the original cluster. This is the first significant finding related to the clustering approach.

Distribution of bypassed and transferring samples for each cluster in all of tested datasets: (a) DS-1; (b) DS-2; (c) DS-3; and, (d) DS-4.
This experiment investigated adversarial sampling and transferring probability based on each cluster. Figure 4 depicts how these probabilities varied among different clusters. Figure 4(a) for DS-1 shows adversarial samples are uniform, bypass classifiers, and transferred among clusters, and it is not significantly different among different clusters. Figure 4(b) for DS-2 shows some variations among different clusters. Clusters 3 and 8 have the highest chance of generating adversarial samples. Cluster 5 has a significantly low chance of transferring an adversarial sample.
The chance of an adversarial sample generation does not vary among different classifiers, as shown in Fig. 4(c) for DS-3. The same pattern can be seen for transferring as well. There is a big gap between the probability of generating adversarial samples and transferring in this dataset, something that has not been seen in other datasets.
Figure 4(d) shows results for DS-4. Cluster 8 has the lowest chance of generating adversarial samples, and clusters 3 and 4 have the highest one. Cluster 4 also has the highest chance of transferring to other clusters.
This experiment used conditional probability to show how adversarial samples and transferred samples are co-related to each other. For this purpose, we calculate the probability that an adversarial sample is transferred to another cluster. It shows how likely an adversarial sample would be transferred to a new cluster. Figure 5 depicts these results.
In this Figure,

Conditional probability of adversarial samples.
This knowledge for an adversary is compatible with the threat model defined in Section 3. In our proposed model, an attacker has access to the predict function and phishing website.
Figure 5 shows that in DS-1, DS-2, and DS-4, the probability of generating an adversarial sample when a manipulated sample is transferred to a new cluster rather than its original cluster is at least 65%. It gives hints to the attackers to target features that, with manipulation, the instance would transfer to another cluster. We also calculated when conditional of these two parameters were are not happening. Based on the results, there is not a significant correlation between these two probabilities.
As discussed earlier in Section 4.4, generating adversarial samples is not an inexpensive process, and an adversary would like to optimize this effort. This section defines the probability of generating adversarial samples and identifying clusters with a lower cost to optimize an adversary’s efforts. To achieve this goal, we considered the following parameters.
Probability of samples belonging to cluster i is denoted as
In this formula,
In this case,
With these parameters, we are in a position to calculate the probability of generating adversarial samples based on instances chosen from a specific cluster while focusing on the membership transfer of such adversarial samples to another chosen cluster.
One cluster membership transfer of a generated adversarial sample with high probability and high cost is not desirable and will get a lower total score than a transfer with high probability and low cost. The desired transfer from an adversary perspective is one with the highest probability and lowest cost.
We visualize these probability and cost metrics in Fig. 6 to show the best transfer that can be made. The X axis shows the initial cluster of phishing samples, and Y axis shows the cluster of generated adversarial samples. Darker colors show lower probability and higher cost. Lighter colors show higher probability and lower cost. In essence, the heat map shows what transfer has the highest probability of adversarial samples with the lowest cost.

The relation between an original cluster of an instance with adversarial samples: (a) describes the DS-1; (b) describes the DS-2; (c) describes the DS-3; and, (d) describes the DS-4.
For example, Fig. 6(a) for DS-1 shows that if the generated sample’s membership is transferred from cluster 1 to cluster 2, then there is a higher probability of this sample being an adversarial sample. In other words, this is a better choice for the adversary. Furthermore, Fig. 6(b), for DS-2, shows that cluster number 5 is a vulnerable cluster and generates adversarial samples whose membership is transferred to a different cluster with low cost. A similar pattern is seen in Fig. 6(d) for DS-4 in cluster 1. In other words, if manipulating a sample in cluster 1 transfers its membership to a different cluster, then there is a higher likelihood that this sample is adversarial. Similarly, Fig. 6(c), for DS-3, shows that cluster 3 on average is the most vulnerable cluster.
In this experiment, we used the previous discussion to form a probabilistic model used by both the adversary and the defender. The adversary can find the best suitable transformation among different cluster samples that generate a higher number of adversarial samples. A defender can find the specific vulnerability of the learning model and the clusters contributing to a higher number of adversarial samples, thereby enabling a specific corrective action.
In this section, we compare our approach with some of the previous research in this field. Table 7 compared nine different approaches in the literature. We summarized each approach’s advantages and disadvantages and showed the dataset size and best accuracy results of each approach. We studied a wide range of previous efforts by focusing on machine learning techniques. Some of the techniques solely focused on the URL itself [18,23], but others look at both URL, and the content of the page [5,19]. The use of third-party services is another difference between approaches that possess privacy risks. The previous studies have been done on variable sizes of datasets. While some of the datasets have less than 5 thousand records [5,19], there are also datasets with millions of instances [11,23]. Also, for approaches analyzing just the URL without the webpage content, creating massive datasets are easier. Most of the approaches achieved high accuracy of over 95%. Both [16,24] achieved accuracy of 99%, which is significantly high. Tian et al. [23] found new phishing samples that were not detected by common phishing detection mechanisms even after one month. We also added the results of this study to Table 7. We trained the classifier on the four public datasets and achieved very high accuracy. When we added the manipulated features in the testing phase, the accuracy degraded significantly and finally became zero. These experiments prove that our proposed attack is sufficient to evade existing classifiers for phishing detection.
Comparisons of different approaches in the literature including our proposed approach
Comparisons of different approaches in the literature including our proposed approach
In this work, we explained the limitation of machine learning techniques when adversarial samples are considered. We introduced the notion of vulnerability level for data instances and datasets based on the adversarial attacks and quantified it. We achieved high accuracy in the absence of this attack using seven different well-studied classifiers in the literature: more than 95% for all classifiers except one that had 82%. However, when we evaluated the best-performing classifier against the adversarial samples, the classifier’s performance degraded significantly. With only one feature perturbation, the TPR falls from 82–97% to 79%–45% and, increasing the number of perturbed features to four, the TPR fell to 0%, meaning that all of the phishing instances were able to bypass the classifier. Subsequently, we continued our experiments by factoring in the adversary cost. We showed that both the number of manipulated features and the total manipulation cost, which can be derived from the difference between the original phishing sample and the adversarial sample, are essential. This means that from an attacker’s point of view, changing the minimum number of instances is desired, but the adversarial sample must have the minimum cost. This shows the weakness of well-known defense mechanisms against phishing attacks.
To increase the success rate for adversarial sampling, we devised a clustering approach that directs the adversary towards identifying the best possible phishing samples for manipulation. We showed that our clustering approach allows an adversary to pick adversarial samples from a specific cluster and achieve a high-rate of success close to 75%. Adversarial samples transferred from the original cluster to a new cluster have a higher chance of bypassing the model. Our clustering approach allows an attacker to identify better samples and allows analysts to identify better defenses.
It hints the adversary to select more efficient feature manipulation to evade the classifiers. Our future work is to develop robust learning models in the face of such organized adversarial sampling strategies. Specifically, our adversarial sampling approach gives indications of the features that are more likely to be manipulated. Defenders can focus on these features to make it infeasible to generate adversarial samples. The complex correlation between the features and the nature of phishing attacks is a topic for future exploration.
Footnotes
Acknowledgments
This work is supported in part by funds from NSF Awards CNS 1650573, CNS 1822118 and funding from CableLabs, Furuno Electric Company, SecureNok, Statnett, Cyber Risk Research, and AFRL. Research findings and opinions expressed are solely those of the authors and in no way reflect the opinions of the NSF or any other organizations.
