Abstract
Malicious web domains represent a serious threat to online users’ privacy and security, causing monetary loss, theft of private information, and malware attacks, among others. In recent years, machine learning methods have been widely used as prediction models to identify malicious web domains. In this study, we propose a Fuzzy-Weighted Least Squares Support Vector Machine (FW-LS-SVM) model for malicious web domain identification. In our proposed model, a fuzzy-weighted operation is applied to each data sample considering the fact that different samples may have different importance. This fuzzy-weighted operation is also able to alleviate the influence of noise data and improve the model’s robustness by assigning weights to error constraints. For comparison purposes, three commonly used single machine learning classifiers and three widely used ensemble models are included in our experiments, in order to assess the performance of our proposed FW-LS-SVM and its ensemble version. Hyperlink indicators and uniform resource locator-based features are used to train the prediction models. Experimental results show that our proposed approach is highly effective in identifying malicious web domains, outperforming the well-established single and ensemble models being compared.
Introduction
With the widespread use of information and communication systems these days, online attacks through phishing and malicious web domains have become a serious threat to Internet users’ privacy and security [2, 34]. Malicious attacks typically target online users’ private and confidential information, allowing such information to be illegally used by ill-intentioned people. Accessing malicious websites can lead to malware attacks, where computer virus and malicious software are secretly installed to infect users’ computers and ‘steal’ important documents. As a result, these attacks will cause not only monetary loss but also theft of private information. Accurate identification of malicious web domains is, thus, very important to prevent these from happening. Conventional methods for identifying malicious web domains rely on the use of user-verified blacklists [35]. Although the blacklists can, to some extent, identify malicious web domains, maintaining up-to-date blacklists is a huge challenge [19].
To overcome the challenge, machine learning methods such as the Artificial Neural Network (ANN), Support Vector Machine (SVM), Adaptive Boost (AdaBoost), Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) have been used. Zhu et al. [3] applied the ANN to detect phishing domains, attempting to alleviate the threats posed by phishing attacks. An SVM-based model was employed to detect phishing websites by Huang et al. [7]. Their results confirmed that the SVM-based model is able to identify suspicious Uniform Resource Locators (URLs) belonging to a phishing domain. Ramanathan and Wechsler [27] developed an AdaBoost-based classifier for phishing website detection, achieving F-measure (F1) scores as high as 99%. Liew et al. [23] proposed an effective security alert mechanism based on the RF for real-time phishing tweet detection. An effective intrusion detection system using XGBoost was proposed by Dhaliwal et al. [22].
In this paper, we propose a Fuzzy-Weighted Least Squares SVM (FW-LS-SVM) model for malicious web domain identification. The Least Squares SVM (LS-SVM) [11] is used in our proposed model, as it simplifies the computation of SVM by utilising equality constraints. We introduce a fuzzy-weighted operation to improve robustness of the LS-SVM over noises or outliers, through the assignment of weights to the error constraints. This fuzzy-weighted operation also serves as an indication for data samples with varying degrees of importance to the model. To assess the performance of our proposed model, we compare it with well-established single and ensemble machine learning models in our computational experiments.
The remainder of this paper is organised as follows. In Section 2, we describe each of the machine learning models included in our experiments, as well as the proposed FW-LS-SVM. After that, we discuss the experimental setup and report on the results obtained in Section 3. Finally, we draw conclusion in Section 4, with future research directions highlighted.
Methods
In this section, we first introduce the single classifiers used for comparison purposes in our experiments, which include the ANN, SVM, and LS-SVM. Then, ensemble models including the AdaBoost, RF and XGBoost are described. After that, we present the details of our proposed FW-LS-SVM, explaining how the fuzzy-weighted operation is designed to improve the LS-SVM.
Single models
ANN
ANNs have been widely used for phishing domain prediction (e.g., see [3, 21]). An ANN is typically composed of three different layers [24], namely an input layer, a hidden layer, and an output layer. For the input layer, the number of neurons is the same as the number of input features; while for the output layer, this is the output of the model, usually with only one neuron for binary classification. Neurons in the hidden layer lie in between the input and output layers and are interconnected with them. Those interconnected neurons are able to exchange messages with each other. The ANN is trained based on the tuning of weights between the neuron connections.
SVM
Based on the structural risk minimisation principle, the SVM has shown promising performance in many practical applications [10, 28], including phishing detection [7]. The main idea is to find an optimal hyper-plane that can separate two different classes by maximising the margin between the hyper-plane and support vectors. By obtaining the largest margin, the SVM can have high generalisation ability, which is very important to accurately predict unseen samples in real-word applications [5].
LS-SVM
Although variants of SVMs have shown promising classification and prediction performance, the major drawback is their higher computational burden resulting from the use of inequality constraints to optimise convex problems [12]. Unlike SVMs, the LS-SVM utilises equality constraints to simplify the computation by transforming the quadratic programming problem into a linear equation problem [17]. Such transformation is useful in real-word applications where efficient computation time is critical.
Ensemble models
AdaBoost
AdaBoost is an ensemble model that builds a number of weak learners by manipulating the samples. It is able to adjust the weight of an observation based on a previous classification result [1]. If a sample was classified incorrectly, it will try to increase the weight of this sample and vice versa. Adaptive boosting is a sequential technique where each weak learner is built sequentially. Specifically, the first model is trained on the entire data set and the subsequent learners are trained by fitting the residuals of the first model, which means each subsequent learner pays more attention to those samples misclassified by previous classifiers. In this case, samples that were incorrectly predicted by a previous model would have higher weights. The final prediction is then carried out based on a weighted combination of all the weak learners according to their accuracies.
RF
The RF is an ensemble model of tree predictors [18], built on boostrap samples. It has shown promising performance in addressing both classification and regression problems [20]. The RF model first resamples several training sets based on given data, with each set having the same number of samples. Decision tree classifiers are then trained from those resampled training sets. Random selection of features increases model diversity, which is very helpful to alleviate over-fitting issues when aggregating the classifiers for final prediction.
XGBoost
Similar to the RF, XGBoost is also able to mitigate the over-fitting problem [26]. It applies a more regularised model formalisation using gradient boosting [16] to improve its performance. In addition, XGBoost has very useful properties for real-world applications, including column block for parallel learning, cache-aware access and blocks for out-of-core computation [25].
FW-LS-SVM
Fuzzy-weighted operation
The fuzzy set theory is capable of handling problems with vagueness and uncertainty [30]. Considering the uncertainties in malicious web domain identification, a fuzzy-weighted operation is introduced to improve robustness of the LS-SVM over noises or outliers by assigning weights to the error constraints. Such an operation can also be used to indicate sample importance to the learning of decision surface [32]. The ith sample’s fuzzy weight can be expressed as:
We improve the LS-SVM by using the fuzzy-weighted operation to assign weights to error constraints for malicious web domain identification. In this case, the objective function for the proposed method can be expressed as follows:
To solve Equation (3), the Lagrangian function [29] is applied as:
The KKT conditions [14] are introduced to solve Equation (4), in which the matrix solution can be expressed as:
From Eq. (5), parameters
To further improve the performance of the FW-LS-SVM, we apply the ensemble technique by integrating multiple FW-LS-SVM models for final prediction. This ensemble version of the FW-LS-SVM is known as EFW-LS-SVM. Here, we resample K datasets with replacements from the given training set. We set a value η ∈ [0.6, 1], which indicates the percentage of samples we would like to obtain from the given training set in each resampling process. By doing so, different estimators can have a different number of training samples and different parameter settings. The final result is based on the weighted average of all estimators’ results.
A flowchart is given in Fig. 1 to depict the process of malicious web domain identification using the prediction models considered. As we can see from the figure, 10-fold cross validation [15] is applied to validate the prediction performance by taking randomness into account. We first randomly shuffle the given dataset. Then, the 10-fold cross validation technique is applied to evaluate the model performance. In this case, the dataset is divided into 10 groups (each group has the same number of samples), where one group is selected as the testing set and the remaining groups are used as the training set. Each time, the training set is used to train the prediction model, whereas the testing set is used to evaluate its performance. Finally, the 10 results are averaged as the result for each 10-fold cross validation.
Experiments and results
In this section, we report on the performance of our proposed FW-LS-SVM by comparing it to three single classifiers and three ensemble models using two datasets. The first dataset was a benchmark dataset taken from the UCI collection 1 . The second dataset was constructed by us, based on 1,000 real-world malicious web domains from PhishTank 2 and 1,000 benign web domains from Alexa 3 . We repeated each of our experiments 100 times, and the final results reported are based on averages from the 100 runs.
Evaluation metrics
Four evaluation metrics, including the accuracy, precision, recall, and F1-score, were used to measure the performance of the prediction models. These performance measures can be calculated as follows:
Grid search [6] was applied to set the parameters for all the models. For the ANN, we optimised the number of neurons in its hidden layer and the maximum number of iterations; for the SVM, LS-SVM, and FW-LS-SVM, we optimised the penalty term and variance in the Gaussian kernel; for the RF and XGBoost, we optimised the number of trees and the maximum depth of trees; for AdaBoost, we optimised the number of trees; for the EFW-LS-SVM, we optimised the number of estimators. As for the fuzzy weights, the number of fuzzy regions and standard deviations for the Gaussian membership function were automatically calculated using a subtractive clustering method [31, 33].
Experimental results on single classifiers
Experimental results obtained by single classifiers, including the ANN, SVM, LS-SVM and our proposed FW-LS-SVM, are shown in Tables 1 and 2 for the UCI benchmark dataset and our self-collected real-world dataset, respectively. As we can see from the tables, all the prediction models have good performance for malicious web domain identification, especially on the UCI dataset. Our proposed FW-LS-SVM has the best resuts, outperforming the other single classifiers being compared in terms of accuracy, precision, recall, and F1-score.
Comparing the results based on the benchmark dataset in Table 1 and the results using the real-world dataset in Table 2, we see that results for the real-world dataset are not as good as the benchmark ones. This does make sense, as the real-world dataset obviously contains more noise data and uncertainties associated with the real-world scenarios.
Experimental results for single models (ANN, SVM, LS-SVM, and FW-LS-SVM), based on the UCI dataset (best results are highlighted in bold)
Experimental results for single models (ANN, SVM, LS-SVM, and FW-LS-SVM), based on the UCI dataset (best results are highlighted in bold)
Experimental results for single models (ANN, SVM, LS-SVM, and FW-LS-SVM), based on the self-collected dataset (best results are highlighted in bold)
Although the results presented so far have shown that our proposed model exhibits better performance, we are not sure if differences between the results obtained are statistically significant or not. To ascertain the significance of our results, we first show the box plot for each of the single models based on the F1 scores obtained. We also carried out further statistical analysis using the t-test [13]. Each of the statistical tests was conducted based on the 100 repeated runs for each experiment.
Figs. 2 and 3 show the box plots for the single models based on their F1 scores reported in Tables 1 and 2, respectively. From the figures, we can see that the SVM and SVM-based models have performed consistently better than the ANN. The medians of SVM-based models are clearly higher than that of the ANN, while our proposed FW-LS-SVM has the highest median. Statistical test results for the single models on the two datasets are shown in Tables 3 and 4. These test results are also based on the F1-score values reported in Tables 1 and 2. In Tables 3 and 4, p-values that are smaller than the significance level (0.05) are highlighted in bold. From the tables, we see that all of the pairwise comparisons show significant differences in terms of F1-score, which means our proposed FW-LS-SVM has significantly outperformed the other single models being compared for both the UCI and self-collected datasets. It is worth noting that there are even results with zero values –this is because the SVM-based models, unlike the ANN model, are not stochastic models (as can be seen from the box plots in Figs. 2 and 3).

A flowchart depicting the process of malicious web domain identification using the prediction models.

Box plots for the single models (ANN, SVM, LS-SVM, and FW-LS-SVM), based on the F1-score values reported in Table 1.

Box plots for the single models (ANN, SVM, LS-SVM, and FW-LS-SVM), based on the F1-score values reported in Table 2.
Statistical test results for the single models (ANN, SVM, LS-SVM, and FW-LS-SVM), based on the F1-score values reported in Table 1 (p-values less than 0.05 are highlighted in bold)
Statistical test results for the single models (ANN, SVM, LS-SVM, and FW-LS-SVM), based on the F1-score values reported in Table 2 (p-values less than 0.05 are highlighted in bold)
The above results confirmed that our proposed FW-LS-SVM is able to improve the performance of the LS-SVM and has promising potential for malicious web domain identification.
Experimental results obtained by ensemble models, including the AdaBoost, RF, XGBoost and EFW-LS-SVM, are shown in Tables 5 and 6 for the UCI benchmark dataset and our self-collected real-world dataset, respectively. As can be seen in the tables, results based on the UCI dataset are better than results based on the real-world dataset. Among the prediction models, our EFW-LS-SVM has the best performance in terms of accuracy, precision, recall and F1-score. This can be attributed to the use of fuzzy weights, which, to some extent, is able to alleviate the influence of noise data.
Experimental results for ensemble models (AdaBoost, RF, XGBoost, and EFW-LS-SVM), based on the UCI dataset (best results are highlighted in bold)
Experimental results for ensemble models (AdaBoost, RF, XGBoost, and EFW-LS-SVM), based on the UCI dataset (best results are highlighted in bold)
Experimental results for ensemble models (AdaBoost, RF, XGBoost, and EFW-LS-SVM), based on the self-collected dataset (best results are highlighted in bold)
In addition, comparing Tables 1 and 5 for the UCI dataset and Tables 2 and 6 for the self-collected dataset, the ensemble version of our FW-LS-SVM clearly can further improve the performance of its base model. This is evidence that the ensemble approach is able to achieve better performance by integrating a number of weak learners. From Tables 2 and 6, we observe that, although our FW-LS-SVM outperforms the single models being compared, its performance is worse than all the ensemble models (RF, XGBoost, EFW-LS-SVM). This is further evidence that the ensemble strategy plays an important role in improving model performance for malicious web domain identification.
To ascertain the significance of our results, box plots for the ensemble models based on their F1 scores reported in Tables 5 and 6 are presented in Figs. 4 and 5, respectively. From the figures, we see that there is an asterisk (*) for the AdaBoost and RF results; it means there are outliers that are far away from other data values and can strongly affect the results. XGBoost and our proposed EFW-LS-SVM, on the other hand, have consistent results. The median of our EFW-LS-SVM is the highest, indicating that it is the best-performing model overall.
Statistical test results for the ensemble models based on the t-test can be found in Tables 7 and 8. In these tables, we see that all the p-values are less than 0.05, implying that all the pairwise comparisons between classifiers show significant differences. We can therefore conclude that our proposed EFW-LS-SVM is able to significantly outperform the other ensemble models being compared.

Box plots for the ensemble models (AdaBoost, RF, XGBoost, and EFW-LS-SVM), based on the F1-score values reported in Table 5.

Box plots for the ensemble models (AdaBoost, RF, XGBoost, and EFW-LS-SVM), based on the F1-score values reported in Table 6
Statistical test results for the ensemble models (AdaBoost, RF, XGBoost, and EFW-LS-SVM), based on the F1-score values reported in Table 5 (p-values less than 0.05 are highlighted in bold)
Statistical test results for the ensemble models (AdaBoost, RF, XGBoost, and EFW-LS-SVM), based on the F1-score values reported in Table 6 (p-values less than 0.05 are highlighted in bold)
Overall, the results confirmed that the fuzzy-weighted operation is effective in improving the LS-SVM, and the ensemble strategy can further enhance our proposed FW-LS-SVM. Although all the prediction models have good performance for malicious web domain identification, our EFW-LS-SVM has obtained the best results.
Machine learning models have demonstrated their capability to identify malicious web domains with high accuracy. In our proposed FW-LS-SVM, we added a fuzzy-weighted operation to the error constraints, considering the fact that different samples may have different importance over the learning of decision surface. The experimental results clearly showed that our fuzzy-weight operation is effective in improving the performance of the LS-SVM and exhibits promising performance for malicious web domain identification. In addition, the ensemble strategy can further enhance its performance.
In the current study, both datasets used for our experiments are balanced datasets, where the number of benign web domains is equivalent to malicious ones. In real-world applications, however, there are far more benign web domains than malicious ones. In this case, it is much more difficult to identify malicious web domains. Thus, for future work, we will investigate how to improve our prediction models for malicious web domain identification with class imbalance. We will also explore the use of resampling techniques [9] to deal with this issue.
