Semi-supervised learning approach for malicious URL detection via adversarial learning 1

Abstract

Uniform Resource Location (URL) is the network unified resource location system that specifies the location and access method of resources on the Internet. At present, malicious URL has become one of the main means of network attack. How to detect malicious URL timely and accurately has become an engaging research topic. The recent proposed deep learning-based detection models can achieve high accuracy in simulations, but several problems are exposed when they are used in real applications. These models need a balanced labeled dataset for training, while collecting large numbers of the latest labeled URL samples is difficult due to the rapid generation of URL in the real application environment. In addition, in most randomly collected datasets, the number of benign URL samples and malicious URL samples is extremely unbalanced, as malicious URL samples are often rare. This paper proposes a semi-supervised learning malicious URL detection method based on generative adversarial network (GAN) to solve the above two problems. By utilizing the unlabeled URLs for model training in a semi-supervised way, the requirement of large numbers of labeled samples is weakened. And the imbalance problem can be relieved with the synthetic malicious URL generated by adversarial learning. Experimental results show that the proposed method outperforms the classic SVM and LSTM based methods. Specially, the proposed method can obtain high accuracy with insufficient labeled samples and unbalanced dataset. e.g., the proposed method can achieve 87.8% /91.9% detection accuracy when the number of labeled samples is reduced to 20% /40% of that of conventional methods.

Keywords

Malicious URL detection network security deep learning semi-supervised learning

1 Introduction

Diversified development of network resources has brought great convenience for people’s daily life. However, the increase in the number and types of web sites has also led to many network security problems, such as viruses and malicious URLs. This paper focuses on the detection of malicious URLs. A malicious URL is a link to illegal webpage, which will try to lure users to visit a malicious webpage by clicking a link, and the malicious codes in those webpages would lead to malware installation, personal information leak, Internet fraud, etc [1]. Phishing URL is one typical kind of malicious URLs. According to the Phishing Activity Trends Report [2] published by the Anti-Phishing Working Group (APWG), the total numbers of phishing sites detected by APWG in the first and second quarters of 2019 were 180,768 and 182,465, respectively. In recent years, the number of phishing URLs detected continued to grow and caused substantial economic loss. According to Gartner Survey, phishing attacks cost 3.5 million users $3.2 billion every year. Therefore, finding an effective malicious URL detection method to maintain network security is of great importance.

In general, there are two main problems in malicious URL detection task: First, the number of malicious URLs and benign URLs is unbalanced. Malicious URLs are more difficult to collect than benign URLs, and their percentage is usually very small [3]. The imbalance between positive samples and negative samples reduces the effectiveness of the detection model [4, 5], while many existing research methods often ignore this. Second, obtaining sufficient labeled URL training samples is difficult. Although there are many URL samples exist in the Internet, the label information is not always available and labeling samples manually is time consuming and expensive. Moreover, in the actual application environment, URL generation is very fast, and obtaining a sufficient number of the latest labeled URL samples is a great challenge. Therefore, how to train a detection model with insufficient labeled samples while avoiding the decline of accuracy is of great importance.

In this paper, a semi-supervised learning approach for malicious URLs detection based on generative adversarial network (GAN) [6] is proposed to solve the two problems above. Semi-supervised learning enables the full use of unlabeled samples, solving the problems of insufficient labeled samples and the GAN model enables the generating of synthetic samples, solving the problems of class imbalance. In summary, the main contribution of this paper is as follows:

1) The proposed method can reduce the impact of malicious and benign URL imbalance because when the unbalanced dataset is used to train the discriminator, the generator continuously generates synthetic samples and increase the number of insufficient malicious samples.

2) The proposed method only needs a few labeled URL samples to train the classification model but still achieves higher accuracy than that of existing methods.

3) The proposed approach can automatically extract URL features without the need to extract features artificially from data sets for training models, which reduces the workload compared with traditional methods and the unacceptable effect of improper feature selection.

The remainder of this paper is organized as follows. Section 2 gives a brief review on existing malicious URL detection methods. Section 3 is devoted to the detailed description of the proposed methods. The experimental results and analysis are provided in Section 4. Section 5 concludes the paper.

2 Related work

This section describes the current common methods for detecting malicious URLs, and summarizes their advantages and disadvantages in actual detection.

2.1 Blacklist-based method

The blacklist-based method is the most traditional and direct method to detect malicious URL [7]. When using the blacklist-based method to determine whether a URL is malicious or not, the URL is searched in the blacklist first, and the malicious URL will be identified and blocked if it is in the blacklist already. The blacklist recognition method is very simple and has a high accuracy rate. This method can easily filter malicious URLs from the blacklist database. Although blacklist technology is one of the most commonly used methods in many interception systems such as Google Safebrowsing, it still has many disadvantages. The blacklist-based method cannot intercept malicious URLs outside the database. Therefore, blacklist-based method requires adding all known malicious URLs to the database. However, new malicious URLs are produced frequently every day, and the blacklist database cannot be completely exhaustive. Consequently, only relying on the blacklist-based method will miss intercepting a large amount of newly generated malicious URLs. Moreover, as the blacklist database grows, the time to query the database also increases [8].

2.2 Content-based method

The content-based methods use webpage contents to detect whether the URL is normal or malicious. This method first extracts features from web content, including HTML document, Javascript code and images contained within the webpage. The extracted features are then used as the basis for malicious URL detection [9, 10]. The content-based method has two issues: The first one is the time-consuming problems due to the analysis of the source code and content of the entire webpage. The second one is the low efficiency problem due to the fact that the phishing malicious webpage may have the same contents as the original webpage. Moreover, the extracted features may be inaccurate. All of these limitations reduce the reliability [11] of the content-based malicious URL detection method.

2.3 Machine-learning-based method

In recent years, machine learning has become one of the engaging topics in current research. Different kinds of machine learning-based approaches have been used to detect malicious URLs and have achieved good results. Huang et al. [12] proposed a phishing URL detection method based on support vector machine (SVM), and achieved good detection effect on data set PhishTank. However, the training speed of SVM method will be reduced due to high-dimensional features. In [13], the authors proposed a method for titanium alloys classification based on the combined use of Wiener Polynomial and SVM to reduce the training time. Random forest (RF) [14] classifier is used to detect malicious URLs and is combined with other classifiers to construct a strong classifier for better classification performance [15]. A new non-iterative neural-like structure based on the Geometric Transformations Model [16] is designed, which is capable of high-speed training and solving large-dimensional tasks. Izonin et al. [17] used non-iterative approaches based on a Successive Geometric Transformations Model (SGTM) to solve the multiple regression task. Bahnsen et al. [18] proposed an active URL detection system based on LSTM, which uses URLs directly as input of the machine learning models to detect malicious URLs. Compared with the method that rely on expert experience to extract features, the LSTM method can achieve a good classification accuracy without manually extracting features.

In order to solve the problem of insufficient labeled samples in malicious URL detection, {many scholars pointed out that unlabeled samples can also be used, related works include Positive and Unlabeled (PU) learning [19, 20] and Semi-supervised learning [21 –23]. The current semi-supervised learning methods for malicious URL detection often use labeled samples to train the classifier first, then the unlabeled samples are tagged by the trained classifier and reused as labeled samples to train the classifier [24]. Without directly using the unlabeled URL itself, the unlabeled samples with an unreliable label still cannot solve the imbalance issue. To achieve better detection accuracy, many improved versions of semi-supervised learning are proposed, such as CoForest [25], S3VM [26] and Self-trained rotation forest algorithm [27]. A comprehensive review of semi-supervised learning method can be refer to [28]. .

3 Preliminaries and the proposed approach

In this section, the pre-processing of the URL data is first introduced. Then the structure and principle of GAN model are discussed. Finally, the semi-supervised learning theory is introduced into the field of malicious URL detection, and the semi-supervised learning approach is proposed for malicious URL detection based on GAN.

3.1 Data preprocessing

Different URL samples have different characteristics. Before input into the deep learning training model, the URL samples and their labels need to be preprocessed uniformly to obtain a unified representation. URL samples are first segmented into character because Chinese and several abbreviations often appear in the URLs, and using character-level feature [29, 30] is more applicable. According to one-hot coding, the positive URL is labeled as [1, 0], and the malicious URL is labeled as [0, 1].

As shown in Figure 1, a URL is mainly composed of protocol, hostname, port, path and other parts. Before splitting the URL sample into characters, the unnecessary port and the inconsequential protocol part are discarded. Second, a vocabulary that contains common characters in all URLs is built, which contains numbers 0-9, the uppercase and lowercase letters “-Z”, “a-z”, and special characters such as “/”, “.”, “_”, “?”, “@”, “% ”. In this paper, the built vocabulary has 94 characters. Each character in the URL is encoded as a 94-dimensional vector by one-hot encoding, and all the URLs are clipped into 200 characters for easy processing. The excess part is discarded for URLs longer than 200, and NULL is used for padding URLs shorter than 200. Through the above steps, each URL is converted into a matrix with a dimension of 200 × 94. Converting URL samples into ont-hot coding is shown in Figure 2.

Fig. 1

Structure of the URL.

Fig. 2

Convert the URL sample to a one-hot encoding.

3.2 Preliminaries on gan

In [6], Goodfellow et al. proposed a deep learning model called GAN, which is now widely used in computer vision, natural language processing and audio processing. At present, GAN is mainly used to generate images [31, 32], but it also has great application potential in generating other kinds of data [33]. The GAN theory was inspired by the adversarial training game [34] between two network parties. The structure of the GAN model consists of two parts, namely, generator G and discriminator D. The generator is used to synthesize a fake sample from a random noise vector z, and the discriminator is used to distinguish whether the input is a real sample or a fake one. During the joint training of the generator and the discriminator, the generator generates samples that are real enough to deceive the discriminator, and the discriminator tries not to be deceived and to distinguish the synthesized samples from the real samples. Finally, these two models reach a Nash equilibrium, the generator generates synthesized samples that are very similar to the original ones and the discriminator can hardly distinguish them.

The GAN model is trained under the min-max rule according to the following loss function:

$\begin{matrix} min_{G} max_{D} V (G, D) = & E_{x \sim P_{data (x)}} [log D (x)] \\ + & E_{z \sim p_{noise (z)}} [log (1 - D (G (z)))] \end{matrix}$ (1) where D (x) represents the probability that x is a real sample. Discriminator D is trained to maximize the probability, logD (x) and log (1 - D (G (z))), ensuring that the input is assigned the correct labels. Optimizing G is the opposite, and the training of G is to minimize log (1 - D (G (z))). In the adversarial training game, GAN achieves a Nash equilibrium through the joint alternating training of G and D.

3.3 The proposed approach

This paper introduces GAN into the field of malicious URL detection, and proposes a semi-supervised learning malicious URLs detection approach based on the GAN framework, which makes full use of two characteristics of the GAN model. First, GAN is an unsupervised learning model, which can make full use of unlabeled data to train the generator and the discriminator through adversarial training. Second, during the training of the GAN model, the generator in GAN can continuously synthesize samples and increase the number of rare samples. These two characteristics make GAN suitable for malicious URL detection since the malicious URL detection has the problem of insufficient labeled samples and unbalance dataset, whereas the GAN model can perfectly solve them. Figure 3 shows that the proposed GAN-based malicious URL detection model includes generator G and discriminator D.

Fig. 3

Structure of the detection model and the whole process.

Design of discriminator D: As aformentioned, there are two main problems in malicious URL detection, insufficient labeled samples and unbalanced data set problem. To handle these issues, the original discriminator D is modified to suit the semi-supervised learning mode. In the traditional GAN model, the discriminator is used to distinguish whether the input is real or fake. In this paper, the output y of discriminator is no long 0 (fake) or 1 (real), instead, the output is a K + 1 dimensional vector, specifying the category of the input sample (be one of the K classes or be the synthetic one)

Fig. 4

The network structure of the discriminator.

The network structure of the discriminator is shown in Figure 4. It includes three LSTM layers, because RNNs are suitable for processing sequence data. An original URL sample is first converted to a one-hot encoding and then input to the network. The characters converted to vectors are input into LSTM layer one by one, and the time series features between them are extracted through LSTM layer. Each LSTM layer is followed by a dropout layer to avoid over-fitting. The output from the network’s final dropout layer is entered into a full connected layer for classification.

The input of the discriminator network includes three kinds of data: (1) labeled URL sample x, (2) unlabeled URL sample $\tilde{x}$ , (3) synthesized samples G (z) generated by the GAN generator. To make full use of all these samples to train discriminator D, the loss function L_D related to discriminator D contains two parts: supervised learning cross entropy loss L_{D_sup} and unsupervised learning loss L_{D_unsup}. $L_{D} = L_{D_{-} unsup} + L_{D_{-} \sup}$ (2) The unsupervised learning loss part consists of two parts, L_{real_unsup} and L_{fake_unsup}. The former is the loss function when training with unlabeled samples, and the latter is the loss function when training with unlabeled synthesized samples, which are defined as follows: $L_{real_unsup} = - E_{\tilde{x} \sim P_{data (x)}} [log (1 - p_{model} (y_{K + 1} ∣ \tilde{x}))]$ (3) $L_{fake_unsup} = - E_{\tilde{x} \sim G (z)} [log (p_{model} (y_{K + 1} ∣ \tilde{x}))]$ (4) where $p_{model} (y_{K + 1} ∣ \tilde{x}$ 1 is the probability that $\tilde{x}$ is synthesized. During this unsupervised learning training, discriminator D only needs to determine whether the input samples are synthetic samples, and does not need to recognize the specific category. D (x) represents the probability that x is a real sample, then D (x) =1 - p_model (y_K+1 ∣ x). The two unsupervised learning losses can be simplified as follows: $L_{real_unsup} = - E_{\hat{x} \sim P_{data (x)}} [log D (\tilde{x})]$ (5) $L_{fake_unsup} = - E_{\hat{x} \sim G (z)} [log (1 - D (G (z)))]$ (6)

For the supervised learning loss part, the labeled samples from real dataset are used to train the discriminator. The discriminator needs to distinguish specific categories of the real samples. The loss function L_{D
_-
sup} is as follows: $L_{D_\sup} = - E_{x},_{y \sim P_{data (x, y)}} log p_{model} (y_{k} ∣ x)$ (7) where p_model (y_k ∣ x) 2 is the probability that the input x belongs to the kth category, and when the sample label is encoded by one-hot coding(i.e., y_i = 1 for i = k and y_i = 0 for i ≠ k), the loss function is a cross-entropy function as follows. $\begin{matrix} L_{D_\sup} = - E_{x, y \sim P_{data} (x, y)} \sum_{i = 1}^{n} y_{i} log D (x_{i}, y_{i}) \end{matrix}$ (8) where D (x_i, y_i) denotes the probability that the discriminator can label the input sample x_i correctly.

Design of generator G: During the training of generator G, random noise vectors are taken as input, and synthetic URL samples are continuously generated to train the discriminator model D. The number of rare malicious URL samples is expanded, which reduces the effect of insufficient malicious samples on the training.

The input of the generator model is vector of noise z, which obeys a Gaussian distribution P_z and the output is the synthetic samples that matches the distribution of real samples. The network contains three full connection layers as hidden layer and each full connection layer followed by a dropout layer. During the semi-supervised training of the discriminator model, half of the training samples are synthesized by the generator. The generator’s loss function L_G consists of two parts. The first part L_{fake_unsup} is from the training of GAN as shown in Equation (6), and the other part L_{feature_matching} is from the feature matching.

In the training of feature matching, we followed the idea of Bad GAN theory [35] that the distribution of the synthetic samples is better not be completely matched to the real samples. A perfect generator can generate a perfect distribution as real data, but the generalization ability of discriminator is limited during semi-supervised learning [36]. Instead, though a bad generator may not synthesize a distribution that exactly matches the distribution of the real sample, it can generate samples that close to the boundary , thus force the split plane to have a large margin to get better generalization ability. How a perfect generator and a bad generator influence the decision boundaries of the classification model is shown in Figure 5. The blue and red dots represent positive and negative samples respectively and the green dots represent the synthetic samples, specifically, the green dots along the decision boundary of perfect generator is synthesized by bad generator.

Fig. 5

Principle visualization of bad generator: The bad generator can generate the complement samples in feature space, and these complement samples help the discriminator obtain the correct decision boundaries.

Let f (x) denote the output of each feature layer of the discriminator model before the last layer. To integrate the idea of bad GAN, a parameter w is used in the feature matching loss function L_{feature_matching}: $\begin{matrix} L_{feature_m atching} = w {∥ E_{\tilde{x} \sim P_{data (x)}} f (\tilde{x}) - E_{z \sim P_{Z}} f (G (z)) ∥}_{2}^{2} \end{matrix}$ (9) where $∥ * ∥_{2}^{2}$ denotes the L₂ norm, the weight parameter w is added to control the matching degree. Consequently, the total loss function of generator L_G is defined as follows: $\begin{matrix} L_{G} & = L_{fake_u nsup} + L_{feature_m atching} \\ = E_{\tilde{x} \sim G (z)} \\ + w {∥ E_{\tilde{x} \sim P_{data (x)}} f (\tilde{x}) - E_{z \sim Pz} f (G (z)) ∥}_{2}^{2} \end{matrix}$ (10) In the training of the total GAN model, a classification model is trained by minimizing the loss function of discriminator L_D and the loss function of generator L_G. The entire loss function can be expressed as follows: $\begin{matrix} min_{G} max_{D} V (G, D) & = E_{\tilde{x} \sim P_{data} (x)} [log D (\tilde{x})] \\ + E_{x, y \sim P_{data} (x, y)} \sum_{i = 1}^{n} y_{i} log D (x_{i}, y_{i}) \\ + E_{\tilde{x} \sim G (z)} [log (1 - D (G (z)))] \\ + w {∥ E_{\tilde{x} \sim P_{data} (x)} f (\tilde{x}) - E_{z \sim Pz} f (G (z)) ∥}_{2}^{2} \end{matrix}$ (11)

4 Experimental design and result analysis

4.1 Experiments design

A URL dataset containing malicious URLs and benign URLs was constructed to train the model and verify the detection accuracy of the approach. The benign URLs were obtained from Alexa 3 , containing 41227 samples and the malicious URLs were obtained from PhishTank 4 , containing 20613 URL samples. The number of malicious URLs in the dataset was less than that of benign URLs to simulate the real environment where the number of different class of URL samples is unbalanced. The number of malicious URLs only accounted for 1 / 3 of the total number of dataset samples. In addition, only a small part of the URL samples was tagged with positive labels and negative labels, while the rest of the URL samples are unlabeled. Table 1 shows a collection of five benign URL samples (positive) and five malicious URL samples (negative). Some synthesized URL samples by the GAN’s generator model are listed in Table 2.

Table 1
URL

URL Samples label

1. http://www.tianya.cn positive

2. http://taiwan-itinerary.blogspot.com positive

3. http://socialmediamonitoring.herokuapp.com positive

4. https://acreorGANic.com positive

5. http://generalelectricrefrigeratorrepairs.com positive

6. https://aswpaynet-jp.com/18/08/2020/ negative

7. https://icloud.com.apple-find.com/ negative

8. https://www.supersimplesurvey.com/survey/22762/m1 negative

9. http://mx7.internetsuperproducts.com/t/r?3or-1740x-0-zo21 negative

10.http://amazon.co.jp.safeaccountaccountaccountaccountdjsaoijfiadjfuohuidshiofjoisdaj.xyz/?customer-service-center negative

URL Samples	label
1. http://www.tianya.cn	positive
2. http://taiwan-itinerary.blogspot.com	positive
3. http://socialmediamonitoring.herokuapp.com	positive
4. https://acreorGANic.com	positive
5. http://generalelectricrefrigeratorrepairs.com	positive
6. https://aswpaynet-jp.com/18/08/2020/	negative
7. https://icloud.com.apple-find.com/	negative
8. https://www.supersimplesurvey.com/survey/22762/m1	negative
9. http://mx7.internetsuperproducts.com/t/r?3or-1740x-0-zo21	negative
10.http://amazon.co.jp.safeaccountaccountaccountaccountdjsaoijfiadjfuohuidshiofjoisdaj.xyz/?customer-service-center	negative

Table 2

Synthetic URL

Synthetic URL Samples
1. http://loliili.acon.aamuw6wels/iua/=qetu
2. http://laowdlo.con?foo1iohe/wow/s/ina-=aetk
3. http://evaeia.tslbadnmm/c/eio/alio1ee-p2p
4. http://tnree.seee.ccoo.arg/nv/aa2/ibrsn/
5. http://xieair/om4a=6am/m.g/==jsohd6ii.eo4cia/5f0a.yorth2pm8s

Experiments were carried out on the above dataset, 80% of the dataset was used as the training dataset and the remaining 20% was used as the test dataset. The main parameters of the training process are as follows: The batch_ size for training was set to 50. The learning rate of the model was set to 0.00005 and optimized by Adam optimizer.

4.2 Results analysis

Under the condition that only 20%, 30% and 40% random data in each group were labeled, the experimental results were compared with those of the existing detection method based on SVM [12] and the detection method based on LSTM [14] as shown in Table 3. The structure of the LSTM method is the same as that of the discriminator, which is composed of three layers of LSTM neural network. The neuron nodes of each layer are set at 128,256,256, and Dropout is set at 0.5. The kernel function of SVM use RBF (radial basis function). From Table 3, we can see that the detection accuracy of the traditional malicious URL detection method based on SVM and LSTM algorithm is lower than ours when the number of labeled samples is relatively small. When there are only 20% labeled data, the proposed method can still achieve the detection accuracy of 87.8%, outperforms other two methods by a large margin, which means that our method is more practical in real application environment.

Table 3
Detection Accuracy of Common URL Detection Method

The percentage of Detection method labeled samples

SVM LSTM Our approach

20% 80.5 84.3 87.8

30% 81.6 84.7 89.1

40% 83.7 85.1 91.9

The percentage of	Detection method labeled samples
20%	80.5	84.3	87.8
30%	81.6	84.7	89.1
40%	83.7	85.1	91.9

To further validate the effectiveness of the proposed method under the condition of insufficient labeled samples, three more objective evaluation metrics are used, including the precision rate, the recall rate, and F₁ score. When the number of labeled samples was increased from 5% to 40%, the experimental results were shown in Table 4, from which we can see that the evaluation metrics decrease when the labeled samples are reduced. When the percentage of labeled samples drops from 20% to 10%, the performance drops the fastest, but still the precision rate can achieve 87.0% when only 10% labeled samples are used for training. When the labeled samples accounted for 30% and 40% of the dataset, the proposed method achieved good performance in terms of four metrics.

Table 4

The Performance with Different Percentages of Labeled Samples

Labeled samples	Accuracy	Precision	Recall	F1
5%	76.7	85.6	77.6	81.4
10%	78.4	87.0	79.2	82.9
20%	87.8	89.4	86.6	88.0
30%	89.1	92.1	86.4	89.2
40%	91.9	92.0	90.9	91.4

Moreover, in order to verify the statistical independence, 5-fold cross-validation was performed. The results are compared with the semi-supervised method based on SETRED, Self-training using C4.5 [37], CoForest [25] and S3VM [26]. These algorithms used the parameters presented in literature [38]. The experimental results are shown in Table 5, from which we can see that among four semi-supervised learning methods, the proposed approach achieves the highest classification accuracy for three settings. Compared with the self-training method, the proposed method has a significant improvement in accuracy when the number of labeled URL samples is relatively small. This is because our approach uses unlabeled samples directly for training through the discriminator loss function to avoid using too many mislabeled samples. Moreover, our approach uses LSTM algorithm to build the discriminator, which can more fully extract the semantic information of the URL dataset and avoid the influence of manual selection of URL features. Friedman test was carried out on the experimental results in Table 5, and the value of p was 0.007, which proved that the data presented significant differences In addition, as shown in Table 4 and Table 5, the accuracy results of the proposed method obtained by one random test and by the five-fold cross validation are very close to each other, which also proves the stability of the proposed method.

Table 5

Detection Accuracy of Semi-Supervised Learning Method (5-fold CV)

The percentage of	Detection method labeled samples
	Self-training(C4.5)	SETRED	CoForest	S3VM	Our approach
20%	85.7	84.2	85.5	83.2	87.6
30%	87.2	85.9	88.6	87.1	89.3
40%	90.4	89.8	90.3	90.7	92.3

During the training, since the proposed model has to train both the LSTM discriminator and the generator, the overall training is time-consuming. The proposed model gradually converges at 10,000 iterations, while under the same setting, the LSTM algorithm only needs 3000 iterations to converge. Considering the performance improvement by the GAN model, it is worth sacrificing the training time.

5 Conclusion and future work

This paper introduces GAN into the field of malicious URL detection and proposes a semi-supervised malicious URL detection approach to solve the problems of insufficient number of labeled samples in malicious URL detection and the imbalance of negative samples and positive samples in URL datasets. By constructing the supervised loss function and the unsupervised loss function, this detection approach can make full use of the labeled samples, the unlabeled samples, and the synthetic samples to train the classifier together. In this manner, the detection approach can solve the class imbalance and insufficient number of labeled samples problems. In addition, this approach does not need to design features with expert knowledge manually and can extract features automatically, resulting in a better generalization ability.

The proposed method also has some limitations. One is that the proposed may be easily attacked by adversarial examples, since the many unlabeled samples are used to train the network. The other is the long training time. In future, we will study further to solve the above two problems. In addition, we will also investigate how to deal with the case that when fewer labled samples are availabe or when the imbalance is more serious.

6 Compliance with ethical standards

This work is supported by the Key Areas Research and Development Program of Guangdong Province (grant#2019B010139002), the project of Guangzhou Science and Technology (grant#202007010004), and the project of Guangzhou Science and Technology (grant#202007040005). The authors declare that they have no conflict of interest.

Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

Footnotes

For simplicity, y_K+1 is short for y = y_K+1, indicating that the input is predicted as the synthesized one.

y_k is short for y = y_k

References

Ranganayakulu

and Chellappan

, Detecting malicious urls in e-mail–an implementation, AASRI Procedia 4 (2013), 125–131.

APWG et al. second quarter 2019. phishing activity trends report. Technical report, Anti- Phishing Working Group.

Tan

, Zhang

, Liu

, Zhu

and Guo

, Malfilter: A lightweight real-time malicious url filtering system in large-scale networks. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), (2018), pp. 565–571. IEEE.

Wang

, Gong

N.Z.

and Fu

, Gang: Detecting fraudulent users in online social networks via guilt-byassociation on directed graphs. In 2017 IEEE International Conference on Data Mining (ICDM), (2017), pp. 465–474. IEEE.

Wang

, Liu

, Gao

, Qu

and Xu

, Session-based fraud detection in online e-commerce transactions using recurrent neural networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2017), pp. 241–252. Springer.

Goodfellow

, Pouget-Abadie

, Mirza

, Xu

, Warde-Farley

, Ozair

, Courville

and Bengio

, Generative adversarial nets. In Advances in neural information processing systems, (2014), pp. 2672–2680.

Sahoo

, Liu

and Hoi

S.C.H.

, Malicious url detection using machine learning: A survey. arXiv preprint arXiv:1701.07179, 2017.

Huang

, Xu

and Pei

, Malicious url detection by dynamically mining patterns without pre-defined elements, World Wide Web 17(6) (2014), 1375–1394.

, Pham

, Sahoo

and Hoi

S.C.H.

, Urlnet: Learning a url representation with deep learning for malicious url detection. arXiv preprint arXiv:1802.03162, 2018.

10.

Dalai

A.K.

, Ankush

S.D.

and Kumar Jena

, Xss attack prevention using dom-based filter. In Progress in Intelligent Computing Techniques: Theory, Practice, and Applications (2018), pp. 227–234. Springer.

11.

Jain

A.K.

and Gupta

B.B.

, A machine learning based approach for phishing detection using hyperlinks information, Journal of Ambient Intelligence and Humanized Computing 10(5) (2019), 2015–2028.

12.

Huang

, Qian

and Wang

, A svm-based technique to detect phishing urls, Information Technology Journal 11(7) (2012), 921.

13.

Izonin

, Trostianchyn

, Duriagina

, Tkachenko

, Tepla

and Lotoshynska.

, The combined use of the wiener polynomial and svm for material classification task in medical implants production, International Journal of Intelligent Systems and Applications 10(9) (2018), 40–47.

14.

Zhao

, Wang

, Ma

and Cheng

, Classifying malicious urls using gated recurrent neural networks. In International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, (2018), pp. 385–394. Springer.

15.

Fernández-Delgado

, Cernadas

, Barro

and Amorim

, Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research 15(1) (2014), 3133–3181.

16.

Tkachenko

and Izonin

, Model and principles for the implementation of neural-like structures based on geometric data transformations. In Zhengbing Hu, Sergey Petoukhov, Ivan Dychka, and Matthew He, editors, Advances in Computer Science for Engineering and Education, pp. 578–587, Cham, 2019. Springer International Publishing.

17.

Izonin

, Tkachenko

, Kryvinska

, Tkachenko

and Greguš ml

, Multiple linear regression based on coefficients identification using non-iterative sgtm neural-like structure. In Ignacio Rojas, Gonzalo Joya, and Andreu Catala, editors, Advances in Computational Intelligence, pp. 467–479, Cham, 2019. Springer International Publishing.

18.

Bahnsen

A.C.

, Contreras Bohorquez

, Villegas

, Vargas

and González

F.A.

, Classifying phishing urls using recurrent neural networks. In 2017 APWG symposium on electronic crime research (eCrime), (2017), pp. 1–8. IEEE.

19.

Zhang

Y-L.

, Li

, Zhou

, Li

, Liu

, Zhang

and Zhou

Z-H.

, Poster: A pulearning based system for potential malicious url detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, (2017), pp. 2599–2601.

20.

T.T.

, Fan

W.Y.

and Luo

Y.S.

, A method on selecting reliable samples based on fuzziness in positive and unlabeled learning. arXiv preprint arXiv:1903.11064, 2019.

21.

Gabriel

A.D.

, Teodor Gavrilut

, Ioan Alexandru

and Adrian Stefan

, Detecting malicious urls: A semi-supervised machine learning system approach. In 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), (2016), pp. 233–239. IEEE.

22.

Yang

, Yang

, Jin

and Qian

, Multi-classification for malicious url based on improved semi-supervised algorithm. IEEE, 2017.

23.

and Zhou

Z.-H.

, Setred: Self-training with editing. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, (2005), pp. 611–621. Springer.

24.

Yarowsky

, Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, (1995), pp. 189–196.

25.

and Zhou

Z.-H.

, Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 37(6) (2007), 1088–1098.

26.

Burges

C.J.C.

, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2(2) (1998), 121–167.

27.

Fazakis

, Karlos

, Kotsiantis

and Sgarbas

, Self-trained rotation forest for semi-supervised learning, Journal of Intelligent & Fuzzy Systems 32(1) (2017), 711–722.

28.

Triguero

, García

and Herrera

, Selflabeled techniques for semi-supervised learning: taxonomy, software and empirical study, Knowledge and Information Systems 42(2) (2015), 245–284.

29.

Henriques

J.F.

, Caseiro

, Martins

and Batista

, High-speed tracking with kernelized correlation filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(3) (2014), 583–596.

30.

Danelljan

, Hager

, Shahbaz Khan

and Felsberg

, Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, (2015), pp. 4310–4318.

31.

Shin

H-C.

, Tenenholtz

N.A.

, Rogers

J.K.

, Schwarz

Christopher G.

, Senjem

Matthew L.

, Gunter

Jeffrey L.

, Andriole

Katherine P.

and Michalski

, Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In International workshop on simulation and synthesis in medical imaging, (2018), pp. 1–11. Springer.

32.

Reed

, Akata

, Yan

, Logeswaran

, Schiele

and Lee

, Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.

33.

Radford

, Metz

and Chintala

, Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

34.

Ratliff

Lillian J.

, Burden

Samuel A.

and Shankar Sastry

, Characterization and computation of local nash equilibria in continuous games. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), (2013), pp. 917–924. IEEE.

35.

Dai

, Yang

, Cohen

William W.

and Salakhutdinov

Russ R.

, Good semi-supervised learning that requires a bad gan. In Advances in neural information processing systems, (2017), pp. 6510–6520.

36.

Zareapoor

, Shamsolmoali

and Yang

, Oversampling adversarial network for class-imbalanced fault diagnosis, Mechanical Systems and Signal Processing 149 (2021), 107175.

37.

Ross Quinlan

, C4. 5: programs for machine learning. Elsevier, 2014.

38.

Alcalá-Fdez

, Fernández

, Luengo

, Derrac

, García

, Sánchez

and Herrera

, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic & Soft Computing 17, 2011.

The percentage of	Detection method labeled samples
	SVM	LSTM	Our approach
20%	80.5	84.3	87.8
30%	81.6	84.7	89.1
40%	83.7	85.1	91.9

Semi-supervised learning approach for malicious URL detection via adversarial learning 1

Abstract

Keywords

1 Introduction

2 Related work

2.1 Blacklist-based method

2.2 Content-based method

2.3 Machine-learning-based method

3 Preliminaries and the proposed approach

3.1 Data preprocessing

4.1 Experiments design

Table 3 Detection Accuracy of Common URL Detection Method The percentage of Detection method labeled samples SVM LSTM Our approach 20% 80.5 84.3 87.8 30% 81.6 84.7 89.1 40% 83.7 85.1 91.9

6 Compliance with ethical standards

Footnotes

References

Table 3
Detection Accuracy of Common URL Detection Method

The percentage of Detection method labeled samples

SVM LSTM Our approach

20% 80.5 84.3 87.8

30% 81.6 84.7 89.1

40% 83.7 85.1 91.9