Evaluating deep learning approaches to characterize and classify malicious URL’s

Abstract

Malicious uniform resource locator (URL), termed as malicious website is a foundation mechanisms for many of internet criminal activities such as phishing, spamming, identity theft, financial fraud and malware. It has been considered as a common and serious threat to the Cybersecurity. Blacklisting mechanism and many machine learning based solutions found by researchers with the aim to effectively signalize and classify the malicious URL’s in internet. Blacklisting is completely ineffective at finding both variations of malicious URL or newly generated URL. Additionally, it requires human input and ends up as a time consuming approach in real-time scenarios. Machine learning based solutions implicitly rely on feature engineering phase to extract hand crafted features including linguistic, lexical, contextual or semantics, statistical information of URL string, n-gram, bag-of-words, link structures, content composition, DNS information, network traffic, etc. As a result feature engineering in machine learning based solutions has to evolve with the new malicious URL’s. In recent times, deep learning is the most talked due to the significant results in various artificial intelligence (AI) tasks in the field of image processing, speech processing, natural language processing and many others. They have an ability to extract features automatically by taking the raw input texts. To leverage this and to transform the efficacy of deep learning algorithms to the task of malicious URL’s detection, we evaluate various deep learning architectures specifically recurrent neural network (RNN), identity-recurrent neural network (I-RNN), long short-term memory (LSTM), convolution neural network (CNN), and convolutional neural network-long short-term memory (CNN-LSTM) architectures by modeling the real known benign and malicious URL’s in character level language. The optimal parameter for deep learning architecture is found by conducting various experiments with various configurations of network parameters and network structures. All the experiments run till 1000 epochs with a learning rate in the range [0.01-0.5]. In our experiments, deep learning mechanisms outperformed the hand crafted feature mechanism. Specifically, LSTM and hybrid network of CNN and LSTM have achieved highest accuracy as 0.9996 and 0.9995 respectively. This might be due to the fact that the deep learning mechanisms have ability to learn hierarchical feature representation and long range-dependencies in sequences of arbitrary length.

Keywords

Malicious uniform resource locator (URL) or malicious website deep learning mechanisms: Recurrent Neural Network (RNN)Identity-Recurrent Neural Network (I-RNN)Long Short-Term Memory (LSTM)Convolution Neural Network (CNN)Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM)

1 Introduction

Over the years, the technological advancements in communication paradigms has been intriguing, radically transfigured the working environments ranging from education to trading. Advances in communication paradigms are facilitated to have an instant access to various sources of information. They have emerged as a mainstream platform in various fields such as e-commerce, social networking and others. In today’s reality world, it is required to possess a web presence by utilizing search engine optimization (SEO) to effectively start a new or run an existing venture successfully. These factors have forced to deploy new applications. And they are being increasing immensely and will constantly evolve to have enormous impact on all fields. Unfortunately, the rapid increase in technologies has coming coupled with security holes. As a result, a malicious author has an opportunity to attack an end-user to gather illegitimate benefits. The most commonly used technique to conduct diverse attacks is rogue website. The rogue website facilitates an attacker to display unsolicited content in the form of spam, phishing, SQL injections, denial of service (DOS), distributed denial of service (DDOS), man-in-the middle, malware and others, and causes the financial fraud, stealing private or sensitive information by deceiving end-users. Attacking these diverse web based attacks in timely manner enables to alleviate the significant damages. This has been considered as major research studies in security intelligence. The websites present in internet are diverse, a very large in number. As a result identifying the nature of website as either benign or malicious is considered as a challenging and more difficult task. Additionally, as an attacker can impersonate their attacks and replicate them to available in more than one place at any point of time. Usually, an attacker implements their own malicious code and makes them available to users through spreading a compromised uniform resource locator (URL’s) on the web [1]. In [2] reported that the 30% of the websites are likely malicious by inspecting 90 websites. They followed random selection of URL’s from China Education and Research Network (CERNET). They also reported that the 39% of malicious code are JavaScript. In 2012, Kaspersky lab had reported that the browser based attacks has seen a sudden increase from 946,393,693 to 1,595,587,670 [3]. In 1,595,587,670, 87.36% were occurred using URL based mechanism.

A uniform resource locator (URL) is a subnet of uniform resource identifier (URI), used to identify the location and to retrieve resources from computer network. This is mainly used to direct to a specific web page on a website. A URL has two parts. The first part defines the type of protocol for example http, https or others and the second part defines the location of resources through domain name or internet protocol (IP) address. Both parts of URL are separated by a colon and followed by two forward slashes (example shown in Fig. 1).

Fig.1

Syntax of uniform resource locator (URL).

In Fig. 1, the first part https denotes the protocol, “amrita.edu” is a primary domain name, “www.amrita.edu” denotes host name, “center/computational-engineering-and-networking” defines the path to a particular resource specifically a webpage on the domain name and “edu” is a top level domain name. Most of the time a user by themselves is not known whether the URL belongs to either benign or malicious. Thus unsuspecting user visits the websites through the URL presented in email, web search results and others. Once the URL is compromised, an attacker imposes an attack. These compromised URL’s are typically termed as malicious URL’s. As a security mechanism, finding the nature of a particular URL using the necessary mechanism will alleviate the aforementioned discussed attacks.

To avoid aforementioned attacks, Blacklisting is the most commonly used technique by many antivirus companies in web filtering applications, appliances, search engine and browser tool bars. Blacklisting techniques have a data base of malicious URL’s and the data base has to be updated manually if a new URL is identified as malicious. Even though blacklisting mechanism has the capability to identify malicious URL more accurately, their applicability in real time adoption is very less in recent days due to they are completely ineffective in attacking the unknown malicious URL’s. These unknown malicious URL’s can be attacked only after that they are identified as malicious by using the following techniques; honeypots, web crawlers and manually reporting through human feedback. These techniques use heuristic search in web site analysis and tagged them as malicious, followed by stored them in database. This has been employed in web browsers such as PhishTank 1 , DNS-BH 2 and jwSpamSpy 3 and commercial malicious URL detection systems such as Google Safe Browsing 4 McAfee SiteAdvisor 5 , Web of Trust (WOT) 6 , Websense ThreatSeeker Network 7 , Cisco IronPort Web Reputation 8 and Trend Micro Web Reputation Query Online System 9 . In today’s world wide web (WWW), maintaining encyclopedic of malicious URL’s is often difficult due to the fact that the URL’s are generated on per day. Moreover, to evade detection an attacker use obfuscation mechanism and modifies a malicious URL to looks like a benign. In [4] discussed the following 4 obfuscation mechanisms: obfuscating the host with an IP address, obfuscating the host with another domain, obfuscating with large host names, domain unknown or misspelled. All of these 4 mechanisms were targeted towards hiding their malicious behaviors by disguising malicious URL’s. In addition to obfuscation mechanism, authors used URL shortening to make URL’s more robust [5, 6]. Usually, an attacker embeds malicious code inside the JavaScript and attempts to launch an attack if a user visits to a malicious URL. In most of the current scenarios, an attacker obfuscates the malicious code with the aim to evade detection from signature based techniques. To evade blacklisting mechanism, an attacker use approaches such as fast-flux and generation of new URL’s. Fast-flux has multiple IP addresses to a certain domain name and constantly changes it. For generating new URL’s randomly, mostly an attacker use algorithmic technique.

Machine learning based solutions to malicious URL detection and its classification is another solution followed by many security research communities [7 –13]. Machine learning methods rely on a set of URL repository as training data to extract a set of statistical features and followed to learn a discriminative function to distinguish between the benign and malicious URL. Thus in contrast to blacklisting mechanisms, machine learning model able to generalize well for unknown malicious URL. Mostly the employed machine learning (ML) methods are categorized into 3 types. (1) Supervised ML; a set of URL with label as either benign or malicious, (2) unsupervised ML; a set of URL repository without a class label and (3) semi- supervised ML; limited set of URL’s has a class label. To learn the good feature representation, researchers used various set of features in their feature engineering phase. This includes linguistic, lexical, and contextual or semantics, statistical information of URL string, n-gram, bag-of-words, link structures, content composition, DNS information and host based features such as geo-location, WHOIS info. As a next step the extracted features has to be transformed to numerical format in order to plug into a certain machine learning models. Naive Bayes (NB), Decision tree (DT), Logistic Regression (LR), Ada boost (AB), Random forest (RF) and Support Vector Machine (SVM) are the most commonly used classical supervised machine learning models to classify the given URL’s as either malicious or benign. In [25] compared the effectiveness of artificial neural network (ANN) approach over static machine learning classifiers such as SVM, DT, NB and KNN to malicious web page detection by using the static feature sets from URL and page contents. An ANN approach had performed well by reporting highest accuracy as 95.08 in comparison to the other static classifiers. Additionally, the importance of each feature towards identifying attacks and thereby reducing the false positive rate was discussed in detail. The detection rate of malicious URL’s of supervised machine learning models is directly proportional to feature representation. Due to enormous amount of input training data, researchers followed scalable learning mechanisms such as online learning [14]. Thus extracting various features and evaluating the effectiveness of various supervised machine learning models has been a vivid area of research for the past years. A feature engineering phase in machine learning models has remained as a resource intensive task. In recent days, influx of deep learning mechanism has showed their strength in various artificial intelligence tasks in the fields of image processing, natural language processing, speech recognition and many others [16]. They have an ability to learn abstract feature representation by themselves by passing raw data through several hidden layers. Each hidden layer maps the data to higher dimensional plane to effectively learn the characteristics and to generalize with other new set of URL’s. By following this, in this paper we evaluate various deep learning mechanisms to understand the effectiveness of them towards characterize, signalize and classify malicious URL’s. To do so, we crawled various large set of benign and malicious URL’s and plugs them as input to deep learning algorithms. Deep learning based mechanisms are complex, understanding the inner mechanisms is remained as a black box. Thus an adversary may not be able to reverse engineer them easily. In order to defeat the deep learning based detector an adversary may require the same set of training samples.

The following sections of this paper are organized as follows. Section 2 discusses the text encoding mechanism and deep learning algorithms mathematically. Section 3 provides the necessary details of malicious URL corpus, hyper parameter tuning and deep learning architecture for malicious URL detection. Section 4 includes the evaluation results. Section 5 discusses the future work, discussions and at last the conclusion is placed in Section 6.

2 Background

2.1 Text representation

Text representation is typically termed as text encoding. It has 2 steps. First step involves in preprocessing and tokenizing the sentences to words and words to characters. During preprocessing, all uppercase characters are turned into lowercase characters. Second step includes vocabulary creation using the training data. The size of vocabulary creation acts as equilibrium between the training vectors of each class and the number of parameters to learn for the given task. Initially, the input texts are mapped to vector sequences representation (list of character indexes) by assigning a unique id to each vocable or character. Each unique id is a vector that denotes the size of the vocabulary. These character unique ids are transformed in to feature vectors using the lookup table operation. This feature vector transformation can be formulated mathematically as follows,

A Lookup table layer, LUT represents each character c ∈ V as an inner dimensional feature vector d_cvd, $LU T_{c} c = 〈 C 〉_{c}^{1}$ (1) where, $C \in ℝ^{d_{cvd} \times | V |}$ denote learning parameters for weight matrix, $〈 C 〉_{w}^{1} \in ℝ^{d_{cvd}}, ℝ^{d_{cvd}}$ , denotes the character-vector size that is chosen through hyper parameter tuning. The discussed approach is applied to each character in vocabulary V that forms the following equation. $LU T_{c} ([C]_{1}^{s}) = (〈 C 〉_{{[c]}_{1}}^{1} 〈 C 〉_{{[c]}_{2}}^{1} 〈 C 〉_{{[c]}_{3}}^{1} \dots 〈 C 〉_{{[c]}_{s}}^{1})$ (2) where, $[c]_{1}^{s}$ denotes the number of character that are chosen from vocabulary V. Next, varying vocabulary id sequences are converted to fixed length sequences and passed to character embedding layer. This transforms the characters to its character embedding. Generally, this maps the discrete character ids to its vectors of continuous numbers. The character embedding captures the semantic meaning of the given URL sequence by mapping them into a high dimensional geometric space. This high dimensional geometric space is called as character embedding space. If an embedding is properly learnt the semantics of the URL’s by encoding as real valued vectors, then the similar characters appear in a same cluster with close to each other in a high dimensional geometric space. The newly formed continuous vectors are fed to other layer such as (1) RNN (2) LSTM (3) GRU (4) I-RNN (5) CNN (6) CNN-LSTM.

2.2 Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN) was improved method of feed forward network (FFN) that was introduced in 1990 [15]. They take input sequences x_T of arbitrary length and use a transition function tf to map the input sequence to its internal hidden state vector hi_T recursively. At each time step t the hidden state vectors hi_t are estimated as a transition function of current input sequence x_t and past hidden state vector hi_t-1. They have substantially performed well in long standing artificial intelligence tasks [16]. $h i_{t} = {\begin{matrix} 0, & t = 0 \\ tf (h i_{t - 1}, x_{t}), & otherwise \end{matrix}}$ (3)

The computation of state to state in transition function tf is done with the composition of affine transformation of x_t and hi_t-1 including the element wise non-linear activation function. This form of transition function ends up in vanishing and exploding gradient issue while training. To mitigate this, [17] introduced long short-term memory (LSTM) that contains a special unit called as a memory block. A memory block is a complex processing unit. Each memory block contains one or more memory cells and a set of input and output gate. A memory cell keeps information and it is triggered when it is necessary and additionally contains constant error carousel (CEC) component. CEC has a fixed value 1 and it is used when a memory cell doesn’t receive any value from the outside signal. The states of a memory cell are controlled by the pair of adaptive gates over time-steps. Further the research on LSTM, [18] introduced Gated recurrent unit (GRU). GRU has less number of units in compared to LSTM, computationally efficient. On the other side, [19] proposed identity recurrent neural network (I-RNN) that includes identity matrix of initialized values and performance of that was closer to LSTM in 4 important tasks such as two toy problems, language modeling and speech recognition.

2.3 Convolution Neural Network (CNN)

Convolution neural network (CNN) considers input in the form of 2D for image and 1D for time-series and texts [16]. The CNN is sequence of convolution 1D layer, pooling 1D layer, fully connected layer and non-linear activation function as ReLU. Let U = {uc₁, uc₂, ⋯ , uc_l} be the uniform resource locator (URL) in which uc denotes characters and l be the length of URL, V be the vocabulary of URL’s characters and d be the dimensionality of character embedding. The character level representation is encoded by an embedding matrix $V^{U} \in ℝ^{d \times l}$ . A convolution 1D operation uses filter or kernel $H \in ℝ^{d \times uc}$ normally applied to window of URL’s characters uc to construct a new feature map fm. Specifically using a window of characters V [* , j : j + uc], a new feature map is obtained through, $f m^{U} [j] = AF (\sum (V [*, j : j + uc] ⊙ H) + b)$ (4) where $b \in ℝ$ is the bias term, AF is an activation function, usually tanh or ReLU, ⊙ is the element wise multiplication between two matrices. To generate a new feature map, a convolution filter is employed for all possible window of character in URL’s, uc = [uc₁, uc₂, ⋯ , uc_l-w+1] where $uc \in ℝ^{l - uc + 1}$ .

To obtain the most significant feature, we apply pooling 1D operation for the obtained feature map uc = [uc₁, uc₂, ⋯ , uc_l-w+1]. For instance, if an input downscaled by 3 then the adjacent three features in feature map is estimated as, $o_{i} = max (u c_{3 \times j - 1}, u c_{3 \times j})$ (5) where, $o = [o_{1}, o_{2}, \dots, o_{\frac{l - uc + 1}{2}}], o \in ℝ^{\frac{l - uc + 1}{2}}$ .

2.4 Hybrid Network (CNN-LSTM)

The following methods offer eclectic mix of convolutional and recurrent neural networks. To capture the time series patterns across time-steps of newly formed features from max-pooling operation in CNN, we feed them to LSTM as defined below, $fm = CNN (x_{t})$ (6) where, CNN is composed of convolution1D and maxpooling1D layers. The newly constructed feature map vector fm is passed to LSTM to learn the long-range temporal dependencies. x_t denotes the input feature vector with a class label.

3 Methodology

This section includes information related to data set details of URL’s, deep learning frameworks and followed by experimental analysis. This includes hyper parameter optimization, binary classification settings. All experiments are trained using backpropogartion through time (BPTT) with ADAM update rule using GPU enabled TensorFlow [20] software framework in conjunction with Keras 10 in single Nvidia GK110BGL Tesla k40.

3.1 Description of data set

Phishing and malware URL are two types of malicious URL’s. We crawled the URL’s of legitimate from Alexa 11 and DMOZ directory 12 and malicious URL’s from MalwareURL 13 , MalwareDomains 14 and MalwareDomainList 15 . We crawled one more data set i.e. legitimate URL’s from Alexa and DMOZ directory, Phishing from Phishtank 16 and OpenPhish 17 . This data set is entirely different and distinct from the other one. The difference is that the data set contains more legitimate URL’s. The detailed statistics is displayed in Table 1.

Table 1
Description of data set

Data Benign Malicious Total

Data set 1

Training 42800 39000 81800

Testing 20000 15000 35000

Data set 2

Training 12000 6000 18000

Testing 5000 2000 7000

Data	Benign	Malicious	Total
Training	42800	39000	81800
Testing	20000	15000	35000
Data set 2
Training	12000	6000	18000
Testing	5000	2000	7000

The crawled URL’s are preprocessed and randomly split into training and testing. During preprocessing stage upper-case characters are transformed to lower-case due to distinguishing the upper and lower-case characters might ends up in a regularization issue [21]. The detailed statistics of the data set is reported in Table 1. The proposed approach of this research is derived from character-level text classification [27]. Figure 2b shows the unigram alphanumeric distribution of legitimate and malicious URL’s. Both the benign and malicious URL’s followed irregular rise and fall in unigram probability distribution. The probability distribution of each character and the integer in malicious URL’s has lesser in comparison to the benign URL’s. Malicious URL’s contained higher probability distribution for special characters in comparison to the benign URL’s.

Fig.2

(a) Proposed Deep learning architecture (b) Probability distribution of alphanumeric character.

3.2 Hyper parameter tuning

The various recurrent layers, particularly LSTM has a set of parameters. Thus obtaining better performance in terms of classifying the URL’s as either malicious or benign directly relies on the optimal parameters. Such parameters are hidden layers, batch-size, hidden layer units, dropout, optimizer and learning rate. Here we focused on only in character level inputs. So we haven’t done any experiments for parameter tuning related to vocabulary size. The vocabulary size is set to 71 (number of unique characters). Moreover, the dropout, learning rate, and batch-size are set to 0.1, 0.01, 64 in both recurrent structures and CNN respectively. In most of the cases recurrent structures with one hidden layer including 128 units have attained state-of-the-art performance. Addition to character level inputs, bigrams of URL’s in character level was used as input to recurrent structures. The results of bigram level inputs with recurrent structures are considerably less in comparison to character level inputs with recurrent structures.

In experiments with CNN, we have done two trails of experiments for each filter 4, 8, 16, 32, 64, and 128. Number of filters i.e. 128 has achieved good performance in comparison to the other filters. Moreover, the accuracy of experiments with number of filter 32 and 64 is comparable to number of filter 128.

3.3 Deep learning architecture

The architecture of deep learning mechanisms for classifying the URL as either benign or malicious is displayed in Fig. 2a. This has 3 notional sections; (1) character encoding of URL’s (2) features representation through deep layers (3) classification.

3.3.1 Character encoding of URL’s

In initial step, the raw URL’s are preprocessed in such a way that making all character of URL’s to lower case. Further, a unique id is assigned to each character and each unique id is a vector that denotes the size of the vocabulary V. Here, the vocabulary size is V = 71. The unknown characters in URL’s are trivial, so they are assigned by the default key 0. To know the URL’s with most number of characters we formed 2 dictionaries; (1) maps character ids to characters (2) characters to character ids. The largest URL length is 123. To make all URL’s sequences of same length, URL’s of length less than 123 is padded by zero. As a result, we get a matrix of size 81800 × 123 for training and 35000 × 123 for validation. These matrices are passed to an embedding layer by using batch-size of 64, particularly 64 × 123 matrices batch-by-batch. This embedding layer constructs a matrix of size 71 × 256. Each row is a character-embedding vector that is created by putting back each character-id with a character-vector of size 256. During backpropogation, embedding layer cooperatively works with other layer to optimize character in such a way that to make similar characters appears close together. Thus, this character clustering enables other layers to easily find the semantics and contextual similarity patterns in URL’s. In order to visualize the high-dimensional data of embedding layer, we used two-dimensional linear projection through PCA in t-SNE [22]. It has transformed the 256 dimensional vectors into 2 dimensional and these 2D vectors are plotted, as shown in Fig. 3. This contains the similar characters in a same group i.e. characters, numbers, and other special characters have occurred in different clusters. Obtaining special characters as a separate cluster significantly play a role in distinguishing the URL’s as benign or malicious URL. Mostly, embedding layer has learned the perfect separation of them to characters and numbers. Thus an embedding layer has learnt the semantic and contextual similarity of URL’s.

Fig.3

Embedded character vectors learned by LSTM binary classification model is represented using 2-dimensional linear projection (PCA).

3.3.2 Features representation through deep layers

We adopted different recurrent layers such as RNN, LSTM, GRU, I-RNN, CNN and hybrid architecture such as CNN-LSTM for feature engineering. In order to evaluate the performance of each deep layer, various experiments are conducted. Most of the employed architecture is not too deep, so we didn’t rely on batch-normalization mechanism [26]. All experiments have used learning rate 0.1 and batch-size of 64. Batch-size indicates that the model parameters update is done once it trains the 64 data samples via backpropogation mechanism.

Recurrent layers. Recurrent layer has used different layers such as RNN, LSTM, GRU and I-RNN. Through domain knowledge, we set the hidden units to 256. A RNN unit has used hyperbolic tangent as an input and output activation function which in the range [-1, 1]. A LSTM memory cell has used hyperbolic tangent as an input and output activation function which in the range [-1, 1] and logistic sigmoid for gates and other neurons which in range [1, 0]. As recurrent structure layers captures the dependencies for the received matrix of shape (71 × 256) from an embedding layer and passes its last output 256 to dropout layer with 0.1.

Convolution layers. Generally, a sentence is one-dimensional, we apply convolution1D, in which a filter slides in only one direction. Convolution1D typically includes two steps. (1) convolution1D (2) pooling1D A convolution1D has 128 filters of length 5 which means filters applied on 5 characters at a time. Each character contains a vector of 128 elements. To characterize the sequences of 5 characters, scalar product is done between the filters. Finally, convolution1D layer passes its output of shape 5 × 128 to pooling1D layer. A pooling1D uses max operation with a stride of length 2. This separates the obtained feature map into two equal pieces. The resultant matrix shape of pooling1d operation is 248 × 128. As further, the matrix is flattened. This resultant vector includes 31744 elements in which the first 128 elements for the first character, second 128 elements for the second character, and so on and so forth. Next, we passed this vector to dense layer including 256 units and followed by dropout 0.1 layers. This has facilitated to learn the latent features including the position of each individual character. As in the case of hybrid architecture, the pooling1D output matrix is passed to LSTM layers. The LSTM layers learn the temporal dependencies of hierarchical features by passing the sequences through memory blocks with a set of gates and non-linear activation functions. Finally, it compresses the output to a shape 256 and followed by fed to the dense layer.

Regularization. Dropout layer with 0.1 characterized as regularization parameters. This is interleaved with deep layers such as RNN, LSTM, GRU, and I-RNN where as in CNN, dense layer interleaved with dropout layer with 0.2 and additionally activation layer with ReLU, non-linear activation function. Dropout is a mechanism widely adopted to remove the neurons randomly along with their connections during training a deep learning architecture.

3.4 Classification

Dropout layer interleaved dense layer with units as 1 in distinguishing a URL as malicious or benign. Additionally, dense layer has followed an activation layer with sigmoid non-linear activation function and loss function as binary cross-entropy. The dense layer is typically called as fully connected layer. It sums the received features from recurrent layers, CNN and CNN-LSTM to a single unit by constraining the most important one. Thus it constructs hierarchical feature representation for the final stage classification. The loss function for binary cross-entropy is calculated using the below formulae,

$\begin{matrix} loss (pr, ep) & = & - \frac{1}{N} \sum_{j = 1}^{N} [e p_{j} log {pr}_{j} \\ + (1 - e p_{j}) log (1 - p r_{j})] \end{matrix}$ (7)

Here ep is a vector of expected class label, pr is a vector of predicted probability for all URL’s in testing data set. To minimize the loss, we used ADAM optimizer via backpropogation.

4 Evaluation results

Epoch wise testing performance is done using the trained LSTM model and displayed in Fig. 4b. LSTM model performed well till 80 epochs and suddenly the accuracy fall down. Again, accuracy has seen a peak at epoch 350 and thereafter it followed fluctuations. CNN-LSTM model has performed well till 170 epochs and thereafter the model has followed same fluctuations as LSTM model. This was happened due to over fitting. Bigram model has followed constant accuracy with only slight fluctuations till 1000 epochs. The performance of RNN, I-RNN, GRU and hand-crafted features with logistic regression is not good till 100 epochs. After 100 epochs, all models had seen sudden peak and followed same accuracy till 1000 epochs. By observing all these cases, we decided that the 500 epochs sufficient to classify the URL as malicious or benign. As a baseline comparison, we apply RF, DT, Maximum Entropy Modeling (MT), AB, and NB on the hand-crafted features. The performances in terms of accuracy, precision, recall, and F1-score of all models is represented in Table 2. LSTM and CNN-LSTM has performed well in distnguishing the URL as either benign or malicious in comparison to RNN, I-RNN and other adopted mechanisms. Moreover, deep learning approaches have outperformed the classical machine learning algorithms. The obtained result of RNN architecture is comparable to both LSTM and CNN-LSTM architectures. For a detailed study and to understand the classifier performance related to true-positive rate (TPR) and false-positive rate (FPR), both deep learning and classical machine learning classifiers performance is displayed in receiver operating characteristic (ROC) curve in Fig. 4a. LSTM and CNN-LSTM have both showed good performance (AUC of 1.000) including the consistent TPR and FPR.

Fig.4

(a) ROC curve, (b) Performance of deep learning models.

Table 2

Summary of test results for binary classification using deep learning mechanisms and classical machine learning classifiers with hand-crafted features

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.9996	0.9988	0.9982	0.9985
RNN	0.9820	0.9002	0.9568	0.9277
I-RNN	0.9656	0.7854	0.9357	0.8540
GRU	0.9974	0.9830	0.9974	0.9901
CNN	0.9440	0.6638	0.8679	0.7522
CNN-LSTM	0.9995	0.9978	0.9989	0.9984
bigram	0.8964	0.6566	0.5854	0.6190
Hand-crafted features
RF	0.9303	0.5116	0.9023	0.6523
DT	0.811	0.701	0.879	0.817
MT	0.753	0.782	0.766	0.774
AB	0.845	0.897	0.872	0.884
NB	0.747	0.945	0.667	0.782

Interpreting the inner workings of deep learning networks have remained as a black box for both the novices and advanced users. These deep layer stores a lot of information and this can be seen by unwrapping them. Inevitably, the deep architectures such as RNN, LSTM, GRU, and IRNN are very complex. By unwrapping them a lot of information can be extracted. Mostly, the learned character representation in embedding layer was transformed through the deep layers to capture the semantic similarity of them. The non-linear activation in each deep layers supports to learn the best features. Using the learned feature representations, the last layer in deep architecture should maximally separate the benign and malicious URL’s. To use this in our experiments, we randomly selected 500 samples composed of 250 benign and malicious URL’s. These set of URL’s are passed to the LSTM architecture. The last layer outputs i.e. before sigmoid activation function are redirected to t-SNE. This transforms the high dimensional vectors into two dimensional vectors, as shown in Fig. 5. The Fig. 5 shows a clear separation between the benign and malicious URL’s. This infers that the LSTM model has learnt the good feature representation to accurately detect and classify the malicious URL’s.

Fig.5

100 samples of each classes of benign and malicious and with their corresponding activation values of the last hidden layer neurons are represented using two dimensional linear projection (PCA) with t-SNE. Note that the samples are clustered based on the similarity in activation values before the sigmoid layers.

Three different experimental designs are done using machine learning and deep learning algorithms.

Experiments with Data set 1, as discussed above

Experiments with Data set 2

Experiments with merged data sets of Data set 1 and Data set 2

Two trails of experiments are done on the Data set 2. The performance of both the machine learning and deep learning classifiers are less in comparison to the Data set 1. This is due to the fact that the test data of Data set 2 contains more legitimate URL’s and completely unseen. The detailed statistics of the performance of both the machine learning and deep learning algorithms on Data set 2 is reported in Table 3. Moreover, the performance of the machine learning and deep learning classifiers on the merged data set of Data set 1 and Data set 2 is reported in Table 4. Overall, the detection rate of malicious URL’s by both the classical machine learning and deep learning algorithms is acceptable.

Table 3

Summary of test results for binary classification using deep learning mechanisms and classical machine learning classifiers with hand-crafted features

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.991	0.988	0.999	0.993
RNN	0.971	0.961	1.000	0.980
I-RNN	0.967	0.956	1.000	0.977
GRU	0.985	0.979	1.000	0.989
CNN	0.948	0.932	1.000	0.965
CNN-LSTM	0.982	0.976	1.000	0.988
bigram	0.902	0.888	0.986	0.935
Hand-crafted features
RF	0.897	0.897	0.967	0.931
DT	0.874	0.854	0.994	0.919
MT	0.881	0.875	0.974	0.921
AB	0.882	0.904	0.933	0.919
NB	0.872	0.852	0.994	0.917

Table 4

Summary of test results for binary classification using deep learning mechanisms and classical machine learning classifiers with hand-crafted features

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.985	0.979	1.000	0.989
RNN	0.962	0.950	0.999	0.974
I-RNN	0.954	0.941	0.999	0.969
GRU	0.982	0.976	1.000	0.988
CNN	0.952	0.938	1.00	0.968
CNN-LSTM	0.986	0.980	1.000	0.990
bigram	0.904	0.904	0.968	0.935
Hand-crafted features
RF	0.896	0.884	0.983	0.931
DT	0.894	0.891	0.970	0.929
MT	0.876	0.856	0.993	0.920
AB	0.885	0.872	0.984	0.925
NB	0.877	0.861	0.988	0.920

5 Future work and discussions

Malicious URL detection has plays an important role in the context of most Cybersecurity applications. Machine learning based solutions are prevailing in nature towards malicious URL’s detection. Advance the research in machine learning has given birth to deep learning. Deep learning can be simply considered as complex models of machine learning. They have an ability to extract the necessary abstract feature representations by acting on raw input data. To take this benefit, in this paper we apply for the task of malicious URL’s detection. In most of the experiment settings deep networks performed well in comparison to the other hand crafted feature mechanism. The reported results in detecting the malicious URL’s are acceptable and we lack behind in showing the inner mechanics of deep networks. This can be considered as one of future directions. This can be done by transforming the non-linear state to linearized form and thereby calculate and analyze the shape of eigen values and eigen vectors from them over time-steps [23]. We used various deep networks with its simple architectures and lack behind in showing the effectiveness of complex networks due to considering the computational cost. The complex architecture can be trained using an advanced hardware in a distributed environment.

The characteristics of malicious threats are evolving in nature. At the same time the URL’s also change across time. We make a concrete statement such that deep learning mechanisms are most representative to deal with drifting of URL’s. In real-time scenario, getting an adequate labeled training data is often considered as a difficult task. One of the largest available open source labeled URL’s training data is of size 2.4 million [24]. Thus require a larger study by transforming supervised learning to semi-supervised to unsupervised learning in deep learning mechanisms. This can be considered as another significant future direction.

6 Conclusion

This paper has reviewed the effectiveness of various deep learning mechanisms towards detect and analysis of malicious URL’s. The URL’s of benign and malicious are trained in character level by extracting features automatically. Thus it avoids manual hand crafted feature engineering method and thereby itself serve as robust in handling drifting of URL’s and in the scenario of adversarial machine learning setting. The embedding layer followed by other deep networks layer extracts features implicitly and those optimal features are fed to other layers of deeper network for classification in supervised classification setting. The family of RNN and its hybrid network with CNN has performed well over hand crafted feature based machine learning mechanisms. Indeed, we claim that modeling URL’s as character level and learning character sequences using various deep learning algorithms is more effective towards characterize, detect and classify malicious URL’s in comparison to the hand crafted feature based machine learning mechanisms. This work can be considered as a baseline system to further analysis of performance of deep learning mechanisms in detail by providing mathematical exploration. Moreover, adopting deep learning based malicious URL detection in real time should require an extended data sets to avoid the state i.e. an adversary can easily reverse engineer the deep learning mechanisms.

Based on the obtained results, malicious URL detection system can be used as an initial shelter and followed by web page content analysis. The developed system can give faster response in comparison to the web page content analysis.

Footnotes

Acknowledgments

This research was supported in part by Paramount Computer Systems. We are also grateful to NVIDIA India for the GPU hardware support to research grant. We are grateful to Computational Engineering and Networking (CEN) department for encouraging the research.

References

Hong

, The state of phishing attacks, Communications of the ACM55(1) (2012), 74–81.

Liang

, Huang

, Liu

, Wang

, Dong

and Liang

, Malicious web pages detection based on abnormal visibility recognition, E-Business and Information System Security, 2009. EBISS’09, International Conference on IEEE (2009).

Maslennikov

and Namestnikov

, Kaspersky security bulletin. statistics 2012: The overall statistics for 2012. Kaspersky Lab. [Online]. Available:, https://securelist.com/kaspersky-security-bulletin-2012

Garera

, Provos

, Chew

and Rubin

A.D.

, A framework for detection and measurement of phishing attacks, in Proceedings of the 2007 ACM workshop on Recurring malcode. ACM, (2007), pp. 1–8.

Chhabra

, Aggarwal

, Benevenuto

and Kumaraguru

, Shocial: The phishing landscape through short URL’s, in Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, ACM, (2011), pp. 92–101.

Alshboul

, Nepali

and Wang

, Detecting malicious short URL’s on twitter, (2015).

Garera

, Provos

, Chew

and Rubin

A.D.

, A framework for detection and measurement of phishing attacks, in Proceedings of the 2007 ACM workshop on Recurring malcode. ACM, (2007), pp. 1–8.

McGrath

D.K.

and Gupta

, Behind phishing: An examination of phisher modi operandi, LEET8 (2008), p. 4.

, Saul

L.K.

, Savage

and Voelker

G.M.

, Beyond blacklists: Learning to detect malicious web sites from suspicious URL’s,, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, (2009).

10.

, Saul

L.K.

, Savage

and Voelker

G.M.

, Learning to detect malicious URL’s, ACM Transactions on Intelligent Systems and Technology (TIST)2.3 (2011), 30.

11.

Choi

, Zhu

B.B.

and Lee

, Detecting malicious web links and identifying their attack types, WebApps11 (2011), 11.

12.

Patil

D.R.

and Patil

, Survey on malicious web pages detection techniques, International Journal of U-and E-service, Science and Technology8(5), 195–206.

13.

Kuyama

, Kakizaki

and Sasaki

, Method for detecting a malicious domain by using whois and dns features, in The Third International Conference on Digital Security and Forensics (DigitalSec2016), (2016), p. 74.

14.

Hoi

S.C.

, Wang

and Zhao

, Libol: A library for online learning algorithms, The Journal of Machine Learning Research15(1) (2014), 495–499.

15.

Elman

J.L.

, Finding structure in time, Cognitive Science14.2 (1990), 179–211.

16.

LeCun

, Bengio

and Hinton

, Deep learning, Nature521 (2015), 436–444.

17.

Hochreiter

and Schmidhuber

, Long short-term memory , Neural Computation9.8 (1997), 1735–1780.

18.

Cho

and Van Merriënboer

, Gulcehre

, Bahdanau

, Bougares

, Schwenk

and Bengio

, Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation, arXiv preprint arXiv:1406.1078, (2014). http://arxiv.org/abs/1406.1078

19.

Q.V.

, Jaitly

and Hinton

G.E.

, A simple way to initialize recurrent networks of rectified linear units, arXivpreprintarXiv: 1504.00941 (2015).

20.

Abadi

, Barham

, Chen

, Davis

, Dean

and Kudlur

, TensorFlow: A system for large-scale machine learning, Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Savannah, Georgia, USA (2016).

21.

Zhang

, Zhao

and LeCun

, Character-level convolutional networks for text classification, In Advances in Neural Information Processing Systems (2015), pp. 649–657.

22.

Van Der Merwe

, Caceres

, Chu

Y.H.

and Sreenan

, Mmdump: A tool for monitoring internet multimedia traffic, ACM SIGCOMM Computer Communication Review30(5) (2000), 48–59.

23.

Moazzezi

, Change-based population coding, PhD thesis, UCL (University College London), (2011).

24.

Rieck

, Krueger

and Dewald

, Cujo: Efficient detection and prevention of drive-by-download attacks, in Proceedings of the 26th Annual Computer Security Applications Conference. ACM, 2010, pp. 31–39.

25.

Sirageldin

, Baharudin

B.B.

and Jung

L.T.

, Malicious Web Page Detection: A Machine Learning Approach, In Advances in Computer Science and its Applications. SpringerBerlin Heidelberg (2014), 217–224.

26.

Ioffe

and Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXivpreprintarXiv:1502.03167, (2015).

27.

Zhang

and LeCun

, Text understanding from scratch. arXiv preprint arXiv:1502.01710 (2015).

Evaluating deep learning approaches to characterize and classify malicious URL’s

Abstract

Keywords

1 Introduction

2.1 Text representation

3.1 Description of data set

Table 1 Description of data set Data Benign Malicious Total Data set 1 Training 42800 39000 81800 Testing 20000 15000 35000 Data set 2 Training 12000 6000 18000 Testing 5000 2000 7000

3.3 Deep learning architecture

3.3.1 Character encoding of URL’s

3.4 Classification

6 Conclusion

Footnotes

Acknowledgments

References

Table 1
Description of data set

Data Benign Malicious Total

Data set 1

Training 42800 39000 81800

Testing 20000 15000 35000

Data set 2

Training 12000 6000 18000

Testing 5000 2000 7000