Evaluating deep learning approaches to characterize and classify the DGAs at scale

Abstract

In recent years, domain generation algorithms (DGAs) are the foundational mechanisms for many malware families. Mainly, due to the fact that DGA can generate immense number of pseudo random domain names to associate to a command and control (C2) infrastructures. This paper focuses on to detect and classify the pseudo random domain names without relying on the feature engineering or any other linguistic, contextual or semantics and statistical information by adopting deep learning approaches. A deep learning approach is a complex model of traditional machine learning mechanism that has received renewed interest by solving the long-standing tasks in artificial intelligence (AI) related to the field of natural language processing, image recognition, speech processing and many others. They have immense capability to extract optimal feature representations by taking input as in the form of raw input texts. To leverage this and to transfer the performance enhancement in aforementioned areas towards characterize, detect and classify the DGA generated domain names to a specific malware family, this paper adopts deep learning mechanisms with a known one million benign domain names from Alexa, OpenDNS and a corpus of malicious domain names generated from 17 DGA malware families in real time for training in character and bigram level and a trained model has been evaluated on the OSNIT data set in real-time. Specifically, to understand the effectiveness of various deep learning mechanisms, we used recurrent neural network (RNN), identity-recurrent neural network (I-RNN), long short-term memory (LSTM), convolution neural network (CNN), and convolutional neural network-long short-term memory (CNN-LSTM) architectures. Additionally, to find out an optimal architecture, experiments are done with various configurations of network parameters and network structures. All experiments run up to 1000 epochs with a learning rate set in the range [0.01-0.5]. Overall, deep learning approaches, particularly family of recurrent neural network and a hybrid network (where the first layer is CNN and a subsequent layer is LSTM) have showed significant performance with a highest detection rate 0.9945 and 0.9879 respectively. The main reason is deep learning approaches have inherent mechanisms to capture hierarchical feature extraction and long range-dependencies in sequence inputs.

Keywords

Domain generation algorithms (DGAs)deep learning mechanisms recurrent neural network (RNN)identity-recurrent neural network (IRNN)long short-term memory (LSTM)convolution neural network (CNN)convolutional neural network-long short-term memory (CNN-LSTM)

1 Introduction

In recent years, internet has become an indispensable platform to everyone to carry out daily activities through various applications such as e-commerce, social media and other apps. Each and everyone one has to provide an online presence themselves in internet frequently to run a successful venture. These factors have enforced internet and its applications to progress gradually and they will evolve in future too. To maintain integrity of the user’s personal identity and information, various security infrastructures have been incorporated to internet. At the same time an attacker use various techniques to inject malicious activities to internet infrastructure with the aim to perform attacks such as stealing users personal data, denial of service (dos), distributed denial of service (ddos) and other attacks.

In earlier days, the malwares were embedded with a fixed IP address or a domain name to reach out to the command and control server (C2C) in order to store acquired information, or follow malicious activities. The communications point of resilient C2C infrastructures is hardcoded in malware. So, once a malware is blocked using blacklisting, then it impedes the working performance of them. To avoid this and including a new version of malware development and creating a new infrastructure for C2C server, attackers rely on Domain generation algorithms (DGA). The idea behind how DGA can be used to communicate C2C server was reported by [1]. Domain generation algorithms (DGAs) are a set of algorithmic mechanisms that have been used by various malware families with the purpose to generate an immense pseudo random domain names periodically. The random domain names are generated based on the seed. A seed is composed of numeric, alphabet, date/time and other information. A seed facilitates to estimate the rendezvous points between a botmaster and a bot and newly estimated point remains as a shared secret. Most of the randomly generated domain names are non-existent and a small subset of domain names from them try to find and connect to a command and control (C2C) infrastructures periodically to acquire information or follow other malicious tasks. Thus conventional methods such as blacklists found to be ineffective in finding DGA generated domain names. Moreover, there is a possibility of registering other domain name from the DGA generated domain list when a domain has been blocked successfully. These factors direct an attacker not to follow the difficult method such as hard coding the domain names in malware binaries.

A botnet is considered as a major threat against cyber security applications [2]. A Botnet is a subnet of compromised remote controlled machines under the control by an adversary typically called as botmaster. Malware authors use a botmaster and a network facilitate them to control or perform myriad of malicious activities to each machine in a botnet instead of injecting malicious activities to each machine separately. Modern botnet use DGA to randomly generate domain names and connects to each of them till at least one of them is get connected successfully. Once it is connected, the IP (internet protocol) address of compromised machines is used as a C2C server. A C2C server redirects the received commands from malware authors to each machine for running malicious activities. In addition, botnet constantly changes its domain name or IP address to evade detection. This process is typically called as “domain fluxing” or “fast fluxing” respectively [3]. Recent botnets use both DGA and domain fluxing mechanisms to execute internet nefarious activities. In earlier days these botnets have used only for backup communications. Additionally, some of the new botnets use DGA as their primary communication mechanism too. The DGA engages both the botmaster and security researchers in an asymmetric condition, in which the botmaster has required an entry to a single domain name to control all its bots whereas security researcher has to know all domains to blacklist them successfully. In recent days, botnets serve as a primary source for an attacker to conduct various malicious attacks such as sending spam, denial of service (dos), distributed denial of service (ddos) and other attacks. In [4] reported that the botnet contamination rate is expanding quickly by a normal development of 8% every week including the information for the top ten aggressive botnet in detail. Recently, [5] reported some families of botnet had more than a million botnet. In [6], published a detailed threat prediction report that includes information of how botnet will progress and acts themselves gradually. Thus there is a requirement to attack botnets by using reliable mechanism.

Several mechanisms exist to block DGA C2C traffic of domain fluxing. One simple solution is to analyze the entire correlated activities of network traffic with all clients. Other approach is to apply reverse engineering the malware samples and its corresponding DGA to understand the hidden patterns of DGA algorithms [7]. Using the seed value, a list of domains will be registered and configure own server to seems as a C2C infrastructure, called as sinkholing. This configuration will help to hijack botnets. And once sinkholed, a malicious author has to reinstall other bots with an updated seed value. Other most commonly used approach is blacklisting [8]. Blacklisting is a static approach that houses DGA generated domain names. This blacklisting is used by a network administrator to block connections to a C2C server. Both sinkholing and blacklisting approaches are time consuming and resource-intensive. Moreover, these are relying on the seed value of a campaign. Once seed value becomes unknown, then these approaches are ineffective.

Other most important approach is to create DGA classifier based on machine learning. DGA classifier resides in network and issue an alert to a network admin when it finds a DGA generated domain name in DNS requests. DGA classification is a significant component of domain reputation system (DRS). DRS marks a trustworthy rating score for malware as 1 and for benign as 0. Additionally, it absorbs information from pDNS [9, 10]. DGA classification is categorized into 2 types such as (1) Retrospective: a large set of domain are clustered and statistical properties is computed on each cluster for classification, for example: Kullback-Leibler divergence. Additionally, for building a strong feature set, uses statistical properties and contextual information such as HTTP headers, NXDomains [11 , 27]. The published solutions for DGA classifier, most of them are based on retrospective approach. The retrospective approaches used as a reactionary instead of real-time (2) Real-time detection and prevention: classification is done on each domain names without using the contextual properties [12]. This is often considered as a difficult one in comparison to retrospective approach and also there is possibility in showing less performance. However, in [12] author showed even retrospective approach performance is considerably less.

DGA analysis has been a significant area of research for security researchers. They have found many solutions using traditional machine learning approaches. Though they have achieved a significant performance in DGA in some cases, the solutions are not considered as an effective and cannot be adopted in real-time systems. Mainly, the solutions are entirely based on the feature engineering (entropy, string length, alpha numeric characters, vowel to consonant ratio etc). Thus machine learning based solutions based on feature engineering may not work for characterizing and detecting a new domain name from an existing DGA or a newly created DGA. Once a new malware domain has occurred then corresponding feature set will be calculated and training has to be done on the newly obtained features. This results as a time-consuming approach. Moreover, an adversary can easily evade the machine learning based solutions based on hand-crafted features, once adopted feature set is known. In [9], hidden markov model (HMM) based DGA classification without using hand crafted feature sets is introduced. The approach is largely rely on retrospective detection and consequently performed very poorly. This paper evaluates effectiveness of large-scale deep learning approaches such as recurrent neural network (RNN), identity-recurrent neural network (IRNN), long short-term memory (LSTM), convolution neural network (CNN), and convolutional neural network-long short-term memory (CNN-LSTM) architectures to DGA classification. These deep networks are composed of complex units and inner mechanics of these units is remained as a black box. Thus an adversary may not be able to reverse engineer the classifier without knowing the same training samples.

The rest of sections of this paper are structured as follows. Section 2 discusses the related work on DGA analysis. Section 3 provides an appropriate mathematical foundation to deep learning architectures, Section 4 provides necessary details for DGA data set, hyper parameter tuning, and architecture for DGA analysis and classification. Section 5 provides evaluation results and Section 6 discusses the future work and conclusion.

2 Related work

Applying machine learning to DGA analysis and its classification has been remained as a vivid area of research from the past 10 years. The main reason is machine learning have an ability to identify a newly created domain name or a new DGA itself with an acceptable false positive rate. The primary objective of this section is to systematically overview the related works on detecting DGA.

In [8] discussed the efficiency of blacklists including 15 public malware blacklists and 4 private malware blacklists from anti-virus vendors. They identified the unregistered domains in listings using DNS. However for parked and sinkhole domain, they followed a feature based approach. Vendor provided blacklists performed well in blacklisting both DGA malware and without DGA malware in comparison to public malware blacklists. Overall they claimed blacklists are useful and can be used as an initial shelter for protection from malwares. This can be made potential by supplementing an additional mechanism.

In [25] used IP address, ‘whois’ information, phishing information and lexical entries of URLs as feature and reported the lengths of malicious domain names are smaller than benign domain names, use lesser vowels with unique characters. [26] used language based mechanisms in which a score will be assigned to each domain to identify the DGA. The score is estimated based on the dictionary and additionally a dictionary helps to examine the sequences in the domain names. In [26, 27] proposed n-gram mechanism specifically they used distribution of alphanumeric characters in 1 and 2-grams to detect domain-fluxes. The proposed method assumes the distribution of alphanumeric character in human generated and DGA generated are entirely different. They used 2 sets for training, one is human generated and other one is DGA generated. For each set, 1 gram and 2 grams is calculated and unknown domain in each batch of test data is grouped by same second level domain and same IP address. They also showed efficacy of their mechanism by using the various distance metrics such as Kullback-Leibler (KL) distance, Jaccard Index and Edit distance. In [11] proposed Peiades, that use same clustering mechanism to classify domains with assuming the DGA and other DGA-bot infected machines response will be Non-Existent Domain (NX-Domain). The Statistical features of Bobax, Torpig, and Conficker are used in training and in testing the unknown domain names are clustered based on the entropy, frequency of individual character and length. Next, they computed the statistical features for each cluster and compared with the train data for classifying to a specific DGA family. Additionally they found 12 DGAs over 15 months. Surprisingly, half of the DGAs are unknown and other half is variants of known DGA. While in classification, if a DGA is classified as known then the domain is considered to be damaged by bot. The damaged domain of DNS requests are analyzed on each host and a score will be calculated based on the fixed threshold value. They found an approach that extends both retrospective and real-time mechanisms. The approach uses rated scores entries to its clients to label them as malicious or benign.

The aforementioned DGA detection and classification methods are studied by [12], reported two issues by them. One is the discussed methods are retrospective, consequently cannot be adopted in real time DGA detection due to the fact that time consuming and less performance. Their system have limitations, one is showed detection rate of 83%, it’s entirely based on estimating scores for clients. As a result this cannot be used in real-time. Second, it uses NXDomain as a baseline for classification and consequently doesn’t facilitate multi-class classification.

In [28] proposed DGA classifier for real-time using the linguistic features. Linguistic features are obtained from significant characters ratio and n-gram normality score. For both the significant characters ratio and n-gram normality score, mean and covariance are estimated using Alexa top one million dataset. The Mahalanobis distance measures is used to calculate the distance of unknown domains. If a distance is too large then it is classified as DGA otherwise considered as benign. Additionally they used the same aforementioned clustering mechanisms to classify the discovered DGA.

3 Background

This section provides an intuitive understanding of character level representation of domain names and followed by appropriate mathematical foundation to deep learning architectures.

3.1 Domain names encoding in character level

Representation of domain name typically called as domain name encoding. Domain name encoding consists of 2 steps. In first step, the raw domain names are preprocessed and tokenized to characters. In preprocessing, the top-level domain is removed and converted all characters to lower-case. Second step involves in vocabulary creation with only using the training data as an initial step. The vocabulary size parameter value is rely symmetry between the training vectors of each class and the number of parameters to learn for the given task. Here to limit the size of vocabulary, selected only the characters that meet the minimum frequency. Followed by initial step of vocabulary creation, each character is assigned to a unique id and each unique id is a vector that denotes the size of the vocabulary D. The unknown characters in a domain name is trivial, so they are assigned by the default key 0. The unique ids of character are transformed in to feature vectors using the lookup table operation. Moreover, the most commonly occurred character are indexed in an ascending order. This feature vector transformation can be formulated mathematically as: A Lookup table layer LUT represents each character c ∈ D as an inner dimensional feature vector d_wvd, $LU T_{c} c = 〈 C 〉_{c}^{1}$ (1) where $C \in ℝ^{d_{cvd} \times | D |}$ denotes the learning parameters for weight matrix, $〈 C 〉_{c}^{1} \in ℝ^{d_{cvd}}$ , $ℝ^{d_{cvd}}$ denotes the character-vector size that is chosen through hyper parameter tuning. The discussed approach is applied to each character in dictionary D that forms the following equation. $LU T_{c} ([C]_{1}^{s}) = (〈 C 〉_{{[c]}_{1}}^{1} 〈 C 〉_{{[c]}_{2}}^{1} 〈 C 〉_{[c]_{3}}^{1} \dots 〈 C 〉_{{[c]}_{s}}^{1})$ (2) where $[c]_{1}^{s}$ denotes the number of characters that are chosen from vocabulary D. As a next step we set fixed maximum length for the varying unique vocabulary id sequences and fed to character embedding layer. Character embedding layer converts characters to its character embedding by performing a simple mathematical operation as shown below.

input-shape x weights-of-character-embedding = (nb-characters, character-embedding-dim) where input-shape = (nb-characters, vocab-size), nb-characters denotes the number of top characters, vocab-size denotes the number of unique characters, each character is represented in one-hot encoding format. weights-of-character-embedding = (vocab-size, character-embedding-dimension), character-embedding-dimension denotes the size of character embedding vector. The j line in embedding weights matrix denotes the j integer. The dimension of character embedding can be considered as one of hyper parameter of deep learning algorithms. This operation maps the discrete character to its vectors of continuous numbers. The character embedding captures the semantic meaning of the given domain name sequence by mapping them in to a high dimensional geometric space. This high dimensional geometric space is called as an embedding space. If an embedding is properly learnt the semantics of the domain name by encoding as a real valued vectors, then the similar characters appear in a same cluster with close to each other in a high dimensional geometric space. The resultant embedding output vector is passed to any other layers. In our case, we considered (1) RNN (2) LSTM (3) GRU (4) I-RNN (5) CNN (6) CNN-LSTM.

3.2 Recurrent Neural Network (RNN)

Recurrent neural network (RNN) is an extension to feed forward networks (FFN), introduced in 1990 [19]. RNN use a transition function tf to compute its internal hidden state vector hd_T recursively for the given input sequence. The hidden state vectors hd_t are computed using a transition function of current input sequence x_t and past hidden state vector hd_t-1. $h d_{t} = {\begin{matrix} 0 & t = 0 \\ tf (h d_{t - 1}, x_{t}) & otherwise \end{matrix}}$ (3)

A transition function tf is applied to state with the composition of affine transformation of x_t and hd_t-1 including the element wise non-linearity activation function. Intuitively, this form of network results in vanishing and exploding gradient issue while training a gradient vector can grow or decay exponentially over time-steps [20]. To solve this, [20] introduced long short-term memory (LSTM) by introducing a special component in hidden recurrent layer called as a memory block. A memory block is a complex processing unit with one or more memory cells and additional pair of gating units to control the units of a memory cell across time-steps. Additionally, a memory cell has constant error carousel (CEC). CEC has an in-built fixed value as 1, CEC will be triggered when a memory block doesn’t receive any value from outside signals. Further the research on LSTM, [21] introduced Gated recurrent unit (GRU). GRU has less number of units in compared to LSTM, computationally efficient. On the other side, [22] proposed the RNN network built with ReLU and initialized with identity matrix is termed as IRNN. The identity initialization in IRNN keeps the error derivatives of hidden units back propagated through time constant till no extra error derivatives are added and this property helps to learn long range temporal dependencies. The efficacy of proposed mechanism was relatively closer to LSTM in 4 important tasks; two toy problems, language modeling and speech recognition.

3.3 Convolution Neural Network (CNN)

Convolution neural network (CNN) has well established method in the field of image processing. Concretely, CNN takes input in the form of 2D for image and 1D for time-series and texts [23]. The CNN is composed of convolution 1D layer, pooling 1D layer, fully connected layer and non-linear activation function ReLU.

Let, D1 = {c₁, c₂, ⋯ , c_l} be the domain name in which c denotes characters and l be the length of domain name, D be the vocabulary of domain name characters and d be the dimensionality of character embedding. The character level representation is encoded by an embedding matrix $D^{D 1} \in ℝ^{d \times l}$ . A convolution 1D operation uses filter or kernel $H \in ℝ^{d \times c}$ normally applied to window of domain name characters c to construct a new feature map fm. Specifically using a window of characters D [* , j : j + c], a new feature map is obtained through, $f m^{D 1} [j] = AF (\sum (D [*, j : j + c] ⊙ H) + b)$ (4) where $b \in ℝ$ is the bias term, AF is an activation function, usually tanh or ReLU, ⊙ is the element wise multiplication between two matrices. To generate a new feature map, a convolution filter is employed for all possible window of character in domain name, c = [c₁, c₂, ⋯ , c_l-w+1], where $c \in ℝ^{l - c + 1}$ . To obtain the most significant feature, as a next step we apply pooling 1D operation for the obtained feature map, c = [c₁, c₂, ⋯ , c_l-w+1]. Pooling1D operation is simply a non-linear down-sampling. For instance, if input downscaled by 3 then the adjacent three features in feature map is estimated as, $y_{i} = max (c_{3 \times j - 1}, c_{3 \times j})$ (5) where, $y = [y_{1}, y_{2}, \dots, y_{\frac{l - c + 1}{2}}], y \in ℝ^{\frac{l - c + 1}{2}}$ . Fully connected layer have connection to all activations in previous layer and add them to a single neuron. A newly formed feature map from CNN network i.e. fm = CNN (x_t) can be further fed to LSTM to capture the time-series patterns across time-steps.

4 Methodology and experiments

This section starts with providing the necessary details of DGA data set, the hyper parameter tuning, binary and multi-class classification settings and the proposed deep learning architecture. As recurrent neural network approaches such as RNN, LSTM, GRU and IRNN are parameterized functions and to find out the optimal values for them experiments are done for various configurations of network parameters and structures. All experiments are trained using backpropogation through time (BPTT) with adam update rule on GPU enabled TensorFlow [13] accompanied with Keras 1 in single NVidia GK110BGL Tesla k40.

4.1 Description of Alexa, OpenDNS, DGA malware family and OSNIT dataset

Many recent malware families have largely rely on Domain generation algorithms (DGAs) to build effective rallying mechanisms. This is primarily due to the fact that, DGA facilitates malware families to generate domain names pseudo randomly in the range from thousand to million and a sub set of the domain names is used to connect to a C2C server periodically to acquire information or inject other malicious attacks. This process is illustrated in Fig. 3(b), a bot generates two domain names such as abc.com and def.com and sends a DNS request for them, then the DNS server replies back NXDomain (not registered) and IP address for abc.com and def.com respectively. As further a bot uses the registered domain name IP address and establishes a connection to the C2C server. In this paper, we formed a list of legitimate domain names by assembling Alexa [14] and OpenDNS [24], and malicious domain names are generated using publically accessible algorithms [15] and OSNIT DGA feeds [16]. We randomly split the data set in to 2 parts such as (1) training with 86008 domain names (2) testing with 39004 domain names. The detailed statistics of benign and DGA generated domain names is displayed in Table 1.

Table 1
Detailed statistics of benign and DGA generated data sets

Domain name category Number of domain names

Train Test

Alexa 7000 3000

OpenDNS 3000 2000

banjori 13666 6833

corebot 80 40

cryptolocker 6334 2667

dircrypt 1138 569

kraken 2192 1096

locky 766 383

pykspa 4336 2168

qakbot 4396 2198

ramdo 5000 2000

ramnit 2956 1478

simda 3244 1122

zeus 6578 2789

tinba 3000 1000

rovnix 3000 1000

conflicker 6200 2600

pushdo 10122 5061

goz 3000 1000

Total 86008 39004

Domain name category	Number of domain names
Alexa	7000	3000
OpenDNS	3000	2000
banjori	13666	6833
corebot	80	40
cryptolocker	6334	2667
dircrypt	1138	569
kraken	2192	1096
locky	766	383
pykspa	4336	2168
qakbot	4396	2198
ramdo	5000	2000
ramnit	2956	1478
simda	3244	1122
zeus	6578	2789
tinba	3000	1000
rovnix	3000	1000
conflicker	6200	2600
pushdo	10122	5061
goz	3000	1000
Total	86008	39004

The proposed approach of this research is derived from character-level text classification. Fig. 1(a) shows the unigram alphanumeric distribution of legitimate and DGA generated domain names. This infers that the legitimate domain names have followed a unique pattern in comparison to the DGA generated. DGA generated domain names have either a high probability distribution or low probability distribution in each character. Both the legitimate and DGA generated domain names have maximum frequency of character ‘E’ and after that followed irregular rise and fall in unigram probability distribution. Moreover, the probability distribution of integers in DGA generated domain names is very less in comparison to the legitimate domain names. Fig. 1(b) represents the unigram alphanumeric distribution of each class of DGA generated domain names.

Fig.1

(a),(b) Probability distribution of alphanumeric character of benign and dga generated, (c) Accuracy of deep learning models epochs in rage [0–1000].

4.2 Hyper parameter tuning

A recurrent structure, specifically LSTM is a parameterized function. As a result the good performance in terms of distinguishing the domain name as either Non-DGA or DGA is implicitly rely on the optimal parameters. The set of parameters that LSTM require as an optimal are hidden layers, batch_size, character-vector or vocabulary size, hidden units size, dropout, optimizer and learning rate. The experiment is typically for character level inputs of domain name, we didn’t find any optimal parameter for vocabulary size. It is set to 39 (number of unique characters). Additionally, the dropout is set to 0.1 in recurrent structures and 0.2 in CNN. The learning rate and batch_size are set to 0.1 and 32 respectively in both the CNN and recurrent structures. Most of recurrent structures have reached state-of-the-art performance with one hidden layer including 128 units. Moreover, instead of character level in our second settings of experiment, we apply recurrent structures to the bigram level inputs. The results of bigram level inputs with recurrent structures are considerably poor in comparison to character level inputs with recurrent structures.

For CNN, optimal filters are chosen by evaluating experiments for each filter of 4, 16, 32 and 64. Among them 64 has resulted in good performance in comparison to all other filters. Moreover, the accuracy of experiments with number of filter 32 is comparable to the experiments with number of filter 64.

4.3 Deep learning architecture

Fig. 2(a) represents an intuitive overview of the deep learning mechanisms for classifying the domain name as either benign or DGA generated and its corresponding malware family. The architecture is divided in to 3 notional sections, (1) character encoding (2) features extraction via deep layers (3) classification.

Fig.2

(a) An intuitive overview of proposed deep learning architecture, (b) ROC curve, (c) Embedded character vectors learned by LSTM model is represented using 2-dimensional linear projection (PCA) with t-SNE.

4.3.1 Character Encoding

Initially, preprocessing is applied to the raw domain names. Preprocessing includes converting the upper case characters to lower case primarily due to the fact that distinguishing the upper and lower case characters might ends up in a regularization issue. Additionally, the top-level domain name is discarded. As the direction towards the initial step of vocabulary creation, each character is assigned to a unique id and each unique id is a vector that denotes the size of the vocabulary, D. Here, the vocabulary size is D = 39. The unknown characters in a domain name are trivial, so they are assigned by the default key 0. To know the domain name with the most number of characters we formed 2 dictionaries, (1) maps character ids to characters (2) characters to character ids. The largest domain name length was 37. To make all domain name sequences of same length, domain name of length less than 37 is padded by zero. As a result, we get a matrix of size 56008 × 37 for training and 30000 × 37 for validation. These matrices are passed to an embedding layer by using batch_size of 32, particularly 32 × 37 matrices batch-by-batch. This embedding layer constructs a matrix of size 39 × 128. Each row is a character-embedding vector that is created by putting back each character-id with a character-vector of size 128. Embedding layer mutually cooperates with the other layers in the deep network during optimizing in the backpropogation process. As a result, similar characters come close together. This type of character clustering facilitate to other layers to easily identify the semantics and contextual similarity structures in domain names. To visualize the learned 128 high-dimensional vector representation, we used two-dimensional linear projection through PCA with t-SNE [17], as shown in Fig. 2(c). Through Fig. 2(c), we can understand that the similar characters are clustered together i.e. characters, numbers, underscore, hyphen and period have appeared in separate clusters. Appearing underscore, hyphen and period in a separate cluster is highly important. This reflects that the embedding representation has captured the semantic and contextual similarity of domain names.

4.3.2 Features extraction via deep layers

We espouse various deep layers such as recurrent layers: RNN, LSTM, GRU, IRNN and CNN and hybrid architecture such as CNN-LSTM for feature extraction. For each deep layer, various experiments are done to evaluate the performance of them. We didn’t rely on batch-normalization between deep layers because the architectures are not too deep. For all deep learning architectures, we set learning rate parameter to 0.1, batch_size to 32. Thus, the deep learning model updates the parameter once it trains the 32 data samples via backpropogation mechanism.

4.3.3 Recurrent layers

For recurrent structures, we adopted the following various layers, RNN, LSTM, GRU and IRNN. By observing the experience from past experiments in hyper-parameter tuning, the number of units (typically memory cells in LSTM and GRU) is set to 128. A RNN unit has used hyperbolic tangent as an input and output activation function which in the range [-1, 1]. A LSTM memory cell has used hyperbolic tangent as an input and output activation function which in the range [-1, 1] and sigmoid for gates and other neurons which in range [1, 0]. As recurrent structure layers captures the dependencies for the received matrix of shape (39 × 128) from an embedding layer and passes its last output 128 to dropout layer with 0.1.

4.3.4 Convolution layers

The sentence is one-dimensional, hence we adopt convolution1D in which filter moves only in one direction. The convolution1D mechanism has two steps. (1) convolution1D (2) pooling1D. In first step, convolution operation with 64 filters of length 3 (means filters applied on 3 characters at a time) is applied. Each character has a vector of 128 elements. To characterize the sequences of 3 characters, scalar product is done between the filters of shape 3 × 64. Finally, convolution1D layer passes its output of shape 37 × 64 to pooling1D layer (chosen maxpooling1D). A maxpooling1D layer has a stride of length 2 that divides the convolution feature map in to two equal parts. The pooling1D output as shape 18 × 64 and it is flattened to a vector. Flattened vector includes 1152 elements in which the first 64 elements for the first character, second 64 elements for the second character, and so on and so forth. This vector has passed to dense layer that compress in to 128 elements. Dense layer has followed dropout with 0.1, as a result, the deep model has captured the latent features. Additionally, the fully connected last layer weight matrix has a direct connection to position of individual characters. Thus it learns the position of characters in a domain name. In hybrid network the pooling1D output as shape 18 × 64 is fed to LSTM layer. This has compressed the output to 70 and passed to dense layer.

4.3.5 Regularization

To avoid over fitting, dropout layer with 0.1 added as regularization parameter after deep layers such as RNN, LSTM, GRU, and IRNN. In CNN, dense layer interleaved with dropout layer with 0.2 and additionally activation layer with ReLU non-linear activation function. Dropout is a mechanism for discarding the neurons randomly along with their connections during training a deep learning model.

4.4 Classification

After feature engineering, to classify the domain name as either benign or DGA generated in binary classification configuration and additionally categorizing the DGA generated domain name to corresponding malware family in multi-class classification configuration we embed dense layer after dropout layer. Dense layer is typically composed of two layers such as dense with unit 1 and followed by an activation layer such as sigmoid with loss function as binary cross-entropy in binary classification configuration. Dense with unit 18 and followed by an activation layer such as softmax with loss function as categorical cross-entropy in multi-class classification configuration. The dense layer is also termed as fully connected layer. It aggregates the received features from previous layer specifically dropout layer in recurrent structures and CNN and LSTM layer in hybrid network to a single unit by constraining the most important one. As a result, it forms hierarchical feature representation for the final phase classification. The loss function for binary cross-entropy is estimated using the following formulae,

$\begin{matrix} loss (p, e) \\ = - \frac{1}{N} \sum_{j = 1}^{N} [e_{j} log p_{j} + (1 - e_{j}) log (1 - p_{j})] \end{matrix}$ (6)

Here e is a vector of expected class label, p is a vector of predicted probability for all domain names. To minimize the loss we used adam optimizer via backpropogation. The loss function for categorical cross-entropy is estimated using the following formulae, $loss (p, q) = - \sum_{x} p (x) log (q (x))$ (7)

Here p is true probability distribution, q is predicted probability distribution from softmax layer.

5 Evaluation results

The trained model performance is evaluated using the testing samples on epoch wise, as displayed in Fig. 1(c). LSTM and CNN-LSTM have showed good performance till epochs 500. After that, it started to decrease due to over fitting. IRNN has started decreasing its performance after epochs 300. CNN and GRU have started to over fitting after 100 epochs. RNN has started to over fit once it reaches epochs 50. By observing this, we can say that 500 epochs is sufficient to capture the dependencies of domain name in character level. As a baseline comparison, we apply logistic regression model on bigrams in character level representation of domain name and other classical machine learning classifiers like Random forest (RF), Decision tree (DT), Maximum Entropy Modeling (MT), AdaBoost (AB), and Naive Bayes (NB) on the hand-crafted features. Table 2 has experiment results for them to classify the domain name as either benign or DGA generated in terms of accuracy, precision, recall, and f1-score. Table 3 includes experiment results of deep learning approaches on character level, logistic regression (LR) on bigrams of character in domain name, random forest classifier on hand-crafter features to classify the domain name as either benign or DGA generated and its family. By looking at Table 2 and Table 3 LSTM has performed well in both the binary and multi-class classification settings in comparison to RNN and IRNN and other adopted mechanisms. Moreover, the obtained results of GRU are comparable to LSTM. In binary classification settings of DGA, the performance of various employed mechanism is displayed in receiver operating characteristic (ROC) curve in Fig. 2(b). LSTM and CNN-LSTM have both showed good performance (AUC of 1.000) including the consistent TPR and FPR.

Table 2
Summary of test results for binary classification

Algorithm Accuracy Precision Recall F-score Loss

LSTM 0.9997 0.99841 0.9997 0.99 0.00

RNN 0.9826 0.8924 0.9696 0.9294 0.05

I-RNN 0.9235 0.4834 0.8580 0.6184 0.18

GRU 0.9983 0.9912 0.9958 0.9935 0.01

CNN 0.9981 0.9232 0.9836 0.9524 0.03

CNN-LSTM 0.9990 0.9946 0.9980 0.9963 0.00

bigram-LR 0.958 0.966 0.987 0.976 0.12

Hand-crafted features

RF 0.953 0.958 0.990 0.973 –

DT 0.926 0.859 0.992 0.921 –

MT 0.747 0.945 0.667 0.782 –

AB 0.951 0.901 0.996 0.946 –

NB 0.817 0.990 0.781 0.873 –

Algorithm	Accuracy	Precision	Recall	F-score	Loss
LSTM	0.9997	0.99841	0.9997	0.99	0.00
RNN	0.9826	0.8924	0.9696	0.9294	0.05
I-RNN	0.9235	0.4834	0.8580	0.6184	0.18
GRU	0.9983	0.9912	0.9958	0.9935	0.01
CNN	0.9981	0.9232	0.9836	0.9524	0.03
CNN-LSTM	0.9990	0.9946	0.9980	0.9963	0.00
bigram-LR	0.958	0.966	0.987	0.976	0.12
Hand-crafted features
RF	0.953	0.958	0.990	0.973	–
DT	0.926	0.859	0.992	0.921	–
MT	0.747	0.945	0.667	0.782	–
AB	0.951	0.901	0.996	0.946	–
NB	0.817	0.990	0.781	0.873	–

Table 3

Summary of test results for multi-class classification

Classes	LSTM		RNN		GRU		CNN		CNN-LSTM		bigram-LR		Features-RF
	TPR	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR	FPR
Alexa & Open DNS	0.997	0.001	0.815	0.009	0.973	0.003	0.758	0.01	0.978	0.001	0.859	0.013	0.864	0.014
banjori	1.0	0.0	0.999	0.0003	1.0	0.0	1.0	0.002	0.999	0.0	1.0	0.0	1.0	0.0
corebot	1.0	0.0	0.95	0.0	0.975	0.0	0.775	0.0001	0.975	0.0	1.0	0.0	1.0	0.0
cryptolocker	0.994	0.001	0.825	0.01	0.974	0.002	0.740	0.029	0.987	0.001	0.825	0.024	0.826	0.023
dircrypt	0.979	0.0002	0.531	0.006	0.954	0.001	0.22	0.001	0.967	0.0003	0.422	0.003	0.43	0.004
kraken	0.994	0.0003	0.845	0.009	0.978	0.0004	0.349	0.008	0.984	0.001	0.781	0.011	0.786	0.012
locky	0.971	0.0001	0.219	0.001	0.963	0.0005	0.0	0.0	0.982	0.0004	0.243	0.003	0.243	0.003
pykspa	0.988	0.0002	0.845	0.007	0.979	0.001	0.762	0.017	0.982	0.0008	0.810	0.010	0.807	0.01
qakbot	0.990	0.001	0.822	0.019	0.975	0.003	0.645	0.019	0.977	0.001	0.694	0.009	0.695	0.009
ramdo	0.999	0.0	0.999	0.0002	1.0	0.0	0.996	0.001	1.0	0.0	1.0	0.0	1.0	0.0
ramnit	0.995	0.0001	0.942	0.005	0.972	0.001	0.941	0.011	0.984	0.0002	0.907	0.005	0.904	0.005
simda	0.988	0.001	0.174	0.006	0.912	0.003	0.004	0.0002	0.972	0.001	0.538	0.01	0.585	0.012
zeus	0.995	0.988	0.967	0.01	0.985	0.001	0.964	0.042	0.995	0.001	0.922	0.011	0.92	0.011
tinba	0.997	0.0001	0.92	0.002	0.985	0.0002	0.867	0.01	0.984	0.0	1.0	0.0	1.0	0.0
rovnix	0.981	0.0006	0.61	0.014	0.935	0.004	0.473	0.014	0.958	0.0022	0.57	0.014	0.557	0.013
conflicker	0.999	0.0002	0.967	0.006	0.994	0.0004	0.971	0.007	0.997	0.0006	0.978	0.004	0.98	0.004
pushdo	0.992	0.0005	0.925	0.027	0.974	0.002	0.938	0.031	0.991	0.0011	0.923	0.015	0.91	0.014
goz	1.0	0.0	0.974	0.002	0.997	0.0002	0.825	0.003	0.999	0.0004	1.0	0.0	1.0	0.0
Loss	0.02		0.35		0.06		0.53		0.04		0.37		0.37
Accuracy	0.9945		0.8736		0.980		0.8133		0.9879		0.8753		0.8751

Understanding the inner mechanics of deep learning architectures have remained as a black box including both the novices and advanced users in a real environment. These deep layers include a large amount of information and this can be found by unwrapping them. Considerably, the deep networks such as RNN, LSTM, GRU and IRNN are very complex. By unwrapping them we get a lot of information. Generally the transformed character representation in embedding layer passed through several deep layers to capture the semantic similarity of them. The non-linear activation in each deep layers supports to classify the domain name as benign or DGA generated with its corresponding malware family. Based on the learned feature representation, the last layer in deep architecture should maximally separate the benign and DGA malware family. To adopt this in our experimental setting, we randomly selected the 25 samples from benign and DGA generated malware family. Those samples are fed to the LSTM architecture of the trained model. We replaced the last layer i.e. softmax with t-SNE [17]. t-SNE transforms the high dimensional vectors into two-dimensional representation and those newly formed 2-D vectors are shown in Fig. 3(a). A few testing samples of locky, benign, dircrypt and qakbot have misclassified wrongly. This shows that the learned features using LSTM models have similar characteristics among these classes. Moreover, other malware class family has clustered together correctly.

Fig.3

(a) Visualization of activation values in t-SNE, (b) Rallying mechanism of DGA based botnet.

6 Conclusion and Future work

This paper evaluates the effectiveness of various deep learning approaches to detect and classify the domain names to a specific malware family in which domain names are generated by domain generation algorithms (DGA). The domain names are distinguished as either malicious or benign by training them in character level by automatically extracting the necessary features. Thus, it avoids manual hand crafted feature engineering method and thereby itself serve as robust in handling drifting of domain names and in the scenario of adversarial machine learning setting. The deep network with embedding as first layer extracts features implicitly that is pushed to other deep networks layer and followed by dense layer for classification in supervised learning setting. The family of recurrent neural network (RNN) and its hybrid network (formed using CNN) has significantly performed well in comparison to the methods of hand crafted features and bigrams in both binary and multiclass classification settings. For multiclass classification, we specifically changed the LSTM network in which we used in binary classification with the aim to learn the background environment details of DGA generated malware including the origin and its main purpose. Overall deep learning approaches are considered to be more effective in comparisons to the classical machine learning methods due to the fact that the classical methods rely on the feature set and a machine learning model based on the feature set becomes futile in the case of adversarial machine learning. This work remains as a baseline system to understand the effectiveness of various deep learning approaches for DGA generated domain name analysis and its classification.

In recent days malware families have largely rely on DGA due to the fact that random nature in generating domain names based on the given seed. As a result DGA helps them to hide from detection methods. There are various techniques used by security researchers to detect and classify the DGA such as blacklisting, machine learning based solutions, passive domain name system (DNS) techniques, and analysis of traffic of malicious domain and DNS. This paper puts focus towards implementing machine learning based solutions towards DGA analysis, to achieve this we use complex model of machine learning approach typically called as deep learning. The effectiveness of various mechanisms is studied towards DGA detection and classifying domain name to a corresponding DGA family. However, this study has used only 17 malware families. Thus studying the effectiveness of deep learning mechanisms with more number of malware families will be considered as one of a significant direction towards future work. Additionally, this paper has not shown the inner mechanics of deep learning networks. This might be considered as an important towards in real-time deployment. Thus analyzing the inner mechanics of deep learning algorithms [18] with concerning on the detection rate of malicious domain name and providing appropriate mathematical representation and visualization will be considered as another future direction.

In this study, we used various deep networks with examining its capability by using simple architectures and adopted a very few parameters tuning towards DGA analysis. By appropriately following the hyper-parameter tuning approach may still enhance the reported results. When we attempt to train them, the complex network ingests high computational costs for larger domain names in our current hardware settings. Thus a large study should be required on effectiveness of complex architectures towards DGA analysis. This can be done by using the advanced hardware and following distributed environment for training complex networks.

Footnotes

Acknowledgments

This research was supported in part by Paramount Computer Systems. We are also grateful to NVIDIA India, for the GPU hardware support to research grant.

References

Geffner

, End-to-end analysis of a domain generating algorithm malware family, Black HatUSA, 2013, (2013).

Feily

and Shahrestani

, A survey of botnet and botnet detection emerging security information, In Systems and Technologies, 2009.

Knysz

, Hu

and Shin

K.G.

, Good guys vs. bot guise: Mimicry attacks against fast-flux detection systems, in INFOCOM, 2011 Proceedings IEEE, IEEE, 2011, pp. 1844–1852.

Damballa Inc, Top 10 botnet threats, http://www.damballa.com, 2010.

Rossow

, Andriesse

, Werner

, Stone-Gross

, Plohmann

, Dietrich

C.J.

and Bos

, SoK: P2PWNED – Modeling and Evaluating the Resilience of Peerto-Peer Botnets, In Proceedings of the 34th IEEE Symposium on Security and Privacy (S&P) (San Francisco), CA, 2013.

McAfee Labs 2014 Threats Predictions, https://blogs.mcafee.com/, 2014.

Stone-Gross

, Cova

, Gilbert

, Kemmerer

, Kruegel

and Vigna

, Analysis of a botnet takeover, Security Privacy, IEEE9 (2011), 64–72.

Kuhrer

, Rossow

and Holz

, Paint it black: Evaluating the effectiveness of malware blacklists, in Research in Attacks, Intrusions and Defenses, Springer, 2014, pp. 1–21.

Antonakakis

, Perdisci

, Dagon

, Lee

and Feamster

, Building a dynamic reputation system for DNS, in USENIX Security Symposium, 2010, pp. 273–290.

10.

Bilge

, Sen

, Balzarotti

, Kirda

and Kruegel

, Exposure: A passive DNS analysis service to detect and report malicious domains, ACM Transactions on Information and System Security (TISSEC)16(4) (2014), 14.

11.

Antonakakis

, Perdisci

, Nadji

, Vasiloglou

, Abu-Nimeh

, Lee

and Dagon

, From throw-away traffic to bots: Detecting the rise of DGA-based malware, in P21st USENIX Security Symposium (USENIX Security 12), 2012, pp. 491–506.

12.

Krishnan

, Taylor

, Monrose

and McHugh

, Crossing the threshold: Detecting network malfeasance via sequential hypothesis testing, in 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE, 2013, pp. 1–12.

13.

Abadi

, Barham

, Chen

, Davis

, Dean

and Kudlur

, TensorFlow: A system for large-scale machine learning, Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Savannah, Georgia, USA, 2016.

14.

Does Alexa have a list of its top-ranked websites? https://support.alexa.com/. Accessed: 2017-04-02.

15.

https://github.com/baderj/domain-generation-algorithms Accessed: 2017-03-02.

16.

Bambenek consulting – master feeds, http://osint.bambenekconsulting.com/feeds/. Accessed: 2016-04-06.

17.

Van Der Merwe

, Caceres

, Chu

Y.H.

and Sreenan

, Mmdump: A tool for monitoring Internet multimedia traffic, [J], ACM SIGCOMMComputer Communication Review30(5) (2000), 48–59.

18.

Moazzezi

, Change-based population coding, PhD thesis, UCL (University College London), 2011.

19.

Elman

J.L.

, Finding structure in time, Cognitive Science14(2) (1990), 179–211.

20.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation9(8) (1997), 1735–1780.

21.

Cho

, Van Merriënboer

, Gulcehre

, Bahdanau

, Bougares

, Schwenk

and Bengio

, Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation, arXiv preprint arXiv:1406.1078, 2014. http://arxiv.org/abs/1406.1078.

22.

Q.V.

, Jaitly

and Hinton

G.E.

, A simple way to initialize recurrent networks of rectified linear units, , arXiv preprint arXiv:1504.00941, 2015.

23.

Kim

, Convolutional neural networks for sentence classification, 2014. arXiv preprint arXiv:1408.5882.

24.

OpenDNS Domain list, https://umbrella.cisco.com/blog Accessed: 2017-03-02.

25.

Mcgrath

D.K.

and Gupta

, Behind Phishing: An Examination of Phisher Modi Operandi, in LEET, 2008.

26.

Yadav

, Reddy

A.K.K.

, Reddy

A.L.

and Ranjan

, Detecting algorithmically generated malicious domain names, in Proceedings of the 10th annual Conference on Internet Measurement, New York, 2010.

27.

Yadav

, Reddy

A.K.K.

, Reddy

A.L.

and Ranjan

, Detecting algorithmically generated domain-flux attacks with DNS traffic analysis, IEEE/Acm Transactions on Networking20(5), 1663–1677.

28.

Schiavoni

, Maggi

, Cavallaro

and Zanero

, Phoenix: DGAbased botnet tracking and intelligence, in Detection of Intrusions and Malware, and Vulnerability Assessment, Springer, 2014, pp. 192–211.

Domain name category	Number of domain names
	Train	Test
Alexa	7000	3000
OpenDNS	3000	2000
banjori	13666	6833
corebot	80	40
cryptolocker	6334	2667
dircrypt	1138	569
kraken	2192	1096
locky	766	383
pykspa	4336	2168
qakbot	4396	2198
ramdo	5000	2000
ramnit	2956	1478
simda	3244	1122
zeus	6578	2789
tinba	3000	1000
rovnix	3000	1000
conflicker	6200	2600
pushdo	10122	5061
goz	3000	1000
Total	86008	39004