Abstract
In recent years, domain generation algorithms (DGAs) are the foundational mechanisms for many malware families. Mainly, due to the fact that DGA can generate immense number of pseudo random domain names to associate to a command and control (C2) infrastructures. This paper focuses on to detect and classify the pseudo random domain names without relying on the feature engineering or any other linguistic, contextual or semantics and statistical information by adopting deep learning approaches. A deep learning approach is a complex model of traditional machine learning mechanism that has received renewed interest by solving the long-standing tasks in artificial intelligence (AI) related to the field of natural language processing, image recognition, speech processing and many others. They have immense capability to extract optimal feature representations by taking input as in the form of raw input texts. To leverage this and to transfer the performance enhancement in aforementioned areas towards characterize, detect and classify the DGA generated domain names to a specific malware family, this paper adopts deep learning mechanisms with a known one million benign domain names from Alexa, OpenDNS and a corpus of malicious domain names generated from 17 DGA malware families in real time for training in character and bigram level and a trained model has been evaluated on the OSNIT data set in real-time. Specifically, to understand the effectiveness of various deep learning mechanisms, we used recurrent neural network (RNN), identity-recurrent neural network (I-RNN), long short-term memory (LSTM), convolution neural network (CNN), and convolutional neural network-long short-term memory (CNN-LSTM) architectures. Additionally, to find out an optimal architecture, experiments are done with various configurations of network parameters and network structures. All experiments run up to 1000 epochs with a learning rate set in the range [0.01-0.5]. Overall, deep learning approaches, particularly family of recurrent neural network and a hybrid network (where the first layer is CNN and a subsequent layer is LSTM) have showed significant performance with a highest detection rate 0.9945 and 0.9879 respectively. The main reason is deep learning approaches have inherent mechanisms to capture hierarchical feature extraction and long range-dependencies in sequence inputs.
Keywords
Introduction
In recent years, internet has become an indispensable platform to everyone to carry out daily activities through various applications such as e-commerce, social media and other apps. Each and everyone one has to provide an online presence themselves in internet frequently to run a successful venture. These factors have enforced internet and its applications to progress gradually and they will evolve in future too. To maintain integrity of the user’s personal identity and information, various security infrastructures have been incorporated to internet. At the same time an attacker use various techniques to inject malicious activities to internet infrastructure with the aim to perform attacks such as stealing users personal data, denial of service (dos), distributed denial of service (ddos) and other attacks.
In earlier days, the malwares were embedded with a fixed IP address or a domain name to reach out to the command and control server (C2C) in order to store acquired information, or follow malicious activities. The communications point of resilient C2C infrastructures is hardcoded in malware. So, once a malware is blocked using blacklisting, then it impedes the working performance of them. To avoid this and including a new version of malware development and creating a new infrastructure for C2C server, attackers rely on Domain generation algorithms (DGA). The idea behind how DGA can be used to communicate C2C server was reported by [1]. Domain generation algorithms (DGAs) are a set of algorithmic mechanisms that have been used by various malware families with the purpose to generate an immense pseudo random domain names periodically. The random domain names are generated based on the seed. A seed is composed of numeric, alphabet, date/time and other information. A seed facilitates to estimate the rendezvous points between a botmaster and a bot and newly estimated point remains as a shared secret. Most of the randomly generated domain names are non-existent and a small subset of domain names from them try to find and connect to a command and control (C2C) infrastructures periodically to acquire information or follow other malicious tasks. Thus conventional methods such as blacklists found to be ineffective in finding DGA generated domain names. Moreover, there is a possibility of registering other domain name from the DGA generated domain list when a domain has been blocked successfully. These factors direct an attacker not to follow the difficult method such as hard coding the domain names in malware binaries.
A botnet is considered as a major threat against cyber security applications [2]. A Botnet is a subnet of compromised remote controlled machines under the control by an adversary typically called as botmaster. Malware authors use a botmaster and a network facilitate them to control or perform myriad of malicious activities to each machine in a botnet instead of injecting malicious activities to each machine separately. Modern botnet use DGA to randomly generate domain names and connects to each of them till at least one of them is get connected successfully. Once it is connected, the IP (internet protocol) address of compromised machines is used as a C2C server. A C2C server redirects the received commands from malware authors to each machine for running malicious activities. In addition, botnet constantly changes its domain name or IP address to evade detection. This process is typically called as “domain fluxing” or “fast fluxing” respectively [3]. Recent botnets use both DGA and domain fluxing mechanisms to execute internet nefarious activities. In earlier days these botnets have used only for backup communications. Additionally, some of the new botnets use DGA as their primary communication mechanism too. The DGA engages both the botmaster and security researchers in an asymmetric condition, in which the botmaster has required an entry to a single domain name to control all its bots whereas security researcher has to know all domains to blacklist them successfully. In recent days, botnets serve as a primary source for an attacker to conduct various malicious attacks such as sending spam, denial of service (dos), distributed denial of service (ddos) and other attacks. In [4] reported that the botnet contamination rate is expanding quickly by a normal development of 8% every week including the information for the top ten aggressive botnet in detail. Recently, [5] reported some families of botnet had more than a million botnet. In [6], published a detailed threat prediction report that includes information of how botnet will progress and acts themselves gradually. Thus there is a requirement to attack botnets by using reliable mechanism.
Several mechanisms exist to block DGA C2C traffic of domain fluxing. One simple solution is to analyze the entire correlated activities of network traffic with all clients. Other approach is to apply reverse engineering the malware samples and its corresponding DGA to understand the hidden patterns of DGA algorithms [7]. Using the seed value, a list of domains will be registered and configure own server to seems as a C2C infrastructure, called as sinkholing. This configuration will help to hijack botnets. And once sinkholed, a malicious author has to reinstall other bots with an updated seed value. Other most commonly used approach is blacklisting [8]. Blacklisting is a static approach that houses DGA generated domain names. This blacklisting is used by a network administrator to block connections to a C2C server. Both sinkholing and blacklisting approaches are time consuming and resource-intensive. Moreover, these are relying on the seed value of a campaign. Once seed value becomes unknown, then these approaches are ineffective.
Other most important approach is to create DGA classifier based on machine learning. DGA classifier resides in network and issue an alert to a network admin when it finds a DGA generated domain name in DNS requests. DGA classification is a significant component of domain reputation system (DRS). DRS marks a trustworthy rating score for malware as 1 and for benign as 0. Additionally, it absorbs information from pDNS [9, 10]. DGA classification is categorized into 2 types such as (1) Retrospective: a large set of domain are clustered and statistical properties is computed on each cluster for classification, for example: Kullback-Leibler divergence. Additionally, for building a strong feature set, uses statistical properties and contextual information such as HTTP headers, NXDomains [11, 27]. The published solutions for DGA classifier, most of them are based on retrospective approach. The retrospective approaches used as a reactionary instead of real-time (2) Real-time detection and prevention: classification is done on each domain names without using the contextual properties [12]. This is often considered as a difficult one in comparison to retrospective approach and also there is possibility in showing less performance. However, in [12] author showed even retrospective approach performance is considerably less.
DGA analysis has been a significant area of research for security researchers. They have found many solutions using traditional machine learning approaches. Though they have achieved a significant performance in DGA in some cases, the solutions are not considered as an effective and cannot be adopted in real-time systems. Mainly, the solutions are entirely based on the feature engineering (entropy, string length, alpha numeric characters, vowel to consonant ratio etc). Thus machine learning based solutions based on feature engineering may not work for characterizing and detecting a new domain name from an existing DGA or a newly created DGA. Once a new malware domain has occurred then corresponding feature set will be calculated and training has to be done on the newly obtained features. This results as a time-consuming approach. Moreover, an adversary can easily evade the machine learning based solutions based on hand-crafted features, once adopted feature set is known. In [9], hidden markov model (HMM) based DGA classification without using hand crafted feature sets is introduced. The approach is largely rely on retrospective detection and consequently performed very poorly. This paper evaluates effectiveness of large-scale deep learning approaches such as recurrent neural network (RNN), identity-recurrent neural network (IRNN), long short-term memory (LSTM), convolution neural network (CNN), and convolutional neural network-long short-term memory (CNN-LSTM) architectures to DGA classification. These deep networks are composed of complex units and inner mechanics of these units is remained as a black box. Thus an adversary may not be able to reverse engineer the classifier without knowing the same training samples.
The rest of sections of this paper are structured as follows. Section 2 discusses the related work on DGA analysis. Section 3 provides an appropriate mathematical foundation to deep learning architectures, Section 4 provides necessary details for DGA data set, hyper parameter tuning, and architecture for DGA analysis and classification. Section 5 provides evaluation results and Section 6 discusses the future work and conclusion.
Related work
Applying machine learning to DGA analysis and its classification has been remained as a vivid area of research from the past 10 years. The main reason is machine learning have an ability to identify a newly created domain name or a new DGA itself with an acceptable false positive rate. The primary objective of this section is to systematically overview the related works on detecting DGA.
In [8] discussed the efficiency of blacklists including 15 public malware blacklists and 4 private malware blacklists from anti-virus vendors. They identified the unregistered domains in listings using DNS. However for parked and sinkhole domain, they followed a feature based approach. Vendor provided blacklists performed well in blacklisting both DGA malware and without DGA malware in comparison to public malware blacklists. Overall they claimed blacklists are useful and can be used as an initial shelter for protection from malwares. This can be made potential by supplementing an additional mechanism.
In [25] used IP address, ‘whois’ information, phishing information and lexical entries of URLs as feature and reported the lengths of malicious domain names are smaller than benign domain names, use lesser vowels with unique characters. [26] used language based mechanisms in which a score will be assigned to each domain to identify the DGA. The score is estimated based on the dictionary and additionally a dictionary helps to examine the sequences in the domain names. In [26, 27] proposed n-gram mechanism specifically they used distribution of alphanumeric characters in 1 and 2-grams to detect domain-fluxes. The proposed method assumes the distribution of alphanumeric character in human generated and DGA generated are entirely different. They used 2 sets for training, one is human generated and other one is DGA generated. For each set, 1 gram and 2 grams is calculated and unknown domain in each batch of test data is grouped by same second level domain and same IP address. They also showed efficacy of their mechanism by using the various distance metrics such as Kullback-Leibler (KL) distance, Jaccard Index and Edit distance. In [11] proposed Peiades, that use same clustering mechanism to classify domains with assuming the DGA and other DGA-bot infected machines response will be Non-Existent Domain (NX-Domain). The Statistical features of Bobax, Torpig, and Conficker are used in training and in testing the unknown domain names are clustered based on the entropy, frequency of individual character and length. Next, they computed the statistical features for each cluster and compared with the train data for classifying to a specific DGA family. Additionally they found 12 DGAs over 15 months. Surprisingly, half of the DGAs are unknown and other half is variants of known DGA. While in classification, if a DGA is classified as known then the domain is considered to be damaged by bot. The damaged domain of DNS requests are analyzed on each host and a score will be calculated based on the fixed threshold value. They found an approach that extends both retrospective and real-time mechanisms. The approach uses rated scores entries to its clients to label them as malicious or benign.
The aforementioned DGA detection and classification methods are studied by [12], reported two issues by them. One is the discussed methods are retrospective, consequently cannot be adopted in real time DGA detection due to the fact that time consuming and less performance. Their system have limitations, one is showed detection rate of 83%, it’s entirely based on estimating scores for clients. As a result this cannot be used in real-time. Second, it uses NXDomain as a baseline for classification and consequently doesn’t facilitate multi-class classification.
In [28] proposed DGA classifier for real-time using the linguistic features. Linguistic features are obtained from significant characters ratio and n-gram normality score. For both the significant characters ratio and n-gram normality score, mean and covariance are estimated using Alexa top one million dataset. The Mahalanobis distance measures is used to calculate the distance of unknown domains. If a distance is too large then it is classified as DGA otherwise considered as benign. Additionally they used the same aforementioned clustering mechanisms to classify the discovered DGA.
Background
This section provides an intuitive understanding of character level representation of domain names and followed by appropriate mathematical foundation to deep learning architectures.
Domain names encoding in character level
Representation of domain name typically called as domain name encoding. Domain name encoding consists of 2 steps. In first step, the raw domain names are preprocessed and tokenized to characters. In preprocessing, the top-level domain is removed and converted all characters to lower-case. Second step involves in vocabulary creation with only using the training data as an initial step. The vocabulary size parameter value is rely symmetry between the training vectors of each class and the number of parameters to learn for the given task. Here to limit the size of vocabulary, selected only the characters that meet the minimum frequency. Followed by initial step of vocabulary creation, each character is assigned to a unique id and each unique id is a vector that denotes the size of the vocabulary D. The unknown characters in a domain name is trivial, so they are assigned by the default key 0. The unique ids of character are transformed in to feature vectors using the lookup table operation. Moreover, the most commonly occurred character are indexed in an ascending order. This feature vector transformation can be formulated mathematically as: A Lookup table layer LUT represents each character c ∈ D as an inner dimensional feature vector d
wvd
,
input-shape x weights-of-character-embedding = (nb-characters, character-embedding-dim) where input-shape = (nb-characters, vocab-size), nb-characters denotes the number of top characters, vocab-size denotes the number of unique characters, each character is represented in one-hot encoding format. weights-of-character-embedding = (vocab-size, character-embedding-dimension), character-embedding-dimension denotes the size of character embedding vector. The j line in embedding weights matrix denotes the j integer. The dimension of character embedding can be considered as one of hyper parameter of deep learning algorithms. This operation maps the discrete character to its vectors of continuous numbers. The character embedding captures the semantic meaning of the given domain name sequence by mapping them in to a high dimensional geometric space. This high dimensional geometric space is called as an embedding space. If an embedding is properly learnt the semantics of the domain name by encoding as a real valued vectors, then the similar characters appear in a same cluster with close to each other in a high dimensional geometric space. The resultant embedding output vector is passed to any other layers. In our case, we considered (1) RNN (2) LSTM (3) GRU (4) I-RNN (5) CNN (6) CNN-LSTM.
Recurrent neural network (RNN) is an extension to feed forward networks (FFN), introduced in 1990 [19]. RNN use a transition function tf to compute its internal hidden state vector hd
T
recursively for the given input sequence. The hidden state vectors hd
t
are computed using a transition function of current input sequence x
t
and past hidden state vector hdt-1.
A transition function tf is applied to state with the composition of affine transformation of x t and hdt-1 including the element wise non-linearity activation function. Intuitively, this form of network results in vanishing and exploding gradient issue while training a gradient vector can grow or decay exponentially over time-steps [20]. To solve this, [20] introduced long short-term memory (LSTM) by introducing a special component in hidden recurrent layer called as a memory block. A memory block is a complex processing unit with one or more memory cells and additional pair of gating units to control the units of a memory cell across time-steps. Additionally, a memory cell has constant error carousel (CEC). CEC has an in-built fixed value as 1, CEC will be triggered when a memory block doesn’t receive any value from outside signals. Further the research on LSTM, [21] introduced Gated recurrent unit (GRU). GRU has less number of units in compared to LSTM, computationally efficient. On the other side, [22] proposed the RNN network built with ReLU and initialized with identity matrix is termed as IRNN. The identity initialization in IRNN keeps the error derivatives of hidden units back propagated through time constant till no extra error derivatives are added and this property helps to learn long range temporal dependencies. The efficacy of proposed mechanism was relatively closer to LSTM in 4 important tasks; two toy problems, language modeling and speech recognition.
Convolution neural network (CNN) has well established method in the field of image processing. Concretely, CNN takes input in the form of 2D for image and 1D for time-series and texts [23]. The CNN is composed of convolution 1D layer, pooling 1D layer, fully connected layer and non-linear activation function ReLU.
Let, D1 = {c1, c2, ⋯ , c
l
} be the domain name in which c denotes characters and l be the length of domain name, D be the vocabulary of domain name characters and d be the dimensionality of character embedding. The character level representation is encoded by an embedding matrix
This section starts with providing the necessary details of DGA data set, the hyper parameter tuning, binary and multi-class classification settings and the proposed deep learning architecture. As recurrent neural network approaches such as RNN, LSTM, GRU and IRNN are parameterized functions and to find out the optimal values for them experiments are done for various configurations of network parameters and structures. All experiments are trained using backpropogation through time (BPTT) with adam update rule on GPU enabled TensorFlow [13] accompanied with Keras 1 in single NVidia GK110BGL Tesla k40.
Description of Alexa, OpenDNS, DGA malware family and OSNIT dataset
Many recent malware families have largely rely on Domain generation algorithms (DGAs) to build effective rallying mechanisms. This is primarily due to the fact that, DGA facilitates malware families to generate domain names pseudo randomly in the range from thousand to million and a sub set of the domain names is used to connect to a C2C server periodically to acquire information or inject other malicious attacks. This process is illustrated in Fig. 3(b), a bot generates two domain names such as abc.com and def.com and sends a DNS request for them, then the DNS server replies back NXDomain (not registered) and IP address for abc.com and def.com respectively. As further a bot uses the registered domain name IP address and establishes a connection to the C2C server. In this paper, we formed a list of legitimate domain names by assembling Alexa [14] and OpenDNS [24], and malicious domain names are generated using publically accessible algorithms [15] and OSNIT DGA feeds [16]. We randomly split the data set in to 2 parts such as (1) training with 86008 domain names (2) testing with 39004 domain names. The detailed statistics of benign and DGA generated domain names is displayed in Table 1.
Detailed statistics of benign and DGA generated data sets
Detailed statistics of benign and DGA generated data sets
The proposed approach of this research is derived from character-level text classification. Fig. 1(a) shows the unigram alphanumeric distribution of legitimate and DGA generated domain names. This infers that the legitimate domain names have followed a unique pattern in comparison to the DGA generated. DGA generated domain names have either a high probability distribution or low probability distribution in each character. Both the legitimate and DGA generated domain names have maximum frequency of character ‘E’ and after that followed irregular rise and fall in unigram probability distribution. Moreover, the probability distribution of integers in DGA generated domain names is very less in comparison to the legitimate domain names. Fig. 1(b) represents the unigram alphanumeric distribution of each class of DGA generated domain names.

(a),(b) Probability distribution of alphanumeric character of benign and dga generated, (c) Accuracy of deep learning models epochs in rage [0–1000].
A recurrent structure, specifically LSTM is a parameterized function. As a result the good performance in terms of distinguishing the domain name as either Non-DGA or DGA is implicitly rely on the optimal parameters. The set of parameters that LSTM require as an optimal are hidden layers, batch_size, character-vector or vocabulary size, hidden units size, dropout, optimizer and learning rate. The experiment is typically for character level inputs of domain name, we didn’t find any optimal parameter for vocabulary size. It is set to 39 (number of unique characters). Additionally, the dropout is set to 0.1 in recurrent structures and 0.2 in CNN. The learning rate and batch_size are set to 0.1 and 32 respectively in both the CNN and recurrent structures. Most of recurrent structures have reached state-of-the-art performance with one hidden layer including 128 units. Moreover, instead of character level in our second settings of experiment, we apply recurrent structures to the bigram level inputs. The results of bigram level inputs with recurrent structures are considerably poor in comparison to character level inputs with recurrent structures.
For CNN, optimal filters are chosen by evaluating experiments for each filter of 4, 16, 32 and 64. Among them 64 has resulted in good performance in comparison to all other filters. Moreover, the accuracy of experiments with number of filter 32 is comparable to the experiments with number of filter 64.
Deep learning architecture
Fig. 2(a) represents an intuitive overview of the deep learning mechanisms for classifying the domain name as either benign or DGA generated and its corresponding malware family. The architecture is divided in to 3 notional sections, (1) character encoding (2) features extraction via deep layers (3) classification.

(a) An intuitive overview of proposed deep learning architecture, (b) ROC curve, (c) Embedded character vectors learned by LSTM model is represented using 2-dimensional linear projection (PCA) with t-SNE.
Initially, preprocessing is applied to the raw domain names. Preprocessing includes converting the upper case characters to lower case primarily due to the fact that distinguishing the upper and lower case characters might ends up in a regularization issue. Additionally, the top-level domain name is discarded. As the direction towards the initial step of vocabulary creation, each character is assigned to a unique id and each unique id is a vector that denotes the size of the vocabulary, D. Here, the vocabulary size is D = 39. The unknown characters in a domain name are trivial, so they are assigned by the default key 0. To know the domain name with the most number of characters we formed 2 dictionaries, (1) maps character ids to characters (2) characters to character ids. The largest domain name length was 37. To make all domain name sequences of same length, domain name of length less than 37 is padded by zero. As a result, we get a matrix of size 56008 × 37 for training and 30000 × 37 for validation. These matrices are passed to an embedding layer by using batch_size of 32, particularly 32 × 37 matrices batch-by-batch. This embedding layer constructs a matrix of size 39 × 128. Each row is a character-embedding vector that is created by putting back each character-id with a character-vector of size 128. Embedding layer mutually cooperates with the other layers in the deep network during optimizing in the backpropogation process. As a result, similar characters come close together. This type of character clustering facilitate to other layers to easily identify the semantics and contextual similarity structures in domain names. To visualize the learned 128 high-dimensional vector representation, we used two-dimensional linear projection through PCA with t-SNE [17], as shown in Fig. 2(c). Through Fig. 2(c), we can understand that the similar characters are clustered together i.e. characters, numbers, underscore, hyphen and period have appeared in separate clusters. Appearing underscore, hyphen and period in a separate cluster is highly important. This reflects that the embedding representation has captured the semantic and contextual similarity of domain names.
Features extraction via deep layers
We espouse various deep layers such as recurrent layers: RNN, LSTM, GRU, IRNN and CNN and hybrid architecture such as CNN-LSTM for feature extraction. For each deep layer, various experiments are done to evaluate the performance of them. We didn’t rely on batch-normalization between deep layers because the architectures are not too deep. For all deep learning architectures, we set learning rate parameter to 0.1, batch_size to 32. Thus, the deep learning model updates the parameter once it trains the 32 data samples via backpropogation mechanism.
Recurrent layers
For recurrent structures, we adopted the following various layers, RNN, LSTM, GRU and IRNN. By observing the experience from past experiments in hyper-parameter tuning, the number of units (typically memory cells in LSTM and GRU) is set to 128. A RNN unit has used hyperbolic tangent as an input and output activation function which in the range [-1, 1]. A LSTM memory cell has used hyperbolic tangent as an input and output activation function which in the range [-1, 1] and sigmoid for gates and other neurons which in range [1, 0]. As recurrent structure layers captures the dependencies for the received matrix of shape (39 × 128) from an embedding layer and passes its last output 128 to dropout layer with 0.1.
Convolution layers
The sentence is one-dimensional, hence we adopt convolution1D in which filter moves only in one direction. The convolution1D mechanism has two steps. (1) convolution1D (2) pooling1D. In first step, convolution operation with 64 filters of length 3 (means filters applied on 3 characters at a time) is applied. Each character has a vector of 128 elements. To characterize the sequences of 3 characters, scalar product is done between the filters of shape 3 × 64. Finally, convolution1D layer passes its output of shape 37 × 64 to pooling1D layer (chosen maxpooling1D). A maxpooling1D layer has a stride of length 2 that divides the convolution feature map in to two equal parts. The pooling1D output as shape 18 × 64 and it is flattened to a vector. Flattened vector includes 1152 elements in which the first 64 elements for the first character, second 64 elements for the second character, and so on and so forth. This vector has passed to dense layer that compress in to 128 elements. Dense layer has followed dropout with 0.1, as a result, the deep model has captured the latent features. Additionally, the fully connected last layer weight matrix has a direct connection to position of individual characters. Thus it learns the position of characters in a domain name. In hybrid network the pooling1D output as shape 18 × 64 is fed to LSTM layer. This has compressed the output to 70 and passed to dense layer.
Regularization
To avoid over fitting, dropout layer with 0.1 added as regularization parameter after deep layers such as RNN, LSTM, GRU, and IRNN. In CNN, dense layer interleaved with dropout layer with 0.2 and additionally activation layer with ReLU non-linear activation function. Dropout is a mechanism for discarding the neurons randomly along with their connections during training a deep learning model.
Classification
After feature engineering, to classify the domain name as either benign or DGA generated in binary classification configuration and additionally categorizing the DGA generated domain name to corresponding malware family in multi-class classification configuration we embed dense layer after dropout layer. Dense layer is typically composed of two layers such as dense with unit 1 and followed by an activation layer such as sigmoid with loss function as binary cross-entropy in binary classification configuration. Dense with unit 18 and followed by an activation layer such as softmax with loss function as categorical cross-entropy in multi-class classification configuration. The dense layer is also termed as fully connected layer. It aggregates the received features from previous layer specifically dropout layer in recurrent structures and CNN and LSTM layer in hybrid network to a single unit by constraining the most important one. As a result, it forms hierarchical feature representation for the final phase classification. The loss function for binary cross-entropy is estimated using the following formulae,
Here e is a vector of expected class label, p is a vector of predicted probability for all domain names. To minimize the loss we used adam optimizer via backpropogation. The loss function for categorical cross-entropy is estimated using the following formulae,
Here p is true probability distribution, q is predicted probability distribution from softmax layer.
The trained model performance is evaluated using the testing samples on epoch wise, as displayed in Fig. 1(c). LSTM and CNN-LSTM have showed good performance till epochs 500. After that, it started to decrease due to over fitting. IRNN has started decreasing its performance after epochs 300. CNN and GRU have started to over fitting after 100 epochs. RNN has started to over fit once it reaches epochs 50. By observing this, we can say that 500 epochs is sufficient to capture the dependencies of domain name in character level. As a baseline comparison, we apply logistic regression model on bigrams in character level representation of domain name and other classical machine learning classifiers like Random forest (RF), Decision tree (DT), Maximum Entropy Modeling (MT), AdaBoost (AB), and Naive Bayes (NB) on the hand-crafted features. Table 2 has experiment results for them to classify the domain name as either benign or DGA generated in terms of accuracy, precision, recall, and f1-score. Table 3 includes experiment results of deep learning approaches on character level, logistic regression (LR) on bigrams of character in domain name, random forest classifier on hand-crafter features to classify the domain name as either benign or DGA generated and its family. By looking at Table 2 and Table 3 LSTM has performed well in both the binary and multi-class classification settings in comparison to RNN and IRNN and other adopted mechanisms. Moreover, the obtained results of GRU are comparable to LSTM. In binary classification settings of DGA, the performance of various employed mechanism is displayed in receiver operating characteristic (ROC) curve in Fig. 2(b). LSTM and CNN-LSTM have both showed good performance (AUC of 1.000) including the consistent TPR and FPR.
Summary of test results for binary classification
Summary of test results for binary classification
Summary of test results for multi-class classification
Understanding the inner mechanics of deep learning architectures have remained as a black box including both the novices and advanced users in a real environment. These deep layers include a large amount of information and this can be found by unwrapping them. Considerably, the deep networks such as RNN, LSTM, GRU and IRNN are very complex. By unwrapping them we get a lot of information. Generally the transformed character representation in embedding layer passed through several deep layers to capture the semantic similarity of them. The non-linear activation in each deep layers supports to classify the domain name as benign or DGA generated with its corresponding malware family. Based on the learned feature representation, the last layer in deep architecture should maximally separate the benign and DGA malware family. To adopt this in our experimental setting, we randomly selected the 25 samples from benign and DGA generated malware family. Those samples are fed to the LSTM architecture of the trained model. We replaced the last layer i.e. softmax with t-SNE [17]. t-SNE transforms the high dimensional vectors into two-dimensional representation and those newly formed 2-D vectors are shown in Fig. 3(a). A few testing samples of locky, benign, dircrypt and qakbot have misclassified wrongly. This shows that the learned features using LSTM models have similar characteristics among these classes. Moreover, other malware class family has clustered together correctly.

(a) Visualization of activation values in t-SNE, (b) Rallying mechanism of DGA based botnet.
This paper evaluates the effectiveness of various deep learning approaches to detect and classify the domain names to a specific malware family in which domain names are generated by domain generation algorithms (DGA). The domain names are distinguished as either malicious or benign by training them in character level by automatically extracting the necessary features. Thus, it avoids manual hand crafted feature engineering method and thereby itself serve as robust in handling drifting of domain names and in the scenario of adversarial machine learning setting. The deep network with embedding as first layer extracts features implicitly that is pushed to other deep networks layer and followed by dense layer for classification in supervised learning setting. The family of recurrent neural network (RNN) and its hybrid network (formed using CNN) has significantly performed well in comparison to the methods of hand crafted features and bigrams in both binary and multiclass classification settings. For multiclass classification, we specifically changed the LSTM network in which we used in binary classification with the aim to learn the background environment details of DGA generated malware including the origin and its main purpose. Overall deep learning approaches are considered to be more effective in comparisons to the classical machine learning methods due to the fact that the classical methods rely on the feature set and a machine learning model based on the feature set becomes futile in the case of adversarial machine learning. This work remains as a baseline system to understand the effectiveness of various deep learning approaches for DGA generated domain name analysis and its classification.
In recent days malware families have largely rely on DGA due to the fact that random nature in generating domain names based on the given seed. As a result DGA helps them to hide from detection methods. There are various techniques used by security researchers to detect and classify the DGA such as blacklisting, machine learning based solutions, passive domain name system (DNS) techniques, and analysis of traffic of malicious domain and DNS. This paper puts focus towards implementing machine learning based solutions towards DGA analysis, to achieve this we use complex model of machine learning approach typically called as deep learning. The effectiveness of various mechanisms is studied towards DGA detection and classifying domain name to a corresponding DGA family. However, this study has used only 17 malware families. Thus studying the effectiveness of deep learning mechanisms with more number of malware families will be considered as one of a significant direction towards future work. Additionally, this paper has not shown the inner mechanics of deep learning networks. This might be considered as an important towards in real-time deployment. Thus analyzing the inner mechanics of deep learning algorithms [18] with concerning on the detection rate of malicious domain name and providing appropriate mathematical representation and visualization will be considered as another future direction.
In this study, we used various deep networks with examining its capability by using simple architectures and adopted a very few parameters tuning towards DGA analysis. By appropriately following the hyper-parameter tuning approach may still enhance the reported results. When we attempt to train them, the complex network ingests high computational costs for larger domain names in our current hardware settings. Thus a large study should be required on effectiveness of complex architectures towards DGA analysis. This can be done by using the advanced hardware and following distributed environment for training complex networks.
