Detecting malicious domain names using deep learning approaches at scale

Abstract

Threats related to computer security constantly evolving and attacking the networks and internet all the time. New security threats and the sophisticated methods that hackers use can bypass the detection and prevention mechanisms. A new approach which can handle and analyze massive amount of logs from diverse sources such as network packets, Domain name system (DNS) logs, proxy logs, system/service logs etc. required. This approach can be typically termed as big data. This approach can protect and provide solution to various security issues such as fraud detection, malicious activities and other advanced persistent threats. Apache spark is a distributed big data based cluster computing platform which can store and process the security data to give real time protection. In this paper, we collect only DNS logs from client machines in local area network (LAN) and store it in a server. To find the domain name as either benign or malicious, we propose deep learning based approach. For comparison, we have evaluated the effectiveness of various deep learning approaches such as recurrent neural network (RNN), long short-term memory (LSTM) and other traditional machine learning classifiers. Deep learning based approaches have performed well in comparison to the other classical machine learning classifiers. The primary reason is that deep learning algorithms have the capability to obtain the right features implicitly. Moreover, LSTM has obtained highest malicious detection rate in all experiments in comparison to the other deep learning approaches.

Keywords

Big data approach log files Apache Spark Domain Name Service (DNS)machine learning and deep learning: Recurrent Neural Network (RNN)Long Short-Term Memory (LSTM)

1 Introduction

Over the years, the internet has expanded to monumental proportions with increasing number of hosts and handiness of high-speed connections. This enlargement has conjointly resulted in a rise within the range of attacks on hosts. The Domain Name System (DNS) is one of the vital elements in the internet. The security of DNS is one of key issue for the internet to work well. Due to the importance of DNS, it’s been the target of attacks by attackers. The common attack is to overload the number of packets which takes huge amount of bandwidth and processing energy with the aim of making DNS server unavailable for users. This type of attack is typically called as denial of service (DOS). This is only a small part of attacks. These predominant attacks and security breaches makes organizations to take steps ahead to the growing issues of advanced continual threats, fraud, and insider attacks. Conventional security monitoring approaches lack the sophisticated methods and visibility required to come across and shield in opposition to such attacks. These conventional approaches concentrate on solving a single aspect of the security problem. Smart cyber attackers are skilled, agile and hold out surreptitious reconnaissance mission of an organizations community within a certain period of time. Sometimes there is a possibility of stealing the statistics belongings to the organization. The conventional approaches are not enough to secure the organizations from such powerful threats and attacks. New policies and methods should be adapted to defend against these threats. Research on security are now focusing on new approaches which involves in collecting all possible security information’s such as log files, emails, event records, network packet flow data, DNS logs, software configuration files etc. When the size of network increases, the data collected will become massive and becomes a big data problem. By analyzing these massive structured and unstructured security data over a certain period revel useful hidden patterns which can be used to chracterize security threats.

Security intelligence with Big Data provides a good security analytics platform for advanced threat detection. This combines deep security information with analytical insights on a massive scale. Currently organizations are looking superior perception into security threats; Apache Spark 1 with Hadoop 2 stack collects and analyze massive structured and unstructured log data to cope up with advanced security threats. The integration of security intelligence with Big Data helps to find the advanced malicious activity and security threats. Apache Spark can examine a wider variety of information along with hypertext transfer protocol (HTTP) traffic data, network logs, networks events, honey pot data, network flows, IP drift data, DNS replies over years of activity. These data helps to find the malicious activity in an organization data. In this work, we are dealing with only DNS log data to detect malicious domain names.

Nowadays all network communications happens with the help of DNS. It is the most important service of the internet. For this reason, monitoring DNS to discover malicious domain names is an effective way to proactively discover and save you an important part of malicious communications. DNS is a globally distributed, scalable, hierarchical and dynamic database which translates hostnames to internet protocol (IP) address and vice-versa. Logs of DNS contain a rich source of information which can be used for security log analysis. Each query of DNS log data contains domain name, target domain, IP sources etc. These data is used for understanding DNS traffic usually for detecting faults and malicious events. It can also be used to monitor the ongoing security threats. The log analysis application provides useful information to the network administrators to defend against existing and new types of security threats. It allows the users to visualize the attacks happened, to establish surprising network activities [8] and to redefine the attack or malicious activity [9]. There are a lot of works towards the DNS log visualization. One among is visual metaphor flying-term developed by [10] which gives a visual system to network administrator in order to understand DNS unusual queries for example misconfiguration and security events. In [11] system helps network administrator to find malicious domain by analyzing from massive monitoring data. In this paper, we capture DNS logs within an organization LAN and preprocess using Apache Spark on top of Hadoop and store it in Apache Cassandra 3 . Apache Spark splits the massive DNS log data in to chunks and distribute into the slave nodes. Once the processing is done, these data is aggregated and send it back to the master node. The preprocessed data in master node contains domain names. These domain names are passed to deep learning models in order to find the domain name as either malicious or benign. Deep learning is an old concept of artificial intelligence called as neural network (in recent times typically termed as deep learning) has achieved a significant result in various multitudinous fields namely natural language processing, image processing, speech recognition and many others [35]. Deep learning mechanisms itself facilitated to extract features by taking raw data set as input. They are mainly categorized into two types (1) convolution neural network (CNN) (2) recurrent neural network (RNN). This paper leverage the idea of RNN and its variant LSTM network to identify the malicious domain names.

The contribution of the paper is given below

Develop a robust, scalable distributed framework, capable of analyzing very large volumes of DNS logs at the local area network (LAN) level in an organization and correlating them to detect the attack patterns. These recognized patterns could be utilized to stop the further harms by malware.

Detect the malicious domain name by using the deep learning algorithms.

Efficacy of traditional machine learning and deep learning is discussed in the context of detecting the malicious domain name.

The rest of the paper is organized as follows. Section 2 discusses the select related work. Section 3 provides the necessary information about DNS, domain generation algorithm (DGA) and Apache Spark. Section 4 discusses the DNS logs collection, preprocessing approach and the deep learning architecture. Section 5 provides the detailed evaluation results. Conclusion and future work directions are placed in Section 6.

2 Related work

Kuhrer [1] discussed the efficiency of blacklists including 15 public malware blacklists and 4 private malware blacklists from anti-virus vendors. They identified the unregistered domains in listings using DNS. However, for parked and sinkhole domain, they followed a feature based approach. Vendor provided blacklists performed well in blacklisting both domain generation algorithms (DGA’s) malware and without DGA malware in comparison to public malware blacklists. Overall, they claimed that the blacklists were useful and can be used as an initial shelter for protection from malwares. This can be made potential by supplementing an additional mechanism. Mcgrath [2] used IP address, ‘whois’ information, phishing information and lexical entries of URL’s as feature and reported the lengths of malicious domain names were smaller than benign domain names and use lesser vowels with unique characters. Sandeep [3] used language based mechanisms in which a score was assigned to each domain to identify the DGA. The score was estimated based on the dictionary and additionally a dictionary helps to examine the sequences in the domain names. Sandeep and Antonakakis [3, 4] proposed n-gram mechanism specifically they used distribution of alphanumeric characters in 1 and 2-grams to detect domain-fluxes. The proposed method assumed the distribution of alphanumeric character in human generated and DGA generated were entirely different. They used 2 sets for training; one was human generated and other one was DGA generated. For each set, 1 gram and 2 grams was calculated and unknown domain in each batch of test data was grouped by same second level domain and same IP address. They also had showed efficacy of their mechanism by using the various distance metrics such as Kullback-Leibler distance, Jaccard Index and Edit distance. Yadav [5] proposed Peiades, that used same clustering mechanism to classify domains with assuming the DGA and other DGA-bot infected machines response was Non-Existent Domain (NX-Domain).

The aforementioned DGA detection and classification methods were studied by [6] and reported two issues. One was the discussed methods entirely retrospective, consequently cannot be adopted in real time DGA detection. Retrospective methods were time consuming and less performance. Their system had limitations; one was they showed detection rate of 83%, it was entirely based on estimating scores for clients. As a result cannot be used in real-time. Second, it used NXDomain as a baseline for classification and consequently did’t facilitate to classify the malicious domain names to their DGA family. Schiavoni [7] proposed DGA classifier for real-time using the linguistic features. Linguistic features were obtained from significant characters ratio and n-gram normality score. For both the significant characters ratio and n-gram normality score, mean and covariance were estimated using Alexa top one million dataset. The Mahalanobis distance measures was used to calculate the distance of unknown domains. If a distance was too large then it was classified as DGA otherwise considered as benign. Additionally, they used the same aforementioned clustering mechanisms to classify the discovered DGA. Instead of following various feature engineering mechanisms, in this paper we pass the raw domain names to deep learning architecture. A deep learning architecture implicitly obtains optimal feature representations and passes to feed forward neural network (FFN) for classification.

3 Background

3.1 Domain Generation Algorithm (DGA)

Domain generation algorithm (DGA) is used as a primary mechanism by many recent malware families due to the fact that they can generate pseudo random domain names periodically and connects them to a command and control (C2C) server. The pseudo random domain names are generated based on a seed. A seed is a combination of numeric, alphabet, date/time and other information. This enables to find a rendezvous point between a botmaster and a bot. A botmaster controls a set of compromised hosts that are typically called as bot or botnet. Out of several thousand DGA generated domain names, the bot master will procure only a handful of domain names. As a result, lots of non-existent (NX) response queries get generated. This process makes the security strategy very expensive for the defender and very economical for the attacker. In this scenario, the attacker has to buy just a handful of domain names, but at the same time, the defender will have to procure or block or sinkhole millions of domains by reverse engineering them to discover an algorithm that makes the effort very expensive. Moreover, this process can be easily circumvented by malware authors.

Blacklisting is another most commonly used mechanism to combat botnets by blacklisting the communication point between a botmaster and C2C [1]. This has been a laborious task and needs to be updated manually on a regular basis. In a significant number of cases the additions about a malware will be made only after weeks or months of its propagation. By this time most the systems might have affected with the bots that are controlled by the bot master.

3.2 Apache spark in security

Distributed computing platforms are extensively used in several big data applications [12] for example introduction of Map reduce [13] by Google. These platforms are extensively used in security analysis such as intrusion detection [14], botnet detection [15], and spam detection and classification [16, 17]. Map reduce framework uses batch processing jobs on data stored in HDFS (Hadoop distributed file system). It is less efficient for performing operations using iterative machine learning algorithms. However, for all types of data, batch processing jobs are not the solution. For example in our case we are capturing some data in real time and some data are offline. Finally, the system has to perform streaming computation on the aggregated data in the real time. For this, Apache Spark with the existing Hadoop stack is well suited instead of Hadoop Map Reduce. Apache Spark performs streaming computation on captured data like in our case DNS logs with existing Hadoop stack using Spark Streaming. It means the processing happens in real time on the aggregated DNS logs. Apache Spark is a fast, distributed cluster computing platform used for large scale data processing, developed by UC Berkley in the AMPLab [18] as shown in Fig. 1. Spark core is the distributed execution engine, Java, Scala, Python and R API’s offers a platform for application development. Apache Spark contains the following built-in libraries,

Spark SQL: It allows the user to perform SQL like queries on Spark datasets. Ex. JDBC API [19].

Spark Streaming: It is used for real-time data streaming. It is based on D-Stream (Discretized stream) [20], which is a set of RDD’s to process the real-time streaming data.

Mllib: It is a scalable machine learning library for Spark which consists classification, clustering, collaborative filtering, dimensionality reduction and other machine learning algorithms [21].

GraphX: It is a graph API for performing graph-parallel computation. It has various collections of graph algorithms and graph tasks to simplify graph operations [22].

Fig.1

Apache spark architecture.

Spark consists of 2 key concepts (1) RDD (resilient distributed dataset) (2) DAG (directed acyclic graph) execution engine. A resilient distributed dataset or RDD is the core concept in Spark. It is a distributed memory of words which allows programmers to perform in-memory computations on cluster computing framework in a fault-tolerant manner [23]. Spark has two RDD’s. One is parallelized collections; it is same as Scala collections. Second one is HDFS data. RDD contains two types of operations. One is Transformations; it creates new dataset from the given input, for example map or filter operations. Second one is Actions; the actual data processing job is triggered by Actions and it returns a value after executing a job, for example reduce or count. DAG is a programming method for distributed systems. It supports cyclic data flow which helps to reduce the map reduce multi stage execution model.

Apache Spark aims to enhance the performance of data processing without replacing the existing Hadoop stack, as shown in Fig. 2a. This takes the advantage of read and write of large set of data using the HDFS file system with NoSQL database storage systems such as HBase, Apache Cassandra, MongoDB and also relational database MySQL. This can be used to process the data to derive and design new rules and policies for better protection and prediction.

Fig.2

(a) Apache Spark architecture over Hadoop Stack, (b) ROC curve.

4 Experimental setup

4.1 Port mirroring

Conventional techniques such as static and binary analysis of malware are once in a while deficient to address the expansion of malware because of the time taken to acquire and process the individual binaries so as to come up with signatures. When the antimalware or anti-malware signature is accessible through market, quite possibly a vast quantity of harm would possibly have happened. However, analysis of DNS logs in a timely manner may be one of the possible ways for early and faster detection of malicious activities. As DNS was not outlined in light of security, how the large amount of event data of DNS can be used to create cyber threat situational awareness is traversed. There are different ways to capture network traffic and they are as follows; (1) Using hub (2) Port mirroring (3) Bridge mode (4) ARP spoof and (5) Remote packet capture. In our experimental setup, we used the port mirroring.

Port mirroring enables the user to duplicate the traffic and mirror it to the port wherever wanted as shown in Fig. 3. It allows the administrator to observe switch performance by placing a protocol analyzer on the port which receives the mirrored data [24]. There are different names for port mirroring and usually it depends on the manufacturers. For example, Cisco names port mirroring as SPAN (switched port analyzer). We initially configured port mirroring via assigning a port from which to duplicate all packets and every other port to which the ones packets may be dispatched. When packets are received for a long time, sometimes packets are transferred to some other ports. This problem is solved by closely looking at and keeping a protocol analyzer on the receiver’s port. The protocol analyzer monitors each and every section of mirrored data one at a time. This type of protocol analyzer is called as packet sniffer. Port mirroring is used to install Intrusion detection system (IDS) and other analysis tools. Port mirroring sends network traffic to network analyzer tools which analyze the screen event like IDS, network anomaly detection, monitoring and forecasting network trends.

Fig.3

Port mirroring setup: duplicates traffic between different switch ports.

4.2 Network Interface Card (NIC) in promiscuous mode

In our experimental LAN, network traffic of all connected hosts is not accessible for us via default mode. Because network adapter make sure that the packets should be received by the specified recipient. With the help of Ethernet LAN the problem is solved where it is able to accept packets despite the fact that those are not addressed to them. In this approach, the network adapter permits to receive all packets which are flowing inside the network. We also experimented with hub-based network with by switching the network adapter to promiscuous mode for to get all network traffic within the LAN. This has an advantage as there is no need to put network analyzer on different ports. The hub transmits the received packets to all other ports. Thereby connecting to any of the port and the network traffic is monitored.

In our experimental setup we have a switch-based LAN. This actually transmits packets to only one port. It maintains a record of all connected host media access control (MAC) addresses and the associated port address. This helps to identify connected host with the specified port. When a switch receives a packet it searches for the MAC address in the record and chooses the exact port to forward the packet. This feature of switch makes the adapter to accept only the packets which are addressed to you. It minimizes the network load without minimizing its bandwidth. With a purpose to begin monitoring after the activation of port mirroring, we have connected to the specific switch port and used the promiscuous mode as shown in Fig. 4. From the port mirroring enabled port on the switch, a LAN cable is connected to computer with its NIC in promiscuous mode. This enables the NIC to collect all the packets from the network.

Fig.4

Architecture of DNS packet capturing system.

4.3 Tcpdump

Tcpdump 4 is a network analyzer tool used to dump traffic on the network. It displays packet information. This is used for reading packets from NIC or from a file which contains packet information, writing packets to standard output or to a separate file. In our experimental setup most of the systems are Linux operating system. To capture packets, libcap 5 library is used with Tcpdump. It has more number of options for capturing and analyzing packets. DNS packets communicate through the well-known port number 53 which is allocated for DNS. Tcpdump captures all network traffic on port number 53 and saves into pcap file. $tcpdump - ieth 0 - wexaa . pcap - p 53$

This command listens to the specified Ethernet port eth0 and captures all packets towards port number 53. Then it writes the captured packets into a file called exaa.pcap.

4.4 Global header

In this experiment the global header information is only extracted which contains the source and destination IP address, query type etc. The global header is extracted and appended to a text file. And the size of text file increases as time moves. This single file is treated as the big data for our Apache Spark engine, sample log is displayed in Fig. 5.

Fig.5

A part of DNS log data.

4.5 Data processing

DNS logs are stored in text format. To avoid memory problems each day the collected DNS data (which is around 10 GB each day) is compressed as a ZIP file. We observed these volumes of logs are not static. The ZIP files are transferred to one of the slave node for temporary access and this is transferred to external HDD for permanent storage. The extracted DNS log contains much information and looks more complex. But, we apply preprocessing for extracting the Time, Date, IP (internet protocol) and Domain field information. In order to keep the required information for future use, the preprocessed DNS logs are separately stored in Apache Cassandra database. Apache Cassandra [25] is an open source, distributed database that exhibits excessive overall performance in data access. It is a decentralized database which partitions data across all the nodes in the cluster. Apache Spark has a driver which is used for configuring Apache Cassandra with Spark. This architecture guarantees to handle large amount of network data inside an organization. According to our approach these data can be easily processed by using distributed cluster computing platform, Apache Spark on top of existing Hadoop stack in a fault tolerant manner.

4.6 Architecture

The preprocessed text file is stored in HDFS (Hadoop distributed file system) and given as the input to Apache Spark cluster engine which performs the analysis of the packet and statistics is calculated, as shown in Fig. 6a. These logs are collected in organization working hours. The Fig. 6b shows real-time Apache Spark cluster setup done in our experimental setup. Apache Spark cluster is setup with the following advantageous. (1) Apache Spark facilitates to distribute the jobs across different nodes in the cluster. As a result this helps to find the best hyperparameters of deep learning methods, leading to increase the speed of training and decreases the error rate. (2) The trained deep learning model is deployed at scale in real-time on analyzing large amount of data and to provide alerts in a timely manner.

Fig.6

(a) Top queried websites, (b) Apache Spark cluster setup.

Initially, we had Hadoop Yet Another Resource Negotiator (YARN) architecture. Apache Spark has been setup over YARN. This makes easy integration of Spark into the existing Hadoop stack. This enables to read and write a large file system with the help of HDFS. The Hadoop YARN cluster has a resource manager which controls resource utilization of the cluster. In our Hadoop YARN cluster, each node has 32 GB RAM, if process request 5 executors with 1 GB each, then these all process will be easily executed on a single YARN node without the help of the other nodes in the cluster. Hence, job submission to the slave nodes depends on the number of executors; if the input data is massive then it requires more process and executors.

The Apache Spark cluster has been setup to efficiently distribute, execute and harvest tasks. Each system has specifications (32 GB RAM, 2 TB hdd, I7-4790K processor, MSI Z87-GD65) running over 1 Gbps Ethernet network. We define the basic unit of our Apache Spark cluster as a node, which is a machine, physical or virtual. The developed framework has 3 kinds of nodes, master node, slave node and data storage node. Master node (C_m): It controls all the nodes in the framework. The user can communicate to the system through master node interface. It retrieves DNS log data from data storage node and performs log preprocessing and log segmentation. This distributes workloads to other slave nodes and finally aggregates the output from all other slave nodes. Slave nodes (C_s): These nodes retrieve preprocessed logs from the master node. There might be a big quantity of them to investigate logs in parallel. The master slave node framework keeps computation very faster and also adjusts according to the size of logs. Data storage node (C_ds): This is used for storing DNS logs. It also acts as a slave node. This keeps track of log data on daily basis and aggregates the data to daily, weekly and monthly basis.

At the end of every month the log data is transferred to the local disk by the network administrator. This is to keep computation very faster. Because, whenever the data size becomes large the master node tries to distribute the log data to the storage node also. Currently, the logs are not streamed. In the near future, Data storage node is also responsible for reading the streamed DNS logs on connected hosts. Each domain name is labeled as benign or malicious/DGA-generated in classifying whether the domain name as benign or malicious in the collected data set.

Two different data sets are used such as Data set 2 as collected from the real-time environment and Data set 1 as collected from the publicly available DGA algorithms [26] and real-time malicious domain names from OSNIT feeds [27]. The benign domain names are taken from Alexa [33], OpenDNS [34]. These 2 data sets are completely disjoint. The main reason to evaluate it on the 2 different data sets is that, off-line data set Data set 1 contains the same number of benign and malicious domain names. In Data set 2 the number of benign domain names is more than the malicious domain names because the domain names are collected in real time DNS traffic. The detailed statistics of Data set 1 and Data set 2 are reported in Table 1.

Table 1

Description of data set

Data	Benign	Malicious	Total
Data set 1
Training	97506	77506	175012
Testing	15000	110012	125012
Data set 2
Training	49000	19005	68005
Testing	20000	5000	25000

The combination of deep learning with Apache Spark has the potential for gigantic effect in real-time applications. The system is developed and deployed with an aim to detect malicious domains automatically in real-time. After detecting the malicious domain, the system monitors its activities in an interval of ten seconds on a regular basis. Fig. 7 presents the architectural diagram of the implemented system which consists of four modules 1) Data Collection 2) Preprocessing 3) Classification 4) Continuous monitoring. Representation of domain names is typically called as domain names encoding. Domain names encoding composed of 2 steps. First step involved in preprocessing followed by tokenization. During preprocessing, the top-level domain names are removed and all characters converted to lower case. Moreover, the predefined key, generally 0 is assigned to the unknown characters. A tokenizer chops the domain name into segments usually character, called tokens. Second step involved in dictionary creation using the training set. The frequently occurred characters are indexed in an ascending order in a dictionary D. The dictionary D size is 39. Each character of a domain name is assigned an index that is taken from dictionary D and passed to word embedding layer. $\begin{matrix} Input - data - shape * weights - of - character \\ - embedding = (nb - words, character - embed - \\ ding - dimension) \end{matrix}$ where input-data-shape = (nb-words, dict-size), nb-words denotes the number of top characters, dict-size denotes the number of unique characters, each character is represented in one-hot encoding format. weights-of-character-embedding = (dict-size, character-embedding-dimension), character- embedding-dimension denotes the size of character embedding vector. This can be considered as one of hyper parameter of deep learning algorithms. This operation maps the discrete character to its vectors of continuous numbers. An embedding layer collaborates with the other layer in the deep network during reducing the error in backpropogation. This learns the semantics and contextual similarity structures of domain names and obtains separate cluster for similar characters. We pass newly obtained vectors to other deep layers such as RNN [28] or LSTM [29]. Both the layers learns the temporal dependencies in the given sequence embedding vectors and pass into feed forward network (FFN) for classification. RNN has an issue in learning long term temporal dependencies such as vanishing and exploding gradient issue [30]. That’s the reason why LSTM has performed well in comparison to the RNN as shown in Table 2. In FFN we used a dense layer with sigmoid as non-linear activation function in which neurons are fully connected to each other. This facilitates to find the given sample as either benign or malicious.

Fig.7

Architecture for detecting malicious domain names (inner units and their connection are not shown for deep layers).

Table 2

Summary of test results of Data set 1 for classifying domain name as either benign or malicious

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.999	0.998	0.999	0.999
RNN	0.978	0.860	0.950	0.903
Hand-crafted Features
Random Forest (RF)	0.947	0.658	0.900	0.740
Decision Tree (DT)	0.933	0.512	0.882	0.648
Naive Bayes (NB)	0.94	0.564	0.898	0.693

5 Evaluation results

All experiments of deep learning algorithms are trained using backpropogation through time (BPTT) [31] with ADAM optimizer on GPU enabled TensorFlow [32] in conjunction with Keras framework 6 . During training RNN/LSTM network, we pass a database of both the benign and malicious domain names to the deep learning architecture. The database has 39 unique characters. The number of unique characters denotes the dictionary size (dict-size). Hyperparameter tuning is a crucial step in deep learning because the deep learning algorithms are parameterized and the best parameters is set by experimenting with different values for parameters in training. Two trails of experiments are run for the learning rate in the range [0.01–0.5] with moderately sized RNN and LSTM network. A moderately sized RNN/LSTM network contains the embedding layer with embedding size 64 and followed by LSTM with 64 memory blocks and FFN. An experiment with learning rate 0.05 has performed well in comparision to other learning rates. Thus 0.05 is set for the learning rate parameter for the rest of the experiments. Two trails of experiments are run for the parameters such as the embedding size 64, 128, 256 and followed by RNN/LSTM layer 64, 128, 256 units/memory blocks. Experiments with embedding layer size 128 and followed by RNN/LSTM layer 128 units/memory blocks performed well. Thus we decided to set 128 as the embeding size followed by 128 memory blocks/units for LSTM/RNN network. We pass a matrix 175012*39 with batch-size 64 to the embedding layer in the deep learning architecture. The embedding layer learns the semantic and contextual meaning of the domain names character sequence by mapping them into a high dimensional geometric space, particularly each character is mapped to 128 dimensional space. This is typically called as embedding space. If an embedding is properly learnt the semantics of the character sequence in domain names by encoding as a real valued vectors, then the similar characters appear in a same cluster with close to each other in a high dimensional geometric space. The embedding layer output of shape 39*128 is passed to RNN/LSTM layer. RNN/LSTM contains 128 units/memory blocks which learns the temporal dependencies and passes into FFN. FFN has contains sigmoid non-linear activation function with binary cross-entropy loss function to identify the domain name as either benign or malicious, binary cross-entropy loss function is shown in Equation 1,

$\begin{matrix} loss (pr, ep) & = & - \frac{1}{N} \sum_{j = 1}^{N} [e p_{j} log p r_{j} \\ + (1 - e p_{j}) log (1 - p r_{j})] \end{matrix}$ (1) where ep is a vector of target class label, pr is a vector of predicted class label. To minimize the loss we used ADAM optimizer via BPTT.

The deep learning and machine learning algorithms are applied to detect the malicious domain names using 5 different experimental designs.

Experiments with Data set 1

Experiments with Data set 2

Experiments with Data set 1 for training and Data set 2 for testing

Experiments with Data set 2 for training and Data set 1 for testing

Experiments with merged data sets of Data set 1 and Data set 2

The performance of RNN/LSTM and other classical machine learning algorithms based on the features on test data of Data set 1 and Data set 2 is displayed in Tables 2 and 3 respectively. To get an intuitive understanding of the performance of various deep networks and other classical machine learning classifiers the receiver operating characteristic (ROC) curve is displayed in Fig. 2b for Data set 1 and Fig. 8 for Data set 2. The performance of both the deep learning and machine learning classifiers are less for Data set 2 in comparision to the Data set 1. This is due to the fact that the test data of Data set 2 contains distinct probability distribution and collected in different time settings in real-time environment. However, the efficacy of the deep learning algorithms is acceptable. The detailed performance of them is reported in Table 3. To identify the effectiveness of machine learning and deep learning algorithms on completely unseen data, the classifiers trained on the Data set 1 is evaluated on the Data set 2 and vice-versa. The performance of the trained deep learning and machine learning classifiers using the Data set 2 is good in comparision to the Data set 1. One of the significant reason is that due to the overfitting. More importantly, Data set 2 contains unique domain name with covering the most of the patterns of the benign and malicious domain names. The detailed evaluation results is reported in Tables 4, 5. Finally, the experiments are done to evaluate the performance of deep learning and machine learning algorithms on the merged data set of Data set 1 and Data set 2. The detailed results is shown in Table 6.

Table 3

Summary of test results of Data set 2 for classifying domain name as either benign or malicious

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.976	0.892	0.989	0.938
RNN	0.965	0.838	0.986	0.906
Hand-crafted Features
Random Forest (RF)	0.946	0.762	0.963	0.851
Decision Tree (DT)	0.941	0.742	0.954	0.835
Naive Bayes (NB)	0.924	0.664	0.936	0.777

Fig.8

ROC curve.

Table 4

Summary of evaluation results – train and test is done on the Data set 1 and Data set 2 respectively

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.968	0.965	0.996	0.981
RNN	0.958	0.809	0.979	0.886
Hand-crafted Features
Random Forest (RF)	0.943	0.940	0.992	0.965
Decision Tree (DT)	0.939	0.931	0.998	0.963
Naive Bayes (NB)	0.892	0.483	0.951	0.641

Table 5

Summary of evaluation results – train and test is done on the Data set 2 and Data set 1 respectively

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.984	0.926	0.994	0.959
RNN	0.976	0.899	0.981	0.938
Hand-crafted Features
Random Forest (RF)	0.945	0.757	0.957	0.846
Decision Tree (DT)	0.917	0.596	0.983	0.742
Naive Bayes (NB)	0.909	0.564	0.967	0.712

Table 6

Summary of test results of merged data set of Data set 1 and Data set 2 for classifying domain name as either benign or malicious

Algorithm	Accuracy	Precision	Recall	F-score
LSTM	0.974	0.878	0.993	0.932
RNN	0.965	0.862	0.961	0.909
Hand-crafted Features
Random Forest (RF)	0.941	0.718	0.98	0.828
Decision Tree (DT)	0.933	0.512	0.882	0.648
Naive Bayes (NB)	0.94	0.564	0.898	0.693

6 Conclusion

This work investigates and explores the event data generated by the core internet protocol namely Domain Name System (DNS) for the purpose of Cyber threat situational awareness. DNS is most important service in the internet. The main provocation is that a big data problem i.e. collecting 100 system logs in our lab – it’s about 10GB data generated every day. This creates a sematic gap between the information stored in DNS log file and security analysts to analyze and detect security threats. In this paper, we design and develop a new scalable architecture using Apache Spark. Apache Spark has started gaining more importance with the existing Hadoop stack. It fits for large scale DNS log analysis for security monitoring. The system collects DNS logs and performs an analysis in a distributed way in a fault-tolerant manner. The deep learning is used to detect and alert the presence of malicious domain name. Additionally, the efficacy of machine learning is evaluated to compare with the deep learning algorithms. To the best of our knowledge we have shown the big data security analytics platform by collecting DNS log data in LAN environment and analyzing using scalable architecture i.e. Apache Spark. In future work, we will work on collecting other logs and apply the proposed deep learning model on the collected data to enhance the malicious activities detection rate.

Footnotes

Acknowledgments

This research was supported in part by Paramount Computer Systems. We are also grateful to NVIDIA India, for the GPU hardware support to research grant. We are grateful to Computational Engineering and Networking (CEN) department for encouraging the research.

References

Kuhrer

, Rossow

and Holz

, Paint it black: Evaluating the effectiveness of malware blacklists, in Research in Attacks, Intrusions and Defenses, pp. 1–21, Springer, 2014.

Mcgrath

D.K.

and Gupta

, Behind Phishing: An Examination of Phisher Modi Operandi, in LEET, 2008.

Sandeep

, Reddy

R.A.K.K

and Ranjan

, Detecting algorithmically generated malicious domain names, in Proceedings of the 10th annual Conference on Internet Measurement, New York, 2010.

Antonakakis

, Perdisci

, Nadji

, Vasiloglou

, Abu-Nimeh

, Lee

and Dagon

, From throw-away traffic to bots: detecting the rise of DGA-based malware, in P21st USENIX Security Symposium (USENIX Security 12), pp. 491–506, 2012.

Yadav

, Reddy

A.K.K.

, Reddy

A.N.

and Ranjan

, Detecting algorithmically generated domain-flux attacks with DNS traffic analysis, IEEE/Acm Transactions on Networking20(5), 1663–1677.

Krishnan

, Taylor

, Monrose

and McHugh

, Crossing the threshold: Detecting network malfeasance via sequential hypothesis testing, in 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12, IEEE, 2013.

Schiavoni

, Maggi

, Cavallaro

and Zanero

, Phoenix: DGAbased botnet tracking and intelligence, in Detection of intrusions and malware, and vulnerability assessment, pp. 192–211, Springer, 2014.

Zhang

, Banick

, Yao

and Ramakrishnan

, User intention-based traffic dependence analysis for anomaly detection, in Security and Privacy Workshops (SPW), 2012 IEEE Symposium on Security and Privacy. IEEE, 2012, pp. 104–112.

King

S.T.

, Mao

Z.M.

, Lucchetti

D.G.

and Chen

P.M.

, Enriching intrusion alerts through multi-host causality, in Proceedings of the 2005 Network and Distributed System Security Symposium(NDSS), 2005.

10.

Ren

, Kristoff

and Gooch

, Visualizing DNS Traffic, VizSEC2006, 23–30.

11.

Kim

, Choi

and Lee

, BotXrayer: Exposing Botnets by Visualizing DNS Traffic, KSII the first International Conference on Internet (ICONI) 2009, December 2009.

12.

Dean

and Ghemawat

, MapReduce: simplified data processing on large clusters, Communications of the ACM51(1) (2008), 107–113.

13.

Shim

, MapReduce algorithms for big data analysis, Proceedings of the VLDB Endowment5(12) (2012), 2016–2017.

14.

Yang

S.F.

, Chen

W.Y.

and Wang

Y.T.

, ICAS: An inter-VM IDS log cloud analysis system, in Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on. IEEE, 2011, pp. 285–289.

15.

Francois

, Wang

, Bronzi

, State

and Engel

, Bot-Cloud: Detecting botnets using MapReduce, in Information Forensics and Security (WIFS), 2011 IEEE International Workshop on, IEEE, 2011, pp. 1–6.

16.

Caruana

, Li

and Qi

, A MapReduce based parallel SVM for large scale spam filtering, in Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on, vol. 4. IEEE, 2011, pp. 2659–2662.

17.

Indyk

, Kajdanowicz

, Kazienko

and Plamowski

, Web spam detection using MapReduce approach to collective classification, in International Joint Conference CISIS12-ICEUTE’12-SOCO? 12 Special Sessions. Springer, 2013, pp. 197–206.

18.

Zaharia

, Chowdhury

, Franklin

M.J.

, Shenker

and Stoica

, Spark: Cluster Computing with Working Sets, Ion Stoica. HotCloud 2010 June 2010.

19.

Armbrust

, Xin

, Lian

, Huai

, Liu

, Bradley

, Meng

, Kaftan

, Franklin

, Ghodsi

and Zaharia

, Spark SQL: Relational Data Processing in Spark, SIGMOD June, June 2015.

20.

Zaharia

, Das

, Li

, Hunter

, Shenker

and Stoica

, Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013 November 2013.

21.

Meng

, Bradley

, Yuvaz

, Sparks

, Venkataraman

, Liu

, Freeman

, Tsai

, Amde

, Owen

, Xin

, Franklin

, Zadeh

, Zaharia

and Talwalkar

, MLlib: Machine learning in apache spark, The Journal of Machine Learning Research17(1), 1235–1241.

22.

Xin

, Gonzalez

, Franklin

and Stoica

, GraphX: A resilient distributed graph system on spark, In Proceedings of the Graph Datamanagement Experiences and Systems (GRADES) Workshop June 2013.

23.

Zaharia

, Chowdhury

, Das

, Dave

, Ma

, Mc-Cauley

, Franklin

M.J.

, Shenker

and Stoica

, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012.

24.

Zhang

and Moore

, Traffic Trace Artifacts due to Monitoring Via Port Mirroring, End-to-End Monitoring Techniques and Services, 2007. E2EMON ’07.Workshop on, Munich, 2007, pp. 1–8.

25.

Lakshman

and Malik

, Cassandra: structured storage system on a p2p network, in Proceedings of the 28th ACM symposium on Principles of distributed computing, ser. PODC ’09. New York, NY, USA: ACM, 2009, p. 5.

26.

https://github.com/baderj/domain-generation-algorithms, Accessed: 2017-04-05.

27.

Bambenek consulting – master feeds, http://osint.bambenekconsulting.com/. Accessed: 2016-04-05

28.

Elman

J.L.

, Finding structure in time, Cognitive Science14(2) (1990), 179–211.

29.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation9(8) (1997), 1735–1780.

30.

Bengio

, Simard

and Frasconi

, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks5(2) (1994), 157–166.

31.

Werbos

P.J.

, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE78(10) (1990), 1550–1560.

32.

Abadi

, Barham

, Chen

, Davis

, Dean

and Kudlur

, TensorFlow: A System for Large-Scale Machine Learning, OSDI. Vol. 16, 2016.

33.

Does Alexa have a list of its top-ranked websites? https://support.alexa.com/, Accessed: 2017-04-02.

34.

https://umbrella.cisco.com/, Accessed: 2017-03-02.

35.

LeCun

, Bengio

and Hinton

, Deep learning, Nature521(7553) (2015), 436–444.