Abstract
Anomalous traffics are those unusual and colossal hits a non-popular domain gets for a small epoch period in a day. Regardless of whether these anomalies are malicious or not, it is important to analyze them as they might have a dramatic impact on a customer or an end user. Identifying these traffic anomalies is a challenge, as it requires mining and identifying patterns among huge volume of data. In this paper, we provide a statistical and dynamic reputation based approach to identify unpopular domains receiving huge volumes of traffic within a short period of time. Our aim is to develop and deploy a lightweight framework in a monitored network capable of analyzing DNS traffic and provide early warning alerts regarding domains receiving unusual hits to reduce the collateral damage faced by an end–user or customer. The authors have employed statistical analysis, supervised learning and ensemble based dynamic reputation of domains, IP addresses and name servers to distinguish benign and abnormal domains with very low false positives.
Introduction
With the advent of technology, the world has seen a surge in internet usage for varied reasons in day-to-day tasks. For more than a decade the internet has played a major role in transfer of information at various levels. As an end user we access the internet by querying fully qualified domain names (FQDN), which are mapped to its corresponding IP addresses with the help of Domain Name System (DNS) protocol [1]. However, there has been an immense increase in attacking DNS through multiple malicious activities. These malicious activities have elevated the need for network security and protecting the customers inside a network.
Network traffic anomaly detection plays a crucial role in network security and detecting them early is of great importance for maintaining the integrity of the network. A network traffic anomaly occurs when the traffic deviates from its normal behaviors [2–4]. Thus a domain receiving unusual hits (high or low) when compared to its original activity could be labeled as an anomalous behavior. A domain may experience a sudden surge in its traffic under the following conditions: (1) It is being attacked by a malicious server; (2) It hosts malicious content and the infected clients start querying it; (3) A benign domain launching a product or hosting any event.
Regardless of whether this sudden surge in traffic is malicious or not, it is important to analyze them as they might have a dramatic impact on a customer or an end user. To identify whether a domain receiving anomalous queries is malicious or a victim, we need to check the reputation of that domain. Traditional static blacklists and reputation systems could be easily evaded using DNS agility thus creating a need for a dynamic reputation of domains [5, 27].
In this research we use supervised machine learning, ensemble based dynamic reputation to identify suspicious anomalous domains. We see the following as our contribution through this research: Develop a lightweight yet scalable framework capable of analyzing DNS traffic of a large network. Identify domains in the network receiving anomalous surge traffic. Distinguish and alert suspicious anomalous domains based on supervised classification, statistical analysis and ensemble based dynamic reputation.
This paper is organized as follows: The existing methods and related work are described in Section 2. Section 3 provides the proposed system architecture and work flow. The implementation details of our work are explained in Sections 4. The performance evaluation and experimental results are provided in Section 5. Finally, conclusions of the paper are drawn in Section 6.
Literature survey
Network traffic anomaly detection has been a vibrant and ongoing research area in the past few decades. One of the pioneer works in measuring internet traffic was performed by Ramon et al. in [11]. A lot of work has been in the area of network traffic anomaly detection in Netflow. In [12], Paul et al. worked on identifying the statistical properties of anomalies present and their invariant properties and alerting them from IP traffic flow. Another work on profiling traffic flow was performed in [13]. Paxson et al. presented a FFT based methodology for analyzing approximate self-similar sample paths in network traffic [14]. Network based Intrusion Detection Systems (NIDS) for example snort [15] have been used to identify ICMP, TCP, UDP flooding attacks. Wavelet theory and signal processing have also been used for finding network anomalies. Some of the noteworthy work includes [16–22]. But these methodologies mainly aim for anomaly detection in traffic at packet level and not at the DNS level.
Network traffic anomalies have also been detected by checking signature patterns [23]. The efficiency of the signature based models heavily depends upon static blacklists and signatures present in its database. Also the existing signatures needs to be regularly updated and slight modification could result in evasion. Antonakakis et al. developed a dynamic reputation system for DNS to overcome the drawbacks present in the traditional signature based systems [5]. Some of the anomalous detection in DNS includes [24–26]. In our approach we developed and deployed a lightweight model for identifying suspicious domains getting sudden surge in traffic by employing a combination of static and dynamic reputation based methodologies.
System overview
In this paper, we develop a scalable framework which analyses the DNS traffic in a network and detect and alert the presence of suspicious domains receiving anomalous hits. The architectural details of the developed framework are provided in Fig. 1.
The data for the system is collected by deploying passive sensors across the four geographically distributed (Kochi, Kollam, Coimbatore and Bangalore) Amrita University campuses comprising of more than 30,000 unique users. The Passive Sensor collects DNS Query/Response from the deployed DNS Servers (Any DNS Server). Hence in this work we are getting data from four DNS servers. The sensor captures the network traffic from DNS servers and passes it to an application, which takes only the traffic, which is originated, from DNS Server (DNS Response Traffic). It will then dissect the DNS Response Packet, converts it into human readable format and forward it to DNS Log Collector. The Passive Sensor could be installed inside the DNS Server itself (for Linux/Unix Platform) or mirror the traffic to a server which is dedicated for DNS Passive Sensors. Our sensors get the data by port mirroring the traffic from the DNS Servers present in the deployed network. The logs received from the 4 campus sensors are received by a distributed log parser. The parser finds Geo Location and ASN Details of each IP address (Client IP, DNS Server IP, and A Records in Resource records) in the DNS Responses. The parser uses the Maxmind Geo IP database to find Geo location (city, country, latitude, longitude) of an IP Address. Maxmind ASN database is used for finding ASN details (AS Number, As Name) of an IP Address. A Query Router Service is used to control the communication between different modules. The Front End Message Router controls all the communication to and from the front end UI. It also interfaces with the respective back end modules.
Implementation details
The framework performs batch analysis of the DNS traffic in the monitored network. For a given time frame Tf the hits received by all domains are aggregated and analyzed. Four features about each domain are extracted in the given time frame Tf. They are listed as follows: where, HRratio (Di) - Hit to registration ratio for domain Di d# - Number of days from date of registration of domain Di till today h# - Number of hits received by the domain in the time frame Tf where,
and
Considering the time frame as of 15 minutes we get the traffic threshold for a time frame as,
If any unpopular domain Di in any time frame receives hits greater than Tt, we flag it as 1 else 0. where p (xi) is the ith outcome of X [19].
Ensemble based dynamic reputation of domains classified as suspicious by the J48 classifier is carried out. Given a domain name or an IP address we annotate a reputation score Ri to it accordingly. The reputation score is based on a 3-level check which involves spanning through a Passive DNS Intelligence [8, 9, 28], Malware Knowledge Base and Whois Data Base. Figure 2 represents the dynamic reputation service employed in this work.
For a domain, query is sent to retrieve the following information: Registrant name, address, email, phone. Registrar name, address, email, phone. Administrator name, address, email, phone. Domain registered date, expiration date, authoritative name server etc.
In this research the authors investigate the application of statistical analysis, supervised machine learning and ensemble based reputation for detecting domains receiving anomalous surge traffic in their network. In this experiment, a DNS data set was prepared containing 40000 data with 4 features in a time frame of 6 months from October 2015 to April 2016. The features used are Hits-to-Registration ratio, Traffic Threshold, Shannon Entropy and Time-to-Live of a domain name. This contains two classes of data - benign and malicious. The data in this dataset are uniformly distributed for both the classes and is solved using the J48 decision tree classifier with 10 fold cross validation.
As we implemented each feature we compared the effectiveness of the classifier to its previous iteration using the following effectiveness metrics: F1-score, accuracy, false-negative rate (FNR), and false-positive rate (FPR). The F1-score equation is defined as:
The experimental results are shown in Table 1, 2 and 3 below.
Figure 3 showcases the hits received by an anomalous domain ‘bjyxjy.com.cn’ on 5th April, 2016. The attack happened from 16:24:10 (4:24 PM) and was active till 23:03:00 (11:03 PM). The traffic hits received by ‘google.com’ for the same timeframe has been provided in Fig. 4.
The first, third quartile and interquartile of the hits received by ‘bjyxjy.com.cn’ were calculated along with other descriptive statistics which are provided in Table 4.
The system developed was deployed in a Tier 1 ISP from 2nd April, 2016. The system was able to find out and alert the presence of many anomalous activities in the network.
In this paper we propose a lightweight framework for detecting domains receiving anomalous traffic of data for a short period with the help of supervised machine learning algorithms and ensemble based reputation system. This system is deployed at a Tier-1 Internet service provider network where it has proven to analyze several thousands of DNS logs per second. The proposed system is highly scalable, able to handle Tier 1 ISP data load and is capable of detecting anomalous events occurring in a network with a low false positive rate of 0.064.
