Identifying Internet background radiation traffic based on traffic source distribution

Abstract

IBR (Internet Background Radiation) traffic identification is significant for malicious behavior detection. This paper presents a novel IBR traffic identification method since traditional methods depend on tough conditions, such as full bi-direction traffic or unassigned IP address space. We firstly explored the traffic source distribution of each destination IP on a traffic dataset, and found that the traffic sources of active IPs are relatively certain but that of inactive IPs are relatively uncertain. Secondly, based on this exploration results, we present a method to identify IBR traffic. It utilizes the presented metric to evaluate the certainty of traffic sources of a destination IP, so as to identify inactive IPs. Then it detects IBR traffic according to some heuristics built according to malicious traffic behavior patterns. We carried out several experiments to evaluate our method on real traffic datasets, and results show that it obtains 99% precision and 0.1% omission rate on detecting IPv4 IBR traffic. The detected IBR traffic includes the traffic that sent to assigned IPs besides unassigned IPs, which is more valuable and practical for detecting the malicious traffic in real networks.

Keywords

Internet background radiation traffic malicious behavior information entropy network management

1. Introduction

Monitoring and analyzing Internet traffic are the important foundation for guarding against cyber threats and ensuring Quality of Service (QoS) in large networks [21]. Internet Background Radiation (IBR) traffic [19] generally refers to the “unsolicited” traffic that is passively received by the hosts, such as network scanning traffic, backscatter traffic and misconfiguration traffic [23]. Scanning traffic is usually sent by an infected host that tries to find vulnerable hosts. Backscatter traffic is generally generated by denial of service attacks. Misconfiguration traffic results from erroneous settings or faults or software/hardware faults. As a special kind of traffic, IBR traffic has been applied in network security field [17], e.g. for the analysis of intrusion characteristics or designing intrusion detection system.

Three categories of methods have been presented to collect IBR traffic. (1) Darknets utilize a specific acquisition machine to collect the traffic sent to pre-set unassigned IPs [18]. (2) Honeypot [14] refers to a special kind of host system that is used to lure intruders to attack it and collect the generated traffic. (3) One-way detection method filters out the one-direction flow on a full bi-directional flow dataset by matching whether a flow has an inverse flow. However, they commonly rely on tough conditions such as unassigned address space for darknets or full bi-directional flows for one-way detection method, and they are only able to collect limited IBR traffic. To weaken the restrictions, this paper explores traffic data characteristics and presents a novel method to identify IBR traffic. We consider the IP addresses inside campus network as destination IPs and the ones outside as source IPs. Main work of this paper includes following three aspects:

We explore the traffic source distribution of each destination IP in the campus traffic data and found that (1) there exists some IBR traffic sent to inactive IPs and (2) the traffic sources of inactive IPs are mostly close to be uniform. In addition, a significant property of active hosts is that the traffic source is relatively certain. These exploration results induct us to propose a novel method named IBR_TSD (IBR traffic identification based on Traffic Source Distribution). It is based on information entropy theory.

We carry out a series of experiments on IPv4 traffic datasets to validate the feasibility and effectiveness of IBR_TSD. A traffic trace was collected by port mirroring and another comes from a NetFlow [4] dataset. Experimental results show that the proposed technique obtains about 99% precision and 0.1% omission rate on identifying IBR traffic. We also summarize the source/destination port numbers of collected IBR traffic and found that well-known port scanning traffic can be identified by IBR_TSD.

On the other hand, the experimental results on IPv6 datasets show that IBR_TSD only obtains 83.5% detection rate and 21.9% precision on identifying IBR traffic. We further found that the traffic source distribution of inactive IPs on IPv6 datasets do not satisfy the assumption of IBR_TSD. This suggests that IBR_TSD is more suitable for detecting the IPv4 IBR traffic generated by malicious behavior.

Our approach can identify IBR traffic from the traffic data without packet payloads. Moreover, the identified IBR traffic includes the packets sent to the unassigned IPs and to the unknown active IPs that are more valuable for attackers. Whereas, most traditional methods are only able to collect the traffic sent to pre-setting unassigned IPs. Compared with our previous study [22], new contributions include: (1) building heuristics based on host communication patterns of malicious traffic, which is used to improve the IBR traffic identification method; (2) analyzing traffic source distribution on campus network traffic, which indicates that the traffic sources of active IPs are different from those of inactive IPs; (3) presenting a method to build the ground truth for inactive IPs and IBR traffic; and taking experiments on discussing the parameter of IBR_TSD and evaluating the precision and omission rate on detecting IBR traffic; (4) statistically summarizing the source and destination port numbers of identified IBR traffic; (5) evaluating the detection rate and precision of IBR_TSD on filtering inactive IPs and IBR traffic on pure IPv6 datasets, and discussing the characteristics of IPv6 IBR traffic and the suitable usage context of IBR_TSD.

This paper is organized as follows. Section 1 overviews background and contributions of this paper. Section 2 summarizes related works on IBR traffic identification methods, the applications of IBR traffic, and attack behavior detection methods. Section 3 firstly analyzes the traffic sources of active and inactive IPs, and then presents the basic theory and the details of our method. Section 4 introduces experimental datasets and takes a series of experiments to evaluate the performance of our approach on identifying IBR traffic. Section 5 concludes this paper.

2. Related work

A variety of monitoring systems have been presented to collect IBR traffic, such as network telescope [18], Internet sink [26], black hole [1], dark net [2], greynet [10,16] and so on. However, these methods closely rely on pre-setting unassigned IPs such as telescope, or require collecting full bi-direction traffic such as greynet. A one-way flow is referred as the one that does not have reverse flows. Actually, all IBR traffic flows are one-way flows. Glatz et al. [9] presented a scheme to detect one-way traffic in live network, and put them into several useful classes, respectively, service unreachable, malicious scanning, benign P2P, backscatter, suspected benign, bogon and others. However, all bi-directional traffic flows have to be collected in order to filter out one-way flows.

Pang et al. [19] systematically analyzed IBR traffic characteristics from different perspectives. The analysis on destination port numbers reveals that the vast majority of IBR traffic targets vulnerable services. They grouped the components of background radiation by protocols, applications and often specific exploit, and also studied the time mode and correlated behavior of IBR traffic. Wustrow et al. [23] investigated on the past five years IBR traffic in terms of time and space. They found that the traffic from the exploit ports is reduced and the trends are towards increasing SYN and decreasing SYN-ACK traffic. On examining the traffic across address blocks, they noted that significant differences exist and these differences are clustered in a handful of network blocks. Zseby et al. [27] investigated the entropy-based metrics for IP darkspace traffic classification. Iglesias et al. [12] proposed a method based on entropy measures of traffic-flow features to distinguish different attack types observed in IBR traffic. In [11], a cluster method was applied to the analysis of IP darkspace sources and they found that about 75% of the darkspace IP sources contributes to a set of very stable clusters, 4% to less stable clusters and 21% to outliers. Zakir et al. [8] found that large horizontal scanning is common and attackers utilize new scanning tools and technology to reduce the burden for finding vulnerabilities.

In essence, IBR traffic is a kind of harmful traffic. Besides the applications in security, it is able to reflect some other valuable information [3,6]. Dainotti et al. [6] analyzed the impacts of political and geographical events in Internet at the aid of IBR traffic. A sudden decrease of IBR traffic in an area indicates that network services may be blocked or interrupted. It also implies that significant political or geographical disaster have happened to this area. They mainly monitored the IBR traffic on backbone network and used special configurations to collect IBR traffic.

Most IBR traffic is generated by malicious attacks. Some papers detect attacks from the perspective of communication behavior. Karagiannis et al. [13] summarized the host communication patterns at the transport layer and presented a method to identify attacks. Steven et al. [20] proposed a method named ASMG (Attack Segmentation and Model Generation) to aggregate similar attack behaviors and provide situational awareness by grouping relevant traffic. Xu et al. [24,25] built traffic behavior patterns based on information entropy theory. They defined four dimensions ${srcIP, srcPort, dstIP, dstPort}$ . In each dimension, the relative uncertainty (RU) values of other three dimensions are computed. Attack behavior is defined according to RU values. This method can effectively distinguish attack traffic, but it only concerns productive traffic and ignores IBR traffic. Experimental results show that the traffic sources of active hosts are imbalanced, i.e. most traffic received by a host is sent by a small number of source hosts. This suggests that the traffic sources of active hosts are relatively certain. Our work is inspired by this and is actually the reverse expands of [24].

3. IBR_TSD method

3.1. Network traffic analysis

This section mainly analyzes traffic source distributions of destination IPs. In the study of network traffic, the object is usually the flow. A flow is composed by the packets that own the same five-tuples ${srcIP, srcPort, dstIP, dstPort, transport protocol}$ . NetFlow and sFlow are the most popular flow types [15]. NetFlow is the de facto industry standard used for accounting, traffic engineering and anomaly detection [7], since it can be collected with high speed and low disk demand. Internet Service Providers generally use inflow traffic for accounting, and the inflow data is more comprehensive and easily acquired. Hence, in-direction NetFlow data are used for traffic analysis in this paper.

NetFlow data were collected in our campus network in 2012 [22]. The distribution of source IPs of the traffic received by a destination IP is shown as Fig. 1, where x-axis represents the index of source IP, and y-axis represents the bytes sent by each source IP. When the destination IP is inactive (Fig. 1(a)), the bytes sent by each source IP do not have significant difference, and the distribution is close to be uniform. When the destination IP is active (Fig. 1(b)), even though there are more than 6000 source IPs, the majority of bytes (more than 10⁷) are sent by a small number of source IPs (about 14.8 IPs on average).

Fig. 1.

Traffic source distributions of inactive IPs and active IPs on traffic data collected in campus network. (a) Distribution of the traffic received by inactive IPs. (b) Distribution of the traffic received by active IPs.

The results in Fig. 1 show that the traffic sources of inactive hosts are relatively uncertain. And the traffic sent to them must be IBR traffic. Inspired from this, this paper proposes a novel approach to detect IBR traffic based on traffic source distribution.

3.2. Basic theory

Information entropy is generally used to evaluate the uncertainty of a random variable (denoted as X). Assume that $x_{i}$ is the ith value of X, $p (x_{i})$ denotes the probability of $x_{i}$ , then the entropy of X is calculated as described in Eq. (1), by considering that X takes $N_{x}$ discrete values. We observe X m times, then $p (x_{i}) = n_{i} / m$ , where $n_{i}$ is the frequency or the number of times that the value of $x_{i}$ is observed, $H (X) = - \sum_{x_{i} \in X} p (x_{i}) log p (x_{i}) .$ (1)

The value range of $H (X)$ is $[0, H_{\max} (X)]$ , where $H_{\max} (X) = {log}_{2} min {N_{x}, m}$ . $H (X)$ is a function related to $N_{x}$ and m. In order to compare the uncertainty of different variables, this paper applies relative uncertainty (RU) [24]. It is defined as in Eq. (2), where the denominator is used for normalization. The value range of RU is $[0, 1]$ , $RU (X) = \frac{H (X)}{H_{\max} (X)} = \frac{- \sum_{x_{i} \in X} p (x_{i}) log p (x_{i})}{log min {N_{x}, m}} .$ (2)

According to (2), when $N_{x} < m$ , if and only if $N_{x} = 1$ , $RU (X) = 0$ , which means that X only takes one value and is totally certain; if and only if $p (x_{i}) = 1 / N_{x}$ (i.e. $n_{i} = m / N_{x}$ ), $RU (X) = 1$ , which means that the values of X have a nearly uniform distribution and X has the highest degree of uncertainty. When $N_{x} > m$ , if and only if $N_{x} = 1$ , $RU (X) = 0$ , which means that X is totally certain; if and only if $p (x_{i}) = 1 / m$ (i.e. $n_{i} = 1$ ), which means that all observed values of X are different or unique. The larger the RU value is, the more uncertain the variable is.

3.3. Proposed approach

The main idea of our approach is to identify the inactive IPs and then distinguish the IBR traffic sent by the source IP of inactive IPs. It mainly includes three steps as shown in Fig. 2, i.e., (1) identifying inactive IPs, (2) marking potentially malicious source IPs, (3) identifying IBR traffic. The relationship among normal IP, inactive IP, malicious IP, normal traffic and IBR traffic is shown in Fig. 2. All source IPs contacting an inactive IP are labeled as malicious IPs. All traffic that is sent by a malicious IP and matches some attack behavior pattern are marked as IBR traffic. The detail of our approach are introduced as follows.

Fig. 2.

The main steps of identifying IBR traffic and relationship among different IPs.

A metric to evaluate the certainty of traffic sources

According to Fig. 1, the traffic sources of inactive IPs are mostly relatively uncertain. In order to identify inactive IPs, a metric called USIP (uncertainty of source IP) is proposed to evaluate the uncertainty of traffic sources associated to a destination IP in this section. The used notations are firstly shown as follows.

Let $SN = {{SN}_{i}, i = 1, \dots, n}$ denotes subnetworks which access Internet, where $n > 1$ and ${SN}_{i}$ stands for the IP address space of the subnetwork i which is globally routable.

Let $inactiveIP ({SN}_{i}, T)$ denote the inactive IPs in ${SN}_{i}$ space during time interval T.

Let $SourceIP (destIP, T)$ denote the source IPs that send traffic data to a destIP during time interval T.

the set of source IPs of the traffic received by destIP, i.e. $S = SourceIP (destIP, T)$

the ith source IP in S

the number of flows sent by $s_{i}$

the number of packets sent by $s_{i}$

the number of bytes sent by $s_{i}$

the total number of flows sent by all IPs in S

the total number of packets sent by all IPs in S

the total number of bytes sent by all IPs in S

$inactiveIP ({SN}_{i}, T)$ is generally defined as the IP that does not access Internet, such as subnet IP, broadcast IP, gateway IP, and unassigned IP.

USIP is defined as (3) which is based on the RU shown in Section 3.2. It is the ratio between the entropy of S and the maximum entropy of S. $p (s_{i})$ is the probability of a source IP $s_{i}$ . It can be evaluated from three granularities, respectively, the number of flows (#flows), the number of packets (#packets) and the number bytes (#bytes). Let $TrafficSource (SourceIP (destIP, T)) = {# {pkt}_{1}, # {pkt}_{2}, \dots, # {pkt}_{n}} or {# {byt}_{1}, # {byt}_{2}, \dots, # {byt}_{n}} or {# {flow}_{1}, # {flow}_{2}, \dots, # {flow}_{n}}$ denote the #packets, #bytes or #flows generated by each SourceIP of destIP during time interval T. Assume that there are k source IPs for a destIP in $SourceIP (destIP, T)$ and respectively denoted as ${SIP}_{1}, \dots, {SIP}_{k}$ . The #flows/#packets/#bytes sent by ${SIP}_{1}, \dots, {SIP}_{k}$ are respectively denoted as $V_{1}, V_{2}, \dots, V_{k}$ . Then, the distribution of TrafficSource of a destIP is defined as (4). The $p (s_{i})$ is defined as in Eq. (5), and the value is different from granularities. Similarly, m is the total number that S is observed. The value of m is also different for different granularities and it is defined in Eq. (6), $\begin{matrix} \begin{matrix} USIP (S) = H (S) / H_{\max} (S), \\ H (S) = - \sum p (s_{i}) log p (s_{i}), \\ H_{\max} (S) = log min {N_{s}, m}, \end{matrix} & (3) \\ {\frac{V_{1}}{\sum_{j = 1}^{k} V_{j}}, \frac{V_{2}}{\sum_{j = 1}^{k} V_{j}}, \dots, \frac{V_{k}}{\sum_{j = 1}^{k} V_{j}}}, & (4) \\ p (s_{i}) = \{\begin{matrix} F_{i} / T_{flow} & Granurality is #flows, \\ P_{i} / T_{pkt} & Granurality is #packets, \\ B_{i} / T_{byt} & Granurality is #bytes, \end{matrix} & (5) \\ N_{s} = | S |, m = \{\begin{matrix} T_{flow} & Granularity is #flows, \\ T_{pkt} & Granularity is #packets, \\ T_{bytes} & Granularity is #bytes . \end{matrix} & (6) \end{matrix}$

The value range of $USIP (S)$ is $[0, 1]$ . The larger the value of $USIP (S)$ is, the more the uncertainty of S is. As each flow always takes different number of bytes, #flows cannot reflect flow size information. For example, it is not able to distinguish mice flows and elephant flows. However, malicious flows are always mice flows with a small number of packets/bytes. The other two granularities (#packets and #bytes) are more suitable for evaluating the uncertainty of TrafficSource.

The way for checking uniform distribution

With the metric USIP, the next problem is how to judge if the traffic source distribution is close to be uniform. Our solution is inspired from the algorithm [24] for extracting the significant subset from a set. For the S set of an active IP, the USIP value is small and the TrafficSources are relatively certain. A significant subset on S of an active IP (i.e. source IPs sending a dominant number of packets/bytes to the destIP) exists. To extract the significant subset from S, the definition of significance is as follows. A subset D of S contains the most significant values of S if D is the smallest subset of S and satisfies that (1) the probability of any value in D is larger than that of remaining values (denoted by R and $R = S - D$ ) and (2) the probability distribution of R is close to be uniform, i.e., $USIP (R) > θ$ . Intuitively, D contains the most significant values in S while the values in R set are nearly undistinguishable from each other.

When the method is used to extract significant subset D from S for each destIP on an inflow data, the significant subsets for some destIPs are empty. In other words, the probability distribution on S is close to be uniform. Through further examination, we found that such destIPs are inactive. Consequently, our solution named threshold-cutting is devised to check whether the $USIP (S) ⩾ θ$ . If it is, then S is close to be uniform.

A novel method to identify IBR traffic

Based on the metric USIP and the threshold-cutting solution, two naive assumptions are presented to detect IBR traffic.

Assumption 1.

If the distribution of TrafficSources of a destIP is uniform, then the destIP is inactive.

Assumption 2.

If an IP is inactive, all its source IPs are potentially malicious.

IBR traffic include the traffic received by inactiveIPs and the traffic that sent by malicious IP and matches attack behavior. The pseudo-code of our approach named IBR_TSD is shown in Fig. 3. Firstly, the inflow data is grouped according to the destination IP (denoted by ${F_{i}, i = 1, 2, \dots}$ ), and the traffic on each $F_{i}$ owns same destination IP. Secondly, for each $F_{i}$ , group $F_{i}$ according to source IP (denoted by ${F_{i}^{(j)}, j = 1, 2, \dots}$ ), and the traffic on each $F_{i}^{(j)}$ owns same source IP and destination IP; then it calculates the number of bytes on every $F_{i}^{(j)}$ and computes the USIP value of TrafficSource on $F_{i}$ by #bytes; checks if the USIP value is larger than the threshold θ; if it is, all traffic of $F_{i}$ is labeled as IBR traffic and the source IPs on $F_{i}$ are labeled as malicious IPs; if it is not, all traffic of $F_{i}$ is added into set G. Finally, examining each flow of G, if the source IP is malicious and the traffic flow satisfies one of the following heuristics, the flow is labeled as IBR traffic.

Fig. 3.

Pseudo-code of identifying algorithm.

Fig. 4.

Three attack traffic behavior patterns summarized by [13].

In our algorithm, the heuristics aim to determine whether a traffic flow matches the attack behavior pattern. Based on the exploration results in [13], three attack behavior patterns are shown in Fig. 4. A heuristic is built for each pattern. The meaning of each following expression is $⟨ # SIP, # SPort, # DIP, # DPort ⟩$ . And ${SIP}_{1}$ , ${SPort}_{1}$ , ${DIP}_{1}$ and ${DPort}_{1}$ respectively represent one source IP, one source port, one destination IP and one destination port. Pattern A is attack behavior. Pattern B is similar and belongs to fixed port scanning mode. Pattern C disguises the traffic as well-known service, such as web service, hence we need additionally check the packet number in Heuristic 3.

Heuristic 1.

A source IP using multiple source port numbers sends packets to a destination port number of multiple destination IPs. It is expressed as $⟨ {SIP}_{1}, *, *, {DPort}_{1} ⟩$ .

Heuristic 2.

A source IP using a source port number sends packets to a destination port of multiple destination IPs. It is expressed as $⟨ {SIP}_{1}, {SPort}_{1}, *, {DPort}_{1} ⟩$ .

Heuristic 3.

A source IP using a source port number sends packets to multiple destination port numbers of multiple destination IPs, and the number of packets is smaller than 4. It is expressed as $⟨ {SIP}_{1}, {SPort}_{1}, *, * ⟩ & & T_{pkt} < 4$ .

4. Experimental results and analysis

4.1. The method of building ground truth

This section mainly evaluates the performance of our approach, i.e. the accuracy of identifying inactive IPs and IBR traffic. The ground truth of inactive IPs and IBR traffic should be firstly built. An inactive IP is routable but does not access Internet, such as the IP that is not assigned to a user or the user’s host does not turn on or not access Internet. A strict way of building the ground truth for inactive IPs is to inspect the packet payload contents. But this way has to access packet payload, which is much complex and may suffer from privacy preservation. Instead, a conservative method is presented to build ground truth of inactive IPs and IBR traffic. The steps are as follows.

(1) When the transport protocol is TCP, for a flow received by a destIP, if there is no inverse packet, the flow is labeled as IBR traffic. This is the situation in Fig. 5(a). If there are inverse packets, we will check the flags of in-packets and out-packets. If the SYN and ACK flags of an in-packet are set, and the RST of the corresponding out-packet is set, we label the in-flow as IBR traffic. This is the situation in Fig. 5(b) and the flows may be generated by backscatter. If the SYN of an in-packet is set, the SYN and ACK of the corresponding out-packet are set, but the RST of the following in-packet is set (or there is no following in-packet), the in-flow is labeled as IBR traffic. This is the situation in Fig. 5(c) and is the abnormal of TCP connection. If the TCP connection is built by three-way handshake, but destIP receive a RST packet, the in-flow is also labeled as IBR traffic. This is situation in Fig. 5(d) and the packets may be generated by scanning attack.

Fig. 5.

The situations when building the ground truth on TCP packets.

(2) When the transport protocol is UDP or ICMP, for a flow received by a destIP, if there is no inverse packet, the in-flow is labeled as IBR traffic. This is the situation in Fig. 6.

Fig. 6.

The situations when building the ground truth on UDP/ICMP packets.

Using the above way, the ground truth for a traffic trace collected from a lab building of our campus (called LAB) was built and used for performance evaluation of our approach. The collection time is between 09:58:14 and 10:33:14 on 2/4/2013. The summary of the LAB traffic trace is shown as Table 1.

Table 1

Traffic summary of LAB traffic data

Duration (s)	#Num of user IP	Flow size of Intranet (MB)	Flow size of Internet (MB)
2100	1922	2791	16,730

To further display the ground truth of inactive IPs and IBR traffic, the distributions of received traffic (in-traffic) and sent traffic (out-traffic) of each destIP are computed on LAB traffic data, which is shown as Fig. 7(a). x-axis and y-axis represent in-traffic and out-traffic respectively and they are log-scaled. And $log 0$ is set as 1.11 in the figure. There are mainly three categories of IPs in Fig. 7(a): (1) the IPs that only have in-traffic (IPs on the y-axis), (2) the IPs that only have out-traffic (IPs on x-axis), (3) the IPs that have in-traffic and out-traffic (IPs between two axis). First category includes 606 IPs, which belong to inactive IPs. The 26,209 packets received by inactive IPs are labeled as IBR traffic. Based on the results of inactive IPs. Their source IPs are labeled as malicious, and then 155 malicious source IPs are obtained. Malicious source IPs totally send 69,396 packets. There are 43,187 other packets except the packets received by inactive IPs, in which 43,059 packets are labeled as IBR traffic. Figure 7(b) displays the USIP values and in-traffic volume of each IP from first category and third category in Fig. 7(a). x-axis represents the bytes received by each IP and y-axis represents USIP value of the traffic source. The ‘+’ represents inactive IPs, and the ‘×’ represents other IPs. Figure 7(b) shows that some of other IPs overlap inactive IPs.

Fig. 7.

The in-traffic and out-traffic volume distribution, and RU values of traffic sources on the datasets with ground truth. (a) In-traffic and Out-traffic. (b) RU and In-traffic.

4.2. Performance evaluation on traffic data with ground truth

Precision, detection rate and omission rate are used as performance evaluation metrics. The precision of inactive IPs/malicious IP/IBR traffic is the ratio of the number of correctly identified inactive IPs/malicious IPs/IBR traffic to the number of identified inactive IPs/malicious IPs/IBR traffic. Detection rate (or recall) of inactive IPs/malicious IP/IBR traffic is the ratio of the number of correctly identified inactive IPs/malicious IPs/IBR traffic to the number of inactive IPs/malicious IPs/IBR traffic. Omission rate of inactive IPs/malicious IP/IBR traffic is ratio of the number of inactive IPs/malicious IPs/IBR traffic that are not identified to the number of inactive IPs/malicious IPs/IBR traffic. And $Omission rate = 1 - Detection rate$ .

There is a parameter in our approach, i.e. threshold θ. The threshold is used to check whether an IP is inactive and directly influences the results of IBR traffic. Figure 7(b) shows that the majority of inactive IPs are at top left. If the threshold is small, all inactive IPs can be detected, but many other IPs are incorrectly identified as inactive IPs and all source IPs of them will be incorrectly labeled as malicious IPs. Then, many flows are incorrectly marked as IBR traffic. In order to discuss the influence of the threshold, the results using different thresholds between 0.85–0.99 are tested and compared.

Experimental results with different values of threshold θ are shown as Fig. 8. In Fig. 8(a), x-axis represents different values of threshold θ and y-axis represents the precision of detected inactive IPs, malicious IPs and IBR traffic using our approach. Figure 8(a) shows that the precisions of malicious IPs and IBR traffic are very low when the threshold is small. Even though the detection rate of inactive IPs is high, but a lot of malicious IPs and IBR traffic are incorrectly identified. As increasing the value of the threshold, the precisions of inactive IPs, malicious IPs and IBR traffic also increase. And the precision is stable when the value of the threshold reaches to 0.9. Figure 8(a) shows that the precision on detecting IBR traffic is about 99% when the threshold equals to 0.9.

Fig. 8.

The performance of our approach on identifying inactive IPs and IBR traffic on the datasets with ground truth. (a) Precision. (b) Omission rate.

In Fig. 8(b), x-axis represents different values of threshold θ and y-axis represents the omission rates of inactive IPs, malicious IPs and IBR traffic using our approach. By increasing the value of the threshold, the omission rates also increase, which means that more inactive IPs, malicious IPs and IBR traffic are missed. When the threshold equals to 0.9, the omission rate of inactive IPs is high (40%) and the omission rate of malicious IPs is about 25%, but that of IBR traffic flows is only about 0.1%. This means that most IBR traffic is received by the inactive IPs with high uncertainty traffic sources, and majority IBR traffic can be correctly detected when the threshold is 0.9. However, the threshold cannot be too high. For example, when the threshold is 0.95, the precision further increases but the omission rate increases significantly.

According to the discussion of Fig. 8, our approach obtains 99% precision and 0.1% omission rate when the threshold is 0.9. Our approach aims to correctly detect IBR traffic from existing NetFlow data, which mainly focuses on ensuring high precision for IBR traffic. It is not necessary and unrealistic to filter out all IBR traffic. It allows a small omission rate. According to above discussion and experimental results, the value between 0.9 and 0.94 is suitable for the threshold θ. In the following experiments of this paper, the threshold is set as 0.9.

4.3. Experimental results on IPv4 NetFlow data

Our previous work [22] has discussed and analyzed the experimental results of detecting the IBR traffic on NetFlow datasets (i.e., Student_A) in detail, which will not be shown here again. Based on the detected IBR traffic in [22] on Student_A, this section further develops experiments to summarize the port number information of IBR traffic. This is because the features and information of IBR (or attack) traffic is helpful for guarding against network security and improving the configurations of security management systems. Table 2 summarily lists the source and destination ports of every pattern on each day, where “∗” represents that there are multiple port numbers without regular pattern. The Pattern A, B and C respectively correspond to the three attack behavior patterns shown in Fig. 4. Table 2 shows that several port numbers are used by well-known port scanning attack such as 1433, 3306, 3389 and 4899.

Table 2
Traffic statistics of IBR traffic detected on the NetFlow data using our approach

Date	Pattern	#Flows	Source port	Destination port
Mon	Pattern A	11,349	∗	22,1433,3389,6666,8909
	Pattern B	15,131	0,6000,24920	0,1433,3306,3389,4899,6666,8086,8088,8090,8909,9415
	Pattern C	72	80	∗
Tue	Pattern A	8984	∗	22,80,1433,3389,8909
	Pattern B	11,827	6000,17503	1433,3306,3389,4899,5060,6666,9415
	Pattern C	755	53,80	∗
Wed	Pattern A	8102	∗	23,1433,3389,8080,8909,18600
	Pattern B	7778	6000,23054	81,90,1111,1433,3306,3389,4899,6666,8909,9415
	Pattern C	0	NULL	NULL
Thu	Pattern A	8530	∗	1433,6666,8888,8909
	Pattern B	16,370	6000,15108	1433,3306,3389,8090,9415
	Pattern C	8	80	∗
Fri	Pattern A	8663	∗	23,25,1433,3306,3389,4899,8081,8909,18600
	Pattern B	9249	6000,12200	1433,3306,3389,8080,9415
	Pattern C	2466	80	∗
Sat	Pattern A	9347	∗	80,1433,3389,6666,8909,
	Pattern B	7572	0,6000	0,808,1433,1521,2967,3306,3389,4899,8090,8909,9415,65500
	Pattern C	1465	53,80	∗
Sun	Pattern A	10,118	∗	22,1433,3306,3389,8080
	Pattern B	6144	6000,27160	1433,3306,3389,6666,8080,8909,9415
	Pattern C	210	53,80	∗

4.4. Performance evaluation on IPv6 traffic datasets

In IPv6 network, it is highly likely that inexperience, system configuration differences, or even software or hardware bugs will result in errors that can lead to Internet background radiation. By observing IPv6 IBR traffic, we can be alerted by the emergence of malicious activity, such as scanning and worms, via the new protocol features.

In this section, the IPv6 IBR traffic will be analyzed using our approach. A 30 minutes pure IPv6 traffic were collected using port mirror in our campus network. The IPv6 packets were collected during the time interval 16:11–16:41 on 2/21/2014. The basic information of IPv6 traffic dataset is shown in Table 3. And the ground truth of IPv6 IBR traffic was built by the method presented in Section 4.1.

Table 3
Information of IPv6 traffic data

Duration (s)	#Num of IPs	Flow size of in-traffic (MB)	Flow size of out-traffic (MB)
1800	3815	66,167	38,178

When our approach is used for filtering out IBR traffic and the threshold for USIP is set as 0.9, the detection rate of inactive IP is about 70% but the precision is only 34%; the detection rate of IBR traffic is 83.5% and the precision of IBR traffic is 21.9%. This means that most active IPs are misclassified as inactive IPs and most normal traffic is misclassified as IBR traffic. This indicates that the traffic sources of many active IPs are also close to be uniform. This is different from the active IPs on IPv4 network environment. To further analyze the USIP values of the active IPs and inactive IPs on IPv6 traffic dataset, the USIP and traffic size (#bytes) of inactive IPs and active IPs on the traffic datasets with ground truth are shown in Fig. 9(a) and (b), respectively. USIP values of a part of inactive IPs are smaller than 0.9 in Fig. 9(a) and the USIP values of some active IPs are also large in Fig. 9(b). It is different from the result on IPv4 traffic shown in Fig. 7, in which the USIP values of most inactive IPs are larger than 0.9 and these of active IPs are smaller than 0.9.

Fig. 9.

Traffic volume and the USIP of traffic sources of inactive IPs and active IPs on the IPv6 traffic data. (a) RU and In-traffic of InactiveIP. (b) RU and In-traffic of ActiveIP.

Above experimental results show that our method is not suitable for filtering out IPv6 IBR traffic. This is because it assumes that the source IPs that send traffic to inactive IPs are potential malicious IPs, but most IPv6 IBR traffic is not generated by malicious IPs [5] which do not satisfy the assumption. Existing work [5] on IPv6 IBR traffic concludes that there is no evidence to prove that prevalent malicious traffic in IPv6 data is due to worms, scanning, or backscatter. They found that most IPv6 IBR traffic is due to misconfiguration. Our analysis on IPv6 IBR traffic is consistent with existing work. Nevertheless, the experiments on IPv6 traffic further suggests that our method is suitable for detecting the IBR traffic generated by the abnormal behavior which benefits to guard network security.

5. Conclusions

This paper mainly investigates the IBR traffic identification method. According to the experimental analysis on campus network traffic, we found that there is always some traffic sent to an IP, even if this IP is not assigned or a non-user IP (i.e. subnet IP, broadcast IP and gateway IP). This phenomenon indicates that IBR traffic is ubiquitous. Moreover, the traffic sources of active IPs are relatively certain and these of inactive IPs are uncertain. Inspired by this, we assume that an inactive IP is identified when its traffic sources are relatively uncertain. A novel approach named IBR_TSD is presented to detect IBR traffic from existing in-direction traffic data. Firstly, it detects inactive IPs and malicious source IPs based on the presented USIP metric and threshold-cutting method. Secondly, it labels the traffic received by inactive IPs as IBR traffic and identifies the IBR traffic sent by malicious source IPs by using three heuristics that can be used for detecting attack behavior.

To validate the feasibility and evaluate the performance of our approach, we collected IPv4 and IPv6 traffic traces to build ground truth using a conservation way. On the one hand, experiments are carried out on IPv4 traffic traces, and experimental results show that (1) the parameter related to USIP in IBR_TSD is better when set in the range of 0.9–0.94; (2) IBR_TSD obtains about 99% precision and 0.1% omission rate on detecting IBR traffic; (3) the summary of source and destination port numbers on detected IBR traffic can be used for intrusion detection system, such as detecting port scanning. On the other hand, experiments are carried out on IPv6 traffic traces, and experimental results show that (1) IBR_TSD obtains about 83.5% detection rate and 21.9% precision on detecting IBR traffic; (2) the USIP values of some inactive IPs are smaller than 0.9, which suggests that IBR_TSD is not suitable for detecting IPv6 IBR traffic; the possible reason is that the IPv6 IBR traffic are mainly generated by misconfiguration rather than malicious behavior. Nevertheless, IBR_TSD is able to effectively detect IPv4 IBR traffic that has been generated by malicious behavior. Overall, the advantages of IBR_TSD include: (1) it directly detects IBR traffic from existing in-direction traffic data without any extra information such as online user lists and full payload of bi-directional traffic; (2) the detected IBR traffic is more valuable and practical since it contains the traffic sent to other active IPs besides inactive IPs. However, existing methods (such as darknet, black holes, honeypots, etc.) have to firstly configure a series of unassigned IPs. And the IBR traffic collected by them only contains the traffic sent to unassigned IPs.

The experimental data used in this paper covers half-hour or a day. The method may be not suitable for real-time IBR traffic detection. For example, inactive IPs are more easily distinguished from active IPs on daily traffic compared to half-hour traffic according to our further analysis. The possible reason is that the USIP values calculated on the traffic collected in a short period time cannot be used to accurately detect inactive IPs.

Footnotes

Acknowledgements

This study is supported by National Natural Science Fund, China (Grant No. 61300198), Guangdong Province Natural Science Foundation (No. S2013040016582). Guangdong Higher School Scientific Innovation Project (No. 2013KJCX0177), Fundamental Research Funds for the Central Universities (SCUT 2014ZB0029) and China Postdoctoral Science Foundation (No. 2014M552199).

References

[1]

Bailey,

Cooke,

Jahanian,

Nazario and

Watson, The Internet motion sensor: A distributed blackhole monitoring system, in: Proceedings of the 12th Network and Distributed System Security Symposium, February 2005, 2005.

[2]

Bailey,

Cooke,

Jahanian,

Provos,

Rosaen and

Watson, Data reduction for the scalable automated analysis of distributed darknet traffic, in: Proceedings of the USENIX/ACM Internet Measurement Conference, October 2005, 2005.

[3]

Benson,

Dainotti,

K.C.

Claffy and

Aben, Gaining insight into AS-level outages through analysis of Internet background radiation, in: Proceedings of the 5th IEEE International Traffic Monitoring and Analysis Workshop, April 2013, 2013, pp. 447–452.

[4]Cisco Systems, Cisco IOS NetFlow site, available at: http://www.cisco.com/c/en/us/products/ios-nx-os-software/ios-netflow/index.html.

[5]

Czyz,

Lady,

S.G.

Miller,

Kallitsis and

Karir, Understanding IPv6 Internet background radiation, in: Proceedings of the 13th ACM SIGCOMM Conference on Internet Measurement (IMC’13), October 2013, 2013, pp. 105–118.

[6]

Dainotti,

Ammann,

Aben and

K.C.

Claffy, Extracting benefit from harm: Using malware pollution to analyze the impact of political and geophysical events on the Internet, ACM SIGCOMM Computer Communication Review42(1) (2012), 31–39.

[7]

Drašar,

Vizváry and

Vykopal, Similarity as a central approach to flow-based anomaly detection, International Journal of Network Management24(4) (2014), 318–336.

[8]

Durumeric,

Bailey and

J.A.

Halderman, An Internet-wide view of Internet-wide scanning, in: Proceedings of the 23rd USENIX Security Symposium, August 2014, 2014, pp. 65–78.

[9]

Glatz and

Dimitropoulos, Classifying Internet one-way traffic, in: Proceedings of the ACM Conference on Internet Measurement Conference, November 2012, 2012.

10.

[10]

Harrop and

Armitage, Defining and evaluating greynets (sparse darknets), in: Proceedings of the 30th IEEE Conference on Local Computer Networks (LCN’05), November 2005, 2005.

11.

[11]

Iglesias and

Zseby, Modelling IP darkspace traffic by means of clustering techniques, in: Proceedings of IEEE Conference on Communications and Network Security (CNS), October 2014, 2014, pp. 166–174.

12.

[12]

Iglesias and

Zseby, Entropy-based characterization of Internet background radiation, Entropy17(1) (2015), 74–101.

13.

[13]

Karagiannis,

Papagiannaki and

Faloutsos, BLINC: Multilevel traffic classification in the dark, ACM SIGCOMM Computer Communication Review35(4) (2005), 229–240.

14.

[14]

Kreibich and

Crowcroft, Honeycomb: Creating intrusion detection signatures using honeypots, ACM SIGCOMM Computer Communication Review34(1) (2004), 51–56.

15.

[15]

Li,

Springer,

Bebis and

M.H.

Gunes, A survey of network flow applications, Journal of Network and Computer Applications36(2) (2013), 567–581.

16.

[16]

Miao,

Ding and

Zhu, Extracting Internet background radiation from raw traffic using greynet, in: Proceedings of the 18th IEEE International Conference on Networks (ICON), IEEE, 2012, pp. 370–375.

17.

[17]

Moore,

Shannon,

D.J.

Brown,

G.M.

Voelker and

Savage, Inferring Internet denial of service activity, ACM Transactions on Computer Systems24(2) (2006), 115–139.

18.

[18]

Moore,

Shannon,

G.M.

Voelker and

Savage, Network telescopes, Technical report, Cooperative Association for Internet Data Analysis (CAIDA), July 2004.

19.

[19]

Pang,

Yegneswaran,

Barford,

Paxson and

Peterson, Characteristics of Internet background radiation, in: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, October 2004, 2004.

20.

[20]

Strapp and

S.J.

Yang, Segmenting large-scale cyber attacks for online behavior model generation, in: Proceedings of the 7th International Conference, SBP 2014, April 2014, 2014, pp. 169–177.

21.

[21]

A.M.

Sukhov,

D.I.

Sidelnikov,

A.P.

Platonov and

A.A.

Galtsev, Active flows in diagnostic of troubleshooting on backbone links, Journal of High Speed Networks18(1) (2011), 69–81.

22.

[22]

Wang,

Zhang and

Liu, A novel method for filtering Internet background radiation traffic, in: The 4th International Conference on Emerging Intelligent Data and Web Technologies (EIDWT), September 2013, IEEE, 2013, pp. 371–376.

23.

[23]

Wustrow,

Karir,

Bailey,

Jahanian and

Huston, Internet background radiation revisited, in: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, November 2010, 2010.

24.

[24]

Xu,

Z.-L.

Zhang and

Bhattacharyya, Profiling Internet backbone traffic: Behavior models and applications, ACM SIGCOMM Computer Communication Review35(4) (2005), 169–180.

25.

[25]

Xu,

Z.-L.

Zhang and

Bhattacharyya, Internet traffic behavior profiling for network security monitoring, IEEE/ACM Transactions on Networking16(6) (2008), 1241–1252.

26.

[26]

Yegneswaran,

Barford and

Plonka, On the design and use of Internet sinks for network abuse monitoring, in: Proceedings of the Symposium on Recent Advances in Intrusion Detection, September 2004, 2004, pp. 146–165.

27.

[27]

Zseby,

Brownlee,

King and

Claffy, Nightlights: Entropy-based metrics for classifying darkspace traffic patterns, in: Proceedings of the 15th International Conference, PAM 2014, March 2014, 2014, pp. 275–277.

Identifying Internet background radiation traffic based on traffic source distribution

Abstract

Keywords

1. Introduction

2. Related work

3. IBR_TSD method

3.1. Network traffic analysis

A metric to evaluate the certainty of traffic sources

The way for checking uniform distribution

A novel method to identify IBR traffic

4.1. The method of building ground truth

Table 2 Traffic statistics of IBR traffic detected on the NetFlow data using our approach

Table 3 Information of IPv6 traffic data

Footnotes

Acknowledgements

References

Table 2
Traffic statistics of IBR traffic detected on the NetFlow data using our approach

Table 3
Information of IPv6 traffic data