Abstract
IBR (Internet Background Radiation) traffic identification is significant for malicious behavior detection. This paper presents a novel IBR traffic identification method since traditional methods depend on tough conditions, such as full bi-direction traffic or unassigned IP address space. We firstly explored the traffic source distribution of each destination IP on a traffic dataset, and found that the traffic sources of active IPs are relatively certain but that of inactive IPs are relatively uncertain. Secondly, based on this exploration results, we present a method to identify IBR traffic. It utilizes the presented metric to evaluate the certainty of traffic sources of a destination IP, so as to identify inactive IPs. Then it detects IBR traffic according to some heuristics built according to malicious traffic behavior patterns. We carried out several experiments to evaluate our method on real traffic datasets, and results show that it obtains 99% precision and 0.1% omission rate on detecting IPv4 IBR traffic. The detected IBR traffic includes the traffic that sent to assigned IPs besides unassigned IPs, which is more valuable and practical for detecting the malicious traffic in real networks.
Keywords
Introduction
Monitoring and analyzing Internet traffic are the important foundation for guarding against cyber threats and ensuring Quality of Service (QoS) in large networks [21]. Internet Background Radiation (IBR) traffic [19] generally refers to the “unsolicited” traffic that is passively received by the hosts, such as network scanning traffic, backscatter traffic and misconfiguration traffic [23]. Scanning traffic is usually sent by an infected host that tries to find vulnerable hosts. Backscatter traffic is generally generated by denial of service attacks. Misconfiguration traffic results from erroneous settings or faults or software/hardware faults. As a special kind of traffic, IBR traffic has been applied in network security field [17], e.g. for the analysis of intrusion characteristics or designing intrusion detection system.
Three categories of methods have been presented to collect IBR traffic. (1) Darknets utilize a specific acquisition machine to collect the traffic sent to pre-set unassigned IPs [18]. (2) Honeypot [14] refers to a special kind of host system that is used to lure intruders to attack it and collect the generated traffic. (3) One-way detection method filters out the one-direction flow on a full bi-directional flow dataset by matching whether a flow has an inverse flow. However, they commonly rely on tough conditions such as unassigned address space for darknets or full bi-directional flows for one-way detection method, and they are only able to collect limited IBR traffic. To weaken the restrictions, this paper explores traffic data characteristics and presents a novel method to identify IBR traffic. We consider the IP addresses inside campus network as destination IPs and the ones outside as source IPs. Main work of this paper includes following three aspects:
We explore the traffic source distribution of each destination IP in the campus traffic data and found that (1) there exists some IBR traffic sent to inactive IPs and (2) the traffic sources of inactive IPs are mostly close to be uniform. In addition, a significant property of active hosts is that the traffic source is relatively certain. These exploration results induct us to propose a novel method named IBR_TSD (IBR traffic identification based on Traffic Source Distribution). It is based on information entropy theory.
We carry out a series of experiments on IPv4 traffic datasets to validate the feasibility and effectiveness of IBR_TSD. A traffic trace was collected by port mirroring and another comes from a NetFlow [4] dataset. Experimental results show that the proposed technique obtains about 99% precision and 0.1% omission rate on identifying IBR traffic. We also summarize the source/destination port numbers of collected IBR traffic and found that well-known port scanning traffic can be identified by IBR_TSD.
On the other hand, the experimental results on IPv6 datasets show that IBR_TSD only obtains 83.5% detection rate and 21.9% precision on identifying IBR traffic. We further found that the traffic source distribution of inactive IPs on IPv6 datasets do not satisfy the assumption of IBR_TSD. This suggests that IBR_TSD is more suitable for detecting the IPv4 IBR traffic generated by malicious behavior.
Our approach can identify IBR traffic from the traffic data without packet payloads. Moreover, the identified IBR traffic includes the packets sent to the unassigned IPs and to the unknown active IPs that are more valuable for attackers. Whereas, most traditional methods are only able to collect the traffic sent to pre-setting unassigned IPs. Compared with our previous study [22], new contributions include: (1) building heuristics based on host communication patterns of malicious traffic, which is used to improve the IBR traffic identification method; (2) analyzing traffic source distribution on campus network traffic, which indicates that the traffic sources of active IPs are different from those of inactive IPs; (3) presenting a method to build the ground truth for inactive IPs and IBR traffic; and taking experiments on discussing the parameter of IBR_TSD and evaluating the precision and omission rate on detecting IBR traffic; (4) statistically summarizing the source and destination port numbers of identified IBR traffic; (5) evaluating the detection rate and precision of IBR_TSD on filtering inactive IPs and IBR traffic on pure IPv6 datasets, and discussing the characteristics of IPv6 IBR traffic and the suitable usage context of IBR_TSD.
This paper is organized as follows. Section 1 overviews background and contributions of this paper. Section 2 summarizes related works on IBR traffic identification methods, the applications of IBR traffic, and attack behavior detection methods. Section 3 firstly analyzes the traffic sources of active and inactive IPs, and then presents the basic theory and the details of our method. Section 4 introduces experimental datasets and takes a series of experiments to evaluate the performance of our approach on identifying IBR traffic. Section 5 concludes this paper.
Related work
A variety of monitoring systems have been presented to collect IBR traffic, such as network telescope [18], Internet sink [26], black hole [1], dark net [2], greynet [10,16] and so on. However, these methods closely rely on pre-setting unassigned IPs such as telescope, or require collecting full bi-direction traffic such as greynet. A one-way flow is referred as the one that does not have reverse flows. Actually, all IBR traffic flows are one-way flows. Glatz et al. [9] presented a scheme to detect one-way traffic in live network, and put them into several useful classes, respectively, service unreachable, malicious scanning, benign P2P, backscatter, suspected benign, bogon and others. However, all bi-directional traffic flows have to be collected in order to filter out one-way flows.
Pang et al. [19] systematically analyzed IBR traffic characteristics from different perspectives. The analysis on destination port numbers reveals that the vast majority of IBR traffic targets vulnerable services. They grouped the components of background radiation by protocols, applications and often specific exploit, and also studied the time mode and correlated behavior of IBR traffic. Wustrow et al. [23] investigated on the past five years IBR traffic in terms of time and space. They found that the traffic from the exploit ports is reduced and the trends are towards increasing SYN and decreasing SYN-ACK traffic. On examining the traffic across address blocks, they noted that significant differences exist and these differences are clustered in a handful of network blocks. Zseby et al. [27] investigated the entropy-based metrics for IP darkspace traffic classification. Iglesias et al. [12] proposed a method based on entropy measures of traffic-flow features to distinguish different attack types observed in IBR traffic. In [11], a cluster method was applied to the analysis of IP darkspace sources and they found that about 75% of the darkspace IP sources contributes to a set of very stable clusters, 4% to less stable clusters and 21% to outliers. Zakir et al. [8] found that large horizontal scanning is common and attackers utilize new scanning tools and technology to reduce the burden for finding vulnerabilities.
In essence, IBR traffic is a kind of harmful traffic. Besides the applications in security, it is able to reflect some other valuable information [3,6]. Dainotti et al. [6] analyzed the impacts of political and geographical events in Internet at the aid of IBR traffic. A sudden decrease of IBR traffic in an area indicates that network services may be blocked or interrupted. It also implies that significant political or geographical disaster have happened to this area. They mainly monitored the IBR traffic on backbone network and used special configurations to collect IBR traffic.
Most IBR traffic is generated by malicious attacks. Some papers detect attacks from the perspective of communication behavior. Karagiannis et al. [13] summarized the host communication patterns at the transport layer and presented a method to identify attacks. Steven et al. [20] proposed a method named ASMG (Attack Segmentation and Model Generation) to aggregate similar attack behaviors and provide situational awareness by grouping relevant traffic. Xu et al. [24,25] built traffic behavior patterns based on information entropy theory. They defined four dimensions
IBR_TSD method
Network traffic analysis
This section mainly analyzes traffic source distributions of destination IPs. In the study of network traffic, the object is usually the flow. A flow is composed by the packets that own the same five-tuples
NetFlow data were collected in our campus network in 2012 [22]. The distribution of source IPs of the traffic received by a destination IP is shown as Fig. 1, where x-axis represents the index of source IP, and y-axis represents the bytes sent by each source IP. When the destination IP is inactive (Fig. 1(a)), the bytes sent by each source IP do not have significant difference, and the distribution is close to be uniform. When the destination IP is active (Fig. 1(b)), even though there are more than 6000 source IPs, the majority of bytes (more than 107) are sent by a small number of source IPs (about 14.8 IPs on average).

Traffic source distributions of inactive IPs and active IPs on traffic data collected in campus network. (a) Distribution of the traffic received by inactive IPs. (b) Distribution of the traffic received by active IPs.
The results in Fig. 1 show that the traffic sources of inactive hosts are relatively uncertain. And the traffic sent to them must be IBR traffic. Inspired from this, this paper proposes a novel approach to detect IBR traffic based on traffic source distribution.
Information entropy is generally used to evaluate the uncertainty of a random variable (denoted as X). Assume that
The value range of
According to (2), when
The main idea of our approach is to identify the inactive IPs and then distinguish the IBR traffic sent by the source IP of inactive IPs. It mainly includes three steps as shown in Fig. 2, i.e., (1) identifying inactive IPs, (2) marking potentially malicious source IPs, (3) identifying IBR traffic. The relationship among normal IP, inactive IP, malicious IP, normal traffic and IBR traffic is shown in Fig. 2. All source IPs contacting an inactive IP are labeled as malicious IPs. All traffic that is sent by a malicious IP and matches some attack behavior pattern are marked as IBR traffic. The detail of our approach are introduced as follows.

The main steps of identifying IBR traffic and relationship among different IPs.
A metric to evaluate the certainty of traffic sources
According to Fig. 1, the traffic sources of inactive IPs are mostly relatively uncertain. In order to identify inactive IPs, a metric called USIP (uncertainty of source IP) is proposed to evaluate the uncertainty of traffic sources associated to a destination IP in this section. The used notations are firstly shown as follows.
Let
Let
Let
the set of source IPs of the traffic received by destIP, i.e.
the ith source IP in S
the number of flows sent by
the number of packets sent by
the number of bytes sent by
the total number of flows sent by all IPs in S
the total number of packets sent by all IPs in S
the total number of bytes sent by all IPs in S
USIP is defined as (3) which is based on the RU shown in Section 3.2. It is the ratio between the entropy of S and the maximum entropy of S.
The value range of
The way for checking uniform distribution
With the metric USIP, the next problem is how to judge if the traffic source distribution is close to be uniform. Our solution is inspired from the algorithm [24] for extracting the significant subset from a set. For the S set of an active IP, the USIP value is small and the TrafficSources are relatively certain. A significant subset on S of an active IP (i.e. source IPs sending a dominant number of packets/bytes to the destIP) exists. To extract the significant subset from S, the definition of significance is as follows. A subset D of S contains the most significant values of S if D is the smallest subset of S and satisfies that (1) the probability of any value in D is larger than that of remaining values (denoted by R and
When the method is used to extract significant subset D from S for each destIP on an inflow data, the significant subsets for some destIPs are empty. In other words, the probability distribution on S is close to be uniform. Through further examination, we found that such destIPs are inactive. Consequently, our solution named threshold-cutting is devised to check whether the
A novel method to identify IBR traffic
Based on the metric USIP and the threshold-cutting solution, two naive assumptions are presented to detect IBR traffic.
If the distribution of TrafficSources of a destIP is uniform, then the destIP is inactive.
If an IP is inactive, all its source IPs are potentially malicious.
IBR traffic include the traffic received by inactiveIPs and the traffic that sent by malicious IP and matches attack behavior. The pseudo-code of our approach named IBR_TSD is shown in Fig. 3. Firstly, the inflow data is grouped according to the destination IP (denoted by

Pseudo-code of identifying algorithm.

Three attack traffic behavior patterns summarized by [13].
In our algorithm, the heuristics aim to determine whether a traffic flow matches the attack behavior pattern. Based on the exploration results in [13], three attack behavior patterns are shown in Fig. 4. A heuristic is built for each pattern. The meaning of each following expression is
A source IP using multiple source port numbers sends packets to a destination port number of multiple destination IPs. It is expressed as
A source IP using a source port number sends packets to a destination port of multiple destination IPs. It is expressed as
A source IP using a source port number sends packets to multiple destination port numbers of multiple destination IPs, and the number of packets is smaller than 4. It is expressed as
The method of building ground truth
This section mainly evaluates the performance of our approach, i.e. the accuracy of identifying inactive IPs and IBR traffic. The ground truth of inactive IPs and IBR traffic should be firstly built. An inactive IP is routable but does not access Internet, such as the IP that is not assigned to a user or the user’s host does not turn on or not access Internet. A strict way of building the ground truth for inactive IPs is to inspect the packet payload contents. But this way has to access packet payload, which is much complex and may suffer from privacy preservation. Instead, a conservative method is presented to build ground truth of inactive IPs and IBR traffic. The steps are as follows.
(1) When the transport protocol is TCP, for a flow received by a destIP, if there is no inverse packet, the flow is labeled as IBR traffic. This is the situation in Fig. 5(a). If there are inverse packets, we will check the flags of in-packets and out-packets. If the SYN and ACK flags of an in-packet are set, and the RST of the corresponding out-packet is set, we label the in-flow as IBR traffic. This is the situation in Fig. 5(b) and the flows may be generated by backscatter. If the SYN of an in-packet is set, the SYN and ACK of the corresponding out-packet are set, but the RST of the following in-packet is set (or there is no following in-packet), the in-flow is labeled as IBR traffic. This is the situation in Fig. 5(c) and is the abnormal of TCP connection. If the TCP connection is built by three-way handshake, but destIP receive a RST packet, the in-flow is also labeled as IBR traffic. This is situation in Fig. 5(d) and the packets may be generated by scanning attack.

The situations when building the ground truth on TCP packets.
(2) When the transport protocol is UDP or ICMP, for a flow received by a destIP, if there is no inverse packet, the in-flow is labeled as IBR traffic. This is the situation in Fig. 6.

The situations when building the ground truth on UDP/ICMP packets.
Traffic summary of LAB traffic data
To further display the ground truth of inactive IPs and IBR traffic, the distributions of received traffic (in-traffic) and sent traffic (out-traffic) of each destIP are computed on LAB traffic data, which is shown as Fig. 7(a). x-axis and y-axis represent in-traffic and out-traffic respectively and they are log-scaled. And

The in-traffic and out-traffic volume distribution, and RU values of traffic sources on the datasets with ground truth. (a) In-traffic and Out-traffic. (b) RU and In-traffic.
Precision, detection rate and omission rate are used as performance evaluation metrics. The precision of inactive IPs/malicious IP/IBR traffic is the ratio of the number of correctly identified inactive IPs/malicious IPs/IBR traffic to the number of identified inactive IPs/malicious IPs/IBR traffic. Detection rate (or recall) of inactive IPs/malicious IP/IBR traffic is the ratio of the number of correctly identified inactive IPs/malicious IPs/IBR traffic to the number of inactive IPs/malicious IPs/IBR traffic. Omission rate of inactive IPs/malicious IP/IBR traffic is ratio of the number of inactive IPs/malicious IPs/IBR traffic that are not identified to the number of inactive IPs/malicious IPs/IBR traffic. And
There is a parameter in our approach, i.e. threshold θ. The threshold is used to check whether an IP is inactive and directly influences the results of IBR traffic. Figure 7(b) shows that the majority of inactive IPs are at top left. If the threshold is small, all inactive IPs can be detected, but many other IPs are incorrectly identified as inactive IPs and all source IPs of them will be incorrectly labeled as malicious IPs. Then, many flows are incorrectly marked as IBR traffic. In order to discuss the influence of the threshold, the results using different thresholds between 0.85–0.99 are tested and compared.
Experimental results with different values of threshold θ are shown as Fig. 8. In Fig. 8(a), x-axis represents different values of threshold θ and y-axis represents the precision of detected inactive IPs, malicious IPs and IBR traffic using our approach. Figure 8(a) shows that the precisions of malicious IPs and IBR traffic are very low when the threshold is small. Even though the detection rate of inactive IPs is high, but a lot of malicious IPs and IBR traffic are incorrectly identified. As increasing the value of the threshold, the precisions of inactive IPs, malicious IPs and IBR traffic also increase. And the precision is stable when the value of the threshold reaches to 0.9. Figure 8(a) shows that the precision on detecting IBR traffic is about 99% when the threshold equals to 0.9.

The performance of our approach on identifying inactive IPs and IBR traffic on the datasets with ground truth. (a) Precision. (b) Omission rate.
In Fig. 8(b), x-axis represents different values of threshold θ and y-axis represents the omission rates of inactive IPs, malicious IPs and IBR traffic using our approach. By increasing the value of the threshold, the omission rates also increase, which means that more inactive IPs, malicious IPs and IBR traffic are missed. When the threshold equals to 0.9, the omission rate of inactive IPs is high (40%) and the omission rate of malicious IPs is about 25%, but that of IBR traffic flows is only about 0.1%. This means that most IBR traffic is received by the inactive IPs with high uncertainty traffic sources, and majority IBR traffic can be correctly detected when the threshold is 0.9. However, the threshold cannot be too high. For example, when the threshold is 0.95, the precision further increases but the omission rate increases significantly.
According to the discussion of Fig. 8, our approach obtains 99% precision and 0.1% omission rate when the threshold is 0.9. Our approach aims to correctly detect IBR traffic from existing NetFlow data, which mainly focuses on ensuring high precision for IBR traffic. It is not necessary and unrealistic to filter out all IBR traffic. It allows a small omission rate. According to above discussion and experimental results, the value between 0.9 and 0.94 is suitable for the threshold θ. In the following experiments of this paper, the threshold is set as 0.9.
Traffic statistics of IBR traffic detected on the NetFlow data using our approach
Traffic statistics of IBR traffic detected on the NetFlow data using our approach
In IPv6 network, it is highly likely that inexperience, system configuration differences, or even software or hardware bugs will result in errors that can lead to Internet background radiation. By observing IPv6 IBR traffic, we can be alerted by the emergence of malicious activity, such as scanning and worms, via the new protocol features.
Information of IPv6 traffic data
Information of IPv6 traffic data
When our approach is used for filtering out IBR traffic and the threshold for USIP is set as 0.9, the detection rate of inactive IP is about 70% but the precision is only 34%; the detection rate of IBR traffic is 83.5% and the precision of IBR traffic is 21.9%. This means that most active IPs are misclassified as inactive IPs and most normal traffic is misclassified as IBR traffic. This indicates that the traffic sources of many active IPs are also close to be uniform. This is different from the active IPs on IPv4 network environment. To further analyze the USIP values of the active IPs and inactive IPs on IPv6 traffic dataset, the USIP and traffic size (#bytes) of inactive IPs and active IPs on the traffic datasets with ground truth are shown in Fig. 9(a) and (b), respectively. USIP values of a part of inactive IPs are smaller than 0.9 in Fig. 9(a) and the USIP values of some active IPs are also large in Fig. 9(b). It is different from the result on IPv4 traffic shown in Fig. 7, in which the USIP values of most inactive IPs are larger than 0.9 and these of active IPs are smaller than 0.9.

Traffic volume and the USIP of traffic sources of inactive IPs and active IPs on the IPv6 traffic data. (a) RU and In-traffic of InactiveIP. (b) RU and In-traffic of ActiveIP.
Above experimental results show that our method is not suitable for filtering out IPv6 IBR traffic. This is because it assumes that the source IPs that send traffic to inactive IPs are potential malicious IPs, but most IPv6 IBR traffic is not generated by malicious IPs [5] which do not satisfy the assumption. Existing work [5] on IPv6 IBR traffic concludes that there is no evidence to prove that prevalent malicious traffic in IPv6 data is due to worms, scanning, or backscatter. They found that most IPv6 IBR traffic is due to misconfiguration. Our analysis on IPv6 IBR traffic is consistent with existing work. Nevertheless, the experiments on IPv6 traffic further suggests that our method is suitable for detecting the IBR traffic generated by the abnormal behavior which benefits to guard network security.
This paper mainly investigates the IBR traffic identification method. According to the experimental analysis on campus network traffic, we found that there is always some traffic sent to an IP, even if this IP is not assigned or a non-user IP (i.e. subnet IP, broadcast IP and gateway IP). This phenomenon indicates that IBR traffic is ubiquitous. Moreover, the traffic sources of active IPs are relatively certain and these of inactive IPs are uncertain. Inspired by this, we assume that an inactive IP is identified when its traffic sources are relatively uncertain. A novel approach named IBR_TSD is presented to detect IBR traffic from existing in-direction traffic data. Firstly, it detects inactive IPs and malicious source IPs based on the presented USIP metric and threshold-cutting method. Secondly, it labels the traffic received by inactive IPs as IBR traffic and identifies the IBR traffic sent by malicious source IPs by using three heuristics that can be used for detecting attack behavior.
To validate the feasibility and evaluate the performance of our approach, we collected IPv4 and IPv6 traffic traces to build ground truth using a conservation way. On the one hand, experiments are carried out on IPv4 traffic traces, and experimental results show that (1) the parameter related to USIP in IBR_TSD is better when set in the range of 0.9–0.94; (2) IBR_TSD obtains about 99% precision and 0.1% omission rate on detecting IBR traffic; (3) the summary of source and destination port numbers on detected IBR traffic can be used for intrusion detection system, such as detecting port scanning. On the other hand, experiments are carried out on IPv6 traffic traces, and experimental results show that (1) IBR_TSD obtains about 83.5% detection rate and 21.9% precision on detecting IBR traffic; (2) the USIP values of some inactive IPs are smaller than 0.9, which suggests that IBR_TSD is not suitable for detecting IPv6 IBR traffic; the possible reason is that the IPv6 IBR traffic are mainly generated by misconfiguration rather than malicious behavior. Nevertheless, IBR_TSD is able to effectively detect IPv4 IBR traffic that has been generated by malicious behavior. Overall, the advantages of IBR_TSD include: (1) it directly detects IBR traffic from existing in-direction traffic data without any extra information such as online user lists and full payload of bi-directional traffic; (2) the detected IBR traffic is more valuable and practical since it contains the traffic sent to other active IPs besides inactive IPs. However, existing methods (such as darknet, black holes, honeypots, etc.) have to firstly configure a series of unassigned IPs. And the IBR traffic collected by them only contains the traffic sent to unassigned IPs.
The experimental data used in this paper covers half-hour or a day. The method may be not suitable for real-time IBR traffic detection. For example, inactive IPs are more easily distinguished from active IPs on daily traffic compared to half-hour traffic according to our further analysis. The possible reason is that the USIP values calculated on the traffic collected in a short period time cannot be used to accurately detect inactive IPs.
Footnotes
Acknowledgements
This study is supported by National Natural Science Fund, China (Grant No. 61300198), Guangdong Province Natural Science Foundation (No. S2013040016582). Guangdong Higher School Scientific Innovation Project (No. 2013KJCX0177), Fundamental Research Funds for the Central Universities (SCUT 2014ZB0029) and China Postdoctoral Science Foundation (No. 2014M552199).
