Study of long short-term memory in flow-based network intrusion detection system

Abstract

The adoption of network flow in the domain of Network-based Intrusion Detection System (NIDS) has steadily risen in popularity. Typically, NIDS detects network intrusions by inspecting the contents of every packet. Flow-based approach, however, uses only features derived from aggregated packet headers. In this paper, all publicly accessible and labeled NIDS data sets are explored. Following the advances in deep learning techniques, the performances of Long Short-Term Memory (LSTM) are also presented and compared with various machine learning classifiers. Amongst the reviewed data sets, the models are trained and evaluated on CIDDS-001 flow-based data set.

Keywords

Intrusion detection system NIDS NetFlow deep learning LSTM

1 Introduction

The detection of malicious traffic and network intrusions is typically automated by deploying Intrusion Detection System (IDS). An IDS complements a firewall in a similar way. It analyzes traffic for any possible intrusions, which can originate from either outside or inside an organization. One of the first major division of an IDS depends on the data it analyzes. Network-based IDS (NIDS) is one such variant, it monitors all traversing network traffic for intrusions or anomalies.

Traditional NIDS relies on Deep Packet Inspection (DPI) by extracting the desired information from payloads. Despite yielding beneficial insights on semantic attacks [1], the implementation of DPI can be difficult. Issues such as encrypted payloads and computational costs can render an IDS ineffective in high-speed networks. On this account, one such prominent recourse is to use network flows in detecting network intrusions. A flow-based IDS eliminates the aforementioned weaknesses by having only information from the flow records. Yet, it has not been exploited as much when comparing to the traditional packet-based approach.

In considering the majority of NIDS performances are evaluated on outdated data sets, this paper presents the trends and analysis of various labeled NIDS data sets which can serve as replacements. Furthermore, due to the effectiveness and notability of deep learning, Long Short-Term Memory (LSTM) is explored. The performances of LSTM are compared with different machine learning algorithms. Different multiclass and binary classifications are also experimented on unidirectional flow-based CIDDS-001 data set.

2 Flow-based intrusion detection system

As defined in [2], A network flow is a set of IP packets flowing through an observation point. IP packets headers are aggregated into a particular flow based on common properties such as IP addresses, port numbers and protocol. The aggregation of flows can be either in unidirectional or bidirectional format. Unidirectional implementation creates two different flows from a single conversation. Bidirectional instead, result in only a single flow, stemming from the host who initiated the conversation.

Figure 1 presents the typical architecture for the generation of flow-based traffic. Accounting flows relies on cooperation of two main components: flow exporter and flow collector [1]. The flow records are first generated by the metering process from the passing packets at observation points. Metering process consists of a set of functions for the proper creation of flow records. Thereafter, flow records are passed to the exporting process either when they are expired or end of connection has been detected (in the case of TCP flow).

Fig.1

General flow architecture.

The flow exporter is responsible for sending flow records to one or more flow collector to be later stored for relevant purposes. Various protocols exist for defining the format of flow records and how they are transported [3], but Cisco NetFlow remains the dominant one. Most flow-based data sets reviewed in this paper are also recorded in NetFlow format.

As traffic are now summarized into aggregated views of packet headers, there are both obvious impacts from the use of network flows [4]. Compared to the traditional DPI, a flow-based IDS is both computational and financially efficient [5]. Flow-based approach is better equipped in handling high-speed links as it does not require the analysis, or even decryption of payloads. However, the complete absence of payload during inspection can also be a double-edged sword. The main drawback of a flow-based IDS is its difficulty in detecting network attacks in which the destructive power resides in the payload [1].

3 Progression of labeled NIDS data set

DARPA’98 [6] and DARPA’99 [7] are some of the earliest known data sets designed for the assessment of IDS. Extending from DARPA’98, DARPA’99 has included some of the new attack types and Window NT victim machines. Otherwise, both of them were created using the same approaches and the traffic are simulated in an offline military environment. On top of that, the attacks executed in the simulation are further classified into four classes: Denial of Service, User to Root, Remote to Local, and Probes [8].

Following this, Stolfo et al. [9] created the KDDCup’99 data set. The data set was reconstructed from DARPA’98 tcpdump data. The author processed the data into connection records with three main types of features: basic features of single TCP connection, content features suggested by domain knowledge, and traffic features derived from past two seconds. KDDCup’99 remains to be the most used data set since its creation. However, several problems exist in the aforementioned data sets, and were mentioned in few studies [10 –12].

Due to the popularity of KDDCup’99, few studies have been attempted to refine the data set. One of the first attempt was conducted by Perona et al [13]. The data set: gureKDDCup was generated by the same methodology in order to replicate KDDCup’99 as accurate as possible while maintaining added features. gureKDDCup data set carries all the features of KDDCup’99 with extra payload information, IP addresses and port numbers.

NSL-KDD data set proposed by Tavallaee et al. [12] is another attempt to solve some of the inherent weaknesses in KDDCup’99. The author has removed all redundant records in both training and testing set in order to eliminate biases toward the frequent records. Furthermore, the selected subsets of records are inversely proportional to their difficulty level of predictions in the original KDDCup’99 data set. This has made the evaluation results of different machine learning methods to be more efficient and precise. Despite that, NSL-KDD still carries few problems discussed by McHugh [10].

In another spectrum, multiple honeypots have been deployed in Kyoto University for the generation of Kyoto 2006+ data set [14]. As only traffic which are directed to the honeypots are captured, most of the traffic are inherently suspicious. Only normal activities which are simulated to produce DNS and mail traffic are captured. The data set contains 14 non-content features from KDDCup’99. Additional 10 features such as indications of attacks, IDS trigger, and malware are included to assist in investigating the present of attacks.

The first labeled flow-based IDS data set is generated by Sperotto et al. [15] in 2009. Similar to the previous data set, the traffic was captured from a honeypot in University of Twente, resulting in a limited view of the network traffic. Even though monitoring of traffic is limited to one honeypot, labeling process still proved to be a complex task. Services such as SSH, Apache web server, and FTP are executed by the honeypot and their log files collected for labelling task.

The second flow-based data set: CTU-13 was created by Garcia et al. [16]. The author focuses on the captures of botnet traffic, along with normal and background traffic. Normal labels are assigned to known and controlled computers in the network, while background labels are assigned to other unknown traffic. The data set consists of 13 scenarios, each with differing malwares and actions. Originally, unidirectional flows was used in [16], but bidirectional flow is later replaced to include much more detailed labels. The data set is also extended to include pcap files of all traffic [17], albeit being truncated due to privacy concerns.

Shiravi et al. [18] established a set of guidelines to adhere to, and follows a systematic approach in generating the UNB ISCX 2012 data set. Two profiles were devised to aid in simulating desirable events: α-profiles describe a set of attack scenarios, and β-profiles contain distribution models for background traffic. The data set covers a total of four multistage attack scenarios, and each scenario uphold to the following attack lifecycle: reconnaissance, identifying vulnerabilities, acquiring access, maintaining access, and covering tracks. Although various protocols were simulated for the background traffic, the data set does not contain HTTPS traces.

Aiming to solve the absence of modern low footprint attacks in KDDCup’99 data set, Moustafa et al. [19] proposed the UNSW-NB15 data set. The data set consists of synthetic normal and abnormal traffic, generated with the aid of IXIA PerfectStorm tool. The generated attacks consist of nine different categories. A total of two simulations are carried out, with varied frequencies of generated attacks. Each simulation is captured until it reaches the size of 50 GB.

After a few years of gap, the trend of labeled flow-based data sets emerges again. UGR’16 data set by Fernández et al. [20] aims to create a data set which can aid in the evaluation of cyclostationary-based NIDS. Flows are captured at a tier-3 ISP located in Spain over a period of four month. Both synthetics and realistic attacks are also included in the data set. The background traffic which contains realistic attacks are labeled though manual analysis and by employing 3 state-of-the art anomaly detectors. Several sources are also referred to assist in labeling blacklisted public IP addresses.

In 2017, CIDDS-001 flow-based data set is published by Ring et al. [21]. The data set contains two sources of traffic: internal OpenStack traffic, and external server traffic which are exposed to the Internet. Within the controlled OpenStack network, a total of four different subnets are designed to emulate the environment of a small business. Each subnet has its own prescribed behaviors and follows the probability distribution of working hours. At the external server, traffic which are not generated by the OpenStack clients are considered unknown for HTTP or HTTPS requests, or suspicious for remaining traffic.

At the time of writing, the most recent data set is CICIDS2017 by Sharafaldin et al. [32]. The data set covers all eleven criteria of the evaluation framework and uses β-profile proposed in previous work [33]. Six different attack profiles were created to include common and updated attacks. Besides, CICFlowMeter [34] is also utilized to extract 83 statistical features for the data set.

Majority studies on network intrusion detection methods evaluate their performances on KDDCup’99 and NSL-KDD [12]. However, both of them are obsolete and are inferior in reflecting modern real world traffic [35 –37], since they are essentially based on a 20 years old DARPA’98 data set. A comparison table of the aforementioned data sets are shown in Table 1.

Table 1
Comparison table of labeled data set for NIDS

Data set Data format Real traffic f1 f2 f3 f4 Duration of capture Point of capture Reference

DARPA‘98 R No – – – – 9 weeks network [22]

DARPA‘99 R No – – – – 5 weeks network [22]

KDDCup‘99 C No 9 13 19 0 9 weeks network [23]

gureKDDCup C, P No 15 13 19 0 7 weeks (5 days / week) network [24]

NSL-KDD C No 9 13 19 0 – network [25]

Kyoto 2006+ C Yes 11 0 9 3 9 years honeypots [26]

Labeled Flows NF_U Yes 12 0 0 0 6 days honeypot [27]

CTU-13 NF_U, NF_B, R Yes 14 0 0 0 15–4011 minutes / scenario network [28]

UNB ISCX 2012 X, R No 14 3 0 1 1 week network [25]

UNSW-NB15 csv, raw No 37 3 7 0 16, 15 hours network [29]

UGR’16 NF_B, NF_D Yes 12 0 0 0 4 months network [30]

CIDDS-001 NF_U Yes 12 0 0 0 4 weeks network [31]

CICIDS2017 F_B, R No 83 0 0 2 5 days network [25]

Data set	Data format	Real traffic	f1	f2	f3	f4	Duration of capture	Point of capture	Reference
DARPA‘98	R	No	–	–	–	–	9 weeks	network	[22]
DARPA‘99	R	No	–	–	–	–	5 weeks	network	[22]
KDDCup‘99	C	No	9	13	19	0	9 weeks	network	[23]
gureKDDCup	C, P	No	15	13	19	0	7 weeks (5 days / week)	network	[24]
NSL-KDD	C	No	9	13	19	0	–	network	[25]
Kyoto 2006+	C	Yes	11	0	9	3	9 years	honeypots	[26]
Labeled Flows	NF_U	Yes	12	0	0	0	6 days	honeypot	[27]
CTU-13	NF_U, NF_B, R	Yes	14	0	0	0	15–4011 minutes / scenario	network	[28]
UNB ISCX 2012	X, R	No	14	3	0	1	1 week	network	[25]
UNSW-NB15	csv, raw	No	37	3	7	0	16, 15 hours	network	[29]
UGR’16	NF_B, NF_D	Yes	12	0	0	0	4 months	network	[30]
CIDDS-001	NF_U	Yes	12	0	0	0	4 weeks	network	[31]
CICIDS2017	F_B, R	No	83	0	0	2	5 days	network	[25]

Data format: R - tcpdump / pcap; NF_U - unidirectional NetFlow; P - payload; NF_B - bidirectional NetFlow; C – csv; NF_D - NetFlow dump; X - xml; F_B - bidirectional flow. f1: features derived from headers. f2: features derived from content or application payload. f3: features derived from previous connections. f4: features do not belong to any of the mentioned categories.

4 Deep learning approaches for NIDS

Machine learning techniques have since been used to tackle disparate problems. It has the ability of learning and extracting patterns without any explicit programming. However, the performances of traditional machine learning techniques are greatly dependent on the representation of data [38]. Often, it requires domain expertise in order to craft and obtain a fitting representation from raw data.

Deep learning can solve such problem by discovering the suitable representations. It has shown the capability of beating machine learning techniques in many applications with outstanding results [39]. The ongoing wave of interest in deep networks research can be traced back to 2006, when Hinton et al. presented an effective strategy for pre-training Deep Belief Network (DBN) [40]. The surge of interests was further motivated by the advances of Graphical Processing Unit (GPU). The efficiency of GPU in training a deep network has made it much more convenient for experimentation.

4.1 Deep belief network

Restricted Boltzmann Machine (RBM) is a probabilistic generative model, it is capable of learning the probability distribution of the inputs for reconstruction. RBM consists of two layers, a visible and a hidden layer. By stacking RBMs, where the hidden layer becomes the visible layer for the next RBM, it forms a Deep Belief Network (DBN) and can arrive in higher level representations [41]. One of the earliest implementation of DBN in NIDS is proposed by Salama et al. [42]. From the literature, DBN of two RBM layers is used to reduce 41 features from NSL-KDD to five output features. DBN is trained using backpropagation and is tested with two different configurations: as classifier by itself, or by using it to perform dimension reduction before applying Support Vector Machine (SVM) as a classifier. Results are compared and the DBN-SVM combination outperform both standalone SVM and DBN. In recent years, DBN-based models received much more attention compared to other approaches. Both Gao et al. [43] and Alrawashdeh et al. [44] have demonstrated the performance increase by implementing more hidden layers for DBN. With a total of four hidden layers, they have achieved high performances on KDDCup’99 data set.

4.2 Recurrent neural network

Recurrent Neural Network (RNN) approach in the domain of network intrusion detection also received increasing attention. Unlike a feedforward neural network, a RNN is able to perceive and memorize previous sequences with the introduction of feedback loop that is connected to the previous time step. A three layer reduced-size RNN is proposed by Sheikhan [45], in which the nodes are partially connected between layers. The implementation of this structure allows both effective training speed and improved classification rate on KDDCup’99 data set. Yin et al. [46] also propose the use of RNN in detecting network intrusions. Unlike previous literature, nodes in the hidden layers are now “fully connected”. The author has studied the performance on both binary and multiclass classification. The experimental results show that the performance is of more superior. In another study, Kim et al. [47] adopt Hessian-free optimization algorithm to address the difficulty of handling complex long-term dependencies in RNN. Hessian-free optimization allows faster convergence without computation of Hessian matrix.

The study of deep learning approach in NIDS has received growing attention in recent years. Most of the existing deep learning methods for NIDS are centered around KDDCup’99 and NSL-KDD data sets. In the case of flow-based NIDS, the exploit of deep learning study is lacking.

5 Long Short-Term Memory

Long Short-Term Memory (LSTM) is a variant of RNN proposed by Hochreiter et al. [48] to solve the long existing problems of exploding and vanishing gradient in RNN. There exist many proposed variants of LSTM architecture. Most modern vanilla LSTM has since comprised of modifications including forget gate [49] and peephole connections [50]. Other notable variants also include a lesser complex Gated Recurrent Unit (GRU) architecture. Results obtained in [51] has shown the dominance of vanilla LSTM in handling various data sets when comparing to other variants of it. Figure 2 illustrate the unfolded version of a 2-layer LSTM network.

Fig.2

Stacked long short-term memory network.

Formulas below denote the forward-pass of a non-peephole vanilla implementation: $f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})$ (1) $i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})$ (2) $o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})$ (3) $c_{t} = c_{t - 1} ⊙ f_{t} + i_{t} ⊙ tanh (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{c})$ (4) $h_{t} = tanh (c_{t}) ⊙ o_{t}$ (5)

Inside each of the LSTM cells are regulated by three main gates: input gate i_t, forget gate f_t, and output gate o_t at a certain time step t. The gates activation uses element-wise logistic sigmoid function and is denoted by σ. The weights and biases are W and b, for the respective gates or state (4). x_t is the input, and h_t-1 is the hidden state from previous time step. The updated cell state served as a memory, and is represented using h_t. Finally, ⊙ is an element-wise multiplication of two vectors.

The forget gate (1) outputs a value between 0 to 1 to control the decision of either keeping or forgetting the previous states. This is essential in processing unreasonably long and complex tasks, as it allows an LSTM cell to learn when to reset the memory contents.

6 CIDDS-001 OpenStack data set

The CIDDS-001 data set is a flow-based data set that contains two different sections organized based on the location of the traffic captured. In this paper, the traffic recorded in the OpenStack environment is used to evaluate the performance of the models. The first week of the data set is used in training the models, while the performance of the models are evaluated by the second week of the data set. The attributes of the data set consist of more common properties found in a NetFlow record as can be seen in Table 2. Due to privacy concerns, the author has anonymized the first three bytes of the public addresses, as well as replacing the addresses of both DNS and external server.

Table 2
NetFlow features of CIDDS-001 dataset

Attributes Type Description

Date first seen Continuous Start time of the flow

Duration Continuous Duration of the flow

Proto Symbolic Protocol type

Src IP Addr Symbolic Source IP Address

Src Pt Symbolic TCP/UDP Source Port

Dst IP Addr Symbolic Destination IP Address

Dst Pt Symbolic TCP/UDP Destination Port

Packets Continuous Packet count of the flow

Bytes Continuous Byte count of the flow

Flows Continuous Number of aggregated flows

Flags Symbolic Bitwise OR of TCP Flags

Tos Symbolic Type of Service in decimal

Attributes	Type	Description
Date first seen	Continuous	Start time of the flow
Duration	Continuous	Duration of the flow
Proto	Symbolic	Protocol type
Src IP Addr	Symbolic	Source IP Address
Src Pt	Symbolic	TCP/UDP Source Port
Dst IP Addr	Symbolic	Destination IP Address
Dst Pt	Symbolic	TCP/UDP Destination Port
Packets	Continuous	Packet count of the flow
Bytes	Continuous	Byte count of the flow
Flows	Continuous	Number of aggregated flows
Flags	Symbolic	Bitwise OR of TCP Flags
Tos	Symbolic	Type of Service in decimal

The data set are captured at the OpenStack router over a period of four weeks. In the first two weeks, a total of four different attacks are deployed intermittently alongside other benign activities. The attacks are automated and the characteristic of each attack follows the behavior presented in Table 3. Since unidirectional NetFlow format are captured, class victim is supplied as well, in contrast to the conventional binary normal and anomaly labels. The additional victim label is given to the flow if the Dst IP Addr is the source of an attack. Table 4 shows the number and type of flows captured in the OpenStack environment.

Table 3

Attacks executed in OpenStack environment

Attack Types	Description
Ping Scan	Scanning is performed on a randomly selected subnet to discover online hosts.
Port Scan	SYN scan to acquire running services on specified host.
Brute Force	SSH brute force attack on a randomly selected IP obtained from last scanning session.
DoS	10,000 connections are initiated via HTTP service for every Denial of Service attacks.

Table 4

Classes of CIDDS-001 OpenStack dataset

OpenStack Traffic	Class	Attack Type	No. of instances
Week 1	Normal	–	7010897
	Attacker	Ping Scan	2382
		Port Scan	116774
		Brute Force	1131
		DoS	625943
	Victim	Ping Scan	977
		Port Scan	66737
		Brute Force	495
		DoS	626184
Week 2	Normal	–	8515329
	Attacker	Ping Scan	1752
		Port Scan	51403
		Brute Force	2946
		DoS	854274
	Victim	Ping Scan	979
		Port Scan	31004
		Brute Force	420
		DoS	852626
Week 3	Normal	–	6349783
Week 4	Normal	–	6175897

7 Experimentation setup

7.1 Preprocessing stage

In this stage, all features are preprocessed before employing them to train the models. The start time of the flow is first converted into two dimensions. The two dimensions represent sine and cosine wave which spans for one week.

Protocol, flags, and type of service attributes are then transformed into binary vectors representations. Binary encoding is used to represent protocol fields, while one-hot encoding is performed on the latter two. The resulting attributes are encoded into dimensions of five for protocol attribute, and four for both flags and type of service.

The flows attribute is also removed from the data set, as the value remains identical for all instances throughout the weeks.

In the case of IP addresses, binary encoding results in poor performances and one-hot encoding is not applied due to their sizeable distinct elements. Instead, IP addresses are bucketized into seven different categories based on the identity and locality available on the hosts. IP addresses which originated from the OpenStack network are mapped into respective subnets of either “server”, “management”, “office”, or “developer”. As for the hosts residing outside the network: DNS server, external server, and public addresses are each given a category to represent them.

In this experiment, port addresses are treated as numerical values, and the performances are compared with the binary encoded ports.

The features of the data set are then normalized into the range of [0,1] using Min-Max normalization technique (6). The parameters of the normalization are calculated on the first week and are applied on both training and testing set. x_i denotes the value of ith feature, whereas min(x_i) and max(x_i) represent the minimum and maximum value of ith feature in training set respectively. $x_{i} = \frac{x_{i} - min (x_{i})}{max (x_{i}) - min (x_{i})}$ (6)

7.2 Performance metrics

The goal of an IDS is to have minimal False Positives (FPs) and False Negatives (FNs). Sensitivity is prioritized if the protection of an asset is of utmost importance. Specificity otherwise, when efficiency is concerned [37]. The cost of misclassification in this domain is relatively expensive [52], as every false positive on the attacks requires the time of security analyst in investigating the reported incident. Meanwhile, an attack which bypasses the detection of an IDS can results in serious damages.

Some of the more common metrics used in this experiment are as shown below. The first metric is Sensitivity, it also goes by the name of True Positive Rate (TPR) or recall. Sensitivity is used to measure the rate at which all instances of a positive class (anomalous events) are classified as it is. $Sensitivity, TPR = \frac{TP}{TP + FN}$ (7)

The next metric used is Precision (PPV). Precision of a particular class is defined as the rate at which all the predicted instances of a class are really true: $Precision, PPV = \frac{TP}{TP + FP}$ (8)

TP denotes the number of correct predictions on positive class (anomalous traffic).

TN denotes the number of correct predictions on negative class (benign traffic).

FP denotes the number of prediction on anomalous instances are not actually anomalous.

FN denotes the number of anomalous instances that are not detected.

The number of instances in the data set are skewed towards the normal class much like most of the NIDS data sets. In comparing the performances of multiclass classifications, macro averaged values of the aforementioned metrices (7, 8) are used. Macro averaging allows the treatment of differing classes to have the same weights regardless of the number of instances in each class. The formulas are denoted by: ${TPR}_{macro} = \frac{1}{| C |} \sum_{λ = 1}^{| C |} {TPR}_{λ}$ (9) ${PPV}_{macro} = \frac{1}{| C |} \sum_{λ = 1}^{| C |} {PPV}_{λ}$ (10)

Where |C| denotes the total number of instances in class λ.

8 Results and discussions

A total of five notable machine learning algorithms: C4.5 (better known as J48 in Weka), Random Forest, Hoeffding Tree, Naïve Bayes, and Support Vector Machine are employed in this experiment. The classifiers are trained and tested in the same manner by using the machine learning software: Weka.

As for the deep learning method, vanilla variant of LSTM (without peephole) is used and its performance is compared as well alongside other classifiers. The experiments for LSTM are designed with the aid of open source framework: TensorFlow 1.4.

All the aforementioned models are evaluated based on four different classifications as shown:

2-class: normal and anomaly.

5-class: normal traffic, with four different types of attacks.

3-class: normal, attacker and victim.

9-class: normal traffic, four different types of attacks including attacker and victim label.

LSTM model is trained with disparate combination of hyperparameters as shown in Table 5, and their results are observed at 50th epoch. The performances of the models do not deteriorate substantially within the specified scope of hyperparameters. To aid in the convergence of the model, both Xavier initialization and Adam optimizer are used. Gradient clipping is also applied to control exploding gradients.

Table 5
LSTM Models of varied Hyperparameters

Models λ ₁ λ ₂ λ ₃ λ ₄ λ ₅ λ ₆

L1 32 128 256 3 0.001 0.4

L2 64 256 128 2 0.001 0.4

L3 128 64 64 2 0.01 0.4

L4 128 32 128 1 0.01 0.4

L5 32 128 48 2 0.1 0.4

L6 16 256 112 1 0.01 0.4

L7 30 256 96 2 0.01 0.4

L8 256 64 64 2 0.01 0.4

L9 256 128 64 2 0.01 0.4

L10 160 48 48 2 0.01 0.4

Models	λ ₁	λ ₂	λ ₃	λ ₄	λ ₅	λ ₆
L1	32	128	256	3	0.001	0.4
L2	64	256	128	2	0.001	0.4
L3	128	64	64	2	0.01	0.4
L4	128	32	128	1	0.01	0.4
L5	32	128	48	2	0.1	0.4
L6	16	256	112	1	0.01	0.4
L7	30	256	96	2	0.01	0.4
L8	256	64	64	2	0.01	0.4
L9	256	128	64	2	0.01	0.4
L10	160	48	48	2	0.01	0.4

λ₁: Sequence length; λ₂: Mini-batch size; λ₃: Hidden units; λ₄: Hidden layers; λ₅: Learning rate; λ₆: Dropout rate.

Table 6 shows the results obtained from using the hyperparameters mentioned in Table 5. Across four classification tasks, the model: L8 which has the decent trade-off between FPs and FNs is taken to be compared with other machine learning classifiers.

Table 6

Performance results of LSTM with varied Hyperparameters

	2-class		5-class		3-class		9-class
	Sensitivity	Precision	Sensitivity	Precision	Sensitivity	Precision	Sensitivity	Precision
L1	99.7807	99.9678	61.8345	90.0010	99.8573	99.9454	56.9759	80.7440
L2	99.7914	99.9624	62.8745	72.8244	99.8596	99.9607	57.7905	87.8686
L3	99.7833	99.9654	67.2389	87.1128	99.8547	99.9675	62.5531	93.5662
L4	99.7962	99.9588	66.9806	88.5335	99.8613	99.9594	63.1245	86.1189
L5	99.7704	99.9687	74.2559	89.1770	99.8517	99.9753	64.7312	60.7247
L6	99.7920	99.9634	63.6066	89.3173	99.8592	99.9698	61.3158	82.3857
L7	99.7920	99.9680	64.8599	92.1000	99.8586	99.9720	59.6177	83.2806
L8	99.7896	99.9595	68.8949	93.3100	99.8608	99.9518	66.3028	93.4608
L9	99.7859	99.9589	63.2345	93.2186	99.8557	99.9564	63.0385	88.0466
L10	99.7876	99.9582	66.2227	87.3372	99.8598	99.9612	58.9577	80.9101

Results of vanilla LSTM and various models are summarized in Tables 6 and 7. The abbreviations of different models are used to denote the respective models. Table 7 captures the results when port addresses are treated as numeric values. Meanwhile, Table 8 presented the gain of performance in which the ports are binary-encoded. The amount of change in performance is relative to the results obtained in Table 7. No significant differences are observed in comparing the results of preprocessed ports. In some cases, the binary-encoded modification only performs slightly better than leaving the port numbers untreated.

Table 7

Performance comparison on numeric-valued ports

Models	2-class		5-class		3-class		9-class
	Sensitivity	Precision	Sensitivity	Precision	Sensitivity	Precision	Sensitivity	Precision
J48	68.6611	94.1095	71.5143	85.3080	58.1772	96.0378	67.3480	69.0403
RF	99.8742	99.9469	66.9978	65.8135	99.8379	99.9459	64.7606	62.7756
HT	96.9872	99.2978	47.6400	71.1249	83.9242	98.3021	46.8859	55.8859
NB	97.6467	91.1563	75.4425	44.6302	96.5947	81.0740	73.4948	37.8003
SVM	99.8840	99.9583	59.2222	58.6743	99.8505	99.9740	54.8570	65.3480
LSTM L8	99.7896	99.9595	68.8949	93.3100	99.8608	99.9518	66.3028	93.4608

Table 8

Performance gain on binary encoded ports

Models	2-class		5-class		3-class		9-class
	Sensitivity	Precision	Sensitivity	Precision	Sensitivity	Precision	Sensitivity	Precision
J48	+0.0029	+1.9241	–4.5573	–18.6922	+7.7929	+0.6890	–7.7365	–6.7371
RF	+0.0024	+0.0049	+1.0610	+0.6316	–0.0033	+0.0076	–0.0335	–0.0091
HT	–0.0468	–1.3884	+1.3441	–12.2209	–14.0721	–1.7981	–6.5821	+2.0964
NB	–2.5982	–3.8008	+0.3362	+5.3960	–2.1000	–4.1978	+0.4000	+6.1323
SVM	–0.0001	–0.0003	+0.0516	+20.0472	+0.0001	–0.0055	+3.2482	+7.5333
LSTM L8	+0.0667	–0.0127	–2.4401	+1.1623	–0.2240	+0.0063	–0.0103	+4.3981

From Table 7, it can be observed that the vanilla LSTM is able to achieve consistent results while achieving optimal sensitivity and precision trade-off in most classification tasks.

Naïve Bayes algorithm is able to obtain the highest sensitivity in 5-class and 9-class classification tasks. However, it suffers greatly from false positives by having benign traffic predicted as attacks, resulting in lowest precision compared to other models. Whereas in the case of J48 and Hoeffding Tree, the opposite held true. They have comparatively low performances in sensitivity, which are caused by anomalous traffic not being detected (FNs).

In the case of SVM, high performances are achieved in binary and 3-class classification tasks. But the outcome in 5-class and 9-class diminished, as it fails to detect both ping scan and brute force activities.

Another observation is the low performances in both sensitivity and precision. This is caused by the losses in FPs and FNs from each class, since they are evaluated with the same weight without taking number of instances into consideration (9-10). The effect is undermined even further by the failure in detecting brute force attacks by all classifiers.

All things considered, the exceedingly high accuracies of the classifiers are in part due to the use of flow information in detecting probes and DoS attacks [36]. The confusion matrices on the LSTM model: L8 are illustrated in Fig. 3.

Fig.3

Confusion Matrices of LSTM model: L8. The results of all classification tasks: 2-class (top left), 5-class (top right), 3-class (bottom left), and 9-class (bottom right) are taken on 50th epoch. Across each row represents the instances of the actual class, while predictions are represented across each column.

9 Conclusion

In this paper, all labeled network IDS data sets have been reviewed and compared in view of the widely popular yet outdated DARPA’98 data set. In the application of NIDS, the adoption of network flow in contrast to the conventional DPI approach is also examined. Furthermore, various machine learning algorithms, including a deep learning model: LSTM, are deployed to be evaluated based on a flow-based data set. One of the latest unidirectional NetFlow CIDDS-001 OpenStack data set is prepared to aid in the evaluation. The models are then compared and studied in both binary- and multi-class classification tasks. From the experiment results, the capability of vanilla LSTM in handling flow-based traffic is shown.

Footnotes

Acknowledgments

This research work was supported by a Fundamental Research Grant Schemes (FRGS) under the Ministry of Education and Multimedia University, Malaysia (Project ID: MMUE/160029).

References

Sperotto

, Schaffrath

, Sadre

, Morariu

, Pras

and Stiller

, An overview of IP flow-based intrusion detection, IEEE Commun Surv Tutorials12(3) (2010), 343–356.

Quittek

, Zseby

, Claise

and Zander

, Requirements for IP Flow Information Export (IPFIX), 2004.

, Springer

, Bebis

and Hadi

, A survey of network flow applications, J Netw Comput Appl36 (2013), 567–581.

Umer

M.F.

, Sher

and Bi

, Flow-based intrusion detection: Techniques and challenges, Comput Secur70 (2017), 238–254.

Golling

, Hofstede

and Koch

, Towards multi-layered intrusion detection in high-speed networks, in International Conference on Cyber Conflict, CYCON, 2014, pp. 191–206.

Lippmann

R.P.

et al., Evaluating intrusion detection systems: The DARPA off-line intrusion detection evaluation, in Proceedings - DARPA Information Survivability Conference and Exposition, DISCEX 2000, vol. 2, 2000, pp. 12–26.

Lippmann

, Haines

J.W.

, Fried

D.J.

, Korba

and Das

, DARPA off-line intrusion detection evaluation, Comput Networks34(4) (2000), 579–595.

Kendall

, A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems, 1999.

Stolfo

S.J.

, Fan

, Lee

, Prodromidis

and Chan

P.K.

, Cost-based modeling for fraud and intrusion detection: Results from the JAM project, in Proceedings - DARPA Information Survivability Conference and Exposition, DISCEX 2000, vol. 2, 2000, pp. 130–144.

10.

McHugh

, Testing Intrusion detection systems: A critique of the and DARPA intrusion detection system evaluations as performed by Lincoln Laboratory, ACM Trans Inf Syst Secur3(4) (2000), 262–294.

11.

Mahoney

M.V.

and Chan

P.K.

, An analysis of the DARPA/Lincoln laboratory evaluation data for network anomaly detection, Int Symp Recent Adv Intrusion Detect2820 (2003), 220–237.

12.

Tavallaee

, Bagheri

, Lu

and Ghorbani

A.A.

, A detailed analysis of the KDD CUP 99 data set, in IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, 2009, pp. 1–6.

13.

Perona

, Gurrutxaga

, Arbelaitz

, Martín

J.I.

, Muguerza

and Pérez

J.M.

, Service-independent payload analysis to improve intrusion detection in network traffic.

14.

Song

, Takakura

and Okabe

, Cooperation of intelligent honeypots to detect unknown malicious codes, in Proceedings - WOMBAT Workshop on Information Security Threats Data Collection and Sharing, WISTDCS 2008, 2008, pp. 31–39.

15.

Sperotto

, Sadre

, Van Vliet

and Pras

, A labeled data set for flow-based intrusion detection, Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics), vol. 5843LNCS, 2009, pp. 39–50.

16.

Garcia

, Grill

, Stiborek

and Zunino

, An empirical comparison of botnet detection methods, Comput Secur45 (2014), 100–123.

17.

S. Garcia, CTU-13-Extended. [Online]. Available: https://www.stratosphereips.org/blog/2015/7/17/new-dataset-ctu-13-extended-now-includes-pcap-files-of-normal-traffic.[Accessed: 14-Feb-2018].

18.

Shiravi

, Shiravi

, Tavallaee

and Ghorbani

A.A.

, Toward developing a systematic approach to generate benchmark datasets for intrusion detection, Comput Secur31 (2012), 357–374.

19.

Moustafa

and Slay

, UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set), 2015 Mil Commun Inf Syst Conf, 2015, pp. 1–6.

20.

Fernández

G.M.

, Camacho

, Magán-Carrión

, García-Teodoro

and Theron

, Ugr’16: A new dataset for the evaluation of cyclostationarity-based network IDSs, Comput Secur (2017).

21.

Ring

, Wunderlich

, Grüdl

, Landes

and Hotho

, Flow-based benchmark data sets for intrusion detection, in European Conference on Information Warfare and Security, ECCWS, 2017, pp. 361–369.

22.

DARPA Intrusion Detection Data Sets, MIT Lincoln Laboratory. [Online]. Available: https://www.ll.mit.edu/ideval/data/. [Accessed: 20-Nov-2017].

23.

Hettich

, Bay

S.D.

, KDD Cup Data, Irvine, CA: University of California, Department of Information and Computer Science, 1999. [Online]. Available: https://kdd.ics.uci.edu/ [Accessed: 16-Nov-2017].

24.

Perona

, Arbelaitz

, Gurrutxaga

, Martín

J.I.

, Muguerza

and Pérez

J.M.

, Generation of the database gurekddcup, 2016.

25.

University of New Brunswick, Canadian Institute for Cybersecurity (CIC) Datasets. [Online]. Available: http://www.unb.ca/cic/datasets/index.html. [Accessed: 20-Nov-2017].

26.

Song

, Takakura

, Okabe

, Eto

, Inoue

and Nakao

, Statistical analysis of honeypot data and building of Kyoto + dataset for NIDS evaluation, in Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security - BADGERS ’11, 2011, pp. 29–36.

27.

A. Sperotto, R. Sadre, F. Van Vliet and A. Pras, Labeled Dataset for Intrusion Detection. [Online]. Available: https://www.simpleweb.org/wiki/index.php/Labeled_Dataset_for_Intrusion_Detection. [Accessed: 16-Nov-2017].

28.

S. García, M. Grill, J. Stiborek, A. Zunino, CTU-13 Dataset. [Online]. Available: https://www.stratosphereips.org/datasets-ctu13/. [Accessed: 14-Feb-2018].

29.

N. Moustafa and J. Slay, UNSW-NB15 data set. [Online]. Available: https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-NB15-Datasets/ [Accessed: 16-Nov-2017].

30.

Fernández G.M., Camacho J., Magán-Carrión R., García-Teodoro P., Theron R., UGR’16 Dataset. [Online]. Available: https://nesg.ugr.es/nesg-ugr16/ [Accessed: 14-Feb-2018].

31.

Ring

, Wunderlich

, Grüdl

, Landes

and Hotho

, CIDDS - Coburg Intrusion Detection Data Sets: Hochschule Coburg. [Online]. Available: https://www.hs-coburg.de/forschung-kooperation/forschungsprojekte-oeffentlich/ingenieurwissenschaften/cidds-coburg-intrusion-detection-data-sets.html. [Accessed: 20-Nov-2017].

32.

Sharafaldin

, Lashkari

A.H.

and Ghorbani

A.A.

, Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization, no. Cic, in Proceedings of the 4th International Conference on Information Systems Security and Privacy - Volume 1: ICISSP, 2018, pp. 108–116.

33.

Sharafaldin

, Gharib

, Lashkari

A.H.

and Ghorbani

A.A.

, Towards a reliable intrusion detection benchmark dataset, Softw Netw2017(1) (2017), 177–200.

34.

Habibi Lashkari

, Draper Gil

, Mamun

M.S.I.

and Ghorbani

A.A.

, Characterization of Tor Traffic using Time based Features, in Proceedings of the 3rd International Conference on Information Systems Security and Privacy, 2017, pp. 253–262.

35.

Ahmed

, Naser Mahmood

and Hu

, A survey of network anomaly detection techniques, Journal of Network and Computer Applications60 (2016), 19–31.

36.

Catania

C.A.

and García

, Automatic network intrusion detection: Current techniques and open issues, Comput Electr Eng38(5) (2012), 1062–1072.

37.

Bhuyan

M.H.

, Bhattacharyya

D.K.

and Kalita

J.K.

, Network anomaly detection: Methods, Systems and Tools, IEEE Commun Surv Tutorials (2014).

38.

Goodfellow

, Bengio

, Courville

, Deep learning, MIT Press, Cambridge, 2016.

39.

Lecun

, Bengio

and Hinton

, Deep learning, Nature521(7553) (2015), 436–444.

40.

Hinton

G.E.

, Osindero

and Teh

Y.-W.

, A fast learning algorithm for deep belief nets, Neural Comput18(7) (2006), 1527–1554.

41.

Fischer

, Igel

, An Introduction to Restricted Boltzmann Machines, Springer, Berlin, Heidelberg, 2012, pp. 14–36 .

42.

Salama

M.A.

, Eid

H.F.

, Ramadan

R.A.

, Darwish

, Hybrid Intelligent Intrusion Detection Scheme, Springer, Berlin Heidelb, 2011, pp. 293–303 .

43.

Gao

, Gao

and Wang

, An Intrusion Detection Model Based on Deep Belief Networks, 2014 Second Int Conf Adv Cloud Big Data, 2014, pp. 247–252.

44.

Alrawashdeh

and Purdy

, Toward an online anomaly intrusion detection system based on deep learning, Proc - 2016 15th IEEE Int Conf Mach Learn Appl ICMLA 2016, 2017, pp. 195–200.

45.

Sheikhan

, Jadidi

and Farrokhi

, Intrusion detection using reduced-size RNN based on feature grouping, Neural Comput Appl21(6) (2012), 1185–1190.

46.

Chuan-long

, Yue-fei

, Jin-Long

and Xin-zheng

, A deep learning approach for intrusion detection using recurrent neural networks, IEEE Access5 (2017), 1–1.

47.

Kim

and Kim

, Applying recurrent neural network to intrusion detection with hessian free optimization, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol.9503, 2016, pp. 357–369.

48.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Comput9(8) (1997), 1735–1780.

49.

Gers

F.A.

, Schmidhuber

and Cummins

, Learning to forget: Continual prediction with LSTM, Neural Comput12(10) (1999), 2451–2471.

50.

Gers

F.A.

and Schmidhuber

, Recurrent nets that time and count, in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks IJCNN 2000 Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 3, 2000, pp. 189–194.

51.

Greff

, Srivastava

R.K.

, Koutnik

, Steunebrink

B.R.

and Schmidhuber

, LSTM: A search space odyssey, IEEE Trans Neural Networks Learn Syst28(10) (2017), 2222–2232.

52.

Sommer

and Paxson

, Outside the Closed World: On Using Machine Learning for Network Intrusion Detection, in Security and Privacy (SP), 2010 IEEE Symposium on, 2010, pp. 305–316.