Abstract
This paper is devoted to the problem of class imbalance in machine learning, focusing on the intrusion detection of rare classes in computer networks. The problem of class imbalance occurs when one class heavily outnumbers examples from the other classes. In this paper, we are particularly interested in classifiers, as pattern recognition and anomaly detection could be solved as a classification problem. As still a major part of data network traffic of any organization network is benign, and malignant traffic is rare, researchers therefore have to deal with a class imbalance problem. Substantial research has been undertaken in order to identify these methods or data features that allow to accurately identify these attacks. But the usual tactic to deal with the imbalance class problem is to label all malignant traffic as one class and then solve the binary classification problem. In this paper, however, we choose not to group or to drop rare classes but instead investigate what could be done in order to achieve good multi-class classification efficiency. Rare class records were up-sampled using SMOTE method (Chawla et al., 2002) to a preset ratio targets. Experiments with the 3 network traffic datasets, namely CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin et al., 2018) and LITNET-2020 (Damasevicius et al., 2020) were performed aiming to achieve reliable recognition of rare malignant classes available in these datasets.
Popular machine learning algorithms were chosen for comparison of their readiness to support rare class detection. Related algorithm hyper parameters were tuned within a wide range of values, different data feature selection methods were used and tests were executed with and without over-sampling to test the multiple class problem classification performance of rare classes.
Machine learning algorithms ranking based on Precision, Balanced Accuracy Score,
Keywords
Introduction
Detection of intrusions into networks, information systems or workstations, as well as detection of malware and unauthorized activities of individuals, have emerged into a global challenge. A part of cybernetic defence challenges is addressed by optimizing the intrusion detection systems (IDS). There are three methods of intrusion detection (Koch, 2011): known pattern recognition (signature-based), anomaly based detection, and a hybrid of the previous two. Anomaly based detection is currently mainly implemented as a support for zero-day network perimeter defence of big infrastructures and network operators, while signature based intrusion prevention remains the main mode of defence for most businesses and households. Pattern recognition or anomaly detection can be seen as classification problems. Classification problems refer to the problems in which the variable to be predicted is categorical. In network traffic the benign data is most often represented by a large number of examples, while malignant traffic appears extremely rarely or is an absolute rarity. This is known as the class imbalance problem and is a known obstacle to the induction of good classifiers by Machine Learning (ML) algorithms (Batista et al., 2004).
He and Ma (2013) define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process. He and Ma (2013) introduced informal conventions for imbalanced dataset classification. A dataset where the most common class is less than twice as common as the rarest class would be marginally unbalanced. A dataset with the imbalance ratio of about 10 : 1 would be modestly imbalanced, and a dataset with imbalance ratios above 1000 : 1 would be extremely unbalanced. This sort of imbalance is found in medical record databases regarding rare diseases, or production of electronic equipment, where non-faulty examples heavily outnumber faulty examples. Cases when negative to positive ratios are close to or higher than 1 000 000 : 1 are called absolute rarity imbalance. This sort of imbalance is found in cyber security, where all but a few network traffic flows are benign. However, standard ML algorithms are still capable of inducing good classifiers for extremely imbalanced training sets. This shows that class imbalance is not the only problem responsible for the decrease in performance of learning algorithms. Batista et al. (2004) have demonstrated that a part of the problem to have class separation is often an overlap of classes due to a lack of feature separation. Another reason could be a lack of attributes, specific to a certain decision boundary. It is known that in cases where negative class has an internal structure (multimodal class), an overlap between negative and positive classes can be observed on a few of the clusters within negative class.
This study reports results of the empirical research executed with selected supervised machine learning classification algorithms in an attempt to compare their efficiency for intrusion detection and get improved results compared to other published studies. The study consists of the following sections: Section 2, introduction of the data sources, Section 3, a review of machine learning methods and model benchmark metrics used in this study, Section 4, an overview of the experiment and pre-processing steps, Section 5, results and conclusions.
Contribution
The research question raised in this study is which supervised machine learning method consistently provides the best multi-class classification results with large and highly imbalanced network datasets. To answer this question we chose the CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin et al., 2018) and LITNET-2020 (Damasevicius et al., 2020) datasets as they are recent realistic software-generated traffic network datasets and meet the required criteria (Gharib et al., 2016) for a good network intrusion dataset. An answer to this question is that based on rankings of performance metrics and bias-variance decomposition the tree ensembles Adaboost, RandomForest Trees and Gradient Boosting Classifier performed best on the network intrusion datasets used in this research.
The novelty of this research is in a proposed methodology (see Section 4) and application of it for the recent and not yet in depth studied dataset LITNET-2020. A review of the LITNET-2020 dataset compliance to the criteria raised by Gharib et al. (2016) is first introduced in Section 2.2. A variant of random under-sampling (skewed ratio under-sampling, proposed by authors and discussed in Section 3.1) is used to reduce imbalance of classes in a nonlinear fashion. SMOTE up-sampling for numeric data and SMOTE-NC for categorical data (see Section 3.2) is executed to increase representation of rare classes. Further in this research, comparison of multi-class classification performance of the CIC-IDS2017 and CIC-IDS2018 datasets with the LITNET-2020 dataset is discussed in Section 5. Multi-class performance macro-averaged metrics are implemented in this research. Balanced accuracy (Formula (2)) and geometric mean of recall (Formula (4)) for the LITNET-2020 dataset are implemented for the first time (see results in Tables 16 and 17). Multi-criteria scoring is cross-validated with an approach of testing through data previously unseen for the models (see Section 4). For decision tree ensemble methods, instead of the weak CART base classifiers, parameters Tree depth and alpha were GirdSearched and validated using the method of maximum cost path analysis (Breiman et al., 1984), see Section 3.8. Additional ML model, Gradient Boosting Classifier, utilizing ensemble of Classification and regression trees (CART), was introduced for benchmark in this research via the use of XGBoost library (Chen and Guestrin, 2016) with GPU support (see Section 3.5.6). In our methodology, due to the highly imbalanced nature of the used data, cost sensitive method implementations were chosen. These choices lead to better results (see Table 20) compared to other reviewed studies. Furthermore, selection of models with better generalization capabilities in this research is achieved through decomposition of classification error into bias and variance (see results in Table 18).
Datasets Used
The following section presents a review of datasets considered for this research together with arguments for the choice made.
Datasets Considered for Analysis
There are many datasets that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Far from being complete, the list includes: DARPA 1998 (Lippmann et al., 1999) and 1999 traces by Lincoln Laboratory, USA, KDD’99 (Hettich and Bay, 1999), CAIDA (The Cooperative Association for Internet Data Analysis, 2010) datasets by University of California, USA, the Internet Traffic Archive and LBNL traces by Lawrence Berkeley National Laboratory, USA (Lawrence Berkeley National Laboratory, 2010), DEFCON by The Shmoo Group (2011), ISCX IDS 2012 (Shiravi et al., 2012), CIDDS-001 (Coburg Intrusion Detection Data Set) (Ring et al., 2017) and others. However, it has been widely acknowledged that machine learning research in an intrusion detection area needs to include new attack types and therefore researchers should consider more recent data sources.
In this research, three recent network data sets, compliant to the criteria described further (see Section 2.2) suggested by their authors for intrusion detection research, are explored. The datasets chosen are CIC-IDS2017, CSE-CIC-IDS2018 (Sharafaldin et al., 2018) by the University of Brunswick, Canada, and LITNET-2020 (Damasevicius et al., 2020). These datasets are of significant volume, contain anonymized real academic network traffic and are suited for multiple purposes of machine learning. LITNET-2020 is a new dataset that is given particular attention in this research, with discussion of compliance to the dataset suitability as devised by Gharib et al. (2016).
Requirements for Cybersecurity Datasets
Criteria for building such datasets are discussed by Małowidzki et al. (2015), Buczak and Guven (2016), Maciá-Fernández et al. (2018), Ring et al. (2019), Damasevicius et al. (2020), and others.
Małowidzki et al. (2015) define the following features of a good dataset: it must contain recent data, be realistic, contain all typical attacks met in the wild, be labelled, be correct regarding operating cycles in enterprises (working hours), should be flow-based. Ring et al. (2019) contend that a good dataset should be comparable with real traffic and therefore have more normal than malicious traffic, since most of the traffic within a company is normal and only a small part is malicious. Detailed framework and analysis of criteria for such datasets is proposed by Canadian Institute for Cybersecurity (CIC) at the University of New Brunswik. Gharib et al. (2016) have proposed the eleven dataset selection criteria. These criteria are presented in Table 1. Following this publication of the criteria, CIC created a list of new datasets,1 addressing issues of compliance to these criteria. Creation of the CSE-CIC-IDS2018 followed with improvements, such as decreasing number of duplicates and uncertainties. Thakkar and Lohiya (2020) in Sections 4.1 and 4.2, Tables 4 and 5, and Karatas et al. (2020) in Sections III.C (CIC-IDS21017) and III. D (CSE-CIC-IDS2018) provide discussion and support to these claims.
Dataset compliance criteria by Gharib et al. (2016).
Dataset compliance criteria by Gharib et al. (2016).
The LITNET-2020 dataset was selected for the current study as complying to most of the above mentioned requirements with some reservations regarding interaction completeness, heterogeneity and feature set completeness criteria.
These eleven criteria as applied to LITNET-2020 are discussed below. Complete network configuration: In order to investigate the real course of attacks, it is necessary to test the real network configuration. All of the network flows in this dataset are received or generated at the Network of Lithuanian academic institutions LITNET. Complete traffic: The dataset accumulates full packet flows from the source to the destination, which can be a workstation computer, router or another specialized service device. Labelled dataset: The dataset is labelled into a single benign and 12 malignant classes. The benign class is not separately labelled into sub-classes, however, it could be done because the number of benign records is exceeding 36 million records and is close to Complete interaction: The correct interpretation of the data requires data from the entire network interoperability process. LITNET-2020 dataset, however, is a pure network traffic dataset with no correlated host memory or host log information. Record completeness: The LITNET-2020 dataset is compliant with this requirement. Various protocols: Records of 13 types of protocols for normal and 3 types of protocols for malignant traffic are available in the LITNET-2020 dataset. Diversity and novelty of attacks: The dataset includes attack flows that were detected from 2019-03-06 first flow and 2020-01-31 last flow. Anonymity: It is important that the simulated set contain data for which privacy is not important. The LITNET-2020 data set contains no personally identifiable data. Heterogeneity: Data from different sources, such as network streams, operating system logs, or network equipment logs, memory images, must be available. LITNET-2020 is not compliant with this requirement. Feature Set/Attribute Linkage: It is important for the research that data from different types of sources for the same event be linked, for example, device memory view, network traffic, and device logs. LITNET-2020 is not compliant with this requirement as it contains no linked host sources. Metadata and documentation: Information about attributes, how the traffic was generated or collected, network configuration, attackers and victims, machine operating system versions and attack scenarios are required to do the research. LITNET-2020 is documented in Damasevicius et al. (2020).
In datasets selected for the research, the benign class takes from
Dataset content split.
Dataset content split.
The following Table 3 presents the split of malignant classes and is a summary of dataset imbalance shares in accordance with the taxonomy described by He and Ma (2013):
Dataset imbalance.
1Share of records in imbalance category.
The following Table 4 represents a summary of extremely imbalanced (>1000 : 1) classes in the three selected datasets.
Extremely rare classes in the datassets.
1DDOS attack.
Various imbalance measures are discussed by Ortigosa-Hernández et al. (2017) in a study, dedicated to such measures. In Karatas et al. (2020), section III.E, authors review most practical to use imbalance ratios of several IDS datasets, including the CIC-IDS2017 and CSE-CIC-IDS2018.
Referring to Ortigosa-Hernández et al. (2017) and Karatas et al. (2020), the following Formula (1) can be used for the calculation of the imbalance ratio:
For example, historical NSL-KDD has an imbalance ratio of 648, CIC-IDS2017 has an imbalance ratio of 112 287 and CSE-CIC-IDS2018 has a slightly better imbalance ratio of 53 887. LITNET-2020 has an imbalance ratio of 70 769.
While imbalance ratios are an important part of the discussion, the absolute rarity is another concept introduced by He and Ma (2013) for the case when there is not enough records to learn the class. If there is not enough information within the feature-scape, determination of decision boundary cannot be made. There are no such classes in the LITNET-2020 datasets, and the data was sufficient for learning to all the machine learning algorithms used in our experiment. However, Infiltration, Heartbleed and Web Attack-Aql Injection classes in the CIC-IDS2017 dataset exhibit behaviour of such an absolute rarity and learning the decision boundaries for these classes is complicated and unspecific. In CSE-CIC-IDS2018 dataset, even though Infiltration class records are abundant, high overlap with benign class is observed.
The CIC-IDS-2017 dataset (Sharafaldin et al., 2018) is made available by Canadian Institute for Cyber Security Research at the University of New Brunswick2 and introduces labelled data of 14 types of attacks including DDoS, Brute Force, XSS, SQL Injection, Infiltration, and Botnet. The traffic was emulated in a test environment during a period from July 3 to July 7, 2017. Network traffic features and related aggregates were extracted and generated using the CICFlowMeter tool and made available in a form of 8 CSV files. The CICFlowMeter is an open source tool3 provided by CIC at UNB that generates bidirectional flows from pcap files, and extracts features from these flows, made available to the research community by Draper-Gil et al. (2016) and further described by Lashkari et al. (2017). The dataset contains a total of 2 830 743 records with flow data, synthetic features and is labelled.
The following Table 5 is a summary of class representation of this dataset.
Class representation in CIC-IDS2017 dataset.
Class representation in CIC-IDS2017 dataset.
Dataset features, all measures of duration or related aggregates, further used for this research belong to these categories:
Fiat (Forward Inter Arrival Time mean, min, max, std): aggregates on the time between two flows are sent in forward direction;
Biat (Backward Inter Arrival Time mean, min, max, std): aggregates on the time between two flows are sent backwards;
Flowiat (Flow Inter Arrival Time, mean, min, max, std): aggregates on the time between two flows sent in either direction;
Active (mean, min, max, std): aggregates on the amount of time a flow was active before going idle;
Idle (mean, min, max, std): aggregates on the amount of time a flow was idle before becoming active;
Flow Bytes/s: Flow bytes sent per second;
Flow Packets/s: Flow packets sent per second;
Duration: The duration of a flow.
The CSE-CIC-IDS2018 dataset (Sharafaldin et al., 2018) is made available by Canadian Institute for Cyber Security Research at the University of New Brunswick.4 Data was emulated in the CIC test environment within an environment of 50 attacking machines, 420 victim PC’s and 30 victim servers during the period from February 14 to March 2, 2018. The dataset contains records from 14 distinct attacks, is labelled and presented together with anonymised PCAP5 files. 80 network traffic features were extracted and calculated using the CICFlowMeter tool. Ten CSV files are made available for machine learning, containing 16 232 943 records. The representation of classes in IDS-2018 ranges from approximately 1 : 20 to 1 : 100 000.
The following Table 6 presents a summary of class representation of this dataset.
Class representation of CSE-CIC-IDS2018 dataset.
Class representation of CSE-CIC-IDS2018 dataset.
1Variants of DoS attacks.
Same dataset features as described in Section 2.5 are used further in this research for selection of features.
LITNET-2020 is a new annotated network dataset for network intrusion detection, obtained from the real life Lithuanian academic network LITNET traffic by researchers from Kaunas Technology University (KTU). The environment of data collection, comparison of the dataset with other recently published network-intrusion datasets and description of attacks represented in the LITNET-2020 dataset is introduced by Damasevicius et al. (2020). The dataset contains benign traffic of the academic network and 12 attack types generated at KTU managed LITNET network from March 6, 2019 to January 31, 2020. Network traffic was captured using the open source nfcapd binary format, anonymised and processed into the CSV format, containing 39 603 674 time-stamped records. Nfsen, MeSequel, and Python script tools were used for extra feature generation and pre-processing, with data fields in CSV format named after fields, generated by Nfdump.6 The 49 attributes that are specific to the NetFlow v9 protocol as defined in RFC 3954 (Claise, 2004) are used to form a dataset basis, further expanded with additional fields of time and tcp flags (in symbolic format), which can be used to identify attacks. An additional 19 attack specific attributes are added. The representation of classes in LITNET-2020 is imbalanced in a range from approximately 1 : 30 to 1 : 100 000.
The following Table 7 presents a summary of class representation of this dataset.
Class representation of LITNET-2020 dataset.
Class representation of LITNET-2020 dataset.
1Record counts before removing timestamp and related record duplicates.
Multiple different types of methods were used in this research to improve performance of ML methods. The methods employed could be grouped into pre-processing (see Sections 3.1–3.3) and machine learning methods (see Section 3.5). Data record sampling methods are discussed in detail in Section 3.1. Record over-sampling – in Section 3.2, feature selection, scaling and frequency transformation undertaken and pre-processing activities are discussed in Section 3.3. Machine learning methods (see Section 3.5), capable of cost sensitive learning, were chosen for performance comparison in this paper.
For all models, their hyper-parameters were searched using the GridSearch method, and later multiple performance measures (see Section 3.6) were used to evaluate and compare ML algorithms.
Under-Sampling Methods
The benign class in our datasets constitutes up to
Cleaning under-sampling approaches do not target a specific ratio, but rather clean the feature space based on some empirical criteria (Lemaitre et al., 2016). According to Lemaitre et al. (2016), these criteria are derived from the nearest neighbour rule, namely: (i) condensed nearest neighbours (Hart, 1968), (ii) edited nearest neighbours (Wilson, 1972), (iii) one-sided selection (Kubat and Matwin, 1997), (iv) neighbourhood cleaning rule (Laurikkala, 2001), and (v) Tomek links (Tomek, 1976).
Cleaning under-sampling methods such as Edited Nearest Neighbours, TomekLinks, Condensed Nearest Neighbours were tested, however, due to the size of sub-sampled data and the large computational overhead they require, these methods were not further explored. The fixed random under-sampling was implemented in two steps as follows: Major class records were first randomly under-sampled to a target number of records, such as to provide sufficient learning for all models. Target numbers were obtained after analysis of learning curves. Sufficient learning is defined here as the objective to have learning and testing curves to converge within a margin less than Numbers of benign and other highly imbalanced classes were further transformed with a random under-sampling function from Imbalanced-learn library (Lemaitre et al., 2016) using the number of records per class targets, calculated with the following empirically chosen skewed ratio function
In this paper, to balance minority classes, we investigate random and SMOTE (Synthetic Minority Over-sampling Technique) (Chawla et al., 2002) over-sampling methods. Random over-sampling is a base method that aims to balance class distribution through the random replication of minority class examples. Unfortunately, this can increase the likelihood of classifier overfitting (Batista et al., 2004). Therefore, we removed all duplicates in training data.
A more advanced method, capable of increasing minority class size without duplication, is SMOTE. SMOTE forms new minority class examples by linearly interpolating between minority class examples that are close. Thus, the overfitting problem risk is mitigated as the decision boundaries of the classifier for the minority class are moved further away from the minority class space. SMOTE works in feature space, not in data space, therefore, before the procedure to over-sample is executed, the first step is to select numeric features to over-sample, as it is not necessary to over-sample in all dimensions. SMOTE over-sampling is achieved by following these steps: a) take k nearest neighbours from minority class for some minority class vector in the feature space, b) randomly choose the vector from those k neighbours, c) take a difference between the vector and its neighbour, and multiply the difference vector by a random number which lies between 0, and 1, d) repeat previous step until the target number of synthetic points is reached. After this, new records can be added to the current data (see Chawla et al., 2002, for a complete algorithm). SMOTE method can be combined with some under-sampling methods to remove examples of all classes that tend to be misclassified. For example, in SMOTE with the Edited Nearest Neighbours (ENN) algorithm (Batista et al., 2004), after SMOTE is used to over-sample a number of records in defined minority classes, ENN is used to remove samples from both classes such that any sample that is misclassified by its given number of nearest neighbours is removed from the training set. Batista et al. (2004) have demonstrated the best results on imbalanced datasets with minority classes containing under 100 records. However, due to the complexity of the edited neighbours procedures (Witten et al., 2005) being
As our datasets have not only continuous but also nominal features, we used a modification of SMOTE – Synthetic Minority Over-sampling Technique-Nominal Continuous (SMOTE-NC), from imbalanced-learn library (Lemaître et al., 2017) in the research. We used a recommended number of neighbours equal to
Feature Selection Methods
Based on the ideas of research and practical implementation recommendations made by Sharafaldin et al. (2018) and Shetye (2019), a selection of features was tested with 3 classes of methods: (a) filtering – correlation and related heat map analysis (b) univariate – recursive feature elimination and (c) iterative – regularization methods. In this research, features were selected with SelectKBest from Scikit-learn library (Pedregosa et al., 2011). The SelectKBest method takes as a parameter a score function, such as
If the Anova F-value function is used, a test result is considered statistically significant if it is unlikely to have occurred by chance, assuming the truth of the null hypothesis. If
Embedded methods penalize a feature based on a coefficient threshold. On each iteration of the model training process those features which contribute the most to the training for a particular iteration are selected.
Further on in this paper, two methods, the filtering and SelectKBest from Scikit-Learn were used to select features.
When performing feature selection, SelectKBest is focusing on the largest classes, therefore a possible improvement would be to do feature selection in a pipeline, by firstly selecting the most important features for the rarest class and then adding features needed for every class.
Generating additional synthetic features was not attempted in this research, as all chosen datasets contain a significant number of such.
Cost-Sensitive Learning Methods
Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model (Brownlee, 2020).
If not configured, machine learning algorithms assume that all misclassification errors made by a model are equal. In case of an intrusion detection problem, missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.
The simplest and most popular approach to implementing cost-sensitive learning is to penalize the model less for training errors made on examples from the minority class by adjusting weights. The decision tree algorithm can be modified to weight model error by class weight when selecting splits. The Heuristic rule, also confirmed with intuition from decision trees (Brownlee, 2020), is to invert the ratio of the class distribution in the training dataset.
In this research, weights adjustment for decision trees was implemented using Scikit-learn library model parameters class_weight, setting it to ‘balanced’, which does the above mentioned inversion of class weights. Prior statistics were used for Quadratic discriminant analysis model.
Choice of Machine Learning Methods
For a performance comparison of machine learning methods on network intrusion detection data with imbalanced classes, we selected the most popular machine learning algorithms from surveys and review papers, related to intrusion detection (Buczak and Guven, 2016; Sharafaldin et al., 2018; Damasevicius et al., 2020).
Adaptive Boosting (Adaboost)
AdaBoost ensemble method was proposed by Yoav Freund and Robert Shapire for generating a strong classifier from a set of weak classifiers (Freund and Schapire, 1997). AdaBoost algorithm works by weighting instances in the dataset by how easy or difficult they are to classify, and correspondingly prioritizes them in the construction of subsequent models. A Default base classifier was used with Adaboost by authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) obtaining the result on Precision and
Classification and Regression Tree (CART)
The Classification and Regression Tree method was proposed by Breiman et al. (1984), and used to construct tree structured rules from training data. Tree split points are chosen on a basis of cost function minimization.
The authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) obtained weighted averages of Precision, Recall and
In this research, CART, as implemented in Scikit-learn library, was also used to obtain a base classifier and tree parameters for Adaboost, Gradient Boosting Classifier and Random Forest Classifier. Tree depth and alpha were obtained using the method of maximum cost path analysis (Breiman et al., 1984), implemented in the Scikit-learn library cost-complexity-pruning-path function, discussed in Section 3.8.
k-Nearest Neighbours (KNN)
The k-Nearest Neighbours method was proposed by Dudani (1976), as a method which makes use of a neighbour weighting function for the purpose of assigning a class to an unclassified sample. KNN was used by authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) with obtained results for weighted averages of Precision, Recall and
Quadratic Discriminant Analysis (QDA)
Quadratic discriminant analysis descends from discriminant analysis introduced by Fisher (1954). Bayesian estimation for QDA was first proposed by Geisser (1964). Quadratic discriminant analysis (QDA) models the likelihood of each class as a Gaussian distribution, then uses the posterior distributions to estimate the class for a given test point (Friedman, 2001). The method is sensitive to the knowledge of priors. QDA was used by authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018) with obtained result for Precision, Recall and
Random Forest Trees (RFT)
The Random Forest Trees (RFT) classifier was proposed by Breiman (2001) as a combination of tree predictors minimizing overall generalization error of participating trees as the number of trees in the forest becomes larger. Random forests are an alternative to Adaboost by Freund and Schapire (1997) and are more robust with respect to noise. Random Forests is an extension of bagged decision trees where only a random subset of features are considered for each split.
The algorithm was used by the authors of the CIC-IDS-2017 dataset (Sharafaldin et al., 2018), and also by Kurniabudi et al. (2020). Sharafaldin et al. (2018) obtained results for the weighted averages of Precision, Recall and
Gradient Boosting Classifier (GBC)
In order to extend the scope of the research, Gradient Boosting Classifier (GBC), as proposed by Friedman (2001) and Friedman (2002), was added as a natural member of classifier ensemble methods. GBC is a stochastic gradient boosting algorithm, where decision trees are fitted on the negative gradient of the chosen loss function. The idea of gradient boosting is to fit the base-learner not to re-weighted observations, as in AdaBoost, but to the negative gradient vector of the loss function evaluated at the previous iteration. XGBoost library (Chen and Guestrin, 2016), incarnation with GPU support of GBC, was implemented in this research. The results of GBC of other authors are not known publicly.
Multiple Layer Perceptron
Multiple Layer Perceptron (MLP) has been proposed by Rosenblatt (1962) as an extension to a linear perceptron model (Rosenblatt, 1957). It is a supervised learning artificial neural network implementation, utilizing back-propagation for training, that can have multiple layers and a chosen, non necessarily linear, activation function.
MLP was used in the study of Sharafaldin et al. (2018) with obtained results for weighted averages of Precision, Recall and
Performance Measures
Standard performance metrics for classifiers are presented in Section 3.6.1, and Bias and Variance decomposition metric (see Section 3.7) was used to evaluate ML algorithm tendencies to overfit or underfit.
Confusion Matrix Based Metrics
Accuracy, Precision in equation (5), Recall in equation (3) and
Further on in this research, the Balanced accuracy score and
Balanced accuracy score BAS in formula (2) is further defined as average of recall values for K classes:
Geometric mean
Precision for class i is defined as follows:
Whereas
In this research, we have used macro-weighted (i.e. unweighted mean)
The decomposition of the loss into bias and variance helps to improve understanding of generalization capacities of compared learning algorithms, such as overfitting and underfitting. Various methods of decomposition are reviewed in Domingos (2000). It has been demonstrated that high variance correlates to overfitting, and high bias correlates to underfitting. In practical terms, when comparing the performance of learning algorithms, models with lower bias and variance over the same test data would be preferred. It is worth noting that models with a higher degree of parameter freedom tend to demonstrate lower bias and higher variance, and models with a low degree of freedom demonstrate high bias and lower variance.
The loss function of a learning algorithm can be decomposed into three terms: a variance, a bias, and a noise term, which will be ignored further for simplicity (Raschka, 2018). Loss function depends on the machine learning algorithm. For decision trees (CART), training proceeds through a greedy search, each step based on information gain. For the random forest classifier, loss function is the Gini impurity. Cross-entropy is the default loss function to use for multi-class classification problems with MLP.
The prediction bias is calculated as the difference between the expected prediction accuracy of a model and the true prediction accuracy (equation (7)). In formal notation the
The variance (equation (8)) is a measure of the variability of model’s predictions if the learning process is repeated multiple times with random fluctuations in the training set.
Finding the values where training and testing learning curves converge allows for creation of better generalizing decision trees, decrease of overfitting and underfitting. The Tree depth (implemented in Scikit-learn library through parameter
Many variables in the datasets CIC-IDS2017 and CSE-CIC-IDS2018 appear to be correlated with each other, which increases bias while using Quadratic Discriminate Analysis. A statistical measure known as VIF (Variance Inflation Factor) was proposed by Lin et al. (2011) to support elimination of cross-correlation of features and is implemented in this research from statsmodels library (Seabold and Perktold, 2010).
Other Methods
The number of estimators was obtained using the Scikit-learn’s GridSearch (LaValle et al., 2004) method. See Sections 4.4–4.5 and Table 15 for implementation details in this research.
Experiment Design
Our experiment contained pre-processing, described further in detail in Section 4.1 for the CIC-IDS2017 dataset, Section 4.2 for the CSE-CIC-IDS2018 dataset and Section 4.3 for the LITNET-2020 dataset. The datasets were cleaned and normalized. Quantile transformation from Scikit-learn library (Pedregosa et al., 2011) with QuantileTransformer using a default of 1 000 quantiles has been implemented for the pre-processing of numeric (continuous time related) features of all datasets in order to transform original values to a more uniform distribution.
Datasets were further under-sampled with random fixed ratio under-sampling and proposed skewed fixed ratio under-sampling so that after splitting into testing and training, sets would contain more than approximately 600 000 records each, which is sufficient for learning of all algorithms. This number has been estimated by performing learning curve analysis.
Later on, the training subsets were over-sampled using SMOTE for CIC-IDS2017 and CIC-IDS2018 datasets and SMOTE-NC for LITNET-2020. Features were selected using KBest (see Section 3.3) and VIF procedures (see Section 3.9). Training and hyper-parameter search was performed using cross validation with
The final results of predictions were obtained using testing data, e.g. not seen to trained models. In order to obtain a reliable result, predictions were run 30 times with a change of random seed on each run.
Further on in the experiment, the best features were selected using the SelectKBest procedure from Scikit-learn library (Pedregosa et al., 2011) and followed by Variance inflation factor analysis (Lin et al., 2011) with a target threshold value, to eliminate variables with high collinearity.
Parameters for classification models were searched using GridSearch from the Scikit-learn library.
CIC-IDS2017 Pre-Processing Steps
The following procedures were implemented to condition the dataset for better learning of under-represented attack classes: a) removal of unused features and related record duplicates, b) random under-sampling of benign class records, such as to represent no more than a number of records, providing sufficient learning for the worst performing model, obtained after analysis of learning curves and c) over-sampling using SMOTE for the training sub-sample of extremely rare records (see Table 4) up until the minimum number of examples of classes with high imbalance.
Duplicate rows were removed (leaving the first one), see Table 8.
Removal of duplicates in IDS2017 dataset.
Removal of duplicates in IDS2017 dataset.
1Record counts after removing duplicate records.
The following 8 features ‘Bwd PSH Flags’, ‘Bwd URG Flags’, ‘Fwd Avg Bytes/Bulk’, ‘Fwd Avg Packets/Bulk’, ‘Fwd Avg Bulk Rate’, ‘Bwd Avg Bytes/Bulk’, ‘Bwd Avg Packets/Bulk’, ‘Bwd Avg Bulk Rate’, containing no information
After dropping the duplicates, the 2 522 362 remaining records were investigated for missing values and infinities.
As a result, 1 358 missing values containing records were removed with drop duplicates. The remaining 353 rows with missing values were found to be split between ‘Benign’ (350) and ‘DoS Hulk’ (3) classes and missing values were replaced with −1.
Further, 1 211 records with infinities in two features Flow ‘Bytes/s’ and ‘Flow Packets/s’ were found and replaced by maximums of values per class, see Table 9.
Replacing infinities in IDS2017 dataset.
This processing step is made under an assumption that such a replacement for lost values would be possible to implement after learning the values during the initial training of a real life intrusion detection system.
Further numbers of records for Benign and second largest class Dos Hulk were transformed with a skewed fixed ratio under-sampling. Remaining data is split into test and train sub-samples. The training sub-set is then over-sampled with SMOTE (thus training record count values of 4 999 and 2 999 in Table 10). This procedure keeps all extremely imbalanced class records (Table 4) intact and adds new records for the training, resulting in record counts for the training and testing samples presented in Table 10.
After this, the values of numeric columns were scaled to a range of
Further in this research, the 40 best features were selected using the SelectKBest procedure from the Scikit-learn library (Pedregosa et al., 2011) and followed by Variance inflation factor analysis with a target threshold value equal to 40, to eliminate variables with high collinearity.
Resulting IDS2017 dataset training and/or validation sample representation.
The same pre-processing procedure from Section 4.1 was applied to dataset CIC-IDS2018.
The timestamp column and related record duplicates were removed, as no time series dependent machine learning methods were chosen in this research.
Afterwards, 8 features ‘Bwd URG Flags’, ‘Bwd Pkts/b Avg’, ‘Bwd PSH Flags’, ‘Bwd Blk Rate Avg’, ‘Fwd Byts/b Avg’, ‘Fwd Pkts/b Avg’, ‘Fwd Blk Rate Avg’, ‘Bwd Byts/b Avg’ containing no information (eq.
The following sampling procedures were executed in order to achieve a better balance between major classes and extremely rare classes:
the top two classes (‘Benign’ and ‘DDoS attacks-LOIC-HTTP’) were under-sampled so as to represent no more than a number of records, providing sufficient learning for the worst performing model, obtained after analysis of learning curves.
The remaining data was split into test and train sub-samples.
Training sub-set was then over-sampled with SMOTE (thus, value of 2 999). This procedure keeps all extremely imbalanced class records (Table 4) intact and adds new records for the training, resulting in record counts for the training and testing samples presented in Table 11.
Resulting IDS2018 dataset training and validation sample representation.
It should be noted that 7 373 records with infinities in two features ‘Flow Bytes/s’ and ‘Flow Packets/s’ were found and replaced by maximums of values per class, see Table 12.
Replacing infinities in IDS2018 dataset.
Presence of such values could indicate that related flows were not terminated on recording.
After the data cleaning, the dataset was normalized with QuantileTransform. The 40 best features from SelectKBest were passed through the Variance Inflation Factor procedure with a threshold of 40 which was selected to eliminate collinearity of features.
Due to the choice of supervised machine learning models and problem definition in this study, the LITNET-2020 dataset timestamp feature was not used. Features related to the source and destination address, such as source and destination issuing authorities, are highly supportive in discovering not only the attacker but also the attack class, therefore, in order to support generalization of training, they were eliminated.
After removing timestamp and address related features, related duplicate records were also removed, see Table 13.
Removal of timestamp related duplicates in LITNET-2020 dataset.
Removal of timestamp related duplicates in LITNET-2020 dataset.
1Record counts after removing timestamp and related record duplicates.
The resulting dataset is even more imbalanced. The target number of records of the Benign and the Code Red type was set after learning curves that indicate the number of records required by the worst performing model for sufficient learning. Sufficient learning is defined here as the objective of getting the learning and testing curves to converge within a margin of less than
As a final step, a Synthetic Minority Over-sampling Technique for Nominal and Continuous features for datasets with categorical features, SMOTE-NC, introduced by Chawla et al. (2002) was implemented, see Table 14.
LITNET-2020 dataset sample representation.
After the data cleaning, the dataset was normalized with QuantileTransform. The 40 best features from SelectKBest were obtained and further checked for feature collinearity. Collinear features were reduced using the Variance Inflation Factor procedure (see Section 3.9) with a threshold value of 40.
All code for models was realized in the Python 3.7 environment on Anaconda 3 using Scikit-learn 7 and Imbalanced-learn 8 libraries, except for the Gradient Boosting Classifier, which was implemented using the XGBoost library (Chen and Guestrin, 2016), utilizing GPU.
Model parameters were searched with the GridSearch method. Tree depth and alpha were further validated using the method of maximum cost path analysis (Breiman et al., 1984), implemented in Scikit-learn by the cost-complexity-pruning-path function (see Section 3.8).
Parameter Values Selection
The following parameter ranges were selected for the grid search:
ADA: n_estimators: (range(10, 256, 5)), learning_rate: [0.001, 0.005, 0.01, 0.5, 1], and base estimator – CART.
CART: criterion: (‘entropy’, ‘gini’), max_depth: range(4, 32), in_samples_leaf: range(6, 10, 1), max_features: [0.5, 0.6, 0.8, 1.0, ‘auto’].
GBC: max_depth: range(4, 32, 1),
n_estimators: range(100, 256, 5), other parameters used from CART.
KNN: n_neighbors: range(3, 16, 1), algorithm: [‘ball_tree’, ‘auto’],
leaf_size: range(15, 35, 5)
MLP: hidden_layer_sizes: tuple (32 ... 256, 32 ... 256) (
QDA: reg_param: np.geomspace(1e−19, 1e−1, 50, endpoint = True). Value of tol parameter only impacts threshold when warnings of variable collinearity should be suppressed.
RFC: n_estimators: range(100, 350, 5), other parameters in the same ranges as CART.
The parameters used in this study are presented in the Table 15.
Model parameters used.
1Default Scikit-Learn values; 2Priors calculated equal to class shares.
Results of the Conducted Experiments
Tables 16, 17 and 18 represent the results of ML methods rankings using a Standard Ranking approach (Adomavicius and Kwon, 2011), where equal items get the same ranking number, and a gap is left in between the smaller and bigger result, where the bigger result means a worse result.
In Table 16, the results of scoring by Balanced Accuracy are in favour of trees or their ensembles, Adaboost being the strongest, closely followed by Random Forest Classifier and K-Nearest Neighbours.
Comparison of Model performance on 3 datasets using Balanced Accuracy Score (BAS) and Error Rate (ErR).
Comparison of Model performance on 3 datasets using Balanced Accuracy Score (BAS) and Error Rate (ErR).
1Adaboost ensemble is made of CART estimators with the grid-searched hyper-parameters described in Table 15.
Results of this research support notion that Balanced Accuracy metric (see Table 16) should be used for measuring accuracy in case of highly and extremely imbalanced data sets. Error Rate for all models is below 0.1, while Balanced Accuracy manifests some insufficient learning. Accuracy of Extremely rare (malicious) classes in this research is dominated by majority (benign) class, representing over 80% of the whole data (see Tables 2 and 3) and therefore Error Rate is overly optimistic, under-representing the prediction error of Extremely rare classes (see Table 4), important to this research.
The ranking results in Table 17 were obtained based on the minimum of the sum of rankings for Presicion and
Model rankings by Precision (Pr) and G-mean
The rankings of bias and variance decomposition in Table 18 are obtained on a basis of the minimum of the sum of bias and variance (equal to the model mean squared error, when not accounted for the noise component). The bias and variance are calculated according to formulas (7) and (8). To calculate bias, we have to estimate β and
Model rankings using model bias and variance (Var) decomposition.
1Ranking is performed on the sum of model loss variance and bias squared; 2Bias squared value.
The QDA values that are much higher than average compared to other algorithm errors from the same data in Table 18 are a characteristic property of models with low number of hyper-parameters as noted in Brownlee (2020). Values obtained in this experiment could be local optima, but authors were not able to find other parameter values that would result in lower difference of values for this model between datasets. However, bias and variance of this model was noticed to be sensitive to changes in a list of features selected before the parameter search process. The list of features chosen for model training is individual for each dataset.
Comparison of results of research in different implementations for CIC-IDS2017 and CSE-CIC-IDS2018 datasets is presented in Table 19. Performance metrics are not directly comparable to our research (further in Table 19 – this research), as validation results in our experiment were obtained using multiple class optimization and 50% of dataset as a hold-out data, versus standard k-fold cross-validation, known to be prone to knowledge leak. In our methodology, cost sensitive model implementations provided classification for multiple class measures. However, for comparison, traditional measures suitable only for balanced datasets are presented with other reviewed studies (see Table 19). It is important to note that optimization in this experiment was done on Balanced Accuracy Score, therefore, other measures are sub-optimal.
Related research results analysis.
Related research results analysis.
1See explanatory notes related to cited work in Section 5.2.
In Sharafaldin et al. (2018) authors had an objective to introduce the CIC-IDS-2017 dataset, and default parameter model results of machine learning are presented for purely benchmark purposes of future research. Feature selection was performed using the random forest regression feature selection algorithm. The results of Precision, Recall and
In Sharafaldin et al. (2019) authors improve results on RFT through proposing super-feature creation versus random feature regression algorithm for feature selection used in previous research (Sharafaldin et al., 2018). In our research the feature selection was obtained through fast Kbest procedure with Anova F-value optimization function, however, algorithm has been chosen after testing three classes of feature selection methods.
In Yulianto et al. (2019) strategy, SMOTE is utilized with CIC-IDS-2017. However, only benign and DDos class data of CIC-IDS-2017 dataset is taken, calculating binary classification problems, therefore, produces results that are incomparable to our research results. Features in their research are also selected differently, first utilizing Primary Components Analysis (PCA), then the Ensemble Feature Selection (EFS), using EFS Package in R Studio and ensemble methods gbm, glm, lasso, ridge and treebag from the fscaret library. The AdaBoost classification with default weak decision tree classifiers was used during the training. Meanwhile, in our research a choice was made to strengthen the base classifier via pruning. The results of Precision, Recall and
Kanimozhi and Jacob (2019a, 2019b) classified the CSE-CIC-IDS2018 data set using ADA, RF, kNN, SVM, NB and ANN (Artificial neural network) machine learning methods. For an ANN authors used MLP with two layers, lbfgs solver, grid searched alpha parameter (for L2 regularization) and Hidden layer sizes. In their research, authors used 0–1 classification. Either “Benign” or “Malicious” labels were used for training, making the results directly incomparable with our multi-class approach. Results of the accuracy, precision, recall,
In the study Karatas et al. (2020) classified the CSE-CIC-IDS2018 dataset using KNN, RFT, GBC, ADA, DT (Decision tree), and LDA (Linear discriminant analysis with singular value decomposition solver) algorithms. Parameters that were selected for all the implemented algorithms are described in Karatas et al. (2020) Table 8. Number of classes was determined to be six (one for non-attack type, and 5 for attack types), making the results directly incomparable with our multi-class approach. Cross-validation with 80%/20% split of training and test data was used. Results of the accuracy, precision, recall and
In their study Kilincer et al. (2021) classified the CSE-CIC-IDS2018 dataset using KNN, DT, and SVM algorithms. Options of Matlab for KNN with KNN Fine algorithm, DT with Fine tree and SVM Quadratic algorithm gave the best results in this research. Results on a limited amount of records (up to 1584 records per class, see Kilincer et al. (2021) Table 3) were used in this research for CSE-CIC-IDS2018 dataset classes. Authors focus on UNSW-NB15 dataset with no discussion on pre-processing for CSE-CIC-IDS2018, parameter search or tree pruning or overfitting. Results of the accuracy, precision, recall,
In Dutta et al. (2020) authors used SMOTE and ENN to balance the LITNET-2020 dataset. Classes are reduced to two, normal and malignant, therefore, results are directly incomparable with ours. The approach also differs in that authors reduce dimensionality with Deep sparse autoencoder (Zhang et al., 2018), selecting 15 features. Then authors stack LSTM with adam optimizer and DNN with four layers, back-propagation and stochastic gradient descent as the optimizer and early stopping on Keras with TF back-end and Scikit-learn. 5-fold validation was used in that research. Results of the precision, recall, false positive rate, and MCC were obtained. The results of Precision, Recall and
Regarding the limitations of the approach taken in this research, it is important to note that new categories of malicious traffic in reality are introduced daily. Therefore, models tuned using this method will not detect zero day threats.
Another know limitation is that in absolute rarity case, or when data has not been obtained and labelled sufficiently, models will predict with high Error rate. A possible known solution to this problem is an anomaly detection for the unseen data.
Moreover, datasets CIC-IDS2017 and IDS-2018 lack some categorical flag data, which is possible to obtain, like it has been demonstrated in LITNET-2020 case.
Even though LITNET-2020 lacks temporal features, introduced in CIC-IDS datasets, this, however, can be resolved by running the CICFlowMeter on the original PCAP files.
Temporal average approach of flags does not help some classes like Infiltration, however, flag features could be added to CIC-IDS datasets in the future.
While SMOTE was helpful for some rare classes, the method did not help much where sub-classes overlap due to lack of host data or feature latency.
Some features can be extracted and supplemented, which might be used in future research, however, extraction requires high degree of previous network traffic logging, whereas authors are aware that organizations lack resources to collect data on such a level of detail.
Observations on Multi-Class Predictions
Details of comparison of each class and dataset before and after SMOTE up-sampling is not represented here due to substantial amount of tables. However, it is important to note that some rare classes in these datasets learn very well even with a small numbers of records, which is confirmed by testing using dedicated unseen data. Some classes learn significantly better after adding synthetic data, which is further supported with tests on model performance and classification reports executed before (prefixed with n as
MLP model results for Precision
, and G-mean
on LITNET-2020 dataset before and after SMOTE.
MLP model results for Precision
1Selected example rare classes.
As demonstrated in Table 20, random data under-sampling and SMOTE over-sampling techniques are supportive in ensuring that extremely under-represented classes (see Table 4) can learn with non-zero precision and
In this paper, we have studied three highly imbalanced network intrusion datasets and proposed methodology steps (see Section 4), helping to achieve high classification results of rare classes which were validated through model error decomposition and 50% data hold-out strategy. This methodology was checked using a novel, differently structured dataset LITNET-2020, and comparison of the results to those obtained on the established benchmark datasets CIC-IDS2017 and CSE-CIC-IDS2018.
A review of the LITNET-2020 dataset compliance to the criteria raised by Gharib et al. (2016) is first introduced in Section 2.2. A variant of random under-sampling (skewed ratio under-sampling, proposed by authors and discussed in Section 3.1), is used to reduce imbalance of classes in a nonlinear fashion, and SMOTE-NC up-sampling (see Section 3.2) is executed to increase representation of under-represented classes. Further on in this research, comparison of multi-class classification performance of the CIC-IDS2017 and CIC-IDS2018 datasets with the recent LITNET-2020 dataset is discussed in Section 5. As LITNET-2020 is constructed differently from the CIC-IDS datasets, a conclusion can be made that the proposed method is resistant to dataset change. Performance metrics – balanced accuracy (Formula (2)) and geometric mean of recall (Formula (4)), better suited for multi-class classification used for the LITNET-2020 dataset, is another introduced novelty (see results in Tables 16 and 17), not discussed by other authors using these datasets. Multi-criteria scoring is cross-validated with an approach of testing through data previously unseen for the models (see Section 4). Additional ML model, Gradient Boosting Classifier, utilizing ensemble of classification and regression trees, was introduced for benchmark in this research via the use of XGBoost library (Chen and Guestrin, 2016) incarnation with GPU support (see Section 3.5.6). In our methodology, cost sensitive model implementations have been used and have provided some better results (see Table 19) compared to other reviewed studies. Furthermore, selection of models with better generalization capabilities in this research has been achieved through decomposition of classification error into bias and variance (see results in Table 18). Instead of the weak CART base classifiers (see Section 3.8) parameters were GirdSearch’ed and parameters Tree depth and alpha were validated using the method of maximum cost path analysis (Breiman et al., 1984). Other models were tuned using Gridsearch and Balanced Accuracy Score was scored as an optimization goal.
Machine learning algorithm rankings based on Precision, Balanced Accuracy Score,
Footnotes
See
More information at
More information at
More information at
File format as abbreviated from Packet CAPture, traffic capture file format in use by networking tools.
For a definition of features used in Nfdump 1.6 see
