Abstract
Anomaly detection in Intrusion Detection System (IDS) data refers to the process of identifying and flagging unusual or abnormal behavior within a network or system. In the context of IoT, anomaly detection helps in identifying any abnormal or unexpected behavior in the data generated by connected devices. Existing methods often struggle with accurately detecting anomalies amidst massive data volumes and diverse attack patterns. This paper proposes a novel approach, KDE-KL Anomaly Detection with Random Forest Integration (KRF-AD), which combines Kernel Density Estimation (KDE) and Kullback-Leibler (KL) divergence with Random Forest (RF) for effective anomaly detection. Additionally, Random Forest (RF) integration enables classification of data points as anomalies or normal based on features and anomaly scores. The combination of statistical divergence measurement and density estimation enhances the detection accuracy and robustness, contributing to more effective network security. Experimental results demonstrate that KRF-AD achieves 96% accuracy and outperforms other machine learning models in detecting anomalies, offering significant potential for enhancing network security.
Keywords
Introduction
An IoT (Internet of Things) enables the connection and automation of various physical objects, such as appliances, vehicles, wearables devices, and industrial equipment, to improve efficiency, convenience, and productivity [1]. IoT sensor data can be collected in an IDS (Intrusion Detection System) environment, and analyzed using methods like network taps or mirroring, inline deployment, Log collection, Agent-Based Monitoring, etc. [2], This data includes network traffic logs, event logs, alerts, and any other relevant information that the IDS captures during its monitoring activities [3].
IDS data provides insights into network activities, anomalies, security events, and potential threats. It can be used for analyzing attack patterns [4], investigating security incidents, and improving security policies and defenses [5, 6]. The logged IDS data is then processed to extract relevant information such as message topics, payload, QOS (Quality of Service) levels, and timestamps. This processed data form is used for various purposes, such as intrusion detection, anomaly detection, protocol analysis, or evaluating the performance and security of the system [7].
In an IoT environment [8], anomaly detection is a crucial aspect of ensuring the security and integrity of connected devices and systems [9, 10]. IoT devices generate massive amounts of data, and detecting anomalies helps in identifying potential threats [11] or abnormal behavior that may indicate security breaches [12]. The different methods castoff to notice the variances in the IoT data are statistical thresholds, time-series analysis, supervised learning [13], unsupervised learning, clustering [14, 15], deep learning, etc. [16], Fig. 1, shows the importance of anomaly detection using machine learning models.
Kernel Density Estimation (KDE), one of the statistical methods for finding anomalies, calculates the probability density function (PDF) of the sensor data. Based on the PDF values, anomalies are determined, and those with a low likelihood are regarded as anomalies [17]. Another method is Kullback-Leibler (KL) divergence, which counts the number of times a data point deviates from the reference or normal distribution [18].
Importance of anomaly detection using ML.
Greater dissimilarity or abnormality is believed to exist where KL divergence is higher. The suggested methodology makes use of these techniques to increase anomaly detection in IoT data’s potency and accuracy. The following is a summary of the contribution of our suggested work:
Initially, data are collected and pre-processed using normalization, tokenization, word elimination and stemming to prepare the data for effective analysis and anomaly detection. From the preprocessed data, the data are extracted using Principal Component Analysis (PCA) technique to diminish the complexity and noise in the data for improve the performance of the machine learning model. The proposed model utilizes KDE to model the complex distribution of normal instances and KL divergence to measure dissimilarity, providing a statistical measure for detecting anomalies. The model integrates Random Forest (RF), an ensemble learning method, to classify data points as anomalies or normal based on features and anomaly scores. The experimental findings showed that the suggested model, KDE-KL Anomaly Detection with RF Integration (KRF-AD) achieves superior performance compared to the other ML algorithms.
The following is how the sections are arranged: Section 2 discusses various research papers related to anomaly detection techniques. Section 3 describes the methodology used in the experiment and also brief notes on the dataset used for the study. Section 4 shows the result and discussion of the experiment and the paper’s conclusion and potential improvements are contained in Section 5.
This article focuses on the extended KDE technique, which dynamically modifies density estimates based on incoming streaming data. It considers both recent data and historical data to maintain an accurate representation of the underlying density distribution. This approach allows for real-time anomaly detection in streaming data. The author introduces an adaptive bandwidth estimation technique to overcome the limitations of fixed bandwidth in traditional KDE and also adjusts the data characteristics, ensuring better sensitivity to anomalies. The author has not focussed on additional experiments, particularly on multi-dimensional data, as well as a look at how the extended KDE might be used as an alternative approach to anomaly identification [19].
The paper discusses the effectiveness of machine learning algorithms in detecting anomalies caused by gamma radiation in complex SoC environments. Using a Radial Basis Function kernel, a One-Class-SVM has an average recall score of 0.95, according to the anomaly detection results. A sanity check and a voltage drop to zero allow for the detection of any anomalies before the boards become completely inoperable. The author focused on the radiation environment data and created a one-class SVM using the Radial Basis Function Kernel to find abnormalities. Shielding the monitoring board and sensors from the radiation environment is not a possibility for real-time use, though. Therefore, a different strategy must be created [20].
The main area of the research is to review and categorize a variety of anomaly detection methods created especially for high-dimensional big data settings. Additionally, it divides different anomaly detection approaches into categories based on the fundamental theories behind them, including statistical, distance-based, density-based, subspace-based, and machine learning-based approaches. Also, the author discusses future directions, such as developing hybrid approaches that combine multiple techniques, that address scalability challenges for big data, and exploring unsupervised and semi-supervised methods for anomaly detection. The author emphasized mainly the high dimensionality of the information and how it improved the model’s performance. Neglected to concentrate on the error detection rate to reduce the anomalies in the data [21].
The author explains the effectiveness of statistical approaches in identifying DDoS attacks, which are notorious for overwhelming target systems with a flood of traffic from multiple sources. This highlights the strengths and limitations of statistical techniques in DDoS detection, emphasizing their ability to handle diverse traffic patterns and the challenge of dealing with large-scale attacks. To improve the presentation of the model, the author focused mostly on statistical approaches and machine learning techniques. But lack focus on the large dimensionality of the data [22].
The paper discusses the existing research related to anomaly detection, scrutiny, and estimate techniques in the IoT location. The author covers different techniques for analyzing and interpreting detected anomalies, as well as predicting future anomalies, enabling proactive actions in response to potential issues. The author concentrated on various cutting-edge techniques to improve the model’s performance. A lower priority to balanced, imbalanced, and high-dimensional data [23].
From the literature survey, it is understood that anomaly detection can be carried out with some statistical approach or machine learning techniques or by some other methods to detect the anomalies easily and to emphasize their ability to handle diverse traffic patterns and the challenges of dealing with large-scale attacks.
Proposed method
In this section, a novel Kernel Density Estimation-Kullback-Leibler Anomaly Detection with Random Forest Integration (KRF-AD) has been proposed for effective anomaly detection in different IDS data. Initially, data are collected and pre-processed using normalization, tokenization, word elimination and stemming to prepare the data for effective analysis and anomaly detection. From the preprocessed data, the data are extracted using Principal Component Analysis (PCA) technique to diminish the complexity and noise in the data for improve the performance of the machine learning model. The proposed model utilizes KDE to model the complex distribution of normal instances and KL divergence to measure dissimilarity, providing a statistical measure for detecting anomalies. The model integrates Random Forest (RF), an ensemble learning method, to classify data points as anomalies or normal based on features and anomaly scores. Figure 2, shows the proposed model of anomaly detection.
Proposed work for anomaly detection using ML model.
Exploring and understanding the public datasets that are available on the Kaggle website. Then examining the features present in the dataset their descriptions, and the target variable. And also understanding the distribution of classes (legitimate or dos or brute force or malformed or slow lite or flood, etc.,) and the insights of each attribute.
Data preprocessing
Preprocessing is the collection of actions and methods used to clean up raw data before further analysis. Preprocessing is the process of formatting data to make it appropriate for the task at hand. Preprocessing techniques such as stemming, tokenization, word elimination, and normalization are applied.
Normalization
Multiple tasks must be completed simultaneously in order to achieve normalization. Punctuation must be eliminated, all text must be transformed to uppercase or lowercase, and numbers must be translated to words. Every text will therefore experience more uniform pre-processing.
Tokenization
Tokenization divides a text into meaningful pieces while maintaining its meaning. Long paragraphs, also known as text chunks or chunks, are split into tokens in this stage, which are sentences. You can break these statements down into their component words as well.
Word elimination
During this stage, terms that are used repeatedly are removed from the text. There’s a lot of stop words utilized, like “are,” “of,” “the,” and “at.” These have to be taken out of the text as a result.
Stemming
By stemming words in many tenses, unnecessary computations are eliminated and words are reduced to their most fundamental forms. Words with similar meanings should be grouped together via stemming, regardless of their variations in derivation or inflection.
Feature extraction
Feature extraction is one of the most crucial steps in getting the data ready for machine learning models. Selecting or transforming the most relevant information from the raw data is necessary to advance the model’s presentation. The Principal Component Analysis method is used to extract features from the preprocessed data.
Principal Component Analysis (PCA)
A general method for reducing feature dimensionality is PCA. Conventional PCA reduces the dimensions linearly, hence it is ineffective when the feature space is complex. Standard PCA is generalized to nonlinear dimension reduction to improve the feature reduction process. After the features have been normalized, PCA is a helpful feature dimension reduction technique. Dimensionality reduction techniques such as PCA are used to identify the eigenvectors of the covariance matrix with the largest eigenvalues, hence lowering the dimensionality of large datasets. PCA algebraic definition is as follows, Calculate the mean of A for data framework A and Determine A’s covariance as follows in Eqs (1) and (2):
Count the eigenvalue
The mutual range should be 83% greater than the size of the major segments, therefore choose the first L eigenvalue that did this, information about a more compact measurement subspace,
Where
Whereas,
The extracted features are put into one of the anomaly detection methods, i.e., The probability density function (PDF), which is used to estimate the density values of the features, by the statistical technique known as Kernel Density Estimation (KDE). By adding the contributions from neighbouring data points and weighting them according to their kernel function values, the thickness is estimated at a certain point. The mass estimate at a particular point represents the chance of observing a information point in that vicinity. For the given data point
where,
Scott’s rule is another bandwidth selection method that is implemented on the sample size and dimensionality of the data. It is used as a rule of thumb for unimodal distribution. Now for all the features, probability density values are calculated, with the help of the bandwidth value. The formula for the bandwidth selection is shown in Eq. (7),
Kullback-Leibler (KL) divergence quantifies the dissimilarity or divergence between two probability density values of the data. It is calculated by taking the sum (or integral) of the product of the probability of each event or outcome in the true distribution P and the logarithm of the ratio of probabilities between P and the “estimated” distribution Q. The mathematical formula is shown below, (in Eq. (8))
Here,
The threshold value is set based on their divergence from the expected distribution. Here, the value is set by taking the mean of the anomaly score
Here, the
After extracting the features, the statistics is divided into training (70%) and testing (30% of the dataset) sets. The ML model will be trained using the training data, and its performance on untrained data will be assessed using the testing set. The assessment metrics like precision, recall, accuracy and f1-score are used to determine anomaly detection.
Model evaluation
The proposed model combines the KDE and KL divergence and is then integrated with the Random Forest (RF) to notice the irregularities in the dataset (KRF-AD). The parameters initialized for the RF typical are given in Table 1.
Hyperparameter of the random forest
Hyperparameter of the random forest
The typical integrates KDE and KL divergence to obtain a smooth density estimate and evaluate divergence from the true distribution. The RF classifier is an ensemble learning technique that enhances accuracy and robustness by constructing numerous decision trees and combining their predictions. RF is then employed to classify data points as anomalies or normal based on the features and anomaly scores.
The proposed model is evaluated using different public datasets. The first one is MQTTset; the second one is NF-BoT-IoT and finally with NF-UNSW-NB15-v2 dataset. All these are taken from the Kaggle website. The proposed model’s effectiveness has been evaluated utilizing a number of evaluation system of measurement such as F1-score, Accuracy, Precision and recall.
The first dataset is the balanced one, which has normal and attacked data split equally. The normal data in the second dataset is 3%, while the attacked data is 97%. Similarly, the third dataset is unbalanced, with the normal data being 96% and the attacked data being 4%. All the datasets have different types of class distribution.
Dataset description
The first dataset covers 330926 rows * 34 columns, which contains all types of data like numeric, string, float, etc., The normal data is about 1,65,463 and the attacked data is also the same. It is equally distributed, whereas the attack labels are classified in different amounts as, dos – 130223; brute force – 14501; malformed – 10924; slowite – 9202; flood – 613.
The second dataset contains 600100 rows *14 columns, where the normal data is 13,859 and the attacked data is 5,86,241. Here, the data is not distributed equally. The attack data labels are classified as Investigation – 470655; DDoS – 56844; DoS – 56833; Theft – 1909.
The third dataset has 2390275 rows * 45 columns, which has 2295222 normal data and the attacked data is 95053. It is the unbalanced dataset, which has Exploits – 31551; Worms – 164; Fuzzers – 22310; Shellcode – 1427; Generic – 16560; DoS – 5794; Reconnaissance – 12779; Analysis – 2299; Backdoor – 2169.
In comparing the required size of training sets between the proposed KRF-AD approach and existing methods, the training set sizes vary depending on the datasets utilized. we systematically partitioned the datasets into training and testing subsets, adhering to a 70% training and 30% testing split. Specifically, for the first dataset, consisting of 330,926 instances, 231,648 instances were allocated for training and 99,278 for testing. Similarly, in the case of the second dataset, comprising 600,100 instances, 420,070 instances were designated for training and 180,030 for testing. Lastly, for the third dataset, encompassing 2,390,275 instances, 1,673,192 instances were earmarked for training, with 717,083 instances reserved for testing. Table 2, shows the different metric comparisons on ML models using various datasets.
Various metrics comparison of ML models on different datasets
Various metrics comparison of ML models on different datasets
The raw data undergoes preprocessing to optimize it for machine learning models. After preparation, the data is tested across various scenarios by adjusting threshold values (Tv) and bandwidth (h). In the first scenario, the threshold value is fixed at 0.5, and the bandwidth is approximately 0.75 based on Scott’s rule. This adjustment leads to a 1% increase in accuracy and a 78% detection rate compared to the existing model. In the second scenario, the threshold value is determined by the median of anomaly scores, maintaining the same bandwidth. However, this results in a decrease in both detection rate (2%) and accuracy (9%) compared to the existing Random Forest (RF) model.
ML models using different datasets
So, in the final scenario, the threshold value is set by taking the mean of anomaly scores, and increased bandwidth value twice the amount, where the presentation of the typical is increased to 2% from the existing model. Figure 3 depicts the comparison line graph of the different evaluation metrics with various ML algorithms using different datasets, and Fig. 4 shows the regular correctness of the ML models with dissimilar datasets.
Performance metrics of the various.
Thus, by assessing the model’s presentation consuming various metrics, the final model gives an accuracy of about 96% and the detection rate is approximately 80%, which is improved than the current models.
The relative investigation of the machine learning models with different datasets is shown in Fig. 5, which incorporates that the proposed model KRF-AD outperforms well compared with other existing machine learning models. Table 3 displays the correctness contrast of different ML models by using different datasets.
Accuracy comparison of ML models with different datasets
Average accuracy of the ML models with different datasets.
Comparative analysis of the machine learning models with different datasets.
The computational complexity (
Conclusion and future enhancement
In this paper, a novel Kernel Density Estimation-Kullback-Leibler Anomaly Detection with Random Forest Integration (KRF-AD) has been proposed for effective anomaly detection in different IDS data. By leveraging, Kernel Density Estimation (KDE) to estimate the data distribution, Kullback-Leibler (KL) divergence to measure dissimilarity, and Random Forest (RF) for classification, the model achieves a balance between precision, accuracy, recall, and f1-score in detecting anomalies. The integration of these techniques provides a comprehensive and effective solution for anomaly detection tasks, particularly in scenarios with complex data patterns and imbalance class distributions. A comprehensive analysis utilizing various public datasets including MQTTset, NF-BoT-IoT, and NF-UNSW-NB15-v2, demonstrating the better performance of KRF-AD over other machine learning models, such as Random Forest, Naive Bayes, Logistic Regression, and SVM. The proposed method attained an accuracy of 96% and improved detection rates compared to existing models, highlighting the effectiveness in enhancing network security. Scalability issues could arise when processing and analyzing massive amounts of data, potentially affecting the model’s performance and computational efficiency. Therefore, future research could focus on investigating the scalability of the KRF-AD model and optimizing it to handle big data efficiently without compromising its accuracy and effectiveness in anomaly detection.
Ethical approval
My research guide reviewed and ethically approved this manuscript for publishing in this Journal.
Author contributions statement
Conceptualization, G. Aarthi; methodology, G. Aarthi, S. Sharon Priya; software, G. Aarthi; validation, S. Sharon Priya, W. Aisha Banu; formal analysis, G. Aarthi; investigation, G. Aarthi, S. Sharon Priya; data curation, G. Aarthi; writing-original draft preparation, G. Aarthi writing-review and editing, G. Aarthi, S. Sharon Priya; visualization, G. Aarthi; supervision, S. Sharon Priya, W. Aisha Banu.
Competing interests
The authors declare that there is no conflict of interest.
Research funding
Not applicable.
Availability of data and material
The dataset used for the research are open dataset. Below are the links,
Human and animal rights
This article does not contain any studies with human or animal subjects performed by any of the authors.
Informed consent
I certify that I have explained the nature and purpose of this study to the above-named individual, and I have discussed the potential benefits of this study participation. The questions the individual had about this study have been answered, and we will always be available to address future questions.
Footnotes
Acknowledgments
I express my gratitude to my respected supervisor and head of the department for their guidance.
