Anomaly detection algorithm for big data based on isolation forest algorithm

Abstract

With the advent of the big data era, anomaly detection becomes increasingly crucial for ensuring the security and reliability of systems. This paper investigates large-scale anomaly detection based on the Isolation Forest algorithm, enhancing the algorithm’s performance in the context of big data by introducing the method of adaptive feature selection. The proposed approach is a fusion of the Isolation Forest and adaptive feature selection, dynamically adjusting feature weights to adapt more flexibly to the contributions of different features. Experimental results on large-scale datasets demonstrate that adaptive feature selection significantly improves the anomaly detection performance of the Isolation Forest algorithm. This method provides a new perspective for enhancing anomaly detection techniques and addressing the challenges posed by large-scale, high-dimensional data. Its practical implications are crucial for real-world applications.

Keywords

big data anomaly detection isolation forest adaptive feature selection performance optimization

Introduction

With the rise of the big data era, the security and reliability of large-scale systems have become critically important. The rapid growth of massive data has brought unprecedented opportunities to various industries, while also presenting significant challenges. As a key technical approach, anomaly detection is widely used to ensure the normal operation of systems and the security of data. Its goal is to identify anomalies that significantly deviate from normal patterns within large-scale data, thereby preventing potential risks and failures. However, traditional anomaly detection methods often exhibit high computational complexity, poor adaptability, and difficulties in handling data imbalance when faced with large-scale, high-dimensional data.

Traditional anomaly detection methods are typically based on statistical,¹ clustering,² or machine learning techniques.³ For example, statistical-based methods (such as Z-score and Grubbs test) rely on assumptions about data distribution, making them unsuitable for complex high-dimensional data⁴; clustering-based methods (such as K-means and DBSCAN) incur significant computational overhead when processing large-scale data and are sensitive to parameter settings⁵; while supervised learning-based methods (such as support vector machines and neural networks) require large amounts of labeled data, making them less effective in addressing data imbalance issues.⁶ These limitations restrict the applicability of traditional methods in big data environments.

To address the challenges of anomaly detection in large-scale, high-dimensional data, the Isolation Forest algorithm has garnered significant attention due to its efficient computational performance and superior capability in handling anomalous data. By randomly selecting features and split points to construct isolation trees, Isolation Forest can quickly identify anomalies, making it particularly suitable for high-dimensional data and large-scale datasets. However, despite its outstanding performance in anomaly detection, Isolation Forest still faces several challenges in practical applications.⁷ For instance, high-dimensional data may contain many redundant or irrelevant features, which can affect the algorithm’s detection accuracy. Additionally, data imbalance issues may reduce the algorithm’s ability to detect minority-class anomalies.

To further enhance the performance of the Isolation Forest algorithm in big data environments, this paper focuses on introducing an adaptive feature selection method. By dynamically adjusting the weights of features, the algorithm can more flexibly adapt to the contribution of different features to anomaly detection, thereby addressing the limitations of the existing Isolation Forest in the context of big data. This improvement not only enhances the detection accuracy of the algorithm but also strengthens its adaptability to high-dimensional and imbalanced data, providing new perspectives and solutions for the further optimization of anomaly detection methods.

This paper is organized as follows: In Section 2, we review related work on big data anomaly detection. Section 3 presents model of the proposed methodology. In Section 4, we present the adaptive feature selection algorithm. In Section 5, we present the experimental setup and results analysis, and we conclude the paper in Section 6.

Related work

In the fields of anomaly detection and feature selection, researchers have proposed various methods and techniques that have been widely applied in different scenarios. Below, we review related research from four aspects: anomaly detection methods, feature selection techniques, applications of Isolation Forest, and advancements in adaptive feature selection.

Anomaly detection methods

Anomaly detection is a crucial research direction in data mining and machine learning, aiming to identify data points that significantly deviate from normal patterns. Traditional anomaly detection methods can be broadly categorized into the following types.

(1) Statistical-based methods: Such as Z-score, Grubbs test, and Gaussian distribution-based methods.⁸ These methods rely on assumptions about data distribution and are suitable for low-dimensional data but perform poorly on high-dimensional data.

(2) Machine learning-based methods: Such as Support Vector Machines (SVMs), K-Nearest Neighbors (KNNs), and clustering algorithms (e.g., K-means and DBSCAN).⁹ These methods identify anomalies by constructing models or calculating distances between data points. However, they face challenges such as high computational complexity and parameter sensitivity when applied to large-scale data.

(3) Deep learning-based methods: Such as Autoencoders and Generative Adversarial Networks (GANs).^10,11 These methods extract data features through nonlinear mappings and can handle complex high-dimensional data but require large amounts of labeled data and computational resources.

Although these methods have achieved certain success in different scenarios, they still face challenges such as low computational efficiency and insufficient model robustness when dealing with large-scale, high-dimensional data.

Feature selection methods

Feature selection is a critical step in improving the performance of anomaly detection. Its goal is to select the most representative features from the original feature set to reduce computational complexity and enhance model performance. Traditional feature selection methods mainly include the following three categories.

(1) Filter methods: Such as Chi-square test, mutual information, and information gain.^12–14 These methods evaluate the relevance between features and target variables for feature selection, offering high computational efficiency but ignoring interactions between features.

(2) Wrapper methods: Such as Recursive Feature Elimination (RFE) and genetic algorithm-based feature selection.^15,16 These methods iteratively train models to evaluate the performance of feature subsets, achieving high accuracy but at the cost of significant computational overhead.

(3) Embedded methods: Such as Lasso regression and decision tree-based feature selection.^17,18 These methods integrate feature selection into the model training process, balancing efficiency and performance.

However, these methods often face trade-offs between efficiency and accuracy when dealing with large-scale, high-dimensional data.

Applications of isolation forest in anomaly detection

Isolation Forest is a tree-structure-based anomaly detection algorithm that has gained widespread attention due to its computational efficiency and suitability for high-dimensional data. Its core idea is to construct isolation trees by randomly selecting features and split points, isolating anomalies on shorter paths, thereby effectively detecting anomalies.⁷ The advantages of Isolation Forest include.

(1) High computational efficiency: Its time complexity is $O (n)$ , making it suitable for large-scale data.

(2) No assumptions about data distribution: It can handle complex high-dimensional data.

(3) Sensitivity to anomalies: It can quickly identify data points that significantly deviate from normal patterns.

In recent years, Isolation Forest has been widely applied in fields such as network security, financial fraud detection, and industrial fault diagnosis.^19–22 However, Isolation Forest may be affected by redundant features when processing high-dimensional data, leading to reduced detection performance.

Advancements in adaptive feature selection

Adaptive feature selection methods are a class of emerging techniques proposed in recent years to address feature selection challenges.^23–25 These methods dynamically adjust feature weights or selection strategies, allowing for more flexible adaptation to changes in feature contributions to models. Examples include.

(1) Weight adjustment-based methods: These methods dynamically adjust feature weights by evaluating their importance, thereby improving model robustness.

(2) Reinforcement learning-based methods: These methods optimize the feature selection process using reinforcement learning algorithms, enabling efficient feature selection in complex data environments.

Adaptive feature selection methods offer significant advantages in improving model robustness and accuracy, particularly in scenarios involving large-scale, high-dimensional data.

In summary, the current research trend is to introduce more intelligent and adaptive methods into anomaly detection to address the challenges posed by large-scale, high-dimensional data. This paper combines Isolation Forest with adaptive feature selection techniques to propose an improved anomaly detection method. By dynamically adjusting feature weights, the method aims to enhance the algorithm’s performance on high-dimensional and imbalanced data, providing new research insights for further optimization of anomaly detection methods.

Isolation forest algorithm model

Isolation Forest is a tree-structure-based anomaly detection algorithm, whose core idea is to isolate anomalies by randomly partitioning the feature space. Since anomalies typically exhibit significantly different feature distributions compared to normal points, they tend to have shorter path lengths within the tree structure. Specifically, the Isolation Forest constructs multiple isolation trees by randomly selecting features and split points. Due to their sparse distribution and distinct characteristics, anomalies can be isolated in fewer partitioning steps, resulting in shorter path lengths within the tree structure. By calculating the average path length of a sample across all trees, the Isolation Forest effectively measures its anomaly score: the shorter the path length, the higher the degree of anomaly. This mechanism, based on random partitioning and path length, enables the Isolation Forest to achieve high efficiency and robustness when handling high-dimensional data and large-scale datasets.

Mathematical model of isolation forest

Assume there is a dataset $D = {X_{1}, X_{2}, \dots, X_{n}}$ containing $n$ samples, where each sample $X_{i}$ =( $X_{i 1}, X_{i 2}, \dots, X_{i d}$ ) is a $d$ -dimensional feature vector. The construction process of the Isolation Forest is as follows:

Subsampling

A subset S of size $ψ$ is randomly selected from the dataset D, where $ψ ≪ n$ . The subsampling process can be expressed as:

S = S u b s a m p l e (D, ψ)

(1)

where

S u b s a m p l e

denotes the random sampling function.

Tree Building

A binary tree is constructed recursively. At each node, a feature dimension $q$ and a split threshold p are randomly selected to divide the dataset into two subsets. This splitting is performed recursively until the tree reaches a predefined maximum depth $l_{\max}$ or the number of samples in the node falls below a certain threshold $ψ_{\min}$ . The tree-building process is shown in equation (2):

T = B u i l d T r e e (S, l_{\max}, ψ_{\min})

(2)

where T is the constructed tree structure.

Path length calculation

For each sample $X_{i}$ , its path length $h (X_{i})$ in the constructed tree is calculated. The path length is the number of edges traversed from the root node to the leaf node. The path length calculation is shown in equation (3):

h (X_{i}) = P a t h L e n g t h (X_{i}, T)

(3)

where

P a t h L e n g t h

denotes the function for calculating the path length.

Anomaly scoring

The anomaly score for each sample is measured based on the average path length and its standard deviation. Typically, samples with shorter path lengths are considered more likely to be anomalies. The anomaly score $s (X_{i})$ is defined as:

s (X_{i}) = 2^{- \frac{E [h (X_{i})]}{c (ψ)}}

(4)

where

E [h (X_{i})]

is the average path length of sample

X_{i}

across multiple trees, and

c (ψ)

is the normalization factor for the path length, defined as:

c (ψ) = 2 H (ψ - 1) - \frac{2 (ψ - 1)}{n}

(5)

where

H (ψ - 1)

is the harmonic number, which can be approximated as:

H (ψ - 1) \approx \ln (ψ - 1) + γ

(6)

where

γ

is the Euler-Mascheroni constant, approximately equal to 0.5772.

Algorithm flow of isolation forest

The algorithm flow of the Isolation Forest is shown in Figure 1.

Figure 1.

Algorithm flow of isolation forest.

The Isolation Forest algorithm operates as follows: First, the dataset $D$ is provided as input, and key parameters are set, including the number of trees $t$ , subsample size $ψ$ , maximum tree depth $l_{\max}$ , and minimum sample size $ψ_{\min}$ . Next, $t$ isolation trees are constructed. Each tree randomly selects a subsample $S_{j}$ from $D$ and recursively splits it. At each split, a random feature $x_{k}$ and a split value $v$ are chosen to divide the data into left and right subsets. This process continues until the maximum depth is reached or the number of samples in a node is less than $ψ_{\min}$ . Then, for each sample $X_{i}$ , the path length $h (X_{i}, T_{j})$ in each tree is determined, and the average path length $E [h (X_{i})]$ is computed. Finally, the anomaly score $s (X_{i})$ is calculated based on $E [h (X_{i})]$ , where higher scores indicate a higher likelihood of the sample being an anomaly.

Adaptive feature selection algorithm

The core idea of the adaptive feature selection algorithm is to dynamically adjust the weights of features, enabling the model to better adapt to the varying contributions of different features to anomaly detection. This algorithm integrates the Isolation Forest model with adaptive feature selection methods, iteratively optimizing feature weights to improve the accuracy and robustness of anomaly detection.

The adaptive feature selection algorithm dynamically adjusts feature weights to optimize the anomaly detection performance of the Isolation Forest model. The algorithm begins by initializing the weights of each feature, typically setting them to equal values or based on prior knowledge. Subsequently, the algorithm constructs an Isolation Forest model using the current weights and calculates the anomaly scores of the samples. Based on the model’s output, the contribution of each feature is computed, usually measured by the number of splits or information gain of the feature in the tree structure. The feature weights are then dynamically updated according to their contributions, increasing the weights of high-contribution features and decreasing the weights of low-contribution features. Finally, the above steps are repeated until the feature weights converge or a predetermined number of iterations is reached, thereby achieving continuous optimization of anomaly detection performance.

Mathematical model of the adaptive feature selection algorithm

Assume there is a dataset $D = {X_{1}, X_{2}, \dots, X_{n}}$ containing n samples, where each sample $X_{i}$ =( $X_{i 1}, X_{i 2}, \dots, X_{i d}$ ) is a d-dimensional feature vector.

Initialize feature weights

W = (w_{1}, w_{2}, \dots, w_{d}), w_{j} = \frac{1}{d}, j = 1, 2, \dots, d

(7)

where

W

is the feature weight vector, and

w_{j}

represents the weight of the j-th feature.

Build isolation forest model

I s o l a t i o n F o r e s t M o d e l = B u i l d F o r e s t (X, W)

(8)

where the function

B u i l d F o r e s t

denotes the construction of the Isolation Forest model using the current feature weights.

Calculate feature contribution

C_{j} = \frac{\sum_{t = 1}^{T} S p l i t C o u n t (t, j)}{T}, j = 1, 2, \dots, d

(9)

where

C_{j}

represents the contribution of the j-th feature,

S p l i t C o u n t (t, j)

denotes the number of splits of the j-th feature in the t-th tree, and T is the total number of trees.

Update feature weights

w_{j} = \frac{C_{j}}{\sum_{k = 1}^{d} C_{k}}, j = 1, 2, \dots, d

(10)

where

w_{j}

is the updated feature weight.

Iterative optimization

Repeat steps 2–4 until convergence or the maximum number of iterations is reached.

Pseudocode of the adaptive feature selection algorithm

The pseudocode for the adaptive feature selection algorithm is described as follows:

The overall time complexity of the adaptive feature selection algorithm is:

O (\max_i t e r \cdot (T \cdot ψ \log ψ + T \cdot ψ + d))

(11)

where

\max_i t e r

is the maximum number of iterations, T is the number of trees, ψ is the subsample size, d is the feature dimension. The overall space complexity of the adaptive feature selection algorithm is

O (T \cdot ψ + d)

Optimization of the adaptive feature selection algorithm

Although the adaptive feature selection algorithm demonstrates excellent performance in enhancing the anomaly detection capabilities of the Isolation Forest model, it still faces challenges in computational efficiency and robustness when dealing with high-dimensional data and large-scale datasets. To further optimize the algorithm’s performance, this paper proposes the following two optimization directions: optimization of feature contribution calculation and optimization of the weight update strategy.

Optimization of feature contribution calculation

In the current algorithm, the feature contribution is measured by counting the number of splits of each feature in the tree structure. Although this method is simple and intuitive, calculating the split counts for all features in high-dimensional data can lead to high computational complexity. To address this, this paper proposes the following optimization methods.

Introduction of information gain and gini index

The feature contribution can be measured using information gain or the Gini index, which can more accurately reflect the contribution of features to anomaly detection. Specifically, the contribution $C_{j}$ of feature $j$ can be expressed as:

C_{j} = \frac{1}{T} \sum_{t = 1}^{T} I n f o G a i n (t, j)

(12)

Approximation and sampling techniques

To reduce computational costs, random sampling techniques can be employed to calculate contributions for only a subset of trees or features. For example, randomly select T′ trees (T′≪T) and d′d′ features (d′≪d) to compute the approximate contribution:

C_{j} \approx \frac{1}{T^{'}} \sum_{t = 1}^{T^{'}} S p l i t C o u n t (t, j)

(13)

This approach significantly reduces computational complexity while maintaining calculation accuracy.

Optimization of weight update strategy

The current algorithm updates feature weights by normalizing feature contributions. Although this method is simple, it may not fully capture the complex relationships between features and is susceptible to noise in the data. To address this, this paper proposes the following optimization strategies.

Introduction of regularization terms

We introduce an L2 regularization term during the weight update process to prevent overfitting of the weights. The updated feature weights WW can be calculated using the following formula:

W = \begin{array}{l} argmin \\ W \end{array} {‖ C - W ‖}_{2}^{2} + {λ ‖ W ‖}_{2}^{2}

(14)

where C is the feature contribution vector, and λ is the regularization coefficient. By solving this optimization problem, more stable weight update results can be obtained.

Dynamic adjustment of update step size

To accelerate the convergence of weight updates, introduce a dynamically adjusted step size η. Specifically, the step size η can be adjusted based on the rate of change in feature contributions:

η = η_{0} \cdot \exp (- \frac{{‖ C^{(k)} - C^{(k - 1)} ‖}_{2}}{{‖ C^{(k - 1)} ‖}_{2}})

(15)

where

η_{0}

is the initial step size, and

C^{(k)}

and

C^{(k - 1)}

represent the feature contribution vectors at the k-th and (k−1)-th iterations, respectively. By dynamically adjusting the step size, rapid convergence can be achieved in the early iterations, while fine-tuning can be performed in the later iterations.

Consideration of feature correlations

We also introduce a feature correlation matrix R during the weight update process, where $R_{i j}$ represents the correlation between feature ii and feature j. The updated feature weights W can be calculated using the following formula:

W = \frac{R * C}{\sum_{j = 1}^{d} {(R * C)}_{j}}

(16)

This optimization method can better capture the interactions between features, improving the accuracy of weight updates.

Experimental design and data analysis

To comprehensively evaluate the performance of the adaptive feature selection algorithm and its optimization methods, this paper designs five sets of performance experiments and conducts comparative analyses on multiple public datasets. The following sections elaborate on the experimental setup, comparison methods, experimental results, and data analysis.

The experimental environment includes hardware configurations with an Intel Core i7-10750H @ 2.60 GHz CPU, 16 GB DDR4 RAM, and an NVIDIA GeForce GTX 1660 Ti GPU. The software environment consists of the Ubuntu 20.04 LTS operating system, Python 3.8 programming language, and major libraries such as Scikit-learn, NumPy, Pandas, and Matplotlib. Experiments are conducted on multiple public datasets, including KDD Cup 1999,²⁶ Credit Card Fraud Detection,²⁷ MNIST,²⁸ and NSL-KDD,²⁹ to comprehensively evaluate the algorithm’s performance.

The experimental parameter settings include the number of trees (n_estimators) in the Isolation Forest model set to 100, subsample size (max_samples) set to 256, maximum tree depth (max_depth) set to 10, and minimum sample split size (min_samples_split) set to 2. For the adaptive feature selection algorithm, the maximum number of iterations (max_iter) is set to 10, and the convergence threshold (epsilon) is set to 0.01. In the optimization methods, the regularization coefficient (lambda) is set to 0.1, and the initial step size (eta_0) is set to 0.01, ensuring consistent and comparable performance evaluation across different datasets.

The comparison methods include the traditional Isolation Forest, the original adaptive feature selection algorithm, the optimized adaptive feature selection algorithm, Local Outlier Factor (LOF),³⁰ and One-Class Support Vector Machine (One-Class SVM).³¹ The traditional Isolation Forest does not use feature weight adjustment, the original adaptive algorithm calculates feature contributions based on split counts, the optimized algorithm introduces information gain and dynamic step size for weight updates, LOF detects anomalies based on density, and One-Class SVM separates anomalies by constructing a hyperplane. These methods are used together to comprehensively evaluate the performance of the proposed algorithm.

The experiments evaluate the performance of each method in terms of accuracy, recall, F1 score, and runtime.

Experimental design for model accuracy and generalization ability

To evaluate the accuracy and generalization ability of the adaptive feature selection algorithm and its optimization methods, this paper designs the following experiments. The experiments test the performance of the models on multiple datasets and analyze their performance under different data distributions to verify the robustness and generalization ability of the models.

First, the Isolation Forest, the original adaptive feature selection algorithm, and the optimized adaptive feature selection algorithm are trained using the KDD Cup 1999 dataset, and their performance on the training set is recorded. Then, the trained models are applied to the NSL-KDD, Credit Card Fraud Detection, and MNIST datasets to test their generalization ability under different data distributions. Additionally, 5% random noise is introduced into the NSL-KDD and Credit Card Fraud Detection datasets to test the robustness of the models under noisy data. Finally, the accuracy, recall, and F1 score of each model on the training set, test set, and noisy data are compared to evaluate the accuracy and generalization ability of the models.

The experimental results are shown in Table 1.

Table 1.

Comparison of model accuracy and generalization ability of different algorithms.

Dataset	Method	Accuracy	Recall	F1
KDD cup 1999 (Training set)	Traditional isolation forest	0.92	0.85	0.88
	Original adaptive algorithm	0.94	0.88	0.91
	Optimized adaptive algorithm	0.95	0.89	0.92
	LOF	0.89	0.82	0.85
	One-class SVM	0.9	0.83	0.86
NSL-KDD (Test set)	Traditional isolation forest	0.9	0.83	0.86
	Original adaptive algorithm	0.92	0.85	0.88
	Optimized adaptive algorithm	0.93	0.86	0.89
	LOF	0.87	0.8	0.83
	One-class SVM	0.88	0.81	0.84
Credit card fraud detection (Test set)	Traditional isolation forest	0.93	0.87	0.9
	Original adaptive algorithm	0.94	0.88	0.91
	Optimized adaptive algorithm	0.95	0.89	0.92
	LOF	0.91	0.84	0.87
	One-class SVM	0.92	0.85	0.88
MNIST(Test set)	Traditional isolation forest	0.87	0.8	0.83
	Original adaptive algorithm	0.89	0.82	0.85
	Optimized adaptive algorithm	0.9	0.83	0.86
	LOF	0.85	0.78	0.81
	One-class SVM	0.86	0.79
NSL-KDD((Noisy data))	Traditional isolation forest	0.88	0.81	0.84
	Original adaptive algorithm	0.9	0.83	0.86
	Optimized adaptive algorithm	0.91	0.84	0.87
	LOF	0.86	0.79	0.82
	One-class SVM	0.87	0.8	0.83
Credit card fraud detection (Noisy data)	Traditional isolation forest	0.91	0.85	0.88
	Original adaptive algorithm	0.92	0.86	0.89
	Optimized adaptive algorithm	0.93	0.87	0.9
	LOF	0.89	0.82	0.85
	One-class SVM	0.9	0.83	0.86

From Table 1, it can be observed that the optimized adaptive feature selection algorithm outperforms both the traditional Isolation Forest and the original adaptive algorithm in terms of model accuracy and generalization ability. On the training set, the F1 score of the optimized algorithm reaches 0.92, which is 1% higher than that of the original algorithm and 4% higher than that of the traditional Isolation Forest. In cross-dataset testing, the F1 score of the optimized algorithm on the NSL-KDD dataset is 0.89, representing a 1% improvement over the original algorithm and a 3% improvement over the traditional Isolation Forest. In noisy data testing, the F1 score of the optimized algorithm is 0.87, demonstrating higher robustness.

The optimized adaptive feature selection algorithm surpasses the Isolation Forest, the original adaptive algorithm, LOF, and One-Class SVM in both model accuracy and generalization ability. This algorithm not only achieves higher accuracy under known data distributions but also adapts well to different data distributions and noisy environments, exhibiting strong generalization ability and robustness. This provides a more reliable solution for anomaly detection in large-scale, high-dimensional data.

Comparison of model algorithm training time and inference time

This experiment evaluates the training time and inference time of the Isolation Forest, the original adaptive feature selection algorithm, the optimized adaptive feature selection algorithm, Local Outlier Factor (LOF), and One-Class Support Vector Machine (One-Class SVM) on large-scale datasets. The experiment involves training and testing these algorithms on the KDD Cup 1999, Credit Card Fraud Detection, and MNIST datasets, recording their training and inference times, and conducting a comparative analysis of the computational efficiency of each algorithm.

The experimental results are shown in Table 2.

Table 2.

Comparison of Model Training and Reasoning time.

Dataset	Method	Training time (s)	Inference time (s)
KDD cup 1999	Traditional isolation forest	45.3	2.1
	Original adaptive algorithm	48.7	2.3
	Optimized adaptive algorithm	35.2	1.8
	LOF	62.1	3.5
	One-class SVM	78.5	4.2
Credit card fraud detection	Traditional isolation forest	12.4	0.8
	Original adaptive algorithm	13.8	0.9
	Optimized adaptive algorithm	10.4	0.7
	LOF	18.2	1.2
	One-class SVM	25.6	1.5
MNIST	Traditional isolation forest	34.7	1.5
	Original adaptive algorithm	36.2	1.6
	Optimized adaptive algorithm	28.5	1.3
	LOF	50.3	2.2
	One-class SVM	68.9	3

From Table 2, it can be observed that the optimized adaptive feature selection algorithm outperforms the Isolation Forest, the original adaptive algorithm, LOF, and One-Class SVM in both training time and inference time. On the KDD Cup 1999 dataset, the training time of the optimized algorithm is 35.2 seconds, which is 27.7% less than that of the original algorithm and 22.3% less than that of the Isolation Forest. On the Credit Card Fraud Detection dataset, the inference time of the optimized algorithm is 0.7 seconds, representing a 22.2% reduction compared to the original algorithm and a 12.5% reduction compared to the Isolation Forest. This indicates that the optimized algorithm achieves higher computational efficiency during both the training and inference phases, significantly reducing time consumption and providing better real-time performance. It offers a more efficient solution for anomaly detection in large-scale datasets.

Comparison of model noise resistance capability

To evaluate the stability and accuracy of different algorithms in noisy environments, this paper designs a noise resistance capability experiment. The experiment compares the performance of Isolation Forest, the original adaptive feature selection algorithm, the optimized adaptive feature selection algorithm, Local Outlier Factor (LOF), and One-Class Support Vector Machine (One-Class SVM) in noisy data environments by introducing varying levels of noise into the data.

The experiment begins by preprocessing the UCI KDD Cup 99 dataset and adding 10% and 20% random noise, respectively. Next, the Isolation Forest, the original adaptive feature selection algorithm, the optimized adaptive feature selection algorithm, LOF, and One-Class SVM are trained on datasets with no noise, 10% noise, and 20% noise. Then, the accuracy, precision, recall, and F1 score of each algorithm under different noise levels are recorded. Finally, by comparing the performance differences of the algorithms under different noise levels, their noise resistance capabilities and stability are analyzed.

From the experimental results in Table 3, it can be observed that the optimized adaptive feature selection algorithm demonstrates strong noise resistance across different noise levels, with its accuracy, precision, recall, and F1 score outperforming those of the Isolation Forest, the original adaptive algorithm, LOF, and One-Class SVM. This algorithm exhibits high stability and accuracy in noisy data environments, making it suitable for complex real-world application scenarios. In contrast, LOF and One-Class SVM are more sensitive to noisy data and are better suited for environments with minimal noise.

Table 3.

Comparison of noise resistance performance of different algorithm models.

Noise level	Method	Accuracy	Recall	F1
10% noise	Traditional isolation forest	0.92	0.9	0.88
	Original adaptive algorithm	0.94	0.92	0.91
	Optimized adaptive algorithm	0.95	0.93	0.92
	LOF	0.89	0.87	0.85
	One-class SVM	0.9	0.88	0.86
20% noise	Traditional isolation forest	0.9	0.88	0.86
	Original adaptive algorithm	0.92	0.9	0.88
	Optimized adaptive algorithm	0.93	0.91	0.89
	LOF	0.86	0.84	0.82
	One-class SVM	0.87	0.85	0.83

Comparison of model parameter sensitivity

To evaluate the sensitivity of different algorithms to hyperparameters, this paper designs a parameter sensitivity experiment. The experiment analyzes the impact of key hyperparameters on algorithm performance by adjusting those of the Isolation Forest, the original adaptive feature selection algorithm, the optimized adaptive feature selection algorithm, Local Outlier Factor, and One-Class Support Vector Machine.

From Table 4, it can be observed that the optimized adaptive feature selection algorithm is sensitive to the regularization coefficient. Proper adjustment of the regularization coefficient can significantly improve accuracy and reduce training time. The Isolation Forest is sensitive to the number of trees, and increasing the number of trees helps improve accuracy, although it slightly increases both training and inference times. LOF and One-Class SVM exhibit significant sensitivity to parameter changes, particularly the number of neighbors and the kernel function type. Adjusting these parameters directly affects the performance of the algorithms.

Table 4.

Comparison of model parameter sensitivity of different algorithm models.

Method	Parameter adjustment	Optimal accuracy	Optimal training time (s)	Optimal inference time (s)
Traditional isolation forest	Number of Trees	97.8	3.2	0.01
Original adaptive algorithm	Maximum iterations	96.5	4	0.02
Optimized adaptive algorithm	Regularization coefficient	98	3.5	0.01
LOF	Number of neighbors	92.1	1.8	0.02
One-class SVM	Kernel function Type	94.6	5.6	0.05

Additionally, the original adaptive algorithm is sensitive to the maximum number of iterations. Increasing the number of iterations can improve accuracy but significantly increases training time. Overall, each algorithm demonstrates different characteristics in parameter optimization. Optimizing appropriate parameter configurations can effectively enhance model accuracy and efficiency, especially when dealing with large-scale data, where reasonable parameter selection is particularly crucial.

Conclusion

This paper proposes an adaptive feature selection method based on the Isolation Forest algorithm to enhance anomaly detection performance in big data environments. Traditional Isolation Forest algorithms face challenges such as feature redundancy and computational inefficiency when handling high-dimensional and large-scale data. In contrast, the adaptive feature selection method improves the model’s sensitivity to anomalies by dynamically adjusting feature weights. Experimental results demonstrate that the proposed improved method significantly outperforms the traditional Isolation Forest algorithm on multiple large datasets, particularly in handling high-dimensional data, showcasing strong adaptability and robustness.

Despite the achievements of this study, some limitations remain. First, the current method primarily focuses on optimizing feature selection. Future work could explore integrating other anomaly detection algorithms to develop a more accurate and efficient comprehensive anomaly detection framework. Second, the performance of the proposed method on real-time data streams has not been thoroughly investigated. Therefore, future research could explore algorithm improvements from the perspectives of online learning and incremental learning to adapt to dynamic data changes. Finally, the experiments in this paper are limited to specific datasets. Future work could extend the evaluation to broader application scenarios to validate the algorithm’s generalizability and scalability.

Footnotes

ORCID iD

Min He

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Samariya

Thakkar

. A comprehensive survey of anomaly detection algorithms. Annals of Data Science 2023; 10(3): 829–850.

Jain

Kaur

Saxena

. A K-Means clustering and SVM based hybrid concept drift detection technique for network anomaly detection. Expert Syst Appl 2022; 193: 116510.

Nassif

Talib

Nasir

, et al. Machine learning for anomaly detection: a systematic review. IEEE Access 2021; 9: 78658–78700.

Ur Rehman

Belhaouari

. Unsupervised outlier detection in multidimensional data. J Big Data 2021; 8(1): 80.

Gadal

Mokhtar

Abdelhaq

, et al. Machine learning-based anomaly detection using K-mean array and sequential minimal optimization. Electronics 2022; 11(14): 2158.

Kwon

Kim

, et al. A survey of deep learning-based network anomaly detection. Cluster Computing, 2019, vol 22, pp. 949–961.

Lesouple

Baudoin

Spigai

, et al. Generalized isolation forest for anomaly detection. Pattern Recognit Lett 2021; 149: 109–119.

Domański

. Study on statistical outlier detection and labelling. Int J Autom Comput 2020; 17(6): 788–811.

Ripan

Sarker

Hossain

SMM

, et al. A data-driven heart disease prediction model through K-means clustering-based anomaly detection. SN Computer Science 2021; 2(2): 112.

10.

Mao

Wang

Spencer

JBF

. Toward data anomaly detection for automated structural health monitoring: exploiting generative adversarial nets and autoencoders. Struct Health Monit 2021; 20(4): 1609–1626.

11.

Kopčan

Škvarek

Klimo

. Anomaly detection using autoencoders and deep convolution generative adversarial networks. Transp Res Procedia 2021; 55: 1296–1303.

12.

Alalhareth

Hong

. An improved mutual information feature selection technique for intrusion detection systems in the Internet of Medical Things. Sensors 2023; 23(10): 4971.

13.

Mohammadi

Rashid

Karim

SHT

, et al. A comprehensive survey and taxonomy of the SVM-based intrusion detection systems. J Netw Comput Appl 2021; 178: 102983.

14.

Nakashima

Sim

Kim

, et al. Automated feature selection for anomaly detection in network traffic data. ACM Trans Manag Inf Syst 2021; 12(3): 1–28.

15.

Maseno

Wang

. Hybrid wrapper feature selection method based on genetic algorithm and extreme learning machine for intrusion detection. J Big Data 2024; 11(1): 24.

16.

Awad

Fraihat

. Recursive feature elimination with cross-validation with decision tree: feature selection method for machine learning-based intrusion detection systems. J Sens Actuator Netw 2023; 12(5): 67.

17.

Saha

Priyoti

Sharma

, et al. Towards an optimized ensemble feature selection for DDoS detection using both supervised and unsupervised method. Sensors 2022; 22(23): 9144.

18.

Maldonado

Riff

Neveu

. A review of recent approaches on wrapper feature selection for intrusion detection. Expert Syst Appl 2022; 198: 116822.

19.

Sadaf

Sultana

. Intrusion detection based on autoencoder and isolation forest in fog computing. IEEE Access 2020; 8: 167059–167068.

20.

Chabchoub

Togbe

Boly

, et al. An in-depth study and improvement of isolation forest. IEEE Access 2022; 10: 10219–10237.

21.

Laskar

MTR

Huang

Smetana

, et al. Extending isolation forest for anomaly detection in big data via K-means. ACM Trans Cyber-Phys Syst 2021; 5(4): 1–26.

22.

Lin

Zhao

Wang

, et al. AdaFS: adaptive feature selection in deep recommender system. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. New York: Association for Computing Machinery, 2022, pp. 3309–3317.

23.

Liu

Wang

, et al. Adapting feature selection algorithms for the classification of Chinese texts. Systems 2023; 11(9): 483.

24.

Sureshkumar

Prasanna

GKD

Santhosh

. Adaptive butterfly optimization algorithm (ABOA) based feature selection and deep neural network (DNN) for detection of distributed denial-of-service (DDoS) attacks in cloud. Comput Syst Sci Eng 2023; 47(1): 1109–1123.

25.

Shi

Zhu

, et al. Unsupervised adaptive feature selection with binary hashing. IEEE Trans Image Process 2023; 32: 838–853.

26.

Tavallaee

Bagheri

, et al. A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. Piscataway: IEEE, 2009, pp. 1–6.

27.

Han

Zhu

Zhou

, et al. Competition-driven multimodal multiobjective optimization and its application to feature selection for credit card fraud detection. IEEE Trans Syst Man Cybern Syst 2022; 52(12): 7845–7857.

28.

Cohen

Afshar

Tapson

, et al. EMNIST: extending MNIST to handwritten letters. In: 2017 international joint conference on neural networks (IJCNN). Piscataway: IIEEE, 2017, pp. 2921–2926.

29.

Bala

Nagpal

. A review on kdd cup99 and nsl nsl-kdd dataset. Int J Adv Res Comput Sci 2019; 10(2): 64–67.

30.

Cheng

Zou

Dong

. Outlier detection using isolation forest and local outlier factor. Proceedings of the conference on research in adaptive and convergent systems. New York: Association for Computing Machinery, 2019, pp. 161–168.

31.

Pang

. A hybrid algorithm incorporating vector quantization and one-class support vector machine for industrial anomaly detection. IEEE Trans Industr Inform 2022; 18(12): 8786–8796.