Outlier Detection Methods for One-Dimensional Features Based on the Isolation Forest and 2-Means Algorithms

Abstract

In our study, we propose seven methods of detecting outliers in the set of one-dimensional observations. Instead of considering only one-dimensional input data, we use their two-dimensional vector representations, where each representation consists of an original observation and a score obtained through the application of the Isolation Forest (IF) and Extended Isolation Forest (EIF) algorithms. For the corresponding pairs of values, we first implement the $2$ -means clustering method in order to separate them into two clusters and subsequently, we employ different machine learning concepts, such as: the One-Class SVM, the methods based on the idea of conformal scores and conformal prediction sets, or the testing procedures using conformal $p$ -values. For comparison, we also examine the classic clustering algorithms based on the EM algorithm. Our one-dimensional empirical data are generated from mixtures of two distributions. In our numerical experiments, we consider various examples of distribution mixtures. These mixtures display different difficulty levels regarding the issue of outlier identification, range from the mixtures of two Gaussian distributions, through the mixtures of two heavy-tailed distributions. The proposed methods are also applied to four real-world data. All of our computations have been carried out using the R environment.

Keywords

outlier isolation forest extended isolation forest k-means algorithm EM algorithm machine learning

1. Introduction

Outlier (or anomaly) detection is a vital and long-studied problem in statistics, data mining and machine learning. Its importance comes from the fact that existence of outliers can have a very significant impact on the results of scientific analyses from the mentioned fields of knowledge, and also because outlier detection algorithms have many practical applications, ranging from the fraud detection in financial transactions, the intrusion detection in computer networks, through tracking faults in industrial and commercial systems, conducting a quality control in manufacturing, to the identification of rare but critical events in medical analysis. The main goal of the research devoted to the issue of anomaly detection is to distinguish between inliers - observations consistent with the cumulative distribution function (cdf) of the input data, and outliers - data points that significantly deviate from that pattern. From a probabilistic point of view, this problem is often formalized as distinguishing whether the given observations originate from a reference cdf $F = F_{0}$ (if so, then we treat them as inliers) or they come from some alternative cdf $F \neq F_{0}$ (they are considered to be outliers then). Such a formalized definition of inliers/outliers has been proposed by Bates et al. (2023).

A range of methods have been introduced to address the problem of outlier detection. Many of them often rely on distributional assumptions and parametric modeling, but some approaches develop non-parametric concepts. For example, an interesting method of outlier detection introduced in Bates et al. (2023) is a non-parametric approach. It is based on testing individual observations using conformal inference and consists in constructing conformal p-values in order to identify outliers by employing multiple-testing procedures. Continuing our review of articles on anomaly detection, we wish to mention the papers by Li et al. (2020) and Li et al. (2022), where the so-called ECOD (Empirical-Cumulative-distribution-based Outlier Detection) and COPOD (Copula-Based Outlier Detection) methods are proposed, respectively. The first of these methods (ECOD) assumes that outliers are often rare events that come from the tails of underlying distribution. This is also a parameter-free approach; in its first step, the empirical cumulative distribution function (ecdf) of the (usually multivariate) input data is computed and then - based on the obtained ecdf - tail probabilities, per dimension for each example from data, are determined. Finally, by aggregating the estimated tail probabilities across all dimensions of the observed data, outlier scores for the input observations are obtained. In turn, the COPOD approach is also a parameter-free design which involves the idea of copulas for modeling multidimensional distributions. It is also a multi-step concept; at first, an empirical copula is created, then the constructed empirical copula is used in order to predict tail probabilities for each point from the given data, which enables calculating the corresponding anomaly measures and entails separating the available instances into the inlier/outlier parts.

In addition to the earlier presented methods of outlier identification, some other contemporary procedures aimed at outlier detection are particularly worthwhile to notice. They include, e.g., the Isolation Forest (IF) approach, which is an unsupervised algorithm where multiple isolation trees are built. Each node in such a tree randomly selects a feature and a random split value and splits are continued until every data point is isolated (placed in its own leaf). In this setting outliers consist of observations that are easier to isolate than potential inliers, i.e., anomalies are separated relatively quickly from the rest of available data, whereas normal points (inliers) need more splits to be isolated. In order to determine isolation measures (isolation scores) of observations, the isolation path of each point has to be calculated (where by the isolation path we mean the number of splits (tree depth) required to isolate a point). Consequently, outliers should have short paths (as they are easy to isolate) and inliers should be indicated as the points with long paths (as they require many splits to be separated from the rest of data). The final anomaly score is the average path length across many trees (if short - the considered point is a potential anomaly, if long - the given observation is likely typical (normal)). For further details regarding an application of the Isolation Forest concept, we refer to Liu et al. (2008).

Apart from the standard Isolation Forest (IF) algorithm, we also consider its generalized version called the Extended Isolation Forest (EIF). The EIF algorithm was introduced in order to fix a fundamental bias that arises in the standard IF setting. This bias is created as a result of the fact that all splits in standard IF are axis-parallel. Consequently, the data set is always sliced vertically or horizontally (or along coordinate axes in higher dimensions) in this case. Such a constraint systematically produces biases as the trees grow, which results in inconsistent anomaly scores and therefore leads to unclear interpretation and evaluation of anomalous events and simultaneously affects the form of score maps and reduces robustness in anomaly scoring. In order to avoid these drawbacks, the EIF method was proposed by Hariri et al. (2021). The EIF algorithm does not significantly differ from the original philosophy of the IF approach and it is a natural completion of it. The core idea behind it is that it allows the splitting hyperplanes to take arbitrary orientations, since each tree “cuts” the data space in random directions. As a result, we obtain the method that preserves the spirit of random isolation, but which additionally removes the IF’s axis alignment bias, and – as a result – it provides the setting that creates more consistent and robust anomaly scores, can be naturally adapted to data shape and works better for complex data structures.

Both IF and EIF are now the standard baselines, serving as the foundation for subsequent methods, such as DeepIForest (DeepIF) and Generalized Isolation Forest (GIF). Deep variants of the IF and EIF algorithms have been proposed to improve and strengthen representation learning prior to isolation. They integrate deep neural networks with isolation-based detectors, typically relying on EIF-style splits in the learned feature space. With relation to the DeepIF and GIF, we refer to the papers by Xu et al. (2022) and Zhou and Paffenroth (2017).

EIF and its later developments have become widely applied in multivariate settings due to their improved flexibility, although in the univariate case the hyperplane generalization reduces to standard threshold splits and the two formulations coincide. Another direction in contemporary research regarding the extension of the Isolation Forest (IF)-related concepts explores the hybrid models combining IF with clustering techniques. Several clustering–Isolation-Forest hybrid models have already been proposed and should be noted here. For example, Karczmarek et al. (2020) introduced the k-means-based IF with multi-fork tree structures, achieving improved separability in heterogeneous datasets. In turn, the Cluster-Based Improved Isolation Forest (CIIF) framework by Shao et al. (2022) used the k-means clustering as a pre-selection step prior to isolation, whereas (Ayoub et al., 2023) showed that fuzzy C-Means clustering can outperform classical k-means when combined with isolation-based scoring. The hybrid approaches aim to improve anomaly detection in heterogeneous or multi-modal distributions by incorporating structural information prior to or during the isolation process. A complementary line of work concerns the use of conformal p-values for outlier detection. Its theoretical frameworks have been developed by Bates et al. (2023), who established rigorous constructions of marginal conformal p-values and stated how the False Discovery Rate (FDR) control via Benjamini–Hochberg procedure should be performed in this context. This approach determines the finite-sample assumptions and the formal validity conditions for computing conformal anomaly scores. Conformal methods have recently been applied to several anomaly-detection scenarios, providing the settings that enable to quantify uncertainty and to control error rates. The latest research confirms that the hybrid algorithms, combining isolation-based scores with clustering or conformal inference, have become an important and an active research area and that integrating IF or EIF with clustering or conformal inference is a natural direction of further research in the modern anomaly-detection analysis. Our contribution here is complementary, as the methods analyzed in our study contribute to the understanding of how such the hybrid techniques behave specifically in the one-dimensional setting, which has received comparatively less systematic attention or – in other words – it explains how such the hybridization behaves in the univariate setting under the simulation environment where existing clustering–IF or EIF hybrids have not been extensively evaluated.

In addition to that, it is worth noting that the already mentioned paper by Liu et al. (2008) has strongly encouraged and inspired the researchers to develop a family of hybrid methods where isolation-based scores are further processed by machine learning tools, such as clustering or classification approaches, which has resulted in significant improvement of the detection accuracy. In particular, the One-Class Support Vector Machine (OC-SVM) method has been proposed. It is an unsupervised (or semi-supervised) machine learning concept. The core idea behind it is that the OC-SVM algorithm tries to fit (learn) a boundary (hyperplane) around the typical (normal) observations (instances) of feature space, which is achieved by defining a specified kernel function and setting a nuanced parameter $n u$ (this parameter determines an upper limit on the fraction of margin errors and support vectors, facilitating to fine-tune the model’s sensitivity to outliers). Although the OC-SVM approach is sensitive to the choice of kernel function, the diversity of possible kernel options enables modeling nonlinear boundaries. In the testing stage of this method, instances falling outside the learned boundary are treated as potential outliers. The framework of OC-SVM was introduced in Schölkopf et al. (2000) and later - in its extended version - in Schölkopf et al. (2001). For the survey on general clustering–IF hybrid algorithms, we refer to Zhao et al. (2019).

Quite a large number of methods applied for anomaly detection setting involve using conformal inference (sometimes known as conformal prediction) and related statistical learning methods. The work by Vovk et al. (2005) is fundamental for understanding this concept, as its theoretical background is a result of collaboration between the authors of this publication. It contains the mathematical and algorithmic theory of conformal predictors, including definitions of nonconformity scores, calibration sets, and explanations how to produce prediction sets. The cited book also discusses hybrid ways of combining conformal predictors with classical machine-learning methods (such as - IF, SVM, k-nearest neighbours (kNN), etc.). The already mentioned paper by Bates et al. (2023) also uses conformal analysis for the task of outlier detection. More precisely, in Bates et al. (2023) - conformal inference (for constructing the conformal p-values) and multiple hypothesis testing (for controlling the False Discovery Rate (FDR)) are combined. In other words, it is a hybrid design in the sense that conformal p-values are defined and the Benjamini-Hochberg (BH) procedure (see Benjamini & Hochberg, 1995) is applied to them for controlling the FDR when performing multiple hypothesis tests. It is worth noting that hybrid approaches, integrating conformal inference with deep learning, isolation forests or kernel methods, have attracted a growing interest in recent studies, regarding not only outlier identification procedures.

In parallel, a present-day research devoted to the issue of anomaly detection has increasingly focused on combining classical machine learning with deep architectures and graph-based methods. Deep anomaly detection approaches use autoencoders, variational inference or adversarial training to learn compact representations of normality and identify deviations (see Chalapathy & Chawla, 2019). In turn, graph-based methods exploit relational information to capture anomalies in structured data, such as social networks (see Ding et al., 2019). For the survey study on anomaly detection methods in deep learning, we refer to the mentioned paper by Chalapathy and Chawla (2019) and to the work by Li et al. (2023).

With regard to the earlier study on partitioning observations in heterogeneous data sets and in the context of our present research objective, the papers of Belisle (1992) and Vecchi and Kirkpatrick (1983) are also worth mentioning. In the work of Vecchi and Kirkpatrick (1983), an algorithm suitable for separating data or fitting mixture models is introduced, while in Belisle (1992) the authors extend ideas proposed in the previous article. Together, these papers justify simulated annealing as a stochastic optimization method for globally consistent data set splitting and mixture distribution estimation, as they estimate Gaussian mixture models by the EM algorithm and their non-parametric extensions, which provide a flexible way to capture heterogeneity in the data. Further incorporated enhancements involving the robust estimation and clustering approaches have offered several alternative perspectives on anomaly identification and concurrently prompted the development of more versatile algorithms.

We also use three simple robust approaches in order to either analytically compare or theoretically discuss whether they fail our succeed against the proposed outlier detection methods. Namely, we consider: MAD (Median Absolute Deviation) with high breakdown point, IQR (Interquartile Range) with $1.5 \times IQR$ threshold, and the recently proposed ISOD method. The Median Absolute Deviation (MAD) is a robust statistic that measures the variability (spread) of data. The standard deviation also indicates how data set is spread, but MAD is much less sensitive to extremely high or extremely low values (outliers) and to non-normality than the standard deviation. Thus, if the corresponding data is normal, the standard deviation is usually the best choice for spread calculation, but if the data is not normal, the MAD is the statistic that can be applied instead. Formally, for the given data $x = (x_{1}, x_{2}, \dots, x_{n})$ , the MAD measure is defined as follows:

MAD (x) = {m e d i a n}_{1 \leq i \leq n} (| x_{i} - m e d i a n (x) |) .

The above formula is a variation of the mean absolute deviation definition. It is less affected by outliers because outliers have a smaller effect on the median than they do on the mean value. The MAD measure is widely used in outlier detection, especially when data may not come from normal distribution. It is a robust to extreme values, easily interpretable quantity that works well for heavy-tailed or contaminated distributions. For more details regarding MAD, we refer to Huber and Ronchetti (2009), Leys et al. (2013) and Rousseeuw and Croux (1993).

In our research, we also examine the results obtained by application of the classic and simple Interquartile Range (IQR) with $1.5 \times IQR$ threshold outlier detection method. The Interquartile Range (IQR) is defined as the difference between the 75th percentile (the third quartile $Q_{3}$ ) and the 25th percentile (the first quartile $Q_{1}$ ), which describes the spread of the middle $50 %$ of observations from the data set. It measures statistical dispersion in the data and—similarly to the MAD measure—is less sensitive to outliers than the standard deviation. In turn, the IQR with the $1.5 \times IQR$ threshold method is a popular setting used to identify outliers. A given observation $x$ is treated as an outlier if it lies outside the interval $[Q_{1} - 1.5 \cdot IQR, Q_{3} + 1.5 \cdot IQR]$ . This approach is robust to extreme values and non-normal data. It provides a simple rule for detecting outliers and is commonly used together with boxplots and exploratory data analysis. The IQR is a commonly used measure that was proposed by Tukey (1977).

The lately introduced Interpretable Single-dimension Outlier Detection (ISOD) method (see Huang et al., 2024) is also worth mentioning. This is an unsupervised algorithm that treats each single feature independently and therefore, it is especially useful for interpretable 1-dimensional outlier detection. The general idea of the ISOD setting is that at first, for each dimension (feature) the empirical cumulative distribution function from the data is obtained and then, the quantiles and skewness coefficients are computed for these dimensions - the mentioned statistics characterize how “typical” or “atypical” a given data point is in the considered dimension. Subsequently, based on the computed quantiles and skewness coefficients, the vector of outlier scores comprised of the corresponding scores througheach dimension is calculated for each observation – the higher scores indicate stronger deviation (stronger anomality) from the “ordinary” (“normal”) observations. ISOD method is similar with regard to its main concept to the earlier proposed Empirical-Cumulative-distribution-based Outlier Detection (ECOD) algorithm (see Li et al., 2022). The difference between them is that ISOD is a highly interpretable, single-dimension outlier detection approach, which is recommended for cases when we want to analyze each variable (dimension) separately, whereas ECOD is a multivariate outlier detection scheme, which is suitable for the case when we work with high-dimensional data and need a scalable, parameter-free method that aggregates signals across many features.

We have not considered the ISOD method in our numerical experiments and comparisons in particular due to its limitation. This limitation can be explained directly, without the need for simulations or comparisons, because if ISOD is a method based on cutting off a fixed proportion of extreme score values, for example $5 %$ , then this approach will clearly fail when the true proportion of outliers is, for instance, $10 %$ or any other different percentage. In such a situation, at least $50 %$ of the outliers will remain undetected.

The k-nearest neighbors (k-NN) and Local Outlier Factor (LOF) algorithms are also related to the clustering and anomaly detection tasks. K-NN (see, e.g., Cover & Hart, 1967) is a commonly known method that works by looking at the $k$ closest data points (neighbors) to a new observation, based on a selected distance measure and making a decision from them. It is easy to understand and implement, requires no explicit training phase, and adapts naturally to complex decision boundaries, but it can be computationally expensive, sensitive to the choice of $k$ and a distance metric. In turn, the LOF approach (see, e.g., Breuning et al., 2000) is a density-based unsupervised machine learning algorithm that identifies outliers by comparing the density of data points in their local neighborhoods. It measures how isolated a given data point is with respect to its neighborhood: a point is considered an outlier if it has a significantly lower local density than its neighbors. Unlike global methods, LOF is well suited for data sets with varying density regions, where global distance-based approaches may fail.

To make our reference list relatively complete, we also would like to mentioned four more papers. The first of them - the paper by Ostrovsky et al. (2012) studies theoretical foundations of the Lloyd-type heuristics for the k-means clustering problem and introduces a separation (clusterability) condition under which variants of Lloyd’s algorithm (such, e.g., the k-means method) converge quickly to near-optimal solutions and work very well in practice. The second one - an article by Pollard (1981) - establishes strong consistency results for the k-means clustering procedure, namely it proves that, under suitable regularity and uniqueness assumptions, the empirical k-means cluster centers converge almost surely to the population-optimal centers as the sample size grows. The results provide a rigorous statistical justification for k-means as a clustering method. In turn, the paper by Dasgupta (1999) provides an algorithm for learning Gaussian mixtures by using random projection techniques and imposing on mild separation (clustering) conditions. This algorithm - while remaining computationally efficient – guaranties strong and convincing theoretical foundations, which is effective even in high dimensions and for mixtures with arbitrary covariance structures. The other important theoretical contribution regarding the relationship between clustering and mixture models is provided by Chaudhuri et al. (2009), who analyzed the ability of the k-means algorithm to learn mixtures of Gaussian distributions under suitable separation conditions. Their results demonstrate that k-means can successfully recover the structure of Gaussian mixtures and provide additional theoretical motivation for combining clustering-based approaches with anomaly detection methods.

In our work, we present seven new outlier detection methods based on a two-dimensional representation of the sample, composed of the original observation values and the scores computed by using the Isolation Forest (IF) or Extended Isolation Forest (EIF) algorithms with 100 trees. We named these methods as: M.I-M.III, M.III+BH (M.IIIb), M.V-M.VI. These seven approaches are later compared with four classic anomaly detection procedures, which are denoted as M.IV, M.VIII, M.IX and M.X in our paper. All of our numerical experiments have been performed using the R environment (see R Core Team, 2021). In particular, its libraries: distr, isotree, mixtools, e1071, have been used. A detailed information on the introduced methods and motivation for the use of these approaches are given in Section 2.

The remainder of this paper is organized as follows. Namely, in Section 2 - we describe the proposed methods of outlier detection, as well as we justify and motivate their use, in Section 3 - we present our numerical experiments, in particular we describe six scenarios for which our computations are carried out and concurrently, we depict the results of our simulations for the selected scenarios and proposed methods, which allows to evaluate and discuss the introduced concepts later in Section 4. Moreover, in Section 5 we test the introduced methods on the benchmark real-world data: Wine, Thyroid_Disease_Dataset, Heart_Disease_Dataset, Credit_Card_Fraud_Detection, from the UCI Machine Learning Repository and the Kaggle data science community platform, while Section 6 summarizes our research. Due to the fact that our simulation study encompasses quite a large number of cases - except for one case, all of the tables and figures concerning the obtained empirical results are collected in Supplement.¹

2. Outlier Detection Methods and Motivations

In this section, we describe outlier detection methods, which we aim to apply, study and validate in our work.

2.1. Basic Setup

In our work, we propose seven methods of outlier detection for one-dimensional features. Our approach relies on construction of a two-dimensional representation of each observation, consisting of the raw data point and its anomaly score derived from the Isolation Forest (IF) and Extended Isolation Forest (EIF) algorithms, and on subsequent application of the machine learning techniques and statistical procedures to the obtained representation. More specifically, we introduce the following procedures: the concept based on combing the IF or EIF methods (scores) with the 2-means algorithm, the framework consisting in building the conformal prediction intervals, the design using conformal p-values (with and without multiple testing correction), the setting based on the OC-SVM approach, and the model of logistic regression for label refinement. We compare these procedures with classical mixture-based methods estimated via the EM algorithm, both in Gaussian and non-parametric mixtures settings, as well as with simple methods based on: (a) the MAD measure, (b) the Interquartile Range (IQR) with the $1.5 \times IQR$ threshold. The effectiveness of introduced methods is evaluated through extensive simulation experiments under a variety of distributional scenarios - including Gaussian mixtures, heavy-tailed distributions, and exponential contamination.

In our study, we adopt the definition of inliers and outliers from Bates et al. (2023). Namely, assume that we have a sample of one-dimensional observations, which are the realizations of a sequence of independent random variables with a distribution being the mixture of two one-dimensional probability distributions. Suppose that $F_{0}$ denotes a cumulative distribution function (cdf) of the majority class, regarded as the class of inliers, and $F_{1}$ stands for a cdf of the minority class, referred to as the class of outliers. We assume in our investigations that the observations are drawn from the following mixture of probability cdfs:

(1 - p) F_{0} + p F_{1},

where

0 < p < 1 / 2.

This is a special case of the generally accepted and earlier mentioned definition (see Bates et al., 2023), where an observation $x$ comes either from a certain cdf $F_{0}$ , which is a cdf of the class of inliers, or does not come from $F_{0}$ and an observation belongs to the class of outliers then.

Originally, in conformal inference we consider an outlier detection problem where the input data set $D = {x_{i}}_{i = 1}^{2 n}$ , drawn from unknown cdf $F_{X}$ , is observed. The principal objective is to test which among observations from an added (a new) test set of points $D^{test} = {x_{2 n + i}}_{i = 1}^{n_{test}}$ are outliers, in the sense that they do not come from the cdf $F_{X}$ (otherwise, the observations from $F_{X}$ are treated as inliers). It is achieved by training the selected score function $\hat{s}$ on a subset $D^{train} = {x_{1}, x_{2}, \dots, x_{n}}$ of the original data set $D$ and by further evaluating the scores on calibration data $D^{cal} = {x_{n + 1}, x_{n + 2}, \dots, x_{2 n}}$ . Such an approach is introduced by Bates et al. (2023), but in our study we only have one set of observations without calibration and test sample, which is a more realistic situation in data analysis.

2.2. Detailed Description of the Proposed Methods

Below, we present our outlier detection settings in more detail.

2.2.1 Method M.I

The first method - M.I - is based on using the 2-means algorithm for clustering one-dimensional observations, which are transformed at the beginning to their two-dimensional representation vectors. The first coordinates of these vectors are the original one-dimensional input data, whereas the second coordinates consist of the scores received by implementing the Isolation Forest (IF) or Extended Isolation Forest (EIF) algorithms. For the corresponding pairs of values, we apply the $2$ -means clustering method in order to segregate them into two groups. Then, we treat a smaller cluster as a portion with outliers.

2.2.2. Method M.II

M.II is based on the idea of naive conformal method. Its first step is the same as in the M.I method - namely, by using the $2$ -means clustering, we initially divide our data into inliers and outliers. Then, assuming that the inlier set (the majority class) is sufficiently well identified, we construct the conformal prediction intervals (the conformal sets) for the observations belonging to the minority class as if they were inliers. If the predicted score of an observation from the minority class falls outside the corresponding conformal set, the observation is considered to be an outlier. Next, on the training set (on the majority cluster obtained after the conducted clustering, i.e., on the class of potential inliers), the 4th-degree polynomial regression is fitted. Then, predictions are made for the scores corresponding to the smaller (minority) cluster of potential outliers. Subsequently, the 90%-quantile of the absolute differences between the predicted and observed scores of data from the inlier cluster are computed, and the 90%-conformal prediction interval for inliers (if they were inliers) - defined as the predicted score for the smaller cluster $\pm$ the 90%-quantile from absolute differences in the inlier cluster - is obtained. Finally, the points from the smaller cluster (initially identified as outliers) whose scores fall outside this interval are classified as outliers.

More formally, let ${x_{1}, \dots, x_{n}} = D_{i n} \cup D_{o u t}$ , i.e., our observations are partitioned into the sets $D_{i n}$ and $D_{o u t}$ , where $D_{i n}$ denotes the set of inliers and $D_{o u t}$ stands for the set of outliers. At first, using the $2$ -means clustering algorithm as in the M.I method, we estimate the set of inliers ${\hat{D}}_{i n}$ and the set of outliers ${\hat{D}}_{o u t}$ . Then, applying the IF or EIF approaches, we calculate the scores: $s (x_{1}), \dots, s (x_{l})$ , for observations $x_{1}, \dots, x_{l} \in {\hat{D}}_{i n}$ . Subsequently, we predict the scores: $\tilde{s} (x_{1}), \dots, \tilde{s} (x_{l})$ , for $x_{1}, \dots, x_{l} \in {\hat{D}}_{i n}$ , by fitting the 4th-degree polynomial regression model as follows:

\tilde{s} (x_{t}) = b_{0} + b_{1} x_{t} + b_{2} x_{t}^{2} + b_{3} x_{t}^{3} + b_{4} x_{t}^{4}, t = 1, \dots, l .

We also predict the scores:

\hat{s} (x_{l + 1}), \dots, \hat{s} (x_{n})

, for

x_{l + 1}, \dots, x_{n} \in {\hat{D}}_{o u t}

, in the following fashion:

\hat{s} (x_{t}) = b_{0} + b_{1} x_{t} + b_{2} x_{t}^{2} + b_{3} x_{t}^{3} + b_{4} x_{t}^{4}, t = l + 1, \dots, n,

where

b_{0} - b_{4}

are the values of estimates, obtained by fitting observations from

{\hat{D}}_{i n}

to the 4th-degree polynomial.

The naive conformal set for $\hat{s} (x_{t})$ is defined by:

C_{\hat{s} (x_{t})} = (\hat{s} (x_{t}) - q_{0.9}, \hat{s} (x_{t}) + q_{0.9}), t = l + 1, \dots, n,

where

q_{0.9} = {q u a n t i l e}_{0.9} {| s (x_{w}) - \tilde{s} (x_{w}) | | w \in {1, \dots, l}} .

2.2.3 Method M.III

M.III is based on the Bates-like conformal p-value method, as it applies the idea of conformal p-values, introduced in Bates et al. (2023) and the previous version of this paper (available from https://arxiv.org/abs/2104.08279). As in the methods M.I-M.II, we identify two clusters in its first step. Then, the majority cluster is treated as a calibration sample and each observation in the minority cluster is tested in order to determine whether it comes from the same distribution as the calibration set (and - as a result - the corresponding observation is identified as an inlier) or not (and - it is treated as an outlier then). No multiple-testing correction is applied, and the significance level is set to $α = 0.1$ . The following formula (see (3) in Bates et al., 2023), defining - for the test point $x$ - the marginal conformal p-value ${\hat{u}}_{marg} (x)$ , is used:

{\hat{u}}_{marg} (x) = \frac{1 + | {i \in D^{cal} : \hat{s} (x_{i}) \leq \hat{s} (x)} |}{n + 1},

where:

$D^{cal}$ – calibration data set of size $n$ ,

$\hat{s} (\cdot)$ – score function (conformity/anomaly score) trained on the training data $D_{train}$ ,

${i \in D_{cal} : \hat{s} (x_{i}) \leq \hat{s} (x)}$ – the number of calibration points, whose scores $\hat{s} (\cdot)$ are not larger than the corresponding score of $x$ ,

$n + 1$ – normalization factor (assuming that $n = | D_{cal} |$ ), ensuring that p-values lie in the set ${1 / (n + 1), \dots, 1}$ ,

$1 +$ in the numerator – guarantees that even the most extreme test point reaches a strictly positive p-value of at least $1 / (n + 1)$ .

2.2.4 Method M.III+BH (or M.IIIb)

In this method, we proceed analogously as in M.III; the difference is that in M.III+BH, we incorporate the Benjamini-Hochberg (BH) correction for multiple testing at $α = 0.1$ (see Benjamini & Hochberg, 1995).

2.2.5 Method M.V

For the two-dimensional representation of the considered one-dimensional data (where - for recollection - this two-dimensional representation consists of the original observations and its corresponding scores), the One-Class Support Vector Machine (OC-SVM) algorithm with a radial kernel is used.

2.2.6 Method M.VI

Method M.VI is similar to M.V. The only difference is that - while using M.VI - after an application of the 2-means clustering, the OC-SVM is fitted to the obtained two-dimensional representation of observations from the minority class. Thus, the OC-SVM is used to extract outliers from the group of instances initially classified as potential outliers by the 2-means algorithm.

2.2.7 Method M.VII

M.VII is a label-cleaning approach, which applies the logistic regression model. It is a two-step outlier detection method that can be treated as a classification problem with uncertain labels. Firstly - the clustering procedure assigns uncertain labels, and secondly - the logistic regression is used to fit a model to these uncertain labels; consequently - for each observation, the probability of being an outlier is estimated, and - as a result - the instances with estimated probabilities exceeding $0.5$ are marked as outliers.

2.2.8 Classic Methods for Comparison Study (M.IV, M.VIII, M.IX, M.X)

The above described (proposed) methods are compared not only between each other, but more importantly also with an approach M.IV, which consist in fitting the EM (Expectation-Maximization) algorithm to a two-component Gaussian mixture model, with the procedure M.VIII, where the EM algorithm is fitted to a two-component non-parametric mixture model, as well as with two simple methods based on the MAD (see M.IX) and IQR measures (see M.X). For the corresponding comparison study, we applied the mixtools library (in order to use the ’normalmixEM’ function for Gaussian mixture models and the ’npEM’ function for non-parametric mixtures - see Benaglia et al. (2009) and the e1071 library. Both libraries come from the free software environment R (R Core Team, 2021).

2.3. Motivations

The proposed methods are based on a two-dimensional anomaly map. An observation that is an outlier in one dimension will typically also be an outlier according to the IF and EIF scores, but not always, since EIF is a randomized method and often produces asymmetric decision boundaries. Our approach aims to combine geometric information (the value of given observation) with model-based information (the corresponding anomaly score). In this way, we integrate signals from two complementary sources. The proposed methods rely on a natural application of the 2-means clustering to improve the IF and EIF-based, conformal prediction–based, and one-class SVM–based anomaly detection methods. Among these, only conformal prediction has a theoretical justification; however, not in the context considered here, since it is applied after clustering. Depending on whether the groups of outliers and inliers are sufficiently well separable, the proposed clustering-based methods may perform well and, as the sample size increases, may exhibit an improved ability to recover the two classes. Due to the lack of a strict theoretical characterization of IF and EIF and their theoretical properties, deriving rigorous theoretical results is difficult, if not impossible. Therefore, our evaluation is primarily based on computer simulations. Intuitively, if the anomaly scores produced by IF or EIF, conformal prediction, or OC-SVM provide better separation between outliers and inliers, improved performance can be expected, which is also confirmed by our simulation results. Overlap between outlier and inlier populations in the feature space is reflected in overlap between these groups in the two-dimensional representation defined by the observation and its anomaly score. The proposed methods can improve detection performance when the scores exhibit some separation ability between outliers and inliers, even if perfect separability is not achieved. Limitation of the conducted research is that this research lacks the rigorous theoretical analysis due to the inherent nature of the considered methods.

2.4. Application of the Wilcoxon Test and Heatmaps for Paired Comparisons

In order to compare $11$ considered methods, we conducted the Wilcoxon signed-rank test with Holm correction. It combines the non-parametric Wilcoxon rank signed-rank test with the Holm method. In this setting, the Wilcoxon signed-rank test is used and then the obtained p-values are adjusted by employing the Holm method in order to control for multiple comparisons. It results in computing the corrected p-values, and constructing the method which enables measuring the statistical significance of multiply comparisons based on analyzing the corrected p-values instead of the raw p-values. The Wilcoxon test with Holm correction is used when we need to perform multiple Wilcoxon rank tests and when - by controlling the family-wise error rate – we want to avoid spurious inferences regarding significance testing while maintaining greater statistical power. Simultaneously, we created the corresponding heatmaps that show pairwise statistical comparisons between the considered methods. These heatmaps are included in Supplement². The idea of Wilcoxon’s test originates from the paper by Wilcoxon (1945).

3. Simulation Study

3.1. Mixture Scenarios

We aim to implement the presented outline detection methods for the six following mixture distributions, labeled as A1–A3 and B1–B3 (additionally, $N (\cdot, \cdot)$ denotes normal distribution, $E x p (\cdot)$ stands for exponential distribution, whereas $t (\cdot)$ symbolizes Student’s t-distribution):

A1:
inliers from $N (10, 1)$ , outliers from $N (2, 1)$ ;
A2:
inliers from $N (0, 1)$ , outliers from $N (5, 1)$ ;
A3:
inliers from $N (0, 1)$ , outliers from $N (1, 1)$ ;
B1:
inliers from $N (10, 1)$ , outliers from $E x p (1 / 5)$ ;
B2:
inliers from $N (0, 1)$ , outliers from $t (3)$ ;
B3:
inliers from $E x p (1)$ , outliers from $t (3)$ .
A1-A2 are easy-to-separate settings with regard to the outlier detection problem and partitioning difficulty, while this issue becomes much more complicated in the A3, B1-B3 scenarios, which encompass moderately difficult to difficult-to-separate cases, since the distributions of inliers and outliers overlap significantly then.
3.2. Simulation Design and Empirical Results

Each scenario is considered in two variants - with outlier proportions $p = 0.05$ or $p = 0.1$ . The simulated sample sizes are $n = 100$ or $n = 1000$ . Every simulation is replicated $1000$ times.

In Table 1 and Figure 1 - contained in the current section, as well as in Tables 1 $-$ 48 and Figures 1 $-$ 24 - placed in Supplement, we present - for each method and all established scenarios - values of the following quantities, averaged over all replications (the corresponding standard deviations are given in parentheses):

TP (True Positives) – the number of correctly detected outliers;

FP (False Positives) – the number of normal observations incorrectly identified as outliers;

FN (False Negatives) – the number of outliers that were not detected;

TN (True Negatives) – the number of correctly classified inliers,

and the following classification quality measures are computed:

Accuracy – overall classification accuracy (percentage of correct classifications);

Precision – proportion of predicted outliers that are true outliers;

Recall (Sensitivity) – proportion of all outliers that are correctly detected;

F1 (F1-score) – harmonic mean of Precision and Recall, representing a balanced trade-off.

For the corresponding definitions of the classification measures, we refer to Hastie et al. (2009).

Figure 1.

Adjusted p-values for Wilcoxon-Holm for A1, $n = 100$ , $p = 0.05$ .

Table 1.

Average Values of TP/FP/FN/TN and Classification Quality Measures for the A1 Scenario ( $n = 100$ , 5% of Outliers, 1000 Replications).

	Confusion Matrix				Classification Measures
Method	TP	FP	FN	TN	Accuracy	Precision	Recall	F1
M.I	5.06 (2.13)	3.11 (5.32)	0.00 (0.00)	91.83 (4.65)	0.97 (0.05)	0.73 (0.29)	1.00 (0.00)	0.81 (0.25)
M.II	4.97 (2.18)	12.81 (4.86)	0.14 (0.88)	82.08 (4.43)	0.87 (0.05)	0.29 (0.12)	0.98 (0.13)	0.44 (0.15)
M.III	5.05 (2.11)	10.67 (4.90)	0.03 (0.29)	84.26 (4.22)	0.89 (0.05)	0.34 (0.14)	1.00 (0.03)	0.49 (0.16)
M.III+BH	2.25 (1.24)	16.50 (5.97)	0.00 (0.00)	81.25 (6.62)	0.84 (0.06)	0.12 (0.06)	1.00 (0.00)	0.21 (0.09)
M.IV	5.13 (2.10)	1.60 (12.18)	0.00 (0.00)	93.27 (11.83)	0.98 (0.12)	0.97 (0.14)	1.00 (0.00)	0.98 (0.13)
M.V	2.69 (0.98)	7.10 (1.41)	2.39 (1.92)	87.82 (2.16)	0.91 (0.02)	0.28 (0.10)	0.60 (0.25)	0.36 (0.11)
M.VI	4.93 (2.15)	13.59 (6.68)	0.14 (0.75)	81.34 (6.13)	0.86 (0.07)	0.29 (0.13)	0.98 (0.11)	0.43 (0.16)
M.VII	5.06 (2.13)	3.11 (5.32)	0.00 (0.00)	91.83 (4.65)	0.97 (0.05)	0.73 (0.29)	1.00 (0.00)	0.81 (0.25)
M.VIII	5.07 (2.14)	1.98 (13.79)	0.00 (0.00)	92.96 (13.51)	0.98 (0.14)	0.98 (0.14)	1.00 (0.00)	0.98 (0.14)
M.IX (MAD)	5.07 (2.14)	3.34 (2.35)	0.00 (0.00)	91.59 (2.73)	0.97 (0.02)	0.62 (0.21)	1.00 (0.00)	0.74 (0.17)
M.X (IQR)	5.07 (2.14)	0.59 (0.85)	0.00 (0.00)	94.34 (2.21)	0.99 (0.01)	0.90 (0.14)	1.00 (0.00)	0.94 (0.09)

All of our simulations and analyses have been conducted in the R environment (see R Core Team, 2021); the assumed number of trees equals $100$ . In case of the methods where either IF or EIF algorithms are used, the computations have been carried out for both of these settings, but the results are only depicted for the case when EIF has been used, since the results do not differ in both approaches. Due to the fact that our simulation study encompasses quite a large number of cases, we have decided that except for one case all of the tables and figures concerning the obtained empirical results will be collected in Supplement, but in order to show how the presentations of our results look like, we will present these results for one case and one scenario. These results are depicted in the given tables and figure.

4. Analysis of Empirical Results

Before presenting the conclusions, we clarify that in the heatmaps included in the supplementary material, the symbols B.I–B.X correspond to the considered methods M.I–M.X.

4.1. Comparison of the IF- and EIF-Based Methods

When comparing the methods based on IF and EIF, only minor differences were observed between the M.I and M.II approaches for a sample size of $n = 100$ . For the larger sample size $n = 1000$ , no differences were discovered. For that reason, in the tables and figures in the added Supplement, depicting our empirical results, we consider the case when the EiF algorithm is used in the methods requiring its application.

4.2. Conclusions for Scenario A1 (Very Good Class Separation)

For $p = 0.05$ and $n = 100$ , all of the methods except for M.III+BH and M.V correctly detected the true outliers ( $TP \approx 5$ ). The largest numbers of false positives were produced by the M.III+BH, M.VI and M.II methods. A substantial number of false negatives was observed for the M.V method, while the lowest numbers of true negatives were recorded for the M.III+BH and M.VI approaches.

The highest Accuracy was achieved by the M.X method, however this method exhibited lower Precision than M.VIII and M.IV. In turn, the M.IV and M.VIII methods achieved the best F1-scores.

Increasing the sample size to $n = 1000$ confirmed the previously observed relationships. The M.I method dominated most of the competing approaches. Although its Accuracy was slightly lower than that of the M.IX and M.X methods, it achieved higher Precision and F1-scores than M.IX. In addition, M.IV and M.VIII performed best overall and were the most stable.

For $p = 0.1$ and $n = 100$ , the weakest performances were observed for the M.II and M.V methods, while the best results were obtained by the M.IV, M.VII and M.X approaches. The M.I method also achieved very good values of Accuracy, Precision, Recall and F1-score. For $n = 1000$ , the best performances were obtained by the M.IV and M.X methods, with slightly weaker results for M.VII and M.I.

4.3. Conclusions for Scenario A2 (Good Class Separation)

For $p = 0.05$ and $n = 100$ , the best methods in terms of Accuracy, Precision, Recall and F1-score were M.IV and M.V. Slightly weaker results were obtained by M.IX and M.I, while the weakest performance was observed for M.III+BH.

For $p = 0.05$ and $n = 1000$ , the M.IV and M.X methods dominated. Method M.IX exhibited low Precision, whereas the M.I method achieved relatively good results.

For $p = 0.1$ and $n = 100$ , the M.IV, M.IX, M.VII and M.X methods dominated, while the M.I method performed slightly worse. For the larger sample size $n = 1000$ , a nearly identical tendency was observed.

4.4. Conclusions for Scenario A3 (Gaussian Distributions With Strong Overlap)

In scenario A3, where Gaussian distributions strongly overlap, all methods performed worse than in scenarios A1 and A2.

For $p = 0.05$ and $n = 100$ , the highest Accuracy and Precision were achieved by the M.X and M.IX methods. The highest Recall values were obtained by M.VIII and M.IV, while the best F1-scores were achieved by M.X and M.IX.

For $p = 0.05$ and $n = 1000$ , the M.X method reached the highest Accuracy and Precision, but it exhibited low Recall and F1-score. The highest Recall values were observed for the M.VIII and M.IV methods, whereas the best F1-scores were achieved by M.I and M.III+BH.

For $p = 0.1$ and $n = 100$ , the M.X method again achieved the highest Accuracy and Precision, but it suffered from very low Recall. The highest Recall values were obtained by M.VIII and M.IV, while the M.I method achieved the highest F1-score.

For $p = 0.1$ and $n = 1000$ , the highest Accuracy and Precision were obtained by the M.X and M.IX approaches. However, these methods again exhibited low Recall and F1-score. The highest values of Recall were achieved by M.IV, while the best F1-scores were obtained for M.I and M.VII.

4.5. General Comment on the A1-A3 Scenarios

The simulation results for the Gaussian models considered in the A1–A3 scenarios indicate that, as expected, method M.IV (EM for Gaussian mixtures) is the best-performing method overall.

4.6. Conclusions for Scenario B1 (Exponential Outliers, Normal Inliers)

For $p = 0.05$ and $n = 100$ , the highest Accuracy values were attained by the M.X and M.IV methods, the highest Precision values by M.VIII and M.IV, and the highest Recall values by M.III and M.VI. The best F1-scores were achieved by the M.IV and M.X approaches.

For $p = 0.05$ and $n = 1000$ , the highest values of Accuracy were achieved by M.X, M.IV and M.VIII, the highest values of Precision were observed for M.VIII and M.VI, while the highest Recall values were achieved by M.III and M.VI. The best F1-scores were recorded for M.IV and M.X.

For $p = 0.1$ and $n = 100$ , the highest Accuracy values were achieved by M.X, M.IV and M.VIII, while the highest Precision values were obtained by M.VIII, M.X and M.VI. In addition, the highest values of Recall were observed for M.III+BH and M.III, whereas the best F1-scores were again recorded for M.IV and M.X.

For $p = 0.1$ and $n = 1000$ , the highest Accuracy and Precision were achieved by the M.X and M.IV methods and the highest Recall values were observed for M.III and M.VI, while the best F1-scores were achieved by M.IV and M.X.

4.7. Conclusions for Scenario B2 (Student’s $t$ Outliers, Normal Inliers)

For $p = 0.05$ and $n = 100$ , the highest Accuracy, Precision and F1-score were achieved by the M.X and M.IX methods, while the highest values of Recall were obtained by M.VIII and M.IV.

For $p = 0.05$ and $n = 1000$ , the highest Accuracy values were achieved by M.X and M.IX, the highest Precision and Recall values by M.VIII and M.IV, and the best F1-scores by M.IX, M.I, M.III+BH and M.VII. In this case, Precision and F1-scores were generally low for all of the methods.

For $p = 0.1$ and $n = 100$ , the highest Accuracy and Precision were achieved by M.X and M.IX, while the highest Recall values were again obtained by M.VIII and M.IV. In turn, the best F1-scores were achieved by M.X and M.IX.

For $p = 0.1$ and $n = 1000$ , the highest values of Accuracy were achieved by M.X and M.IX, while the highest Recall values were observed for M.III and M.VI, whereas the best F1-scores were achieved by the M.III+BH, M.I, M.III and M.VII methods.

4.8. Conclusions for Scenario B3 (Student’s $t$ Outliers, Exponential Inliers)

For $p = 0.05$ and $n = 100$ , the highest Accuracy and Precision were achieved by the M.X and M.V methods. The highest Recall values were obtained by M.VIII and M.VI, while the best F1-scores were achieved by the M.V, M.III+BH, M.I and M.VII approaches.

For $p = 0.05$ and $n = 1000$ , the highest Accuracy values were achieved by M.X and M.V, the highest Precision by M.VIII and M.V, and the highest Recall by M.III and M.VI. The best F1-scores were attained for M.V, M.I, and M.VII.

For $p = 0.1$ and $n = 100$ , the highest Accuracy values were achieved by M.X and M.V, while the highest Precision, Recall, and F1-scores were obtained by M.X and M.IX.

For $p = 0.1$ and $n = 1000$ , the highest Accuracy values were achieved by M.X and M.IX. The highest Precision values were recorded for M.VIII and M.IV, while the highest Recall values were achieved by M.III and M.VI. The best F1-scores were obtained by M.III+BH, M.I, M.III, and M.VII.

4.9. Conclusions Regarding the Impact of Contamination Rate and Separation’s Degree

Based on the conclusions above, clear differences can be observed in the behavior of individual outlier detection methods depending on both the proportion of outliers and the degree of separation between the distributions considered in the presented methods. They can be summarized as follows.

Case $p = 0.05$ .

For a low level of data contamination (5% outliers), the best performance for easily separable scenarios (A1, A2) is achieved by the M.IV (GMM) and M.X (IQR) methods. These approaches are characterized by nearly perfect Accuracy, very high Precision, and full or almost full recovery of outliers ( $Recall \approx 1$ ), while generating a minimal number of false positives.

The M.I–M.VII methods, except for M.V, exhibit very high sensitivity but at the cost of increased numbers of false positives, resulting in moderate Precision values and suboptimal F1-scores.

Classical robust methods, namely MAD (M.IX) and IQR (M.X), perform very well in simple scenarios. However, in more difficult cases they lose the ability to detect a portion of the outliers. In the most challenging scenarios (A3, B1–B3), all methods experience a substantial decrease in performance, particularly in terms of Precision and F1-score.

Case $p = 0.1$ .

Increasing the proportion of outliers to 10% leads to a clear deterioration in the performance of most methods due to a substantial increase in false positives. The mixture-based method M.IV remains relatively the most stable approach, although a decrease in Recall is observed for difficult scenarios, especially when $n = 100$ .

The MAD and IQR methods maintain high Accuracy, but they exhibit a substantial decrease in Recall as $p$ increases, making them conservative approaches that limit false positives at the expense of false negatives. In the most difficult scenarios (B2–B3), clustering-based methods achieve relatively high Recall.

Conclusions from the heatmaps’ analysis.

As the sample size increases, a larger number of statistically significant differences emerge between Accuracy, Precision, Recall, and F1-score, particularly in scenarios A2, A3, B1, B2, and B3. The least significant differences between methods are observed for scenario A1.

ROC-based comparison of M.IV and M.VIII

Since the M.IV and M.VIII methods were often placed among the best-performing approaches, we additionally evaluated their classification performances. Namely, for the mentioned methods, the AUC and AUCPR values were computed based on 1000 Monte Carlo replications, and the posterior probabilities were used as the score measures. Such the score measures (or scores) are not available for the other methods considered in our paper. The corresponding evaluation was conducted for the simulation models A1–A3 and B1–B3 with sample sizes $n = 100$ and $n = 1000$ . The obtained results are presented in Table 2. As expected, the performances of the considered methods deteriorate as the deviations of the normal distributions increase. In turn, for the more challenging scenarios A3 and B2–B3, both methods exhibit poor performances, with the AUC values which only slightly lie above 0.5 and with the AUCPR values only marginally exceeding (or close to) 0.05 or 0.09 (these values are marked in bold). In these cases, clustering-based methods tend to outperform methods based on the EM algorithm. For the remaining scenarios, the AUC and AUCPR values are generally close to 1.

Table 2.
Average AUC and AUCPR for $p = 0.05$ (Standard Deviations in Parentheses).

Method Metric $n$ A1 A2 A3 B1 B2 B3

M.IV AUC 100 0.99 (0.07) 1.00 (0.02) 0.71 (0.14) 0.92 (0.14) 0.63 (0.10) 0.66 (0.13)

M.IV AUCPR 100 0.96 (0.20) 0.96 (0.18) 0.11 (0.12) 0.82 (0.27) 0.14 (0.14) 0.09 (0.08)

M.VIII AUC 100 1.00 (0.00) 1.00 (0.00) 0.77 (0.12) 0.85 (0.16) 0.63 (0.12) 0.74 (0.16)

M.VIII AUCPR 100 0.98 (0.14) 0.89 (0.31) 0.15 (0.18) 0.77 (0.25) 0.10 (0.10) 0.06 (0.09)

M.IV AUC 1000 1.00 (0.00) 1.00 (0.00) 0.71 (0.09) 0.96 (0.03) 0.55 (0.04) 0.64 (0.04)

M.IV AUCPR 1000 1.00 (0.00) 1.00 (0.00) 0.09 (0.07) 0.90 (0.04) 0.11 (0.05) 0.07 (0.02)

M.VIII AUC 1000 1.00 (0.00) 1.00 (0.00) 0.76 (0.03) 0.87 (0.05) 0.53 (0.03) 0.74 (0.05)

M.VIII AUCPR 1000 1.00 (0.00) 1.00 (0.00) 0.13 (0.08) 0.82 (0.05) 0.09 (0.03) 0.05 (0.08)

Method	Metric	$n$	A1	A2	A3	B1	B2	B3
M.IV	AUC	100	0.99 (0.07)	1.00 (0.02)	0.71 (0.14)	0.92 (0.14)	0.63 (0.10)	0.66 (0.13)
M.IV	AUCPR	100	0.96 (0.20)	0.96 (0.18)	0.11 (0.12)	0.82 (0.27)	0.14 (0.14)	0.09 (0.08)
M.VIII	AUC	100	1.00 (0.00)	1.00 (0.00)	0.77 (0.12)	0.85 (0.16)	0.63 (0.12)	0.74 (0.16)
M.VIII	AUCPR	100	0.98 (0.14)	0.89 (0.31)	0.15 (0.18)	0.77 (0.25)	0.10 (0.10)	0.06 (0.09)
M.IV	AUC	1000	1.00 (0.00)	1.00 (0.00)	0.71 (0.09)	0.96 (0.03)	0.55 (0.04)	0.64 (0.04)
M.IV	AUCPR	1000	1.00 (0.00)	1.00 (0.00)	0.09 (0.07)	0.90 (0.04)	0.11 (0.05)	0.07 (0.02)
M.VIII	AUC	1000	1.00 (0.00)	1.00 (0.00)	0.76 (0.03)	0.87 (0.05)	0.53 (0.03)	0.74 (0.05)
M.VIII	AUCPR	1000	1.00 (0.00)	1.00 (0.00)	0.13 (0.08)	0.82 (0.05)	0.09 (0.03)	0.05 (0.08)

5. Real Data Analysis

The purpose of this section is to evaluate the presented methods using four selected real-world data sets, available from the UCI Machine Learning Repository or from the Kaggle data science community platform. All of them constitute classical benchmark data sets for classification and statistical learning tasks.

The first data set, Wine, is based on the results of a chemical analysis of wines produced in the same region of Italy but originating from three different grape cultivars, denoted as classes 1–3. This data set contains 178 observations, each characterized by 13 continuous features (explanatory variables, also referred to as predictors or regressors), including alcohol content, malic acid, magnesium, total phenols, flavonoids, and color intensity, among others. The target variable, representing the grape variety, is moderately imbalanced, with 59, 71, and 48 observations for classes 1–3, respectively.

The second data set used in our numerical experiments is the Thyroid_Disease_Dataset, which contains 3,771 observations and 26 variables related to thyroid diagnostics. It is a cleaned version of an original medical data set, in which various clinical, demographic, and laboratory features are used to determine thyroid status. Among the 26 variables, the data set includes 25 predictors and one binary target variable indicating either normal thyroid function or thyroid disorder (hypothyroid or hyperthyroid). All features are numeric, taking either integer or floating-point values.

We also applied the Heart_Disease_Dataset in our study. This data set consists of 1,025 observations and 14 variables related to a group of disorders of the heart and blood vessels, referred to as cardiovascular disease. It comprises 13 predictors and one binary target variable. The primary goal of this data set is to predict the presence or absence of heart disease based on clinical and demographic features, where the target variable indicates either the absence or presence of cardiovascular disease.

Finally, we examined the $C r e d i t_C a r d_F r a u d_D e t e c t i o n$ data set, which contains 284,807 observations and 31 variables, including 30 predictors and one binary target variable. For the sake of reducing computational cost, we randomly chose 2,000 instances. This data set is widely used in data analysis and machine learning research, with the main objective of predicting whether a transaction is fraudulent or legitimate. All predictor variables are continuous numeric features, while the target variable is binary, with labels corresponding to legitimate (non-fraudulent) and fraudulent transactions. The data set is highly imbalanced, as fraudulent cases account for approximately 0.17% of all transactions.

We restricted our simulations to analyses performed on reduced versions of the original data sets containing only the most statistically significant continuous explanatory variables (features). For that purpose, each data set was transformed to include only these variables together with the response (target) variable.

The final sets of explanatory variables were selected by jointly considering univariate statistical significance tests and mutual information criteria with respect to the response variable. For each feature, ranks obtained from hypothesis testing and mutual information were aggregated, and the most significant explanatory variables of the continuous type with the best combined ranks were retained for further analysis.

Using such a combined feature selection principles, the most significant explanatory variables of continuous type have been selected for each data set. For the Wine data set, these variables are: flavanoids, OD280/OD315 of diluted wines, hue, color intensity, and total phenols. In turn, for the Heart_Disease_Dataset, the selected variable includes only one feature, namely maximum heart rate achieved (thalach). In addition, the Thyroid_Disease_Datasetis limited to the following features: thyroid-stimulating hormone (TSH), free thyroxine index (FTI), total thyroxine (TT4). Finally, for the Credit_Card_Fraud_Detection_Dataset, the final feature set consists of the following five most significant explanatory variables: V4, V14, V11, V16, and V12.

It is worthwhile to mention that except for the above considered data, there is quite a large number of other anomaly detection benchmark data sets available on the Internet Global Network. The work by Han et al. (2022) presents $30$ anomaly-detection algorithms (unsupervised, semi-supervised or supervised) evaluated across $57$ data sets. The paper provides extensive experimental comparisons, statistical analyses, and insights into when different anomaly-detection methods perform well or fail on the corresponding data sets. In particular, it examines how anomaly detection algorithms perform under different levels of supervision, how they behave with different types of anomalies and how robust the algorithms are to noisy or corrupted data. In conclusions the authors claim that the algorithm performance is strongly tied to anomaly type and no unsupervised method is statistically superior across all data sets. In this context, we wish to mention the paper by Sánchez Vinces et al. (2025), where 11 clustering-based outlier detection algorithms are evaluated and compared with three classic non-clustering baselines (k-NN Outlier, LOF and IF). The authors used $46$ real and synthetic data sets and concluded that clustering-based approaches, like k-means, should be included as baseline, reference method in further benchmarking studies, as they often have a competitive quality at a relatively low run time and offer several other gains.

In Table 3, we collected numbers and percentage of points identified as outliers, obtained after using the considered methods M.I-M.X, provided that our input observations include - separately - data from each of the columns containing sample values of continuous explanatory variables selected from the Wine data set. For the other three among our data sets tables with the corresponding results are given in Supplement.

Table 3.
Comparison of Outlier Detection Methods for Individual Features. Values Show the Number of Detected Outliers With Percentages in Parentheses.

Method Flavanoids OD280/OD315 Hue Color Intensity Total Phenols

M.I (KMeans+IF) 12 (6.74) 43 (24.16) 49 (27.53) 42 (23.60) 26 (14.61)

M.II (Naive Conformal) 7 (3.93) 27 (15.17) 37 (20.79) 41 (23.03) 26 (14.61)

M.III (Bates-like) 12 (6.74) 43 (24.16) 49 (27.53) 42 (23.60) 26 (14.61)

M.IIIb (Bates+BH) 12 (6.74) 43 (24.16) 49 (27.53) 42 (23.60) 26 (14.61)

M.IV (NormalmixEM) 36 (20.22) 54 (30.34) 18 (10.11) 57 (32.02) 69 (38.76)

M.V (OCSVM all) 12 (6.74) 14 (7.87) 24 (13.48) 11 (6.18) 26 (14.61)

M.VI (OCSVM inliers) 24 (13.48) 54 (30.34) 57 (32.02) 58 (32.58) 46 (25.84)

M.VII (LogReg) 12 (6.74) 43 (24.16) 49 (27.53) 42 (23.60) 26 (14.61)

M.VIII (NpEM) 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00)

M.IX (MAD) 1 (0.56) 0 (0.00) 1 (0.56) 12 (6.74) 1 (0.56)

M.X (IQR) 0 (0.00) 0 (0.00) 1 (0.56) 4 (2.25) 0 (0.00)

Method	Flavanoids	OD280/OD315	Hue	Color Intensity	Total Phenols
M.I (KMeans+IF)	12 (6.74)	43 (24.16)	49 (27.53)	42 (23.60)	26 (14.61)
M.II (Naive Conformal)	7 (3.93)	27 (15.17)	37 (20.79)	41 (23.03)	26 (14.61)
M.III (Bates-like)	12 (6.74)	43 (24.16)	49 (27.53)	42 (23.60)	26 (14.61)
M.IIIb (Bates+BH)	12 (6.74)	43 (24.16)	49 (27.53)	42 (23.60)	26 (14.61)
M.IV (NormalmixEM)	36 (20.22)	54 (30.34)	18 (10.11)	57 (32.02)	69 (38.76)
M.V (OCSVM all)	12 (6.74)	14 (7.87)	24 (13.48)	11 (6.18)	26 (14.61)
M.VI (OCSVM inliers)	24 (13.48)	54 (30.34)	57 (32.02)	58 (32.58)	46 (25.84)
M.VII (LogReg)	12 (6.74)	43 (24.16)	49 (27.53)	42 (23.60)	26 (14.61)
M.VIII (NpEM)	0 (0.00)	0 (0.00)	0 (0.00)	0 (0.00)	0 (0.00)
M.IX (MAD)	1 (0.56)	0 (0.00)	1 (0.56)	12 (6.74)	1 (0.56)
M.X (IQR)	0 (0.00)	0 (0.00)	1 (0.56)	4 (2.25)	0 (0.00)

The results in Table 3 show that similar numbers of outliers are detected by the methods: M.I (KMeans+IF), M.III (Bates-like), M.IIIb (Bates+BH), and M.VII (LogReg). In turn, the method M.II (Naive Conformal) produces more conservative results, whereas the methods M.IV (EM algorithm to a two-component Gaussian mixture model) and M.VI (OCSVM (inliers)) often detect the highest numbers of outliers. In addition, the results from the corresponding table concerning the Heart_Disease_Dataset from our supplementary material indicate that similar numbers of outliers are detected by the methods: M.I, M.IV, and M.VII. From the results in this table concerning the Thyroid_Disease_Dataset, it can be seen that the largest numbers of outlier detection are obtained for either the M.VII or M.IX methods. Moreover, the results from the above mentioned table concerning the Credit_Card_Fraud_Detection_Dataset show that similar numbers of outliers are detected by the methods: M.I, M.III, M.IIIb, and M.VII.

6. Summary and Discussion

The simulation study reveals substantial differences in the behavior of the considered outlier detection methods, which strongly depend on both the proportion of outliers and the degree of separation between the underlying distributions.

For the Gaussian scenarios A1–A3, the mixture-based method M.IV, based on the EM algorithm for Gaussian mixture models, consistently exhibits the strongest overall performance. In scenarios characterized by good or very good class separation (A1 and A2), this method achieves nearly perfect Accuracy, high Precision and excellent Recall. Moreover, its performance remains stable as the sample size increases, confirming its robustness in the well-structured settings.

As the overlap between the inlier and outlier distributions increases (scenario A3), the performance of all methods deteriorates significantly. This is particularly evident in Precision and F1-score values, which confirm the basic difficulty of distinguishing outliers from inliers under strong distributional overlap. Classical robust methods such as MAD (M.IX) and IQR (M.X) maintain relatively high Accuracy in these scenarios, however this is achieved at the cost of very low Recall, indicating a conservative behavior that favors limiting false positives over detecting outliers.

The proportion of outliers has a significant impact on our methods’ performances. For a low contamination level ( $p = 0.05$ ), several approaches perform exceptionally well in simple scenarios, with M.IV and M.X achieving near-optimal classification results. Many alternative methods demonstrate high sensitivity, but often at the expense of increased false positive rates, leading to only moderate Precision and F1-scores.

When the proportion of outliers increases to $p = 0.1$ , a clear deterioration in performance is observed across most methods. The number of false positives rises substantially, resulting in a significant drop of Precision. Although the mixture-based approach M.IV remains relatively robust, its Recall decreases in more challenging scenarios, particularly for smaller sample sizes. Similar behavior is observed for MAD and IQR, which continue to produce high Accuracy but exhibit a marked loss in Recall.

In the most challenging scenarios involving heavy-tailed outlier distributions (B2 and B3), mixture-based and clustering-based methods show a relative advantage in terms of Recall, suggesting an improved ability to identify extreme observations. Nevertheless, this improvement is frequently accompanied by reduced Precision, highlighting the fundamental trade-off between sensitivity and specificity.

Finally, an analysis of the obtained heatmaps indicates that increasing the sample size leads to a greater number of statistically significant differences between methods, especially in scenarios: A2, A3, B1, B2, and B3. In contrast, scenario A1 exhibits only minor differences, reflecting its relative simplicity.

The presented results show that no single outlier detection method universally prevails over the other methods across the considered scenarios and that the choice of an appropriate approach should be guided by prior knowledge of the data structure, including the degree of distributional overlap and the proportion of outliers.

For data sets with well-separated clusters and approximately Gaussian distributions, mixture-based methods such as M.IV are recommended due to their high Accuracy, Precision, and Recall. In such settings, classical robust methods like IQR may also perform well, offering a simple and computationally efficient alternative.

In contrast, for more complex or heavy-tailed distributions, practitioners should be aware of the trade-offs associated with conservative methods such as MAD and IQR, which tend to achieve high Accuracy at the cost of low Recall. In applications where identifying as many outliers as possible is crucial, clustering-based or mixture-based approaches may be preferable, even if it involves a higher false positive rate. The proposed seven new methods, based on clustering, perform relatively well across all of the considered scenarios, but in the difficult scenarios B1-B3, some of them, namely M.I, M.III and M.III+BH outperform other methods in terms of certain classification metrics.

The results of our study show some possible directions for future research.

Instead of using the two-dimensional vector representations, in our further study we can consider a natural extension of the proposed methods, consisting in in analysis the pairs comprised of the vector of multidimensional observations and a score obtained by using the Extended Isolation Forest algorithm and clustering techniques in order to detect outliers in multidimensional data.

The other reasonable idea is to investigate the proposed hybrid outlier detection methods in higher-dimensional settings, where interactions between anomaly scores clustering and conformal procedures may lead to new interesting challenges.

Furthermore, important direction could be the development of theoretical foundations for isolation-based and hybrid methods. In addition, deeper integration of the isolation-based techniques with modern machine learning approaches could provide further enhancements in the anomaly detection performance.

A limitation of the clustering-based methods considered in our study is their inability to produce continuous anomaly scores. As a result, they provide only hard decisions, which reduce flexibility for practitioners who may need to adjust sensitivity thresholds depending on their different application contexts.

Footnotes

ORCID iDs

Konrad Furmańczyk

Marcin Dudziński

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Ayoub

Khalid

Karczmarek

(2023). FUZZY C-MEANS based extended isolation forest for anomaly detection. In J. Kacprzyk, M. Ezziyyani, & V. E. Balas (Eds.), International conference on advanced intelligent systems for sustainable development. AI2SD 2022. Lecture Notes in Networks and Systems, (Vol. 637). Springer, Cham. https://doi.org/10.1007/978-3-031-26384-2_35

Bates

Candès

Lei

Romano

Sesia

(2023). Testing for outliers with conformal p-values. The Annals of Statistics, 51(1), 149–178. 10.1214/22-AOS2244

Belisle

C. J. P.

(1992). Convergence theorems for a class of simulated annealing algorithms on

R^{d}

. Journal of Applied Probability, 29, 885–895. 10.2307/3214721

Benaglia

Chauveau

Hunter

D. R.

Young

D. S.

(2009). mixtools: An R package for analyzing mixture models. Journal of Statistical Software, 32(6), 1–29. https://doi.org/10.18637/jss.v032.i06

Benjamini

Hochberg

(1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Breuning

M. M.

Kriegel

H.-P.

R. T.

Sander

(2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 93–104). https://doi.org/10.1145/342009.335388

Chalapathy

Chawla

(2019). Deep learning for anomaly detection: A survey; arXiv preprint. arXiv:1901.03407.

Chaudhuri

Dasgupta

Vattani

(2009). Learning mixtures of Gaussians using the k-means algorithm. https://arxiv.org/abs/0912.0086

Cover

T. M.

Hart

P. E.

(1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964

10.

Dasgupta

(1999). Learning mixtures of Gaussians. In Proceedings of the 40th Annual IEEE symposium on foundations of computer science (FOCS) (pp. 634–644).

11.

Ding

Bhanushali

Liu

(2019). Deep anomaly detection on attributed networks. In Proceedings of SIAM international conference on data mining (SDM) (pp. 594–602).

12.

Han

Huang

Jiang

Zhao

(2022). ADBench: Anomaly detection benchmark. arXiv preprint arXiv:2206.09426.

13.

Hariri

Carrasco Kind

Brunner

R. J.

(2021). Extended isolation forest. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1479–1491. https://doi.org/10.1109/TKDE.2019.2947676

14.

Hastie

Tibshirani

Friedman

(2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

15.

Huang

Liu

Guo

Chen

(2024). Interpretable Single-dimension Outlier Detection (ISOD): An Unsupervised Outlier Detection Method Based on Quantiles and Skewness Coefficients. Applied Sciences, 14 (1). https://doi.org/10.3390/app14010136

16.

Huber

P. J.

Ronchetti

E. M.

(2009). Robust statistics (2nd ed.). Wiley.

17.

Karczmarek

Di Noia

Rosati

(2020). K-means-based isolation forest. Knowledge-Based Systems, 196, 105801. https://doi.org/10.1016/j.knosys.2020.105801

18.

Leys

Ley

Klein

Bernard

Licata

(2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764–766. 10.1016/j.jesp.2013.03.013

19.

Zhao

Botta

Ionescu

(2020). COPOD: Copula-Based Outlier Detection. In Proceedings of the IEEE international conference on data mining (ICDM) (pp. 1118–1123). https://doi.org/10.1109/ICDM50108.2020.00135

20.

Zhao

Botta

Ionescu

Chen

G. H.

(2022). ECOD: Unsupervised outlier detection using empirical cumulative distribution functions. IEEE Transactions on Knowledge and Data Engineering (TKDE), 35(12), 12181–12193. https://doi.org/10.1109/TKDE.2022.3159580

21.

Zhu

van Leeuwen

(2023). A survey on explainable anomaly detection. https://arxiv.org/abs/2210.06959

22.

Liu

F. T.

Ting

K. M.

Zhou

Z.-H.

(2008). Isolation Forest. In Proceedings of the IEEE international conference on data mining (ICDM) (pp. 413–422).

23.

Ostrovsky

Rabani

Schulman

L. J.

Swamy

(2012). The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM, 59(6), 28. 10.1145/2395116.2395117

24.

Pollard

(1981). Strong consistency of k-means clustering. The Annals of Statistics, 9(1), 135–140. 10.1214/aos/1176345339

25.

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

26.

Rousseeuw

P. J.

Croux

(1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88(424), 1273–1283. 10.1080/01621459.1993.10476408

27.

Sánchez Vinces

Capobianco

Bardram

J. E.

(2025). A comparative evaluation of clustering-based outlier detection. Data Mining and Knowledge Discovery, 39(2), Article 13. https://doi.org/10.1007/s10618-024-01086-z

28.

Schölkopf

Platt

J. C.

Shawe-Taylor

Smola

A. J.

Williamson

R. C.

(2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. 10.1162/089976601750264965

29.

Schölkopf

Williamson

R. C.

Platt

J. C.

Shawe-Taylor

Smola

A. J.

(2000). Support Vector Method for Novelty Detection. In S. A. Solla, T. K. Leen, & K.-R. Müller (Eds.), Advances in neural information processing systems (NIPS 12) (pp. 582–588). MIT Press.

30.

Shao

Chen

(2022). Cluster-based improved isolation forest. Entropy, 24(5), 611. https://doi.org/10.3390/e24050611

31.

Tukey

J. W.

(1977). Exploratory data analysis. Addison-Wesley.

32.

Vecchi

M. P.

Kirkpatrick

(1983). Optimization by simulated annealing. Science (New York, N.Y.), 220(4598), 671–680. doi:https://doi.org/10.1126/science.220.4598.671

33.

Vovk

Gammerman

Shafer

(2005). Algorithmic learning in a random world. Springer.

34.

Wilcoxon

(1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. https://doi.org/10.2307/3001968

35.

Wang

Cheng

Liu

(2022). Deep Isolation Forest for Anomaly Detection. IEEE Transactions on Knowledge and Data Engineering, 34(10), 4784–4797.

36.

Zhao

Nasrullah

(2019). PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research, 20(96), 1–7.

37.

Zhou

Paffenroth

R. C.

(2017). Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 665–674) Halifax, NS, Canada.

Outlier Detection Methods for One-Dimensional Features Based on the Isolation Forest and 2-Means Algorithms

Abstract

Keywords

1. Introduction

2. Outlier Detection Methods and Motivations

2.1. Basic Setup

2.2. Detailed Description of the Proposed Methods

2.2.1 Method M.I

2.2.2. Method M.II

2.2.3 Method M.III

2.2.4 Method M.III+BH (or M.IIIb)

2.2.5 Method M.V

2.2.6 Method M.VI

2.2.7 Method M.VII

2.2.8 Classic Methods for Comparison Study (M.IV, M.VIII, M.IX, M.X)

2.3. Motivations

2.4. Application of the Wilcoxon Test and Heatmaps for Paired Comparisons

3. Simulation Study

3.1. Mixture Scenarios

4.1. Comparison of the IF- and EIF-Based Methods

4.2. Conclusions for Scenario A1 (Very Good Class Separation)

4.3. Conclusions for Scenario A2 (Good Class Separation)

4.4. Conclusions for Scenario A3 (Gaussian Distributions With Strong Overlap)

4.5. General Comment on the A1-A3 Scenarios

4.6. Conclusions for Scenario B1 (Exponential Outliers, Normal Inliers)

4.7. Conclusions for Scenario B2 (Student’s t Outliers, Normal Inliers)

4.8. Conclusions for Scenario B3 (Student’s t Outliers, Exponential Inliers)

4.9. Conclusions Regarding the Impact of Contamination Rate and Separation’s Degree

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

Notes

References

4.7. Conclusions for Scenario B2 (Student’s $t$ Outliers, Normal Inliers)

4.8. Conclusions for Scenario B3 (Student’s $t$ Outliers, Exponential Inliers)