An approach for unsupervised contextual anomaly detection and characterization

Abstract

Outlier detection has been widely explored and applied to different real-world problems. However, outlier characterization that consists in finding and understanding the outlying aspects of the anomalous observations is still challenging. In this paper, we present a new approach to simultaneously detect subspace outliers and characterize them. We introduce the Dimension-wise Local Outlier Factor (DLOF) function to quantify the degree of outlierness of the data points in each feature dimension. The obtained DLOFs are used in an outlier ensemble so as to detect and rank the anomalous points. Subsequently, the same DLOFs are analyzed in order to characterize the detected outliers with their relevant subspace and their same-type anomalies. Experiments on various datasets show the efficacy of our method. Indeed, we demonstrate through an experimental evaluation that the proposed approach is competitive compared to the existing solutions in terms of both detection and characterization accuracy.

Keywords

Contextual anomaly detection outlier characterization outlying aspect mining Local Outlier Factor outlier ensembles

1. Introduction

Outlier detection is the process of discovering objects with unpredicted and rare patterns in the data. The unexpected objects are also called anomalies, exceptions, novelties, etc. Thus, the terms outlier detection and anomaly detection will be used interchangeably throughout this article. Anomaly detection has several application domains such as fraud detection in banking, intrusion detection in cyber security, fault detection in control systems, etc. Several methods for anomaly detection have been proposed, they can be categorized into statistical, classification-based, clustering-based, nearest neighbors-based, etc. [27]. However, lately, the research community has highlighted the need for interpreting and finding more information about the detected anomalies. Anomaly characterization consists mainly in discovering subspace(s) or subset(s) of features (a.k.a attributes, dimensions) for a query outlier where it shows a significant degree of outlierness. This task can be of great utility in different practical applications. For instance, in the field of intrusion detection, the attacks can be of different types and unseen anomalies occur daily. Therefore, new unsupervised methods are needed to detect but also to characterize them. Indeed, reporting to the analyst more information regarding the detected malicious activities is of major importance in order to distinguish between attacks and anomalies, to understand the root causes and to decide on better defense strategies. Similarly, in the field of the Internet of Things, fault detection is a critical task. Thus, it is of tremendous importance to know the reason behind the faulty behavior of a sensor, if it is a malfunction of the hardware, a battery drain problem, … etc.

In previous works, numerous approaches have been proposed for anomaly detection in relevant subspaces, and the subspaces are identified either before or during the detection process. These outliers are also called contextual or conditional outliers [28, 16, 17, 30, 29]. However, when contextual outliers are detected, their contextual information are not always reported to the analyst [13, 9, 10, 31, 5] or the characterization performance is not well evaluated [16, 17, 30]. On the other hand, in other works, some approaches have been proposed to address only the specific outlier characterization problem. Nonetheless, the existing characterization approaches lack scalability w.r.t. data dimension [23, 4]. In addition, the outliers are explained only with the subspaces. Furthermore, since the detection and the characterization processes are decoupled, this may result in extra time and effort needed by the user to carry out the work.

In this article, we propose a new approach to simultaneously detect and characterize outliers. The proposed approach addresses several issues related to the outlier detection in relevant subspaces and in high dimensional data. The challenges tackled are detailed in the following.

1.1 Curse of dimensionality

Finding outliers in high dimensional data is a difficult task due to the curse of dimensionality [2]. Indeed, most detection methods rely on distance as a similarity measure. However, as data dimensionality increases, the relative contrast between near and far neighbors diminishes. This reduces the usability of the distance measure, especially for the nearest neighbors-based techniques. Therefore, in our case, the issue related to the high dimensional data is addressed by adopting the vertical partitioning of the data. More precisely, when the data dimension exceeds a certain number of attributes, the dataset is divided into several vertical partitions, and later the outlier detection and characterization is carried out on each of these partitions.

1.2 Subspace outliers and feature relevance

In high dimensional data, as explained in [2], it is highly probable that the data can be separated into different clusters following different distributions; where each cluster is generated by a specific mechanism. Furthermore, these mechanisms are characterized by some relevant features only. Therefore, finding the nearest neighbor is meaningful if the search is restricted to data objects from the same cluster (distribution) as the considered point. Consequently, in our case, the local outlierness degree is computed by considering two different neighborhoods. The first one includes a large number of nearest neighbors (points from the same cluster) in the full feature space which is assumed to share the same generating mechanism as the considered point. The second neighborhood considers a smaller set of nearest neighbors, found in the previously defined neighborhood, for the local density comparison in each feature dimension; and thus for computing the local outlierness degree and finding the relevant subspace. The local outlierness degree is computed with the introduced function DLOF (Dimension wise Local Outlier Factor) which is a modified LOF (Local Outlier Factor). This function has been first explored in [19] and improved in this work.

1.3 Scoring function

Numerous approaches have been proposed to assess the relevance of the data attributes and construct relevant subspaces either for detection or characterization, and most of them rely on the local neighborhood of the considered data point. However, there are issues related to bias and contrast of the scoring function [2]. Indeed, scores obtained in subspaces of different dimensionalities are not comparable and choosing a threshold boundary to distinguish an outlier can be a challenging task. However, with our scoring function DLOF, since the outlierness is computed at each feature level, the scores comparability is not encountered. In addition, with LOF, a baseline value slightly larger than 1 allows differentiating an outlier from the remaining data points, the threshold is thus less difficult to set.

1.4 Anomaly detection and characterization

Contextual anomaly detection is a difficult task and obtaining additional information about the detected anomalies can be of tremendous importance for analysts. However, few works only have proposed approaches to simultaneously detect and characterize the contextual anomalies. Furthermore, when characterization is adopted, it is not evaluated appropriately.

Consequently, in the present work, a solution is proposed for contextual anomaly detection and characterization. Both the detection and the characterization processes rely on the same subspace relevance analysis function DLOF. The detection process is modeled as an outlier ensemble, while the characterization is achieved by quantifying the attributes relevance. In addition to the relevant subspaces, the anomalies are also characterized by finding the same-type anomalies. Moreover, both the detection and the characterization have been appropriately evaluated and compared to existing solutions.

The main contributions of our work can be summarized as follows:

•
A function for feature relevance analysis and quantification is introduced. The function is a modified LOF that addresses issues related to high dimensional data and subspace outliers.
•
A completely unsupervised solution for outlier detection and characterization is presented.
•
The detected outliers are characterized with their relevant subspaces and their same-type anomalies.
•
The proposed approach has been applied on both real-world and synthetic data and evaluated in terms of detection and characterization efficacy.

The rest of this paper is organized as follows. Section 2 discusses the related work. In Section 3, we present the preliminaries related to the LOF algorithm and to the outlier ensembles. Subsequently, Section 4 presents the DLOF function and the data vertical partitioning strategy. Section 5 details our general approach for outlier detection and characterization. In Section 6, we present and discuss the evaluation results of the proposed solution in terms of both detection and characterization efficacy. Finally, Section 7 concludes the article.
2. Related work

The majority of the works have focused on improving the accuracy and the efficiency of the detection techniques while few works only have contributed in bringing more information about the anomalies. Hence, a challenging task in the outlier mining field is the interpretability and the characterization of the detected anomalies.

2.1 Contextual outlier detection

In order to develop and apply anomaly detection techniques, one should consider several constraints [27] such as the nature of the input data, the availability of data labels, the result of the algorithm. An important consideration, that has been taken into account lately, is based on the properties of the anomalies to detect. Indeed, it is assumed that the anomalies can be different depending on the generating mechanism, and generally a distinction is achievable by analyzing the impact of the data attributes. In most of the works which follow this assumption, the outliers discovered under certain condition, eventually in specific attribute subsets, are called conditional outliers [28] or contextual outliers [12, 16, 17, 30, 29].

As introduced in [27], the contextual anomaly detection methods can be categorized into two types. Methods in which contextual anomaly detection is reduced to point anomaly detection after identification of the context. Hence the contextual information is provided beforehand. The second type of method uses training data to learn a model which predicts the expected behavior with respect to a given context. An anomaly is detected if the predicted behavior is significantly different from the observed behavior. Herein, the contextual information is the learned model.

However, the first category can be extended. Indeed, the techniques described in [27] rely on contextual information that are determined manually. As an example, in [28], the authors presented their method, namely Conditional Anomaly Detection (CAD), and they assume that in addition to a baseline dataset, the user divides beforehand the data attributes into environmental attributes and indicator attributes. Herein, the definition of the contextual attributes is made by human experts, therefore, it relies essentially on the experts’ knowledge and experience and that is a drawback since the contextual information are predefined manually and not automatically. However, the contextual information can be either generated randomly [1, 18] or discovered automatically [13, 9, 10, 31, 5, 12, 8, 16, 17, 30, 29]

In [1], the authors have proposed an LOF-based outlier ensemble, the ensemble combines the results obtained by computing the LOF in several randomly generated feature subspaces. Although the approach does not discover a relevant subspace for each outlier, nonetheless, it can be considered as a first step towards an automatic outlier mining in feature subsets in order to improve the detection accuracy. The same idea has been developed in [18], where in addition to the various subspaces, different parameterization has been adopted. An unsupervised outlier detection algorithm based on generative adversarial active learning has been presented in [32] in order to tackle the issue of high-dimensional data. The model consists of a generator and a discriminator. The generator generates informative potential outliers that occur inside or close to the real data based on the guide of the discriminator which defines a decision boundary which separates normal and anomalous data. In order to provide sufficient information to the discriminator, the model has been expanded from one to multiple generators, namely, MO-GAAL. These generators have different objectives, thus generate different types of outliers.

The authors in [13] proposed a method, namely Subspace Outlier Degree (SOD), that detects and scores an outlier with reference to the axis parallel subspace spanned by its neighbors. The relevant attributes of the set of neighbors of a point are the attributes where the points in the neighborhood exhibit a low variance. The OUTRES is introduced in [9] which is an adaptive outlier ranking where comparable degrees of deviation are measured for objects in multiple arbitrary subspaces. The deviation is measured for each object in a set of relevant subspaces, where a subspace is retained as relevant if it is non-uniformly distributed by using the Kolmogorov-Smirnov test. In [10], the authors proposed HiCS, a subspace search method that selects high contrast subspaces for density-based outlier ranking. The proposed contrast measure is based on correlation analysis, the difference between marginal and conditional pdf of a subspace is used as a criterion for high contrast. And, the outlier score for the ranking relies on LOF. In addition, the subspace outlier mining is a decoupled process, divided into “subspace search” and “outlier ranking”. In [31], the Local Outlier Mining Algorithm (LOMA) is proposed, herein, the approach relies on the use of the PSO heuristic and a proposed sparse coefficient threshold to search for the sparce subspaces from which the local outliers can be effectively extracted. The GLOSS algorithm was introduced in [5], the approach consists in combining existing local subspace anomaly detection techniques with the concept of global neighborhoods. The authors explain the necessity to consider global neighborhoods in order to asses the outlierness of a query point with respect to data objects having the same mixture component.

In [13, 9, 10, 5, 31], although, the terminology of contextual or conditional anomaly has not been used, the methods that have been proposed in these works rely on the notion of relevant subspace. And as explained in [7], certain outliers are hidden in data subsapces and can be detected only by projecting the data into lower dimensions. Therefore, those characterizing subspaces can be considered as the context in which the anomaly occured. However, in these works the relevant subsapces in which the outliers are detected are not retained for further exploration and are not reported to the user.

In [16, 17, 29, 30], the notion of contextual anomalies has been adopted, and the contextual information is retained as an additional result of the proposed approaches so as to help the user to achieve a better interpretation of the detected anomalies. In both [16, 17], the relevance of a subspace relies on the local density. The authors in [17] assume that an outlier is an object which, in the relevant subspace, doesn’t satisfy the distribution of the local dataset. The local sparseness of attribute dimensions is used to quantify the relevant subspaces, and the probability density of local datasets is used to compute the degree of outlierness of the object. This work aimed at improving detection efficiency and scalability. The authors in [30] also aimed at improving the LOMA [31] through parallelization and outlier interpretation.

Above-mentioned methods have been proposed for numeric data and the relevant subspace is the context considered. However, methods for contextual anomaly detection in categorical data or that require data transformation have been proposed in [12, 8], respectively. And, in [29], the context is identified as a 2-coloring of a random walk graph.

The common property of all these works is that they detect outliers under a certain context. However, reporting the contextual information is a side product of the proposed methods and a small effort has been put in for the evaluation of these methods in terms of outlier characterization and explainability [16, 17, 30]. As such, in this work, through the DLOF function both the detection and the characterization of the contextual anomalies are achieved. In addition, the charcaterization performance of the solution has been appropriately evaluated and compared to existing solutions.

The works that have been proposed for numeric data and consider the relevant subspace as the contextual information are summarized in Table 1.

Table 1
Summary of works on relevant subspace-based contextual outlier detection in numeric data

Reference	Context discovery	Reported context	Subspace relevance analysis
[1]	Random	–	None
[18]	Random	–	None
[28]	Manual	–	Experts knowledge
[5]	Automatic	No	Outlierness probability
[13]	Automatic	No	Variance
[9]	Automatic	No	Distribution
[10]	Automatic	No	Correlation
[31]	Automatic	No	Sparsity
[16]	Automatic	Yes	Sparsity
[17]	Automatic	Yes	Sparsity
[30]	Automatic	Yes	Sparsity

2.2 Outlier characterization

As previously explained, in some proposed solutions, even though the outliers’ relevant subspaces are found and explored so as to enhance the anomaly detection accuracy, these relevant subspaces are not retained to be presented to the user. Moreover, the problem of outlier interpretation is gaining more and more interest, indeed, understanding the reason behind the outlierness of a data object is very important. It is also referred to outlier characterization as outlying aspects mining [11], outlying properties detection [20], or outlier explanation [21].

An early work on outlier characterization is introduced in [11]. The authors presented a solution that considers categorical attributes. A measure using the relative frequency has been used to quantify the relevance of attribute combinations for a considered anomaly. The score is obtained by considering the entire dataset or a subset of it. A supervised feature selection-based technique is proposed in [4]. The method constructs for each outlier a binary classifier to distinguish it from the inliers. The outlier’s class is a set of samples generated as a Gaussian distribution, while the inliers are formed with other samples of the remaining data. Herein, classification accuracy is used as a feature scoring function, i.e. the features that allow to obtain a high accuracy are retained as a characterizing subspace. The OAMiner was introduced in [20], as a scoring function, the authors used the rank of the outlier in a given subspace based on the kernel density estimate. The subspaces which allow obtaining the best ranking of the outlier are retained as the relevant susbspaces. The approach relies on a systematic search and an in depth-first strategy is adopted to browse the search space and find the relevant subspaces. An anomaly characterization approach is presented in [23], namely, the OARank framework. It works in two stages, the first one is a mutual information-based feature ranking, and the second stage consists in exploring the features obtained in the previous stage through a score-and-search strategy. The second stage is optional, when considered the framework is called OARank search. Lately, an ensemble of scoring functions has been presented in [24]. The authors have analyzed these functions in terms of different desiderata, in particular, the dimension unbiasedness. Subsequently, the beam search algorithm has been used so as to browse the subspaces for the considered outliers with the described scoring functions. A general framework named SHAP (SHapley Additive exPlanations) is presented in [25, 26], it assigns for each feature an importance value and aims to help the analyst to interpret the predictions of any machine learning model. The authors in [21] introduced the Explainer algorithm which uses an ensemble of specifically trained decision trees to explain the anomalies detected by any other algorithm. The Explanation consists in the different relevant features but also in a set of rules.

Similarly to detection, the outliers are characterized with their relevant subspaces and the relevance analysis is key point. However, the research on this problem is limited. The solution presented in [11] considers only categorical attributes. In [20, 24], in order to find the relevant subspaces, a systematic search strategy is adopted, therefore, they are computationally expensive. Furthermore, in [4], a feature selection strategy is adopted and the approach in [23] is hybrid. The latter approaches are better than the former ones in terms of computation, however, in high dimensional data, their characterization performance is low. In our case, the proposed scoring function relies on the feature ranking by analyzing the features individually, therefore the systematic search is not needed and the process is more efficient. It is also effective for high dimensional data due to the data partitioning. Furthermore, in addition to the relevant features, the same-type anomalies are reported as an additional result that can help the analyst to explain the results.

3. Preliminaries

In this section we present the Local Outlier Factor (LOF) algorithm, followed by the introduction of the concept of outlier ensembles.

3.1 Local Outlier Factor

LOF [22] is a density-based outlier detection approach. It assigns to each data point an outlierness degree. The local outlier factor is computed as follows.

Firstly, for each point $p$ , the kDistance (p) is computed, it represents the distance of $p$ to its ${k}$ -th nearest neighbor. Subsequently, all points within kDistance (p) form the $N_{k}(p)$ , the set of ${k}$ nearest neighbors of $p$ . It is essential to mention that the cardinality of $N_{k}(p)$ may be greater than $k$ since a neighbor of $p$ may not be unique.

Secondly, to reduce the statistical fluctuations of the distances between a point and its neighbors, the reachability distance of point ${p}$ with respect to its neighbor ${o}$ , denoted $\textit{reachDist}_{k}(p,o)$ , is computed as follows:

$\displaystyle\textit{reachDist}_{k}(p,o)=\max\{\textit{kDistance}(o),d(p,o)\}$ (1)

Later, the local reachability density of point $p$ is obtained as the inverse of the average reachability distances:

$\displaystyle\textit{lrd}_{k}(p)={1}\left/{\left[\frac{\sum_{o\in N_{k}(p)}{% \textit{reachDist}_{k}(p,o)}}{|N_{k}(p)|}\right]}\right.$ (2)

Finally, the local outlier factor of a point $p$ is the average of the ratio of the local reachability density of point $p$ and those of the k-nearest neighbors of $p$ , the equation is as follows:

$\displaystyle\textit{LOF}_{k}(p)=\frac{\sum_{o\in N_{k}(p)}{\frac{\textit{lrd}% _{k}(o)}{\textit{lrd}_{k}(p)}}}{|N_{k}(p)|}$ (3)

The LOF represents the degree of outlierness of a data point. A data point is considered as an outlier if its local outlier factor is greater than a baseline value close to 1, it is considered as a normal point, otherwise. The advantage of the LOF consists in the fact that it detects outliers lying in neighborhoods with different properties [22].

3.2 Outlier ensembles

Ensembles analysis consists in combining results obtained with various algorithms or, like in our case, executions of one base algorithm with distinct parameters and/or distinct data subspaces. This approach has been adapted to the field of outlier detection, and it is known as outlier ensembles. The outlier ensemble can be unsupervised and the main purpose from it is the improvement of the anomaly detection efficacy by merging results from various models and to tackle the curse of dimensionality when feature subspaces are used. Indeed, previous studies have shown that a better performance can be achieved by combining results from various models compared to using the results of one base model, and it holds true for anomaly detection [1, 18]. However, designing an unsupervised anomaly detection ensemble necessitates considering some aspects such as the accuracy of the algorithms, their diversity and, the combination function to use to obtain the consensus of the results. A more detailed explanation of the concept can be found in [6, 3].

4. Data partitioning and scoring function

In this section, we will introduce our function for attribute relevance analysis, in addition to explaining how the vertical data partitioning is explored in our solution.

4.1 Vertical data partitioning

LOF has different advantages; however, it relies on the distance measure, and because of the curse of dimensionality, the performance of the algorithm on high-dimensional data is not satisfactory. Thus, in the case of the DLOF, the algorithm will not perform well in discovering the $C$ nearest neighbors. (defined in Subsection 4.2). We tackle the problem of computing the local outlierness in high dimensional data by transforming it into a computation in distinct data partitions. As can be seen from Fig. 1, the $n\times d$ original dataset DS is divided into several $n\times s$ vertical data partitions { $\textbf{DP1, DP2},\ldots,\textbf{DPx}$ }, with $x=d/s$ . Thus, DLOF is sequentially applied to data partitions obtained by dividing the feature space into multiple subsets of size $s$ , instead of applying it to the full feature space. Data partitioning is applied only in case the dataset dimensionality is higher than a certain dimension size $s$ .

Figure 1.

Vertical data partitioning.

4.2 DLOF: Dimension-wise Local Outlier Factor

We adapt the LOF algorithm and use it as a feature scoring function for both detection and characterization. Therefore, we make the two following modifications to the algorithm.

4.2.1 Two neighborhood-related parameters: Cp and K

In the LOF algorithm, the outlierness degree of each data point is computed with the parameter $k$ ranging within two values, and the final LOF is obtained as the maximum. For our scoring function, we consider two neighborhoods obtained with two different parameters $C p$ and $K$ .

In order to tackle the challenge explained in Subsection 1.2, i.e. the fact that the nearest neighbor search is meaningful if it is restricted to objects of the same cluster as the considered point [2]. We define the parameter $C p$ as a percentage of the full number of data points, this percentage is used to determine the number of nearest neighbors $C$ of point $p$ in the full feature space. Herein, it should be mentioned that the full feature space refers to the number of features of the data partition in case the dimensionality of the dataset exceeds $s$ , to the number of features of the dataset, otherwise. A large enough value should be used for the parameter $C$ in order to capture the data objects belonging to the same cluster, and which is assumed to be generated by the same mechanism as the query point $p$ . The use of a percentage makes the number $C$ proportional to the dataset size.

We define a second parameter $K$ that will be used to compute the Kdistance of each data point $p$ in a distinct feature dimension $f$ . The same parameter $K$ determines the cardinality of the set of point $p$ ’s nearest neighbors to consider for the local density comparison. Parameter $C$ should be set larger than $K$ and both the Kdistance and the K-nearest neighbors of point $p$ are searched in the C-nearest neighbors. The $K$ ’s function is similar to that of the parameter $k$ used in the algorithm LOF, the only difference consists in the fact that, the local densities are found w.r.t the neighboring determined by $C$ and computed for each feature dimension.

4.2.2 Dimension-wise LOF (DLOF)

The DLOF of a point $p$ is a modified LOF which is computed for each feature dimension. It is used for quantifying the feature dimension importance in order to detect and characterize the outliers.

Let $p\in\textbf{DS}$ be an ordinary data point, with DS being a d-dimensional dataset that contains $n$ objects, and let $\textbf{F}={f_{1},f_{2},\ldots,f_{d}}$ be the feature set. The definition of the DLOF of $p$ at the level of $f_{i}$ is presented in the following.

Definition 1. $N_{C}(p)$ , the C-nearest neighbors in the full feature space of point $p$ in DS, $N_{C}(p)$ should be large so as to capture the cluster to which point $p$ belongs.

Definition 2. $\textit{Kdistance}_{fi}(p)$ , the distance of $p$ to its K-th nearest neighbor in $N_{C}(p)$ and at the feature dimension $f_{i}$ . Formally, $\textit{Kdistance}_{fi}(p)$ is defined as the distance at the feature dimension $f i$ , $d_{fi}(p,o)$ between $p$ and an object $o\in N_{C}(p)$ such that:

(1)
for at least $K$ objects $o^{\prime}\in N_{C}(p){p}$ it holds that $d_{fi}(p,o^{\prime})\leqslant d_{fi}(p,o)$ , and
(2)
for at most $K-1$ objects $o^{\prime}\in N_{C}(p){p}$ it holds that $d_{fi}(p,o^{\prime})<d_{fi}(p,o)$ .

The data point $p$ will have different values for $\textit{Kdistance}_{fi}(p)$ depending on the feature dimension $f i$ .

Definition 3. $\textit{reachDist}_{K,fi}(p,o)$ , the reachability distance of point $p$ with respect to point $o$ in feature dimension $f i$ is obtained as in the following:

$\displaystyle\textit{reachDist}_{K,fi}(p,o)=\max\{\textit{Kdistance}_{fi}(o),d% _{fi}(p,o)\}$ (4)

Definition 4. $N_{K}(p)$ , the set of K-nearest neighbors of point $p$ . This set should be smaller than $N_{C}(p)$ , it is used for the local density comparison, i.e. the local density of point $p$ in feature dimension $f i$ will be compared to the local density in $f i$ of its nearest neighbors $N_{K}(p)$ .

Definition 5. $\textit{lrd}_{fi}(p)$ , the local reachability density of point $p$ in feature dimension $f i$ . It is defined as the inverse of the average reachability distance with the k-nearest neighbors of $p$ . The density is obtained with the following equation:

$\displaystyle\textit{lrd}_{fi}(p)={1}\left/{\left[\frac{\sum_{o\in N_{K}(p)}{% \textit{reachDist}_{K,fi}(p,o)}}{|N_{K}(p)|}\right]}\right.$ (5)

Definition 6. $\textit{DLOF}_{fi}(p)$ , the dimension-wise local outlier factor of point $p$ in the feature dimension $f i$ . It is defined as the average of the ratio of the local reachability density of $p$ and those of $p$ ’s k-nearest neighbors. It is computed as follows:

$\displaystyle\textit{DLOF}_{fi}(p)=\frac{\sum_{o\in N_{K}(p)}{\frac{\textit{% lrd}_{fi}(o)}{\textit{lrd}_{fi}(p)}}}{|N_{K}(p)|}$ (6)

where $\textit{lrd}_{fi}(o)$ is the local reachability density of point $o$ defined and found in a similar way to point $p$ . The DLOF captures the outlierness degree of point $p$ in the feature dimension $f i$ ; thus, it quantifies the relevance of this feature for the distinction of point $p$ as an anomalous observation. Similarly to the original algorithm LOF, a value less than 1 of DLOF indicates that the point is an inlier in the considered feature dimension, it is an outlier, otherwise. This characteristic addresses the problem pointed out in Subesection 1.3; indeed, the baseline value of approximately 1 allows to distinguish the relevant feature, and for the outlier detection and characterization, the problem of comparability of the DLOFs obtained from the different subspaces is not encountered.

Figure 2.
General functioning of the proposed approach.

5. Outlier detection and characterization

In this section, we introduce the details on how the presented approach works, its general functioning is illustrated in Fig. 2. After the data partitioning, if necessary, the first step consists in computing the Dimesion-wise LOFs of all data points. Subsequetly, these DLOFs are used to find the outliers and to characterize them. The details of these different steps are presented in the following.

5.1 DLOF computation

In order to tackle the problem of the high dimensional data, we implement the data partitioning that has been discussed in subsection 4.1 by following the instructions in Algorithm 5.1. The DS dataset is divided into vertical partitions by dividing the feature set F into subsets FS, where each subset FS is of size $s$ . Later, the DLOF at each feature dimension $f$ , i.e., the feature dimensions of each vertical partition DP, is computed for all points belonging to DS. The resulting DLOFs are subsequently saved in matrix LOD, each entry $i$ of LOD is a 1xd vector of the DLOFs of the $i$ th point. Algorithm 5.1 presents the case where the vertical data partitioning is necessary; however, if the data dimensionality $d$ is less than $s$ , then the data partitioning is not performed, and the attribute DP of the DLOF function will be set to DS.

[h!] : DLOFs Computation[1] Input:DS: $n\times d$ tabular dataset F: Feature set of size $d$ $s$ : Size of feature subset

$C p, K$ : DLOF parameters Output: LOD: $n\times d$ table of DLOFs of $p\in DS$ in $f\in F$ (Initially $\textbf{LOD}=\emptyset$ ) Begin $\textit{subNbr}\leftarrow d/s$ ; number of data partitions Divide F into subNbr subsets FS of size $s$ . if ( $d\mbox{ modulo }s$ ) is not equal to 0, then the last subset will be of size $d\mbox{ modulo }s$ $i\leftarrow 1,\textit{subNbr}$ Generate the data partition DP of DS in the feature subset $\textbf{FS}_{i}$ $j\leftarrow 1,\textit{sizeof}(\textbf{FS}_{i})$ $\textbf{LOD}_{1..n,j}$ = DLOF(DP, $j$ , $C p, K$ ) in Algorithm 5.1. $\textbf{LOD}=$ { $\textbf{LOD},\textbf{LOD}_{1..n,j}$ };

End

Algorithm 5.1 presents the details of the DLOF computation, as can be seen, the C-nearest neighbors and the K-nearest neighbors are first computed for each data point $p\in\textbf{DP}$ , then, the DLOF is computed based on Eqs (4)–(6).

[h] : Dimension-wise Local Outlier Factor (DLOF)[1] Input:DP: $n\times m$ tabular dataset $f$ : Feature index $C p$ : Percentage of nearest neighbors in the full feature set $K$ : Number of nearest neighbors for local density comparison Output: LODF: $n\times 1$ table of DLOFs of points in DP in feature dimension $f$

Begin

$C=Cp\times n$ ; $i\leftarrow 1,n$

Find the C-NNs of point $p_{i}$ , $N_{C}(pi)$ Find the K-NNs of point $p_{i}$ , $N_{K}(pi)$ Assign to point $p_{i}$ its Kdistance, $\textit{Kdistance}_{f}(p_{i})$ (Definition 2) $i\leftarrow 1,n$ Compute the reachability distance of point $p_{i}$ based on Eq. (4) Compute the reachability density of point $p_{i}$ based on Eq. (5) Compute the DLOF of point $p_{i}$ based on Eq. (6) $\textbf{LODF}(i)=$ DLOF; End

5.2 Outlier detection

Once the DLOFs are computed for all data points $p\in\textbf{DS}$ , we obtain the matrix LOD where $\textit{LOD}_{ij}(i=1..n,j=1..d)$ denotes the deviation of the $i$ th point in the $j$ th attribute. We use these results in an outlier ensemble-based detection strategy. For each data point $p$ , the decision on its outlierness is made with a combination of its deviations in different attributes. In the literature, different combination functions have been explored, for instance, combination functions based on the maximum, the average [3], the voting [18], etc. In our case, the DLOF is obtained from subspaces of the same size, i.e., one dimensional subspaces. In addition, the DLOF has an appropriate baseline value that is around 1, therefore, the ensemble results are not biased towards any subspace. Thus, and without loss of generality, we use the average function for the results combination. Furthermore, although good results can be obtained with the voting function; however, unlike the average function, the final results are binary and outliers ranking is not possible. The final degree of outlierness $\textit{LOD}_{\textit{final}}$ of the $i$ th data point with the average function is defined as follows:

$\displaystyle\textit{LOD}_{\textit{final}}(i)=\frac{\sum_{j=1}^{d}{\textit{LOD% }_{ij}}}{d}$ (7)

The results in $\textit{LOD}_{\textit{final}}$ allow the outliers ranking, i.e., the higher the $\textit{LOD}_{\textit{final}}(i)$ , the higher the probability that point $i$ is an outlier. However, for binary results, for each data point, the final outlierness degree $\textit{LOD}_{\textit{final}}(i)$ is compared to a predefined threshold $t$ . It is considered anomalous if $\textit{LOD}_{\textit{final}}(i)$ is larger than $t$ , otherwise, it is considered as a normal object, as shown in the following equation.

$\displaystyle\textit{Label}(i)=\left\{\begin{array}[]{ll}1&\text{if }\textit{% LOD}_{\textit{final}}(i)>t\\ 0&\text{otherwise}\\ \end{array}\right.$ (8)

5.3 Outlier characterization

In order to characterize the detected anomalies, the relevant subspace of each data point is discovered, in addition, the detected anomalies are clustered in order to find the same-type anomalies in terms of characterizing attributes.

[h] : Outlier Characterization[1] Input:OLOD: $m\times d$ matrix of DLOFs of m outliers in each feature dimension $t h$ : threshold MinNbr: minimum number of features MaxNbr: maximum number of features Output:characterizingFeatures: $m\times 1$ table of lists of relevant featuresBegin $i\leftarrow 1,m$ $\textbf{characterizingFeatures}(i)=\textit{IndiceOf}(\textbf{OLOD}(i)>th)$ $\textbf{Idx}=\textit{descendingOrder}(\textbf{OLOD}(i,1..j))$ If (sizeof( $\textbf{characterizingFeatures}(i))<\textit{MinNbr}$ ) then $\textbf{characterizingFeatures}(i)=\textbf{Idx}(i,j=1..\textit{MinNbr})$ ElseIf (sizeof( $\textbf{characterizingFeatures}(i))>\textit{MaxNbr}$ ) then $\textbf{characterizingFeatures}(i)=\textbf{characterizingFeatures}(i,j=1..% \textit{MaxNbr})$ EndIf End

The same DLOFs that have been used for detection will be used in order to find the relevant features for each anomaly. From Algorithm 5.3, we can see that for each anomaly, the feature dimensions which allow obtaining a DLOF greater than a predefined threshold are considered as the characterizing features. Additionally, the user can choose the number of characterizing features by introducing a minimum number and a maximum number of features to retain.

In addition to outlier characterization through reporting relevant subspaces, we also propose to report the anomalies that might be of the same type. Two outliers can be considered of the same type if they have the same relevant subspace. Finding the outliers of the same type can be achieved by comparing the obtained characterizing features or simply by clustering their DLOFs. Indeed, the outliers in the same clusters are considered of the same type. To this purpose, any clustering algorithm can be used.

6. Experimental evaluation

In order to evaluate the proposed approach in terms of detection and characterization, a set of experiments on synthetic and real-world data has been conducted. In this section, the details of the experiments and the comparison results are presented.

6.1 Datasets and evaluation metrics

6.1.1 Datasets

The synthetic datasets introduced in [10] have been used for evaluating the detection and characterization performance of our solution. The datasets have 10, 20, 30, 40, 50, 75 or 100 dimensions. For each dimensionality, three different datsets are available, each having 1000 points including 19 to 136 anomalous points. In these datasets, 2–5 dimensional subspaces are randomly selected out of the full data space in order to generate high density clusters. Subsequently, some objects are picked in the subspaces, they are modified and considered as subspace outliers. As such, detecting and characterizing these anomalies are difficult tasks.

In order to evaluate the performance of our approach on real-world data, datasets with different dimensionalities and sizes, and that have been explored in several works, were retained to compare the proposed solution. The data do not include missing values or noise, as such the pre-processing step was limited to feature scaling by using the z-score standardization. The datasets are summarized in Table 2 and explained in the following.

•
Wine1
¹
http://odds.cs.stonybrook.edu/.

Results of a chemical analysis of wines grown in the same region but derived from three different cultivars. The dataset has 13 features that describe each of the three types of wines. Class 2 and 3 are used as inliers and class 1 is downsampled to 10 instances to be used as ouliers.
•
Glass ${}^{1}$ From the study of the different types of glass was motivated by criminological investigation. This dataset contains 9 attributes regarding several glass classes. Points of all classes are marked as inliers except class 6 which is a clear minority class.
•
Arrhythmia ${}^{1}$ Related to a study which aims to distinguish between the presence and absence of cardiac arrhythmia. The dataset is described with 274 attributes and it presents 16 classes. The smallest classes, i.e., 3, 4, 5, 7, 8, 9, 14, 15 are combined to form the outliers and the rest of the classes are combined to form the inliers.
•
Vowels ${}^{1}$ A classification dataset of 12 features used to classify the speakers who uttered two Japanese vowels. For outlier detection, The class of speaker 1 is downsampled to 50 outliers. The inliers are the points contained in classes 6, 7 and 8, while other classes are discarded.
•
Musk ${}^{1}$ The dataset of 166 features contains several musk and non-musk classes. The non-musk classes j146, j147, and 252 are combined to form the inliers, while the musk classes 213 and 211 are added as outliers without downsampling. Other classes are discarded.
•
Annthyroid ${}^{1}$ Related to hypothyroid patients, it has 21 features of which the 6 real ones were retained. There are three classes: normal (not hypothyroid), hyperfunction and subnormal functioning. The hyperfunction and subnormal classes are used as outliers and the other one as inliers.
•
Mammography ${}^{1}$ The dataset has 11,183 samples including 260 calcifications described with 6 features, the minority class of calcification is considered as outliers and the non-calcification class as inliers.
•
Optdigits ${}^{1}$ Related to optical recognition of handwritten digits, the dataset has 64 features. The instances of digit 0 are down-sampled to 150 and marked as outliers, while the instances of digits 1-9 are considered as inliers.
•
Ionosphere2
²
https://elki-project.github.io/datasets/outlier.

This dataset differentiates good and bad radars, it has 32 numeric attributes the data points in the class b are reduced to 5 and are considered as outliers and those in class g as inliers.
•
Page Blocks ${}^{2}$ The dataset contains 10 features, it is related to information about different types of blocks in document pages. The blocks can be text, pictures or graphics. If the block content is text, it is labeled as inlier, otherwise it is labeled as outlier.
•
WBC3
³
https://archive.ics.uci.edu/ml/datasets/.

The Wisconsin-Breast Cancer (Diagnostics) dataset of dimensionality 30, it records the measurements for breast cancer cases. There are two classes, the malignant class is considered as outliers, while points in the benign class are considered as inliers. This dataset has been explored in [30], and in our case, it is used for comparison in terms of characterization.

Table 2
Summary of the real-world datasets

Dataset Points Attributes Outiers (%)

Wine 129 13 7.7

Glass 214 9 4.2

Arrhythmia 452 274 14.6

Vowels 1456 12 3.4

Musk 3062 166 3.2

Annthyroid 7200 6 7.4

Mammography 11183 6 2.3

Optdigits 5216 64 2.9

Ionosphere 205 32 2.4

PageBlocks 4982 10 2

WBC 569 30 37.3

6.1.2 Evaluation metrics

Dataset	Points	Attributes	Outiers (%)
Wine	129	13	7.7
Glass	214	9	4.2
Arrhythmia	452	274	14.6
Vowels	1456	12	3.4
Musk	3062	166	3.2
Annthyroid	7200	6	7.4
Mammography	11183	6	2.3
Optdigits	5216	64	2.9
Ionosphere	205	32	2.4
PageBlocks	4982	10	2
WBC	569	30	37.3

In order to illustrate the performance comparison in terms of detection efficacy, the area under the curve (AUC) of the Receiver Operating Characteristics (ROC) curves is used. The ROC curve depicts the true positive rate (TPR) as a function of the false positive rate (FPR) that are defined as follows:

$\displaystyle\textit{TPR}=\frac{TP}{(FN+TP)}$ (9) $\displaystyle\textit{FPR}=\frac{FP}{(FP+TN)}$ (10)

False positives (FP) are data objects annotated as anomalies but they are normal observations in reality, while true positives (TP) are anomalous observations which are labeled as anomalous. Normal observations annotated as normal represent true negatives (TN), whereas the observations that are anomalies but annotated as normal represent false negatives (FN).

The comparison of the characterization performance on synthetic data has been achieved by using the precision and the Jaccard index measures that are defined as in the following:

$\displaystyle\textit{Precision}(T,P)\triangleq\frac{|T\cap P|}{|P|}$ (11) $\displaystyle\textit{Jaccard}(T,P)\triangleq\frac{|T\cap P|}{|T\cup P|}$ (12)

$P$ and $T$ depict the discovered characterizing features and the real characterizing features, respectively. The Jaccard index helps to understand the similarity between the set of real characterizing features and the set of discovered features, it is obtained by dividing the size of the intersection by the size the union of the two sets. The precision represents the fraction of correctly discovered features in all discovered features. These metrics are measured for each outlier.

Additionally, in order to compare the detection algorithms performance on multiple datasets, we assess whether the differences between these algorithms are statistically significant, as such, we use the Friedman test with the posthoc analysis test Nemenyi [15]. The Friedman test is a non-parametric equivalent of the repeated-measures ANOVA. It ranks the detection algorithms for each data set separately, and in case of ties, the average ranks are assigned. The Friedman statistic is given by Eq. (13). Let $r_{i}^{j}$ be the rank of the $j$ -th of $k$ detection algorithms on the $i$ -th of $N$ datasets. The Friedman test compares the average ranks of algorithms, $R_{j}=\frac{1}{N}\sum_{i}{r_{i}^{j}}$ . Under the null-hypothesis, which states that all the algorithms are equivalent and so their ranks $R_{j}$ should be equal. The Friedman statistic is approximately $X_{F}^{2}$ -distributed with $k-1$ degrees of freedom.

$\displaystyle X_{F}^{2}=\frac{12N}{k(k+1)}\left[\sum_{j}{R_{j}^{2}}-\frac{k(k+% 1)^{2}}{4}\right]$ (13)

The Nemenyi test is a post-hoc test used when the null-hypothesis is rejected with the Friedman test. It is used to compare the detection algorithms to each other. The performance of two algorithms is significantly different if the corresponding average ranks differ by at least the critical difference defined in Eq. (14), where critical values $q_{\alpha}$ are based on the Studentized range statistic divided by $\sqrt{2}$ [15].

$\displaystyle CD=q_{\alpha}\sqrt{\frac{k(k+1)}{6N}}$ (14)

For both the Friedman and the Nemenyi tests, the $p$ -values are computed and used for the significance assessment. If a $p$ -value is less than a predefined significance level $\alpha$ , commonly set to 0.05, the null hypothesis is rejected, and in our case, the detection results of an algorithm are statistically significant.

6.2 Impact of the parameters

In order to understand the effect of the different parameters on the proposed approach, we conduct a set of experiments on the synthetic datasets, the first synthetic dataset of each dimensionality (i.e., $d=$ 10, $d=$ 20, … etc.) is used. Results of the impact of each parameter on the AUC metric are presented in Fig. 3.

The effect of the parameter s, presented in the first plot from the top, is relatively clear; the larger the value of $s$ , the lower the AUC. Indeed, a large value of $s$ indicates a high dimensionality; as such, the distance between the data points diminishes and finding meaningful nearest neighbors becomes more difficult.

The parameter $K$ determines the number of the nearest neighbors to use for comparing the local density of the considered data point, therefore, the suitable value might change from a dataset to another and from a point to another; thus, the choice of the value of $K$ is not easy. However, similarly to the original algorithm LOF, a value between 10 and 50 allows obtaining acceptable results. In the case of the synthetic data, as can be seen from the plot in the middle, good results were obtained with $K$ values between 15 and 35.

The impact of the parameter $C p$ depends on how representative is the obtained $C$ nearest neighbors of the natural cluster to which the considered point belongs. From Fig. 3, plot in the bottom, we can see that for small values of $C p$ , the AUC is low because the sets of C-nearest neighbors where the DLOF is computed are not large enough, thus not representative enough. This parameter is proportional to the dataset size, i.e., the larger the size of the data, the higher the number of C-nearest neighbors. Thus, a value from 0.2 to 0.3 is adequate to capture a set of neighbors that is large enough, provided that the size of this set is larger than $K$ .

Figure 3.

Impact of the parameters on the detection performance in different d-dimensional datasets.

Regarding the characterization, the key parameter is the threshold $t h$ . It should be set to a value which differentiates the outliers from the inliers so as to assess the relevance of the considered feature. A value slightly larger than 1 and up to 3 can achieve this distinction. The two other parameters are optional, they can be set in order to decide on the minimum and maximum number of characterizing features to return.

6.3 Detection performance

Our proposed solution is compared to six algorithms, namely, the well-established LOF [22], HiCS [10], SOD [13], GLOSS [5],4

⁴
https://github.com/Basvanstein/Gloss.

MO-GAAL [32]5

⁵

https://github.com/leibinghe/GAAL-based-outlier-detection.

and PFSOE [18]. The Environment for deveLoping KDD-applications (ELKI) platform6

⁶

https://elki-project.github.io/.

was used to perform the experiments with the LOF, HiCS and SOD algorithms. The different algorithms have been applied with the following parameters. For our outlier ensemble, the DLOF parameters

C p

and

K

were set to 0.3 and 35, respectively; while, the vertical data partition size

s

was equal to 10. LOF was applied with

k=

35. For SOD, we used alpha

=

1.1, the k for searching the nearest neighbors was set to 100 and the shared nearest neighbors was equal to 35. This parameterization was the same for both the synthetic and the real-world data. However, for HiCS, the same results for the synthetic data presented in [10] were retained for comparison, while for real-world data, the parameter alpha was equal to 0.1, the number of Monte-Carlo iterations was set to 50, the candidate cutoff was set to 100 and the LOF parameter

k

was equal to 35. The MO-GAAL algorithm was used with the number of generators

k

equal to 10, the stop epochs was set to 25 while the other parameters’ default values were used. For GLOSS, the default value of 20 was used for the parameter

k

. The PFSOE was applied with a number of iterations equal to 49, a threshold set to 1 and a majority bound equal to 0.9.

1) Synthetic data

Results of the comparison of the proposed approach with the other algorithms on synthetic data are presented in Fig. 4. The figure depicts the average AUC of the different algorithms, the average obtained from the three datasets of each dimensionality. As can be seen, our DLOF-based outlier ensemble outperforms the other approaches almost in all dimensionalities. The AUC obtained with the algorithms LOF, PFSOE and SOD is low even if SOD is a detection strategy for outliers in subspaces, the LOF doesn’t explore the subsapces, while the PFSOE explores only randomly generated subspaces. The MO-GAAL performance is also low, this can be due to the fact that the generated artificial outliers are not sufficiently diverse. The GLOSS algorithm outperforms our solution on both the 10 and 20-dimensional data, however, its performance diminishes in the higher dimensional data. The HiCS results are slightly better than those of the rest of the algorithms; however, they are not sufficiently high. We can explain these results by the fact that our solution tries to find the outlierness degree in the appropriate cluster into which the considered point belongs, i.e. its local density is compared to the local density of points that have the same generating mechanism. Furthermore, the final score of each data point is obtained by averaging the scores from all one-dimensional subspaces and not from subspaces with different dimensionlities. Indeed, if a subspace with a certain dimensionality is retained, as it is the case in the other algorithms, but it is incomplete or too large, then the degree of outlierness will be erroneous.

Table 3

AUC comparison on real-world data

Dataset	DLOF ensemble	LOF	HiCS	SOD	Mo–GAAL	Gloss	OE
Wine	93.61	92.18	63.19	78.82	95.04	42.26	81.97
Glass	83.14	86.07	78.43	70.14	64.04	76.85	74.88
Arrhythmia	80.81	76.85	66.17	77.56	39.79	67.39	59.24
Vowels	93.78	93.38	88.73	89.98	75.14	89.18	76.85
Musk	99.35	58.53	98.12	98.62	64.84	33.1	44.86
Annthyroid	78.6	73.03	75.89	80.75	38.5	75.95	71.67
Mammography	83.83	73.22	76.72	81.42	31.29	74.04	66.22
Optdigits	59.89	48.44	48.36	56.1	59.63	56.74	49.43
Ionosphere	96.7	96.2	90.30	87.80	21.4	87.99	73.5
PageBlocks	93.26	92.61	91.75	93.03	35.1	88.18	78.44

Table 4

$P$ -values of the pairwise comparison of the detection algorithms using the Nemenyi test

Dataset	DLOF ensemble	LOF	HiCS	SOD	Mo–GAAL	Gloss	OE	Avg
DLOF ensemble	1	0.255	0.031	0.438	0.001	0.016	0.001	0.249
LOF	–	1	0.90	0.90	0.309	0.90	0.309	0.653
HiCS	–	–	1	0.90	0.808	0.90	0.808	0.764
SOD	–	–	–	1	0.165	0.808	0.165	0.625
Mo–GAAL	–	–	–	–	1	0.90	0.90	0.583
Gloss	–	–	–	–	–	1	0.90	0.775
OE	–	–	–	–	–	–	1	0.583

Figure 4.

AUC comparison on synthetic data.

2) Real-world data

The real-world datasets presented previously have been used for the comparison of the proposed approach with the other algorithms. The results are presented in Table 3. The table depicts the obtained AUC. As can be seen, competitive results have been obtained with our approach, including on high dimensional datsets such as Arrhythmia, Musk and Ionosphere. High-dimensional datasets allow one to validate the necessity to mine for outliers in data subspaces. For example, the AUC obtained with LOF for the Musk dataset is low, while good results have been obtained with the subspace-based algorithms, including ours. Some of the datasets contain several classes,and they are labeled for classification purposes; and the anomalies are obtained by subsampling one or more classes, e.g., in Wine, Glass and Opdigits; therefore, these anomalies can be semantically meaningless, they can also form small clusters and become difficult to distinguish; as such, some resulting AUC are not high enough. Satisfactory results are also obtained with our approach on datasets with a large size (e.g., Annthyroid and Mammography). In order to assess the statistical significance, the Friedman and Nemenyi tests were performed. The statistic obtained with the Friedman test is approximately equal to 29.23, while the $p$ -value is equal to 5.51 $\times$ 10 ${}^{-5}$ . Since the $p$ -value is less than 0.05, the Nemenyi test was carried out, the $p$ -values obtained from the pairwise comparison of the detection algorithms are presented in Table 4. The results allow us to see that out of the 6 $p$ -values obtained from the comparison of the DLOF Ensemble to the remaining algorithms, 4 values are statistically significant, in addition, the DLOF Ensemble has the lowest $p$ -value average, as such, we can conclude that the performance of our solution is mostly significant.

Figure 5.

Characterization comparison on synthetic data.

6.4 Characterization performance

In order to evaluate the characterization performance of our approach, in this subsection we present the experiments that have been conducted on real-world and synthetic data. In both cases, the threshold $t h$ of the characterization function was set to 2.8. The minimum number of features to return is 1 for the synthetic data and for the real-world data, while the maximum number was set to 3 only for the real-world data.

Table 5
Same-type outliers finding through DLOFs clustering

Outlier type	Dim 10	Neuron	Dim 20	Neuron	Dim 30	Neuron
1	184,208,478,511	1	44,158,289,666,879	5	267,728,929, 995	4
2	221,246,578,655,724	2	71,384,526,937	4	69,460,568,753,869	8
			336	1
4	755,766,782	3	56,330,707	1	33,207,434,941	3
	185	2	48,748	4
5	316	1	–	–	283	3
8	173,324,705,825,976	4	105,791,796,960	4	636,745,956	6
			865	1	177,482	5
16	–	–	87,452,706,874,943	3	70,146,266,847	7
					151	4
32	–	–	–	–	332,398,811	5
					239,878	4
64	–	–	–	–	10,530,784	6
64	–	–	–	–	214,425	7
128	–	–	–	–	57,136,306,828,949	1
256	–	–	–	–	130,480,528	5
256					137	6
256					383	7

Table 6

Characterizing features of the anomalies in the WBC dataset

Anomaly ID	Our characterization	Characterization in [30]
865423	Standard-Smoothness, Standard-Area, Standard-Texture	Mean-Radius, Worst-Smoothness
911296202	Standard-Area, Standard-Perimeter, Standard-Radius	Standard-Smoothness, Mean-Radius,
		Standard-Concavity
873592	Mean-Area	Worst-Perimeter, Worst-Concave Points
8810703	Standard-Area, Standard-Perimeter, Standard-Symmetry	Worst-Perimeter, Mean-Concavity

Figure 6.

The obtained U-matices from the outlier clustering with the Self Organizing Maps.

Figure 7.

Visual comparison of the characterization results of the anomalies in the bcwd dataset, Subplots (a), (c), (e) and (g) have been obtained with our approach, Subplots (b), (d), (f) and (h) have been obtained in [30].

Figure 8.

Heat maps of the DLOFs obtained for the first synthetic datasets of dimension 10, 20, 30, 40, 50 and 75.

1) Synthetic data

In order to assess and compare the performance of the approach and because in these datasets the ground truth concerning the relevant subspaces of the anomalies is available, the precision and the Jaccard index metrics have been used. Our results are compared to those obtained with SHAP [26]7

⁷

https://github.com/slundberg/shap.

and Explainer [21].8

⁸

https://github.com/koppmart/Explainer.

They are also compared to those of the OARank

+

search and the svmFS approaches introduced in [23, 4], respectively. SHAP has been applied with a number of estimators equal to 100 and a maximum depth of 6. The Explainer algorithm has been applied with a training set size equal to 35 and a number of trees equal to 50. The characterization function was applied only to real outliers, i.e. not on the outliers that have been detected with our method. The average precision and Jaccard index over all the outliers obtained with the different approaches in each dataset are shown in Fig. 5. As can be seen, our approach outperforms almost all the approaches in terms of both precision and Jaccard index. This can be due to the fact that with the OARank

+

search and the svmFS approaches, the characterization is transformed into a feature selection problem; as such, the discovered characterizations describe the whole class and not only the considered outlier. Further, with the Explainer and the SHAP algorithms, too many features were returned for each outlier, and these methods rely on a learning phase, however, as the number of the outliers in the datasets is relatively small, their performance could be affected.

In addition to outlier characterization through reporting relevant subspaces, we also present the anomalies that might be of the same type. In the synthetic datasets, the outliers are labeled depending on the feature subspace in which they have been generated, i.e., two outliers belonging to the same subspace have the same label. We refer to these labels as types. In order to obtain the same-type outliers, we apply a clustering algorithm on the DLOFs, we use the Self Organizing Maps (SOM). Two outliers belonging to the same neuron are considered as being of the same type, indeed, the neuron’s centroid reflects the relevant attributes of those outliers. Table 5 and Fig. 6 present the results of clustering the DLOFs of the outliers in the first datasets with dimensions 10, 20 and 30. As can be noticed, in each dataset all the same-type anomalies or some of them could be grouped in the same neurons. For example, the outliers of type 2 in the 30-dimensional dataset, {69,460,568,753,869}, have the same relevant subsapce and they have all been assigned to the same neuron 8. The advantage of using the SOM lies in the fact that the obtained map’s neurons have a neighborhood relationship which helps in preserving the topological properties of the input data, thus, misclassified outliers are generally assigned to the neighboring neurons. Furthermore, the resulting map can be updated and reused in the future in order to detect the same anomalies. Indeed, the discovered properties of the outliers will help the analyst, on the one hand, to understand the reason behind the abnormal behavior of the data observation, and on the other hand, to reuse these characterizations to build better detection strategies.

2) Real-world data

In order to compare the characterization performance of our method on real-world data, we use the wisconsin-breast cancer diagnostic (WBC) dataset, this dataset has been used in [30]. The set of characterizing features obtained with our approach alongside those obtained in [30] are presented in Table 6. In order to compare the quality of these subspaces, we have conducted a visual comparison. Figure 7 shows the plots of the dataset points in the different discovered subspaces for each anomaly. These plots allow one to see in which data projection the considered anomaly separates well from the remaining data. In the same figure, Subplots (a), (c), (e) and (g) present the subspaces obtained with our approach for the anomalies 865423, 911296202, 873592, 8810703, respectively; while Subplots (b), (d), (f) and (h) present the subspaces obtained in [30]. As can be seen from these results, in the data projections obtained with the discovered subspaces of our approach, there is a clear separation between the considered anomalies and the rest of the data.

6.5 Visualization based detection and characterization

The detection with our approach can also be achieved through visualization tools. As can be seen from Fig. 8 which presents the heat maps generated for the DLOFs obtained from the first of each synthetic dataset of dimension 10, 20, 30, 40, 50 and 75, the different outliers are visible in the red color. The rows indicate the index of the anomalies in the dataset while the column position helps in figuring out the valuable features for each anomaly. This can be specially useful when we have a small dataset with a small number of anomalies.

7. Conclusion and discussion

Both contextual outlier detection and outlier characteriation are challenging tasks and of great importance. Solutions have been proposed in the literature for contextual outlier detection or for outlier characterization; however, few works have tackled both issues at the same time. Thus, in this article, we have presented our feature relevance scoring function DLOF. This function has been used to simultaneously detect and characterize contextual outliers. With our solution, several challenges have been addressed. Indeed, the appropriate neighborhood has been retained to compute the outlierness degree, the vertical data partitioning has been adopted in order to obtain a scalable function w.r.t. data dimensionality. Furthermore, the DLOF is computed for each feature dimension individually; avoiding in this way exploring the different feature combinations and the problem of subspaces comparability. An ensemble approach with an averaging combination function has been adopted for outlier detection; thus both binary scores and outlier ranking are made possible. In addition, the DLOF is a modified LOF; however, DLOF is still advantageous regarding the threshold choice facility. Moreover, the presented solution has been appropriately evaluated on synthetic and real-world data in terms of both detection and characterization efficacy. From the experiments, we could see the good performance of our approach compared to existing solutions. Nonetheless, further improvements can be brought. For instance, in the detection part, other combination strategies can be explored in order to obtain the final results. Furthermore, for some applications the severity of an anomaly can be quantified with the obtained DLOF, making, in this way, more interpretations. The discovered characteristics can be reused in order to reinforce detection methods, by targeting the same types of outliers with their characterizations.

Footnotes

Acknowledgments

This work was funded by the National Natural Science Foundation of China, Grant number 61773206.

References

Lazarevic

and Kumar

, Feature bagging for outlier detection, in: Proceeding of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining – KDD, ACM Press, 2005.

Zimek

Schubert

and Kriegel

H.-P.

, A survey on unsupervised outlier detection in high-dimensional numerical data, Statistical Analysis and Data Mining 5(5) (2012), 363–387.

Zimek

Campello

R.J.G.B.

and Sander

, Ensembles for unsupervised outlier detection, ACM SIGKDD Explorations Newsletter 15(1) (2014), 11–22.

Micenkova

R.T.

Dang

X.-H.

and Assent

, Explaining Outliers by Subspace Separability, in: 2013 IEEE 13th International Conference on Data Mining, IEEE, 2013.

Van Stein

Van Leeuwen

and Bäck

, Local subspace-based outlier detection using global neighbourhoods, in: 2016 IEEE International Conference on Big Data (Big Data), IEEE, 2016, pp. 1136–1142.

Aggarwal

C.C.

, Outlier ensembles: Position paper, ACM SIGKDD Explorations Newsletter 14(2) (2013), 49–58.

Aggarwal

C.C.

and Yu

P.S.

, Outlier detection for high dimensional data, ACM SIGMOD Record 30(2) (2001), 37–46.

Hemalatha

C.S.

Vaidehi

and Lakshmi

, Minimal infrequent pattern based approach for mining outliers in data streams, Expert Systems with Applications 42(4) (2015), 1998–2012.

Muller

Schiffer

and Seidl

, Statistical selection of relevant subspace projections for outlier ranking, in: 2011 IEEE 27th International Conference on Data Engineering, IEEE, 2011.

10.

Keller

Muller

and Bohm

, HiCS: High Contrast Subspaces for Density-Based Outlier Ranking, in: 2012 IEEE 28th International Conference on Data Engineering, IEEE, 2012.

11.

Angiulli

Fassetti

and Palopoli

, Detecting outlying properties of exceptional objects, ACM Transactions on Database Systems 34(1) (2009), 1–62.

12.

Tang

Pei

Bailey

and Dong

, Mining multidimensional contextual outliers from categorical relational data, Intelligent Data Analysis 19(5) (2015), 1171–1192.

13.

Kriegel

H.P.

Kroger

Schubert

and Zimek

, Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data, in: Advances in Knowledge Discovery and Data Mining, Springer Berlin Heidelberg, 2009, pp. 831–838.

14.

Kriegel

H.P.

Kroger

Schubert

and Zimek

, Interpreting and unifying outlier scores, in: Proceedings of the 2011 SIAM International Conference on Data Mining, SIAM, 2011, pp. 13–24.

15.

Demsar

, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine Learning Research 7 (2006), 1–30.

16.

Zhang

Xun

and Qin

, A relevant subspace based contextual outlier mining algorithm, Knowledge-Based Systems 99 (2016), 1–9.

17.

Zhang

Xun

Zhang

and Qin

, Scalable Mining of Contextual Outliers Using Relevant Subspace, IEEE Transactions on Systems, Man, and Cybernetics: Systems 50(3) (2020), 988–1002.

18.

Boukela

Zhang

and Zhou

S.B.J.

, An outlier ensemble for unsupervised anomaly detection in honeypots data, Intelligent Data Analysis 24(4) (2020), 743–758.

19.

Boukela

Zhang

Yacoub

Bouzefrane

Ahmadi

S.B.B.

and Jelodar

, A modified LOF-based approach for outlier characterization in IoT, Annals of Telecommunications, 2020, 1–9.

20.

Duan

Tang

Pei

Bailey

Campbell

and Tang

, Mining outlying aspects on numeric data, Data Mining and Knowledge Discovery 29(5) (2015), 1116–1151.

21.

Kopp

Pevny

and Holena

, Anomaly explanation with random forests, Expert Systems with Applications 149 (2020), 113187.

22.

Breunig

M.M.

Kriegel

H.P.

R.T.

and Sander

, LOF, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data – SIGMOD, ACM Press, 2000.

23.

Vinh

N.X.

Chan

Bailey

Leckie

Ramamohanarao

and Pei

, Scalable Outlying-Inlying Aspects Discovery via Feature Ranking, in: Advances in Knowledge Discovery and Data Mining, Springer International Publishing, 2015, pp. 422–434.

24.

Vinh

N.X.

Chan

Romano

Bailey

Leckie

Ramamohanarao

and Pei

, Discovering outlying aspects in large datasets, Data Mining and Knowledge Discovery 30(6) (2016), 1520–1555.

25.

Lundberg

S.M.

and Lee

S.-I.

, A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems, 2017, pp. 4765–4774.

26.

Lundberg

S.M.

Erion

G.G.

and Lee

S.-I.

, Consistent individualized feature attribution for tree ensembles, arXiv preprint arXiv:1802.03888, 2018.

27.

Chandola

Banerjee

and Kumar

, Anomaly detection, ACM Computing Surveys 41(3) (2009), 1–58.

28.

Song

Jermaine

and Ranka

, Conditional anomaly detection, IEEE Transactions on Knowledge and Data Engineering 19(5) (2007), 631–645.

29.

Wang

and Davidson

, Discovering Contexts and Contextual Outliers Using Random Walks in Graphs, in: 2009 Ninth IEEE International Conference on Data Mining, IEEE, 2009.

30.

Zhao

Zhang

Qin

Cai

and Ma

, Parallel mining of contextual outlier using sparse subspace, Expert Systems with Applications 126 (2019), 158–170.

31.

Zhao

Zhang

and Qin

, LOMA: A local outlier mining algorithm based on attribute relevance analysis, Expert Systems with Applications 84 (2017), 272–280.

32.

Liu

Zhou

Jiang

Sun

Wang

and He

, Generative adversarial active learning for unsupervised outlier detection, IEEE Transactions on Knowledge and Data Engineering, 2019.

An approach for unsupervised contextual anomaly detection and characterization

Abstract

Keywords

1. Introduction

1.1 Curse of dimensionality

1.2 Subspace outliers and feature relevance

1.3 Scoring function

1.4 Anomaly detection and characterization

2.1 Contextual outlier detection

Table 1 Summary of works on relevant subspace-based contextual outlier detection in numeric data

3. Preliminaries

3.1 Local Outlier Factor

4. Data partitioning and scoring function

4.1 Vertical data partitioning

4.2.1 Two neighborhood-related parameters: Cp and K

4.2.2 Dimension-wise LOF (DLOF)

5.1 DLOF computation

5.2 Outlier detection

6. Experimental evaluation

6.1 Datasets and evaluation metrics

6.1.1 Datasets

4 https://github.com/Basvanstein/Gloss.

1) Synthetic data

2) Real-world data

Table 5 Same-type outliers finding through DLOFs clustering

1) Synthetic data

2) Real-world data

7. Conclusion and discussion

Footnotes

Acknowledgments

References

Table 1
Summary of works on relevant subspace-based contextual outlier detection in numeric data

⁴
https://github.com/Basvanstein/Gloss.

Table 5
Same-type outliers finding through DLOFs clustering