Abstract
Outlier detection has been widely explored and applied to different real-world problems. However, outlier characterization that consists in finding and understanding the outlying aspects of the anomalous observations is still challenging. In this paper, we present a new approach to simultaneously detect subspace outliers and characterize them. We introduce the Dimension-wise Local Outlier Factor (DLOF) function to quantify the degree of outlierness of the data points in each feature dimension. The obtained DLOFs are used in an outlier ensemble so as to detect and rank the anomalous points. Subsequently, the same DLOFs are analyzed in order to characterize the detected outliers with their relevant subspace and their same-type anomalies. Experiments on various datasets show the efficacy of our method. Indeed, we demonstrate through an experimental evaluation that the proposed approach is competitive compared to the existing solutions in terms of both detection and characterization accuracy.
Keywords
Introduction
Outlier detection is the process of discovering objects with unpredicted and rare patterns in the data. The unexpected objects are also called anomalies, exceptions, novelties, etc. Thus, the terms outlier detection and anomaly detection will be used interchangeably throughout this article. Anomaly detection has several application domains such as fraud detection in banking, intrusion detection in cyber security, fault detection in control systems, etc. Several methods for anomaly detection have been proposed, they can be categorized into statistical, classification-based, clustering-based, nearest neighbors-based, etc. [27]. However, lately, the research community has highlighted the need for interpreting and finding more information about the detected anomalies. Anomaly characterization consists mainly in discovering subspace(s) or subset(s) of features (a.k.a attributes, dimensions) for a query outlier where it shows a significant degree of outlierness. This task can be of great utility in different practical applications. For instance, in the field of intrusion detection, the attacks can be of different types and unseen anomalies occur daily. Therefore, new unsupervised methods are needed to detect but also to characterize them. Indeed, reporting to the analyst more information regarding the detected malicious activities is of major importance in order to distinguish between attacks and anomalies, to understand the root causes and to decide on better defense strategies. Similarly, in the field of the Internet of Things, fault detection is a critical task. Thus, it is of tremendous importance to know the reason behind the faulty behavior of a sensor, if it is a malfunction of the hardware, a battery drain problem, … etc.
In previous works, numerous approaches have been proposed for anomaly detection in relevant subspaces, and the subspaces are identified either before or during the detection process. These outliers are also called contextual or conditional outliers [28, 16, 17, 30, 29]. However, when contextual outliers are detected, their contextual information are not always reported to the analyst [13, 9, 10, 31, 5] or the characterization performance is not well evaluated [16, 17, 30]. On the other hand, in other works, some approaches have been proposed to address only the specific outlier characterization problem. Nonetheless, the existing characterization approaches lack scalability w.r.t. data dimension [23, 4]. In addition, the outliers are explained only with the subspaces. Furthermore, since the detection and the characterization processes are decoupled, this may result in extra time and effort needed by the user to carry out the work.
In this article, we propose a new approach to simultaneously detect and characterize outliers. The proposed approach addresses several issues related to the outlier detection in relevant subspaces and in high dimensional data. The challenges tackled are detailed in the following.
Curse of dimensionality
Finding outliers in high dimensional data is a difficult task due to the curse of dimensionality [2]. Indeed, most detection methods rely on distance as a similarity measure. However, as data dimensionality increases, the relative contrast between near and far neighbors diminishes. This reduces the usability of the distance measure, especially for the nearest neighbors-based techniques. Therefore, in our case, the issue related to the high dimensional data is addressed by adopting the vertical partitioning of the data. More precisely, when the data dimension exceeds a certain number of attributes, the dataset is divided into several vertical partitions, and later the outlier detection and characterization is carried out on each of these partitions.
Subspace outliers and feature relevance
In high dimensional data, as explained in [2], it is highly probable that the data can be separated into different clusters following different distributions; where each cluster is generated by a specific mechanism. Furthermore, these mechanisms are characterized by some relevant features only. Therefore, finding the nearest neighbor is meaningful if the search is restricted to data objects from the same cluster (distribution) as the considered point. Consequently, in our case, the local outlierness degree is computed by considering two different neighborhoods. The first one includes a large number of nearest neighbors (points from the same cluster) in the full feature space which is assumed to share the same generating mechanism as the considered point. The second neighborhood considers a smaller set of nearest neighbors, found in the previously defined neighborhood, for the local density comparison in each feature dimension; and thus for computing the local outlierness degree and finding the relevant subspace. The local outlierness degree is computed with the introduced function DLOF (Dimension wise Local Outlier Factor) which is a modified LOF (Local Outlier Factor). This function has been first explored in [19] and improved in this work.
Scoring function
Numerous approaches have been proposed to assess the relevance of the data attributes and construct relevant subspaces either for detection or characterization, and most of them rely on the local neighborhood of the considered data point. However, there are issues related to bias and contrast of the scoring function [2]. Indeed, scores obtained in subspaces of different dimensionalities are not comparable and choosing a threshold boundary to distinguish an outlier can be a challenging task. However, with our scoring function DLOF, since the outlierness is computed at each feature level, the scores comparability is not encountered. In addition, with LOF, a baseline value slightly larger than 1 allows differentiating an outlier from the remaining data points, the threshold is thus less difficult to set.
Anomaly detection and characterization
Contextual anomaly detection is a difficult task and obtaining additional information about the detected anomalies can be of tremendous importance for analysts. However, few works only have proposed approaches to simultaneously detect and characterize the contextual anomalies. Furthermore, when characterization is adopted, it is not evaluated appropriately.
Consequently, in the present work, a solution is proposed for contextual anomaly detection and characterization. Both the detection and the characterization processes rely on the same subspace relevance analysis function DLOF. The detection process is modeled as an outlier ensemble, while the characterization is achieved by quantifying the attributes relevance. In addition to the relevant subspaces, the anomalies are also characterized by finding the same-type anomalies. Moreover, both the detection and the characterization have been appropriately evaluated and compared to existing solutions.
The main contributions of our work can be summarized as follows:
A function for feature relevance analysis and quantification is introduced. The function is a modified LOF that addresses issues related to high dimensional data and subspace outliers. A completely unsupervised solution for outlier detection and characterization is presented. The detected outliers are characterized with their relevant subspaces and their same-type anomalies. The proposed approach has been applied on both real-world and synthetic data and evaluated in terms of detection and characterization efficacy.
The rest of this paper is organized as follows. Section 2 discusses the related work. In Section 3, we present the preliminaries related to the LOF algorithm and to the outlier ensembles. Subsequently, Section 4 presents the DLOF function and the data vertical partitioning strategy. Section 5 details our general approach for outlier detection and characterization. In Section 6, we present and discuss the evaluation results of the proposed solution in terms of both detection and characterization efficacy. Finally, Section 7 concludes the article.
The majority of the works have focused on improving the accuracy and the efficiency of the detection techniques while few works only have contributed in bringing more information about the anomalies. Hence, a challenging task in the outlier mining field is the interpretability and the characterization of the detected anomalies.
Contextual outlier detection
In order to develop and apply anomaly detection techniques, one should consider several constraints [27] such as the nature of the input data, the availability of data labels, the result of the algorithm. An important consideration, that has been taken into account lately, is based on the properties of the anomalies to detect. Indeed, it is assumed that the anomalies can be different depending on the generating mechanism, and generally a distinction is achievable by analyzing the impact of the data attributes. In most of the works which follow this assumption, the outliers discovered under certain condition, eventually in specific attribute subsets, are called conditional outliers [28] or contextual outliers [12, 16, 17, 30, 29].
As introduced in [27], the contextual anomaly detection methods can be categorized into two types. Methods in which contextual anomaly detection is reduced to point anomaly detection after identification of the context. Hence the contextual information is provided beforehand. The second type of method uses training data to learn a model which predicts the expected behavior with respect to a given context. An anomaly is detected if the predicted behavior is significantly different from the observed behavior. Herein, the contextual information is the learned model.
However, the first category can be extended. Indeed, the techniques described in [27] rely on contextual information that are determined manually. As an example, in [28], the authors presented their method, namely Conditional Anomaly Detection (CAD), and they assume that in addition to a baseline dataset, the user divides beforehand the data attributes into environmental attributes and indicator attributes. Herein, the definition of the contextual attributes is made by human experts, therefore, it relies essentially on the experts’ knowledge and experience and that is a drawback since the contextual information are predefined manually and not automatically. However, the contextual information can be either generated randomly [1, 18] or discovered automatically [13, 9, 10, 31, 5, 12, 8, 16, 17, 30, 29]
In [1], the authors have proposed an LOF-based outlier ensemble, the ensemble combines the results obtained by computing the LOF in several randomly generated feature subspaces. Although the approach does not discover a relevant subspace for each outlier, nonetheless, it can be considered as a first step towards an automatic outlier mining in feature subsets in order to improve the detection accuracy. The same idea has been developed in [18], where in addition to the various subspaces, different parameterization has been adopted. An unsupervised outlier detection algorithm based on generative adversarial active learning has been presented in [32] in order to tackle the issue of high-dimensional data. The model consists of a generator and a discriminator. The generator generates informative potential outliers that occur inside or close to the real data based on the guide of the discriminator which defines a decision boundary which separates normal and anomalous data. In order to provide sufficient information to the discriminator, the model has been expanded from one to multiple generators, namely, MO-GAAL. These generators have different objectives, thus generate different types of outliers.
The authors in [13] proposed a method, namely Subspace Outlier Degree (SOD), that detects and scores an outlier with reference to the axis parallel subspace spanned by its neighbors. The relevant attributes of the set of neighbors of a point are the attributes where the points in the neighborhood exhibit a low variance. The OUTRES is introduced in [9] which is an adaptive outlier ranking where comparable degrees of deviation are measured for objects in multiple arbitrary subspaces. The deviation is measured for each object in a set of relevant subspaces, where a subspace is retained as relevant if it is non-uniformly distributed by using the Kolmogorov-Smirnov test. In [10], the authors proposed HiCS, a subspace search method that selects high contrast subspaces for density-based outlier ranking. The proposed contrast measure is based on correlation analysis, the difference between marginal and conditional pdf of a subspace is used as a criterion for high contrast. And, the outlier score for the ranking relies on LOF. In addition, the subspace outlier mining is a decoupled process, divided into “subspace search” and “outlier ranking”. In [31], the Local Outlier Mining Algorithm (LOMA) is proposed, herein, the approach relies on the use of the PSO heuristic and a proposed sparse coefficient threshold to search for the sparce subspaces from which the local outliers can be effectively extracted. The GLOSS algorithm was introduced in [5], the approach consists in combining existing local subspace anomaly detection techniques with the concept of global neighborhoods. The authors explain the necessity to consider global neighborhoods in order to asses the outlierness of a query point with respect to data objects having the same mixture component.
In [13, 9, 10, 5, 31], although, the terminology of contextual or conditional anomaly has not been used, the methods that have been proposed in these works rely on the notion of relevant subspace. And as explained in [7], certain outliers are hidden in data subsapces and can be detected only by projecting the data into lower dimensions. Therefore, those characterizing subspaces can be considered as the context in which the anomaly occured. However, in these works the relevant subsapces in which the outliers are detected are not retained for further exploration and are not reported to the user.
In [16, 17, 29, 30], the notion of contextual anomalies has been adopted, and the contextual information is retained as an additional result of the proposed approaches so as to help the user to achieve a better interpretation of the detected anomalies. In both [16, 17], the relevance of a subspace relies on the local density. The authors in [17] assume that an outlier is an object which, in the relevant subspace, doesn’t satisfy the distribution of the local dataset. The local sparseness of attribute dimensions is used to quantify the relevant subspaces, and the probability density of local datasets is used to compute the degree of outlierness of the object. This work aimed at improving detection efficiency and scalability. The authors in [30] also aimed at improving the LOMA [31] through parallelization and outlier interpretation.
Above-mentioned methods have been proposed for numeric data and the relevant subspace is the context considered. However, methods for contextual anomaly detection in categorical data or that require data transformation have been proposed in [12, 8], respectively. And, in [29], the context is identified as a 2-coloring of a random walk graph.
The common property of all these works is that they detect outliers under a certain context. However, reporting the contextual information is a side product of the proposed methods and a small effort has been put in for the evaluation of these methods in terms of outlier characterization and explainability [16, 17, 30]. As such, in this work, through the DLOF function both the detection and the characterization of the contextual anomalies are achieved. In addition, the charcaterization performance of the solution has been appropriately evaluated and compared to existing solutions.
The works that have been proposed for numeric data and consider the relevant subspace as the contextual information are summarized in Table 1.
Summary of works on relevant subspace-based contextual outlier detection in numeric data
Summary of works on relevant subspace-based contextual outlier detection in numeric data
As previously explained, in some proposed solutions, even though the outliers’ relevant subspaces are found and explored so as to enhance the anomaly detection accuracy, these relevant subspaces are not retained to be presented to the user. Moreover, the problem of outlier interpretation is gaining more and more interest, indeed, understanding the reason behind the outlierness of a data object is very important. It is also referred to outlier characterization as outlying aspects mining [11], outlying properties detection [20], or outlier explanation [21].
An early work on outlier characterization is introduced in [11]. The authors presented a solution that considers categorical attributes. A measure using the relative frequency has been used to quantify the relevance of attribute combinations for a considered anomaly. The score is obtained by considering the entire dataset or a subset of it. A supervised feature selection-based technique is proposed in [4]. The method constructs for each outlier a binary classifier to distinguish it from the inliers. The outlier’s class is a set of samples generated as a Gaussian distribution, while the inliers are formed with other samples of the remaining data. Herein, classification accuracy is used as a feature scoring function, i.e. the features that allow to obtain a high accuracy are retained as a characterizing subspace. The OAMiner was introduced in [20], as a scoring function, the authors used the rank of the outlier in a given subspace based on the kernel density estimate. The subspaces which allow obtaining the best ranking of the outlier are retained as the relevant susbspaces. The approach relies on a systematic search and an in depth-first strategy is adopted to browse the search space and find the relevant subspaces. An anomaly characterization approach is presented in [23], namely, the OARank framework. It works in two stages, the first one is a mutual information-based feature ranking, and the second stage consists in exploring the features obtained in the previous stage through a score-and-search strategy. The second stage is optional, when considered the framework is called OARank search. Lately, an ensemble of scoring functions has been presented in [24]. The authors have analyzed these functions in terms of different desiderata, in particular, the dimension unbiasedness. Subsequently, the beam search algorithm has been used so as to browse the subspaces for the considered outliers with the described scoring functions. A general framework named SHAP (SHapley Additive exPlanations) is presented in [25, 26], it assigns for each feature an importance value and aims to help the analyst to interpret the predictions of any machine learning model. The authors in [21] introduced the Explainer algorithm which uses an ensemble of specifically trained decision trees to explain the anomalies detected by any other algorithm. The Explanation consists in the different relevant features but also in a set of rules.
Similarly to detection, the outliers are characterized with their relevant subspaces and the relevance analysis is key point. However, the research on this problem is limited. The solution presented in [11] considers only categorical attributes. In [20, 24], in order to find the relevant subspaces, a systematic search strategy is adopted, therefore, they are computationally expensive. Furthermore, in [4], a feature selection strategy is adopted and the approach in [23] is hybrid. The latter approaches are better than the former ones in terms of computation, however, in high dimensional data, their characterization performance is low. In our case, the proposed scoring function relies on the feature ranking by analyzing the features individually, therefore the systematic search is not needed and the process is more efficient. It is also effective for high dimensional data due to the data partitioning. Furthermore, in addition to the relevant features, the same-type anomalies are reported as an additional result that can help the analyst to explain the results.
Preliminaries
In this section we present the Local Outlier Factor (LOF) algorithm, followed by the introduction of the concept of outlier ensembles.
Local Outlier Factor
LOF [22] is a density-based outlier detection approach. It assigns to each data point an outlierness degree. The local outlier factor is computed as follows.
Firstly, for each point
Secondly, to reduce the statistical fluctuations of the distances between a point and its neighbors, the reachability distance of point
Later, the local reachability density of point
Finally, the local outlier factor of a point
The LOF represents the degree of outlierness of a data point. A data point is considered as an outlier if its local outlier factor is greater than a baseline value close to 1, it is considered as a normal point, otherwise. The advantage of the LOF consists in the fact that it detects outliers lying in neighborhoods with different properties [22].
Ensembles analysis consists in combining results obtained with various algorithms or, like in our case, executions of one base algorithm with distinct parameters and/or distinct data subspaces. This approach has been adapted to the field of outlier detection, and it is known as outlier ensembles. The outlier ensemble can be unsupervised and the main purpose from it is the improvement of the anomaly detection efficacy by merging results from various models and to tackle the curse of dimensionality when feature subspaces are used. Indeed, previous studies have shown that a better performance can be achieved by combining results from various models compared to using the results of one base model, and it holds true for anomaly detection [1, 18]. However, designing an unsupervised anomaly detection ensemble necessitates considering some aspects such as the accuracy of the algorithms, their diversity and, the combination function to use to obtain the consensus of the results. A more detailed explanation of the concept can be found in [6, 3].
Data partitioning and scoring function
In this section, we will introduce our function for attribute relevance analysis, in addition to explaining how the vertical data partitioning is explored in our solution.
Vertical data partitioning
LOF has different advantages; however, it relies on the distance measure, and because of the curse of dimensionality, the performance of the algorithm on high-dimensional data is not satisfactory. Thus, in the case of the DLOF, the algorithm will not perform well in discovering the
Vertical data partitioning.
We adapt the LOF algorithm and use it as a feature scoring function for both detection and characterization. Therefore, we make the two following modifications to the algorithm.
Two neighborhood-related parameters: Cp and K
In the LOF algorithm, the outlierness degree of each data point is computed with the parameter
In order to tackle the challenge explained in Subsection 1.2, i.e. the fact that the nearest neighbor search is meaningful if it is restricted to objects of the same cluster as the considered point [2]. We define the parameter
We define a second parameter
Dimension-wise LOF (DLOF)
The DLOF of a point
Let
for at least for at most
The data point
where
General functioning of the proposed approach.
In this section, we introduce the details on how the presented approach works, its general functioning is illustrated in Fig. 2. After the data partitioning, if necessary, the first step consists in computing the Dimesion-wise LOFs of all data points. Subsequetly, these DLOFs are used to find the outliers and to characterize them. The details of these different steps are presented in the following.
DLOF computation
In order to tackle the problem of the high dimensional data, we implement the data partitioning that has been discussed in subsection 4.1 by following the instructions in Algorithm 5.1. The
[h!] : DLOFs Computation[1]
Algorithm 5.1 presents the details of the DLOF computation, as can be seen, the C-nearest neighbors and the K-nearest neighbors are first computed for each data point
[h] : Dimension-wise Local Outlier Factor (DLOF)[1]
Find the C-NNs of point
Outlier detection
Once the DLOFs are computed for all data points
The results in
In order to characterize the detected anomalies, the relevant subspace of each data point is discovered, in addition, the detected anomalies are clustered in order to find the same-type anomalies in terms of characterizing attributes.
[h] : Outlier Characterization[1]
The same DLOFs that have been used for detection will be used in order to find the relevant features for each anomaly. From Algorithm 5.3, we can see that for each anomaly, the feature dimensions which allow obtaining a DLOF greater than a predefined threshold are considered as the characterizing features. Additionally, the user can choose the number of characterizing features by introducing a minimum number and a maximum number of features to retain.
In addition to outlier characterization through reporting relevant subspaces, we also propose to report the anomalies that might be of the same type. Two outliers can be considered of the same type if they have the same relevant subspace. Finding the outliers of the same type can be achieved by comparing the obtained characterizing features or simply by clustering their DLOFs. Indeed, the outliers in the same clusters are considered of the same type. To this purpose, any clustering algorithm can be used.
Experimental evaluation
In order to evaluate the proposed approach in terms of detection and characterization, a set of experiments on synthetic and real-world data has been conducted. In this section, the details of the experiments and the comparison results are presented.
Datasets and evaluation metrics
Datasets
The synthetic datasets introduced in [10] have been used for evaluating the detection and characterization performance of our solution. The datasets have 10, 20, 30, 40, 50, 75 or 100 dimensions. For each dimensionality, three different datsets are available, each having 1000 points including 19 to 136 anomalous points. In these datasets, 2–5 dimensional subspaces are randomly selected out of the full data space in order to generate high density clusters. Subsequently, some objects are picked in the subspaces, they are modified and considered as subspace outliers. As such, detecting and characterizing these anomalies are difficult tasks.
In order to evaluate the performance of our approach on real-world data, datasets with different dimensionalities and sizes, and that have been explored in several works, were retained to compare the proposed solution. The data do not include missing values or noise, as such the pre-processing step was limited to feature scaling by using the z-score standardization. The datasets are summarized in Table 2 and explained in the following.
Wine1
Glass
Arrhythmia
Vowels
Musk
Annthyroid
Mammography
Optdigits
Ionosphere2
Page Blocks
WBC3
Summary of the real-world datasets
In order to illustrate the performance comparison in terms of detection efficacy, the area under the curve (AUC) of the Receiver Operating Characteristics (ROC) curves is used. The ROC curve depicts the true positive rate (TPR) as a function of the false positive rate (FPR) that are defined as follows:
False positives (FP) are data objects annotated as anomalies but they are normal observations in reality, while true positives (TP) are anomalous observations which are labeled as anomalous. Normal observations annotated as normal represent true negatives (TN), whereas the observations that are anomalies but annotated as normal represent false negatives (FN).
The comparison of the characterization performance on synthetic data has been achieved by using the precision and the Jaccard index measures that are defined as in the following:
Additionally, in order to compare the detection algorithms performance on multiple datasets, we assess whether the differences between these algorithms are statistically significant, as such, we use the Friedman test with the posthoc analysis test Nemenyi [15]. The Friedman test is a non-parametric equivalent of the repeated-measures ANOVA. It ranks the detection algorithms for each data set separately, and in case of ties, the average ranks are assigned. The Friedman statistic is given by Eq. (13). Let
The Nemenyi test is a post-hoc test used when the null-hypothesis is rejected with the Friedman test. It is used to compare the detection algorithms to each other. The performance of two algorithms is significantly different if the corresponding average ranks differ by at least the critical difference defined in Eq. (14), where critical values
For both the Friedman and the Nemenyi tests, the
In order to understand the effect of the different parameters on the proposed approach, we conduct a set of experiments on the synthetic datasets, the first synthetic dataset of each dimensionality (i.e.,
The effect of the parameter s, presented in the first plot from the top, is relatively clear; the larger the value of
The parameter
The impact of the parameter
Impact of the parameters on the detection performance in different d-dimensional datasets.
Regarding the characterization, the key parameter is the threshold
Our proposed solution is compared to six algorithms, namely, the well-established LOF [22], HiCS [10], SOD [13], GLOSS [5],4
1) Synthetic data
Results of the comparison of the proposed approach with the other algorithms on synthetic data are presented in Fig. 4. The figure depicts the average AUC of the different algorithms, the average obtained from the three datasets of each dimensionality. As can be seen, our DLOF-based outlier ensemble outperforms the other approaches almost in all dimensionalities. The AUC obtained with the algorithms LOF, PFSOE and SOD is low even if SOD is a detection strategy for outliers in subspaces, the LOF doesn’t explore the subsapces, while the PFSOE explores only randomly generated subspaces. The MO-GAAL performance is also low, this can be due to the fact that the generated artificial outliers are not sufficiently diverse. The GLOSS algorithm outperforms our solution on both the 10 and 20-dimensional data, however, its performance diminishes in the higher dimensional data. The HiCS results are slightly better than those of the rest of the algorithms; however, they are not sufficiently high. We can explain these results by the fact that our solution tries to find the outlierness degree in the appropriate cluster into which the considered point belongs, i.e. its local density is compared to the local density of points that have the same generating mechanism. Furthermore, the final score of each data point is obtained by averaging the scores from all one-dimensional subspaces and not from subspaces with different dimensionlities. Indeed, if a subspace with a certain dimensionality is retained, as it is the case in the other algorithms, but it is incomplete or too large, then the degree of outlierness will be erroneous.
AUC comparison on real-world data
AUC comparison on synthetic data.
2) Real-world data
The real-world datasets presented previously have been used for the comparison of the proposed approach with the other algorithms. The results are presented in Table 3. The table depicts the obtained AUC. As can be seen, competitive results have been obtained with our approach, including on high dimensional datsets such as Arrhythmia, Musk and Ionosphere. High-dimensional datasets allow one to validate the necessity to mine for outliers in data subspaces. For example, the AUC obtained with LOF for the Musk dataset is low, while good results have been obtained with the subspace-based algorithms, including ours. Some of the datasets contain several classes,and they are labeled for classification purposes; and the anomalies are obtained by subsampling one or more classes, e.g., in Wine, Glass and Opdigits; therefore, these anomalies can be semantically meaningless, they can also form small clusters and become difficult to distinguish; as such, some resulting AUC are not high enough. Satisfactory results are also obtained with our approach on datasets with a large size (e.g., Annthyroid and Mammography). In order to assess the statistical significance, the Friedman and Nemenyi tests were performed. The statistic obtained with the Friedman test is approximately equal to 29.23, while the
Characterization comparison on synthetic data.
In order to evaluate the characterization performance of our approach, in this subsection we present the experiments that have been conducted on real-world and synthetic data. In both cases, the threshold
Same-type outliers finding through DLOFs clustering
Same-type outliers finding through DLOFs clustering
Characterizing features of the anomalies in the WBC dataset
The obtained U-matices from the outlier clustering with the Self Organizing Maps.
Visual comparison of the characterization results of the anomalies in the bcwd dataset, Subplots (a), (c), (e) and (g) have been obtained with our approach, Subplots (b), (d), (f) and (h) have been obtained in [30].
Heat maps of the DLOFs obtained for the first synthetic datasets of dimension 10, 20, 30, 40, 50 and 75.
1) Synthetic data
In order to assess and compare the performance of the approach and because in these datasets the ground truth concerning the relevant subspaces of the anomalies is available, the precision and the Jaccard index metrics have been used. Our results are compared to those obtained with SHAP [26]7
In addition to outlier characterization through reporting relevant subspaces, we also present the anomalies that might be of the same type. In the synthetic datasets, the outliers are labeled depending on the feature subspace in which they have been generated, i.e., two outliers belonging to the same subspace have the same label. We refer to these labels as types. In order to obtain the same-type outliers, we apply a clustering algorithm on the DLOFs, we use the Self Organizing Maps (SOM). Two outliers belonging to the same neuron are considered as being of the same type, indeed, the neuron’s centroid reflects the relevant attributes of those outliers. Table 5 and Fig. 6 present the results of clustering the DLOFs of the outliers in the first datasets with dimensions 10, 20 and 30. As can be noticed, in each dataset all the same-type anomalies or some of them could be grouped in the same neurons. For example, the outliers of type 2 in the 30-dimensional dataset, {69,460,568,753,869}, have the same relevant subsapce and they have all been assigned to the same neuron 8. The advantage of using the SOM lies in the fact that the obtained map’s neurons have a neighborhood relationship which helps in preserving the topological properties of the input data, thus, misclassified outliers are generally assigned to the neighboring neurons. Furthermore, the resulting map can be updated and reused in the future in order to detect the same anomalies. Indeed, the discovered properties of the outliers will help the analyst, on the one hand, to understand the reason behind the abnormal behavior of the data observation, and on the other hand, to reuse these characterizations to build better detection strategies.
2) Real-world data
In order to compare the characterization performance of our method on real-world data, we use the wisconsin-breast cancer diagnostic (WBC) dataset, this dataset has been used in [30]. The set of characterizing features obtained with our approach alongside those obtained in [30] are presented in Table 6. In order to compare the quality of these subspaces, we have conducted a visual comparison. Figure 7 shows the plots of the dataset points in the different discovered subspaces for each anomaly. These plots allow one to see in which data projection the considered anomaly separates well from the remaining data. In the same figure, Subplots (a), (c), (e) and (g) present the subspaces obtained with our approach for the anomalies 865423, 911296202, 873592, 8810703, respectively; while Subplots (b), (d), (f) and (h) present the subspaces obtained in [30]. As can be seen from these results, in the data projections obtained with the discovered subspaces of our approach, there is a clear separation between the considered anomalies and the rest of the data.
The detection with our approach can also be achieved through visualization tools. As can be seen from Fig. 8 which presents the heat maps generated for the DLOFs obtained from the first of each synthetic dataset of dimension 10, 20, 30, 40, 50 and 75, the different outliers are visible in the red color. The rows indicate the index of the anomalies in the dataset while the column position helps in figuring out the valuable features for each anomaly. This can be specially useful when we have a small dataset with a small number of anomalies.
Conclusion and discussion
Both contextual outlier detection and outlier characteriation are challenging tasks and of great importance. Solutions have been proposed in the literature for contextual outlier detection or for outlier characterization; however, few works have tackled both issues at the same time. Thus, in this article, we have presented our feature relevance scoring function DLOF. This function has been used to simultaneously detect and characterize contextual outliers. With our solution, several challenges have been addressed. Indeed, the appropriate neighborhood has been retained to compute the outlierness degree, the vertical data partitioning has been adopted in order to obtain a scalable function w.r.t. data dimensionality. Furthermore, the DLOF is computed for each feature dimension individually; avoiding in this way exploring the different feature combinations and the problem of subspaces comparability. An ensemble approach with an averaging combination function has been adopted for outlier detection; thus both binary scores and outlier ranking are made possible. In addition, the DLOF is a modified LOF; however, DLOF is still advantageous regarding the threshold choice facility. Moreover, the presented solution has been appropriately evaluated on synthetic and real-world data in terms of both detection and characterization efficacy. From the experiments, we could see the good performance of our approach compared to existing solutions. Nonetheless, further improvements can be brought. For instance, in the detection part, other combination strategies can be explored in order to obtain the final results. Furthermore, for some applications the severity of an anomaly can be quantified with the obtained DLOF, making, in this way, more interpretations. The discovered characteristics can be reused in order to reinforce detection methods, by targeting the same types of outliers with their characterizations.
Footnotes
Acknowledgments
This work was funded by the National Natural Science Foundation of China, Grant number 61773206.
