Abstract
Analysis of high-dimensional data often suffers from the curse of dimensionality and the complicated correlation among dimensions. Dimension reduction methods often are used to alleviate these problems. Existing outlier detection methods based on dimension reduction usually only rely on reconstruction error to detect outlier or apply conventional outlier detection methods to the reduced data, which could deteriorate the performance of outlier detection as only considering part of the information from data. Few studies have been done to combine these two strategies to do outlier detection. In this paper, we proposed an outlier detection method based on Variational Autoencoder (VAE), which combines low-dimensional representation and reconstruction error to detect outliers. Specifically, we first model the data use VAE, then extract four outlier scores from VAE model, finally propose an ensemble method to combine the four outlier scores. The experiments conducted on six real-world datasets show that the proposed method performs better than or at least comparable to state of the art methods.
Introduction
Outlier detection is an important data mining task with many applications, such as fraud detection, network intrusion detection, environmental monitoring, etc. Conventional full-space outlier detection methods often assume that distance is reliable, or low-density area in data space can be easily detected, which is not the case for high-dimensional data [26]. With so many dimensions, distance in high-dimensional data becomes unreliable, which also known as the curse of dimensionality. Also, the sparsity of high-dimensional data makes it difficult to detect low-density area accurately.
Another challenge for high-dimensional data is the correlation among dimensions. Because of the correlation, outlier detection for high-dimensional data often needs synthetically analyse many dimension to make a judgment about outliers. The Correlation also means that the intrinsic dimensionality is lower than original dimension, which leads to dimension reduction based methods for outlier detection. There are two common strategies for dimension reduction based outlier detection. The first one uses dimension reduction as a tool of feature extraction, after mapping the data to lower dimensions, conventional full-space outlier detection method is applied [10, 5]. With lower dimensions, the problems caused by the curse of dimensionality can usually be alleviated. The other one uses dimension reduction models to capture normal patterns, then the reconstruction error which can be explained as the deviation from the normal patterns is used to measure the outlier degree of the data [7, 22, 24]. Apparently, relying on only one strategy may make some outliers undetectable. Figure 1 shows that when mapping these 2-d points to the solid line: if only use the first strategy, point
A synthetic outlier detection dataset.
One way to combine low dimension information and reconstruction error is directly using probabilistic dimension reduction models. Probabilistic dimension reduction models often have a measure related to probability density that can be viewed as a combination of low dimension information and reconstruction error, and it can be used to detect outliers. The authors of [25] exploit energy-based models to model the data, then use the energy score, which is proportional to the probability density, to recognize outliers. They confirmed that energy score performs better than reconstruction error. The other probabilistic models such as Variational Autoencoders (VAE) [15] also can be used. However, this näive combination method often cannot accurately distinguish between low-density normal samples and outliers, as in the learning process outliers also tend to give large density values.
In order to accurately detect outlier of high-dimensional data, we proposed an outlier detection method based on VAE, which can better combine low dimension information and the reconstruction error of dimension reduction. We choose VAE because it is one of the most advanced probabilistic models that can be used to reduce dimension and can be directly trained by Stochastic Gradient Descent (SGD) algorithm. To the best of our knowledge, this is the first attempt to apply VAE to outlier detection. Specifically, we first use VAE to model the data. Then, we extract four outlier scores from the loss function of VAE, low dimension representation, and the reconstruction error. Finally, we propose a selectively weighted outlier ensemble algorithm to integrate the four outlier scores to detect outliers. We conduct experiments on six real-world datasets to validate the effectiveness of the method. The experiments show that the proposed method performs better or comparable with state of the art methods The remainder of the paper is organized as follows. In Section 2, we briefly review previous related work. Section 3 introduces the notation and problem formulation. In Section 4, we describe the proposed method. We empirically evaluate the method in Section 5. Finally, we conclude this paper in Section 6.
Outlier detection has been extensively studied [6]. According to the different assumption about outliers, conventional full-space outlier detection methods can be classified as statistical methods, nearest neighbor-based methods, clustering-based methods, one-class classification methods and so on. Statistical methods assume that outliers locate in low-density area, and often involve with density estimation. Density estimation methods include model-based methods and non-parametric methods. Model-based methods first estimate the parameters of the model, then compute density through the model. Non-parametric methods do not need to estimate parameters and can directly compute density from data. Common used non-parametric density estimation methods are histogram and kernel density estimation (KDE). Actually, for outlier detection, there is no need to estimate density accurately, the approximation of density or some measure that has a monotonic relationship with density is enough. Following this idea, Isolation Forest (iForest) [18] use tree depth of random space partition as the measure of outlier degree, and the tree depth is monotonic with respect to density. Nearest neighbor-based (NN-based) methods assume that normal data has dense neighbors, and outliers have sparse neighbors. NN-based methods can also be explained as using nearest neighbors to approximate density in statistics. In order to handle the case that normal data instances have different density, Local Outlier Factor (LOF) [4] is proposed to alleviate this problem by comparing density with the density of neighbors. The density in LOF also uses nearest neighbors to approximate. Clustering-based methods assume that normal data instances form clusters, the data instances that are far from clusters or belong to little clusters can be considered as outliers. As outliers are only the by-product of clustering methods, and the used clustering methods are not optimized for outlier detection, the efficiency and effectiveness of clustering-based methods are relatively low. A typical one-class classification method is one-class support vector machine (OC-SVM) [23], which constructs a hyperplane to separate most of data instances from others.
All the above methods are full space methods, which can not handle high-dimensional data well. Existing outlier detection methods for high-dimensional data mainly include feature selection-based methods and feature transformation-based method. Feature selection-based methods, also known as subspace outlier detection [3], aim to detect outliers in some feature subset. Feature selection-based outlier detection methods usually include two components subspace selection and outlier degree computation. As confronting with the exponential order number of subspaces, this kind of methods actually is not feasible for datasets with a large number of features.
Feature transformation-based methods first learn a transformation to map the data to a simple representation (often with much lower dimension). Then there are two strategies to do outlier detection: the first is using reconstruction error of the transformation to measure the outlier degree of the data [7, 22, 24], the second is applying conventional outlier detection method to the transformed data [10, 5]. The feature transformation method used in [7] is kernel PCA, [22, 24] Autoencoders. In order to improve outlier detection performance, the training method of Autoencoders in [24] is modified to consider the separation between potential normal data instances and outliers using binary clustering method according to the reconstruction error. In some extent, the method of [24] can be considered as a one-class classification method. The authors of [10] first use Deep Belief Network (DBN) to transform data to a low dimension representation, then apply OC-SVM to detection outliers. They found that OC-SVM with linear kernel can obtain good outlier detection performance. The method in [5] first use Autoencoders to do dimension reduction, then use KDE to detect outliers. The method in [25] is an exception that can be considered as a combination of the two strategies, which is most similar to ours. Their method is based on Energy-Based Model. EBM is a probabilistic graph model that learns the distribution of data and transformed representation. The authors of [5] use the energy score of EBM to detect outliers and confirm that energy score performs better than information loss. Compared with [25], our method is based on VAE, VAE can be directly trained by SGD, while EBM can not. Moreover, our method explicitly combines the reconstruction error and transformed representation, while [25] uses the approximation of density provided by the model, which is much easily affected by overfitting.
Notation and problem formulation
We follow the notation in [11]. Matrices and vectors are written as boldface uppercase letters and boldface lowercase letters, respectively. Random variables are set in roman; the others are italic.
Let
The other notations are used including
Proposed method
In this section, we first give a brief introduction of VAE, then describe the outlier scores induced from VAE. This is then followed by the weighted outlier ensemble algorithm.
Preliminary
VAE is one of the most promising techniques for unsupervised learning, which can be considered as the combination of probabilistic graphical model (PGM) and neural network and has the advantages of the two. Essentially, VAE is a PGM with latent variables, aiming to learn the dependence between data and latent variables. VAEs use directed structure, so the joint probability density can factor as:
where
Neural networks in VAE are used to approximate the parameters of
VAE optimizes the following loss function with respect to data point
This transformation does not change the distribution of
VAE structure.
The structure of VAE we used is shown in Fig. 2. We choose one hidden layer, as it can guarantee the universal approximation of neural networks [13]. We assume
After learning a probabilistic model, the most straightforward way to do outlier detection is using probability density as outlier score. However, there are two reasons that make this way not accurate enough. First, accurately estimate the density of high-dimensional data is quite difficult. Second, maximum likelihood tends to give large density value to every point in the training set, which will make the outliers in training set hard to detect. In order to more accurately detect outliers, we extract four outlier scores based on VAE from different perspectives.
The first two is computed directly from the loss function of VAE. The first one is the sampling estimation of the first term of Eq. (2):
where
The third outlier score is the histogram density estimation in reduced space. VAE actually learns the posterior distribution, the most commonly used method to get features from distribution is using the mean value of the distribution, and we also choose this method. We denote the reduced data as
Now the problem is reduced to estimate the density of one-dimensional data. There are many methods to tackle this problem, such as Kernel Density Estimation and Histogram. We choose histogram because it is simple and effective. Considering that not every dimension of
The last outlier score is the reconstruction error when
Given several outlier scores, simple average ensemble method often cannot perform well, because they are often not equally important. So we propose a Weighted Outlier Ensemble (WOE) method to aggregate the four outlier scores. The pseudo-code of WOE is shown in Algorithm 4.3. The input to Algorithm 4.3 is the outlier scores of all data points, i.e.
[1] Four outlier score list
The basic idea of WOE is that we first construct a pseudo-target, then assign weight according to the correlation between pseudo-target and outlier scores. As for unsupervised outlier detection, there is no ground truth. We follow the method in [20] using simple average ensemble as pseudo-target. This pseudo-target is used as comparison benchmark to sort outlier scores in descending order according to Spearman’s rank correlation coefficient. We use Spearman’s rank correlation coefficient because it can measure non-linear correlation, and the exact value of score does not matter. The rank of the sort is used to assign weights, with lower rank lower weight. The weights we used is the reciprocal of rank, which is also used in [20]. However, they used it for data points.
Empirical evaluation
In this section, we will briefly introduce the datasets and the compared methods which are used in the experiments. Afterwards, experimental results are evaluated and analyzed.
Datasets
To test our method, we use six real-world datasets: (1) Cardio; (2) Satellite; (3) Epileptic Seizure Recognition (ESR); (4) Human Activity Recognition (HAR); (5) MNIST [16]; (6) Internet Advertisements (IAds). Except for MNIST, other five datasets are all from UCI Machine Learning Repository [17]. As all the datasets are used for classification, we apply the following transformation to generate data for outlier detection:
Cardio contains three classes (normal, suspect and pathologic). For outlier detection, we use the data generated by Outlier Detection DataSets (ODDS) [1]. The generation method is that the normal class formed the inliers; the pathologic (outlier) class is downsampled to 176 points; the suspect class is discarded. Satellite. The data of three large classes (1, 3, 7) are used as inliers, the other three small classes (2, 4, 5) is outliers. This setting is also used in [18]. ESR The data of classes 2, 3, 4, and 5 are from subjects who did not have the epileptic seizure, so we choose these classes as normal data. The data of class 1 is from who have the epileptic seizure, which is chosen as outliers. HAR. We choose three motional classes of activity (walking, up-stairs, and down-stairs) to generate outlier detection dataset. The data of walking forms the inliers, then we subsample data from the other two classes to form outliers. MNIST. We choose the data of two similar digits 4 and 9 to form the inliers, then subsample the data of other digits to get the outliers. IAds. We use the original data to do outlier detection. Non-advertisement instances are used as inliers, and advertisements are used as outliers. We first discard the first three continuous attributes as the proportion of the missing values is over 10%, then discard 15 instances as they have missing values in other attributes. The last attributes are all binary.
Table 1 summaries the main characteristics of the used datasets. All the data are scaled to [0, 1]. The labels are used only for evaluation.
Datasets summary
We use leaky rectified linear units (relu)
The baseline methods includes LOF [4], Isolation Forest (iForest) [18], one-class Support Vector Machine (OC-SVM) [23], PCA, AE, DRAE [24] and EBM based method [25]. The first three are conventional full-space methods, the others are dimension reduction based methods.
The experimental setting for baseline methods are as follow: The neighbor number of LOF is set to 10, which is a commonly used value in outlier detection. For OC-SVM, We use Radial Basis Function (RBF) kernel with parameter
For LOF, iForest, OC-SVM, and PCA, we use the implementations in Scikit-Learn [19]. For the other three methods, we implement them using Tensorflow.
For all experiments, we use the area under the receiver operating characteristic curve (AUC) as the metric to evaluate the performance of outlier detection. ROC curves plot the true positive rate against the false positive rate. As outlier detection methods often generate a score to indicate the degree of abnormality, and the score can also be used as the priority of data processing for the human analysts. Intuitively, AUC measures the rank accuracy of putting outliers ahead of normal data, which is extensively used to evaluate outlier detection methods [6].
Results and discussions
The experimental results are shown in Table 2. Intuitively, there are no methods perform well on all datasets. In order to find significant differences among the results, statistical analysis is carried out. We first use Freidman’s test to determine if there are significant differences among these methods. If there are statistically significant differences, then we use Holm as a post hoc test, which is used to compare the control method v.s. the remaining ones. These statistical tools are often used to compared machine learning methods [8, 12], and the authors in [10] also carry out statistical analysis in the area of outlier detection. Following [10], the significance level
The AUC values of the experiments. For stochastic methods, we run experiments for ten times, and report mean and standard deviation of the AUC values. The best results are marked as bold
The AUC values of the experiments. For stochastic methods, we run experiments for ten times, and report mean and standard deviation of the AUC values. The best results are marked as
Tables 3–5 show the results of statistical analysis, which are generated by the tool provided in [21]. Table 3 reports the average ranking of these methods. The proposed VAE based method ranks first. The results of Friedman’s test are shown in Table 4. The null hypothesis is rejected, which also means that there are significant differences among these outlier detection methods. Therefore, we carry out the Holm’s method as a post hoc test using VAE as the control method. The results of Holm’s method, which is shown in Table 5, reveals that the control method (VAE) is statistically better than LOF, OC-SVM, PCA, AE, and DRAE. Although there is no significant difference for iForest and EBM based on the Holm’s method results, VAE performs better than iForest for all datasets, especially for the high dimensional datasets; and VAE outperforms EBM by a large margin in 4 of 6 datasets, with the other two datasets have comparable performance.
Average ranking of outlier detection methods based on the AUC values
Results of Friedman’s tests based on the AUC values
Results of the Holm’s method based on the AUC values (Using VAE as the control method)
Compared with three full-space methods, sophisticated dimension reduction-based methods such as EBM-based method and VAE-based method outperforms by a large margin. Among these dimension reduction based method, our VAE-based method and EBM-based method perform better than others, as they combine low dimension information and reconstruction error of dimension reduction to detect outliers, while other methods only rely on reconstruction error.
We also compare different combination methods based on VAE: sampling estimation of ELBO (ELBO), the average of the proposed four outlier scores (AVG) and our ensemble method (WOE). The experimental results are shown in Fig. 3. As shown in Fig. 3, WOE performs better or comparable in five of six datasets, except IAds, which only slightly deteriorates the performance. From the results, we can also see that ELBO is the worst combination method in most cases, we argue that this is because the loss function is used as outlier score, during optimization, outliers tend to give a high value as a result of overfitting.
Comparison of ensemble methods. The experiments are run ten times; we plot the boxplots for every method and dataset.
We test the sensitivity of
The sensitivity of the number of hidden layer units. The experiments are run ten times; we plot the mean AUC value for every setting and dataset. The dimension of 
In this paper, we have proposed an outlier detection method based VAE. The method first uses VAE to model the data, then distills four outlier scores form VAE, which includes two part of the loss function of VAE, low-dimension representation induced from VAE, and the reconstruction error of dimension reduction. At last, we propose an ensemble method to combine the four outlier scores to detect outliers. We conduct experiments on six real-world datasets. The experiments show that the proposed method performs better or it is comparable with state of the art methods. In the future, we will apply the idea of combining low-dimension information and the information loss to other dimension reduction method, such as EBM, Bidirectional Generative Adversarial Networks (BiGAN) [9].
Footnotes
Acknowledgments
This work was supported by the National Key Research and Development Program of China (Grant No.2016YFB1000101), the National Natural Science Foundation of China (Grant No.61379052), the Science Foundation of Ministry of Education of China (Grant No.2018A02002), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No.14JJ1026).
