Variational autoencoder-based outlier detection for high-dimensional data

Abstract

Analysis of high-dimensional data often suffers from the curse of dimensionality and the complicated correlation among dimensions. Dimension reduction methods often are used to alleviate these problems. Existing outlier detection methods based on dimension reduction usually only rely on reconstruction error to detect outlier or apply conventional outlier detection methods to the reduced data, which could deteriorate the performance of outlier detection as only considering part of the information from data. Few studies have been done to combine these two strategies to do outlier detection. In this paper, we proposed an outlier detection method based on Variational Autoencoder (VAE), which combines low-dimensional representation and reconstruction error to detect outliers. Specifically, we first model the data use VAE, then extract four outlier scores from VAE model, finally propose an ensemble method to combine the four outlier scores. The experiments conducted on six real-world datasets show that the proposed method performs better than or at least comparable to state of the art methods.

Keywords

Variational autoencoders outlier detection high-dimensional data

1. Introduction

Outlier detection is an important data mining task with many applications, such as fraud detection, network intrusion detection, environmental monitoring, etc. Conventional full-space outlier detection methods often assume that distance is reliable, or low-density area in data space can be easily detected, which is not the case for high-dimensional data [26]. With so many dimensions, distance in high-dimensional data becomes unreliable, which also known as the curse of dimensionality. Also, the sparsity of high-dimensional data makes it difficult to detect low-density area accurately.

Another challenge for high-dimensional data is the correlation among dimensions. Because of the correlation, outlier detection for high-dimensional data often needs synthetically analyse many dimension to make a judgment about outliers. The Correlation also means that the intrinsic dimensionality is lower than original dimension, which leads to dimension reduction based methods for outlier detection. There are two common strategies for dimension reduction based outlier detection. The first one uses dimension reduction as a tool of feature extraction, after mapping the data to lower dimensions, conventional full-space outlier detection method is applied [10, 5]. With lower dimensions, the problems caused by the curse of dimensionality can usually be alleviated. The other one uses dimension reduction models to capture normal patterns, then the reconstruction error which can be explained as the deviation from the normal patterns is used to measure the outlier degree of the data [7, 22, 24]. Apparently, relying on only one strategy may make some outliers undetectable. Figure 1 shows that when mapping these 2-d points to the solid line: if only use the first strategy, point $A$ will mix with other points; and if only use the second strategy, point $B$ will be ignored as it has low information loss. So the better way is to combine the two strategies.

Figure 1.

A synthetic outlier detection dataset.

One way to combine low dimension information and reconstruction error is directly using probabilistic dimension reduction models. Probabilistic dimension reduction models often have a measure related to probability density that can be viewed as a combination of low dimension information and reconstruction error, and it can be used to detect outliers. The authors of [25] exploit energy-based models to model the data, then use the energy score, which is proportional to the probability density, to recognize outliers. They confirmed that energy score performs better than reconstruction error. The other probabilistic models such as Variational Autoencoders (VAE) [15] also can be used. However, this näive combination method often cannot accurately distinguish between low-density normal samples and outliers, as in the learning process outliers also tend to give large density values.

In order to accurately detect outlier of high-dimensional data, we proposed an outlier detection method based on VAE, which can better combine low dimension information and the reconstruction error of dimension reduction. We choose VAE because it is one of the most advanced probabilistic models that can be used to reduce dimension and can be directly trained by Stochastic Gradient Descent (SGD) algorithm. To the best of our knowledge, this is the first attempt to apply VAE to outlier detection. Specifically, we first use VAE to model the data. Then, we extract four outlier scores from the loss function of VAE, low dimension representation, and the reconstruction error. Finally, we propose a selectively weighted outlier ensemble algorithm to integrate the four outlier scores to detect outliers. We conduct experiments on six real-world datasets to validate the effectiveness of the method. The experiments show that the proposed method performs better or comparable with state of the art methods The remainder of the paper is organized as follows. In Section 2, we briefly review previous related work. Section 3 introduces the notation and problem formulation. In Section 4, we describe the proposed method. We empirically evaluate the method in Section 5. Finally, we conclude this paper in Section 6.

2. Related work

Outlier detection has been extensively studied [6]. According to the different assumption about outliers, conventional full-space outlier detection methods can be classified as statistical methods, nearest neighbor-based methods, clustering-based methods, one-class classification methods and so on. Statistical methods assume that outliers locate in low-density area, and often involve with density estimation. Density estimation methods include model-based methods and non-parametric methods. Model-based methods first estimate the parameters of the model, then compute density through the model. Non-parametric methods do not need to estimate parameters and can directly compute density from data. Common used non-parametric density estimation methods are histogram and kernel density estimation (KDE). Actually, for outlier detection, there is no need to estimate density accurately, the approximation of density or some measure that has a monotonic relationship with density is enough. Following this idea, Isolation Forest (iForest) [18] use tree depth of random space partition as the measure of outlier degree, and the tree depth is monotonic with respect to density. Nearest neighbor-based (NN-based) methods assume that normal data has dense neighbors, and outliers have sparse neighbors. NN-based methods can also be explained as using nearest neighbors to approximate density in statistics. In order to handle the case that normal data instances have different density, Local Outlier Factor (LOF) [4] is proposed to alleviate this problem by comparing density with the density of neighbors. The density in LOF also uses nearest neighbors to approximate. Clustering-based methods assume that normal data instances form clusters, the data instances that are far from clusters or belong to little clusters can be considered as outliers. As outliers are only the by-product of clustering methods, and the used clustering methods are not optimized for outlier detection, the efficiency and effectiveness of clustering-based methods are relatively low. A typical one-class classification method is one-class support vector machine (OC-SVM) [23], which constructs a hyperplane to separate most of data instances from others.

All the above methods are full space methods, which can not handle high-dimensional data well. Existing outlier detection methods for high-dimensional data mainly include feature selection-based methods and feature transformation-based method. Feature selection-based methods, also known as subspace outlier detection [3], aim to detect outliers in some feature subset. Feature selection-based outlier detection methods usually include two components subspace selection and outlier degree computation. As confronting with the exponential order number of subspaces, this kind of methods actually is not feasible for datasets with a large number of features.

Feature transformation-based methods first learn a transformation to map the data to a simple representation (often with much lower dimension). Then there are two strategies to do outlier detection: the first is using reconstruction error of the transformation to measure the outlier degree of the data [7, 22, 24], the second is applying conventional outlier detection method to the transformed data [10, 5]. The feature transformation method used in [7] is kernel PCA, [22, 24] Autoencoders. In order to improve outlier detection performance, the training method of Autoencoders in [24] is modified to consider the separation between potential normal data instances and outliers using binary clustering method according to the reconstruction error. In some extent, the method of [24] can be considered as a one-class classification method. The authors of [10] first use Deep Belief Network (DBN) to transform data to a low dimension representation, then apply OC-SVM to detection outliers. They found that OC-SVM with linear kernel can obtain good outlier detection performance. The method in [5] first use Autoencoders to do dimension reduction, then use KDE to detect outliers. The method in [25] is an exception that can be considered as a combination of the two strategies, which is most similar to ours. Their method is based on Energy-Based Model. EBM is a probabilistic graph model that learns the distribution of data and transformed representation. The authors of [5] use the energy score of EBM to detect outliers and confirm that energy score performs better than information loss. Compared with [25], our method is based on VAE, VAE can be directly trained by SGD, while EBM can not. Moreover, our method explicitly combines the reconstruction error and transformed representation, while [25] uses the approximation of density provided by the model, which is much easily affected by overfitting.

3. Notation and problem formulation

We follow the notation in [11]. Matrices and vectors are written as boldface uppercase letters and boldface lowercase letters, respectively. Random variables are set in roman; the others are italic.

Let $\mathcal{X}\subseteq\mathbb{R}^{d}$ denote the input space, where $d$ is the input dimension. Given an unlabeled dataset $\bm{X}=\{\bm{x}^{(1)},\ldots,\bm{x}^{(n)}\}$ , the outlier detection task in this paper is to learn an outlier scoring function $s:\mathcal{X}\to\mathbb{R}$ . The value of $s(\bm{x})$ indicates the outlier degree of data point $\bm{x}$ . The larger value of $s(\bm{x})$ , the larger possibility that $\bm{x}$ is an outlier. We argue that outlier score can provide more information than the binary label, as it can give the priority of data processing, which is important for outlier detection applications. And outlier score can also be transformed to binary label by setting an appropriate threshold, but not vice versa.

The other notations are used including $\mathbf{z}\in\mathbb{R}^{m}$ denotes the latent variable, $m$ is the dimension of $\mathbf{z}$ , which is much lower than $d$ . $p$ and $q$ denote probability density function. $D_{\text{KL}}[p\|q]$ denotes the Kullback-Leibler (KL) divergence between two distributions. Let $r_{s}(\bm{s}_{1},\bm{s}_{2})$ denote Spearman correlation coefficient between two list $\bm{s}_{1}$ and $\bm{s}_{2}$ .

4. Proposed method

In this section, we first give a brief introduction of VAE, then describe the outlier scores induced from VAE. This is then followed by the weighted outlier ensemble algorithm.

4.1 Preliminary

VAE is one of the most promising techniques for unsupervised learning, which can be considered as the combination of probabilistic graphical model (PGM) and neural network and has the advantages of the two. Essentially, VAE is a PGM with latent variables, aiming to learn the dependence between data and latent variables. VAEs use directed structure, so the joint probability density can factor as:

$\displaystyle p(\bm{x},\bm{z})=p(\bm{z})p(\bm{x}|\bm{z})$ (1)

where $p(\bm{z})$ is the prior distribution of $\bm{z}$ , $p(\bm{z})$ is set to $\mathcal{N}(0,\bm{I})$ . The choice of $p(\bm{x}|\bm{z})$ depends on the data. For continuous data, $p(\bm{x}|\bm{z})$ is set to Gaussian distribution; for binary data, $p(\bm{x}|\bm{z})$ is set to Bernoulli distribution. As given $\bm{z}$ , $\bm{x}$ can be inferred, $p(\bm{x}|\bm{z})$ can be considered as a decoder. For many cases, we need to get the posterior distribution $p(\bm{z}|\bm{x})$ , this is the real hard part of PGM. Following the paradigm of variational inference, VAE uses an easily computed distribution $q(\bm{z}|\bm{x})$ to approximate the true posterior distribution. And $q(\bm{z}|\bm{x})$ is also set to the Gaussian distribution. As given $\bm{x}$ , $\bm{z}$ can be inferred, $q(\bm{z}|\bm{x})$ can be considered as an encoder.

Neural networks in VAE are used to approximate the parameters of $p(\bm{x}|\bm{z})$ and $q(\bm{z}|\bm{x})$ . Let denote the parameters of the neural network used in $p(\bm{x}|\bm{z})$ as $\phi$ and the parameters of the neural network used in $q(\bm{z}|\bm{x})$ as $\theta$ . We denote the mean and covariance of $q(\bm{z}|\bm{x})$ as $\mu(\bm{x};\theta)$ and $\Sigma(\bm{x};\theta)$ respectively.

VAE optimizes the following loss function with respect to data point $\bm{x}$ :

$\displaystyle\mathcal{L}(\bm{x};\theta,\phi)=\mathbb{E}_{\bm{z}\sim q(\bm{z}|% \bm{x})}\log p(\bm{x}|\bm{z})-D_{\text{KL}}[q(\mathbf{z}|\mathbf{x})\|p(% \mathbf{z})]$ (2) $\displaystyle\leqslant\log p(\bm{x})$ (3)

$\mathcal{L}(\bm{x};\theta,\phi)$ is the lower bound of log likelihood, also called evidence lower bound (ELBO). The first term of Eq. (2) can be viewed as reconstruction error, but it is computed by the expectation over the posterior not by collapsing to only one point. This property ensured that the space of $\bm{z}$ has better continuity. The second term of Eq. (2) makes the approximative posterior distribution $q$ close to the prior distribution $p(\bm{z})$ . The computation of the second term is easy, as the KL divergence of two Gaussian distributions has a closed-form solution. The computation of the first term has to resort to sampling. However, after sampling, $\mathcal{L}$ does not depend on $\theta$ , so we can not compute the gradient with respect to $\theta$ . In order to solve this problem, VAE introduces reparameterization trick, firstly sampling $\bm{\epsilon}\sim\mathcal{N}(0,\bm{I})$ , then applying the following transformation:

$\displaystyle\bm{z}=\mu(\bm{x};\theta)+\Sigma^{1/2}(\bm{x};\theta)\bm{\epsilon}$ (4)

This transformation does not change the distribution of $\bm{z}$ and keeps the dependency on $\theta$ . Finally, we can optimize $\theta$ and $\phi$ together using SGD algorithm or other first-order optimization algorithms.

Figure 2.

VAE structure.

The structure of VAE we used is shown in Fig. 2. We choose one hidden layer, as it can guarantee the universal approximation of neural networks [13]. We assume $\bm{z}$ and $\hat{\bm{x}}$ are both mutually independent. Although this limit the power of the model, it brings several benefits. First, independence assumption can keep the model simple, as we can put the complexity into the neural network rather than explicit using a complex structure of PGM. The other one is that the number of parameters needs to optimize will be dramatically reduced, for Gaussian distribution, non-diagonal elements of covariance can be set to zero. Another one is that with the independence assumption, the latent representation will become simple, which is a benefit for outlier detection. When $p(\bm{x}|\bm{z})$ is Gaussian distribution, we actually can not model the covariance. Because for the sake of maximum likelihood, the covariance will be infinitely close to zeros. So the covariance of $p(\bm{x}|\bm{z})$ can only be specified as hyper-parameters, and the output of the decoder is the mean of the Gaussian distribution. When $p(\bm{x}|\bm{z})$ is Bernoulli distribution, the output of the decoder is also the mean of the distribution.

4.2 Outlier scores

After learning a probabilistic model, the most straightforward way to do outlier detection is using probability density as outlier score. However, there are two reasons that make this way not accurate enough. First, accurately estimate the density of high-dimensional data is quite difficult. Second, maximum likelihood tends to give large density value to every point in the training set, which will make the outliers in training set hard to detect. In order to more accurately detect outliers, we extract four outlier scores based on VAE from different perspectives.

The first two is computed directly from the loss function of VAE. The first one is the sampling estimation of the first term of Eq. (2):

$\displaystyle s_{1}(\bm{x})=-\frac{1}{k}\sum_{i=1}^{k}\log p(\bm{x}|\bm{z}_{i})$ (5)

where $\bm{z}_{i}\sim q(\bm{z}|\bm{x})$ , $k$ is the number of samples. The second outlier score is the second term of Eq. (2):

$\displaystyle s_{2}(\bm{x})=D_{\text{KL}}[q(\mathbf{z}|\bm{x})\|p(\mathbf{z})]$ (6) $\displaystyle=\frac{1}{2}\Big{(}\operatorname{tr}(\Sigma(\bm{x}))+\|\mu(\bm{x}% )\|_{2}-m-\log|\Sigma(\bm{x})|\Big{)}$ (7)

The third outlier score is the histogram density estimation in reduced space. VAE actually learns the posterior distribution, the most commonly used method to get features from distribution is using the mean value of the distribution, and we also choose this method. We denote the reduced data as $\bm{z}_{\mu}=\mu(\bm{x})$ , so the reduced dataset can be denoted as $\bm{Z}=\{\bm{z}_{\mu}^{(1)},\ldots,\bm{z}_{\mu}^{(1)}\}$ . As shown in the last section, when modeling the data, we assume $\bm{z}$ is mutual independence. This makes the correlation between the different dimension of $\bm{z}$ is weak, so we can use the following approximation to estimation density:

$\displaystyle p_{\bm{Z}}(\bm{z})\approx p(z_{1})p(z_{2})\ldots p(z_{3})$ (8)

Now the problem is reduced to estimate the density of one-dimensional data. There are many methods to tackle this problem, such as Kernel Density Estimation and Histogram. We choose histogram because it is simple and effective. Considering that not every dimension of $\bm{z}$ is informative, we use the deviation of every dimension to weight the density computed by the histogram. We choose standard deviation as the weights because every dimension of $\bm{z}$ is optimized to approximate to $\mathcal{N}(0,1)$ , the standard deviation is a good measure of the information captured by this dimension, i.e., the extent of the change in this dimension affecting the data. So, the third outlier score can be denoted as:

$\displaystyle s_{3}(\bm{x})=-\sum_{i=1}^{m}\sigma_{z_{i}}\log p_{\text{hist}}(% z_{i}(\bm{x}))$ (9)

The last outlier score is the reconstruction error when $\bm{z}=\mu(\bm{x})$ :

$\displaystyle s_{4}(\bm{x})=-\log p(\bm{x}|\mu(\bm{x}))$ (10)

$s_{4}$ can be considered as the complementation of $s_{3}$ . It is worth noting that $s_{4}$ is different from $s_{1}$ , $s_{1}$ is the expectation over the distribution which measures the fitness of the model, while $s_{4}$ is collapsing only to one point.

4.3 Ensemble method

Given several outlier scores, simple average ensemble method often cannot perform well, because they are often not equally important. So we propose a Weighted Outlier Ensemble (WOE) method to aggregate the four outlier scores. The pseudo-code of WOE is shown in Algorithm 4.3. The input to Algorithm 4.3 is the outlier scores of all data points, i.e. $\bm{s}_{1}=[s_{1}(\bm{x}^{(1)},\ldots,s_{1}(\bm{x}^{(n)}]$ . $\bm{s}_{2}$ , $\bm{s}_{3}$ and $\bm{s}_{4}$ are similar to $\bm{s}_{1}$ except with different outlier score function.

[1] Four outlier score list $\bm{s}_{1},\bm{s}_{2},\bm{s}_{3},\bm{s}_{4}$ aggregated final outlier score list $\bm{s}$ $\textsf{scores}\leftarrow[\bm{s}_{1},\bm{s}_{2},\bm{s}_{3},\bm{s}_{4}]$ scale scores to $[0,1]$ $\textsf{pseudo\_target}\leftarrow\textsf{sum}(\textsf{scores})$ sort scores by $r_{s}$ to pseudo_target // descending order $\bm{s}\leftarrow\bm{0}$ $i\leftarrow 0$ $i<\textsf{length}(\textsf{scores})$ $\bm{s}\leftarrow\bm{s}+\frac{1}{i+1}*\textsf{scores}[i]$ $i\leftarrow i+1$ $\bm{s}$ Weighted Outlier Ensemble (WOE)

The basic idea of WOE is that we first construct a pseudo-target, then assign weight according to the correlation between pseudo-target and outlier scores. As for unsupervised outlier detection, there is no ground truth. We follow the method in [20] using simple average ensemble as pseudo-target. This pseudo-target is used as comparison benchmark to sort outlier scores in descending order according to Spearman’s rank correlation coefficient. We use Spearman’s rank correlation coefficient because it can measure non-linear correlation, and the exact value of score does not matter. The rank of the sort is used to assign weights, with lower rank lower weight. The weights we used is the reciprocal of rank, which is also used in [20]. However, they used it for data points.

5. Empirical evaluation

In this section, we will briefly introduce the datasets and the compared methods which are used in the experiments. Afterwards, experimental results are evaluated and analyzed.

5.1 Datasets

To test our method, we use six real-world datasets: (1) Cardio; (2) Satellite; (3) Epileptic Seizure Recognition (ESR); (4) Human Activity Recognition (HAR); (5) MNIST [16]; (6) Internet Advertisements (IAds). Except for MNIST, other five datasets are all from UCI Machine Learning Repository [17]. As all the datasets are used for classification, we apply the following transformation to generate data for outlier detection:

•
Cardio contains three classes (normal, suspect and pathologic). For outlier detection, we use the data generated by Outlier Detection DataSets (ODDS) [1]. The generation method is that the normal class formed the inliers; the pathologic (outlier) class is downsampled to 176 points; the suspect class is discarded.
•
Satellite. The data of three large classes (1, 3, 7) are used as inliers, the other three small classes (2, 4, 5) is outliers. This setting is also used in [18].
•
ESR The data of classes 2, 3, 4, and 5 are from subjects who did not have the epileptic seizure, so we choose these classes as normal data. The data of class 1 is from who have the epileptic seizure, which is chosen as outliers.
•
HAR. We choose three motional classes of activity (walking, up-stairs, and down-stairs) to generate outlier detection dataset. The data of walking forms the inliers, then we subsample data from the other two classes to form outliers.
•
MNIST. We choose the data of two similar digits 4 and 9 to form the inliers, then subsample the data of other digits to get the outliers.
•
IAds. We use the original data to do outlier detection. Non-advertisement instances are used as inliers, and advertisements are used as outliers. We first discard the first three continuous attributes as the proportion of the missing values is over 10%, then discard 15 instances as they have missing values in other attributes. The last attributes are all binary.

Table 1 summaries the main characteristics of the used datasets. All the data are scaled to [0, 1]. The labels are used only for evaluation.

Table 1
Datasets summary

Dataset $d$ $n$ #outliers (%)

Cardio 21 1831 176 (9.6%)

Satellite 36 6435 2036 (31.6%)

ESR 178 11500 2300 (20.0%)

HAR 561 1812 90 (5.0%)

MNIST 784 14507 725 (5.0%)

IAds 1555 3264 454 (13.9%)

5.2 Experimental settings

Dataset	$d$	$n$	#outliers (%)
Cardio	21	1831	176 (9.6%)
Satellite	36	6435	2036 (31.6%)
ESR	178	11500	2300 (20.0%)
HAR	561	1812	90 (5.0%)
MNIST	784	14507	725 (5.0%)
IAds	1555	3264	454 (13.9%)

We use leaky rectified linear units (relu) $\max(x,0.1x)$ in the hidden layers of VAE. The covariance of $p(\bm{x}|\bm{z})$ is set to the square of the mean value of $\{\sigma(x_{1}),\ldots,\sigma(x_{d})\}$ . We argue that the output deviation should close to the input deviation. We set the dimension of $\bm{z}$ to be $\lceil\sqrt{d}\rceil$ , and the dimension of $\bm{h}$ be twice that of $\bm{z}$ . As architecture is important to neural network, we also perform experiments to test the sensitivity of $\bm{z}$ â€˜s dimensionality. We train VAEs using Adam [14] optimizer with a mini-batch size of 64 for 500 epochs, the other parameters using the default values. We implement our method with Python and Tensorflow [2].

The baseline methods includes LOF [4], Isolation Forest (iForest) [18], one-class Support Vector Machine (OC-SVM) [23], PCA, AE, DRAE [24] and EBM based method [25]. The first three are conventional full-space methods, the others are dimension reduction based methods.

The experimental setting for baseline methods are as follow: The neighbor number of LOF is set to 10, which is a commonly used value in outlier detection. For OC-SVM, We use Radial Basis Function (RBF) kernel with parameter $\gamma=$ 0.1, and $\nu=$ 0.1 which controls the upper bound on the fraction of outliers. For iForest, we used the recommended configuration in [18]. For the sake of comparability, we use the same number of hidden layer units of VAE with AE, DRAE, and EBM, and the number of components of PCA is the same with the dimension of $\bm{z}$ . The activation function of the hidden layer for AE and DRAE are also leaky relu with the same parameter to VAE, and the output layer is sigmoid. The outlier score function is reconstruction error for PCA, AE, and DRAE, while EBM is energy score. The training method and parameters are also the same with VAE, except that the mini-batch size of DRAE is set to 100, which is recommended by [24].

For LOF, iForest, OC-SVM, and PCA, we use the implementations in Scikit-Learn [19]. For the other three methods, we implement them using Tensorflow.

For all experiments, we use the area under the receiver operating characteristic curve (AUC) as the metric to evaluate the performance of outlier detection. ROC curves plot the true positive rate against the false positive rate. As outlier detection methods often generate a score to indicate the degree of abnormality, and the score can also be used as the priority of data processing for the human analysts. Intuitively, AUC measures the rank accuracy of putting outliers ahead of normal data, which is extensively used to evaluate outlier detection methods [6].

5.3 Results and discussions

The experimental results are shown in Table 2. Intuitively, there are no methods perform well on all datasets. In order to find significant differences among the results, statistical analysis is carried out. We first use Freidman’s test to determine if there are significant differences among these methods. If there are statistically significant differences, then we use Holm as a post hoc test, which is used to compare the control method v.s. the remaining ones. These statistical tools are often used to compared machine learning methods [8, 12], and the authors in [10] also carry out statistical analysis in the area of outlier detection. Following [10], the significance level $\alpha$ in this paper is also set to 0.1.

Table 2
The AUC values of the experiments. For stochastic methods, we run experiments for ten times, and report mean and standard deviation of the AUC values. The best results are marked as bold

	AUC
	LOF	iForest	OC-SVM	PCA	AE	DRAE	EBM	VAE
Cardio	0.6011	0.9301 $\pm$ 0.0100	0.8647	0.8327	0.6726 $\pm$ 0.0589	0.9396 $\pm$ 0.0117	0.9422 $\pm$ 0.0010	0.9384 $\pm$ 0.0037
Satellite	0.5137	0.7047 $\pm$ 0.0123	0.5003	0.6705	0.6872 $\pm$ 0.0140	0.7172 $\pm$ 0.0405	0.6375 $\pm$ 0.0060	0.7600 $\pm$ 0.0080
ESR	0.8648	0.9884 $\pm$ 0.0007	0.9889	0.9810	0.9820 $\pm$ 0.0006	0.9876 $\pm$ 0.0019	0.9892 $\pm$ 0.0000	0.9893 $\pm$ 0.0004
HAR	0.9803	0.8447 $\pm$ 0.0098	0.8172	0.9206	0.9343 $\pm$ 0.0122	0.8662 $\pm$ 0.0232	0.8642 $\pm$ 0.0087	0.9562 $\pm$ 0.0046
MNIST	0.6476	0.8481 $\pm$ 0.0106	0.7776	0.8947	0.8200 $\pm$ 0.0042	0.8640 $\pm$ 0.0047	0.9164 $\pm$ 0.0056	0.9336 $\pm$ 0.0083
IAds	0.5908	0.6868 $\pm$ 0.0280	0.6760	0.6535	0.6575 $\pm$ 0.0750	0.5544 $\pm$ 0.1492	0.7192 $\pm$ 0.0030	0.8048 $\pm$ 0.0069

Tables 3–5 show the results of statistical analysis, which are generated by the tool provided in [21]. Table 3 reports the average ranking of these methods. The proposed VAE based method ranks first. The results of Friedman’s test are shown in Table 4. The null hypothesis is rejected, which also means that there are significant differences among these outlier detection methods. Therefore, we carry out the Holm’s method as a post hoc test using VAE as the control method. The results of Holm’s method, which is shown in Table 5, reveals that the control method (VAE) is statistically better than LOF, OC-SVM, PCA, AE, and DRAE. Although there is no significant difference for iForest and EBM based on the Holm’s method results, VAE performs better than iForest for all datasets, especially for the high dimensional datasets; and VAE outperforms EBM by a large margin in 4 of 6 datasets, with the other two datasets have comparable performance.

Table 3

Average ranking of outlier detection methods based on the AUC values

Method	VAE	EBM	iForest	DRAE	AE	PCA	OC-SVM	LOF
Ranking	1.71	2.86	4.00	4.57	5.00	5.14	6.00	6.71

Table 4

Results of Friedman’s tests based on the AUC values

Statistic	$p$ -value	Result
4.73739	0.00055	$H_{0}$ is rejected

Table 5

Results of the Holm’s method based on the AUC values (Using VAE as the control method)

Comparison	Statistic	$p$ -value	Result
VAE vs LOF	3.81881	0.00094	$H_{0}$ is rejected
VAE vs OC-SVM	3.27327	0.00638	$H_{0}$ is rejected
VAE vs PCA	2.61861	0.04414	$H_{0}$ is rejected
VAE vs AE	2.50951	0.04836	$H_{0}$ is rejected
VAE vs DRAE	2.18218	0.08729	$H_{0}$ is rejected
VAE vs iForest	1.74574	0.16171	$H_{0}$ is accepted
VAE vs EBM	0.87287	0.38273	$H_{0}$ is accepted

Compared with three full-space methods, sophisticated dimension reduction-based methods such as EBM-based method and VAE-based method outperforms by a large margin. Among these dimension reduction based method, our VAE-based method and EBM-based method perform better than others, as they combine low dimension information and reconstruction error of dimension reduction to detect outliers, while other methods only rely on reconstruction error.

We also compare different combination methods based on VAE: sampling estimation of ELBO (ELBO), the average of the proposed four outlier scores (AVG) and our ensemble method (WOE). The experimental results are shown in Fig. 3. As shown in Fig. 3, WOE performs better or comparable in five of six datasets, except IAds, which only slightly deteriorates the performance. From the results, we can also see that ELBO is the worst combination method in most cases, we argue that this is because the loss function is used as outlier score, during optimization, outliers tend to give a high value as a result of overfitting.

Figure 3.

Comparison of ensemble methods. The experiments are run ten times; we plot the boxplots for every method and dataset.

We test the sensitivity of $\bm{z}$ ’s dimensionality by changing it to the half and the double, then compare the AUC of outlier detection. The results are shown in Fig. 4. We can see from the plot of Fig. 4 that reducing to the half may cause performance degradation while increasing to the double has little or no effect on the final results. So we can confirm that the values used in previous experiments are appropriate.

Figure 4.

The sensitivity of the number of hidden layer units. The experiments are run ten times; we plot the mean AUC value for every setting and dataset. The dimension of $\bm{z}$ is set to $0.5\times\lceil\sqrt{d}\rceil$ , $1\times\lceil\sqrt{d}\rceil$ , and $2\times\lceil\sqrt{d}\rangle$ respectively. The dimension of $\bm{h}$ keeps being twice that of $\bm{z}$ .

6. Conclusion

In this paper, we have proposed an outlier detection method based VAE. The method first uses VAE to model the data, then distills four outlier scores form VAE, which includes two part of the loss function of VAE, low-dimension representation induced from VAE, and the reconstruction error of dimension reduction. At last, we propose an ensemble method to combine the four outlier scores to detect outliers. We conduct experiments on six real-world datasets. The experiments show that the proposed method performs better or it is comparable with state of the art methods. In the future, we will apply the idea of combining low-dimension information and the information loss to other dimension reduction method, such as EBM, Bidirectional Generative Adversarial Networks (BiGAN) [9].

Footnotes

Acknowledgments

This work was supported by the National Key Research and Development Program of China (Grant No.2016YFB1000101), the National Natural Science Foundation of China (Grant No.61379052), the Science Foundation of Ministry of Education of China (Grant No.2018A02002), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No.14JJ1026).

References

Outlier detection datasets (odds), 2016. accessed: 12 July 2017.

Abadi

Barham

Chen

Davis

Dean

Devin

Ghemawat

Irving

Isard

Kudlur

Levenberg

Monga

Moore

Murray

D.G.

Steiner

Tucker

Vasudevan

Warden

Wicke

and Zheng

, Tensorflow: A system for large-scale machine learning, In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, Berkeley, CA, USA 2016 , USENIX Association, pp. 265–283.

Aggarwal

C.C.

, High-dimensional outlier detection: The subspace method, In Outlier Analysis, Springer, 2013, pp. 135–167.

Breunig

M.M.

Kriegel

H.-P.

R.T.

and Sander

, Lof: Identifying density-based local outliers, In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, New York, NY, USA, 2000. ACM, pp. 93–104.

Cao

V.L.

Nicolau

and McDermott

, A hybrid autoencoder and density estimation model for anomaly detection, In Handl

Hart

Lewis

P.R.

López-Ibá nez

Ochoa

and Paechter

, editors, Parallel Problem Solving from Nature – PPSN XIV, Cham, 2016, Springer International Publishing, pp. 717–726.

Chandola

Banerjee

and Kumar

, Anomaly detection: A survey, ACM Comput Surv 41(3) (July 2009), 15:1–15:58.

Chapel

and Friguet

, Anomaly detection with score functions based on the reconstruction error of the kernel pca, In Calders

Esposito

Hüllermeier

and Meo

, editors, Machine Learning and Knowledge Discovery in Databases, Berlin, Heidelberg, 2014, Springer Berlin Heidelberg, pp. 227–241.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine learning research 7 (Jan 2006), 1–30.

Donahue

Krähenbühl

and Darrell

, Adversarial feature learning, In International Conference on Learning Representations (ICLR), 2017.

10.

Erfani

S.M.

Rajasegarar

Karunasekera

and Leckie

, High-dimensional and large-scale anomaly detection using a linear one-class svm with deep learning, Pattern Recognition 58 (2016), 121–134.

11.

Goodfellow

Bengio

and Courville

, Deep learning, MIT Press, 2016.

12.

Hatamlou

, Black hole: A new heuristic optimization approach for data clustering, Information Sciences 222 (2013), 175–184. Including Special Section on New Trends in Ambient Intelligence and Bio-inspired Systems.

13.

Hornik

, Approximation capabilities of multilayer feedforward networks, Neural Networks 4(2) (1991), 251–257.

14.

Kingma

and Ba

, Adam: A method for stochastic optimization, In International Conference on Learning Representations (ICLR), 2015.

15.

Kingma

D.P.

and Welling

, Auto-encoding variational bayes, In International Conference on Learning Representations (ICLR), 2014.

16.

Lecun

Bottou

Bengio

and Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11) (Nov 1998), 2278–2324.

17.

Lichman

, UCI machine learning repository, 2013.

18.

Liu

F.T.

Ting

K.M.

and Zhou

Z.-H.

, Isolation-based anomaly detection, ACM Trans Knowl Discov Data 6(1) (Mar 2012), 3:1–3:39.

19.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

and Duchesnay

, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011), 2825–2830.

20.

Rayana

and Akoglu

, Less is more: Building selective anomaly ensembles, ACM Trans. Knowl. Discov. Data 10(4) (May 2016), 42:1–42:33.

21.

Rodríguez-Fdez

Canosa

Mucientes

and Bugarín

, STAC: a web platform for the comparison of algorithms using statistical tests, In Proceedings of the 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2015.

22.

Sakurada

and Yairi

, Anomaly detection using autoencoders with nonlinear dimensionality reduction, In Proceedings of the MLSDA 2014 2Nd Workshop on Machine Learning for Sensory Data Analysis, MLSDA’14, New York, NY, USA, 2014, ACM, pp. 4:4–4:11.

23.

Tax

D.M.

and Duin

R.P.

, Support vector data description, Machine Learning 54(1) (Jan 2004), 45–66.

24.

Xia

Cao

Wen

Hua

and Sun

, Learning discriminative reconstructions for unsupervised outlier removal, In 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015, pp. 1511–1519.

25.

Zhai

Cheng

and Zhang

, Deep structured energy based models for anomaly detection. In Balcan

M.F.

and Weinberger

K.Q.

, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, New York, New York, USA, 20–22 Jun 2016, PMLR, pp. 1100–1109.

26.

Zimek

Schubert

and Kriegel