A robust anomaly detection algorithm based on principal component analysis

Abstract

Quantifying the abnormal degree of each instance within data sets to detect outlying instances, is an issue in unsupervised anomaly detection research. In this paper, we propose a robust anomaly detection method based on principal component analysis (PCA). Traditional PCA-based detection algorithms commonly obtain a high false alarm for the outliers. The main reason is that ignores the difference of location and scale to each component of the outlier score, this leads to the cumulated outlier score deviates from the true values. To address the issue, we introduce the median and the Median Absolute Deviation (MAD) to rescale each outlier score that mapped onto the corresponding principal direction. And then, the true outlier scores of instances can be obtained as the sum of weighted squares of the rescaled scores. Also, the issue that the assignment of the weight for each outlier score will be solved. The main advantage of our new approach is easy to build with unsupervised data and the recognition performance is better than the classical PCA-based methods. We compare our method to the five different anomaly detection techniques, including two traditional PCA-based methods, in our experiment analysis. The experimental results show that the proposed method has a good performance for effectiveness, efficiency, and robustness.

Keywords

Anomaly detection principal component analysis (PCA)location and scale median absolute deviation (MAD)

1. Introduction

The detection of outlying instances within data sets has received much attention in the field of machine learning and data mining research. The process of identifying abnormalities, who have different patterns from the majority of instances within datasets, is usually referred to as anomaly detection or outlier detection. The early application of anomaly detection is common to get rid of the contaminated instances. It is important because many machine learning algorithms are so sensitive to the abnormalities that the errors may be achieved in pattern tasks. The procedure is commonly known as data cleaning. However, the extended applications of anomaly detection also get much attention, such as fraud detection, intrusion detection, and fault diagnosis, etc. In these applications, the abnormalities are usually containing interesting information, unknown patterns, or novelty records. Generally, the existing anomaly detection algorithms can roughly fall into two major categories: identified in univariate (such as box plot rule and gaussian model-based) [1] and multivariate data sets [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. In practice, the data with univariate, or even bivariate is rarely in real applications. Thus, multivariate detection techniques have received far more attention. In this paper, we limit ourselves to multivariate anomaly detection techniques.

There are many multivariate detection methods have been proposed in recent decades. Including neighbor-based methods (e.g. $k$ -NN [2, 3], LOF [4], and LOCI [5]), cluster-based methods [6, 7], PCA-based methods [8, 9, 10, 11, 12, 13, 14, 15, 16], Information theoretic-based methods [18, 19], Isolation Forest ( $i$ Forest) [20, 21, 22, 23, 24], classification-based methods (e.g. one-class SVM [17]) and deep learning-based methods (e.g. GANomaly [26]), etc. Classification-based methods and deep learning-based methods are often used to learning with normal instances. Those models learn only the data distribution of the normal instances, so they cannot perform on unsupervised scenarios [27]. For neighbor-based methods, the execution time is limited by the procedure of the nearest neighbors learning. The methods often require high time-consuming in high dimensional space [27, 28, 29]. Information theoretic-based methods rely on several information measures, such as entropy, conditional entropy, information gain, and mutual information, to build an appropriate anomaly detection model. However, information-theoretic definitions can only be applied to discrete variables, continuous or numerical variables have to perform discretization first. Due to the high efficiency and effectiveness, isolation Forest ( $i$ Forest) is considered the preferred choice for anomaly detection. Still, $i$ Forest has a high time cost on data sets with high dimension features. Also, classical $i$ Forest suffers from the ‘blind spots’ effect [22], namely, the abnormalities are embedded into the empty portions of the space that surrounded by normal instances surrounded. In case of the effect, $i$ Forest will identify them as normality. The hybrid isolation forest has been published [22] to prevent the issues. Although several extensions of $i$ Forest have been proposed recently [23, 24] to improve the discrimination for outliers, added the extra operations lead to the time cost increase. Benefit from the simple algorithm and effectiveness of PCA-based methods, many researchers start to pay attention to them.

A new definition of abnormalities based on PCA was introduced in [8]. The definition computes the abnormal degrees by the sum of weighted squares of the principal component scores. This means the top $n$ instances that having the maximum cumulative values are considered the abnormalities. The definition is interesting but loses sight of the importance of rescaling each component of the outlier score. And consequently, the approach often leads to the deviation of decision results. Here, we give an example to show the problem intuitively. There is a group of observation points D in a 3-dimensional feature space, and now we want to determine whether a random observation $\bm{x}$ in D is an abnormality. Suppose the outlier score of $\bm{x}$ is [5.78, 0.56, 0.79] ${}^{\text{T}}$ , which is calculated by projecting $\bm{x}$ onto its 3-dimensional principal component. The first principal component score has the maximum value, i.e. 5.78 $>$ 0.79 $>$ 0.56. Considering that the other two outlier scores far less than the first, so the first principal component plays a critical role in the decision on whether the observation $\bm{x}$ would be an outlier. In other words, the effect of other two-component scores would be masked. But if the outlier score matrix of D is

$\displaystyle{\begin{array}[]{*{20}c}{\left[{{\begin{array}[]{*{20}c}{5.41}&{0% .90}&{0.51}\\ {5.98}&{0.83}&{0.49}\\ {6.32}&{0.74}&{0.54}\\ \end{array}}}\right]_{3\times 3},\#}\\ \end{array}}$

the z-scores of $\bm{x}$ by mean removal and variance scaling is [ $-$ 0.243, $-$ 1.34, 1.48] ${}^{\text{T}}$ . Obviously, the last two principal components play a critical role in anomaly detection but do not the first.

To overcome the above problem, we propose a novel PCA-based approach with the standardization of each outlier score. Our approach is based on the method in [8]. The main contribution of our method is to introduce the median and MAD to rescale each principal component score. And also, we resolved the issue that the assignment of the weight of each outlier score. The new PCA-based approach is a very simple mechanism, and we will show in follow that it is both effective and efficient.

In the remainder of this paper, we first review the related works of the PCA-based anomaly detection methods in Section 2. After that, we will introduce two traditional PCA-based anomaly detection algorithms and their drawbacks in Section 3 and also discuss our proposed algorithm in detail. In Section 4, we perform several extensive experimental evaluations. They include the performance of identifying abnormalities, the time cost of different methods, the robustness in high dimensional data, and the availability of the situation when training dataset contains normal instances only, to prove the superiority of our method. In Section 5, we conclude the paper.

2. Related work

In this section, we review the previous PCA-based anomaly detection work and illustrate the advantages of our method.

In recent decades, many anomaly detection methods based on PCA have been proposed [8, 9, 10, 11, 12, 13, 14, 15, 16]. The classical PCA approach has some advantages, for instance, it is simple and non-parametric technology with low consumption of time, and it does not need any assumptions on the distribution of data [8]. However, there are also some problems in traditional PCA, so many PCA-based anomaly detection algorithms focus on developing robust PCA methods to overcome the shortcomings. For instance, typical PCA is generally applied to principal components with linear relation. In the non-linear case, kernel PCA has been used. In [12, 13], the principal components of the data distribution were extracted by kernel PCA and then computed the distance between data and corresponding principal to measure anomalies. To handle high-dimensional data, some new robust PCA methods have been proposed [15, 16]. Also, typical PCA is sensitive to the deviated instances, leading to incorrect principal directions are extracted. Two approaches have been introduced in work [10, 11] respectively, they develop the robust covariance estimation. To measure the degree of deviation, the methods use the outlier scores model proposed in work [8]. Besides, to meet computation and memory requirements in a big data scenario, a new way is to develop the online oversampling PCA method to identify the deviated instances in [14] and the anomalies can be spotted by the variation of the principal directions.

In this paper, we focus on the use of classical PCA for anomaly detection. A new computational model of the outlier score was established. The major advantage of our method is the merits of simple calculation and well performance for anomaly detection. Especially, our method can determine anomalies accurately when the correct principal directions were extracted by the use of improved PCA methods.

3. PCA-based methods

3.1 Principal component analysis

PCA is a statistical analysis method that reduces the dimension of data to a few comprehensive principal components. The reduction approach retains maximally the important information of original variables. The information is often measured by using the variance. Principal components are linear combinations of the original variables, are aimed at explaining the multivariate data structure. It is easy to compute through eigen analysis of the covariance matrix of the original variables.

From now on, $\bm{\Sigma}$ is a $p\times p$ covariance matrix obtained from $n$ observations with $p$ different random variables $X_{1},X_{2},\cdots,X_{p}$ . $\lambda_{1},\lambda_{2},\cdots,\lambda_{p}$ are the $p$ different eigenvalues of $\bm{\Sigma}$ , $\bm{e}_{1},\bm{e}_{2},\cdots,\bm{e}_{p}$ , $\bm{e}_{i}\in\mathbb{R}^{p}$ , are relative unit orthogonal eigenvectors. Therefore, the $i^{\text{th}}$ principal component $z_{i}$ of a random observation $\bm{x}\in\mathbb{R}^{p}$ can be computed by $z_{i}=\bm{e}_{i}^{\text{T}}\bm{x}$ . In practice application, different variables measured in various situations, which may lead to obtaining perverse principal components. To solve the above problem, the mean and the variance (see Eq. (1)) are often used to rescale the variables.

$\displaystyle X^{\prime}_{i}=\frac{X_{i}-\mu_{i}}{\delta_{i}}$ (1)

Where $\mu_{i}$ is the mean of $i^{\text{th}}$ variable, and $\delta_{i}$ denotes the standard deviation.

The classical PCA is sensitive to deviated data instances because of the mean and variance are attracted toward outlying objectives. One of the solutions is to develop robust PCA approaches. In past decades, various robust PCA methods have been proposed [10, 11, 12, 13, 14, 15, 16, 17]. Surely, these methods help obtain the correct principal components. The robust PCA will help to improve the discrimination of outliers [30]. In our method, we use the original PCA to transform the feature space of data, but the proposed method also improves performance for anomaly detection when the correct principal component extracted. And we design an experiment to check this situation (see the experiment in Section 4.5).

Figure 1.

A scatter plot of 200 observations in two-dimensional plane, where the red points denote the normal observations that extract from a standard normal distribution. Obviously, the blue point is an outlier.

3.2 Classical PCA-based methods and the proposed method

PCA is usually considered as an effective data reduction method for machine learning or data mining tasks, not an anomaly detection tool. However, owing to PCA algorithm have comprehensibility and does not impose too many restrictions on the data, more and more studies to identify the outlying instances based on PCA. In some cases, the abnormalities may immediately be revealed in the principal components’ space. For instance, Fig. 1 shows a scatter plot of several observations in 2-dimensional space. The point in the lower-left hand corner is an obvious outlier point, but not outlying in any of the univariate dimensions. Therefore, the outlier point will be easily identified if all observations can be projected onto the NE-SW diagonal. In practice, the direction of the NE-SW diagonal is the direction of the second principal component of the observations. Hence, mapping the data to the right principal component direction may make it easier to identify outliers. For the PCA-based anomaly detection method, modeling the outlier score function in the principal components’ space is a key issue. In [8] attempt to compute the outlier scores by the sum of weighted squares of the principal components of a random observation $\bm{x}$ , i.e.

$\displaystyle\sum_{i=1}^{p}\frac{z_{i}^{2}}{\lambda_{i}}=\frac{z_{1}^{2}}{% \lambda_{1}}+\frac{z_{2}^{2}}{\lambda_{2}}+\cdots+\frac{z_{p}^{2}}{\lambda_{p}}.$ (2)

We know that $z_{i}$ denotes the magnitude of the projection that $\bm{x}$ mapped onto the $i^{\text{th}}$ principal direction. Such that, the absolute of $z_{i}$ can be used to quantify the abnormal degree of $\bm{x}$ in $i^{\text{th}}$ principal direction. And then, accumulating abnormal degrees in chosen principal direction can be regarded as the outlier scores of $\bm{x}$ . Besides, the outlier scores can also be given by the sum of the distance between $\bm{x}$ and relative principal components, i.e.,

$\displaystyle\sum_{i=1}^{p}\frac{\text{dist}({\bm{x},\bm{e}_{i}})^{2}}{\lambda% _{i}}=\frac{\text{dist}({\bm{x},\bm{e}_{1}})^{2}}{\lambda_{1}}+\frac{\text{% dist}({\bm{x},\bm{e}_{2}})^{2}}{\lambda_{2}}+\cdots+\frac{\text{dist}({\bm{x},% \bm{e}_{p}})^{2}}{\lambda_{p}}.$ (3)

In general, Euclidean distance is applied to $\text{dist}({\bm{x},\bm{e}_{i}})$ . The larger outlier scores computed by Eqs (2) or (3), the more likely the relative instances are abnormal. However, these two methods may produce error results as the neglect scaling each component of the outlier score. The z-score can be used to solve the problem. Assuming $\bm{S}=[{\bm{s}_{1},\bm{s}_{2},\cdots,\bm{s}_{p}}]\in\mathbb{R}^{n\times p}$ is the outlier score matrix, which is given by PCA. $\bm{s}_{i}=[{z_{i1}^{2},z_{i2}^{2},\cdots,z_{in}^{2}}]^{\text{T}}$ , $i=1,2,\cdots p$ , is $i^{\text{th}}$ component of the outlier score. Therefore, the z-score of $s^{\prime}_{ij}(j=1,2,\cdots n)$ can be computed by the equation

$\displaystyle s^{\prime}_{ij}=\frac{z_{ij}^{2}-\text{E}({\bm{s}_{i}})}{\sqrt{% \text{Var}({\bm{s}_{i}})}},$ (4)

where $\text{E}({\bm{s}_{i}})=\frac{1}{n}\sum_{j=1}^{n}z_{ij}^{2}$ denotes the mean, and $\text{Var}({\bm{s}_{i}})$ is the variance.

However, the mean is not robust, it is often attracted toward outlying objectives. Using the Eq. (2), the standardization outlier scores will be small. An example has been present in work [30], we refer to the example here to intuitively revealing the non-robust of the mean. Considering a simple univariate observation with five measurements {5.27, 5.34, 5.25, 5.31, 5.28}. Suppose that the last measurement has been contaminated, so the measurements become {5.27, 5.34, 5.25, 5.31, 52.8}. Thus, in that observation with contaminated one, the z-scores of them are { $-$ 0.4483, $-$ 0.4450, $-$ 0.4492, $-$ 0.4464, 1.7889}. The largest value is only 1.7889, which is only slightly greater than other rescaled values. The reason is that the mean is attracted to contaminated values. To improve the precision for anomaly detection, our method use Eq. (5) with the median and MAD to rescale the outlier scores.

$\displaystyle s^{\prime}_{ij}=\frac{z_{ij}^{2}-\text{median}({\bm{s}_{i}})}{% \text{MAD}}$ (5)

Where MAD can be computed by the median of all absolute deviations from the median:

$\displaystyle\text{MAD}=1.4826\times\mathop{\text{median}}\limits_{j=1,\cdots,% n}\left|{z_{ij}^{2}-\text{median}({\bm{s}_{i}})}\right|.$ (6)

In the above formula, the constant 1.4826 denotes a correction factor that makes the MAD unbiased at the normal distribution [30]. Thus, in the above-mentioned contaminated measurements, the rescaled values by using Eq. (5) are {0.6745, 0.5059, 1.0117, 0, 800.79}. Clearly, the last rescaled value, which is contaminated, is greater than others.

Algorithm 1 Robust anomaly detection based on PCA
Inputs: training data set $\textbf{D}_{\text{train}}=[{\bm{x}_{1},\cdots,\bm{x}_{n}}]^{\text{T}}\in% \mathbb{R}^{n\times p}$ , test data set $\textbf{D}_{\text{test}}=[{\bm{y}_{1},\cdots,\bm{y}_{m}}]^{\text{T}}\in\mathbb% {R}^{m\times p}$
1. Calculating each principal component $\bm{e}_{1},\bm{e}_{2},\cdots,\bm{e}_{p}$ and the relative eigenvalue $\lambda_{1},\lambda_{2},\cdots,\lambda_{p}$ of $\textbf{D}_{\text{train}}$
2. Let $\textbf{M}=\emptyset$ , $\textbf{MAD}=\emptyset$
for $j=1$ to $p$ do
Let $\bm{S}=\emptyset$
for $i=1$ to $n$ do
Calculating the projection of $\bm{x}_{i}$ in the $i^{\text{th}}$ principal direction, i.e., $z_{j}=\bm{e}_{j}^{\text{T}}\bm{x}_{i}$
Let $\bm{S}=\bm{S}\cup z_{j}^{2}$
Solving the median $m_{j}$ and MAD $\textit{mad}_{j}$ of $\bm{S}$
Let $\textbf{M}=\textbf{M}\cup m_{j}$ , $\textbf{MAD}=\textbf{MAD}\cup\textit{mad}_{j}$
for $i=1$ to $m$ do
Calculating $\hat{z}_{j}=\bm{e}_{j}^{\text{T}}\bm{y}_{i}$ , and then using Eq. (5) to rescale the square of $\hat{z}_{j}$
Obtaining the outlier score of $\bm{y}_{i}$ by the Eqs (7) or (9)
3. Output: the outlier scores of $\bm{y}_{1},\cdots,\bm{y}_{m}$

Figure 2.

Left: pairwise scatterplot of the first three components of outlier scores (above is SATIMAGE-2 data set, below is WINE data set). Non-outliers are blue real circles, outliers are yellow real circles. Right: pairwise scatterplot of the last components of outlier scores.

Algorithm 1 presents the pseudo-code of our method. Different from Eqs (2) and (3), we use standardized principal components to multiply by corresponding eigenvalues $\lambda_{1},\lambda_{2},\cdots,\lambda_{p}$ . The reasons can be explained in two views: First, in the case of classification, the larger the variance of the principal component, the bigger the between-class scatter. In other words, the class in the top few principal components’ spaces have higher separability. In that aspect, the deviated instances easy to be identified in the first principal components space. Besides, in the first few principal components, they have more important information for anomaly detection. On the contrary, the principal components with small information should be regarded as noises.

An intuitive example is shown in Fig. 2. The observations are easier to discriminate between the normal and the abnormality on the first three components’ space of outlier scores. In contrast, projecting onto the last three ones, the observations are overlapping. This phenomenon exists in most data sets from the Outlier Detection Data Set (ODDS) [31]. Thus, we propose a new computational method of outlier scores, it follows that

$\displaystyle\sum_{i=1}^{p}\lambda_{i}s^{\prime}_{i}=\lambda_{1}s^{\prime}_{1}% +\lambda_{2}s^{\prime}_{2}+\cdots+\lambda_{p}s^{\prime}_{p},$ (7)

where $\{{s^{\prime}_{1},s^{\prime}_{2},\cdots,s^{\prime}_{p}}\}$ is the standardized outlier scores by the median and the MAD.

It is generally known that the eigenvalues reflect the discrete degrees of the original data in relative principal direction. The magnitude of eigenvalues corresponding to the first principal components are greater than the others, that is $\lambda_{1}>\lambda_{2}>\cdots>\lambda_{p}>0$ . However, in the sense of the discreteness, the standardized outlier scores $\bm{S}^{\prime}=[\bm{s}^{\prime}_{1},\bm{s}^{\prime}_{2},\ldots,\bm{s}^{\prime% }_{p}]$ , $\bm{s}^{\prime}_{i}=[s^{\prime}_{i1},s^{\prime}_{i2},\ldots,s^{\prime}_{in}]^{% \text{T}}$ , may not satisfied with $\text{Var}({\bm{s}^{\prime}_{1}})>\text{Var}({\bm{s}^{\prime}_{2}})>\cdots>% \text{Var}({\bm{s}^{\prime}_{p}})$ . That is, it needs to correct the assignment of the weight to outlier scores. Let $\lambda_{i}^{\ast}$ be the new weight of $i^{\text{th}}$ standardized outlier score and it is defined as:

$\displaystyle\lambda_{i}^{\ast}=\lambda_{k},k=\mathop{\psi}\limits_{\text{Var}% ({\bm{s}_{i}})\in{\bm{\Lambda}}}({\text{Var}({\bm{s}^{\prime}_{i}})}),$ (8)

where $\bm{\Lambda}$ denotes the sorted set of $\{{\text{Var}({\bm{s}^{\prime}_{1}}),\cdots,\text{Var}({\bm{s}^{\prime}_{p}})}\}$ from large to small, $\psi(\cdot)$ returns the indices in $\bm{\Lambda}$ by the given variable. Such that, the cumulated outlier score can be given by

$\displaystyle\sum_{i=1}^{p}\lambda_{i}^{\ast}s^{\prime}_{i}=\lambda_{i}^{\ast}% s^{\prime}_{1}+\lambda_{2}^{\ast}s^{\prime}_{2}+\cdots+\lambda_{p}^{\ast}s^{% \prime}_{p}.$ (9)

In a real application, it is generally to examine some principal component for outliers, therefore, the Eq. (9) can be rewritten as:

$\displaystyle\sum_{i=1}^{q}\lambda_{i}^{\ast}s^{\prime}_{i}=\lambda_{i}^{\ast}% s^{\prime}_{1}+\lambda_{2}^{\ast}s^{\prime}_{2}+\cdots+\lambda_{q}^{\ast}s^{% \prime}_{q},({q<p}).$ (10)

How to choose $q$ is a challenging question for our method. It is difficult to give a choice method in theory. It is usually to choose $q$ based on empirical conclusions. For instance, choosing $q$ to be the smallest values so that the ratio of average squared projection error to the total variation in the data is less than one percent. However, the outlier score model based on the above conclusion may not improve the discrimination of outliers in the practical application. Thus, our method is limited by the choice of $q$ .

4. Empirical evaluation

4.1 Data sets and measurements

Here, we plan to evaluate the discrimination between our methods and five effective methods on several benchmark data sets. These data sets are included in ODDS, which have been made publicly available (these data sets can be downloaded at http://odds.cs.stonybrook.edu/). Table 1 gives the details of them. We can see in Table 1, the sizes of deviated data instances are less than the normal in all data sets, and the proportions of the outlier are below 10 in most data sets (BREASTW, SATELLITE, and PIMA are the special cases that exceed 30%). For each data set, it is split into 60 percent for training and 40 percent for testing. We set the number of repetitions for each experiment to 10 such that, the mean of 10 trials would be deemed to the final result. Note that all experiments are repeated independently with different random seeds. For each experiment, two common evaluation metrics are employed to quantify the performance, i.e., the Area Under Receiver Operating Characteristic (ROC) Curve (AUC) and Precision @ rank N (P@N) [31]. P@N indicates that the precision of anomaly detection at top $n$ . In this paper, we use all instances to compute this value, that is, $n$ is equal to the size of instances.

The benchmark algorithms include Histogram-based Outlier Score (HBOS) [32], Local Outlier Factor (LOF), $i$ Forest, and two classical PCA based methods. LOF is applied to identify the local outliers, while other methods are the commonly global anomaly detection methods. For HBOS, the outlier scores are computed based on the combination of the univariate histogram. For LOF, the outlier score of each instance is given by computing the density ratio of the instance to its $k$ nearest-neighbors. $i$ Forest is a non-parametric method for multivariate anomaly detection by isolating instances [20, 21]. The process of isolation means ‘separating an instance from the rest of the instances’ [20]. In brief, $i$ Forest isolates the instances by randomly partitioning a randomly selected feature. The split value is randomly selected between the maximum and minimum values of the selected feature. In literature [20, 21], the authors point out the anomalies are more susceptible to isolation, so the anomalies require a smaller number of partitions than the normal instances.

4.2 ROC and P@N performance

In this sub section, we evaluate anomaly detectors on the nineteen benchmark data sets that are presented in Table 1. They include the continuous-valued data sets, such as ANNTHYROID, THYROID, MAMMOGRAPHY, VOWELS, WINE, GLASS, and WBC. Others are the discrete-valued data sets. The hardware specification of our experimental is represented in Table 2. It should be pointed out that the hardware environment solely affecting the execution time, but the performance of anomaly detection does not be affected. To better distinguish between the two classical PCA-based algorithms in subsequent experiments, we name the PCA-based anomaly detector relied on the Eq. (3) as PCA-1, and call the algorithm based on the Eq. (2) as PCA-2. In our methods, the evaluation model built by Eqs (7) and (9) is referred to as PCA-MAD and PCA-MAD++, respectively.

Table 1
The properties of the data used in the experiments

Data	Features	Instances	Outliers (%)
ARRHYTHMIA	274	452	14.6%
CARDIO	21	1831	9.6%
ANNTHYROID	6	7200	7.42%
BREASTW	9	683	35%
LETTER	32	1600	6.25%
THYROID	6	3772	2.5%
MAMMOGRAPHY	6	11183	2.32%
PIMA	8	768	35%
MUSK	166	3062	3.2%
OPTDIGITS	64	5216	3%
PENDIGITS	16	6870	2.27%
MNIST	100	7603	9.2%
SHUTTLE	9	49097	7%
SATELLITE	36	6435	32%
SATIMAGE-2	36	5803	1.2%
VOWELS	12	1456	3.4%
WINE	12	129	7.7%
GLASS	9	214	4.2%
WBC	30	148	4.1%

The AUCs of seven anomaly detectors on all data sets are presented in Table 3. As for the average performance, the PCA-MAD++ method has a subtle worse than $i$ Forest but is better than other anomaly detection methods. In particular, our methods perform better against the two classical PCA-based methods. For a rigorous comparison, the paired Wilcoxon rank-sum test is employed to evaluate the statistical significance of the results for each data set. In Table 3, the best test result and those results that not significantly worse than it at 0.05 significance level are highlighted in bold and italics. The sign “WIN” indicates the frequency that the detection algorithm can significantly perform better than other competing algorithms. In rigorous comparison, we can observe that PCA-MAD++, $i$ Forest, and HBOS performs are better than two classical PCA-based methods and LOF. The results show that our algorithms can effectively identify abnormalities.

Table 2

The hardware specification used in the experiments

Specification	Value
Platform	PC
OS	Microsoft Windows 10 Home Premium
CPU	Intel i5-7200U @ 2.50GHz
RAM	16GB
Core	Double core
Software	Spyder 3.2.4
Python	Python 3.6.2

Note that, the hardware has an influence of the execution time only, the results of ROC and P@N will not be affected.

Table 3

ROC performance of seven different anormal detectors

Data	PCA-MAD++	PCA-MAD	HBOS	IFOREST	PCA-1	LOF	PCA-2
ARRHYTHMIA	0.783	0.786	0.822	0.817	0.765	0.779	0.782
CARDIO	0.923	0.942	0.835	0.921	0.884	0.574	0.950
ANNTHYROID	0.673	0.677	0.613	0.818	0.636	0.716	0.668
BREASTW	0.981	0.982	0.983	0.985	0.972	0.470	0.959
LETTER	0.649	0.645	0.593	0.627	0.808	0.859	0.528
THYROID	0.947	0.959	0.955	0.978	0.933	0.760	0.955
MAMMOGRAPHY	0.869	0.867	0.836	0.866	0.862	0.731	0.887
PIMA	0.655	0.659	0.700	0.680	0.670	0.627	0.648
MUSK	1.000	1.000	1.000	1.000	0.994	0.529	1.000
OPTDIGITS	0.650	0.502	0.873	0.718	0.494	0.450	0.509
PENDIGITS	0.873	0.891	0.924	0.945	0.764	0.470	0.935
MNIST	0.845	0.854	0.574	0.796	0.788	0.716	0.853
SHUTTLE	0.994	0.994	0.985	0.997	0.982	0.526	0.990
SATELLITE	0.713	0.691	0.758	0.708	0.637	0.557	0.599
SATIMAGE-2	0.999	0.999	0.980	0.995	0.984	0.458	0.982
WINE	0.825	0.776	0.900	0.818	0.556	0.905	0.801
VOWELS	0.817	0.817	0.673	0.756	0.908	0.941	0.603
GLASS	0.731	0.721	0.739	0.754	0.684	0.864	0.675
WBC	0.918	0.917	0.952	0.928	0.933	0.935	0.916
MEAN	0.834	0.825	0.826	0.848	0.803	0.677	0.802
WIN	7	5	8	9	0	5	5

Further, P@N is applied to evaluate the detection precision of the abnormalities. As shown in Table 4, $i$ Forest outperforms other competing algorithms, and our proposed algorithms come second. For the significance test, $i$ Forest and PCA-MAD++ have close performances and outperform other competing algorithms. According to Tables 3 and 4, the results reveal PCA-MAD++ outperforms PCA-MAD. Besides, we can notice that the PCA-2 detection algorithm performs are better than the PCA-1 for P@N, although they have a similarity average AUCs on all benchmark data sets. That says that the PCA-1 detector faces the high false alarm for the outlier.

Table 4

P@N performance of seven different anormal detectors

DATA	PCA-MAD++	PCA-MAD	HBOS	IFOREST	PCA-1	LOF	PCA-2
ARRHYTHMIA	0.462	0.470	0.511	0.482	0.355	0.433	0.461
CARDIO	0.508	0.570	0.448	0.503	0.431	0.154	0.609
ANNTHYROID	0.242	0.250	0.286	0.331	0.204	0.228	0.243
BREASTW	0.919	0.919	0.930	0.920	0.910	0.251	0.931
LETTER	0.155	0.157	0.072	0.088	0.297	0.364	0.087
THYROID	0.328	0.354	0.511	0.558	0.265	0.144	0.343
MAMMOGRAPHY	0.233	0.204	0.120	0.231	0.256	0.208	0.249
PIMA	0.482	0.487	0.542	0.507	0.489	0.456	0.494
MUSK	1.000	1.000	0.978	0.951	0.766	0.170	0.980
OPTDIGITS	0.000	0.000	0.219	0.018	0.000	0.023	0.000
PENDIGITS	0.225	0.227	0.298	0.337	0.050	0.065	0.319
MNIST	0.430	0.409	0.119	0.291	0.343	0.334	0.385
SHUTTLE	0.938	0.907	0.955	0.958	0.863	0.142	0.950
SATELLITE	0.566	0.550	0.569	0.575	0.456	0.389	0.478
SATIMAGE-2	0.866	0.923	0.694	0.885	0.395	0.056	0.804
WINE	0.250	0.175	0.340	0.195	0.000	0.225	0.155
VOWELS	0.229	0.229	0.130	0.206	0.413	0.355	0.136
GLASS	0.123	0.073	0.000	0.073	0.173	0.148	0.073
WBC	0.542	0.551	0.582	0.500	0.456	0.519	0.477
MEAN	0.447	0.445	0.437	0.453	0.375	0.246	0.430
WIN	11	9	10	12	3	5	10

In sum, the results indicate our methods perform better than the two classical PCA-based methods and LOF. And having a similar performance relative to the state-of-the-art method, i.e., $i$ Forest. However, $i$ Forest randomly selects features to compute the outlier scores, so that some important features may not be used. This means $i$ Forest fails to detect the abnormalities in the situation of the data sets with high-dimensionality features. Fortunately, the proposed method is insensitive to this problem. In this aspect, our methods have a more stable performance than $i$ Forest in high dimensionality feature space. We shall design an experiment to verify it in Subsection 4 of Section 3.

4.3 Execution time

In this subsection, we focus on the execution time required of the different methods on each dataset by measuring the CPU time. Especially, we consider the computational expense of different algorithms at both the training and test phases. The results are summarized in Fig. 3. We observe that the execution time of HBOS, two classical PCA-based methods, and our algorithms are low on all of the data sets in both of the training and test phases. And the execution time of $i$ Forest and LOF are low on most data sets, except for performing on those data sets with a big size of instances, such as SHUTTLE, MUSK, and OPTDIGITS. In conclusion, our algorithm can quickly detect the outliers to meet the requirement of real applications.

Figure 3.

Execution time of seven outlier detectors on both training data sets and test data sets.

4.4 High dimensional data

Here, we want to evaluate the discrimination of the abnormalities of our algorithms versus $i$ Forest on high dimensional space. It is a challenging issue for anomaly detection algorithms, as the increase of dimensionality introduces sparseness of the data. The issue is named as the curse of dimensionality. Unfortunately, most detection algorithms fail to detect outliers in this scenario (e.g. the distance-based and density-based methods). Although our algorithms also suffer from this well-known problem it can improve performance by removing irrelevant features. Here, we use a Kurtosis-based feature selector [33] to remove the noise features and check whether our method improves the detection accuracy.

To compare three methods on the robustness when the noise features are increasing, we choose two data sets: CARDIO and SATIMAGE-2 on which the methods can obtain approximate AUCs. For each data set, 500 random features are simulated from ${\mathcal{N}}({0,\bm{\Sigma}})$ , where the covariance matrix $\bm{\Sigma}=({\rho^{\left|{k-j}\right|}})_{1<\left({j,k}\right)<500}$ with $\rho=0.5$ . Therefore, there are a total of 521 and 536 features in CARDIO and SATIMAGE-2, respectively.

Table 5
ROC performance of the proposed methods on training sets that contains both normal instances and anomalies

Data	PCA-MAD	PCA-MAD++
ARRHYTHMIA	0.813	0.810
CARDIO	0.955	0.937
ANNTHYROID	0.878	0.903
BREASTW	0.990	0.989
LETTER	0.655	0.659
THYROID	0.982	0.982
MAMMOGRAPHY	0.884	0.884
PIMA	0.713	0.713
MUSK	1.000	1.000
OPTDIGITS	0.586	0.717
PENDIGITS	0.900	0.883
MNIST	0.910	0.905
SHUTTLE	0.998	0.997
SATELLITE	0.783	0.784
SATIMAGE-2	0.999	0.999
WINE	0.949	0.936
VOWELS	0.838	0.838
GLASS	0.728	0.740
WBC	0.938	0.938
MEAN	0.868	0.874
WIN	15	15

Figure 4.

The ROC performance of our method versus $i$ Forest. It is obvious that our methods are insensitive to the irrelevant features. The x-axis denotes that the numbers of selected features using Kurtosis-based method, and the y-axis means the AUCs.

Table 6

P@N performance of the proposed methods on training sets that contains both normal instances and anomalies

Data	PCA-MAD	PCA-MAD++
ARRHYTHMIA	0.507	0.503
CARDIO	0.649	0.579
ANNTHYROID	0.451	0.520
BREASTW	0.933	0.933
LETTER	0.154	0.150
THYROID	0.595	0.640
MAMMOGRAPHY	0.295	0.333
PIMA	0.558	0.556
MUSK	1.000	1.000
OPTDIGITS	0.000	0.001
PENDIGITS	0.283	0.271
MNIST	0.562	0.535
SHUTTLE	0.965	0.961
SATELLITE	0.640	0.641
SATIMAGE-2	0.932	0.929
WINE	0.480	0.410
VOWELS	0.241	0.241
GLASS	0.123	0.123
WBC	0.551	0.562
MEAN	0.522	0.520
WIN	15	15

Figure 4 gives the detailed results of the ROC performance from three anomaly detectors. The ROC performances will peak when the subspace size comes close to the original number of features. Thus, the feature selection is useful to anomaly detection with a high dimensional feature. For the detection performance, the results of our methods are promising. In our methods, the increase of dimensionality brings a litter degradation of performance than $i$ Forest. Hence, our methods are robust than $i$ Forest in high dimensional space.

4.5 Training using normal instances only

In the real-world, the abnormalities are often rarely in some applications, that is to say, the detection algorithms are usually trained with normal instances only. Thus, we further verify our method work well when the training set contains normal instances only. To achieve the goal, we cut off the anomalies from the training sets are shown in Table 1, and then evaluate the anomaly detectors with both abnormal and normal instances. The average AUCs and P@Ns are presented in Tables 5 and 6. We can see that our methods trained with instances only, can get better detection performance than trained with both normal instances and anomalies.

Other anomaly detection methods also perform better when the anomalies removed from the training data set. That might seem obvious as the anomaly detectors will build a correct outlier score model from training data without abnormalities. For instance, the PCA-based methods are very sensitive to anomalies because the standard PCA is unable to extract the correct principal axis from the data set with abnormalities.

In sum, the results in the above experiment indicate that our methods can get better performance for anomaly detection when the correct principal directions are extracted. One of the solutions that can extract correct principal directions, is to develop the robust PCA, which is meaningful research but not within the scope of this paper.

5. Conclusion

In this paper, we propose a novel unsupervised anomaly detection approach, which computes outlier scores based on PCA. The classical PCA-based methods failed to get high accuracy on identifying abnormalities as they cannot take the location and scale of each score into account. In comparison to traditional methods, we introduce the median and MAD to robustly rescale outlier scores projected onto each principal direction. Thus, the true outlier score can be computed. Also, a new method to assign the weight of the univariate scores is presented in this work. The empirical comparison with five classical anomaly detection techniques on the benchmark data sets indicates that our approaches are superior in terms of runtime (especially in large datasets) and detection effectiveness. For the training dataset with normal instances only, the proposed approach also performs well. Besides, our approach will also work well than $i$ Forest on high dimensional data with many irrelevant features. The future works would be to explore two topics, includes how to extend our model to robust PCA and how to select a few principal directions for better anomaly detection.

Footnotes

Acknowledgments

The authors thank the associate editor and two anonymous referees for their valuable suggestions. They are also very grateful to Xiaoguang Wei for the comments on a preliminary version of this paper. This research is partially supported by the Fundamental Research Funds for the Central Universities under Grants No. 2682017CX046 and No. A0920502052820-21. The Equipment Development Department Funds grant 61403120304, and the Science and Technology Major Project of the science and Technology Department of Sichuan Province (2018GZDZX0043).

References

Chandola

Banerjee

and Kumar

, Anomaly detection: a survey, ACM Computing Surveys 41(3) (2009), 1–72.

Ramaswamy

Rastogi

S.K.

and Korea

, Efficient algorithms for mining outliers from large data sets, in: Proc. 2000 ACM Int. Conf. Management of Data (SIGMOD), pp. 427–438.

Angiulli

and Pizzuti

, Fast outlier detection in High dimensional spaces, in: Proc. 6th Eur. Conf. Principles of Data Mining and Knowledge Discovery (PKDD), pp. 15–26.

Breunig

M.M.

Kriegel

H.P.

R.T.

and Sander

, LOF: Identifying density-based local outliers, in: Proc. 2000 ACM Int. Conf. Management of Data (SIGMOD), pp. 93–104.

Papadimitriou

Kitagawa

Gibbons

P.B.

, et al. LOCI: Fast outlier detection using the local correlation integral, in: Proc. 19th Int. Conf. International Conference on Data Engineering (ICDE).

and Deng

, Discovering cluster-based local outliers, Pattern Recognition Letters 24(9–10) (2003), 1641–1650.

Singh

and Kaur

, Unsupervised anomaly detection in network intrusion detection using clusters, in: Proc. 28th Conf. Australasian Conference on Computer Science (ACSC), pp. 333–342.

Shyu

M.L.

Chen

S.C.

Sarinnapakorn

and Chang

L.W.

, A novel anomaly detection scheme based on a principal component classifier, in: Proc. 3th IEEE Int. Conf. International Conference on Data Mining (ICDM) Workshop.

Filzmoser

Hron

and Reimann

, Principal component analysis for compositional data with outliers, Environmetrics 20(6) (2009), 621–632.

10.

Hubert

Rousseeuw

P.J.

and Vanden Branden

, ROBPCA: a new approach to robust principal component analysis, Technometrics 47(1) (2005), 64–79.

11.

Kwitt

and Hofmann

, Robust methods for unsupervised PCA-based anomaly detection, in: 2006 Proc. IEEE/IST Workshop on Monitoring, Attack Detection, and Mitigation, pp. 1–3.

12.

Hoffmann

, Kernel PCA for novelty detection, Pattern Recognition 40(3) (2007), 863–874.

13.

Das

Golatkar

and Awate

, Sparse Kernel PCA for Outlier Detection, in: 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 152–157.

14.

Lee

Y.J.

Yeh

Y.R.

and Wang

Y.C.F.

, Anomaly detection via online oversampling principal component analysis, IEEE Transactions on Knowledge & Data Engineering 25(7) (2013), 1460–1470.

15.

Caramanis

and Mannor

, Outlier-robust PCA: the high-dimensional case, IEEE Transactions on Information Theory 59(1) (2012), 546–572.

16.

Ding

and Kolaczyk

E.D.

, A compressed PCA subspace method for anomaly detection in high-dimensional data, IEEE Transactions on Information Theory 59(11) (2013), 7419–7433.

17.

Croux

Filzmoser

and Oliveira

M.R.

, Algorithms for projection-pursuit robust principal component analysis, Chemometrics and Intelligent Laboratory Systems 87(2) (Jun. 2007), 218–225.

18.

Fogla

Dagon

Lee

and Skoric

, Towards an information-theoretic framework for analyzing intrusion detection systems, in: Proc. 2006 European Symposium on Research in Computer Security (ESORICS), pp. 527–546.

19.

Lee

and Xiang

, Information-theoretic measures for anomaly detection, in: Proc. 2001 IEEE Symposium on Security and Privacy, pp. 130–143.

20.

Liu

F.T.

Ting

K.M.

and Zhou

Z.H.

, Isolation forest, in: Proc. 8th IEEE Int. Conf. International Conference on Data Mining (ICDM), pp. 413–422.

21.

Liu

F.T.

Ting

K.M.

and Zhou

Z.H.

, Isolation-based anomaly detection, ACM Transactions on Knowledge Discovery from Data 6(1) (2012), 1–39.

22.

Marteau

P.F.

Soheily-Khah

and Béchet

, Hybrid Isolation Forest-Application to Intrusion Detection, arXiv:1705.03800, 2017.

23.

Liu

F.T.

Ting

K.M.

and Zhou

Z.H.

, On Detecting Clustered Anomalies Using SCiForest, in: Proc. 2010 Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 274–290.

24.

Shen

Liu

Wang

Chen

and Sun

, A novel isolation-based outlier detection method, in: Proc. 2016 Pacific Rim International Conference on Artificial Intelligence, pp. 446–456.

25.

Schölkopf

Platt

J.C.

Taylor

J.S.

Smola

A.J.

and Williamson

R.C.

, Estimating the support of a high-dimensional distribution, Neural Computation 13(7) (2001), 1443–1471.

26.

Akcay

Abarghouei

A.A.

and Breckon

T.P.

, GANomaly: Semi-supervised anomaly detection via adversarial training, in: In 14th Asian Conference on Computer Vision (ACCV), pp. 622–637.

27.

Goldstein

and Uchida

, A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data, PLOS One 11(4) (2016).

28.

Hodge

V.J.

and Austin

, A survey of outlier detection methodologies, in: Proc. 2004 ACM Int. Conf. International Conference on Knowledge Discovery and Data Mining, pp. 85–126.

29.

Hadi

A.S.

Imon

A.H.M.R.

and Werner

, Detection of outliers, Wiley Interdisciplinary Reviews: Computational Statistics 1(1) (2009), 57–70.

30.

Rousseeuw

P.J.

and Hubert

, Robust statistics for outlier detection, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1) (2011), 73–79.

31.

Zhao

Nasrullah

and Li

, PyOD: a python toolbox for scalable outlier detection, Journal of Machine Learning Research 20(96) (2019), 1–7.

32.

Goldstein

and Dengel

, Histogram-based outlier score (HBOS): A fast unsupervised anomaly detection algorithm, in: KI-2012: Poster and Demo Track, pp. 59–63.

33.

Joanes

D.N.

and Gill

C.A.

, Comparing measures of sample skewness and kurtosis, Journal of the Royal Statistical Society: Series D (The Statistician) 47(1) (1998), 183–189.

A robust anomaly detection algorithm based on principal component analysis

Abstract

Keywords

1. Introduction

2. Related work

3. PCA-based methods

3.1 Principal component analysis

4.1 Data sets and measurements

4.2 ROC and P@N performance

Table 1 The properties of the data used in the experiments

Table 5 ROC performance of the proposed methods on training sets that contains both normal instances and anomalies

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
The properties of the data used in the experiments

Table 5
ROC performance of the proposed methods on training sets that contains both normal instances and anomalies