An outlier detection algorithm based on an integrated outlier factor

Abstract

Outlier detection is an important task in data mining. In this paper, a novel outlier detection algorithm is proposed, which integrates the local density with the global distance seamlessly. In the proposed method, an integrated outlier factor is used to measure the detecting accuracy. A comprehensive experimental study on both synthetic and real-life datasets shows that the proposed method is more effective than some typical outlier detection methods, including Relative Density-based Outlier Score (RDOS), INFLuenced Outlierness (INFLO), Local Outlier Factor (LOF) and Local Distance-based Outlier detection Factor (LDOF).

Keywords

Outlier detection local density global distance double parameter

1. Introduction

Outlier detection has become an important research in data mining, and it has attracted attention of more and more experts. Many related research findings have been applied to such fields as Internet Intrusion, Financial Fraud Transaction, Bank Non-performing Loan Records and Potential Terrorist Activity [1]. Outliers are data objects which deviate so greatly from other ones as to arouse suspicion that they may have been generated by a different mechanism. Outliers can cause significant losses to users. For example, policy makers could make bad decisions in business analysis. So their corresponding processing is more valuable than conventional ones in some specific areas [2].

In recent decades, many outlier detection methods have been proposed. And they can be divided into four categories which are statistical learning-based, clustering-based, density-based and distance-based ones. In statistical learning-based methods, they are capable for different data distributions, parameters, thresholds and models. In general, outliers can be regarded as the by-products of clustering. The statistical learning-based and clustering-based ones are seldom used. For the statistical learning-based method, it is difficult to build the model. And for the clustering-based method, it is inefficient on large dataset. In contrast, the density-based and distance-based methods have become mainstream ones for a long time. Such density-based algorithms as Local Outlier Factor (LOF) [3], INFLuenced Outlierness (INFLO) [4], Relative Density-based Outlier Score (RDOS) [5] and Connectivity -based Outlier Factor (COF) [6] are all popular. Besides these, a data point is considered an outlier when its local density is different from that of the neighborhood. LOF is the first method to propose the relative outlier factor, which is based on the KNN method. Similar to KNN, LOF can get the reachable densities of the outliers [3]. INFLO is an improved method based on LOF, which considers both neighbors and reverse neighbors of an object when estimating its relative density distributions [4]. RDOS is similar to INFLO, but the former adds share neighbors [5]. COF is a local density-based approach when data specifies such patterns as lines and spheres [6]. The advantage of such kind of method is that all of outliers can be detected, including global outliers and local outliers. However, several parameters are demanded to be determined in advance. In the distance-based methods, the outliers are detected by computing distances among all objects. Among them, the popular commonly methods are Distance Based detect Outlier (DB-Outlier) [7], Nested-Loop (NL) [8] and Local Distance-based Outlier detection Factor (LDOF) [9]. DB-Outlier is the earliest distance-based outlier detection algorithm, which calculates the distance between each pair of objects and compares it with the threshold value to determine whether it is an outlier. The mesh model is added in NL, and then the distance between the current object and other objects are calculated in each mesh. If the distance is less than the specified threshold, the object number is incremented by one [8]. If the number exceeds the given threshold, the object is not an outlier. LDOF uses the relative distance from an object to its neighbors, which is usually used for outlier detection in scattered datasets [9]. The distance-based methods are widely used for their validity and portability. But the distance-based method does not take into account the variations of local outliers. So the distance-based methods can only detect the global outliers. This is to say, they are incapable of detecting the local outliers.

In this paper, we propose a robust Outlier Detection Algorithm based on Density and Distance double Parameters Outlier factor Scores (ODA-DDPOS). The method is accomplished by three steps. 1) The local density of an object is assessed by its neighborhoods, and the global distance is obtained by the known density set. 2) An integrated outlier evaluation factor is determined by results of step 1). 3) The top-n outliers are determined by the evaluation factor. The higher the value of DDPOS is, more likely it is to be an outlier. Experimental results on both synthetic and real-life data sets verify the superior performance of our proposed ODA-DDPOS method. In summary, our contributions in this paper have the following two ones.

•
A new method of outlier detection is presented, which illustrates the theoretical analysis of the outlier detection method and whether it is provided with parameters in practical applications.
•
A new outlier fraction factor is proposed, which is based on two parameters of density and distance.

The paper is organized as follows. In Section 2, we introduce a brief overview of the related work. In Section 3, we present the detailed description of ODA-DDPOS. In Section 4, we give the evaluation criteria, experimental results and analysis. Finally, conclusions are given in Section 5.
2. Related work

The essence of outlier detection is to separate a core of regular observations from a large collection of data. Outlier detection methods can be divided into four types, which are statistical learning-based, clustering-based, density-based and distance-based ones [1, 10]. The statistical learning-based method finds outliers by building statistical learning model. However, its disadvantage is that the determination of the learning model cannot be applied to unknown datasets [11, 12]. Moreover, inconsistency usually exists between data distributions in the model and that in real applications [6, 13]. For the clustering-based method, it detects outliers in the process of finding clusters, and the data point which does not belong to any cluster is considered as an outlier. However, its computational cost is too high to be applicable in large datasets [6, 7, 14]. Besides, it cannot make further optimization because its main goal is to cluster. In this case, the density-based and the distance-based approaches are combined in the proposed methods because they are simpler and more commonly used. The algorithms of LOF, INFLO and RDOS are all density-based ones. The density-based method investigates the density of an object and that of its neighbors. Here, an object is identified as an outlier if its density is lower than that of its neighbors. Given a dataset $D$ , $r$ is the density threshold that is used to define the object neighborhood. For an object $o$ , we use $\textit{dist}_{r}(o)$ to represent the reachable distance in its $r$ -neighborhoods, and use $\textit{dist}(o,p)$ to represent its distance from another object $p(p\in D,p\neq o)$ . For an object $p$ in the object $o$ ’s $r$ -neighborhood, it is satisfied with $\textit{dist}(o,p)\leqslant\textit{dist}_{r}(o)$ . The density neighborhood $N_{r}$ of an object $o$ is calculated according to Eq. (1).

$\displaystyle N_{r}(o)=\left\{{o|o\in D,\textit{dist}(o,p)\leqslant\textit{% dist}_{r}(o)}\right\}$ (1)

Note that $N_{r}(o)$ may contain more than $r$ objects because multiple objects may have the same distance away from the object $o$ . The fourth type is distance-based one. Such kind of method traverses the neighborhood of an object, which is defined by a given radius. In the distance-based method, such distance calculation methods as Euclidean or Manhattan are used. An object is considered as an outlier if it does not have enough neighborhoods. Given a dataset $D$ , $r$ is the distance threshold that is used to define the reasonable neighborhood of an object. For an object $o$ , the distance-based outlier detection method can examine the number of other objects in its $r$ -neighborhood. If most objects in $D$ are far sufficiently from $o$ , it means these objects are not in its $r$ -neighborhood. Accordingly, the object $o$ can be regarded as an outlier. The distance neighborhood is satisfied with Eq. (2).

$\displaystyle\frac{||\{o|\textit{dist}(o,o^{\prime})\leqslant r\}||}{||D||}\leqslant\Pi$ (2)

$o^{\prime}$ is an object excluding $o$ in dataset $D$ . The $\textit{dist}(o,o^{\prime})$ is the distance between the objects $o$ and $o^{\prime}$ , and $\Pi$ is a fraction threshold. The distance-based method can determine whether an object $o$ is an outlier by checking the distance between $o$ and its $r$ nearest neighbors.

The basic calculation models mentioned above are based on density and distance. Both of LOF [3] and LDOF [9] use KNN to calculate their characteristic subspaces [15, 16], and they are too dependent on the kernel density [7]. INFLO puts forward the Influence Space (IS) [4], and RDOS puts forward the relative density-based outlier factor [5]. Concepts of INFLO and RDOS are very similar, but RDOS joins the shared neighbor into IS to become the new nearest neighbor. The disadvantage of INFLO and RDOS is that the density score boundary between the point and the two clusters becomes blurred. In this case, INFLO and RDOS make it difficult to determine which cluster the instance belongs to [17]. Other disadvantages of RDOS are that there are manually operated parameters and the algorithm is too subjective [18].

Therefore, we propose an integrated outlier detection algorithm based on density and distance after comprehensive analysis.

3. Proposed algorithm

3.1 Local density and global distance

Outliers have the following characteristics in the real spatial datasets.

•
The mechanism of the outlier is different from that of the normal objects. It is usually distributed separately or very sparely. And accordingly, most of them have very low densities.
•
The distances between the outlier and other normal point clusters are relatively distant, which makes the outliers become independent.

3.1.1 Local density

Gaussian Kernel Density Estimation (GKDE) can be used to get the object local density. Given a dataset of objects $\chi=\{X_{1},X_{2},\ldots X_{N}\}$ ( $X_{j}\in R^{d}$ for $j=1,2\ldots N$ ), the kernel density of a point $X_{i}$ is defined as follows.

$\displaystyle\rho(X_{i})=\sum\limits_{\begin{array}[]{l}i,j=1\\ i\neq j\\ \end{array}}^{N}K\left(\frac{X_{i}-X_{j}}{d_{c}}\right)$ (3)

where $X_{i}(i=1,2,\ldots,N)$ is a point in dataset and $d_{c}$ is the truncation distance of the object $X_{i}$ . Firstly, Euclidean distances between an object $X_{i}$ and the rest points in the dataset are calculated and they are sorted in an ascending way. Then the largest $K$ distances are selected, and the maximum distance is defined as the truncation distance. $K\left(\frac{X_{i}-X_{j}}{d_{c}}\right)$ is a kernel function. A commonly used Gaussian kernel function formula can be expressed as Eq. (4).

$\displaystyle K\left(\frac{X_{i}-X_{j}}{d_{c}}\right)_{\textit{Gaussian}}=\exp% \left(-\frac{||X_{i}-X_{j}||^{2}}{2d_{c}^{2}}\right)$ (4)

where $||X_{i}-X_{j}||$ is the Euclidean distance between $X_{i}$ and $X_{j}$ . The distribution estimated in Eq. (3) offers many excellent properties, such as its non-parametric and continuity properties. Non-parametric property makes no assumptions about the probability distributions of the assessed data, and it can be used to estimate the data density [19, 20]. Continuity property means that the object density is continuous, and it reduces the conflict probability of the same local density [20]. Besides, GKDE is an asymptotic unbiased estimate of the density. There are two main reasons for using this approach. Firstly, it avoids the tedious manual works. The involved parameters are not required to be determined beforehand. Secondly, the outlier algorithm will calculate the score of each point, and using the full dataset will lead to a highly computational cost. Concretely, its time complexity is controlled within the $O(N^{2})$ ( $N$ is the total element number in the data set).

To estimate the density of an object $X_{i}$ , its neighbors are merely regarded as kernels. After analysis, it is found that the truncation distance $d_{c}$ is a key factor influencing the local density. And the truncation distance is actually evolved from LOF. In this way, we can get the local Gaussian kernel density estimation by Eq. (5).

$\displaystyle\rho_{i}=\rho(X_{i})=\sum\limits_{\begin{array}[]{l}X_{i},X_{j}% \in\chi\\ X_{i}\neq X_{j}\\ \end{array}}e^{-\left(\frac{d_{X_{i},X_{j}}}{\sqrt{2}d_{c}}\right)^{2}}$ (5)

where $d_{X_{i},X_{j}}$ denotes the Euclidean distance between $X_{i}$ and $X_{j}$ , and $\rho_{i}$ represents the density of the $X_{i}$ . The local density of each point is calculated by the Eq. (5), and then we resort the points by the local densities.

It should be noted that $\{\rho_{{}_{i}}\}_{i=1}^{N}$ denotes the density set resorted in the descending way by local density. Thus it can be expressed as Eq. (6).

$\displaystyle\{\rho_{{}_{i}}\}_{i=1}^{N}=\{\rho_{1},\rho_{2},\rho_{3},\ldots,% \rho_{N}\}$ (6)

where $\rho_{1}\geqslant\rho_{2}\geqslant\rho_{3}\ldots\geqslant\rho_{N}$ is the local density set in the descending way. Most density-based outlier detection algorithms are based on one or more parameters. But we only use one parameter in proposed algorithm to reduce influence of manually subjective interference.

3.1.2 Global distance

The kernel density score can be used as a separate evaluation factor of outliers in some density-based methods. However, the global distance is also considered in order to improve the detection accuracy. According to the definition of the outlier, the global distance is also called as the average farthest distance between the object $X_{i}$ and the first $K$ objects which are denser than $X_{i}$ . The global distance can be calculated according to Eq. (7).

$\displaystyle d_{{X_{i}}}=\left\{{{\begin{array}[]{ll}\varepsilon,&{i=1}\\ {\frac{1}{K}\sum\limits_{X_{j}\in S_{i}}^{K}{||X_{i}-X_{j}||}},&{2\leqslant i% \leqslant N,\rho_{j}\leqslant\rho_{i}}\\ \end{array}}}\right.$ (7)

where $S_{i}$ denotes the set including $K$ farthest neighbors of the object $X_{i}$ , and $d_{X_{i}}$ denotes the average first $K$ distances between the objects $X_{i}$ and $X_{j}$ if the density of $X_{i}$ is higher than that of $X_{j}$ .

The global distance can eliminate the interference of the boundary objects and avoid false discrete points. For example, the density of boundary point is lower and the distance between the boundary point and the other side at the edge of the normal cluster is larger. But the boundary point is not always the outlier. The distances between one point and the $K$ high-density points of a point can exclude the boundary point from the final outlier set. Then the global distance is the average of previous $K$ distances. In this way, the characteristics of the outliers are ensured. The higher the density of an object is, the less likely it is to be an outlier. If the neighboring number of one point is less than $K$ , it is defaulted to be 0. And we delete the highest density points from the neighbor set $S$ because it is quite possible to be the normal cluster. So the global distance of the highest density point is close to 0, and the formula is $\lim\varepsilon=$ 0. The global distance corresponding to the point is represented by a farthest distance set including all investigated points. It is expressed as follows.

$\displaystyle\{{d_{ik}}\}_{i=1}^{N}=\{{d_{i1},d_{i2},d_{i3},\ldots,d_{iN}}\}$ (8)

where $\{{d_{ik}}\}_{k=1}^{N}$ is the consequence of sorting $d_{i\ast i}$ , which represents the influence of different point towards $X_{i}$ from high to low.

The global distance of each object $X_{i}$ is calculated according to the density set $\{{\rho_{{}_{i}}}\}_{i=1}^{N}$ and the global distance of $X_{i}$ is stored in the global distance set $\{{d_{i}}\}_{i=1}^{N}$ . Thus it can be expressed as Eq. (9).

$\displaystyle\{{d_{i}}\}_{i=1}^{N}=\{{d_{1},d_{2},d_{3}\ldots d_{N}}\}$ (9)

3.2 Outlier factor score

Based on the local density and the global distance sets, the outlier factor score can be defined. DDPOS is a relatively comprehensive average factor and it has strong correction function, which is defined as follows.

$\displaystyle\{\textit{DDPOS}(X_{i})\}_{i=1}^{N}=\left\{\frac{d_{X_{i}}}{\rho_% {X_{i}}}\right\}$ (10)

DDPOS is the ratio of the global distance to the local density of our interested object $X_{i}$ . If $\textit{DDPOS}(X_{i})$ is much larger than 1, the object $X_{i}$ would be outside of a normal points cluster which indicates that $X_{i}$ would be an outlier. If $\textit{DDPOS}(X_{i})$ is approximately equal or smaller than 1, the object $X_{i}$ would be surrounded by the same domain characteristics neighbors which indicates that $X_{i}$ would not be an outlier. And if the density value of $X_{i}$ is larger, $\textit{DDPOS}(X_{i})$ will be closer to 0. In this case, $X_{i}$ is more likely to be a cluster center. In practice, we would like to rank the DDPOS and detect the top-n outliers.

3.3 Proposed algorithm

Our algorithm is given in Algorithm 1, which takes the KNN graph as input. The KNN graph is a directed graph in which the object corresponds to the vertex and it is connected to its $K$ nearest neighbors with an out bound direction. The output is the top-n outliers in the dataset.

Algorithm 1: Density and Distance double parameters Outlier factor Score
INPUT: $K$ , $\chi$ , KNN-graph
OUTPUT: Outliers
ALGORITHM:
while object $X_{i}\in\chi$ & $1<=i<=N$ do
while $1<=j<=K$ do
$d_{c}=\text{Max}(\textit{KNN}-G(X_{i},X_{j}))$ , the cut-distance ( $d_{c}$ ) is largest of the top-K distance that calculated by the Euclidean
distance between $X_{i}$ and $X_{j}$ stored in KNN-graph;
end
$\rho(X_{i})=$ getKernelDensity ( $X_{i},d_{c}$ ), which is estimate the local kernel density at the location of $X_{i}$ ;
$\{{\rho_{i}}\}_{i=1}^{N}.\textit{push}(X_{i},\rho(X_{i}))$ , which is store the kernel density of each object $X_{i}$ in the density set. $\{{\rho_{i}}\}_{i=1}^{N}$ is local density
set and resorted in the descending way;
end
while object $X_{i}\in\chi$ & $\rho(X_{i})\in\{{\rho_{i}}\}_{i=1}^{N}$ do
$d(x_{i})=$ getDistance ( $X_{i},K,\rho(X_{i})$ ), which is calculate the global distance of $X_{i}$ ;
$\{{d_{i}}\}_{i=1}^{N}.\textit{push}(X_{i},d(X_{i}))$ , which is store the global distance of each object $X_{i}$ in the distance set.
end
while object $X_{i}\in\chi$ & $\rho(X_{i})\in\{{\rho_{i}}\}_{i=1}^{N}$ & $d(X_{i})\in\{{d_{i}}\}_{i=1}^{N}$ do
Calculate $\textit{DDPOS}(X_{i})$ for $X_{i}$ according to Eq. (10);
end
Sort DDPOS in descending order and output the top-n objects.

4. Experimental results and analysis

4.1 Evaluation criteria

Although there are many evaluation metrics, such as precision, Adjusted Index and Adjusted AP [21, 22], the most effective and popular one is the well-known F1 evaluation criterion. F1 is widely used in unsupervised outlier detection algorithms. In this paper, we chose F1 curve to evaluate the experimental results.

Definition 1 (F1). F1 is a comprehensive evaluation criterion using precision and recall. Its definition is given below.

$\displaystyle F1=\frac{2\times P\times R}{P+R}=\frac{2\times TP}{M+TP-TN}$ (11)

Here, $P$ is precision and $R$ is recall. Precision is the number of correct results out of the results marked correctly by the detection algorithm ( $P={TP}/(TP+FP))$ . Recall is the number of correct results out of actual number of correct results ( $R={TP}/(TP+FN))$ . The bigger the values of $P$ and $R$ are, the better the performance of the algorithm is. However, $P$ and $R$ are a pair of contradictory measures. When F1 is used, samples are divided into True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) according to the real class and predicted class. Here $TP+TN+FP+FN=M$ and $M$ is the total number of samples. The confusion matrix is shown in Table 1.

Table 1

Confusion matrix

[height=0.8cm,width=4.3cm]True conditionPredicted condition	Positive	Negative
Positive	TP	FN
Negative	FP	TN

TP is the number of outliers belonging to both True condition and Predicted condition. FN is the number of outliers belonging to the True condition and not belonging to the Predicted condition. FP is the number of outliers not belonging to the True condition and belonging to the Predicted condition. TN is the number of outliers not belonging to the True condition and not belonging to the Predicted condition.

4.2 Synthetic datasets

Firstly, we test ODA-DDPOS in a synthetic dataset. This dataset includes two Gaussian clusters centered at (0.5, 0.6) and (3.0, 3.5) respectively [23]. One hundred samples including 4 outliers are extracted from it. In consideration of the computational expenses and the detecting accuracy, $K$ is set to be 5. In Fig. 1, we give the distribution of the artificial Gaussian dataset samples. Here, the solid blue diamond frames of the two ellipses are normal and the red hollow rectangles represent the outliers.

Figure 1.

Outlier distribution.

As is shown in Fig. 1, the data points corresponding to A, B, C and D are outliers. Meanwhile, the top 5 data samples with the largest DDPOS are given in Table 2.

Table 2

Top 5 data samples with the largest DDPOS

Synthetic dataset numbers	Data	DDPOS
1	(10.0, 8.0)	26.64
2	(1.0, 7.0)	12.56
3	(4.0, 10.0)	11.85
4	(5.0, 9.0)	10.34
5	(1.9, 5.8)	1.46

According to Eq. (9), DDPOS values are calculated and shown in Table 2. Let’s take an example to explain. The DDPOS value of an object $X_{1}$ is calculated as follows.

$\displaystyle\textit{DDPOS}(X_{1}(10.0,8.0))=\frac{d_{X_{1}}}{\rho_{X_{1}}}=% \frac{\frac{1}{5}\sum\limits_{X_{j}\in S_{i}}^{5}{||X_{1}-X_{j}||}}{\sum% \limits_{X_{j}\in\chi,X_{1}\neq X_{j}}{e^{-\left(\frac{d_{x_{1},X_{j}}}{d_{c}}% \right)^{2}}}}\approx\frac{3.3399}{0.1254}\approx 26.64.$

The DDPOS value of object $X_{2}$ is calculated as follows.

$\displaystyle\textit{DDPOS}(X_{2}(1.0,7.0))=\frac{d_{X_{2}}}{\rho_{X_{2}}}=% \frac{\frac{1}{5}\sum\limits_{X_{j}\in S_{i}}^{5}{||X_{2}-X_{j}||}}{\sum% \limits_{X_{j}\in\chi,X_{2}\neq X_{j}}{e^{-\left(\frac{d_{x_{2},X_{j}}}{d_{c}}% \right)^{2}}}}\approx\frac{3.0608}{0.2437}\approx 12.56.$

The DDPOS values of remaining $X_{i}$ ( ${3}\leqslant i\leqslant 100$ ) are calculated similarly in synthetic dataset.

Obviously, the outlier distributions are variable. The DDPOS values of the first four outliers are very large. The fifth data point is the boundary point, and its corresponding value is not as large as the first four ones.

4.3 Real-life datasets

We also conduct experiments on real-life datasets to evaluate the performance of the algorithms. All datasets are from the UCI repository [24]. In experiments, we preprocessed the datasets for performance evaluation in the beginning. And concretely, two steps were taken.

(1) Normalization

It is used to standardize each attribute to the range from 0 to 1.

(2) Transformation

It is used to convert all of the classification attributes to numerical ones [21].

As shown in Table 3, 13 datasets are usually used to evaluate the classification methods. For the purpose of detecting outlier data, one or more classes are considered as outliers in this datasets. We do this on the basis of the characteristics of the outliers which are proposed in Section 3.1. And the specific knowledge background of these datasets can be referred in the literature [21].

Table 3
Characteristics of 13 data sets

Dataset	# of features	# of outliers	# of data
IONSPHERE	31	126	351
KDDCUP99	40	200	48113
LYMPHOGRAPHY	19	6	148
SHUTTLE	9	13	1013
WAVEFORM	21	100	3443
WBC	10	10	454
WDBC	32	10	367
WPBC	32	47	198
GLASS	7	9	214
HEARTDIEASE	13	15	166
PAGEBLOCKS	10	256	5139
PARKSION	22	15	195
WILT	5	93	4655

As can be seen from Table 3, the dimensionality and the outlier number of these datasets are variable. The # of features is the attribute number of each object in a dataset and it represents the dimension of the each dataset. And the # of data is the number of samples in a dataset.

To illustrate the distributions, as shown in Fig. 2, we also give the first two principle components of four selected datasets. Here, the normal data is denoted by blue solid diamond and outliers are denoted by red hollow rectangles. In this way, it can be seen that outliers are mostly different from the normal data.

Figure 2.

Data distribution.

We have analyzed the knowledge background of each dataset to show the distributions of normal and outlier data. Our evaluation uses 13 datasets which contain 166 $\sim$ 48113 samples and 5 $\sim$ 40 features. The first dataset contain measures from high-frequency antennas detecting free electrons in IONOSPHERE. The normal cluster consists of measures representing structure and has a total of 225 data points. And the rest are considered as outlier because there is no evidence of structure formation in the IONOSPHERE. In the KDDCUP99, the U2R type attacks are used as outliers in the preprocessing due to its wide variety of network intrusions. And the remaining network connections are marked as normal. When performing experiments on LYMPHOGRAPHY, we selected classes labeled with ‘one’ and ‘four’ (point number less than 5%) to be the outlier classes and used classes label with ‘two’ and ‘three’ as the normal class. For the SHUTTLE, it was preprocessed to be one which contains 1013 records and one record has 9 attributes. The largest four clusters contain a majority of normal points, accounting for 98.7% of the total. The other three clusters are regarded as outliers. The WBC dataset contains 454 records after preprocessing, as well as 9 attributes. In WAVEFORM datasets, we downsampling class ‘0’ to 100 objects as the outliers, and the rest objects as the normal points. The WBC dataset is divided into two classes which are labeled with ‘benign’ and ‘malignant’. The objects in ‘malignant’ are treated as outliers [25]. WDBC and WPBC are derived from WBC. Their category situation is similar to WBC except that the quantities and attributes are different. For the GLASS dataset, it contains 2 classes in which the ‘window’ class has 205 records as the normal and the ‘non-window’ class has 9 records (4% of data) as the outliers. Medical data on heart problems, with affected patients considered outliers which include 15 pieces of data and healthy person considered inliers that include 166 records in HEARTDIEASE dataset. PAGEBLOCKS dataset is to describe the types of blocks in document pages. And it is divided into two categories. If the block content is graphic, it was regarded as outliers, otherwise it was regarded as normal ones. For PARKSION, it is also medical data which contain two types. One is healthy person that were considered as normal, the other is suffering from Parkinson’s disease that were labeled as outliers. In WILT dataset, it covers the differentiating disease trees from other land. The former are treated as outliers that contain 93 records. Actually, the outliers can be considered as one or more spare clusters. And then we use the heatmap to show the attribute correlation for six datasets GLASS, HEARTDIEASE, PAGEBLOCKS, SHUTTLE, WBC and WILT so as to show the characteristics of dataset in Fig. 3.

Figure 3.

Heatmap of dataset.

We compare our approach with four widely used outlier detection approaches, which are LOF [3], INFLO [4], RDOS [5] and LDOF [9]. All these methods mentioned above are based on the nearest neighbor method, so the threshold of the parameter $K$ is ranged from 1 to 100. In the RDOS method, the bandwidth of the kernel function is to determine manually [26, 27, 28, 29]. In addition, the dimensionalities of the datasets also affect the performances of RDOS method. We summarize F1 values on all 13 datasets in Figs 4–10. And from these, we can see that our proposed ODA-DDPOS approach exhibits superior performance on the metric of F1.

Figure 4.

F1 comparison.

Figure 5.

F1 comparison.

Figure 6.

F1 comparison.

Figure 7.

F1 comparison.

Figure 8.

F1 comparison.

Figure 9.

F1 comparison.

Figure 10.

F1 comparison.

Figure 11.

Average TPR preference.

We give F1 curves on the IONSPHERE and KDDCUP99 in Fig. 4. As shown in Fig. 4a, it is not a big difference and the positive rate is slightly lower when $K$ is in the initial stage. But as its value increases, this discrepancy is becoming more apparent. Especially when the value of $K$ is greater than 30, the superiority of the ODA-DDPOS method becomes increasingly obvious. With the increase of $K$ value, ODA-DDPOS method gradually becomes steady and F1 maintains a relatively high-value. The positive rate is rising with the increase of $K$ value, as shown in Fig. 4b. In fact, this trend is relatively stable in the whole algorithm. But when the value of $K$ is larger than 73, our proposed method reveals more and more apparent merits.

Figure 5 is F1 curves on the LYMPHOGRAPHY and SHUTTLE. As is shown in Fig. 5a, the performance of ODA-DDPOS is better than that of the other algorithms. And as is shown in Fig. 5b, the INFLO behaves better than ODA-DDPOS when $K$ is less than 13. But F1 positive rate of INFLO is dropped sharply when $K$ is more than 13. INFLO shows a sharp instability in the whole experiment process, but ODA-DDPOS is steady. When $K$ is less than 13, the performance of INFLO is the best. Besides that, the performance of ODA-DDPOS is the best.

Figure 6 is F1 curves on the WAVEFORM and WBC. As is shown in Fig. 6a, the trend of ODA-DDPOS is almost similar to that of LOF algorithm. But the positive rate of ODA-DDPOS is exceeded that of LOF algorithm. It can be seen from Fig. 6b, ODA-DDPOS is displayed an excellent F1 positive rate on the beginning. And it is only a little bit lower when the value of $K$ is between 40 and 60. However, ODA-DDPOS still remains a relatively ideal F1 positive rate and it is much more stable than other algorithms.

Figure 7 is F1 curves on the WDBC and WPBC. In Fig. 7a, the positive rate of ODA-DDPOS is excellent and the trend is going pretty smoothly than all other algorithms. It also has the highest average positive rate because WDBC and WPBC are the deformable datasets of WBC, its performance behaves almost stable as a whole. The trend of ODA-DDPOS in Fig. 7b is basically same as that of WDBC. The trend of ODA-DDPOS is more stable compared with other algorithms. And it does not produce very drastic changes affecting the stability of the whole algorithm.

Figure 8 is F1 curves on the GLASS and HEARTDISEASE. In Fig. 8a, the positive rate of ODA-DDPOS is the best when the value of $K$ exceeds 20, and the height positive rate has been maintained since then. The trend of ODA-DDPOS maintains the highest positive rate from the start and it is relatively reposeful throughout all the process, as shown in Fig. 8b. Yet it is just surpassed by RDOS when the value of $K$ is 77.

Figure 9 is F1 curves on the PAGEBLOCKS and PARKSION. In Fig. 9a, the positive rate of ODA-DDPOS rises gradually with increasing $K$ value. However, it has a little bit of a slight drop when the value of $K$ is greater than 40. Finally, the positive rate of ODA-DDPOS is gradually stabilized during the second half and still maintains the high positive rate. The trend of ODA-DDPOS is very stable and it does not produce violent fluctuations like other algorithms at the beginning and end, as shown in Fig. 9b.

Figure 10 is F1 curve on the WILT. In Fig. 10a, the positive rates of ODA-DDPOS and RDOS are very similar. ODA-DDPOS maintains a narrow lead in the whole process. It is worth noting that it has obvious advantages when the value of $K$ is 13.

As is shown in Fig. 11, ODA-DDPOS is always superior. In particular, it is revealed that ODA-DDPOS has ideal effectiveness in nine datasets, including IONSPHERE, KDDCUP99, LYMPHOGRAPHY, WBC, WDBC, WPBC, GLASS, HEARTDISEASE and PAGEBLOCKS. As for SHUTTLE, WAVEFORM, PARKSION and WILT, ODA-DDPOS algorithm shows the same performance. The average effects of all experiments can be expressed as follows: ODA-DDPOS $>$ LOF $>$ RDOS $>$ INFLO $>$ LDOF. Here, “ $>$ ” means “performs better than”.

5. Conclusion

How to identify the outliers is an essential task in data mining. In this paper, a new outlier detection method based on local density-global distance estimation is proposed. In ODA-DDPOS, we take two parameters as the conditions for determining outlier factor. The results on both synthetic and real-life datasets verify that the proposed ODA-DDPOS method is an efficient outlier detection algorithm.

In ODA-DDPOS, determining of the parameter $K$ is of vital importance [30, 31, 32, 33, 34]. And this is just our future research.

Footnotes

Acknowledgments

The corresponding author would like to thank the support from the National Natural Science Foundation of China under the Grant of 61402363, the Education Department of Shaanxi Province Key Laboratory Project under the Grant of 15JS079, Xi’an Science Program Project under the Grant of 2017080CG/RC043 (XALG017), the Ministry of Education of Shaanxi Province Research Project under the Grant of 17JK0534, and Beilin district of Xi’an Science and Technology Project under the Grant of GX1625.

References

Han

Kamber

and Pei

, Data mining concepts and techniques third edition, Morgan Kaufmann (2011), 9–22, 251–374.

Laurikkala

Juhola

and Kentala

, Informal identification of outliers in medical data, in: Intelligent Data Analysis in Medicine and Pharmacology, 2000, pp. 20–24.

Breunig

M.M.

Kriegel

H.-P.

R.T.

and Sander

,LOF: identifying density-based local outliers, 29(2) (2000), 93–104.

Suman

, Improving Influenced Outlierness (INFLO) Outlier Detection Method, 2013.

Tang

and He

, A local density-based approach for outlier detection, Neurocomputing 241 (2017), 171–180.

Tang

Chen

A.W.

and Cheung

, in: A Robust Outlier Detection Scheme for Large Data Sets Pacific-Asia Conf on Knowledge Discovery & Data Mining, 2002, pp. 6–8.

Knox

E.M.

and Ng

R.T.

, Algorithms for Mining Distance-Based Outliers in Large Datasets, in: International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc, 1998, pp. 392–403.

Knorr

E.M.

R.T.

and Tucakov

, Distance-based outliers: Algorithms and applications, Vldb Journal 8(3-4) (2000), 237–253.

Zhang

Hutter

and Jin

, A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer Berlin Heidelberg, 2009, pp. 813–822.

10.

Witten

I.H.

Frank

Hall

M.A.

and Pal

C.J.

, Data Mining: Practical Machine Learning Tools and Techniques, third edition, Morgan Kaufmann Publishers Inc, 2011, 134–210.

11.

Zhao

and Wu

, Integration of ANN and Statistical Method for Outlier Detection in Complex System, in: 8th International Conference on Neural Information Processing, 2001.

12.

Goel

and Montgomery

, Statistics and Machine Learning based Outlier Detection Techniques for Exoplanets, IAU General Assembly, 2015, 22.

13.

Zhou

H.F.

J.H.

Zhang

F.C.

and Cui

Y.A.

, A graph clustering method for community detection in complex networks, Physica A: Statistical Mechanics & Its Applications 469 (2017), 551–562.

14.

Zhou

Guo

Wang

and Zhao

, A feature selection approach based on interclass and intraclass relative contributions of terms, Computational Intelligence and Neuroscience (17) (2016), 1–8.

15.

Kontaki

Gounaris

Papadopoulos

A.N.

Tsichlas

and Manolopoulos

, Efficient and flexible algorithms for monitoring distance-based outliers over data streams, Information Systems 55(C) (2016), 37–53.

16.

Zou

Wang

and Yin

, Outlier detection for high dimensional data, ACM SIGMOD International Conference on Management of Data 30(2) (2015), 37–46.

17.

Huang

Qin

Chen

and Wang

, Density-Based Spatial Outliers Detecting, in: Computational Science-Iccs 2005, International Conference, Atlanta, Ga, Usa, Proceedings, DBLP, Vol. 3514, May 22–25, 2005, pp. 979–986.

18.

Jin

Tung

A.K.H.

Han

and Wang

, Ranking outliers using symmetric neighborhood relationship, Lecture Notes in Computer Science 3918 (2006), 577–593.

19.

Tang

and He

, KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning, in: Evolutionary Computation, IEEE, 2015, pp. 664–671.

20.

Schubert

Zimek

and Kriegel

H.-P.

, Generalized Outlier Detection with Flexible Kernel Density Estimates, in: Siam International Conference on Data Mining, 2014, pp. 542–550.

21.

Campos

G.O.

Zimek

Sander

Campello

R.J.G.B.

Micenkova

Schubert

Assent

and Houle

M.E.

, On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study, Data Mining & Knowledge Discovery 30(4) (2016), 891–927.

22.

Huang

Zhu

Yang

Cheng

D.D.

and Wu

, A novel outlier cluster detection algorithm without top-n parameter, Knowledge-Based Systems 121 (2017), 32–40.

23.

Fukunaga

and Flick

T.E.

, A test of the gaussian-ness of a data set using clustering, IEEE Transactions on Pattern Analysis & Machine Intelligence 8(2) (1986), 240–247.

24.

Lichman

, UCI Machine Learning Repository, School of Information and Computer Science, University of California, Irvine, CA, 2013. http://archive.ics.uci.edu/ml/.

25.

and Garcia

E.A.

, Learning from imbalanced data, IEEE Transactions on Knowledge & Data Engineering 21(9) (2008), 1263–1284.

26.

Pal

and Vepakomma

, Optimal bandwidth estimation for a fast manifold learning algorithm to detect circular structure in high-dimensional data, 2016.

27.

Zhou

Guo

and Wang

, A feature selection approach based on term distributions, Springer Plus 5(1) (2016), 249.

28.

Sheather

S.J.

and Jones

M.C.

, A reliable data-based bandwidth selection method for kernel density estimation, Journal of the Royal Statistical Society 53(3) (1991), 683–690.

29.

Zhou

Zhang

and Liu

, A Global-Relationship Dissimilarity Measure for the k-Modes Clustering Algorithm, Computational Intelligence and Neuroscience, 2017, 1–7.

30.

Zhang

Hutter

and Jin

, A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data, 5476 (2009), 813–822.

31.

Zhu

Kitagawa

Papadimitriou

and Faloutsos

, Outlier detection by example, Journal of Intelligent Information Systems 36(2) (2011), 217–247.

32.

Elgammal

Duraiswami

Harwood

and Davis

L.S.

, Background and foreground modeling using nonparametric kernel density estimation for visual surveillance, Proc IEEE 90(7) (2002), 1151–1163.

33.

Zhou

H.F.

Liu

J.H.

and Duan

W.C.

, A density-based approach for detecting complexes in weighted PPI networks by semantic similarity, Plus One 12(7) (2017).

34.

Zhou

Zhao

and Wang

, An effective ensemble pruning algorithm based on frequent patterns, Elsevier Science Publishers B. V. 56(3) (2014), 79–85.

An outlier detection algorithm based on an integrated outlier factor

Abstract

Keywords

1. Introduction

3.1 Local density and global distance

4. Experimental results and analysis

4.1 Evaluation criteria

(1) Normalization

(2) Transformation

Table 3 Characteristics of 13 data sets

Footnotes

Acknowledgments

References

Table 3
Characteristics of 13 data sets