Fast local outlier detection algorithm using K kernel space

Abstract

Outlier detection can detect a small amount of data which containing valuable information from a large number of data, and it has become a hot topic in data mining. In this article, a new algorithm is proposed, which is fast local outlier detection algorithm using $K$ kernel space. It is proposed to solve problem of the detection efficiency is not high because of the unevenness of density distribution in the outlier detection algorithm based on density, and the running time of the algorithm is obviously increased after introducing the reverse $K$ nearest neighbors algorithm. By introducing $K$ kernel space this algorithm divides the objects in data set into near $K$ neighborhood points and far $K$ neighborhood points, and reduces the number of data points which computation of the reverse $K$ neighborhood, so as to reduce the running time of the algorithm. The accuracy of outlier detection is improved by introducing reachable distance and reachable density to reduce the statistical fluctuation of distance. Finally, the effectiveness of the proposed algorithm is shown by simulation data set and real data set.

Keywords

Data mining outlier detection K kernel space reachable density

1. Introduction

With the continuous advancement of technology, more and more data is generated, and people are overwhelmed by various kinds of data (astronomical data, financial data, scientific computing data, etc.). These data are huge in quantity, contain a large amount of information, and are usually chaotic. It is very difficult for people to find useful information in them. Therefore, data mining technology came into being [1]. Outlier data is often treated as noise points during the process of data mining, but outlier data may have real and valuable data information that can easily be ignored by people [2]. Therefore, we can not directly ignore the impact of outlier data on the whole data set, and outlier mining has became one of the important tasks of data mining [3]. Outlier mining is usually used in credit card fraud, medical treatment, public security, image processing, network intrusion and so on [4, 5, 6, 7].

There are many definitions of outlier. The most classical definition was put forward by Hawkins in 1980. At present, most of the definitions of outlier were improved on the basis of Hawkins. Hawkins defines outlier as a data object, which is significantly different from other data objects, as if it was produced by different mechanisms [8]. So far, there are many methods for outlier mining, including distribution-based methods, distance-based methods, clustering-based methods and density-based methods.

The distribution-based methods [9, 10] can detect outlier in a known distributed data set. However, the distribute of data sets in real data is often unknown. So the detection effect of the distribution-based method for such a data set is not obvious, and the method is not suitable for high dimensional data sets [11]. In order to solve the problem of distribution based detection method, the literature [12, 13] proposed a method based on clustering. The method detects outliers by generating clusters, but the main purpose of this method is to generate clusters instead of mining outliers, so it has high time complexity and low efficiency [14]. Distance-based method [15] can deal with the problem of the unknown distributed, and it can process high-dimensional data sets effectively and easily. In addition, the method can also effectively detect global outliers. But for local outliers, the efficiency of the method is not very good [16].

In 2000, Breunig et al. proposed the concept of Local Outlier Factor (LOF) [17] by calculating the ratio of the average density of the objects in the $K$ neighborhood of the data object to its own density which is used to determine the outliers of the data object, and the LOF give outlier factor to each object in the data set to indicate the degree of outliers of the object. The concept of local outlier factor can well extract local outliers and is applicable to high-dimensional data sets. However, there are also some defects, this method is not good for the data sets with uneven density distribution. Aiming at these problems, the literature [18] put forward the concept of reverse $K$ neighbors (RN ${}_{k}$ ). By combining $K$ nearest neighbors and reverse $K$ neighbors, the problem of low efficiency caused by data distribution anomaly in LOF algorithm is effectively solved. But in the process of computing, the literature uses the average distance of the object in $N_{k}(p)$ and $RN_{k}(p)$ to $p$ as a measure of the local density of $p$ . However, there is a problem in this simple measure, that the statistical fluctuations in distance measurements may be unexpectedly high [19]. This article will describe the problem and the cause in detail in the following chapters, and propose an improved method. For each point in the data, its reverse $K$ neighbors are calculated, which undoubtedly increases the running time of the algorithm and reduces the efficiency of the algorithm. But most of the data in a data set do not need to compute the reverse $K$ neighbors. Aiming at the above two problems, this paper proposes a fast local outlier detection algorithm FLOD_KKS using $K$ kernel space, and verify the effectiveness of the algorithm through experiments.

The structure of this article is described below. In Section 2, existing outlier mining algorithms are discussed, and their advantages and disadvantages are analyzed. In Section 3, a new algorithm is proposed and the advantages of the proposed algorithm are analyzed. In Section 4, the experimental results show that the accuracy of the proposed algorithm is significantly higher than that of the existing algorithm, and the running time is lower than that of the original algorithm. Finally, In Section 5, it is a summary of the proposed algorithm in this paper and the future directions.

2. Related work

2.1 LOF algorithm

Definition 1. The $K$ distance of object $o$ (k-dist( $o$ )).

For a given data set $D . o$ , $p$ , $q$ are the objects in the data set, $\textit{dist}(p,q)$ represents the distance between two objects, $\textit{k-dist}(o)$ of object $o$ is $\textit{dist}(o,p)$ between $o$ and another object $p$ $\in D$ such that:

(II) (I)
For at least $k$ object $o^{\prime}\in D-\{o\}$ , it holds that $\textit{dist}(o,o^{\prime})\leqslant\textit{dist}(o,p)$ .
(II)
For at least $k-1$ object $o^{\prime\prime}\in D-\{o\}$ , it holds that $\textit{dist}(o,o^{\prime\prime})<\textit{dist}(o,p)$ .

Definition 2. The $K$ neighborhood of object $o(N_{k}(o))$ .

All objects that the distance between all data sets and $o$ is not greater than $\textit{k-dist}(o)$ , and the $K$ neighborhood of object $o$ is defined as:

$\displaystyle Nk\left(o\right)=\left\{{\left.{o^{\prime}}\right|o^{\prime}\in D% ,\textit{dist}\left({o,o^{\prime}}\right)\leqslant\textit{k-dist}(o)}\right\}$ (1)

The number of objects in $N_{k}(o)$ may be more than $k$ , because there may be more than one object with equal distances to $o$ .

Definition 3. The reachable distance for object $o$ to $o^{\prime}(\textit{reach-dist}(o,o^{\prime}))$ .

If $\textit{dist}(o,o^{\prime})>\textit{k-dist}(o^{\prime})$ , then the reachable distance is $\textit{dist}(o,o^{\prime})$ , and otherwise the reachable distance is $\textit{k-dist}(o^{\prime})$ , and the reachable distance is defined as:

$\displaystyle\textit{reach-dist}\left({o,o^{\prime}}\right)=\max\left\{{% \textit{k-dist}\left({o^{\prime}}\right),\textit{dist}\left({o,o^{\prime}}% \right)}\right\}$ (2)

Definition 4. The local reachable density $(\textit{lrd}(o))$ of the object $o$ .

The reciprocal of the average value of the reachable distance in the $K$ neighborhood of the object $o$ , and is defined as:

$\displaystyle{\textit{lrd(o)}=\frac{1}{\frac{\sum\nolimits_{o^{\prime}\in Nk(o% )}{\textit{reach-dist}(o,o^{\prime})}}{\left\|{Nk(o)}\right\|}}}$ (3)

If we calculate the density of $o$ by directly using the average distance between objects in $N_{k}(o)$ to $o$ , this method is simple and feasible, but there will be a problem that if $o$ has a neighbor $o^{\prime}$ that is very close to $o$ , then $\textit{dist}(o,o^{\prime})$ will be very small. And the statistical fluctuations in distance measurements are likely to be unexpectedly high. The density value can be understood as follows, first it represents a density. The higher the density is, the more likely it is to belong to the same cluster. The lower the density is, the more likely it is outlier. The density of $o$ is measured by the average distance between the objects of the neighbourhood and the object under consideration. When the distance statistics fluctuate greatly, it will cause an error in the detection. So the introduction of the reachable density can not only add the smooth effect, but also improve the precision of the mining.

Definition 5. The local outlier factor of the object $o(\textit{LOF}(o))$ .

The local outlier factor of an object represents the degree of the object’s outlier, and the local outlier factor of the object $o$ is defined as:

$\displaystyle{\textit{LOF(o)}=\frac{\sum\nolimits_{p\in Nk(o)}{\frac{\textit{% lrd(p)}}{\textit{lrd(o)}}}}{\left\|{Nk(o)}\right\|}}$ (4)

The closer the LOF value is to 1, indicating that the density of objects in the $o^{\prime}s$ neighborhood is about the same, and $o$ is more likely to be in a same cluster with its neighbors; if the ratio is less than 1, indicating the density of $o$ is higher than those points in its neighbourhood point density, $o$ is a dense point; if this ratio is greater than 1, indicating that $o$ is less than the density of its neighborhood density. $o$ is more likely to be a outlier.

But this method also has a problem. As shown in Fig. 1 the point $p$ is located in a sparse cluster $C_{2}$ , and it is very close to the boundary of a dense cluster $C_{1,}$ but q is closer to $C_{1.}$ In this case if you use the LOF algorithm will find outlier degree of $p$ is larger than $q$ , this is obviously wrong. As shown in Fig. 2, the $K$ neighborhood of $r$ contains two normal points and one outlier. The density of object $r$ is relatively low, but the density of its neighborhood is also low. According to LOF algorithm, it can be concluded that the outlier degree of $r$ is lower than $p$ , which is also wrong. In order to avoid the emergence of the above problems, literature [18]introduced reverse $K$ neighbors, and proposed the INFLO algorithm.

Figure 1.
Outlier $p, q, r$ schematic.

Figure 2.
Reverse $K$ nearest neighbor RN ${}_{k}$ schematic.

2.2 INFLO algorithm

Definition 6. Reverse $K$ neighborhood of object $o(\textit{RN}_{k}(o))$ .

$D$ is a data set, and $o$ , $p$ are the objects in the data set. If the $k$ neighborhood of the object $p$ includes the object $o$ , the object $p$ is reverse $k$ neighbor of the object $o,$ and reverse $k$ neighborhood of object $o$ is defined as:

$\displaystyle\textit{RNk}(o)=\{p\left|{p\in D,o\in Nk(p)}\right.\}$ (5)

Definition 7. The density of the object $o(den(o))$ .

$o$ is the object in the data set $D$ , and the density of $o$ is defined as:

$\displaystyle\textit{den}(o)=\frac{1}{\textit{k-dist}(o)}$ (6)

Definition 8. Outlier factor of object $o(\textit{INFLO}(o))$ .

$o$ is the object in the data set $D$ , and the outlier factor of $o$ is defined as:

$\displaystyle\textit{INFLO}(o)=\frac{\sum\limits_{p\in Nk(o)\cup\textit{RNk}(o% )}{\textit{den}(p)}}{\left|{\textit{Nk}(o)\cup\textit{RNk}(o)}\right|\textit{% den}(o)}$ (7)

If the INFLO value of an object is far greater than 1, then the object is more likely to be an outlier.

INFLO method can solve the defect of outlier determination when LOF algorithm is not suitable for abnormal data distribution, but there are still the following shortcomings:

1) 1)

The algorithm not only needs to query $K$ nearest neighborhood of each data point $p$ , but also query reverse $K$ nearest neighborhood (RkNN) of $p$ . Frequent RkNN queries have great impact on the performance of the algorithm. Although R-tree and other indexing structures are used in document [18] to improve query efficiency, they can not fundamentally solve the problem of efficiency. On the other hand, when the dimension of data set are high, the efficiency of index structures such as R-tree is worse than that of sequential traversal queries, and it can not achieve its purpose of improving efficiency.

The algorithm needs to analyze and calculate the outlier degree of each data point to determine whether it may be an outlier, which results in a large time overhead. A large number of reverse $K$ nearest neighbor queries exacerbate this burden. In the actual data set, the abnormal distribution points only account for a small part of the data set. For example, in Fig. 1, only marginal points of clustering $C_{1}$ and $C_{2}$ , as well as the abnormal distribution of data points such as $q$ and $r$ . For most data points in the data set ( $C_{1}$ and $C_{2}$ data points), correct outlier judgment can be obtained without analyzing their RkNN.

Compared with the LOF algorithm, INFLO requires to calculate reverse $K$ neighbors for all points in the data set, which undoubtedly increases the running time of the algorithm. And not all points are required to calculate reverse $K$ neighbors, so a new algorithm Fast Local Outlier Detection Algorithm Using $K$ Kernel Space (FLOD_KKS) is proposed.

3. FLOD_KKS algorithm

3.1 Algorithm thought

In this paper, the number of reverse $K$ neighborhood is reduced by defining far and near $K$ neighborhood points (Definition 10) and $K$ kernel points (Definition 11). At the same time, the reachable density in LOF is used to represent the density of objects and neighbor objects to reduce the statistical fluctuation of distance measure and improve the accuracy of the algorithm.

3.2 Related definition

Definition 9. $K$ kernel space K-KS (K-Kernel Space).

$D$ is a data set, $o$ and $p$ are objects in the data set. Object $o$ is the point in $K$ neighborhood of $p$ , and $K$ neighborhood of object $o$ also contains $p$ . Then object $o$ belongs to the point in the $K$ kernel space of the $p, K$ kernel space is defined as: .

$\displaystyle\textit{K-KS}(p)=\{o\left|{o\in Nk(p)\cap p\in Nk(o)}\right.\}$ (8)

Definition 10. Far and near $K$ neighborhood points.

For the object $p$ in the data set, if $\frac{|K-KS(p)|}{|N_{{\rm k}}(p)|}>\lambda$ then $p$ is called a near $K$ neighborhood point, and conversely called a far $K$ neighborhood point, and $\lambda(0<\lambda\leqslant 1)$ is a threshold of the neighborhood point.

Figure 3.

$K$ Kernel space schematic of object $p$ .

According to Definition 10, if a point $p$ is a near $K$ neighborhood point, it shows that $K$ neighborhood of data points with not less than $\lambda$ 100% of $p^{\prime}$ s $K$ neighborhood also contains $p$ . Then $p$ , $q$ , and $r$ in Fig. 2 must not near be $K$ neighborhood points. In fact, it can avoid the misjudgment caused by uneven distribution of data like the case in Fig. 2 as long as $\lambda>$ 0. When 0.5 $<\lambda\leqslant$ 1, in other words, the density distribution of the area surrounding a large proportion of data points in the $K$ neighborhood of $p$ tends to the density distribution around the data point $p$ . Therefore, when considering whether $p$ is a outlier, there is unnecessary to consider reverse $K$ neighborhood of $p$ .

Definition 11. $K$ kernel points.

If the object $o$ in the data set is near $K$ neighborhood point, and $p$ in the $K$ neighborhood of the object $o$ , when $p$ satisfies:

$\displaystyle\textit{k-dist}(o)<\frac{\sum\nolimits_{o\in\textit{Nk}(o)}{% \textit{k-dist}(p)}}{\left|{\textit{Nk}(o)}\right|}$ (9)

$p$ is called $K$ kernel point.

Definition 12. $K\_$ Impact Space of object $p$ (K_IS).

The union of the set of $K$ neighborhoods of the object $p$ in the data set and the set of reverse $K$ neighborhoods of the object $p$ is referred to as $K$ influence space of the object $p$ , and is defined as:

$\displaystyle\textit{K\_IS}(p)=\{o\left|{o\in\textit{Nk}(p)\cup o\in\textit{% RNk}(p)}\right.\}$ (10)

As shown in Fig. 4, $r, s, q, t, w$ are objects in data set $D$ , then $K$ influence space of object $p$ is { $r, s, q, t, w$ }.

Figure 4.

$K=$ 4 K-IS schematic.

From the definition of kernel point of Definition 11, the following theorem can be drawn.

Theorem 1. If the object $p$ in the data set is a $k$ kernel point, $p$ must not be an outlier.

Proof If $p$ is a kernel point, then the $K$ distance of $p$ is less than the average of the $K$ distances of all the objects in the $p^{\prime}$ s $K$ neighborhood. This shows that the points in $K$ neighborhood of $p$ are close to $p$ , and overall approach to $p$ . Also because $p$ is a near $K$ neighborhood point, it does not appear in the density distribution in Fig. 1 (because the $q$ point in Fig. 2 is not a near $k$ neighborhood point). So we do not need to consider reverse $K$ neighborhood. According to Eq. (4), the outlier factor of $p$ is not greater than 1, so it is not an outlier.

Definition 13. Local Outlier Degree (LOD).

The local outlier of the object $p$ in the data set is recorded as Eq. (11).

$\displaystyle\textit{{\rm LOD}(p)}=\begin{cases}\displaystyle\frac{\sum\limits% _{o\in N_{k}(p)}{\frac{\textit{lrd(o)}}{\textit{lrd(p)}}}}{{\rm|}N_{{\rm k}}(p% )|},&\frac{|K-KS(p)|}{|N_{k}(p)|}>\lambda\\ \displaystyle\frac{\sum\limits_{o\in{\rm K-IS}(p)}{\frac{\textit{lrd(o)}}{% \textit{lrd(p)}}}}{{\rm|K\_IS}(p)|},&{\rm else}\\ \end{cases}$ (11)

$\lambda$ is a parameter.

3.3 Algorithmic description

The points for the data set can be divided into near $K$ neighborhood points and far $K$ neighborhood points. For near $K$ neighborhood points, they are directly marked as normal points if they are the $K$ kernel points. If it is not the $K$ kernel point, according to the analysis of the Definition 10, it is not necessary to consider reverse $K$ neighborhood point. Judge the value of LOD by Eq. (11) in the Definition 13 in when $\frac{|K-KS(p)|}{|N_{k}(p)|}>\lambda$ , and if the value of LOD is greater than a set threshold, the point is marked as an outlier. Otherwise it will be marked as a normal point. If the point is far $K$ neighborhood point, reverse $K$ neighborhood of it need to be considered to eliminate the case of abnormal distribution. The LOD value is judged according to Eq. (11) in the Definition 13, and if it is greater than a certain set threshold, it is marked as an outlier, otherwise it is marked as a normal point.

According to the description of algorithm thought in Section 3.1 and related definitions in Section 3.2, the description of FLOD_KKS algorithm proposed in this paper is presented as Algorithm 1.

Algorithm 1: Fast local outlier mining algorithm using K_kernel space (FLOD_KKS)

input: Data set

D

, parameter

k

\lambda

and

t

output: Outlier collection S

{}_{outliers}

FLOD_KKS(

D

k

\lambda

t

)

BEGIN

1) Initialization outlier set S

{}_{\textit{outliers}}=\Phi

2) FOR all data sets

D

3) IF (if

D

does not have any markings) then

4) Computing

K

neighborhood

N_{k}

D

and

K

kernel space K_KS;

5) IF

(\frac{{\rm|K\_KS(}D{\rm)|}}{{\rm|}N_{{\rm k}}(D)|}>\lambda)

then //

D

is a near

K

neighborhood point

6) IF

{\rm k-dist(D)}<\frac{\sum\nolimits_{o\in N_{{\rm k}}{\rm(D)}}{\textit{{\rm k-% dist(o)}}}}{|N_{{\rm k}}(D)|}

then //

D

is the kernel point of the

K

D

is marked as a normal point.

8) ELSE

9) The reachable distance of the data points is calculated and the reachable density of the data

points is calculated;

10) END IF

11) IF

\left(\frac{\sum\limits_{o\in N_{k}(D)}{\frac{\textit{{\rm lrd(}o{\rm)}}}{% \textit{{\rm lrd(}D{\rm)}}}}}{{\rm|}N_{{\rm k}}(D)|}>t\right)

then

12)

D

is marked as a outlier,

{\rm S}_{\textit{outliers}}={\rm S}_{\textit{outliers}}\cup D

13) ELSE

14)

D

is marked as a normal point.

15) END IF

16) ELSE //

D

is a far

K

neighborhood point

17) get the RN ${}_{k}(p)$ and K_IS(

p)

18) END IF

19) IF

\left(\frac{\sum\limits_{o\in{\rm K\_IS}(D)}\frac{\textit{lrd(o)}}{\textit{lrd% (D)}}}{{\rm|K\_IS}(D)|}>t\right)

then

20)

D

is marked as a outlier,

{\rm S}_{\textit{outliers}}={\rm S}_{\textit{outliers}}\cup D

;

21) ELSE

22)

D

is marked as a normal point.

23) END IF

24) END IF

25) END FOR

26) END

3.4 Algorithm analysis

According to Definition 10, the points in the data set can be divided into near $K$ neighborhood points and far $K$ neighborhood points. When the value of $\lambda$ is smaller, most of the data will be considered near the $K$ neighborhood point, but this does not affect the determination of the outliers. Because the purpose of near $K$ neighborhood points is to avoid the effect of uneven distribution of density in Fig. 1. Even if the neighborhoods of some data points are sparse or they even are outliers, but they are still judged to be near $K$ neighborhood points. In this case, they still needs to determine whether they are $K$ kernel points, or calculate the local outlier degree to determine whether they are outliers. Therefore, the value $\lambda$ has no effect on the judgement of outliers.

In the proposed algorithm in this paper, the time complexity of finding $K$ neighbors is $O(n(n-1)/2)$ . The time complexity of finding reverse $K$ neighbors is $O((n-m)^{2})$ , where $m$ represents near $K$ neighborhood points. The time complexity of computing $K$ kernel space is $O(\textit{nk}^{2})$ , where $k$ represents the number of $K$ neighbors. The time complexity of computing far and near $K$ neighborhood points, kernel points and $K$ influence space is $O(n)$ . The time consumption of proposed algorithm in this paper is mainly in finding $K$ neighbors and reverse $K$ neighbors. However, the time complexity of INFLO algorithm for finding reverse $K$ neighbors is $O(n^{2})$ . Compared with INFLO, this paper does not introduce new temporary variables to store data, so it does not increase the spatial complexity of the algorithm. At the same time, this algorithm introduces the reachable distance of LOF algorithm to calculate the reachable density, which reduces the impact of statistical fluctuations and improves the accuracy of the algorithm.

4. Experimental results and performance analysis

4.1 Experimental environment

The algorithm proposed in this paper is mainly verified on the MyEclipse. The experimental environment configuration of the algorithm consists of two parts: hardware environment configuration and software environment configuration.

1) 1)
Hardware environment configuration: CPU 3.9 GHZ, memory capacity is 8.00 GB, hard disk capacity is 1 TB.
2)
Software environment configuration: 64-bit Windows 10 operating system. Myeclipse, MySQL database were used as development tools; the experimental program was written in Java and the compilation environment was jdk1.6.0.

The algorithm can detect abnormal numerical data. In this paper, several data sets are selected to carry out experiments to verify the effectiveness of the algorithm. The experimental data uses the UCI data sets and simulation data sets. We use the Iris data set, the Abalone data set and the shuttle set respectively. The Iris data set is the four attributes of the genus Iris. It contains 150 data with a dimension of 4, which includes three types of clusters. Process the data set and modify a part of the data so that the numbers of outliers are 9. This Abalone data set is to physically predict the the age of abalone, dimension is 8, contains 4177 data, the data set is processed, modify part of the data so that the number of outliers are 17. The Shuttle dataset consists of 9 attributes, a total of 58000 data objects. Eighty percent of them belong to the first classification. This paper selects 10000 data objects in the first classification of the dataset as normal data. Random selection of 600 data objects in other classifications as outliers. For the simulation data sets, this paper uses three simulation data sets data1, data2, data3, and the dimension are also 2, as shown in Fig. 5, Fig. 5a for the data1, including 1007 data, including the numbers of outliers are 7. Figure 5b is data2, it contains 2213 data, two clusters and 13 outliers. Figure 5c represents the data3, which contains 831 data with two large clusters and three small clusters. The points in the three small clusters are outliers, and the data set data3 contains 28 outliers.

Figure 5.
(a) Data1 set; (b) Data2 set; (c) Data3 set. Simulation data sets.

For performance evaluation of the algorithms, we have used two metrics, namely accuracy (Acc) and rank-power (RP). The accuracy of the algorithm is measured using: Acc $=$ Soutliers/Toutliers, Soutliers indicates the number of detected outliers and Toutliers indicates the number of real outliers. If using a given detection method, true outliers occupy top positions with respect to the non-outliers among m suspicious instances, then the rank-power (RP) of the proposed method is said to be high [20]. If n denotes the number of outliers found within top m instances and $\Re_{{\rm i}}$ denote the rank of the ith true outlier, then the rank-power is given by:

$\displaystyle{\rm RP}=\frac{n(n+1)}{\sum\nolimits_{i=1}^{n}{\Re_{i}}}$

Rank-power can have a maximum value of 1 when all $n$ true outliers are in top $n$ positions. For a fixed value of $m$ , larger values of these metrics imply better performance.
4.2 Results and analysis

In order to verify the time efficiency of the algorithm, we compare it with the INFLO algorithm and CLOF algorithm [21]. From the above introduction, we can know that the INFLO algorithm needs to calculate reverse $K$ neighbors for all data sets. In this paper, by introducing far and near $K$ neighborhood points, the number of points for calculating reverse $K$ neighbors are effectively reduced, so the time efficiency is higher than INFLO algorithm. When $\lambda$ takes different values, the running time of the same data set are different. The larger the $\lambda$ , the longer the running time of the algorithm. Figure 6 shows the contrast between the INFLO algorithm and the proposed algorithm in this paper. In experiment, The parameter $t$ is set to 1.9, and the $K$ value of different data sets are different. In this paper, we compare the running time of $K$ value when the $K$ value of different data sets gets the highest accuracy rate. It can be seen that when the $\lambda$ is greater than 0.6, we can see from Fig. 6a–f the running time of the proposed algorithm in this paper is also increasing with the increase of the $\lambda$ . When $\lambda$ is equal to 1, the running time of the proposed algorithm in this paper and the INFLO algorithm is basically the same. Because the closer $\lambda$ approaches 1, the smaller the size of points in the data set becoming $K$ neighborhood points. The more the number of data points that need to calculate reverse $K$ neighbors, the longer time will be need. This can be seen that the introduction of $\lambda$ can effectively reduce the running time of the algorithm by dividing the data set into near $K$ neighborhood points and far $K$ neighborhood points. The algorithm in this paper is also compared with the CLOF algorithm based on clustering outlier detection algorithm. However, the outlier detection based on clustering needs to cluster and the outliers are only the byproducts of clustering, so the time efficiency is low, and this conclusion can also be drawn from Fig. 6.

Figure 6.

(a) Iris data set of different $\lambda$ running time schematic; (b) Abalone data set of different $\lambda$ running time schematic;(c) Shuttle data set of different $\lambda$ running time schematic; (d) Data1 set of different $\lambda$ running time schematic; (e) Data2 set of different $\lambda$ running time schematic; (f) Data3 set of different $\lambda$ running time schematic. Schematic Diagram of Running time of FLOD_KKS, INFLO and CLOF in Different Data Sets.

Table 1

Accuracy and rank-power of five algorithms under different

Data	Number of	Number of	LOF		INFLO		LDBO		FLOD_KKS		CLOF
set	data	outliers	RP	Acc%	RP	Acc%	RP	Acc%	RP	Acc%	RP	Acc%
Iris	150	9	0.59	85	0.66	90	0.78	90	0.85	98	0.94	97
Abalone	4177	17	0.78	90	0.74	92	0.89	93	0.98	99	0.76	85
Shuttle	10000	600	0.45	70	0.84	82	0.85	80	0.87	87	1	100
Data1	1007	7	1	100	1	100	1	100	1	100	1	100
Data2	2213	13	0.73	88	0.86	91	0.89	91	0.95	97	1	100
Data3	813	28	0.54	87	0.79	91	0.72	91	0.45	98	1	100

Figure 7.

Comparison of accuracy of five algorithms in different data sets.

Figure 8.

Comparison of rank-power of five algorithms in different data sets.

In order to verify accuracy (Acc) and rank-power (RP) of the algorithm, we compared it with the LOF, INFLO, LDBO [22], and CLOF algorithm. From Table 1 and Fig. 7, we can see that the accuracy of the proposed algorithm is higher than that of the other three algorithms in the three actual data sets Iris, Abalone and Shuttle data sets, and in the two simulation data sets data2 and data3. The accuracy of the four algorithms in the simulated data set data1 are the same, and they can detect all the outliers. This result occurs because all outliers in the data1 data set are far away from normal data, and the normal data distribution is concentrated, so the detection efficiency of several algorithms is high. According to the previous analysis, we can know that the closer the RP value is to 1, the better the accuracy of the algorithm is. From Table 1 and Fig. 8, we can see that the RP values of the proposed algorithm and CLOF algorithm are higher on different data sets, but according to Fig. 6, we can see that the running time of CLOF algorithm is longer than that of the proposed algorithm in this paper. So after comprehensive consideration, the proposed algorithm in this paper has more advantages in efficiency. It can also be seen that the value of $\lambda$ does not affect the accuracy of the algorithm. The accuracy of the algorithm in this paper is higher than the other three algorithms. Because when calculating the density of data, the algorithm in this paper uses the reachable density that reduces the statistical fluctuations and improves the accuracy of the algorithm. The algorithm in this paper is compared with the cluster-based outlier detection algorithm CLOF algorithm. It can be seen that the proposed algorithm and CLOF algorithm have the same detection accuracy. However, the outlier detection based on clustering first needs to cluster the data set. The outliers are only the byproducts of clustering, so the time efficiency of clustering is low. This conclusion can also be drawn from Figs 6 ang 7.

FLOD_KKS method can solve the defect that LOF algorithm can not adapt to outliers in data distribution anomalies. It can solve the defect that the INFLO algorithm requires calculating reverse $K$ neighbors of all the points in the data set, and it can reduce the fluctuation of distance statistics in the LDBO algorithm, increase the accuracy of outlier detection by introducing the reachable density.

5. Summary

In this paper a fast local outlier detection algorithm FLOD_KKS using $K$ kernel space is proposed, and the algorithm solves effectively the problem of density distribution in density-based methods. At the same time, the $K$ kernel space is introduced to divide the data into near $K$ neighboring points and far $K$ neighboring points, which reduces the number of objects that calculate reverse $K$ neighbors and running time of the algorithm. Using the reachable distance and reachable density proposed in the LOF algorithm effectively reduces the statistical fluctuation of the distance metric. According to the experimental results, the proposed algorithm can quickly and detect effectively outliers in the data set. However, the following aspects need to be studied and promoted:

1) 1)

The parameter $K$ has a great influence on the algorithm. In future, we will consider how to optimize the parameter $K$ to further improve the efficiency of the algorithm.

This paper effectively reduces the number of objects that need to compute reverse $K$ nearest neighbors. However, some data still need to compute reverse $K$ neighbors. In the future, we can use some attributes of outliers to avoid calculating reverse $K$ nearest neighbor, which makes the algorithm more efficient.

For different data set the parameters $t$ need manual input. In the future, we can make parameter $t$ adapt in different data sets.

References

Salehi

Leckie

Bezdek

J.C.

et al., Fast memory efficient local outlier detection in data streams (extended abstract), IEEE International Conference on Data Engineering 28(12) (2017), 3246–3260.

Nakamura

Kamidoi

Wakabayashi

et al., A decision method of attribute importance for classification by outlier detection, International Conference on Data Engineering Workshops, 2006.

Jin

Tung

A.K.

and Han

, Mining top-n local outiers in large databases, Proceedings of The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pp. 293–298.

Liu

Yang

et al., Local outlier mining algorithm using subspace partitioning, Journal of Chinese Computer System 32(8) (2011), 1628–1632.

Tan

P.N.

Steinbach

and Kumar

, Introduction to Data Mining, (First Edition), Addison wesley longman publishing Co, 2005, pp. 133–134.

Tong

Zhu

and Wu

, NOSEP: Nonoverlapping sequence pattern mining with gap constraints, IEEE Transactions on Cybernetics 48(10) (2018), 2809–2822.

Hido

Tsuboi

Kashima

et al., Inlier based outlier detection via direct density ratio estimation, Eighth IEEE International Conference on Data Mining IEEE Computer Society, 2008, pp. 223–232.

Hawkins

D.M.

, Identification of outliers, Biometrics 37(4) (1980), 860–861.

Hido

Tsuboi

Kashima

et al., Inlier-based outlier detection via direct density ratio estimation, Eighth IEEE International Conference on Data Mining IEEE Computer Society, 2008, pp. 223–232.

10.

Otey

M.E.

Ghoting

and Parthasarathy

, Fast distributed outlier detection in mixed attribute data sets, Data Mining and Knowledge Discovery 12(2-3) (2006), 203–228.

11.

Zhang

Hamm

N.A.S.

Meratnia

et al., Statistics based outlier detection for wireless sensor networks, International Journal of Geographical Information Science 26(8) (2012), 1373–1392.

12.

Guha

Rastogi

Shim

et al., CURE: An efficient clustering algorithm for large databases, Information Systems 26(1) (1998), 35–58.

13.

Sheikholeslami

and Zhang

, FindOut: Finding outliers in very large datasets, Knowledge and Information Systems 4(4) (2002), 387–412.

14.

Liu

and Wang

, Outlier detection based on local minima density, Information Technology, Networking, Electronic and Automation Control Conference, IEEE, 2016, pp. 718–723.

15.

Angiulli

Basta

and Pizzuti

, Distance-based detection and prediction of outliers, IEEE Transactions on Knowledge and Data Engineering 18(2) (2005), 145–160.

16.

C.P.

and Qin

X.L.

, A density-based local outlier detecting algorithm, Journal of Computer Research and Development 47(12) (2006), 2110–2116.

17.

Breunig

M.M.

, LOF: Identifying density-based local outliers, ACM Sigmod Record 29(2) (2000), 93–104.

18.

Jin

Tung

A.K.H.

Han

et al., Ranking outliers using symmetric neighborhood relationship, Lecture Notes in Computer Science 3918 (2006), 577–593.

19.

Han

Kamber

Jian

et al.,Data Mining Concept and Techniques, China machine press, 2012, pp. 364–365.

20.

Tang

Chen

W.C.

et al., Enhancing effectiveness of outlier detections for low density patterns, in: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Pacific-asia conf, on knowledge discovery and data mining (PAKDD), Taipei, Taiwan, 2002, pp. 45–84.

21.

Tao

,Outlier detection method based on clustering and density, South China University of Technology, 2014.

22.

Zou

Zhang

Song

and Ni

, Fast outlier detection algorithm based on local density, Journal of Computer Applications 37(10) (2017), 2932–2937.