Abstract
Outlier detection can detect a small amount of data which containing valuable information from a large number of data, and it has become a hot topic in data mining. In this article, a new algorithm is proposed, which is fast local outlier detection algorithm using
Introduction
With the continuous advancement of technology, more and more data is generated, and people are overwhelmed by various kinds of data (astronomical data, financial data, scientific computing data, etc.). These data are huge in quantity, contain a large amount of information, and are usually chaotic. It is very difficult for people to find useful information in them. Therefore, data mining technology came into being [1]. Outlier data is often treated as noise points during the process of data mining, but outlier data may have real and valuable data information that can easily be ignored by people [2]. Therefore, we can not directly ignore the impact of outlier data on the whole data set, and outlier mining has became one of the important tasks of data mining [3]. Outlier mining is usually used in credit card fraud, medical treatment, public security, image processing, network intrusion and so on [4, 5, 6, 7].
There are many definitions of outlier. The most classical definition was put forward by Hawkins in 1980. At present, most of the definitions of outlier were improved on the basis of Hawkins. Hawkins defines outlier as a data object, which is significantly different from other data objects, as if it was produced by different mechanisms [8]. So far, there are many methods for outlier mining, including distribution-based methods, distance-based methods, clustering-based methods and density-based methods.
The distribution-based methods [9, 10] can detect outlier in a known distributed data set. However, the distribute of data sets in real data is often unknown. So the detection effect of the distribution-based method for such a data set is not obvious, and the method is not suitable for high dimensional data sets [11]. In order to solve the problem of distribution based detection method, the literature [12, 13] proposed a method based on clustering. The method detects outliers by generating clusters, but the main purpose of this method is to generate clusters instead of mining outliers, so it has high time complexity and low efficiency [14]. Distance-based method [15] can deal with the problem of the unknown distributed, and it can process high-dimensional data sets effectively and easily. In addition, the method can also effectively detect global outliers. But for local outliers, the efficiency of the method is not very good [16].
In 2000, Breunig et al. proposed the concept of Local Outlier Factor (LOF) [17] by calculating the ratio of the average density of the objects in the
The structure of this article is described below. In Section 2, existing outlier mining algorithms are discussed, and their advantages and disadvantages are analyzed. In Section 3, a new algorithm is proposed and the advantages of the proposed algorithm are analyzed. In Section 4, the experimental results show that the accuracy of the proposed algorithm is significantly higher than that of the existing algorithm, and the running time is lower than that of the original algorithm. Finally, In Section 5, it is a summary of the proposed algorithm in this paper and the future directions.
Related work
LOF algorithm
For a given data set
For at least For at least
All objects that the distance between all data sets and
The number of objects in
If
The reciprocal of the average value of the reachable distance in the
If we calculate the density of
The local outlier factor of an object represents the degree of the object’s outlier, and the local outlier factor of the object
The closer the LOF value is to 1, indicating that the density of objects in the
But this method also has a problem. As shown in Fig. 1 the point
Outlier 
Reverse 
If the INFLO value of an object is far greater than 1, then the object is more likely to be an outlier.
INFLO method can solve the defect of outlier determination when LOF algorithm is not suitable for abnormal data distribution, but there are still the following shortcomings:
The algorithm not only needs to query The algorithm needs to analyze and calculate the outlier degree of each data point to determine whether it may be an outlier, which results in a large time overhead. A large number of reverse
Compared with the LOF algorithm, INFLO requires to calculate reverse
Algorithm thought
In this paper, the number of reverse
Related definition
For the object

According to Definition 10, if a point
If the object
The union of the set of
As shown in Fig. 4,

From the definition of kernel point of Definition 11, the following theorem can be drawn.
Proof If
The local outlier of the object
The points for the data set can be divided into near
According to the description of algorithm thought in Section 3.1 and related definitions in Section 3.2, the description of FLOD_KKS algorithm proposed in this paper is presented as Algorithm 1.
Algorithm analysis
According to Definition 10, the points in the data set can be divided into near
In the proposed algorithm in this paper, the time complexity of finding
Experimental results and performance analysis
Experimental environment
The algorithm proposed in this paper is mainly verified on the MyEclipse. The experimental environment configuration of the algorithm consists of two parts: hardware environment configuration and software environment configuration.
Hardware environment configuration: CPU 3.9 GHZ, memory capacity is 8.00 GB, hard disk capacity is 1 TB. Software environment configuration: 64-bit Windows 10 operating system. Myeclipse, MySQL database were used as development tools; the experimental program was written in Java and the compilation environment was jdk1.6.0.
The algorithm can detect abnormal numerical data. In this paper, several data sets are selected to carry out experiments to verify the effectiveness of the algorithm. The experimental data uses the UCI data sets and simulation data sets. We use the Iris data set, the Abalone data set and the shuttle set respectively. The Iris data set is the four attributes of the genus Iris. It contains 150 data with a dimension of 4, which includes three types of clusters. Process the data set and modify a part of the data so that the numbers of outliers are 9. This Abalone data set is to physically predict the the age of abalone, dimension is 8, contains 4177 data, the data set is processed, modify part of the data so that the number of outliers are 17. The Shuttle dataset consists of 9 attributes, a total of 58000 data objects. Eighty percent of them belong to the first classification. This paper selects 10000 data objects in the first classification of the dataset as normal data. Random selection of 600 data objects in other classifications as outliers. For the simulation data sets, this paper uses three simulation data sets data1, data2, data3, and the dimension are also 2, as shown in Fig. 5, Fig. 5a for the data1, including 1007 data, including the numbers of outliers are 7. Figure 5b is data2, it contains 2213 data, two clusters and 13 outliers. Figure 5c represents the data3, which contains 831 data with two large clusters and three small clusters. The points in the three small clusters are outliers, and the data set data3 contains 28 outliers.
(a) Data1 set; (b) Data2 set; (c) Data3 set. Simulation data sets.
For performance evaluation of the algorithms, we have used two metrics, namely accuracy (Acc) and rank-power (RP). The accuracy of the algorithm is measured using: Acc
Rank-power can have a maximum value of 1 when all
In order to verify the time efficiency of the algorithm, we compare it with the INFLO algorithm and CLOF algorithm [21]. From the above introduction, we can know that the INFLO algorithm needs to calculate reverse
(a) Iris data set of different 
Accuracy and rank-power of five algorithms under different
Comparison of accuracy of five algorithms in different data sets.
Comparison of rank-power of five algorithms in different data sets.
In order to verify accuracy (Acc) and rank-power (RP) of the algorithm, we compared it with the LOF, INFLO, LDBO [22], and CLOF algorithm. From Table 1 and Fig. 7, we can see that the accuracy of the proposed algorithm is higher than that of the other three algorithms in the three actual data sets Iris, Abalone and Shuttle data sets, and in the two simulation data sets data2 and data3. The accuracy of the four algorithms in the simulated data set data1 are the same, and they can detect all the outliers. This result occurs because all outliers in the data1 data set are far away from normal data, and the normal data distribution is concentrated, so the detection efficiency of several algorithms is high. According to the previous analysis, we can know that the closer the RP value is to 1, the better the accuracy of the algorithm is. From Table 1 and Fig. 8, we can see that the RP values of the proposed algorithm and CLOF algorithm are higher on different data sets, but according to Fig. 6, we can see that the running time of CLOF algorithm is longer than that of the proposed algorithm in this paper. So after comprehensive consideration, the proposed algorithm in this paper has more advantages in efficiency. It can also be seen that the value of
FLOD_KKS method can solve the defect that LOF algorithm can not adapt to outliers in data distribution anomalies. It can solve the defect that the INFLO algorithm requires calculating reverse
In this paper a fast local outlier detection algorithm FLOD_KKS using
The parameter This paper effectively reduces the number of objects that need to compute reverse For different data set the parameters
