An efficient local outlier detection optimized by rough clustering

Abstract

Outlier detection is a hot issue in data mining, which has plenty of real-world applications. LOF (Local Outlier Factor) can capture the abnormal degree of objects in the dataset with different density levels, and many extended algorithms have been proposed in recent years. However, the LOF needs to search the nearest neighborhood of each object on the whole dataset, which greatly increases the time cost. Most of these extended algorithms only consider the distance between an object and its neighborhood, but ignore the local distribution of an object within its neighborhood, resulting in a high false-positive rate. To improve the running speed, a rough clustering based on triple fusion is proposed, which divides a dataset into several subsets and outlier detection is performed only on each subset. Then, considering the local distribution of an object within its neighborhood, a new local outlier factor is constructed to estimate the abnormal degree of each object. Finally, the experimental results indicate that the proposed algorithm has better performance and lower running time than the others.

Keywords

Outlier detection local outlier factor rough Clustering

1 Introduction

Outliers are a special subset with a few objects, which deviate from normal objects in spatial distribution. Outlier detection is an important branch of data mining, aiming at automatically mining outliers from a dataset. Moreover, it has been successfully applied in many practical fields, including network intrusion detection [1 –4], medical monitoring [5 –7], and industrial testing [8 –11].

After long-term research and development, many effective methods of outlier detection have proposed by researchers, which can be briefly divided into the following categories: statistical-based method, clustering-based method, distance-based method, and density-based method, etc. As a representative density-based approach, the LOF [12] algorithm has been paid much attention by scholars, and a series of extended methods have been designed. The main idea of them can be summarized as follows: the local reachable density is constructed by calculating the distance between an object and its neighborhood, the ratio of an object’s reachable density to the mean of those of its neighborhood is used to quantify the abnormal score. Finally, according to the top-n principle, the multiple objects with the largest abnormal score are regarded as outliers.

1.1 Motivation

Although the LOF has been supported by many scholars, it still has some shortcomings:

It is necessary to calculate the distance among all objects, which has high time complexity.

LOF only considers the neighborhood distance and ignores the local distribution of an object within its neighborhood.

LOF relies on the distance among all objects when searching the nearest neighborhood. This sub-problem with time complexity O (n²) may cause that LOF face the risk of failure on large-scale dataset. Scholars have done a lot of effective works to improve the efficiency from data structure and pruning strategy. At present, to speed up the nearest neighbor search, some tree-shaped data structures are adopted, including KD Tree, Ball Tree, BK Tree, VP Tree, and M Tree, etc. At the same time, some scholars employ the idea of pruning strategy to reduce the data scale of participating in outlier detection. The main methods of them are to design some pre-processing algorithms, which can extract a small-scale candidate outlier subset, and then outlier detection is performed only on this candidate subset. However, it is sensitive to the scale of the candidate subset. There is still the problem of high time complexity for large-scale candidate subset. Conversely, if too many objects are deleted, that is, the real outliers may be removed in the pre-processing process, the accuracy will decrease.

1.2 Objective

As we know, if n_i ≥ 0 (i = 1, 2, . . . , c) and $\sum_{i = 1}^{c} n_{i} = n$ , then $\sum_{i = 1}^{c} n_{i}^{2} \leq n^{2}$ . Inspired by this inequality, a rough clustering based on triple fusion (RCBTF) algorithm is proposed to divide the dataset into several subsets. Then, outlier detection is performed only on each subset to decrease the time complexity. Besides, the RCBTF can also be directly applied to the extended LOF algorithms, and the time cost of outlier detection preprocessed by RCBTF is lower than the standard extended algorithms. Aiming at the second shortcoming of LOF mentioned above, the module of neighborhood vector’s sum is introduced to describe the local distribution of an object within the neighborhood, then a new local outlier factor (NLOF) is redefined to estimate the abnormal degree of objects. In short, the main works of this study can be summarized as follows:

The RCBTF algorithm is proposed to divide the dataset into several subsets.

The NLOF is redefined for estimating the abnormal degree of each object.

The rest of this paper is organized as follows. Section 2 introduces the variants of LOF. Section 3 describes RCBTF and NLOF in detail. In Section 4, the experimental results are shown. Summary of this study is given in Section 5.

2 Related work

LOF is a pioneering work of local outlier detection. With the expansion of related work, many variants have been successfully applied in engineering practice. Their main works focus on efficiency and accuracy.

Many scholars have done related researches to improve running efficiency. Two key technologies are being followed with interest. One is to use tree-shaped data structures to conduct efficient nearest neighborhood search, and the other is to reduce unnecessary search scope in advance by utilizing pruning strategy. KD Tree was often employed to optimize the nearest neighborhood search for outlier detection on large-scale data [13]. Likewise, the efficient search was also performed by using the other tree-shaped data structures [14 –17]. To further increased the running speed, Su et al. [18] proposed an approach of eliminating security objects, which eliminated normal objects from the dataset and extracted a few candidate outliers to participate in outlier detection. Although the running speed could be raised, it may misjudge the candidate outliers and normal objects. So, the accuracy might be decreased after the pre-processing. In [19], a dataset was divided into core objects (normal objects), boundary objects (normal objects or outliers) and outliers through DBSCAN [20] clustering, and then the core objects would be filtered out before the task of outlier detection was performed. Similarly, the above strategy of outlier detection, which removes normal objects from the dataset, is hard to define the boundary between normal objects and outliers. In addition, from the perspective of the calculation rule of local outlier factor, SimplifiedLOF [21] simplified the density estimation of LOF and improved the efficiency of the algorithm to some extent. Literatures [22 –26] have also made relevant works in the aspect of improving efficiency, which have certain reference significance.

To improve accuracy, scholars have done extended works for the calculation rule of local outlier factor, such as connectivity-based outlier factor (COF) [27], local distance-based outlier factor (LDOF) [28], INFLuenced Outlierness (INFLO) [29]. Zhang et al. proposed LDOF, which used the relative distance between an object and its neighborhood to measure the degree of deviation. It was less sensitive to hyper-parameter. INFLO modified the selection rule of neighborhood by replacing KNN with the union of KNN and inverse KNN. However, many variants only depend on the neighborhood distance to measure the abnormal degree, without considering the local distribution of an object within its neighborhood. Recently, Su et al. [18] redefined a local deviation coefficient called LDC, which first integrated the variance and expectation of the neighborhood distance into the LOF. It took the local distribution of an object within its neighborhood into account. In [30], the relative k-distance nearest neighborhood was used to replace the k-distance nearest neighborhood, and the relative k-nearest neighbor entropy was used to redefine the local outlier factor.

In summary, all of the above variants of LOF judge whether the object is an outlier by employing the neighborhood distance to calculate the local outlier factor of each object. Only a few papers, such as [18] and [30], have considered the variation of neighborhood distance, but the rules of them can’t fully reflect the local distribution of an object within its neighborhood. So, a new local outlier factor, which fully takes advantage of the local distribution, is defined in this study. The details will be given in Section 3.2.

3 Enhanced outlier detection

As can be seen from Section 1, if $\sum_{i = 1}^{c} n_{i} = n$ and n_i ≥ 0 (i = 1, 2, . . . , c), then $\sum_{i = 1}^{c} n_{i}^{2} \leq n^{2}$ . So, if a dataset containing n objects is divided into c subsets and outliers are mined in these subsets, then a problem with time complexity O (n²) can be solved optimally by breaking it into c sub-problems with time complexity $O (n_{i}^{2})$ . Adopting this strategy is beneficial to speed up the search of neighborhood. The framework is shown in Fig. 1.

Fig. 1

Framework of proposed algorithm.

3.1 Rough clustering

In unsupervised learning, the purpose of the clustering algorithm is to partition a dataset into multiple clusters, that is, to divide the dataset into multiple subsets. Although many industry-recognized clustering algorithms have been proposed at present, such as KMeans, KMeans++ and DPC [31], all of them face the challenge of low efficiency. So, an efficient rough clustering based on triple fusion (RCBTF) is presented, which doesn’t need to iterate dataset repeatedly. The related definitions of the RCBTF are as follows.

Definition 1. (Core cluster, core object and non-core object): Let ɛ (x, r) be a subset of dataset X inside a hypersphere with radius r and center x. Let ρ (x) stands for local density [31] and its calculation rules are shown in Eqs. (1) and (2). Given an artificial parameter delta, if ρ (x) > delta, then the ɛ (x, r) and x are called core cluster and core object, respectively. If object x doesn’t belong to any core clusters, then it is called non-core object.

$ρ (x) = \sum_{o \in X} χ (d (x, o) - r)$ (1) $χ (z) = {\begin{matrix} 1, & z \leq 0 \\ 0, & otherwise \end{matrix}$ (2) where d (x, o) denotes the Euclidean distance between x and o. r is the cutoff distance.

According to Definition 1, we can obtain the core objects and core clusters in the dataset. In order to reduce the number of computing units and improve the speed of operation, some core objects and core clusters will be merged by a rough rule. This process is called the first mergence, and its rule is as follows. Let C ={ C₁, C₂, . . . , C_m } be a set of core clusters obtained by Definition 1, and CP ={ cp₁, cp₂, . . . , cp_m } be a set of core objects. If there exist cp_i, cp_j ∈ CP, such that d (cp_i, cp_j) ≤ 1.5r, then C_i and C_j will be merged into one. The sets of core clusters and core points after the first mergence are denoted as CF ={ CF₁, CF₂, . . . , CF_q } and CFP ={ fp₁, fp₂, . . . , fp_q }, respectively. Obviously, we have q ≤ m, and CF_i and fp_i may contain several core clusters and core objects, respectively.

Definition 2. (Inter-cluster distance between CF_i and CF_j): Let CF_i, CF_j ∈ CF, then the inter-cluster distance between CF_i and CF_j is defined as:

$\begin{matrix} d (C F_{i}, C F_{j}) = min_{p_{i} \in f p_{i}, p_{j} \in f p_{j}} (∥ p_{i} - p_{j} ∥) . \end{matrix}$ (3)

The clusters after the first mergence doesn’t reach the final structure, therefore, the core clusters need to be merged further, and the rule is as follows: first, the inter-cluster distance of any two core clusters in CF is calculated using Definition 2. Then, the two core clusters with the smallest inter-cluster distance will be merged into a cluster, and the currently merged core cluster still participate in the next mergence until the remaining c clusters. The process mentioned above is called the second mergence. The sets of core clusters and core points after the second mergence are denoted as CS ={ CS₁, CS₂, . . . , CS_c } and CSP ={ sp₁, sp₂, . . . , sp_c }, respectively.

All core clusters have been allocated after the first and second mergence, but there are some non-core objects that have not been allocated. To complete the allocation of these objects, Definition 3, Definition 4 and Definition 5 are presented.

Definition 3. (Relative distance similarity): Object x_i is called a non-core object if it doesn’t belong to any core cluster. Take $o_{ij} = \underset{p_{j}, p_{j} \in s p_{j}}{argmin} (∥ x_{i} - p_{j} ∥)$ . Assume that dis _ sim (i, j) = e^{-∥x_i-o_ij∥} stands for the distance similarity between object x_i and CS_j, then the relative distance similarity is defined as:

$r d i s_s i m (i, j) = e^{d i s_s i m (i, j) - \max_{j^{'} \neq j} d i s_s i m (i, j^{'})} .$ (4)

Definition 4. (Relative density similarity): Object x_i is called a non-core object if it doesn’t belong to any core cluster. Take $o_{ij} = \underset{p_{j}, p_{j} \in s p_{j}}{argmin} (∥ x_{i} - p_{j} ∥)$ . Assume that den _ sim (i, j) = e^{-|ρ(x_i)-ρ(o_ij)|} represents the density similarity between object x_i and CS_j, then the relative density similarity is defined as:

$r d e n_s i m (i, j) = e^{d e n_s i m (i, j) - \max_{j^{'} \neq j} d e n_s i m (i, j^{'})} .$ (5)

The distance similarity and density similarity can reflect the membership degree of an object belonging to a cluster, but they ignore the influence of other clusters. In this paper, the distance similarity and density similarity are transformed by the concept of relativity. It can be found that the relative distance similarity and relative density similarity take into account the competitiveness of each cluster for an object, so that they can better reflect the membership degree of an object belonging to a cluster.

Definition 5. (Similarity): By combining the two measures of density and distance, similarity between object x_i and CS_j is defined as:

$\begin{matrix} sim (i, j) = (1 - λ) \cdot \frac{rdis_sim (i, j)}{\sum_{j = 1}^{c} rdis_sim (i, j)} + \\ λ \cdot \frac{rden_sim (i, j)}{\sum_{j = 1}^{c} rden_sim (i, j)}, \end{matrix}$ (6) where the weighted factor λ is between 0 and 1. In this study, a default weighted factor of λ = 0.5 be used.

According to Definition 3, Definition 4 and Definition 5, these non-core objects will be allocated to the cluster with the greatest similarity to themselves. The above process is called the third mergence.

RTCBTF aims to improve the efficiency of outlier detection by partitioning the dataset into several subsets. Therefore, its time cost also needs to be as low as possible. As shown in Algorithm 1, RCBTF only needs to traverse the dataset once when extracting all the core clusters and core objects, and subsequent mergence are based only on the core objects. Obviously, it doesn’t need too many objects to participate in the clustering process. The idea mentioned above is to satisfy the need of RCBTF algorithm with low time cost.

Algorithm 1 RCBTF
input: dataset X, the number of clusters c, r, delta
output: clusters TC
1:Set n is the size of dataset X;
2: Initialize a list visited [1 . . . n] = false;
3: Initialize three empty set c _ objects, c _ clusters and nc _ objects;
4: fori = 1 : ndo
5: if not visited [i] then
6: Calculate ρ (x_i) using Eqs. (1) and (2);
7: ifρ (x_i) ≥ deltathen
8: Add x_i to c _ objects;
9: Add ɛ (x_i, r) to c _ clusters;
10: Mark all elements of ɛ (x_i, r) as true in visited;
11: end if
12: end if
13: end for
14: Add objects that visited is false to nc _ objects;
15: Merge c _ clusters and nc _ objects according to
the rules of the first, second and third mergence;
16: Let TC stands for the result of third mergence;
17: return TC;

3.2 Method of outlier detection

According to the analysis in Section 2, the LOF can describe the abnormal degree of objects, but most of the existing variants are still inadequate. For example, Fig. 2 shows two different local distribution of the object x₀, where Fig. 2(a) and Fig. 2(b) have the same neighborhood distance and the variation of neighborhood distance for object x₀. The two datasets in Fig. 2 are called DS1 and DS2, respectively. Apparently, the abnormal degree of x₀ is different in the two plots. Therefore, it is not sufficient to describe the local distribution within the nearest neighbors, if only considering the neighborhood distance and its variation. To address the deficiency of the existing rules of calculating the local outlier factor, we fully takes the local distribution into account and introduces the module of k-neighborhood vector’s sum, so as to define a new enhanced local outlier factor. The specific rules are defined as follows.

Fig. 2

x₀ and its neighborhood.

Definition 6. (k-Neighborhood of object x) Let N_k (x) be a subset of X that contains k objects. N_k (x) is called k-neighborhood of the object x if it satisfies:

$\begin{matrix} ∥ x - o ∥ \leq ∥ x - p ∥, \forall o \in N_{k} (x), \\ \forall p \notin N_{k} (x), x \notin N_{k} (x) . \end{matrix}$ (7)

Definition 7. (k-Neighborhood average distance of object x): The k-neighborhood average distance of the object x is defined as:

$\begin{matrix} N_{avg_dist} (x) = \frac{1}{k} \sum_{o \in N_{k} (x)} d (x, o) . \end{matrix}$ (8)

Definition 8. (Module of k-Neighborhood vector’s sum of object x): Let N_k (x) ={ o₁, o₂, . . , o_k }, and then the set of k-neighborhood vector is represented as ${\vec{x o_{1}}, \vec{x o_{2}}, . . ., \vec{x o_{k}}}$ . The module of k-neighborhood vector’s sum of the object x is defined as:

$\begin{matrix} N_{vector_m} (x) = | \sum_{o \in N_{k} (x)} \vec{xo} | . \end{matrix}$ (9)

In general, normal object generally has a higher density than outlier in the same cluster, so the metric N_vector-m (x) has the following properties:

The k-neighborhood of x is close to itself if x with high density, and its k-neighborhood distribute uniformly around x. Thus, we have $\sum_{o \in N_{k} (x)} \vec{xo} \approx 0$ and $| \vec{x o_{1}} | \approx | \vec{x o_{2}} | \approx . . . \approx | \vec{x o_{k}} |$ .

Conversely, if x is an outlier, the values of set ${| \vec{x o_{l}} |}_{l = 1}^{k}$ have large variability, and $\sum_{o \in N_{k} (x)} \vec{xo}$ is a non-zero vector with large module.

So, outlier usually has a larger value of N_{vector_m} (x) of Eq. (9) than normal object.

Definition 9. (Density of object x): The density of the object x is defined as:

$\begin{matrix} N_{density} (x) = \frac{exp (N_{avg_dist} (x))}{(N_{vector_m} (x) + 1) \cdot N_{avg_dist} (x)} . \end{matrix}$ (10)

To further illustrate the properties of Eq. (10), we make the following analysis. If N_{avg_dist} (x) as an independent variable, N_{vector_m} (x) be a constant, and N_density (x) is treated as a dependent variable, then this function is strictly monotonically decreasing. Obviously, it conforms to the idea that the density decreases with the increment of the neighborhood average distance. On the contrary, if N_density (x) is treated as a function of N_{vector_m} (x), and N_{avg_dist} (x) be a constant, then this function is strictly monotonically decreasing. Obviously, it conforms to the idea that the density decreases with the increment of the module of neighborhood vector’s sum.

Definition 10. (New local outlier factor of object x): The new local outlier factor (NLOF) of object x is defined as:

$\begin{matrix} NLOF (x) = \frac{\sum_{p \in N_{k} (x)} N_{density} (p)}{k \cdot N_{density} (x)} \end{matrix}$ (11)

According to Definition 10, the NLOF (x) is the average of the ratio of the density of x and those of its neighborhood. It has the following properties: the greater NLOF (x) an object gains, indicates x has more opportunity to be an outlier, conversely, it is more likely to be a normal object. The details of outlier detection are shown in Algorithm 2.

Algorithm 2 Outlier detection
input: parameter k and α. //α stands for the
number of outliers
output: outliers
1:j = 1;
2: Initialize a list score;
3: Obtain clusters TC according to Algorithm 1;
4: fori = 1 : cdo
5: forx in TC [i] do
6: Search k-neighbors N_k (x) in TC [i];
7: Calculate N_{avg_dist} (x) using Eq. (8);
8: Calculate N_{vector_m} (x) using Eq. (9);
9: Calculate N_density (x) using Eq. (10);
10: Calculate NLOF (x) using Eq. (11);
11: score [j ++] = NLOF (x);
12: end for
13: end for
14: return the first α objects with the highest score values;

Theorem 1. After outlier detection is optimized by RCBTF, the maximum time complexity of decline is n² (c - 1)/ - c.

Proof. Let n be the number of objects in dataset. n_i (i = 1, 2, . . . , c) represent the number of objects in each cluster obtained by RCBTF. It is not difficult to find that the time complexity of LOF and NLOF are O (n²) and $O (\sum_{i = 1}^{c} n_{i}^{2})$ , respectively. Let function $f = n^{2} - \sum_{i = 1}^{c} n_{i}^{2}$ be regarded as the time complexity of the decline. The objective function is defined as follows: $\begin{matrix} f = n^{2} - \sum_{i = 1}^{c} n_{i}^{2} \\ s . t . \sum_{i = 1}^{c} n_{i} = n \end{matrix}$ Using Lagrangian multiplier method, the objective function is transformed to $\begin{matrix} f = n^{2} - \sum_{i = 1}^{c} n_{i}^{2} + λ (n - \sum_{i = 1}^{c} n_{i}) . \end{matrix}$ Obtaining the derivatives of f with respect to n_i and setting them to zero, we have $\begin{matrix} n_{1} = n_{2} = . . . = n_{c} = n / c . \end{matrix}$ Thus $\begin{matrix} max f = n^{2} - \sum_{i = 1}^{c} {(n / c)}^{2} = \frac{c - 1}{c} n^{2} . \end{matrix}$ □

It can be seen from Theorem 1 that RCBTF algorithm can reduce the time complexity of outlier detection. The maximum time complexity of decline is n² (c - 1)/ - c, if and only if each subset with equal number of samples. Although RCBTF algorithm will bring additional time cost, the total time cost is still smaller than that of original algorithm. In the experiments, we will further explain this phenomenon.

4 Experiments

To verify the effectiveness and efficiency of the proposed algorithms, a series of experiments are performed on synthetic datasets and real-world dataset. The proposed algorithm is compared with LOF, SimplifiedLOF and LDC from two aspects of precision and running time. In the experiments, Precision (Pr) and Coefficient of Variation (CV) are adopted to evaluate the detection performance. The calculation rules of them are defined as follows: $\Pr = \frac{TP}{TP + FP},$ (12) $CV = \frac{σ}{μ},$ (13) where TP is the number of the outliers which are correctly classified, and FP is the number of the normal objects which are wrongly classified. σ and μ stand for standard deviation and mean, respectively.

4.1 Experimental datasets

In this study, in order to illustrate the precision and time cost of the proposed method in this paper, we compare our method with other three algorithms (LOF, SimplifiedLOF, LDC) on five datasets, four of which are synthetic datasets shown in Fig. 2 and 3, another one is Shuttle obtained from UCI [32].

Fig. 3

Synthetic datasets.

For the convenience of description, the two datasets in Fig. 3 are called DS3 and DS4, respectively. In Fig. 2, x₀ is the center of a circle, while the other objects are located on the circle. In Fig. 3, DS3 and DS4 contain 3 and 2 clusters, respectively, which have different shape, size, and density. The Shuttle is usually used for classification. In this paper, the fourth and fifth categories of Shuttle are regarded as normal objects, while second, third, sixth and seventy categories are considered as outliers. The detail descriptions of datasets are shown in Table 1.

Table 1

Descriptions of datasets

Name	Instances	Clusters	Dimension	Outliers
DS3	3021	3	2	30
DS4	3560	2	2	36
Shuttle	3022	2	9	58

4.2 Experiments of detection precision

In DS1 and DS2, it is not difficult to find that the object x₀ has the lowest deviation in Fig. 2(a), the object x₀ has the highest deviation in Fig. 2(b). The experimental results of DS1 and DS2 are shown in Fig. 4. The abnormal scores obtained by the LOF are almost the same on DS1, which indicates that the task of outlier detection fails. In DS2, the LOF can accurately measure the abnormal degree of x₀, but the abnormal scores of other objects are not accurate enough. Especially, it is completely ineffective when k = 6. SimplifiedLOF and LDC are effective for DS1 when k takes different values. But the abnormal scores of them are not accurate enough in DS2, and there are still failures (Fig. 4(e), Fig. 4(f)). The abnormal scores obtained by NLOF can accurately describe the abnormal degree of each object, which are effective for mining outliers using top-n schema.

Fig. 4

Results of different k values on DS1 and DS2.

To further demonstrate the superiority of our method, precision (Pr) and coefficient of variation (CV) are adopted to evaluate the four algorithms on DS3, DS4 and Shuttle. In this experiment, we take r = 0.1, delta = 10. The parameters α (the number of outliers) and c (the number of clusters) are set according to Table 1. The experimental results with different values of k are shown in Table 2. Meanwhile, to reflect the detection results intuitively, the experimental results of four algorithms on DS3 and DS4 are visualized in Fig. 5.

Table 2

Precision of four algorithms on DS3, DS4 and Shuttle

k	LOF			SimplifiedLOF			LDC			RCBTF+NLOF
	DS3	DS4	Shuttle	DS3	DS4	Shuttle	DS3	DS4	Shuttle	DS3	DS4	Shuttle
50	0.97	1.00	0.81	0.93	1.00	0.83	0.82	0.58	0.77	0.93	1.00	0.83
100	0.83	0.97	0.71	0.80	0.97	0.64	0.55	0.62	0.67	0.80	1.00	0.72
150	0.37	0.97	0.64	0.37	0.94	0.62	0.37	0.71	0.66	0.63	0.97	0.71
200	0.13	0.97	0.47	0.17	0.94	0.50	0.17	0.83	0.49	0.60	0.97	0.71
250	0.00	0.94	0.40	0.03	0.86	0.43	0.06	0.94	0.50	0.60	0.97	0.57
300	0.00	0.89	0.34	0.00	0.78	0.26	0.00	0.94	0.40	0.57	0.97	0.53
350	0.00	0.78	0.26	0.00	0.69	0.26	0.00	0.92	0.25	0.57	0.94	0.50
CV	1.17	0.07	0.37	1.10	0.12	0.38	1.03	0.18	0.31	0.19	0.02	0.17

Fig. 5

Results of four algorithms on DS1 and DS2 (k = 50). The outliers are marked as red.

In Table 2, all algorithms have high Pr values when k = 50 and k = 100. However, with the increment of k, the performance on LOF, SimplifiedLOF, and LDC shows a downward trend. Only the proposed algorithm can maintain high Pr values. In DS3, the precision of our method is lower than that of the LOF only when k = 50 and k = 100, but it is also very close to that of the LOF. As can be seen from the CV values in Table 2, LOF, SimplifiedLOF, and LDC algorithms are stable only in DS4. With the increment of k, the precision of them gradually decreased on DS3 and Shuttle, and CV values of them are greater than our method. In contrast, proposed method not only maintains the high Pr values, but also has low CV values when taking different k values. The experimental results show that: firstly, LOF, SimplifiedLOF and LDC are sensitive to the value of parameter k. Secondly, compared with the other three algorithms, the proposed algorithm has not only high precision, but also strong robustness to parameter k. In outlier detection, it’s hard to accurately determine the value of k without prior knowledge. Therefore, our method is more suitable for outlier detection than the others.

4.3 Experiments of time efficiency

In order to fully illustrate the superiority of our method in time efficiency, the following experiments are designed to analyze the time efficiency of the four algorithms. Case 1: compare the NLOF+RCBTF with NLOF; Case 2: compare NLOF+RCBTF with other three algorithms; Case 3: combine the RCBTF with LOF, SimplifiedLOF, and simplified version of LDC (LDC that doesn’t use the approach of eliminating security objects, named SV_LDC), respectively; Case 4: combine the NLOF with different clustering algorithms (KMeans, KMeans++, and RCBTF). Compare the time efficiency before and after the combination. Note: the purpose of the experiment of Case 3 is to verify the effectiveness of combining RCBTF and extended LOF algorithms. There is a pre-processing algorithm in the LDC, which will eliminate part of objects and destroy the cluster-shaped structure of the dataset. Therefore, there is a conflict with the RCBTF. Based on the above two reasons, we only combine SV_LDC with RCBTF in Case 3. The experimental results of the above four cases are shown in Fig. 6, 7, 8 and 9, respectively. The number of objects of each cluster obtained by the RCBTF is shown in Table 3.

Fig. 6

Running time of NLOF only and RCBTF+NLOF on DS3, DS4 and Shuttle.

Fig. 7

Running time of three comparative algorithms and RCBTF+NLOF on different datasets.

Fig. 8

Running time of three comparative algorithms combined with RCBTF on DS3, DS4 and Shuttle.

Fig. 9

Running time of NLOF combined with different clustering algorithms on DS3, DS4 and Shuttle.

Table 3

Size of each cluster obtained by RCBTF algorithm

Name	Instances	1st cluster	2nd cluster	3rd cluster
DS3	3021	2014	213	794
DS4	3560	2024	1536	-
Shuttle	3022	2792	230	-

From Fig. 6, it can be seen that RCBTF can improve the speed of NLOF on both synthetic datasets and real-world dataset. In DS3, DS4 and Shuttle, the time cost is reduced by 32.80%, 37.93% and 15.33%, respectively. There is a different degree of decline here, because the proportion of each cluster in the whole dataset is not consistent. It can be seen from the inequality mentioned in the first section of this paper that the more balanced the number of objects between clusters obtained by RCBTF, the more time cost reduces. Table 3 shows the number of samples in each cluster obtained by RCBTF. Obviously, the number of samples in each cluster is the most balanced in DS4, followed by DS3, and Shuttle is the most unbalanced. Therefore, the decrease of time cost is the largest in DS4, followed by DS3, and Shuttle is the smallest.

As shown in Fig. 7, we compare the time cost of four algorithm in the case of different k values. The SimplifiedLOF simplifies the density estimation of LOF algorithm. Therefore, compared with LOF, its time cost has been decreased to some extent, but it is still need to compute the abnormal scores on whole dataset for outlier detection. RCBTF+NLOF algorithm only calculates abnormal scores on each cluster, which reduces the scale of calculation, so its time cost is lower than that of SimplifiedLOF. It’s not hard to find out that LDC has the least time cost among the four algorithms. This is because it removes the security objects in the pre-processing process and detects the outliers on candidate outliers set containing a few objects. However, we can see two phenomena, first, when k = 200, k = 250, k = 300 and k = 350, the running time of LDC increases dramatically on DS4. Second, the running time of the LDC doesn’t increases with the increment of the value of k on DS3 and Shuttle. That’s because the value of k of LDC algorithm will influence the number of security objects eliminated by the pre-processing algorithm of LDC. The greater number of security objects pre-processing eliminates, indicates the efficiency has more opportunity to be improved. In terms of both precision and efficiency, the performance of our approach is competitive and promising. Overall, our method has better performance in the task of outlier detection.

Fig. 8 shows the experimental results of Case 3 mentioned above on DS3, DS4 and shuttle datasets. Obviously, the proposed RCBTF can not only be applied to NLOF, but also be combined with other extended LOF algorithms. To illustrate the efficiency of the RCBTF, three different clustering algorithms are combined with NLOF respectively. In Fig. 9, we show the average time cost of NLOF combined with different clustering algorithms in different k values. It is obvious that the time cost of RBCTF is less than KMeans and KMeans++. Moreover, the total time cost is lower than the others when RBCTF is used as pre-processing algorithm. Therefore, RCBTF, which can effectively reduce the time cost of outlier detection, is an effective pre-processing approach.

5 Conclusions

In this study, an efficient outlier detection optimized by rough clustering is proposed.

To improve the efficiency of outlier detection, an efficient rough clustering method based on triple fusion (RCBTF) is designed. The main idea of RCBTF is to design three mergence methods, which can merge core clusters and non-core objects. In the process of mergence, both distance and density are employed to complete clustering. In the first and second mergence, the core clusters are merged. The purpose of the third mergence is to allocate non-core objects to the core cluster.

In addition, a new local outlier factor (NLOF) is reconstructed by using the neighborhood distance and the module of neighborhood vector’s sum, which fully takes advantage of the local distribution of an object within its neighborhood. Experimental results show that: first, the RCBTF not only effectively reduces the time cost of NLOF, but also can be directly applied to other extended LOF algorithms. Secondly, the RCBTF+NLOF has higher precision than the others on synthetic datasets and real-world dataset. In the future, the proposed approach will be further optimized and applied to various practical applications.

Footnotes

Acknowledgments

This work was supported by Chongqing University Innovation Research Group funding (No. CXQT20015), the Key Science and Technology Research Program of Chongqing Municipal Education Commission (No. KJZD-K201900505), and Research Project of Chongqing Normal University (No. YKC20032).

References

Umer

M.F.

, Sher

and Bi

, A two-stage flow-based intrusion detection model for next-generation networks, Plos One 13(1) (2018).

, Wang

and Xie

, An improved content-based outlier detection method for ICS intrusion detection, EURASIP Journal on Wireless Communications and Networking 2020(103) (2020).

Kumar

and Kumar

, Anomaly-based network intrusion detection: an outlier detection techniques, Proceedings of International Conference on Soft Computing and Pattern Recognition, Springer, Cham (2019), 262–269.

Beulah

J.R.

and Punithavathani

D.S.

, Applying outlier detection techniques in anomaly-based network intrusion systems–a theoretical analysis, Ijca Proceedings on International Seminar on Computer Vision (2013), 6–9.

Hauskrecht

, Batal

, Valko

, Visweswaran

, Cooper

G.F.

and Clermont

, Cooper outlier detection for patient monitoring and alerting, Journal of Biomedical Informatics 46(1) (2013), 47–55.

Hauskrecht

, Batal

, Hong

, Nguyen

, Cooper

G.F.

, Visweswaran

and Clermont

, Outlier-based detection of unusual patient-management actions: An ICU study, Journal of Biomedical Informatics 64 (2016), 211–221.

Presbitero

, Quax

, Krzhizhanovskaya

and Sloot

, Anomaly detection in clinical data of patients undergoing heart surgery, Procedia Computer Science 108 (2017), 99–108.

Cai

, Thornhill

N.F.

, Kuenzel

and Pal

B.C.

, Real-time detection of power system disturbances based on k-nearest neighbor analysis, IEEE Access 5 (2017), 5631–5639.

Anagnostou

, Boem

, Kuenzel

, Pal

B.C.

and Parisini

, Observer-based anomaly detection of synchronous generators for power systems monitoring, IEEE Transactions on Power Systems 33(4) (2019), 4228–4237.

10.

Zhang

, Wan

, Wang

, Gao

D.W.

and Ma

, Anomaly detection based on random matrix theory for industrial power systems, Journal of Systems Architecture 95 (2019), 67–74.

11.

Zhang

, A novel outlier detection method for improving industrial process monitoring, Proceedings of 2018 Chinese Control And Decision Conference (CCDC), IEEE, Shenyang (2018), 1155–1159.

12.

Breunig

M.M.

, Kriegel

H.P.

, Ng

R.T.

and Sander

, LOF: Identifying density-based local outliers, Proceedings of ACM Sigmod International Conference on Management of Data, ACM, New York, (2000), 93–104.

13.

Kim

, Cho

N.W.

, Kang

and Kang

S.H.

, Fast outlier detection for very large log data, Expert Systems with Applications 38(8) (2011), 9587–9596.

14.

Shen

, Liu

, Zhao

and Lin

, A Kd-Tree-based outlier detection method for airborne lidar point clouds, Proceedings of International Symposium on Image and Data Fusion, IEEE, Tengchong, (2011), 1–4.

15.

Zhang

, Yin

and Huang

, An optimized LOF algorithm based on tree structure, Proceedings of 2020 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD), IEEE, Chengdu, (2020), 167–171.

16.

, Luo

and Liu

, VDOD: Distributed outlier detection algorithm based on KD-tree, Computer & Digital Engineering 46(3) (2018), 419–423+428. (in Chinese).

17.

Sun

, Bao

, Zhao

, Yu

and Wang

, CD-Trees: An efficient index structure for outlier detection, Proceedings of International Conference on Web-Age Information Management, Springer, Heidelberg, (2004), 600–609.

18.

, Xiao

, Ruan

, Gu

, Li

, Wang

and Xu

, An efficient density-based local outlier detection approach for scattered data, IEEE Access 7 (2019), 1006–1020.

19.

Wang

, Wang

X.L.

and Wilkes

D.M.

, New developments in unsupervised outlier detection, Springer (2020).

20.

Ester

, Kriegel

H.P.

, Sander

and Xu

, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of the 2nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96), ACM, Portland, (1996), 226–231.

21.

Schubert

, Zimek

and Kriegel

H.P.

, Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection, Data Mining and Knowledge Discovery 28 (2014), 190–237.

22.

Wang

Y.F.

, Jiong

, Su

G.P.

and Qian

Y.R.

, A new outlier detection method based on OPTICS, Sustainable Cities and Society 45 (2019), 197–212.

23.

Meziati

M.E.

and Ziyati

, Fast outlier detection method based on rough set, Proceedings of 2018 9th International Symposium on Signal, Image, Video and Communications (ISIVC), IEEE, Rabat, (2018), 60–66.

24.

Cai

, Sun

, Hao

, Li

and Yuan

, An efficient outlier detection approach on weighted data stream based on minimal rare pattern mining, China Communications 16(10) (2019), 83–99.

25.

, Ye

, Sun

, Liu

and Xu

, FAST-ODT: A lightweight outlier detection scheme for categorical data sets, IEEE Transactions on Network Science and Engineering 8(1) (2021), 13–24.

26.

Goldstein

, FastLOF: An expectation-maximization based local outlier detection algorithm, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), IEEE, Tsukuba, (2012), 2282–2285.

27.

Tang

, Chen

, Fu

A.W.

and Cheung

D.W.

, Enhancing effectiveness of outlier detections for low density patterns, Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Taipei, (2002), 535–548.

28.

Zhang

, Hutter

and Jin

, A new local distance-based outlier detection approach for scattered real-world data, Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Heidelberg, (2009), 813–822.

29.

Jin

, Tung

A.K.H.

, Han

and Wang

, Ranking outliers using symmetric neighborhood relationship, Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Singapore, (2006), 577–593.

30.

Yang

, Wang

, Wei

, Du

and Li

, An outlier detection approach based on improved Self-Organizing feature map clustering algorithm, IEEE Access 7 (2019), 115914–115925.

31.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344(6191) (2019), 1492–1496.

32.

Dua

and Graff

, UCI machine learning repository, University of California, Irvine, School of Information and Computer Science http://archive.ics.uci.edu/ml/datasets.php.