A novel approach for anomaly detection in data streams: Fuzzy-statistical detection mode

Abstract

Anomalies are subsequences that exhibit departures from normal state of operation. In this paper, to solve the problems of unknown data distribution, control limit determination, multiple parameters, training data and fuzziness of ‘anomaly’, a self-adaptive and unsupervised model is developed for finding anomalies in data streams. A salient feature is a synergistic combination of both statistical and fuzzy set-based techniques. Anomaly detection problem is viewed as a certain statistical hypothesis testing which is realized in an unsupervised mode. At the same time, ‘anomaly’ is a much more complex concept and as such can be described with fuzzy set theory. Fuzzy sets bring a facet of robustness to the overall scheme and play an important role in the successive step of hypothesis testing. Because of the fuzzification, parameters determination is self-adaptive and no parameter needs to be specified by the user, what’s more, there is no need to consider the data distribution in statistical hypothesis testing in this paper. The approach is validated with a number of experiments, which help to quantify the performance of constructed algorithm.

Keywords

Anomaly detection statistical test self-adaptive fuzzy set theory unsupervised

1 Introduction

Anomaly detection in time series provides significant information for numerous applications. For example, it can be used to detect heart arrhythmia in electrocardiogram [19], incident faults in industrial process [5] and intrusions in network data [1]. For example, it is an anomaly that is a premature ventricular contraction in electrocardiogram (ECG) signals in Fig. 1. Anomalies are subsequences that are least similar and depart from the state of statistical control which exists when certain critical process variables remain close to their target values and do not change perceptibly. Subsequences that stay in a state of statistical control are called in-control data (normal data), otherwise, out-of-control data (anomaly). In statistical process control, there are following key features which are shown in Fig. 2.

Points representing a statistic of a quality characteristic in samples taken from the process at different times or different data.

The mean of these statistics using all the samples at which a center line (CL) is drawn.

Upper control limits (UCL) and lower control limits (LCL) that indicate the threshold at which the process output is considered statistically ‘unlikely’.

Anomaly detection in time series is more challenging due to several reasons. First, it makes control limits very important decision aids. Control limits provide information about the process behavior and have no intrinsic relationship to any specification targets. In practice, the process design cannot simply deliver the process characteristic at the desired level. It is also a key challenge to select control limits. Second, ‘anomaly’ is a more complex concept. For example, if one sample’s characteristic is equal to UCL-ɛ (ɛ is an infinitesimal positive number), it is normal. But if one sample’s characteristic is equal to UCL+ɛ, it becomes difficult to determine whether it is normal or abnormal. Third, many other algorithms require several parameters whose values are to be determined [20]. This requires large amounts of training data, therefore, most of algorithms are realized in a supervised mode. Fourth, statistical features are unknown and various, what’s more, they may not coincide with the specified value of the process characteristic.

To address these challenges, a synergistic combination of statistical and fuzzy set-based technique is proposed in this paper. Anomaly detection is viewed as a statistical hypothesis testing and a definition based on statistical process control is introduced. Because the process mean may not coincide with the specified value of the quality, the mean of samples’ characteristics isn’t adopted, but a threshold. ‘Anomaly’ could be a more complex concept, so the threshold should be fuzzy. Fuzzy set theory is taken into account to provide a better characterization of the boundary between ‘normal’ and ‘abnormal’. What’s more, the inequalities (>, ≤) in statistical test are treated as fuzzy predicates (the degree of inclusion). Intensive fuzzification process is adopted to realize related parameters determination which is self-adaptive. Therefore, values of parameters are not required to be specified by the user. Due to the use of fuzzy set theory, there is no need to consider the data distribution in statistical test and the algorithm is a totally unsupervised model. A number of studies that show the effectiveness of our algorithm to detect anomalies in time series are conducted.

2 Related works

The broad categories of anomaly detection techniques are: classification-based techniques [22, 37], nearest neighbor-based techniques [4, 10], clustering-based techniques [27, 45], statistical techniques [7, 38], information theoretic techniques [2, 25], spectral techniques [15]. There are numerous methods to realize anomaly detection in time series. Cheng [11] presented a robust graph-based algorithm for detecting anomalies in noisy multivariate time series. SAX [33] which is a symbolic representation was employed to speed up the proposed method. However, classification-based techniques [43, 46] cannot detect unknown and emerging anomalous patterns. Although many clustering methods [14, 36] show good performance in the laboratory, there are only few techniques gaining popularity in practical applications. There are three main reasons, including sensitivity to initialization, trapping into local minima and lack of prior knowledge for parameters of clustering methods.

One of commonly used techniques for anomaly detection in time series is to analyze their neighborhoods [40]. Shuchita [41] focused on nearest neighbor based outlier detection techniques and identified the advantages and disadvantages of various nearest neighbor based techniques. Keogh [19] proposed a largest nearest neighbor approach to detect maximal different subsequences within a longer sequence. Some techniques computed the outlier score of a data instance as the sum of its distances from its K nearest neighbors [44]. The nearest neighbor-based technique is a nonparametric method. Despite its simplicity, its critical shortcomings are the low tolerance to noise and the fact that it relies exclusively on the existing data, assuming that the training set defines the decision boundaries among the classes perfectly, which is not always the case [13].

There are extensive studies [31, 39] in statistics community regarding anomaly detection. They fit a statistical model to given data, then apply a statistical test to determine whether an instance belongs to this model. Federico [38] proposed a method to detect anomalies in network traffic, based on a non-restricted α-stable first-order model and hypothesis testing. The proposal in [8] was based on Markov chains, co-occurrence matrices, and compression algorithms, for modeling TCP connections, in terms of statistical analysis of some packet header fields. The approach in [9] was based on different families of Markovian models for modeling network traffic running over TCP. For statistical methods [3 , 29], statistical features, such as the signals’ mean value, mean square deviation and distribution, are taken into account. Statistical methods fit a statistical model to given data, then apply a statistical inference test to determine whether an instance belongs to this model. The main drawbacks of statistical methods are statistical features are unknown and various, what’s more, statistical features may not coincide with the specified value of the process characteristic and several parameters need to be determined [16, 23]. However, it has the ability to detect unseen patterns.

The reason that ‘anomaly’ itself includes fuzziness makes fuzzy set theories have been successfully applied [6]. One systems which was based on fuzzy adaptive resonance theory and evolving fuzzy neural networks was proposed in [32]. The approach [28] was to detect new operation modes of a process such as operation point changes and faults. The approach relied upon an incremental clustering procedure to generate fuzzy rules. In [24], unsupervised data partitioning was performed adapting fuzzy c-means clustering in an incremental model. Fuzzy logic theory [12 , 42] can avoid hard threshold, thereby increase the tolerance towards contradictions in the data. However, fuzzy rules and membership functions are difficult to determine. If the information is simple and insufficient, the performance will become worse.

3 Anomaly detection as a statistical test

Now, the reason why any anomaly detection scheme can be viewed as a statistical test is illustrated. A hypothesis testing is to discover whether a sample is or not significantly different from the present in the population. The testing is evaluated by the fundamental tradeoff between the false positive and false negative rates. Hypothesis testing explains how to pick thresholds decision when one problem is faced with balancing this particular tradeoff. In fact, any anomaly detection method requires a threshold. So hypothesis testing is available and useful to discover whether the current subsequence is or not significantly different from the whole information. Any anomaly detection method will at some point use a statistical test to verify whether a hypothesis is true or false. The problem is to choose between a single hypothesis H0, which is associated to the estimated stochastic model and the composite hypothesis H1, which represents all the otherpossibilities.

3.1 Anomaly score

Considering a time series X = X₁, …, X_n, which is an ordered sequence of measurements taken for a real-valued variable X at timestamps 1, …, n, where n is the number of the timestamps and X_i is the ith observation (subsequence) of the time series at timestamp i, X_i may be a subsequence with length q, or a collection of q features at timestamp i. Consequently, the time series can be represented in the form: $\begin{matrix} X_{1} = x_{11}, & x_{12}, & \dots, & x_{1 q} \\ X_{2} = x_{21}, & x_{22}, & \dots, & x_{2 q} \\ ⋮ \\ X_{n} = x_{n 1}, & x_{n 2}, & \dots, & x_{nq} \end{matrix}$ (1)

A key component of an anomaly detection algorithm is the similarity used to determine how closely two given observations are matched. Let X_i and X_j denote a pair of subsequences at timestamps i and j. Each timestamp is associated with a set of measurements for q features. The similarity is measured using Euclidean distance defined as follows: $Distance (X_{i}, X_{j}) = \sqrt{\sum_{h = 1}^{q} {(x_{ih} - x_{jh})}^{2}}$ (2)

For each subsequence, the degree to which the subsequence differs from others in the same time series is defined, termed anomaly score. It is defined as the average of the distances between a subsequence and its K nearest neighbors. Formally speaking, the anomaly score of a subsequence P is computed in the form: $Score (P) = \frac{1}{K} \sum_{t = 1}^{K} Distance (P, P_{t})$ (3)

Where P_t ∈ KNNSet (P), KNNSet (P) denotes the set composed by K nearest neighbors of P in time series.

3.2 Statistical hypothesis testing

Traditional hypothesis testing relies heavily on prior knowledge of the data distribution and is not suitable for arbitrary or unknown data. What’s more, hypothesis testing is based on statistical estimation of the distribution parameters. Because the distribution is unknown, it is impossible to estimate its parameters. In practice, the process mean may not coincide with the specified value of the feature. So a threshold θ is considered to do hypothesis testing to determine whether one subsequence is abnormal. Statistical control state exists when anomaly score ≤ θ. Following hypothesis H0 for the case when there is an anomaly, and an alternate hypothesis H1 for the case when there is no anomaly are formed. $\begin{matrix} H_{0} : anomaly score > θ \\ H_{1} : anomaly score \leq θ \end{matrix}$ (4)

Neither the distribution nor the values of its parameters is known, such as mean and standard deviation. But according to the definition of anomaly score, the average of the distances between a subsequence and its K nearest neighbors can be determined. Meanwhile, standard deviation of this subsequence can be calculated. T-test is a hypothesis test in which the test statistic follows a Student’s t distribution if null hypothesis is supported. It can be used to determine if two data sets are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution. When the scaling term is unknown and is replaced by an estimate, the test statistic (under certain conditions) follows a Student’s t-test. Student’s t-test doesn’t need known distribution parameters. The subsequence mean and standard deviation can be used to replace distribution parameters. In this case, the constructed test statistic and rejection region come in Equations.(5) and (6). If P satisfies Equation.(6), it is normal and H0 is rejected. $t (P) = \frac{Score (P) - θ}{s / \sqrt{K}}$ (5) $t (P) \leq t_{1 - α} (K - 1)$ (6)

Where Score (P) is the anomaly score of subsequence P, s is the standard deviation between a subsequence and its K nearest neighbors, and α is significance level. There are other two parameters to be determined, which are θ and K. As mentioned above, ‘anomaly’ itself includes fuzziness, so related parameters include fuzziness. Correspondingly, threshold determination is a problem of fuzzy information processing. In this sense, fuzzy sets arise naturally in this study. In traditional hypothesis test, the constructed test statistic in Equation.(5) has same rejection region which is Equation.(6) for different data. However, different data possesses different characteristics, what’s more, the inequalities (>, ≤) in statistical hypothesis test are treated as fuzzy predicates (implying a certain degree of inclusion). Therefore, intensive fuzzification is adopted to determine the threshold and rejection region. In following section, the intensive fuzzification will be illustrated, what’s more, because of fuzzification, there is no need to care whether the test statistic would follow a normal distribution or not.

To evaluate the performance, a Receiver Operating Characteristic (ROC) curve is used, which plots true positive rate (TPR) (i.e. detection rate) versus false positive rate (FPR) (i.e. false alarm rate) as shown in Fig. 3. ROC curve establishes a trade-off between detection rate and false alarm rate. An algorithm performs well if its ROC curve climbs rapidly towards the upper left corner of the graph. This means a high fraction of true anomalies are detected with only a few false positives. To quantify how quickly the curve rises to the upper left corner, the area under the curve (AUC) is determined. The larger the area, the better the algorithm. To make the curve climbing towards the upper left corner, K selection is viewed as an optimization problem which is defined as follows: $K = arg max_{K} AUC$ (7)

Optimal ROC curve can be derived by optimizing K whose value is estimated based on training set. Furthermore, only K optimization needs training set in our algorithm. Keogh [19] proposed to use a Kth nearest neighbor (K = 1) approach to detect anomalies. In this paper, a predefined value of K which assumes any value can be considered. In this case, the algorithm is of unsupervised character.

4 Fuzzy-statistical detection mode

As mentioned earlier, ‘anomaly’ is a complex concept, therefore parameters determination can be realized by engaging fuzzy sets. In this section, parameters and rejection region determination is shown. The aim of fuzzification is to achieve optimized results. The first fuzzification is to determine the threshold θ in Equations.(4) and (5). The second fuzzification is to determine the rejection region and regards the inequalities (>, ≤) as fuzzy predicates. It is noted that because of fuzzification, the algorithm is self-adaptive and there is no need to consider the data distribution in statistical test.

4.1 Fuzzification

As the objective is to realize anomaly detection, two fuzzy sets are selected, namely ‘normal’ and ‘abnormal’. Because of their smoothness, infinite support, and concise notation, Gaussian membership functions are commonly encountered models, which are described in the form: $normal (x) = {\begin{matrix} 1, & x≤ a \\ \exp (- {(\frac{x - a}{σ})}^{2}), & x> a \end{matrix}$ (8) $abnormal (x) = {\begin{matrix} \exp (- {(\frac{x - b}{σ})}^{2}), & x≤ b \\ 1, & x> b \end{matrix}$ (9)

Each membership function is characterized by two important parameters. a (or b) represents the function centre, whereas σ denotes the spread. Higher values of σ correspond to larger spreads of fuzzy sets. The same value of σ is selected for these two functions. In this case, σ can be regarded as a constant. So it is a or b, not σ, which determines an overlap between these two functions. When abnormal (x) > normal (x), there is an anomaly. The threshold is given by the condition abnormal (x) = normal (x).

Now, how to determine the values of parameters a and b is illustrated. In principle, a sound fuzzy membership function not only can reflect the fuzzy characteristics of a fuzzy concept, but also can describe the objective content as clear as possible. The clarity can be measured by the fuzziness degree of one fuzzy set. When the fuzziness degree is smaller, the fuzzy set expresses the problem more clearly. So the principle of minimum fuzziness degree is used to determine the values of related parameters. Fuzzy entropy is selected to express the fuzziness degree, leading to the optimization model in the following form, where N is the number of samples in the whole dataset,

$\begin{matrix} min H (a, b) \\ = \frac{1}{N ln 2} \sum_{i = 1}^{N} {s (normal (x_{i})) + s (abnormal (x_{i}))} \end{matrix}$ (10) $s (x) = {\begin{matrix} - x ln x - (1 - x) ln (1 - x), & x \in (0, 1) \\ 0, & x = 0, 1 \end{matrix}$ (11)

4.2 Intensive fuzzification

The fuzzy set-based scheme which is an intensive fuzzification process is proposed in this paper. It is shown in Fig. 4. The scheme consists of five major components as follows:

Fuzziness degree inference: Membership functions are expressed by Equations. (8) and (9) whose parameters are unknown. Then, fuzziness degree is expressed by Equations. (10) and (11).

Parameters optimization: Optimized a and b are achieved by principle of minimum fuzziness degree in Equations. (10). Differential evolution is used to optimize the fuzziness degree and obtain related parameters.

Fuzzy sets building: Fuzzy sets and their membership functions are constructed according to optimized a, b and predefined σ.

Threshold determination: θ in Equations. (4) and (5) exists when two membership degrees are the same. θ is the value of x when abnormal (x) = normal (x).

Result estimation: P is an anomaly and H0 is accepted when abnormal (t (P)) > normal (t (P)). K optimization is carried out by the model in Equation. (7).

As shown in Figure 4, the process of fuzzification includes three major components, which are fuzziness degree inference, parameters optimization and fuzzy sets building. The solid line represents the first fuzzification in which anomaly score is the data object. The aim of the first fuzzification is to obtain an optimized threshold. The dashed line shows the second fuzzification. According to Equation. (5), the value of t (P) is calculated, which is the data object for the second fuzzification. Because the optimized threshold is a numeric value and inequality (≤) is a fuzzy predicate, a second fuzzification is conducted to determine the rejection region and degree of inclusion. The proposed scheme is developed in two stages: training and testing, which are almost the same. Both of them are intensive fuzzification scheme. Their only difference is K optimization. Training is needed only when an optimized K is required. Then, optimized K is used at testing stage. Otherwise, training and K optimization are not necessary. K will be predefined by users when testing. The proposed detection algorithm based on intensive fuzzification is described in the form of Algorithm 1.

Algorithm 1 Fuzzy-statistical detection algorithm

Input: The time series X = X₁, …, X_n

Output:The set of anomalies

1: for all X_ido

2: Calculate anomaly score Score (X_i) according to Equation. (3)

3: Membership functions normal (Score (X_i ) ) and abnormal (Score (X_i ) ) of Score (X_i) are expressed by Equations. (8) and (9), where a and b are unknown, and σ is predefined

4: end for

5: Fuzziness degree H (a, b) of anomaly score Score (X_i) is expressed by Equations. (10) and (11)

6: a and b in membership functions of anomaly score Score (X_i) are optimized according to the principle of minimum fuzziness degree in Equation. (10) which is carried out by differential evolution

7: θ in Equations. (4) and (5) is the value of x when abnormal (x) = normal (x)∥

8: for all X_ido

9: Calculate t (X_i) according to Equation. (5)

10: Membership functions normal (t (X_i ) ) and abnormal (t (X_i ) ) of t (X_i ) are expressed by Equations. (8) and (9), where a and b are unknown, and σ is predefined

11: end for

12: Fuzziness degree H (a, b) of t (X_i) is expressed by Equations. (10) and (11)

13: a and b in membership functions of t (X_i) are optimized according to the principle of minimum fuzziness degree in Equation. (10) which is carried out by differential evolution

14: for all t (X_i) do

15: ifabnormal (t (X_i)) > normal (t (X_i)) then

16: X_i is an anomaly and accept H0

17: else

18: X_i is normal and accept H1

19: end if

20: end for

4.3 Data distribution in statistical test

Now the reason why there is no need to concern about the data distribution is illustrated. Only in the second fuzzification, the data distribution need be considered. As shown in Equations. (5) and (6), t (P) and t_1-α (K - 1) have to be calculated. If K and α are known, t_1-α (K - 1) will be a constant. The predicted result of t (P) will be the same as that of t (P) - t_1-α (K - 1). Now the reason is illustrated. For example, x and y are two variables, and y = x + h, h is a constant. If membership functions of variable x are normal (x) and abnormal (x), membership functions of y are normal (y - h) and abnormal (y - h). abnormal (y - h) and normal (y - h) are on the right of abnormal (x) and normal (x) as shown in Fig. 5. The distance between them is h and shapes of their membership functions are the same. CrossX and CrossY are crossing points of their membership functions and the distance between them is h. It means CrossY = CrossX + h. So the hypothesis x < CrossX is equivalent to the hypothesis y < CrossY. Additionally, if their membership functions can be determined respectively, their crossing points can be determined without caring the value of h. Therefore, t_1-α (K - 1) in Equation. (6) can be ignored, and only t (P) is regarded as the test object in the second fuzzification. Meanwhile, the determination of rejection region is regarded as the processing of fuzzy information. In the second fuzzification, rejection region is determined according to the membership functions of t (P), not Equation.(6) directly. It means that Equation.(6) is changed to abnormal (t (P)) ≤ normal (t (P)). In this case, the determination of rejection region has nothing to do with t_1-α (K - 1) and the critical value of rejection region is no longer t_1-α (K - 1). So there is no need to care whether the data follows a normal distribution or not. The most important thing is that there is no need to care the data distribution.

5 Experiments and discussions

5.1 Performance analysis

The necessary of some techniques in our algorithm is analyzed now. Because every data set may have its own characteristics, a predefined and same K for all datasets is not used. An optimization model is constructed to select K. Of course, if there is limited training data, one can use a predefined and same K for all datasets. In this case, the algorithm will be a totally unsupervised model. According to characteristics of our algorithm, four experiments are performed listed as follows:

The algorithm based on statistical testing and a predefined K is called SP.

The algorithm based on statistical testing and an optimized K is called SO.

The algorithm based on statistical testing, fuzzy set theory and a predefined K is called SFP.

The algorithm based on statistical testing, fuzzy set theory and an optimized K is called SFO.

These four experiments are performed using datasets obtained from university of California Riverside time series repository [21]. Because of space limit, results of the first 10 datasets in the first part of the repository are merely listed. Information of every dataset is listed in Table 1. The class with most instances is selected as the normal class, and the class with fewest instances as the anomaly class. Each dataset is fairly organized so that the resulting dataset consists less than 20% anomalies. The performance is analyzed from two points of view: K selection and fuzzification. In the first (SP) and third (SFP) experiments, a fixed K(9) is selected for the experiments. In other two experiments, optimized K is used. Of course, optimized K may be different for different datasets. All results of 10 datasets are listed in Tables 2–4.

Obviously, when K is different, the performance may be much different. This conclusion can be come up with from performance comparison of SO with SP(9) and comparison of SFO with SFP(9) in Tables 2–4. Taking the comparison of SO with SP(9) for example, better results with K optimization are marked in bold. All AUC values of SO are better than those of SP(9). Almost all TPR values and FPR values remain unchanged or are slightly better. The performance is improved greatly. The same conclusions occur in the performance comparison of SFO with SFP(9). That means, if training data is sufficient, K optimization is necessary for performance improvement. If there is no training data, K can be any value. In this case, detection rate and false alarm rate almost are not affected by different K.

The performance analysis of fuzzification is shown by comparing SFP(9) with SP(9) and comparing SFO with SO in Tables 2–4. Taking the comparison of SFO with SO for example, better results with fuzzification are marked in bold. Most AUC values of SFO are better than those obtained by the SO. All FPR values are smaller and most TPR values stay the same. There is the largest improvement in false alarm rate. This means that the fuzzification can effectively reduce false alarm rate. The SFO algorithm outperforms other three algorithms on most of datasets. Fuzzification is workable and effective for performance improvement.

5.2 Performance comparison

Our algorithm are compared against 4 competing methods, which are LOF [26], Brute force [30], Random Walk [34] and K-distance [35] using 10 datasets described in Table 1. All 4 competing methods are unsupervised. LOF and K-distance use a threshold called minimum points (K) to define the neighborhood. All baseline methods need parameter N to determine the number of predicted anomalies. Selection of K and N for these approaches must be specified by the user. Our algorithm is self-adaptive and does not require any predefined parameters.

5.2.1 Results with predefined parameters

Different N may result in different conclusions. What’s more, sizes of different datasets are different or unknown. This leads to the difficulty in N selection. ROC curve is generated by plotting the cumulative distribution function of detection probability and false alarm probability. AUC is merely determined by anomaly score. Therefore, different N may lead to different TPR and FPR, but no effect on AUC. In general, when N is larger, TPR is better and FPR is worse. Otherwise, TPR becomes worse and FPR gets better. For comparison, same N is selected, which is automatically generated by our algorithm, for four baseline algorithms. K can be any value for SFP, LOF and K-distance. The results when K is 6 are only shown in Tables 5–7. Best results is marked in bold. More best results mean the performance of this algorithm is better. SFP(6) has 7 best TPR, 7 best FPR and 6 best AUC. K-distance has 6 best TPR, 6 best FPR and 4 best AUC. LOF has 4 best TPR, 4 best FPR and 1 best AUC. Brute force has 5 best TPR, 5 best FPR and 2 best AUC. Random walk has 6 best TPR, 5 best FPR and 3 best AUC.

Because of fuzzification, the proposed algorithm is self-adaptive. It means that the analysis procedure can process parameters according to the characteristic of the data and algorithm automatically, so that the algorithm coincides with the data and best performance is gained. If same K and N are predefined for all algorithms, the number of anomalies in every dataset and the characteristic of every algorithm may be different. If K is too small, it is difficult to show the difference between data, so the abnormal data may be regarded as normal data, leading to bad FPR. If N is too small, too many abnormal data will not be detected, leading to bad TPR. Otherwise, if K or N is too large when the number of data is limited, the detection may be out-of-range. If K is too large, the boundary between ‘normal’ and ‘abnormal’ will become indistinct. Meanwhile, the performance is sensitive to N, therefore, large N will cause bad FPR. Therefore, when K is predefined and N is the same for all algorithms, the performance of our algorithm is the best.

5.2.2 Results with optimized parameters

In previous section, predefined K and same N are used for all algorithms. K can be optimized in our algorithm. In this section, the best performance of all algorithms is compared. N has a significant influence on TPR and FPR which can’t reach their maximum values at the same N. As we all know, when N is larger, TPR is better and FPR is worse. Therefore, there is no meaning to compare best TPR and best FPR. Best AUC values are just shown for all methods using the 10 datasets. Various K and N have been experimented, then best K and N that maximize the performance for 4 competing methods are selected. Best results are marked in bold in Table 8. The results indicate that our algorithm significantly outperforms other existing methods for the majority of datasets.

As mentioned above, different N may result in different TPR and FPR, but there is no effect on AUC. Brute force and Random walk just need N and don’t need K, therefore, AUC of these two algorithms remains unchanged. To estimate the connectivity value of an instance, Random Walk views anomaly detection as performing a random walk on the Markov chain with a transition matrix. There are some correlations for all instances, but Markov property may not exist. Therefore, there is serious limitation in Random Walk. SFO, K-distance and Brute Force are distance-based methods. LOF is density-based, in some sense, it is based on the distance. Distance-based methods are simple and intelligible, but the performance is not stable, especially for sparse data. The advantage of a local instead of a global view is detailed in LOF, which depends on how isolated the object is with respect to the surrounding neighborhood. However, if the densities of two different instances are similar, LOF is not effective. K-distance and Brute Force just consider one distance, that is the Kth largest distance to the nearest neighbor, therefore they are not thoughtful. Overall, the proposed algorithm in this paper is better than four competing algorithms.

5.2.3 Time complexity

One of critical performance aspects of any algorithm is the speed. Creating real-time requirements is challenging. Time complexity of our algorithm and 4 compared algorithms is O (n²), where n is the number of instances. The experiments are conducted on an Intel i5-3470 PC with 8.00GB main memory under windows 8 and implemented in Visual studio 2008. Total detection time of every algorithm for all 10 datasets is shown in Fig. 6. Brute force takes every possible subsequence, and only one anomaly can be found per call. Therefore, it is rather time-consuming. Local reachability densities of an object’s K-distance neighborhoods need to be obtained in LOF. A transition matrix is needed to form in Random walk. Runtime trendlines against increasing dataset sizes are shown in Fig. 7. When data size is smaller, the difference of runtime is slight, not more than 1s. The difference of runtime becomes gradually significant with increasing sizes. Brute force and LOF show a sharp rise. It can be seen that the runtime of our algorithm and K-distance is the shortest and their trendlines stay steadier than others as shown in Fig. 6 and Fig. 7. They are almost the same. According to the analysis, our method can outperform other existing methods in both AUC value and detection time. It means that our algorithm is workable and much effective.

6 Conclusion

In this work, a novel algorithm has been first developed for finding anomalies in data streams, which is viewed as a statistical hypothesis testing. Exploiting fuzzy set theory, hypothesis testing is a self-adaptive model and there is no need to consider the data distribution. The algorithm can be an unsupervised model. K optimization is necessary for AUC improvement. In this case, detection rate and false alarm rate almost are not affected by different K. False alarm rate has been significantly improved by intensive fuzzification process. A careful performance comparison confirms the effectiveness of the proposed algorithm. Furthermore, different proportion of anomaly data may significantly affect the performance, which will be further studied.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant No. 61173073 and International S&T Cooperation Program of China under Grant No. 2014DFA11350.

References

Abadeh

M.S.

, Mohamadi

and Habibi

, Design and analysis of genetic fuzzy systems for intrusion detection in computer networks, Expert Systems with Applications38(6) (2011), 7067–7075.

Ando

, Clustering needles in a haystack: An information theoretic analysis of minority and outlier detection, Proceedings of the 7th IEEE International Conference on Data Mining, 2007, pp. 13–22.

Bakhshi

A.D.

, Bashir

and Maud

M.A.

, An improved statistical representation for ECG electrode movement and muscular activity noises in the context of T-wave alternan estimation, Biomedical Signal Processing and Control8 (2013), 297–301.

Boriah

, Chandola

and Kumar

, Similarity measures for categorical data: A comparative evaluation, Proceedings of the 8th SIAM International Conference on Data Mining, 2008, pp. 243–254.

Brighenti

and Sanz-Bobi

M.A.

, Auto-regressive processes explained by self-organized maps: Application to the detection of abnormal behavior in industrial processes, IEEE Transactions on Neural Networks22(12) (2011), 2078–2090.

Cateni

, Colla

and Nastasi

, A multivariate fuzzy system applied for outliers detection, Journal of Intelligent and Fuzzy Systems24 (2013), 889–903.

Bruaene

A.V.D.

, Moons

, Belmans

, Post

M.C.

, Luermans

J.G.

, Delcroix

, Pasquet

, Backer

J.D.

, Paelinck

, Morissens

and Budts

, Predictive model for late atrial arrhythmia after closure of an atrial septal defect, International Journal of Cardiology164 (2013), 318–322.

Callegari

, Giordano

and Pagano

, New statistical approaches for anomaly detection, Security and Communication Networks2 (2009), 611–634.

Callegari

, Vaton

and Pagano

, A new statistical approach to network anomaly detection, International Symposium on Performance Evaluation of Computer and Telecommunication Systems (2008), 441–447.

10.

Chandola

, Boriah

and Kumar

, Technical Report, University of Minnesota, Understanding Categorical Similarity Measures for Outlier Detection, 2008.

11.

Cheng

, Tan

P.N.

, Potter

and Klooster

, A robust graphbased algorithm for detection and characterization of anomalies in noisy multivariate time series, Proceedings of the IEEE International Conference on Data Mining Workshops, 2008, pp. 349–358.

12.

Chua

T.W.

and Tan

W.W.

, Non-singleton genetic fuzzy logic system for arrhythmias classification, Engineering Applications of Artificial Intelligence24 (2011), 251–259.

13.

Derrac

, García

, Herrera

Francisco

, Fuzzy nearest neighbour algorithms: Taxonomy, experimental analysis and prospects, Information Sciences260 (2014), 98–119.

14.

Doğan

and Korürek

, A new ECG beat clustering method based on kernelized fuzzy c-means and hybrid ant colony optimization for continuous domains, Applied Soft Computing12 (2012), 3442–3451.

15.

Dutta

, Giannella

, Borne

and Kargupta

, Distributed top-k outlier detection from astronomy catalogs using the DEMAC system, Proceedings of the SIAM International Conference on Data Mining, 2007, pp. 473–478.

16.

Fei

S.W.

, Diagnostic study on arrhythmia cordis based on particle swarm optimization-based support vector machine, Expert Systems with Applications37 (2010), 6748–6752.

17.

Huang

Y.P.

, Huang

C.Y.

and Liu

S.I.

, Hybrid intelligent methods for arrhythmia detection and geriatric depression diagnosis, Applied Soft Computing14 (2014), 38–46.

18.

Huang

, James

C.A.

, Tichnell

, Murray

, Xue

, Calkins

and Tereshchenko

L.G.

, Statistical evaluation of reproducibility of automated ECG measurements: An example from arrhythmogenic right ventricular dysplasia/cardiomyopathy clinic, Biomedical Signal Processing and Control13 (2014), 23–30.

19.

Keogh

, Lin

, Fu

and Herle

H.V.

, Finding unusual medical time-Series subsequences: Algorithms and applications, IEEE Transactions on Information Technology in Biomedicine10(3) (2006), 429–439.

20.

Keogh

, Lonardi

and Ratanamahatana

C.A.

, Towards parameter-free data mining, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 206–215.

21.

Keogh

, Zhu

, Hu

, Hao

, Xi

, Wei

and Ratanamahatana

C.A.

, The UCR time series classification/clustering homepage: http://www.cs.ucr.edu/_eamonn/time_series_data/, 2011.

22.

Koc

, Mazzuchi

T.A.

and Sarkani

, A network intrusion detection system based on a Hidden Naïve Bayes multiclass classifier, Expert Systems with Applications39(18) (2012), 13492–13500.

23.

Koulaouzidis

, Das

, Cappiello

, Mazomenos

E.B.

, Maharatna

, Puddu

P.E.

and Morgan

J.M.

, Prompt and accurate diagnosis of ventricular arrhythmias with a novel index based on phase space reconstruction of ECG, International Journal of Cardiology182 (2015), 38–43.

24.

Kumarage

, Khalil

, Tari

and Zomaya

, Distributed anomaly detection for industrial wireless sensor networks based on fuzzy data modelling, Journal of Parallel and Distributed Computing73(6) (2013), 790–806.

25.

Lanatá

, Valenza

, Mancuso

and Scilingo

E.P.

, Robust multiple cardiac arrhythmia detection through bispectrum analysis, Expert Systems with Applications38 (2011), 6798–6804.

26.

Lee

, Kang

and Kang

, Integrating independent component analysis and local outlier factor for plant-wide process monitoring, Journal of Process Control21(7) (2011), 1011–1021.

27.

Lee

, Kim

and Kim

, Self-adaptive and dynamic clustering for online anomaly detection, Expert Systems with Applications38(12) (2011), 14891–14898.

28.

Lemos

, Caminhas

and Gomide

, Adaptive fault detection and diagnosis using an evolving fuzzy classifier, Information Sciences220 (2013), 64–85.

29.

Lewis

G.F.

, Furman

S.A.

, McCool

M.F.

and Porges

S.W.

, Statistical strategies to quantify respiratory sinus arrhythmia Are commonly used metrics equivalent?Biological Psychology89 (2012), 349–364.

30.

Z.M.

, Xiong

and Liu

Y.C.

, Mining blackhole and volcano patterns in directed graphs: A general approach, Data Mining and Knowledge Discovery25(3) (2012), 577–602.

31.

and Yang

Y.P.

, Statistical process monitoring based on modified nonnegative matrix factorization, Journal of Intelligent and Fuzzy Systems28 (2015), 1359–1370.

32.

Liao

Y.H.

, Vemuri

V.R.

and Pasos

, Adaptive anomaly detection with evolving connectionist systems, Journal of Network and Computer Applications30(1) (2007), 60–80.

33.

Lin

, Keogh

, Li

and Lonardi

, Experiencing SAX: A novel symbolic representation of time series, Data Mining and Knowledge Discovery15(2) (2007), 107–144.

34.

Moonesinghe

H.D.K.

and Tan

P.N.

, Outrank: A graph-based outlier detection framework using random walks, International Journal on Artificial Intelligence Tools17(1) (2008), 19–36.

35.

Nurunnabi

, West

and Belton

, Outlier detection and robust normal-curvature estimation in mobile laser scanning 3D point cloud data, Pattern Recognition48(4) (2015), 1404–1419.

36.

Özbay

, Ceylan

and Karlik

, Integration of type-2 fuzzy clustering and wavelet transform in a neural network based ECG classifier, Expert Systems with Applications38 (2011), 1004–1010.

37.

Özcift

, Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis, Computers in Biology and Medicine41 (2011), 265–271.

38.

Simmross-Wattenberg

, Asensio-Pérez

J.I.

, Casaseca-de-la-Higuera

, Martin-Fernández

, Dimitriadis

I.A.

and Alberola-López

, Anomaly detection in network traffic based on statistical inference and α-stable modeling, IEEE Transactions on Dependable and Secure Computing8(4) (2011), 494–509.

39.

Subramaniam

, Palpanas

, Papadopoulos

, Kalogeraki

and Gunopulos

, Online outlier detection in sensor data using non-parametric models, Proceedings of the 32nd International Conference on Very Large Data Bases, 2006, pp. 187–198.

40.

Shubair

, Ramadass

and Altyeb

A.A.

, kENFIS: kNN-based evolving neuro-fuzzy inference system for computer worms detection, Journal of Intelligent and Fuzzy Systems26 (2014), 1893–1908.

41.

Upadhyaya

and Singh

, Nearest neighbour based outlier detection techniques, International Journal of Computer Trends and Technology3(2) (2012), 299–303.

42.

Vafaie

M.H.

, Ataei

and Koofigar

H.R.

, Heart diseases prediction based on ECG signals’ classification using a genetic-fuzzy system and dynamical model of ECG signals, Biomedical Signal Processing and Control14 (2014), 291–296.

43.

Wang

J.S.

, Chiang

W.C.

, Hsu

Y.L.

and Yang

Y.T.C.

, ECG arrhythmia classification using a probabilistic neural network with a feature reduction method, Neurocomputing116 (2013), 38–45.

44.

Zhang

and Wang

, Detecting outlying subspaces for high-dimensional data: The new task, algorithms, and performance, Knowledge and Information Systems10(3) (2006), 333–355.

45.

Zhu

B.H.

, Ding

Y.S.

and Hao

K.R.

, Multiclass maximum margin clustering via immune evolutionary algorithm for automatic diagnosis of electrocardiogram arrhythmias, Applied Mathematics and Computation227 (2014), 428–436.

46.

Zhu

J.J.

, He

L.S.

and Gao

Z.Q.

, Feature extraction from a novel ECG model for arrhythmia diagnosis, Bio-Medical Materials and Engineering24 (2014), 2883–2891.