Abstract
For abnormal detection of time series data, the supervised anomaly detection methods require labeled data. While the range of outlier factors used by the existing semi-supervised methods varies with data, model and time, the threshold for determining abnormality is difficult to obtain, in addition, the computational cost of the way to calculate outlier factors from other data points in the data set is also very large. These make such methods difficult to practically apply. This paper proposes a framework named LSTM-VE which uses clustering combined with visualization method to roughly label normal data, and then uses the normal data to train long short-term memory (LSTM) neural network for semi-supervised anomaly detection. The variance error (VE) of the normal data category classification probability sequence is used as outlier factor. The framework enables anomaly detection based on deep learning to be practically applied and using VE avoids the shortcomings of existing outlier factors and gains a better performance. In addition, the framework is easy to expand because the LSTM neural network can be replaced with other classification models. Experiments on the labeled and real unlabeled data sets prove that the framework is better than replicator neural networks with reconstruction error (RNN-RS) and has good scalability as well as practicability.
Introduction
With the development and application of the Internet of Things (IoT) technology, a large number of sensors can collect a large amount of data, which contains a wealth of diverse information to mine. But such data need to be cleaned up to remove abnormal data and improve data accuracy, and in many applications, mining anomalies is more important than mining other information, such as fraud detection in financial data. Anomaly detection is using various methods to find abnormal data with a very small amount from large amount of normal data. Anomalies in the field of statistics are defined as residuals or deviations from a regression or density model of the data as “An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism” [13].
The data collected by the IoT is usually time series data and without label in practical applications. To mine anomalies from such data, a popular approach is using a model to simulate the characteristics of normal data, and then uses the deviation between the actual value and the expected value as the outlier factor. Usually the outlier factor is fitted to a distribution (e.g., Gaussian [19]) to determine a threshold for anomaly detection. However, the threshold of this parameter is difficult to determine, because its determination depends largely on domain knowledge, and as the amount of data increases or time passes, the threshold may also change. This type of method using normal data to find anomalous circumvents the process of anomaly data annotation. However, the fluctuation range of outlier factor used by such methods varies with data and models, which creates obstacles to the versatility of these techniques. In addition, most of the existing unsupervised methods [16, 2, 3] are based on distance or density to calculate outlier factors. These methods have huge computational requirements and are not suitable for today’s massive data.
This paper proposes a semi-supervised anomaly detection framework combining clustering and deep learning for unlabeled time series data, which uses LSTM with VE as outlier factor, named LSTM-VE. The method consists of two phases: the rough labeling phase and the semi-supervised anomaly detection phase. In the first stage, an optimized clustering algorithm is used to cluster the unlabeled data, and the clustering result is combined with the visualization technique to select the normal data categories. The second stage uses the normal data categories to train the LSTM based deep neural network and use the VE as the outlier factor which does not change with data or time. Furthermore, the VE is the variance of the normal data category classification probability sequence. Different from the existing outlier factors that are calculated through other data in the data set. Using VE can greatly reduce the cost of calculation and can be applied on massive data. And its range only depends on the number of the normal data category, so its range can be pre-estimable and stable.
In this paper, experiments are carried out on tagged public datasets and unlabeled dataset generated by the actual application. The experiments on the tagged public datasets verify that the method is more effective as well as scalable than the baseline method. The result of outlier detection on the unlabeled data in the actual application is verified by the manual method, which proves that the method can be practically applied. Our research in this paper significantly differs from the previous work in the following aspects:
The framework combines clustering with deep learning to achieve semi-supervised anomaly detection, which enables anomaly detection based on deep learning to be practically applied. The framework uses the VE of classification probability sequence for normal data categories as outlier factor, whose range can be calculated according to the number of classification categories and does not change with data or model. The framework is easier to expand because the LSTM neural network can be replaced with other deep learning classification models. More importantly, the framework provides an idea for anomaly detection in real-world applications.
The paper is organized as follows. Section 2 introduces related anomaly detection methods for time series data. Section 3 introduces the proposed LSTM-VE framework, specifically Section 3.1 introduces the optimized clustering method and how to use visual methods to select normal data and Section 3.2 introduces the LSTM-based neural network, proves the feasibility and advantages of using VE as outlier factor. We evaluate the proposed model on public labeled data sets and unlabeled data set from practical applications in Section 4. Finally, we present conclusions, discussions and suggestions for future research in Section 5.
The existed anomaly detection algorithms can be divided into three categories: unsupervised methods, semi-supervised methods, and supervised methods. Unsupervised methods include various clustering methods and statistical methods. The main drawback of statistical methods [27, 4, 10] is that they are only applicable to one-dimensional data. For multivariate data, using dimensionality reduction techniques such as principal component analysis (PCA) [24] may result in certain information lost. The effect of the clustering method is very dependent on the design of similar functions [16, 2, 3], and with high time and computational complexity. The semi-supervised method [14] trains a model with normal data and uses the model to output a quantitative indicator representing the likelihood of being abnormal and then determines whether the data is abnormal through a threshold. However, the quantitative indicators will change with the change of the data source and the model, and the determination of the threshold depends on the domain knowledge, which makes it difficult to determine the threshold. The supervised method [23, 1, 22] requires the tagged data and trains a classifier through the training data to directly determine whether the data is abnormal or not.
In the aspect of the time series data anomaly detection using deep learning, the initial method uses an auto encoder and reconstruction error (RS) [14] as outlier factor to perform anomaly detection semi-supervisedly. Afterwards, LSTM has been widely used in the past few years because of the advantages of LSTM in handling time series data. However, the basic idea is the same that the semi-supervised method is adopted, the outlier factor is fitted to the Gaussian distribution to determine the threshold aiming at maximizing the F-score [19].
The existed methods mainly have three defects: (1) The performance of the supervised method is very dependent on the completeness of the abnormal data, and the abnormal data pattern that does not appear in the training set cannot be correctly discriminated. Therefore, we adopt a semi-supervised method to train the classification model using normal data, based on the main assumptions that the data not belonging to known normal data is abnormal. (2) Distance-based or density-based unsupervised methods use outlier factors calculated through the whole data set. Similarly, RS calculation also requires that input and output data have the same dimension. These results in huge computational cost or complex model of the algorithm, which challenges the application of the algorithm in high-dimensional massive data. (3) In the semi-supervised method, the range of outlier factors varies with the data set and model, such as RS, although it can be normalized to [0, 1], such an operation might result in the loose distinction between normal and anomalies. On this basis, this paper innovatively uses the VE as the outlier factor to measure the abnormal degree of the data instance. When the number of normal data categories is known, the variance has a computable fluctuation range, and this range is only related to the number of normal data categories, regardless of the model.
In addition, in order to quickly and accurately obtain normal data and number of categories from unlabeled data in different fields, this paper uses a cluster-based rough labeling method to batch label normal data. Furthermore, the method of this paper is easy to scale, because the classification model can adopt any excellent classification model, which means that the method is applicable to univariate and multivariate data.
LSTM-VE framework
In this paper, LSTM-VE combining clustering with semi-supervised deep learning is used to realize the anomaly detection on unlabeled time series data (shown in Fig. 1). Specifically, an optimized clustering method is used to cluster raw data into several categories, then several simple visualization techniques are used to select normal data cluster to train LSTM-based deep neural networks to classify data. The output is the probability that the data instance belongs to a certain number of normal categories. On this basis, this paper uses the VE of the computable fluctuation range as the outlier factor to measure the abnormal degree of the data instance. The improved cluster algorithm and the LSTM based neural network are described below.
The composition of framework.
The current cluster algorithm for anomaly detection is directly applied after data cleaning, and the requirements for time and space are both high. The effect of anomaly detection is also very dependent on the similarity function which is the core of cluster algorithm, while the design of the similarity function requires professional domain knowledge, which also gives certain obstacles to the generality of the algorithm. Aside from these shortcomings, the clustering algorithm has the advantage of unsupervised processing of data that the similar data is clustered into one class by measuring similarity.
For unlabeled data collected through IoT, clustering can be used to label preprocessing. Because it is much easier to select a large amount of normal data than to select relatively rare abnormal data, several simple visualization methods can be used to select normal clusters. This converts unlabeled data into normal data with labels, and then facilitating anomaly detection is performed using a semi-supervised deep learning approach.
The clustering method used in the paper mainly improves the universality. When using k-means algorithm we needs to determine the parameter
Pseudo code of canopy algorithm[1]
Stage1. The most computational cost of clustering is when computing object similarity. Canopy method selects the simple, low-cost method for computing object similarity in the first stage, placing similar objects in a subset, called canopy (see Algorithm 3.1 for the pseudo code). Canopy can be overlapped, but there is no case where an object does not belong to any canopy. This stage can be seen as data preprocessing. Canopy is a partition-based method. In the algorithm, two parameters
Stage2. We use k-means algorithm within each canopy, and does not perform similarity calculations between objects that do not belong to the same canopy. This process can bring two major advantages: first, canopy being not too big and not too much overlap between canopy will greatly reduce the number of subsequent objects that need to compute similarities; second, the clustering method k-means requires artificially pointing out the value of
Then the system divides the raw data into several groups. Some methods such as visualization technology could be used to select normal data clusters. In the area of anomaly detection, the situation is more complex for its dependence on domain knowledge. The visualization technology can make efficient use of domain knowledge from experts. We use simple visualization techniques to select normal data categories from the clustering results (the visualization technique employed will be shown in Section 4.2).
Using outlier factor to determine anomalies is a widely used method. Various outlier factors have different ranges and criteria. The outlier factors in the kNN-based method [6] are in the range of [0, 1], while the range of LOF is related to the parameters of algorithm [3], and anomalies have a higher outlier factors. The outlier factor of ABOD [16] have a range of uncertainties, while the lower the outlier factor, the greater the probability of being abnormal. So the methods using the outlier factor are faced with the problem that it is difficult to set up the threshold for determining the abnormality because the determination is largely dependent on domain knowledge, furthermore the threshold will change according to the domain and time. In addition, these methods need to compute outlier factors through other data in the data set, which makes the computation cost of the algorithm enormous. Although some partition-based methods such as FastLOF [11] method make some improvements, this shortcoming hinders the application of the algorithm in large data sets nowadays.
With replicator neural network (RNN), Hawkins et al. take the input
With the chosen normal cluster labeled data, we can train a LSTM [18] based model (see Fig. 2). LSTM network is an improved recurrent neural network. A detailed and thorough description of the LSTM network can be found in [9].
The principle of using the variance of the classification probability sequence as the anomaly factor, 
The internal processes of LSTM cell are as follows:
where
There are three main phases within LSTM:
Forgetting the stage. This stage is mainly to selectively forget the input, which is controlled by Selective memory stage. This stage is mainly to selectively remember the current input Output stage. This phase determines which ones will be treated as the output of the current cell, mainly through
Because the model is trained using normal data, we add a softmax layer at the end of the model to predict data classification,
With well trained LSTM network, we use the variance (Eq. (5)) of
Compared to the RS, when we know the number of data classifications, the range of VE can be calculated, which is convenient for determining the threshold. For example, if the number of categories is n, then for normal data, the perfect output is that the probability corresponding to the correct normal category is 1, and the others are 0, then the VE reaches the maximum value as
The first part of the experiment was developed on several public labeled data sets, and several quantitative indicators are used to measure the ability of the LSTM-VE on anomaly detection. The second part is to develop on the unlabeled data in the real world setting to verify that the method can be used for anomaly detection in practical application, and the results were judged by visual methods and compared with existing method. The experiment proves the practicability and effectiveness of the method.
Experiment on public labeled datasets
In this part of the experiment we demonstrate the effectiveness and versatility of LSTM-VE to detect anomalies by experimenting with several labeled public data sets.
Datasets
We use 4 data sets in the University of California at Riverside (UCR) collection [5] as shown in Table 1. UCR is the largest public time-series data classification archive. The selected data sets involve multiple different domains and consist of more than 2 clusters of different sizes, shapes, and densities for the reason that we randomly select one category of data as the abnormal data, and others as the normal data. The data sets are z-normalized and divided into training sets and test sets. However, the size of the training set has a great influence on the effect of the deep learning model. In order to obtain a larger training set, we join the training and test set and randomly re-divide 70% as train set, 10% as validation set and the remaining as test set. The datasets are labeled with clustering ground truth, so we can measure the quality of the algorithm through a variety of quantitative indicators.
Basic situation of the data set
Basic situation of the data set
The baseline method we choose is the replicator neural networks (RNN) with RS for anomaly detection, the RNN is an auto-encoder in fact. In order to avoid confusion with recurrent neural networks, we refer to it as auto-encoder with reconstruction error (AE-RS). This paper tries to find the best parameters for LSTM-VE and AE-RS through experiments. During the experiment, we found that the 3-layer network structure for LSTM-VE and 4 or less layer network structure (it doesn’t include the input layer whose number of units equals to the input dimension) for AE-RS is the best. Increasing the number of layers will not improve the effect, but will increase the model training time. In addition, for LSTM-VE, other activation functions such as tanh have not brought significant improvements, while the softsign activation function can make the network converge more quickly. For AE-RS, tanh activation function gains the best performance. The network hyper parameters that achieve the best performance on each data set are shown in Table 2.
Hyper parameter of the optimal performance networks. In ResNet, each residual block contains three convolutional layer, and each convolutional layer in one residual block has the same number of filters. Therefore, only one number of filters for one convolutional layer is recorded in the table
ROC curves of methods on the datasets.
Here the ability of the model to perform anomaly discovery is measured by Area Under ROC Curve (AUC), which intuitively reflects the classification ability of the model (for details of AUC refer to [8]). The score of AUC is between [0, 1]. The larger the score means that the model has better classification ability, and the score of 0.5 means the model doing classification in an approximately random manner. In order to calculate the AUC, first we do the operations Eq. (6) on the test set data labels converting it to a binary classification task, then the outlier factors (OF) of the model output are normalized to [0, 1] in Eq. (7), where the
Experimental results of four methods on the datasets, AUC of anomaly detection
Outlier factors distribution of two methods, red squares represent abnormal data and green circles represent normal data.
The effect of LSTM-VE depends on the effect of the classification model. However, the VE used as outlier factor in LSTM-VE is more versatile for its scope does not change with the model. When the RS is used as the outlier factor, if the model is changed, the RS may have a different range, resulting in the need to re-find the most appropriate threshold for outlier factor. So we also evaluate scalability of using VE as outlier factor through replacing LSTM with another classification model. We use fully convolutional networks (FCN) and deep residual network (ResNet) [26] with VE to conduct experiments, referred to as FCN-VE and ResNet-VE for short respectively. The experimental results are shown in Table 3 and Fig. 3.
From the experimental results, the AUC scores of using VE is completely better than the base line method AE-RS, which proves that using variance can improve the abnormal detection effect. FCN-VE and ResNet-VE also achieved satisfactory results, which reflects the scalability of proposed method. However, there are some differences in the performance of methods using VE, ResNet-VE has the best performance, LSTM-VE is the second and FCN-VE ranks third. The reason for this difference lies in the accuracy of classification network. ResNet has a more complex structure compared with LSTM network, so it can deal with time series classification task better than LSTM, while FCN has made good achievements in image processing, but it is not good at time series classification task. So this part of the experiment proves the scalability of VE.
Figure 4 shows the distribution of outlier factors gained by AE-RS and LSTM-VE on the test data set ElectricDevices. In the distribution of outlier factors of AE-RS, the outlier factors of normal data and abnormal data are uniformly distributed, and there is no obvious distinction. In our method, the outlier factors of normal data and abnormal data are clustered to a certain extent, and there is a relatively obvious distinction. Therefore, our method better detects the anomalies.
Because the effect of LSTM-VE depends to some extent on the clustering method in the first process, the evaluation of clustering methods is necessary. In this experiment, we verify the effectiveness of the canopy
DataSet
The dataset is gas-oil-refueling data collected from refueling stations in a certain province of China. Every refueling record contains information about a driver (
We extract the data of each gas station from the refueling data. The data of each day of each gas station is used as a sequence of time series data, and each item is amount of fuel at each gas station for each half an hour period. Then the outlier detection process is based on the time series data. Records for each gas station is approximately 150 days. We used the data of about 120 days in 2016 to cluster, select the normal data training model, and use the remaining 2017 data for verification.
In this paper, a canopy
Here we show the results of two gas stations: stationA and stationB. In order to facilitate visualization, the data is processed by the PCA to reduce the dimensionality to two. Figure 5 shows the canopy result that the number of canopies is 6 and 13 respectively. We also test different
Canopy results.
DBI with different 
Visualization of clustering results for Station A.
Refueling data detail visualization.
We use the visualization method shown in Fig. 7 to visualize the clustering results and select normal data in batches. It can be seen from the visualization results that cluster 1, 3 and 5 have a more normal trend, while cluster 2, 4 and 6 are more volatile and less numerous. In general, the number of abnormal data is much smaller than the normal data. So we select cluster 1, 3 and 5 as normal data to train the classification model.
We also use the method in Fig. 8 to check if the data in each cluster is normal. In the left half of the figure, we visualize the distribution of data in time dimension with two forms: The ring view represents the status of continuity of data collection from 0:00 to 24:00 in one day, and the area view represents the number of data acquisition trends at every time interval. In the right half of the figure is scatter plot view [21] within a Cartesian coordinate system which can show the details of all the refueling records of the station within one day. The x-axis represents the time during the selected time period, the y-axis represents the number of refueling. Analysts can filter the data in both dimensions. In this graph, we use curves to represent relations. Red curves represent the relations connected by the same vehicle, and black curves represent the relations connected by the same driver.
Figure 8a depicts the refueling data of a gas station from 00:00 to 22:33 on a certain day. First of all, we can see that there is no refueling activity from 02:00 to 09:00 from the left side of the figure. On the right side, we can find that there are several data points at the bottom, which indicates that some refueling activities with a very small amount of fuel arise. The probability of such refueling behavior is very small and such data may be caused by the error of manual data input. In addition, we can see that several data points are connected by lines, which means that these data points are the same drivers or the same vehicle for refueling. The probability of multiple refueling at the same gas station in nearly one day is very small, which may be abnormal refueling data.
Figure 8b shows the refueling data of a gas station from 16:38 to 21:00 on a certain day. From the left side of the figure, it can be found that refueling activities occurred at this gas station only in the period of time which is selected to show in detail on this day. In the detailed display on the right, the points representing the refueling activity are almost gathered into several lines, and the refueling activities on the same line have the same amount of refueling. In addition, we can see that there are many data points connected together, representing that these refueling activities have the same drivers or vehicle. This kind of refueling data with the same amount of fuel for the same vehicle or person may be caused by repeated transmission and storage in the process of data transmission, which is abnormal refueling data.
Visualization of anomaly detection results on test data from Station A. The shaded part is the model training data, that is, the normal data mode, the green point dotted line is the normal data, and the red dotted line is the abnormal data.
Figure 9 demonstrates that the model can detect abnormalities well in complex situations. By comparing the two images, we find that LSTM-VE is better than AE-RS in mining abnormal data with a small number of peaks or bottoms. This is because the outlier factor RS used in AE-RS is the accumulated anomaly in the whole time series data, so it’s not sensitive to a single peak, while LSTM-VE uses LSTM extracts the temporal features of the whole normal time series data, so it can detect the single peak which does not conform to the normal temporal features. For example, for a time series data
This paper proposes a semi-supervised anomaly detection framework named LSTM-VE, which is based on LSTM neural network classification model and using the variance error of normal data classification probability sequence as outlier factor. By using VE, the framework avoids the shortcomings of huge computational costs such as LOF and AOBD, and VE does not require that the dimension of the output layer of the model be the same as that of the input layer, which makes the model too complex when dealing with high-dimensional data. Furthermore, the range of VE can be pre-estimable and stable for its only related to the number of roughly clustered group, which make the threshold easy to set up. In the experiment, the framework achieves outstanding effects on several public data sets. In terms of practicality, this paper proposes a rough labeling method for batch labeling normal data in practical applications, and then uses LSTM-VE for anomaly detection, which proves satisfactory results in the experiment. The framework is more universal and easier to expand for two reasons: first is the substitutability of the classification model which is LSTM neural network in LSTM-VE; second is the fluctuation range of the outlier factor not related to the data. In addition, the framework provides a practical method for anomaly detection in real-world situations.
For later research, the performance of anomaly detection depends on the accuracy of the classification model as well as quantity and quality of normal data. So a more efficient method will be used to obtain high-quality normal data from a large amount of raw data and build a better classification model.
Footnotes
Acknowledgments
Regional key projects of science and technology service network plan (STS plan) of Chinese Academy of Sciences. West Light Foundation of The Chinese Academy of Sciences No. 2016-QNXZ-A-3, Youth Innovation Promotion Association CAS, No. Y9290802, The Xinjiang Tianshan Excellent Young Scholars, No. Y9390801.
Supplementary Materials
Our source code has been uploaded to GitHub (https://github.com/wangbaoquan520/VE_Anomaly-Detection).
