Abstract
In the era of data technology, data growth is occurring at an unprecedented scale. Business data and information are among the most valuable assets. Massive data analysis now drives nearly every aspect of society and can facilitate informed decision-making by businesses. Fully automated data flow detection of anomalies plays a crucial role in maintaining data service stability and preventing malicious attacks. This paper presents an extensible and generic real-time monitoring system framework (EGRTMS) for large-scale time-series data. EGRTMS employs a prediction module and an anomaly detection module within an anomaly filtering layer for the accurate identification of anomalies. Moreover, the alarm module and anomaly handling module within an anomaly trace processing layer enables the system to respond swiftly to the detected threats. Our solution does not rely on the labelling of anomalies; instead, a predictor module with a deep learning attention-based mechanism learns the normal behaviour of the data, and an anomaly handling module determines the dynamic alarm-threshold by utilizing a sliding window. The results of this study demonstrate that our framework significantly outperforms other anomaly detection systems on most real and synthetic datasets.
Keywords
Introduction
With rapid advances in computing hardware and software, a new technological revolution that is represented by cloud computing, big data and artificial intelligence is infiltrating all areas of life [1–3]. As a new energy source, various data have revitalized the development of enterprises. Massive user log data are extremely valuable for the realization of productivity in businesses and evolutionary breakthroughs in scientific disciplines and provide many opportunities to make significant progress in many fields. Future competitions in business productivity and technologies will inevitably converge into data explorations. From the perspective of data analysis, a mainstream trend is to provide users with personalised services by modelling and analysing user behaviour log data to increase user stickiness for products. The objective is to enable enterprises to gain a competitive advantage by exploiting the ever-growing amount of data that are collected and stored in corporate databases and files to realize better and faster decision making. Big data increasingly include information that has been provided by various data sources and is, of varying reliability. Uncertainty, errors, and missing values are universal. The data quality must be guaranteed for the provision of reliable support for strategic analysis; thus, continuous monitoring of these data is critical.
Via the analysis of the real-time information that is extracted from the continuous inflow of massive streams of data and the detection of anomalies when they occur, instant alarms regarding potential threats can be issued. An anomaly, or an outlier, is a data point or a set of consecutive data points that differ significantly from the remaining data. An anomaly may signify a negative change in the system, such as a change in the data format by the data engineering team without informing a downstream organization, which could result in incorrect data parsing or logic errors in the program source code that causes an unexpected user behaviour fluctuation. An anomaly can also be positive, e.g., if existing business revenue has plateaued, the product development strategy team must examine new growth strategies. The innovation of new products is risky because enterprises do not know how consumers will respond to the changes. After bringing new products to market, if an abnormally large number of daily active users (DAU) visit a new product page, new product development strategies are regarded as effective. Regardless of whether an anomaly evolves into an adverse change or a positive change, enterprises can obtain meaningful feedback information.
Streaming applications
Most enterprises spend a substantial amount of time managing information about their business elements. Strategic departments are divided according to product services, and such information is often scattered among disparate product lines; hence, the aggregation of valuable information for decision-making support can be difficult. A business intelligence (BI) system can help enterprises solve this problem. Effective BI systems give decision-makers timely access to quality information, thereby enabling them to accurately identify where the company has been, along with opportunities and challenges [4]. The core component of a BI system is the data warehouse (DW), which must be constructed. DW is a collection of decision support information, which supports online analytical processing [5, 6]. From the perspective of high-level strategic decision-making, historical summarized and consolidated data are more important than detailed, individual records. The DW contains consolidated data from various operational databases; hence, it has a larger storage scale. Since data alone cannot directly provide enterprise executives and managers with meaningful information on the performances of their companies, information visualization and visual data mining will be critical for the discovery of vital information and for decision making [7]. Visualizations of the data enable analysts to gain insight into the data trends and to formulate new hypotheses. However, the correctness of the above expression is based on reliable data quality. A robust real-time monitoring system can monitor streaming data from operational systems for the detection of malicious intrusion or critical business events.
In most instances, the collected data are streaming time-series data, which are periodic, seasonal and exhibit trends. It is difficult to detect anomalies timely and accurately. In practical scenarios, it is difficult to obtain large amounts of labelled data. Additionally, the abnormal instances are far less frequent than the normal instances; hence, the class imbalance problem will affect the performance of the classifier, which limits the applicability of these traditional supervised anomaly detection technologies. Although many unsupervised methods can solve the class imbalance problem, they cannot capture the intrinsic characteristics of streaming time-series data.
To address this problem, this paper presents EGRTMS, which is a deep-learning-based extensible generic real-time monitoring system framework for streaming data. EGRTMS can detect data flow anomalies at an early stage and play vital roles in maintaining user data consistency, protecting enterprises from malicious attacks, and grasping the direction of industry hotspots. EGRTMS consists of four main components: a prediction module (PM), an anomaly detection module (ADM), an alarm module (AM), and an anomaly handling module (AHM). Note that this paper mainly focuses on the former two components. This approach detects anomalies without considering labels, which uses the proposed C-transformer as a prediction module to learn the normal behaviour of the data. This module predicts the values at the next window timestamp. Subsequently, the predicted values are fed to the anomaly detection module, which utilizes a sliding window to calculate the dynamic threshold that is fitted for various time series. If an anomaly is present, it will be passed to the alarm module and the anomaly handling module. According to testing on publicly anomaly detection datasets, this approach outperforms other advanced anomaly detection methods in most of the considered cases.
The main contributions in this article are as follows: EGRTMS is a deep-learning-based comprehensive anomaly detection framework that is flexible, accurate and scalable, which enables the user to add his own models into any of the components. EGRTMS implements a complete closed-loop of the intelligent monitoring process. The EGRTMS framework discovers anomalies without the need for labels while maintaining a low rate of false positives in an unsupervised setting. This approach can be directly applied to practical scenarios. To the best of our knowledge, EGRTMS is the first deep learning attention-based mechanism approach. Unlike the family of the RNN-based sequence model, C-transformer based EGRTMS can pick up long-range dependencies in time series during the forward path and parallel computing, and it more easily learns strong periodic patterns that correspond to daily or weekly human activities and reduces the training time of the model. Instead of setting a fixed threshold, the framework generates dynamic thresholds by utilizing a sliding window and can automatically adapt to new time-series. Experimental results demonstrate that our framework outperforms the other anomaly detection methods on the Yahoo Webscope Benchmark and Numenta Anomaly Benchmark.
Literature review
Anomalies are typically categorized as point anomalies, contextual anomalies, and collective anomalies. A vast number of detection techniques can be divided into supervised, semisupervised, and unsupervised techniques. In previous studies, many distance-based anomaly detection methods for time-series have been proposed. K–nearest neighbours (K–NN) utilizes a distance measures to rank each point on the basis of its distance to its k th nearest neighbour and declares the top n points in this ranking to be outliers [8]. The local outlier factor (LOF) is an unsupervised technology for local density-based anomaly detection, which detects outliers according to a numerical scale that indicates how isolated an object is from the surrounding neighbourhood [9]. However, LOF may rule out outliers that are close to some nonoutliers pattern that has low density. The connectivity-based outlier factor (COF) can improve the performance if a pattern has a similar neighbourhood density to an outlier [10]. Lee et al. integrated independent component analysis (ICA) and local outlier factor for plant-wide process monitoring [11]. ICA transformation is conducted, and the control limit of the LOF value is obtained based on the dataset. Then, the LOF value of the current observation is computed in the monitoring phase. The LOF value will be judged as a fault if it exceeds the control limit.
Principal component analysis (PCA) is proposed for traffic anomaly detection in [12], which is based on a separation of the occupied high-dimensional space into disjoint subspaces that correspond to normal and anomalous network conditions. However, PCA is highly sensitive to its parameter settings. D. Brauckhoff et al. proposed a method [13] for overcoming the problems of failing to capture the temporal correlation. This method correlates the data across different metrics instead of correlating the data across different spatial measurement points, and the anomaly detection results are significantly improved.
The autoregressive integrated moving average (ARIMA) is a statistical regression model [14], which can be divided into autoregressive (AR) and moving average (MA) models: (i) The AR model uses the dependent relationship between an observation and a specified number of lagged values. (ii) The use of differencing of raw values to ensure the data stationary. (iii) The MA model uses the dependency between the residual error and the observation applied to lagged observation. The ARIMA model is utilized widely in time series forecasting and anomaly detection. Reference [15] simulates network traffic by using the ARIMA model, and the nonstationary behaviour and the outliers can be detected in the ACF and PACF plots, with trends and attacks.
Nowadays, neural network approaches have realized substantial success in a wide range of domains, such as image analysis, natural language processing, and time series analysis. Deep learning architectures have realized superior performance in feature and representation learning from massive data. Long short term memory (LSTM) can solve time-series anomaly detection tasks. Malhotra et al. proposed stacked LSTM networks that are trained on non-anomalous data for prediction over the time steps; then, they computed the prediction error distribution. The error vectors are modeled to fit a multivariate Gaussian distribution, and the anomalies are identified based on the error threshold [16]. An LSTM-based encoder-decoder scheme for anomaly detection in multi-sensor time-series was proposed in [17]. The encoder learns a vector representation of the input time-series, and the decoder decodes it to reconstruct the time-series. The approache uses only the normal time-series during training and learns to reconstruct them. Hence, an anomalous sequence would lead to higher reconstruction errors than a normal sequence. The convolutional neural network (CNN) has become one of the most influential networks in the field of computer vision, and we regard it as a superior black box for feature extraction. C-LSTM was proposed as an anomaly detection method in [18]. The CNN layer is used to reduce the frequency variation in the spatial information. The output of this CNN layer is used as the input for several LSTM layers to reduce temporal variations. It is a simple and effective method for improving performance by combining models.
System architecture
Within a company, big data architecture is the foundation for data monitoring and analytics. The overall integrated architecture of EGRTMS is illustrated in Fig. 1, and the monitoring implementation process is detailed in the following subsections.

Integrated proposed EGRTMS architecture.
The volume of information captured under the massive data is large and formats of data are very varied. Firstly, in the data source layer, this is where the data arrives at your organization. Data from various heterogeneous sources are obtained and loaded into the system for further analysis. The user log files account for a large proportion of the raw data source, and they include a substantial amount of user behaviour information. The relational database management system (RDBMS) stores internal business data. Another type of data can be extracted from external sources through an interface that is provided by the partner, such as HTTP or FTP data.
Data integration and storage
The importance of the integration and storage layer comes into being as the raw source data may not be directly consumed. Hence it can be processed prior to being loaded into the next layer, such as ETL technologies. ETL consists of three main processes: extraction, transformation, and loading. Extraction is the process of collecting or extracting the relevant data from data sources, as described above. Transformation is the process of transforming or converting data into a consistent format according to the business rules that are used for analytical processing. The final step is the loading of the data into a target repository from the staging area. Behavioural data that have been collected from heterogeneous sources are integrated and converted into a structured format and loaded into data warehouse. The data undergo sorting, consolidating, and summarizing to render them more coordinated and easier to use.
The operational data store (ODS) isolation layer is designed according to the operational requirements of business process queries, whereas a data warehouse is typically used for complex queries against summary-level or on aggregated data. The ODS layer can alleviate the pressure to access DW and provides a snapshot of the latest data span multiple transactional systems for business reporting. Typically, the data in the ODS will be structured similarly to the source systems, although during integration, the data can be cleaned and normalized and business rules can be applied to ensure the data integrity.
Due to the differences in the business that is conducted among the departments in the enterprise, the demand priorities differ. Separate business units can build their own data marts (DM) based on unique requirements. DM is the access layer of the data warehouse, which can be created from an existing data warehouse. DM is a subset of the data warehouse aimed at typically orienting to a specified business line or team. A multidimensional data model that is constructed based on Data Mart, is subject-oriented. It only contains the data that is applicable to a specified business area so is a cost-effective method for gaining significant information quickly.
Data monitoring
The main steps of real-time work are shown as Fig. 2. Firstly, the normalizing structure of the indicator data is unified according to the training data format of the prediction model. Data are stored in bulk on the distributed file system. PM is trained based on historical data. These models are stored in the server cluster. Then, the crontab timing tasks call ADM to evaluate incoming data points based on prediction models stored in the server cluster. Based on the rules, if the anomaly is an alert event, it is stored in an anomaly status database. AM applies configuration rules to send the alert to the suitable support staff in various ways, such as email, cell phone messages, and internal communication messages. The responsible teams include the data team and the product strategy team. AHM automatically establishes an alarm information tracking card by invoking the application programming interface (API) of the project management platform. AHM has the characteristics of real-time tracking, retrospective analysis, and continuous improvement; simultaneously, it effectively contacts personnel with various responsibilities, such as R&D personnel, testers, and the product manager.

Real-time monitoring service.
The application layer provides massive data and a user-friendly interface to improve its capacity for rapid calculation, which can enable executives to manage organizational performance more efficiently and observe the current status of the business by conducting further analysis at a suitable detail level based on the results of monitoring indicators. The experimenter conducts experimental analysis through the HQL query and presents the results in the form of business reports. By carefully analyzing the data characteristics, researchers can then employ the corresponding analytical methods to derive the intended impact. Decision-makers can formulate overall business objectives for effectively tracking and measuring user behaviour.
Prediction module
The prediction module occupies a central position in the real-time monitoring system. Traditional prediction methods include time-series analysis methods, multi-linear regression, and the autoregressive integrated moving average model. Most of these methods are for linear relational models with large prediction errors.
In recent years, deep learning algorithms have been used to deeply mine the regular information in historical data, and the prediction performance has been ideal. A recurrent neural network (RNN) is a sequence-based model that has a short-term memory advantage [19, 20]. Therefore, RNN is the preferred neural network for training time-series data. The standard RNN structure suffers from the vanishing gradient problem and the exploding gradient problem; hence, it cannot effectively use the long sequence historical information. Reference [21] compares ARIMA with the Seq2Seq neural network, and the latter shows superior performance. However, the predictive performance degrades as the sequence length increases. The attention mechanism overcomes the problem of Seq2Seq that the encoder compresses all input information into a fixed-length vector [22], but it spends a long time training the model. This paper proposes a new method for addressing this issue that is based entirely on the attention mechanism.
Preliminaries
In this section, we briefly describe other essential methods that are compared with the proposed prediction component of the EGRTMS: the RNN network, the Seq2Seq network, and the typical attention mechanism network.
Recurrent neural networks
Given a sequence of information {x1, x2, …, x
t
}, where
The Seq2Seq model uses the GRU gating unit, which typically consists of an encoder and a decoder. Given a sequence of input information from an initial time point to some end time point t, the encoder generates a fixed-size context vector that contains all the information that has been obtained from the input sequence, which is transmitted to the decoder. The decoder generates predicted values following the encoded input from time t + 1 to an end point in the future, which can be pre-defined.
Typical attention mechanism
The attention mechanism is typically applied to the Seq2Seq module [23, 24]. The attention mechanism overcomes the problem that the encoder can only send a fixed-size context vector regardless of this size of the input sequence. The encoder hidden states are H = {h1, h2, …, h T x }, and the decoder hidden states are S = {s1, s2, …, s T x }. The weight α ij of each annotation h j is computed as follows:
Previous work focused mainly on changing the network architecture, which applies typical attention mechanisms to RNN. Due to the inherently sequential nature of RNN-based models, it is difficult to model very long-term correlation sequences and compute in parallel. In this paper, we propose a C-transformer model, as illustrated in Fig. 3, which discards the traditional recurrent neural network and relies entirely on an attention mechanism. Unlike the RNN-based methods, C-transformer can access any part of the history regardless of distance, making it more suitable for capturing the repeating patterns with longterm dependencies.

C-transformer model.
There is a collection of observed N related univariate time series
Convolutional layer
Convolution neural network models have realized outstanding results on image recognition, which can successfully extract local and shift-invariant features from input images. As such the first layer of C-transformer employs a 1-D convolutional network (CNN), which aims to capture local dependencies among variables. First, the constructed multi-dimensional feature vectors are forwarded to the convolutional layer. Then, the preprocessed window extracts spatial features via convolution and pooling operations. A vector of length n, where n represents the input feature dimension, is obtained, which is expressed as:
These extracted features are passed to the stacked layers, and each layer includes two key sublayers, which consist of a self-attention mechanism and a fully connected feed-forward network. First, the inputs flow through a self-attention layer. The self-attention layer helps to look at the actual values of indicators at different times of the day and determines the amount of attention paying to the current moment. Then, the outputs are fed to a feed-forward network. We use a residual connection around each of the two sublayers, which is followed by a normalization layer.
Attention mechanism
An attention mechanism can be described as creating a query and a set of key-value pairs to an output from each of the layer’ input vectors. The attention is calculated as a score, which determines how much focus on placed on other parts of the input as we encode a value at a specified position. The queries matrix Q, keys matrix K, and values matrix V are created by multiplying the embedding by three matrices that we trained during the training process. The attention function of the outputs is expressed as:
In our experiments, multi-head (MH) attention is utilized instead of employing a single attention mechanism. The queries, keys and values of dimension are d model . This mechanism expands the model’s ability to focus on various positions. With multi-headed attention, not only one but multiple sets of Q, K, and V weight matrices are calculated, and each of these sets is randomly initialized. It is advantageous to linearly project the queries, keys, and values h times to d k , d k and d v dimensions. We concatenate all the attention heads and multiply them by an additional weights matrix W O that is trained jointly with the model, as follows:
Since the former model contains no sequence information, we inject information that represents the relative positions of the tokens in the sequence. We concatenate the extracted feature vectors and the learned positional encodings as the final input vectors.
Experiments and results
Dataset description
In this section, to evaluate the effectiveness and generalization performance of the proposed model, we compare the C-transformer model with the other predictive models on publicly available PJM electricity datasets and Baidu user search behaviour datasets. The collected data often exhibit strong periodic patterns that indicate daily or weekly human activities, daily pattern (every 24 hours) and weekly pattern (every 7 days).
The PJM electricity data in the eastern part of the United States: Electricity: The hourly power consumption data comes from PJM’s website in the United States, in megawatts (MW) and is recorded every 1 hour from 2015 to 2017.
The user search behaviour datasets consist of the search volume (SV) and the recommended search click-through rate (CTR): SV: The search volume is a measure of the total number of searches that are conducted through mobile and desktop search engines. CTR: In the case, the click-through rate is the total number of clicks that the recommended search receives divided by the SV.
These datasets consist of real-world data that were collected from November 2016 to May 2017 and from January to July in 2016, respectively, which contain both linear and non-linear interdependencies. Each data file contains timestamps and value attributes. The lengths of the time series are 17471 and 17531, and periodic timestamps appear at a consistent frequency of every 15 minutes.
Feature enhancement
For all datasets, we can incorporate suitable prior knowledge to accelerate the convergence of the model and improve the performance when building a neural network model. This paper uses the prior knowledge of corresponding user behaviour as input. According to user behaviour influence factors, we construct a multi-feature vector:
where ⊕ is the concatenation operator. U (t): denotes a multi-feature vector at time t. U
n
(t): represents the normal value of the user behaviour data at time t and follows a normal distribution. U
d
(t): denotes the date and time factors, such as the day of the week and which hour of the day. U
w
(t): corresponds to the weather-sensitive part of the user behaviour data at time t, which is tightly coupled to the season of the year. U
s
(t): represents special events, such as holidays, festivals, and elections.
We consider several factors that can affect user behaviour and constructs multidimensional input characteristic vectors.
For all datasets, we use the bidirectional gated recurrent unit (Bi-GRU) as the basic unit for the traditional RNN-based sequence structure. In the training task, each time series in all datasets have been split into training (60%), validation (20%), and test (20%) sets. The training set is used to train our model, the validation set is used to fine-tune hyperparameters and the test set is used to evaluate the performance of the model. The mean squared error (MSE), which is defined in (Eq. 14), is used as the loss function in our model, while the learning rate is set as 10-3. By reducing the error between the actual value y
i
and the predicted value
For the PJM electricity datasets, we conduct a grid search over all tunable hyperparameters on the validation set for each method. In detail, for ARIMA (p, d, q) we utilize the following values for building the different models, as p∈ {0,1, ..., 4}, d∈ {1,2,3} and q∈ {0,1,2}. The RBF kernel was used for the SVR model. For the SVR parameters where C and γ are varied from {2-8, 2-7, …, 28}. For XGBoost parameters where max _ depth is varied from 5 to 15 by increments of 1 and min _ child _ weight is varied from 1 to 6 by increments of 1. For neural network models, the hidden dimension of the recurrent and convolutional layers are chosen from {64, 128, 200}, {20, 50, 100}. For the length of filters is chosen from {3, 4, 5}. For C-transformer, we employ h = 8 parallel attention layers with 64 neurons.
For Baidu user search behaviour datasets, we utilize the following architectures: (i) Bi-LSTM and Bi-GRU: the range for the number of hidden units is {64,128}; (ii) Seq2Seq: a double-layer encoder and decoder with 128 and 64 neurons, respectively; (iii) The typical attention mechanism: an encoder and attention decoder with 128 neurons and the stacked recurrent cells to 2; (iv) C-transformer: employs 8 heads with 64 neurons, filter windows of size 3, 4, and 5 with 128, 64, and 32 feature maps each, and fully connected layers with 64 and 128 neurons; and (v) ARIMA: we utilize the following values for building the different models, p = {0, 1}, d = {1}, and q = {1, 2}.
To measure the effectiveness of various methods on the PJM electricity datasets, we chose the best model on the validation set using RMSE, MAE, and R-Square as the metric for the testing set. A lower value is better for RMSE and MAE while a higher value is better for R-Square. We set the lookback window q = {48,96,168} and horizon h = {1,12,24}. Due to space limitation, we report on results only based on q = {96,168}, h = {1}, which means the look-back time-steps were 4 days and 7 days.
Table 1 summarizes the evaluation results of all the methods. The best result is highlighted in bold face in these tables. On a detailed level, it can be observed for RNN-based models, the larger windows, the harder the prediction tasks. The prediction result shows worse under the larger windows due to gradient vanishing problem. Clearly, proposed model consistently enhances the over state-of-the-art on the datasets, especially in the settings of large history window. C-transformer employed a multi-head attention mechanism. Different attention heads not only do clearly learn to perform different tasks, but many appear to select relevant information across multiple time steps. Besides, Fig. 5 shows a plot of actual values vs. predicted values of a time series. Ideally, when actual and predicted values are close to each other, it should be a smooth diagonal line. In this visualized result, no data point lies far away from the diagonal line. Obviously, C-transformer can memorize the pattern in training set and fit the test set well.
Errors measured for different models on electricity dataset
Errors measured for different models on electricity dataset

Sample time series from user search behavior test datasets.

The actual time series values are plotted against the time series predictions to show the accuracy of C-Transformer on electricity testing set.
On the Baidu datasets, the same evaluation metrics are considered: RMSE, MAE, MAPE, SMAPE, and R-Square. We chose the best model on the validation set by using the these evaluation indicators as the metrics for the test set. The numerical results are tabulated in Table 2. According to the results, the proposed C-transformer outperforms the other methods. In Fig. 4, the blue lines represent the actual streaming data, while the red lines represent the predicted values. Plots (a) and (b) show time-series samples from the user SV dataset. The search volume peaks at 11 am, 4 pm, and 9 pm; however, a significant drop is observed late at night. Plots (c) and (d) show user behaviour results on the recommended search CTR dataset. People prefer to click on recommended search information instead of actively searching late at night. Since the proposed C-transformer model has an attention mechanism, it more easily learns strong periodic patterns, which correspond to daily or weekly human activities.
Errors measured for different models on Baidu datasets
The objective of EGRTMS is to produce accurate and timely alerts. The predictions are passed to ADM, and it can classify each time-stamp as normal or abnormal based on the forecasted and the actual time-series data. The method is discussed in detail in this section.
Dynamic threshold selection
The selected threshold significantly affects the evaluation results. According to the calculated deviation metrics, the alert-threshold changes with the trend of the actual value instead of according to a fixed threshold. When the forecasted time-series are output, a dynamic threshold is determined as follows: for each predicted value, a queue is maintained that represents the sliding window of the previous discrepancy, as expressed in (Eq. 15). The min-max scaling is applied to the past discrepancy from the sliding window. Relying on the Gaussian distribution, according to a well-known statistical rule, namely, the 68-95-99 rule, which is also referred to as the 3-sigma rule, 99.73% of all samples lie within three standard deviations of the mean. We use K times the standard deviation of the errors as a dynamic threshold, where K is a parameter that must be selected for each benchmark. The main steps are described in Algorithm 1.
Following, if the scaled absolute difference of the predicted value exceeds K times the standard deviation of the previous scaled errors, then the ADM labels the instance as anomalous. Therefore, the previous discrepancy and threshold are dynamic and can adapt to the new values.
1: scaler← MinMaxScaler(rang=(0,1));
2: scaledErrors← scaler.fit(rootSquaredErrors [- slidingWindow :]);
3: dynamic _ threshold← K · numpy.std(scaledErrors);
4: crtError ←
5: scaled _ crtError← scaler.fit(crtError);
6:
7: return True.
8:
9: rootSquaredErrors← rootSquaredErrors.put(crtError);
10: return False.
11:
We use both public datasets to verify the effect of our monitoring framework.
The Yahoo Bebscope Benchmark is an open-source time-series anomaly detection benchmark [26]. The datasets consist of 367 real and synthetic time series with anomaly labels that are suitable for testing various anomaly types, such as outliers and changepoints. This benchmark is divided into four main benchmarks: A1 and A2: based on the real production traffic to Yahoo properties and synthetic time-series, respectively. There are 67 data files and 100 data files and each data file contains approximately 1400 time series with labelled anomalies. Integers replace the timestamps of the A1 benchmark with an increment of 1, where each data-point represents 1 hour of data. A3 and A4: based on synthetic time-series. Each benchmark includes 100 data files, and each data file contains approximately 1600 time series. In each synthetic data file, the outliers are inserted at random positions. The synthetic time-series exhibit trends, noise, and seasonality. The A2 benchmark includes only Boolean labels of outliers, while the A4 benchmark includes Boolean labels of change-point anomalies.
The Numenta Anomaly Benchmark (NAB) is a publicly available streaming anomaly detection benchmark [27] which span multiple domains. Each file contains timestamps and scalar values. Anomaly labels of each data file are given in a separate set of JSON files. NAB datasets consist of 58 data files, which contain both real and artificial time-series. The lengths of the data files are 1000–22,000, for a total of 365,558 data points.
There are sample time series from NAB datasets, as shown in Fig. 6. The actual values are shown in blue, whereas an anomaly window is represented between the two dotted lines. Although there is a small number of anomalous data points (1-2), the whole window is marked as anomalous. N. Singh et al. illustrated a few challenges in the NAB datasets [28]. Though the NAB framework provides a scoring system, which is designed to reward early anomaly detection. In most applications, it is significant to detect correct number of anomalies.

(a) Exchange-3_cpm_results, (b) Ec2_cpu_utilization, (c) Ec2_request_latency_system_failure (d) Occupancy_t4013. Snippets of NAB time series from multiple domains are plotted.
To evaluate the effectiveness proposed framework, we compare our framework, namely, EGRTMS, with other anomaly detection methods in terms of the F-score (Eq. 16) evaluation metric. As discussed in Section 4, EGRTMS uses C-transformer as a predictor.
We compare the EGRTMS with the other anomaly detection models on Yahoo datasets. EGADS: This anomaly detection framework was proposed by N. Laptev et al. (2015) [26]. EGADS uses a collection of time-series modelling modules (TMM) and anomaly detection modules (ADM) with an anomaly filtering layer for anomaly detection. We use the Olympic model in TMM and the ExtremeLowDensityModel density-based model in ADM. DeepAD: T. S. Buda et al. (2018) [29] proposed this model integration framework, which leverages a plethora of time-series forecasting models to realize the more accurate detection of anomalies. In the merging prediction (MP) phase, we adopt a single-step merging strategy that aims at at combining the outputs of multiple models to obtain a more accurate forecast for a single dataset. DeepAnT: M. Munir et al. (2018) [30] presented this deep-learning-based anomaly detection approach, which utilizes a deep convolutional neural network (CNN) as a predictor to predict the next time stamp on the defined horizon. DeepAnT can be trained on a relatively small dataset to realize satisfactory generalization performance due to the effective parameter sharing of CNN.
For comparisons with other anomaly detection approaches, the default values of all parameters are used. According to Fig. 7, the following combinations of parametric k and sliding window size w yield the best F-score on the four benchmarks: For the A1 benchmark and the A2 benchmark, w = 70 and k = 3 yield the highest average F-score, whereas for the A3 benchmark, w = 70 and k = 4 yield the highest score. Finally, for the A4 benchmark, w = 24 and k = 3 yield the highest score.

Average F-score of each Yahoo sub-benchmark is plotted by using per sliding window and different parameters k.
In this section, we applied a number of time series anomaly detection methods on 8 NAB time series from multiple domains. The anomaly detection methods include ContextOSE [31], HTM (named as Numenta and NumentaTM in Table 4) [27], Skyline [31], AdVec [32]. We performed the same parameters settings as mentioned in [31], as they have used the optimal parameters on NAB datasets to the best of our knowledge. In this work, we provide evaluation results for each of anomaly detection algorithms in terms of standard metrics like precision, recall instead of NAB scoring. We want to pay more attention to the number of detected anomalies.
Results
For Yahoo datasets, the average F-score per sub-benchmark is reported for each method. Table 3 compares EGRTMS with EGADS, DeepAD, CNN, and LSTM (DeepAnT using CNN and LSTM as predictors) on the whole dataset. The bold F-score is the best score on the corresponding Yahoo sub-benchmark. We observe that EGADS outperforms other methods on three sub-benchmarks (A1-A3) according to the metric. However, DeepAD realizes a higher score on the A4 benchmark. Typically, each time series has unique characteristics. We conclude that the single-step merge strategy of the combined multiple models is more suitable for the Yahoo dataset with change-point anomalies.
Average F-Score of Yahoo EGADS, DeepAD, DeepAnT, and EGRTMS on Yahoo sub-benchmark
Average F-Score of Yahoo EGADS, DeepAD, DeepAnT, and EGRTMS on Yahoo sub-benchmark
For NAB datasets, the precision and recall are reported for each anomaly detection method in Table 4. In most of the cases, we observe that high precision is followed by the low recall. It is mainly due to the bad labelling mechanism. As shown in Fig. 6, there is a small number of anomalous data points, but all the data points are labelled as anomalous data points in an anomaly window. For other methods, the precision close to 1, but the recall stays in between 0.001-0.04. Whereas, EGRTMS achieved better recall (0.02-0.12) and high precision. We observe that in general EGRTMS outperforms other methods in most real life scenarios.
Evaluation results in terms of precision, recall on NAB datasets
Traditional statistical models are widely used for the detection of anomalies in streaming data, which result in a high rate of false positives. This paper presented a real-time monitoring framework that is based on deep learning technology. In the operation of an enterprise, the implementation of fully automated data flow anomaly detection is extremely difficult due to the large scale and the diverse use-cases residing in the practical setting. The deep learning attention-based algorithm is combined in the prediction module to realize accurate prediction. This mechanism can explicitly take into account the different contributions of daily-periodic and week-periodic segments to the prediction. The anomaly detection module can automatically adapt to new time series by using a sliding window. Furthermore, abnormal events are tracked, and risk assessment is conducted by the alarm module and the anomaly handling module. All of these features effectively create a powerful real-time monitoring framework that is versatile, configurable, and extensible.
However, in the face of anomalies with minor fluctuations and slow, continuous decreasing trends, dynamic thresholds also have limitations. We are working on extending the model and combining it with the cumulative sum (CUSUM) sequential analysis for anomaly detection. In the future, the improvement of EGRTMS in terms of the threshold selection and predictive performance can be considered.
Footnotes
Acknowledgments
This research is supported by the National Natural Science Foundation of China (61872288) and the National Key R&D Program of China (2018YFC0809001).
