Abstract
Large deployment of wireless sensor networks in various fields bring great benefits. With the increasing volume of sensor data, traditional data collection and processing schemes gradually become unable to meet the requirements in actual scenarios. As data quality is vital to data mining and value extraction, this paper presents a distributed anomaly detection framework which combines cloud computing and edge computing. The framework consists of three major components: k-nearest neighbors, locality sensitive hashing, and cosine similarity. The traditional k-nearest neighbors algorithm is improved by locality sensitive hashing in terms of computation cost and processing time. An initial anomaly detection result is given by the combination of k-nearest neighbors and locality sensitive hashing. To further improve the accuracy of anomaly detection, a second test for anomaly is provided based on cosine similarity. Extensive experiments are conducted to evaluate the performance of our proposal. Six popular methods are used for comparison. Experimental results show that our model has advantages in the aspects of accuracy, delay, and energy consumption.
Keywords
Introduction
With the rapid development of Internet of Things (IoT), the number of IoT devices involved in manufacturing processes of a smart factory keeps increasing. Meanwhile, the amount of sensor data generated by various sensors deployed in a smart factory has exploded in recent years. The sensor data fall into two major categories: 1) state data which represent real-time status at a specific timestamp (e.g., temperature, humidity) and 2) accumulated data which denote accumulated amount during a time period (e.g., energy consumption, uptime). The above two types of data possess a strict sequential order. Thus, they can be declared as time series [1]. Typically, time series is able to demonstrate both real-time changes and trends over time for a variable.
In general, anomaly in time series falls into the following three categories [2].
Point anomaly [3]: point anomaly refers to a point which is different from other points. This may be the result of sensor malfunction or noise caused by external factors. Point anomaly is also called outlier. Pattern anomaly [4]: pattern anomaly refers to a significant difference between a segment pattern and other segment patterns. In other words, pattern anomaly is caused by a sudden change. Sequence anomaly [5]: sequence anomaly refers to the non-compliance of a subsequence to other subsequences, such as subsequences generated by different DoS attack mechanisms.
Both point anomaly and pattern anomaly are abnormal behaviors appeared in an individual time series, while sequence anomaly is abnormal behaviors appeared between sequences. In addition, there are other descriptions and terms which convey anomaly (e.g, concept drift). In [6], the authors propose a concept drift detection model for data streams. The model is based on ensemble classifiers. The authors point out that “change in the underlying distribution that data points come from”, namely concept drift, is inevitable in data streams.
Traditionally, the collected sensor data in a smart factory are transmitted to a cloud computing center for anomaly detection. Then, the detection result and relevant decision are sent back to the factory. However, with the explosion of data, cloud-based data processing methods inevitably suffer from performance degradation caused by several crucial factors (e.g., latency, bandwidth consumption, and jitter).
As a complement of cloud computing, edge computing [7] selectively offloads computing and storage tasks to the edge of a network. Namely, data can be processed on edge nodes. This new computing paradigm possesses several advantages [8]: 1) the latency of data transmission is reduced, 2) the amount of data transmitted to the cloud is reduced, and 3) data security and privacy are enhanced. As edge computing is able to meet the demand of mobility, real-time capability, reliability, security and privacy for a network, it has extensive application scenarios [9].
For IoT devices in a smart factory, the collected sensor data show the system status in real time. The monitoring, control, and decision-making for manufacturing processes can be achieved by analyzing the data. However, data anomaly often leads to improper instructions and wrong operations. The validity and quality of the data are of great importance. Thus, a timely and accurate anomaly detection for the data plays a vital role in enhancing the stability and security of manufacturing processes [10]. In general, anomaly is inevitable during the collection and transmission of the sensor data. When the original data are transmitted to the cloud, the reliability of data analysis cannot be guaranteed. Based on the concept of edge computing, anomaly detection of the original data can be performed on edge nodes. In this case, the quality of data transmitted to the cloud is improved. In addition, the amount of data which need to be transmitted is reduced. Thus, the network performance metrics latency and bandwidth consumption are improved.
For anomaly detection, the comparison of edge computing and cloud computing is depicted in Fig. 1. The traditional cloud computing mode transmits the collected data to the cloud via wireless networks and backbone networks. The result of anomaly detection accorded by the cloud is sent back to the smart factory. In general, the transmission delay is much larger than that of the calculation delay. Thus, the real-time requirements of a smart factory could hardly be met. In contrast, the edge computing mode is able to selectively offload certain tasks which are originally performed in the cloud to the edge nodes, such as data preprocessing, anomaly detection, real-time decision-making, and privacy policy enforcement.
Representative literatures in five categories
Comparison of edge computing and cloud computing.
In this paper, we propose an improved k-NN anomaly detection framework based on locality sensitive hashing (LSH). The adoption of LSH provides a data preprocessing operation which facilitates the subsequent operation performed by the k-nearest neighbors algorithm. In specific, the innovation point is that the finding of neighboring elements in a large set
The reminder of this paper is structured as follows. Section 2 reviews representative anomaly detection methods in five categories. The advantages and disadvantages of these methods are summarized. Section 3 presents our improved k-NN anomaly detection framework which is enhanced by locality sensitive hashing. Three major components k-NN, LSH, and CS are elaborated, respectively. Section 4 evaluates the proposed model with extensive experiments. The analysis of numerical results is conducted based on the comparison of our method and six popular algorithms. Finally, conclusions and future work are presented in Section 5.
This section presents a brief review of existing anomaly detection methods. Table 1 shows representative literatures in five categories.
Statistical-based methods
The basic idea of a statistical-based method roots in hypothesis. It is assumed that the collected data are in certain distribution model (e.g., Gaussian distribution, Poisson distribution). If a data sample is in the distribution model, it is considered to be normal. When the data sample is not in the distribution model, it is declared as an outlier. The advantage of this type of methods is that there are mature theoretical models which are ready to use. In general, high availability and accuracy can be achieved.
In [11], a Markov-based online anomaly detection method is proposed. The anomaly detection is performed by a model with quantized amplitudes which are given by Markov chain. In [12], a local statistical scan for excessive communication of dynamic network subdomain is used to detect abnormal activities. In [13], a histogram-based anomaly detection method is developed. For stream data, a histogram of different characteristics is generated. The univariate characteristic density is modeled using histogram with fixed or dynamic facet width. Then, the anomaly detection is performed by calculating the anomaly score of a data sample. In [14], the authors propose to perform anomaly detection by using median and median absolute deviation. In addition, a prior dimension reduction is conducted based on principal component analysis. In [15], the authors develop a deep autoencoding Gaussian mixture model to perform unsupervised anomaly detection. A low-dimensional representation of an input data sample is fed into the Gaussian mixture model.
Drawback: Most statistical-based methods require multiple experiments to obtain prior information (e.g., data distribution, parameter estimation, and confidence interval). In general, these prior information could hardly be obtained in advance. In addition, this type of methods is not suitable for high-dimensional data.
Clustering-based methods
The fundamental idea of a clustering-based method is as follows. The collected data is divided into different clusters based on a certain criterion (e.g., distance). The similarity of data within the same cluster is made as high as possible, while the similarity of data between different clusters is made as low as possible [45]. Finally, a data sample which lies in no clusters is declared as an anomaly. The advantage of this type of methods is simplicity of algorithm, fast processing, and no need for class label or prior knowledge.
Hierarchical clustering-based methods
Hierarchical clustering-based methods divide data into different hierarchies. Thus, a tree-shaped clustering structure is obtained. The common strategies are bottom-up aggregation and top-down split. In [46], the authors make a comparison of similarity measures for categorical data in hierarchical clustering. Criteria are provided to determine in which situations a certain similarity measure is recommended for use. In [16], the authors propose an efficient data clustering method called BIRCH. Data are processed based on a tree structure. Each leaf node stores one cluster which is denoted by a cluster center and a radius. A data sample is assigned to the nearest leaf node. At last, data which are not assigned to any leaf node are declared as anomalies. In [18], the authors mainly deal with code anomalies, namely code fragments which are not typical. The acquisition of code vector representation and anomaly detection based on the vectorized data are discussed. In [19], the authors point out that traditional non-incremental hierarchical clustering algorithms for anomaly detection possess the disadvantages of low effectiveness and instability. The proposal in [19] dynamically adjusts the optimum clustering number based on the self-defined criterion. The manual picking of clustering number is avoided. This mitigates the above problems of traditional non-incremental hierarchical clustering methods. In [20], the authors propose a hybrid grid-based hierarchical clustering method based on Hausdorff distance [47]. Different versions of hierarchical clustering algorithms are used to cluster the grid-based trajectories based on pairwise the Hausdorff distances. The proposal is suitable for measuring the similarity between trajectories of different lengths.
Partitional clustering-based methods
Partitional clustering-based methods divide the collected data into several clusters. Initial centers of the above clusters are randomly selected. Then, heuristic algorithms perform iterative relocations for the collected data until a clear clustering result is achieved. The celebrated k-means algorithm was initially proposed in [21]. It randomly selects
Drawback: The performance of most clustering-based methods completely relies on the quality of clustering result. In addition, the time complexity of high-dimensional data processing is quite high.
Distance-based methods
Distance-based methods calculate the distance between data samples by using a distance function. For data sample
In [24], the authors propose a linear regressive model for time series anomaly detection. Several algorithms related to the Huber-skip and least trimmed squares estimators are developed to perform anomaly detection. In [25], non-parametric methods for accurate periodicity detection are developed. Periodic distance measures for time series are used to perform anomaly detection. Among various distance-based methods, the most widely used idea is the k-nearest neighbors [26]. Methods based on this idea perform anomaly detection based on the proximity between data samples. It is simple and does not require any training. However, the computation cost of the k-NN is high due to the calculation of distance between data samples. In [27], a k-NN-SVM method is proposed for anomaly detection. The indexing of training set is conducted based on the idea of R*-tree. Labels of data chosen by the k-NN determine whether a further training based on the SVM is needed. Although the training time is reduced, this proposal is not efficient for processing high-dimensional data due to large amount of computation cost introduced by both the k-NN and the SVM. In [28], the authors develop an efficient method to detect anomalies in the log data. N-gram and frequent pattern mining (FPM) are used. A labeled log data sample set is obtained from historical logs by using clustering and self-training method.
Drawback: The performance of most distance-based methods is influenced by parameter setting. And parameter selection is not an easy task.
Density-based methods
As the above mentioned distance-based methods take no account of the local sparsity of data, density-based methods which involve both proximity and local sparsity of data are developed. The advantage of this type of methods is the imbalance of local sparsity of data could be nicely handled. This contributes to more accurate detection result of local anomaly.
In [30], a local outlier factor (LOF) model is proposed. The local density of a data sample is compared with the densities of its neighboring data samples. The density is calculated based on the distance between data samples. The larger the distance is, the smaller the density is, and vice versa. As LOF-based models possess high time complexity, they are not suitable for processing large amount of high-dimensional data [49]. In [31], an unsupervised density-based anomaly detection method is developed. The local anomaly score of a data sample is used to represent the degree of its deviation from other data samples. In [32], a connectivity-based outlier factor (COF) model is proposed. It improves the effectiveness of the original LOF model when a pattern itself has a similar neighborhood density of an outlier. The detection accuracy of COF is superior to that of LOF. However, the execution time of the COF algorithm is often larger than that of LOF. In [33], the authors study multiple types of traffic data for the purpose of extracting different features. A grid-based LOF algorithm is developed to detect abnormal areas in the city of Beijing. In [34], the authors propose a fast outlier detection algorithm for data streams. The proposal is able to reduce the computation cost of the LOF by using Z-score pruning. In [35], the authors develop a flexible parametric probability measure for large scale anomaly detection in mixed numerical and categorical input space. Low likelihood values are identified as anomalies.
Drawback: For most density-based methods, the time complexity of an algorithm significantly increases with the dimension increment of data.
Classification-based methods
Based on different core components, classification-based methods can be divided into two categories: multi-class classifier and one-class classifier. The former type gives different labels to data which belong to different classes. The latter type sets an identifiable perimeter which encloses normal data. The advantage of classification-based methods is that there are multiple mature mathematical packages which can be employed to perform anomaly detection. The accuracy of anomaly detection is considered to be with a fairly strong stability.
Neural network-based methods
Neural network-based methods simulate the human nervous system to conduct computing and self-learning. Anomaly detection is performed based on considerable prior training. The advantage of this type of methods is that no prior information is needed. No statistical assumption for the original data is needed, thus this type of methods is resistant to distractors. In [36], an anomaly detection model based on convolutional neural network (CNN) is proposed. The extracted features are converted to a binary vector using one-hot encoding. Then, the CNN model is trained for the purpose of anomaly detection. In [37], different deep neural networks are used (e.g., CNN, recurrent neural network) in the training phase. The acoustic emission training is used for dimension reduction which reduces computation cost. In [38], several most advanced deep learning techniques for anomaly detection in cyber security are demonstrated. In specific, the discussion is focused on the multilayer perceptron (MLP) deep learning method.
SVM-based methods
The core component of SVM-based methods is a linear classifier which is defined in feature space. An SVM equipped with a kernel trick can be considered as a non-linear classifier. In general, an SVM-based method generates an area which covers normal data. A data sample lies outside the area is declared as an anomaly. As the performance of SVM-based methods is sensitive to the result of parameter selection, considerable improvements have been proposed. For instance, a data-driven hyperparameter optimization of one-class SVM for anomaly detection is proposed in [39]. In [40], the authors develop a linear one-class SVM based on an unsupervised deep belief network. The proposal is aimed at anomaly detection of high-dimensional and large-scale data. In [41], the authors propose an anomaly detection scheme based on a generalized support vector data description. The basic idea is that hyperspheres are created to accommodate data samples in different classes. Hence, outliers can be identified.
Isolation-based methods
The basic idea of isolation-based methods is that anomalies are few and different, they are more susceptible to be isolated with normal data. The advantage of this type of methods is that the intrinsic characteristics of an outlier are taken into full account during the construction of an anomaly detector [42]. In [43], an isolation-based distributed outlier detection framework using the nearest neighbor ensembles is proposed. The proposal is mainly based on isolation forest (iForest) [44] and LOF.
Drawback: Most classification-based methods are not sensitive enough to anomalies which are close to normal data.
An improved k-NN anomaly detection framework based on LSH
Basic model
Our model is a distributed anomaly detection framework based on edge computing. As shown in Fig. 2, the cloud server and the edge nodes cooperate with each other. In specific, there are three major components: k-NN, LSH, and CS. The traditional k-NN algorithm is enhanced by locality sensitive hashing (LSH) [50]. The cloud is responsible for both the generation and update of hash table. The latest hash table is sent to the edge nodes. The anomaly detection of data collected by various sensors is performed on edge nodes with k-NN. In this anomaly detection process, the hash table is used to reduce the volume of data which are used by k-NN. Finally, the anomaly detection results are further tested based on cosine similarity.
k-NN anomaly detection framework based on LSH.
The interaction of the above mentioned three major components is illustrated in Fig. 3. The brief flow chart is aimed at presenting the relation between the three algorithms. Thus, only key operations are depicted. For a test data sample
Algorithm flow chart.
Anomaly detection based on the k-NN algorithm measures the proximity between data samples. For data sample
An
In specific, data sample
The distance between two data samples
The
where
Based on the above analysis, the detailed k-NN algorithm is shown in Algorithm 3.3, where
Locality sensitive hashing is another nearest neighbors-based algorithm for large amount of high-dimensional data. A distinct feature of this hashing algorithm is that it is sensitive to location. The basic idea of LSH is that two neighboring data samples in the original sample space are still neighboring in high probability after being hashed. As LSH possesses rapid indexing ability, it is suitable for the incremental indexing of dynamic dataset. The cost of index update is considered to be low.
For a given set
where
where
The projection of all data samples in set
In Algorithm 3.4, the maximum value of all hashes in set
Though the above combination of k-NN and locality sensitive hashing can efficiently accord an anomaly detection result, the performance of anomaly detection significantly relies on the parameter
where
In Algorithm 3.5, parameter
Parameter setting and performance metrics
In our experiments, we employ the hardware and software settings used in one of our previous works [51]. Though the edge-cloud collaboration architecture proposed in [51] is aimed at detecting pattern anomaly of time series in wireless sensor networks, the experimental environment can be used for the evaluation of our model in this paper. In [51], experiments were focused on the performance analysis of the proposed edge-cloud collaboration scheme, where the number of edge nodes and cloud node are 15 and 1, respectively. In this paper, to facilitate the performance analysis of edge computing, we use 30 edge nodes and 1 cloud node. In specific, the 30 edge nodes are implemented on the MSP430 single chip computer equipped with the nRF905 wireless module. The MSP430 platform is able to work in 25 MHz and provide 100 KB RAM. The cloud node is an HP Z6 G4 workstation with 32 2.3 GHz cores and 32 GB RAM, and it runs a Debian Stretch 9.4.0 [52]. For the 30 edge nodes, the MSP430 single chip computers are installed with the open source operating system FreeRTOS [53]. For simplicity, the messaging protocols, message formats, and related requirements are based on the TCP/IP protocols. In specific, the communication between edge nodes and the cloud node is implemented based on the HTTP protocol. There is an Apache HTTP server [54] running on the cloud node.
Different values of parameter
Accuracy of anomaly detection for different 
The experimental results illustrated in Figs 5–8 correspond to the experiments conducted with the KDD CUP 99 dataset. The dataset has been widely used for evaluating the performance of anomaly detection algorithms. In our experiments, a pre-operational normalization for the dataset is performed to facilitate the comparison between data of different dimensions. The performance of our proposal is evaluated with three other algorithms: fixed-width clustering (FWC) [57], k-NN [58], and one-class SVM [59]. To facilitate the presentation, we denote our model by LKC. The width of the fixed-width clustering algorithm is set to
For confidence level
TPR, FPR, and AUC with different confidence levels
TPR, FPR, and AUC with different confidence levels
ROC curves of the four algorithms.
The precision, recall, and F1-score of the four algorithms are shown in Fig. 6. The overall performance of one-class SVM is the worst one. The performance of FWC is good in terms of precision, but it performs poorly in terms of both recall and F1-score. Though the performance of the traditional k-NN is similar to LKC in terms of the above three metrics, LKC is still the best one. In addition, the traditional k-NN possesses large delay, which is shown in Fig. 8 later in this section.
Precision, recall, and F1-score of the four algorithms.
Delay of anomaly detection for LKC and FWC in pure edge computing and cloud computing.
Moreover, LKC and FWC are further investigated in terms of delay. Experiments are conducted for
For Fig. 7, the corresponding experiments are conducted for pure edge computing and cloud computing. In other words, the operations of both LKC and FWC are performed either on edge nodes or in the cloud. In our proposal, hash tables are constructed in the cloud. Thus, further experiments of the four algorithms are conducted based on the collaboration of edge nodes and cloud. As shown in Fig. 8, when
Delay of anomaly detection for the four algorithms in collaboration of edge computing and cloud computing.
The experimental results illustrated in Fig. 9 correspond to the experiments conducted with the IBRL dataset. The dataset contains sensor data collected from 54 sensors deployed in the Intel Berkeley Research Lab, including humidity, temperature, light, and voltage values. Like the KDD CUP 99 dataset, the IBRL dataset is also an important dataset in the field of anomaly detection. To further investigate the performance of our proposal, we consider another three recent models (KNN-DK [60], Deep k-NN [61], and kNN-TSAD [62]) which are based on the idea of k-NN. In [60], a modified k-NN classifier which uses dynamic
Energy consumption and accuracy for different 
As shown in Fig. 9a, the values of the average energy consumption are normalized. The baseline of the normalization is the genuine average energy consumption of edge nodes for LKC when
As shown in Fig. 9b, the overall trends of the four curves are similar to each other. When
This paper proposed an improved k-NN anomaly detection method based on locality sensitive hashing. The anomaly detection is conducted on edge nodes. The data preprocessing performed by locality sensitive hash significantly reduces the amount of data which is handled by k-NN. This also speeds up the execution of k-NN. In addition, the adoption of edge computing reduces the delay of anomaly detection and the bandwidth consumption of data transmission. Moreover, the second test which is performed based on cosine similarity further improves the accuracy of anomaly detection. Compared to six existing methods, our proposal possesses higher accuracy, lower latency, and less energy consumption.
However, there is still room for improvement. Multi-source heterogeneous data generated in a smart factory often possess spatial-temporal correlativity. In the future, the optimization of our proposal should incorporate the analysis of spatial-temporal correlativity. In addition, though our proposal works well for univariate and multivariate datasets, the situation becomes a little complicated for a multi-modal dataset. As a multi-modal dataset often possesses more than one mode of distribution/anomaly, a normal data sample in one mode may become an anomaly in another mode. Similarly, an abnormal data sample in one mode may become normal in another mode. In our proposal, the LSH algorithm is able to classify training data into different buckets based on the hash values of the data. The subsequent k-NN algorithm works within an individual bucket. Thus, we hold that if the modes contained in a dataset can be clearly separated, the result obtained by the LSH algorithm handles the “multi-modal” feature gracefully. In this case, such a multi-modal dataset is not a problem in our proposal. On the contrary, if the modes contained in a dataset cannot be distinctly separated, the effectiveness of the LSH algorithm may be observably compromised, as well as the subsequent k-NN algorithm. In this case, such a multi-modal dataset can be considered as unmanageable for our proposal. Thus, an extra component which handles the multi-modal feature of a dataset should be developed for our proposal in the future.
Footnotes
Acknowledgments
The authors would like to thank the anonymous reviewers whose comments and suggestions greatly helped them improve the quality and presentation of this article. Cong Gao wants to thank his beloved mother, Miling Shen and family for their endless support and encouragement.
This work is partly supported by the Scientific Research Program of the Science and Technology Department of Shaanxi Province, China (2023-YBGY-211), the Scientific Research Program of Shaanxi Provincial Education Department, China (21JP115), the Scientific Research Program of the Science and Technology Bureau of Xi’an, China (22GXFW0129), the Scientific Research Program of the Science and Technology Bureau of Yulin, China (CXY-2022-162), and the Key Research and Development Programs of the Science and Technology Department of Shaanxi Province, China (2021ZDLGY06-03).
