Anomaly Detection in Automatic Meter Intelligence System Using Positive Unlabeled Learning and Multiple Symbolic Aggregate Approximation

Abstract

With the development of automatic electrical devices in smart grids, the data generated by time and transmitted are vast and thus impossible to control consumption by humans. The problem of abnormal detection in power consumption is crucial in monitoring and controlling smart grids. This article proposes the detection of electrical meter anomalies by detecting abnormal patterns and learning unlabeled data. Furthermore, a framework for big data and machine learning-based anomaly detection framework are introduced. The experimental results show that the time series anomaly detection for electric meters has better results in accuracy and time than the expert alternatives.

Introduction

The old power grid infrastructure that supplies power to users has been being replaced in recent years by a succession of digital systems known as smart grids. Consumers and utility providers will benefit immensely from this upgraded grid's ability to monitor, control, and forecast energy consumption. The data of energy consumption are time series form. Learning temporal patterns in time series has remained a complex problem. It is critical to understand the underlying structure of a system's normal behavior, especially when detecting anomalies in time series.^1–3

The electrical consumption data are first deconstructed to emphasize the seasonal, trend, and random features. The seasonal component captures the electrical consumption data fluctuation over each day, which is repeated throughout the breakdown. The trend component displays any observed data departures from the trend component and seasonal component data in aggregate, whereas the random component displays any observed data deviations from the trend component and seasonal component data in aggregate. In the case of electrical data, the decomposition is comparable. The trend component represents the magnitude of the consumption, 340 kWh. The trend component looks to move a lot on the graph, although its variability over the week interval is less than 2%.^3–5

To handle the challenge of anomaly detection with time series data, a variety of anomaly detection techniques have been developed. Table 1 outlines some of the specific approaches and case studies in anomaly detection studies with time series. One of the most challenging problems in anomaly detection is the problem of limited data. Labeled data are scarce, whereas unlabeled data are abundant. Positive unlabeled learning (PUL) is a recommended method to solve this problem. Table 2 summarizes some of research that have used PUL to handle data imbalance and missing data issues. Symbolic aggregate approximation (SAX) is a method that transforms a time series into discrete symbolic sequences. It is widely used in many subjects such as pattern recognition and anomaly detection. Some research on SAX is summarized in Table 3. The study by González-Vidal et al.⁶ used Heuristically Order Time series using SAX algorithm to detect anomalies in water management. Trend symbolic aggregate approximation⁷ and symbolic aggregate approximation - change points (which captures the trends in a time series based on abrupt Change Points [CP])⁸ are new SAX techniques and improve performance in classification. Symbolic aggregate approximation in time series researches are summarized in Table 3.

Table 1.

Abnormal detection in time series researches

Objective	Methodology	Case studies	Year	References
Propose a model for detecting irregular electricity usage and resolving the problem of data imbalance	K-means SMOTE, AdaBoost	Electricity consumption data of State Grid Zhejiang Electric Power Corporation	2020	¹⁶
Detecting anomalies in conventional meters using ensemble method	LGB and multivariate Gaussian distribution	The data set of 135,000 consumers of Tunisian Company of Electricity and Gas	2021	¹⁷
Provides a feature-engineering solution for fraud detection in an advanced metering infrastructure that is both practical and model-agnostic	Finite Mixture Model, Genetic Programming, Gradient Boosting	Data on electricity consumption have been collected from more than 4000 households in 18 months	2019	¹⁸

LGB, Light Gradient Boosting.

Table 2.

Positive unlabeled learning researches

Objective	Methodology	Case studies	Year	References
Propose a model for detecting irregular electricity usage and resolving the problem of data imbalance	Using LP-PUL	About 17 text data sets from different domains	2021	^16,19
Provide a unique non-negative risk estimation approach for PUL based on the cross-domain robustness of reconstruction-based features	PURE	Several digit picture data sets are frequently used to study domain adaptability as: MNIST, MNIST-M, USPS, SVHN, etc.	2020	²⁰
Using PUL to create a prospective map for Fe polymetallic deposits in southwestern Fujian province, China	PUL, OCSVM, ANN, LR	Data on Fe ore mines in China's Fujian province	2021	²¹

ANN, artificial neural networks; LP-PUL, Label Propagation for Positive Unlabeled Learning; LR, logistic regression; MNIST, Modified National Institute of Standards and Technology; OCSVM, one-class support vector machine; PUL, positive unlabeled learning; PURE, Positive-Unlabeled Reconstruction Encoding; SVHN, street view house numbers; USPS, U.S. postal service.

Table 3.

Symbolic aggregate approximation in time series researches

Objective	Methodology	Case studies	Year	Reference
Propose a trend symbolized method (TSAX) to detect the anomaly heart signals	TSAX	ECG data set	2019	⁷
Compare performance of ARIMA and HOT-SAX method for anomaly detection	HOT-SAX	Water management system	2019	⁶
Propose trend base SAX reduction techniques	SAX-CP	UCR time series data set	2019	⁸

ARIMA, Autoregressive Integrated Moving Average; ECG, electrocardiography; HOT-SAX, Heuristically Order Time series using SAX; SAX-CP, symbolic aggregate approximation - Change Points; TSAX, trend symbolic aggregate approximation.

This article aims to detect anomalous electric meters automatically by using abnormal behavior pattern detection and unlabeled learning method. Specifically, combine two methods multiple symbolic aggregate approximation and decompose time series to analyze abnormal behavior. However, most of the data are unlabeled. For anomaly detection, we need an unlabeled learning method to increase the model's accuracy.

The article is organized into four sections: The Methodology section shows the methodology of anomaly detection, multiple SAX, imbalance, and proposed model. The experiments and results of smart meter sensor are shown in the Experiments and Results section. Finally, the conclusion and discussions are provided in the Conclusions and Summarize section. Some acronyms used in the manuscript are given in Table 4.

Table 4.

Table of nomenclature

Name	Description
AR	Autoregressive
MA	Moving average
ARMA	Autoregressive Moving Average
ARIMA	Autoregressive Integrated Moving Average
DT	Decision Tree
IoT	Internet of Things
PAA	Piecewise Aggregate Approximation
SARIMA	Seasonal Autoregressive Integrated Moving Average
SAX	Symbolic aggregate approximation
PUL	Positive Unlabeled Learning
SVM	Support Vector Machine
ReNe	Reliable Negative
NB	Naive Bayes

Methodology

Anomaly detection formula problem

In this article, an energy meter is a sensor that measures the automatic electricity load of a consumer by time.⁹ This sensor is binary classified into the anomaly and normal. If the sensor has anomaly status, it will inspect and check.¹⁰

Sensor data set has an input with N features: $x = {x^{1}, x^{2}, \dots, x^{N}}$ . The aim of classification problem is to find each sensor that belongs to either of two classes ${a n o m a l y, n o r m a l}$ as correctly as possible.

Consider the data of a sensor $S = {(x_{i}, y_{i})}_{i = 1}^{n}$ , where $x_{i} \in ℛ^{N}$ and $y_{i} \in {a n o m a l y, n o r m a l}$ sensor. The objective of problem is to find the function $f (.)$ as follows $y = f (x) .$

Overview anomaly detection system

The research procedure of time series anomaly detection system contains four phases:

Phase 1: Integrating Data: Gathering data from various sources and combining it into one data set.

Phase 2: Preprocessing Data: Raw data collected from the previous step can be in various formats and may be inconsistent. This step involves data cleaning and data normalization, and features generation is conducted to overcome this problem.

Phase 3: Modeling: Processed data set will be split into two sets: a training set and a test set. The training set will be used to train a classifier.

Phase 4: Evaluation: This step assists in selecting the best model that meets a set of criteria and evaluating the model's future performance. If the model's performance is poor, go back to the data preprocessing step to clean data and generate more predictive features or the modeling step to tuning the model's hyperparameters.

Decompose time series

Data $X = {x_{t_{1}}, x_{t_{2}}, \dots, x_{t_{N}}}$ are decomposed into three components, namely trend, seasonal, and residual components. $X_{t} = m_{t} + s_{t} + r_{t}$ (1)

where

m_t is trend component that is identified by moving average

d is period of seasonal, seasonal component is $s_{t} = s_{t + d}, \forall t = 1, 2, \dots, N - d$

r_t is residual component. The decomposed components of the time series create new features for data.

Multiple SAX

The approach of SAX converts a time series into discrete symbolic sequences.¹¹ It is frequently employed in a variety of fields, including pattern recognition and anomaly detection.^7,12

A time series of any length N can be reduced to a string of any length $w_{1}, \dots, w_{q}$ using SAX. Let $S = {1, \dots, q}$ be the index of windows.

A new method to find the normal pattern of multiple windows $w_{1}, \dots, w_{q}$ time steps are described below:

Step 1: Consider a time series of length N: $X = {x_{t_{1}}, x_{t_{2}}, \dots, x_{t_{N}}}$ in the observed time period $T = [t_{1}, t_{N}]$ .

Step 2: Select a fixed size of window $w_{i}, i \in S$ and divide time series into $M_{i} = ⌈ \frac{N}{w_{i}} ⌉; \forall i \in S$ ( $⌈ x ⌉$ is the operation that showed the greatest integer smaller than x) equal parts: $P (1, i) = {x_{t_{0} + 1}, x_{t_{0} + 2}, \dots, x_{t_{w_{i}}}}$ $P (2, i) = {x_{t_{w_{i} + 1}}, x_{t_{w_{i} + 2}} \dots, x_{t_{2 w_{i}}}}$ (2)

⋮

P (M_{i}, i) = {x_{t_{(M_{i} - 1) w_{i} + 1}}, x_{t_{(M_{i} - 1) w_{i} + 2}} \dots, x_{t_{M_{i} w_{i}}}}

Thus, the original data now are represented by M_i window $X = {P (1, 1), P (2, 1), \dots, P (M_{1}, 1)}$ (3)

\dots

X = {P (1, q), P (2, q), \dots, P (M_{q}, q)}

The normal symbolic pattern for data in window $w_{i}, i \in S$ is defined in the next step.

Step 3: Let set $P^{i} = {{x_{t}}_{j}}_{j = 1}^{w}$ , $i = 1, \dots, M_{i}, \forall i \in S$ is the list of data in period i. Then, separating the list values of time series data into l quantiles. Consider , where $q_{k}^{i} = q u a n t i l e (P^{i}, \frac{k . 100}{l} %)$ is the list of quantile values of set $P^{i}$ . In the period i, l split thresholds are defined $q_{1}^{i}, q_{2}^{i}, \dots, q_{l}^{i}$ that depends on the set $P^{i}$ and parameter l. The symbolic $A = {a_{1}, \dots, a_{l}}$ is used to decode data in the next step.

Step 4: $P^{i} = {{x_{t}}_{j}}_{j = 1}^{w}$ , the time series period i that ${x_{t}}_{j}$ is decoded value by symbol $a_{k} \in A$ if ${x_{t}}_{j} \in [q_{k}^{i}, q_{k + 1}^{i}]$ , which $k \in {1, \dots, l}$ . Then, the parts $P^{i}$ can be represented as $S y m b o l (i) = (a_{1}^{i}, \dots, a_{w}^{i})$ , $i = 1, \dots M$ , and $a_{j}^{i} \in A, \forall j = \bar{1, w}$ as follows: $S y m b o l (1) = {a_{1}^{1}, \dots, a_{w}^{1}}$ (4) $⋮$

S y m b o l (M) = {a_{1}^{M}, \dots, a_{w}^{M}}

Each symbolic vector has w positions that run from 1 to w. The set of all symbolic at position j of M symbolic vector has form $ℬ_{j} = {a_{j}^{1}, \dots, a_{j}^{M}}$ . $a_{j}^{m o d e} = m o d e (ℬ_{j})$ the most frequency symbol at the position j. Thus, the mode vector is the most frequent symbolic occurrence that is called a normal pattern. Given $S y m b o l_{n o r m a l}$ is the normal behavior of time series defined by the formula: $S y m b o l_{n o r m a l} = {a_{1}^{m o d e}, \dots, a_{w}^{m o d e}} .$ (5)

The meaning $S y m b o l_{n o r m a l}$ is the pattern that represents the most frequent behavior of the sensor in the window w that learns from the number of windows time series data and label.

Step 5: After $S y m b o l_{n o r m a l}$ pattern is found, the features different between two classes normal and anomaly sensors are analyzed. Thus, the distance SAX between a new time series and a normal behavior can be calculated using the following formula: $\begin{matrix} d_{S A X} = d (S y m b o l_{n o r m a l}, S y m b o l_{n e w}) \\ = s y m d i s t (S y m b o l_{n o r m a l}, S y m b o l_{n e w}) \end{matrix}$ (6)

where $m o d e (.)$ the statistical operator return the symbol has the highest frequency of a given set; $s y m d i c t (.)$ is Jaccard distance is defined for two symbolic vectors A and B: $s y m d i c t (A, B) = \frac{| A \cap B |}{| A | + | B | - | A \cap B |}$ (7)

Figure 1 shows an example of using SAX to encode time series. The subsequences of raw time series data in window w are encoded into a symbolic vector using SAX. Figure 2 shows the symbolic anomaly vector using SAX. The anomaly vector is very different from the normal behavior of sensors.

FIG. 1.

SAX encodes subsequence time series in window w into symbolic vector. SAX, symbolic aggregate approximation.

FIG. 2.

Extract symbolic anomaly vector using SAX.

It was using this distance for analyzing the new features of two normal and anomaly classes. The advantage of SAX is tolerated noise of time series data a and its variance is small, but it cannot detect abnormal subsequence of w time step.

Positive unlabeled learning

For classification, PUL is a method of learning from a data set labeled positive (Po) and an unlabeled data set (U), where U includes both positive and negative labels.^13,14 In the collection U, there is a positive label.

Two-step strategy, the direct approach using the Biased—support vector machine (SVM) algorithm, and probability estimation are the approaches to tackling the PUL problem.

Two-step technique

Step 1: From an unlabeled set U, create a trustworthy, Reliable Negative set (notion ReNe). Spying techniques, 1-dynamic noise filter (DNF) procedures, and other methods are utilized to determine the set of reliable negative sets.

Step 2: Create a classifier for the U-ReNe collection using Po and reliable negative. Classifiers such as Iterative SVM, Random Forest (RF), and others are available as options.

Techniques for Spying:

- Random sampling $s %$ from set Po put into setting U act like spies

- Build classifier with new set Po and new U

- Extract the reliable negative set from the above classifier

Technique 1-DNF:

Find features that appear more frequently in the set Po than that in the set U

Extract the document in the set U that does not contain the features in the set above. Those documents are high reliable negative documents (ReNe).

In the second step, building a classifier to split the set $Q = U - R e N e$ , the classifier is built using many different algorithms, but Iterative SVM is often used and gives good results.

Algorithm Iterative SVM:

- Run the SVM algorithm iteratively using the set Po, ReNe, $Q = U - R e N e$ until there are no documents in Q that have not been negatively classified. ReNe and Q will update after each loop.

Direct approach: biased-SVM

Given the training data set ${(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}$ , where is one input feature vector and $y_{i} \in {- 1, 1}$ is the label of the corresponding class. Assuming that the first $k - 1$ sample is positive P that is labeled 1, where the rest are unlabeled samples U, we can consider negative and label −1. Therefore, the negative label contains the error, which means that it also has a positive. We consider two cases:

No noise: The set P has no noise, but the set U does. In this case, according to the SVM formula, we have:

Minimize: $\frac{1}{2} b^{T} b + C \sum_{i = 1}^{n} δ_{i}$

Constrains:

$b^{T} x_{i} + b_{0} \geq 1, i = 1, 2, \dots, l - 1$

$- 1 (b^{T} x_{i} + b_{0}) \geq 1 - δ_{i}, i = l, l + 1, \dots, n$

$δ_{i} \geq 0, i = l, l + 1, \dots, n$

where $C \geq 0$ is a parameter that allows to control the amount of noise.

In case of noise: In practice, the set P may contain several negative labels. Therefore, we allow noise in the set P, we use soft margin SVM for biased-SVM where the two parameters $C_{+}$ and $C_{-}$ are the weights of positive and negative errors, respectively. In the form of a mathematical formula as follows:

Minimize: $\frac{1}{2} b^{T} b + C_{+} \sum_{i = 0}^{l - 1} δ_{i} + C - \sum_{i = 1}^{n} δ_{i}$

Constrains:

$y_{i} (b^{T} x_{i} + b_{0}) \geq 1 - δ_{i}, i = 1, 2, \dots, n$

$δ_{i} \geq 0, i = 1, 2, \dots, n$

$C_{+}$ and $C_{-}$ are twp parameters to adjust to get the best results.

The learning process of the biased-SVM algorithm is the process of calculating the inverse derivative to find b.

Building a classifier using probability estimation

This is a probability formula-based approach. Let $x$ be the feature vector and $y \in - 1, 1$ the corresponding label. Let $s = 1 i f x$ is labeled and $s = 0 i f x$ is not labeled. In the set P, the samples are labeled so $y = 1$ is certain if $s = 1$ , but $s = 0$ , then it is possible $y = 1$ or $y = - 1$ . The fact that only the labeled set P can be expressed as a formula is: $P (s = 1 | x, y = - 1) = 0$ (8)

The main goal is to find a classifier function $f (x)$ such that $f (x) = P (y = 1 | x)$ is as close as possible. To achieve the aforementioned goal, a completely random selection assumption is used, which states that positive samples are chosen at random from all positive samples. This indicates that regardless of $x$ , the likelihood of a positive sample being labeled is the same if $y = 1$ . As math: $P (s = 1 | x, y = 1) = P (s = 1 | y = 1)$ (9)

Here, $c = P (s = 1 | y = 1)$ is the constant probability that a set of positive samples are labeled.

For learning, the training set consists of two subsets, the labeled set P $(s = 1)$ and the unlabeled set U $(s = 0)$ , which is randomly taken. If these two sets are fed to a standard learning algorithm that yields a function $g (x)$ such that $g (x) = P (s = 1 | x)$ is approximately. $f (x)$ is calculated through $g (x)$ as follows: $f (x) = \frac{g (x)}{c}, c = P (s = 1 | y = 1)$ (10)

The fixed value $s = P (s = 1 | y = 1)$ can be estimated using a classifier g (can be SVM, Naive Bayes, RF, …) and a validation set. V is the validation set, V_p is the subset of V containing the labeled labels. Estimate $c (= P (s = 1 | y = 1))$ as the mean of $g (x)$ for all $x \in P$ . The formula for estimating is:

Proposed model

The data set of sensors includes cumulative time series, which represents power consumption per unit time. The proposed model consists of five main parts. Data ingestion: Integrate input data, collect data from electricity meters.

Preprocessing: Includes components that normalize data, removes extraneous data, adds missing data, normalizes data to the same format. Then, normalizes them based on the time period per day.

Feature engineering: Transferring data using the multiple SAX method. Using analysis of time series data by decomposing time series with components: trend, multiseasonal, and residual. These components make up the properties of the electric meter.^1,15

Data exploration: The data are divided into train and test sets. Apply imbalance processing techniques for the train set. Then, it is trained using the PUL method with a selected hyperparameter. We evaluate and choose a trained model with an optimized hyperparameter.

Evaluation: Evaluate the results achieved. If the stopping condition is satisfied, the model is learned. If the stopping condition is not satisfied, we will relearn the model with other parameters. Finally, give the results.

We visualize output and test final model by using evaluation metrics. The overview model of this approach is shown in Algorithm 5 and Figure 3, respectively.

FIG. 3.

Proposed model anomaly detection using machine learning and pattern recognition.

Algorithm 4. Preprocessing
Cumulative time series data TS;
Features set F;
Transform TS to consumption time series $T S'$ ;
Normalize $T S'$ to normalize time series $T S_{0}$ ;
$F_{0} \leftarrow g e n (T S_{0})$ ;
$F_{S A X} \leftarrow M u l t i S A X (F_{0})$ ;
$F_{s t a t} \leftarrow s t a t (F_{0})$ ;
$F \leftarrow F_{0} \cup F_{S A X} \cup F_{s t a t}$ ;
return F;

Experiments and Results

Data set description

The data set contains the time series data of cumulative energy of 1067 sensors from January 1, 2017, to March 31, 2018, in a province of Vietnam. The data include about 2.5 million rows, and 3 attributes are described in Table 5:

Table 5.

The description of data set

Attribute	Description	Data type
Meter ID	The identification of the sensors	String
Timestamp	The datetime to record data	Datetime
Import_KWH	The value of cumulative data at timestamp	Float

KWH, kilowatt hours.

Evaluation metrics

Four measures were used in our experiments to evaluate four methods: classification accuracy, Precision, Recall, and F₁ score. These criteria are calculated by the following formula: $A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N};$ (12) $P r e c i s i o n = \frac{T P}{T P + F P};$ (13)

R e c a l l = \frac{T P}{T P + F N};

(14)

F_{1} = 2 \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(15)

where:

True Positive (TP) and True Negative (TN) are the proportion of correct classification that positive and negative class data points, respectively.

False Positive (FP) and False Negative (FN) are the proportion of incorrect classification that positive and negative class data, respectively.

Receiver operating characteristic curve

A receiver operating characteristic (ROC) curve of two components: true positive rate (TPR) and false positive rate (FPR). ROC is a graph that displays a classification model across all categorization levels.

TPR or Recall is defined as follows:

T P R = \frac{T P}{T P + F N}

(16)

FPR has a form:

T P R = \frac{F P}{F P + T N}

(17)

An ROC curve is used to depict TPR and FPRs multiple categorization criteria. More items are categorized as positive as the classification threshold is lowered, leading to a rise in both False Positives and True Positives.

Area under the ROC curve

The area under the ROC curve (AUC) is a two-dimensional representation of the area beneath the complete ROC curve. The AUC is a separability metric, whereas the ROC is a probability curve. The AUC measures how accurately the model predicts 0 classes as 0 and 1 classes as 1. The better the model predicts 0 courses as 0 and 1 classes as 1, the higher the AUC. The AUC, however, indicates how well the model discriminates. An illustration of the AUC is shown in Figure 4.

FIG. 4.

AUC (area under the ROC curve). ROC, receiver operating characteristic.

Results

We use two scenarios for the experiment, there are as follows:

Scenario 1: We use some algorithms such as RF, Decision Tree (DT), LightGBM, and XGBoost to build classifiers to classify data consumption with the features extracted by multiple symbolic aggregate approximation.

Scenario 2: We use the proposed model using multiple symbolic aggregate approximation to extract features, and PUL combined with some classifiers as RF, DT, LightGBM, and XGBoost to build classifiers to classify data.

We use two scenarios to compare the performance of our proposed method with some base classifiers. We run the model 30 times and evaluate the outcomes using the following metrics: mean ( $μ$ ) and standard deviation ( $σ$ ) of AUC, mean $(μ)$ and standard deviation ( $σ$ ) of F₁ score. The results are shown in Table 6: We use SAX algorithm to calculate the normal behavior of electricity consumption by week for each meters. Figure 5 shows electricity consumption by week of normal and fraud meters. Figure 6 shows electricity consumption by day of normal and fraud meters. The big red line in each figure represents normal behavior. The figures show clearly that in both case, the normal behavior of normal meters is periodic and stable. By contrast, the normal behavior of fraud meters is unstable.

FIG. 5.

(a, b) Graphs of electricity consumption by the week of normal meters. (c, d) Graphs of electricity consumption by the week of fraud meters.

FIG. 6.

(a, b) Graphs of electricity consumption by the day of normal meters. (c, d) Graphs of electricity consumption by the day of fraud meters.

Table 6.

Comparison with state-of-the-art using the original imbalanced data set

	Scenario 1	(%)	Scenario 2	(%)
Classifiers	AUC ( $μ \pm σ$ )	F₁ score ( $μ \pm σ$ )	AUC ( $μ \pm σ$ )	F₁ score ( $μ \pm σ$ )
Decision Tree	$61.8 \pm 1.2$	$43.3 \pm 2.9$	$62 . 4 \pm 1 . 4$	$47 . 7 \pm 2 . 7$
Random Forest	$60.6 \pm 3.4$	$40.7 \pm 9.8$	$64 . 3 \pm 5 . 9$	$61 . 5 \pm 3 . 6$
LightGBM	$65.9 \pm 4.3$	$55.7 \pm 7.4$	$66 . 3 \pm 3 . 6$	$60 . 4 \pm 2 . 6$
XGBoost	$63.8 \pm 3.2$	$51.7 \pm 3.4$	$65 . 4 \pm 2 . 5$	$59 . 1 \pm 2 . 0$

The bold values signifies the best performance.

AUC, area under the ROC curve; ROC, receiver operating characteristic.

The results of the scenarios are described in Table 6. Overall, the results when using PUL in combination with the classifiers give better results than when using the separate classifiers. The F₁ score result is around 4.3% higher when utilizing PUL with DT. Similarly, when positive unlabeled (PU) is combined with DT, LightGBM, XGBoost, and RF, the F₁ score increases by 4.4%, 4.7%, 7.3%, and 20.7%, respectively. It can be shown that when PUL is combined with RF, the generated F₁ score results differ the most. The results in Table 6 also show that the results when using PU in combination with the classifiers are more stable than when using the separate classifiers. Figures 7 –10 show a box plot showing the results of AUC and F₁ score for 30 distinct runs when using the two scenarios.

FIG. 7.

Evaluation metric AUC score of four classifiers.

FIG. 8.

Evaluation metric F₁ score of four classifiers.

FIG. 9.

Evaluation metric AUC score of proposed model with four classifiers.

FIG. 10.

Evaluation metric F₁ score of proposed model with four classifiers.

The time to training model when using the two scenarios is shown in Table 7. When using PUL with classifiers, time to training model is higher about 20 times than using the separate classifiers.

Table 7.

Time to training model of the two scenarios

	Scenario 1	Scenario 2
Classifiers	Mean time (seconds)	Mean time (seconds)
Decision Tree	32.22	656.06
Random Forest	485.66	9765.96
LightGBM	1.93	47.07
XGBoost	32.33	675.39

Conclusions

A new proposal model combining SAX, handling data imbalance and RF for anomaly detection sensor system are addressed in our research. Concretely, the contributions of our article are as follows:

Successfully using multiple SAX for time series data to find complex and dynamic anomaly patterns.

Archiving applied anomaly pattern for machine learning model.

Fitfully proposed a model combining multiple SAX, imbalance technique, and RF to anomaly detection.

Achieving applied proposal model in automatic meter intelligence system in Vietnam.

According to the experimental result, our proposed model has better performance than using well-known machine learning models. The cause of better results is chosen complex and dynamic anomaly patterns in meter intelligence systems.

In future work, more sensor anomaly detection applications will be researched based on the proposed model. The advanced SAX to find a new pattern will be investigated. The anomaly detection of subsequent symbolic normal patterns is necessary to study.

Footnotes

Authors' Contributions

N.T.N.A. contributed to conceptualization (lead), methodology (lead), software, and writing. V.H.T. was in charge of writing—original draft, software development, formal analysis, visualization, and validation. D.K. and D.M.T. were in charge of conceptualization (supporting), writing—original draft (supporting), writing—review and editing, and investigation. L.A.N. was in charge of methodology, investigation, supervision, and writing—review and editing.

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This research was supported by Energy Cloud R&D Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, Information and Communication Technologies (2019M3F2A1073387). In addition, this work is partly supported by the research project number DHFPT/2022/27 granted by the FPT University.

Abbreviations Used

References

Ray

, Dash

. IoT-edge anomaly detection for covariate shifted and point time series health data. J King Saud Univ Comp Inform Sci. 2021; 34:9608–9621.

Thill

, Konen

, Wang

, Bäck

. Temporal convolutional autoencoder for unsupervised anomaly detection in time series. Appl Soft Comp. 2021; 112:107751.

, Sun

. Policy-based reinforcement learning for time series anomaly detection. Eng Appl Artif Intell. 2020; 95:103919.

Tung

, Anh

NTN

, Anh

NHQ

. Feature selection using genetic algorithm and bayesian hyper-parameter optimization for LSTM in short-term load forecasting. In: Tran DT, Jeon G, Nguyen TDL, et al. (Eds.): Intelligent Systems and Networks. Singapore: Springer, 2021.

Dat

, Ngoc Anh

, Nhat Anh

, Solanki

. Hybrid online model based multi seasonal decompose for short-term electricity load forecasting using ARIMA and online RNN. J Intell Fuzzy Syst. 2021; 41:5639–5652.

González-Vidal

, Cuenca-Jara

, Skarmeta

. IoT for water management: Towards intelligent anomaly detection. In: 2019 IEEE 5th World Forum on Internet of Things (WF-IoT), 2019.

Zhang

, Chen

, Yin

, Wang

. Anomaly detection in ECG based on trend symbolic aggregate approximation. Math Biosci Eng. 2019; 16:2154–2167.

Yahyaoui

, Al-Daihani

. A novel trend based SAX reduction technique for time series. Expert Syst Appl. 2019; 130:113–123.

Zhou

, Ren

, Li

, Pedrycz

. An anomaly detection framework for time series data: An interval-based approach. Knowl Based Syst. 2021; 228:107153.

10.

Liu

, Lin

, Xiao

, et al. Self-adversarial variational autoencoder with spectral residual for time series anomaly detection. Neurocomputing. 2021; 458:349–363.

11.

Lin

, Keogh

, Wei

, Lonardi

. Experiencing sax: A novel symbolic representation of time series. Data Mining Knowl Discov. 2007; 15:107–144.

12.

Baldini

, Giuliani

, Steri

, Gentile

. The application of the symbolic aggregate approximation algorithm (SAX) to radio frequency fingerprinting of

IoT devices

. In: 2017 IEEE Symposium on Communications and Vehicular Technology (SCVT), Leuven, Belgium, 2017. pp. 1–6.

13.

Hernández Fusilier

, Montes y Gómez

, Rosso

, Guzmán Cabrera

. Detecting positive and negative deceptive opinions using PU-learning. Inform Process Manag. 2015; 51:433–443.

14.

, Lee

, Hwang

, et al. PUMAD: PU metric learning for anomaly detection. Inform Sci. 2020; 523:167–183.

15.

, Gao

, Jin

, et al. A seasonal-trend decomposition-based dendritic neuron model for financial time series prediction. Appl Soft Comp. 2021; 108:107488.

16.

Qin

, Zhou

, Cao

. Imbalanced learning algorithm based intelligent abnormal electricity consumption detection. Neurocomputing. 2020; 402:112–123.

17.

Oprea

S-V

, BaBara

. Machine learning classification algorithms and anomaly detection in conventional meters and tunisian electricity consumption large datasets. Comput Electr Eng. 2021; 94:107329.

18.

Razavi

, Gharipour

, Fleury

, Akpan

. A practical feature-engineering framework for electricity theft detection in smart grids. Appl Energy. 2019; 238:481–494.

19.

Carnevali

, Rossi

, Milios

, de Andrade Lopes

. A graph-based approach for positive and unlabeled learning. Inform Sci. 2021; 580:655–672.

20.

Reza Loghmani

, Vincze

, Tommasi

. Positive unlabeled learning for open set domain adaptation. Pattern Recogn Lett. 2020; 136:198–204.

21.

Xiong

, Zuo

. A positive and unlabeled learning algorithm for mineral prospectivity mapping. Comp Geosci. 2020; 147:104667.