Abstract
Outlier detection is an important branch of data mining. This paper proposes an advanced fast density peak outlier detection algorithm based on the characteristics of big data. The algorithm is an outlier detection method based on the improved density peak clustering algorithm. This paper improves the original algorithm. From the perspective of outlier detection, although it is a clustering idea, it avoids the clustering process, reduces the time complexity of the cluster-based outlier detection algorithm, and absorbs. The outlier detection based on neighbors is not sensitive to data dimensions and other advantages. In the power industry, outlier detection can be used in areas such as grid fault detection, equipment fault detection, and power abnormality detection. The simulation experiment of outlier detection based on the daily load curve of single and multiple transformers in a certain province shows that the improved algorithm can effectively detect outliers in the data.
Introduction
With the construction and development of smart grids, a large number of data collection devices have been installed and deployed in the major links of use and dispatch, and corresponding information management systems have been built [1]. This information system generates and manages a large number of large-scale and diverse structures of data, which is the main source of big data on electricity. Big data can be applied to all aspects of the power grid, such as power grid planning, new energy grid connection, and demand side management [2]. Through the effective mining of big data on power distribution and consumption, it can effectively promote the transformation of the power grid from a business model based on traditional physical models to a business model based on data.
Due to different data sources, different statistical calibers of the same data, abnormal behaviors, and the lack of a corresponding data quality control system, abnormal data are often generated [3]. Abnormal data contains information related to the occurrence of abnormal conditions in the system. Therefore, there is huge research value behind the abnormal data, which provides help for practical applications. Outlier detection has important applications in the power industry. For example, early anomaly detection such as equipment failure detection and power usage anomaly detection uses the method of on-site investigation by technicians [4]. This method is inefficient and wastes manpower and material resources, and is driven by data. Carrying out anomaly detection helps to automatically lock unnatural events, increase the rate of abnormal events, reduce examination costs, and reduce the economic losses of companies.
In the early nineteenth century, Yang Y, et al. [5] carried out research on outlier detection in the field of mathematical statistics. After that, machine learning, data mining and other disciplines also started the research of outlier detection. At present, there are many methods for outlier detection, such as outlier detection based on statistics, based on clustering, based on classification, and based on neighbor model.
Statistics-based outlier detection methods believe that the distribution of the data set conforms to a certain probability distribution model, and the sample points that do not conform to the model will be judged as outliers [6]. Statistics-based outlier detection is generally clustered into two phases, the training set and the detection phase. The training phase mainly fits the statistical model of the data, and the detection phase mainly determines the outlier degree of the detected data points. The literature introduces a parameterized statistical outlier detection method, and the literature introduces a non-parameter statistical outlier detection method. However, statistics-based outlier detection methods have a common shortcoming, that is, they often fail to get good results when facing high-dimensional data [7].
The cluster-based outlier detection method considers that the sample points that do not belong to any cluster are abnormal points or the points far away from the cluster center are abnormal points. Some clustering algorithms consider the impact of outliers at the beginning of the design, introduce a certain mechanism, after obtaining the clustering results, in the meantime can get outliers, such as the common DBSCAN clustering algorithm, and other clustering algorithms [8]. After the class is completed, it is necessary to calculate the distance between the sample point and the center of the cluster to determine the abnormal point. Cluster-based outlier detection can be divided into partition clustering, hierarchical clustering, density clustering, and grid-based clustering according to different clustering principles. Methods of outlier detection. However, cluster-based outlier detection methods usually have a shortcoming, that is, their time complexity is mostly O(n2), and there is no corresponding optimization for outlier detection, so the efficiency of outlier detection is not high.
This paper proposes a fast density peak based on the characteristics of big data. The algorithm is a method based on the fast density peak clustering algorithm proposed by Huang Y et al. [9]. At present, there are few literatures that use fast density peak clustering algorithm for outlier detection. Literature [10] mentions that points greater than the average distance are regarded as outliers, but when it is applied to power big data outlier detection, there is no good results. This paper improves the original algorithm without considering the local characteristics of the data and the lack of dependence on the cutoff distance. From the perspective of outlier detection, although it is a clustering idea, it avoids the clustering process and reduces the anomalies based on clustering. The time complexity of the value detection algorithm also draws on the advantages of outlier detection based on neighbors, such as insensitive to data dimensions. The simulation experiment of outlier detection based on the daily load curve of single and multiple transformers in a certain province shows that the improved algorithm can effectively detect outliers in the data.
Improved density peak algorithm
One of the core tasks of power big data application is the in-depth analysis of big data, because the value of data comes from the analysis of data [11]. Electric power big data has the characteristics of massive, complex and diverse, and rapid changes. The following methods are commonly used for analysis and mining: statistical analysis methods, and some emerging methods. User electricity consumption behavior analysis is a typical scenario of power big data analysis and mining. It mainly analyzes and mines the information reflected by the user’s electricity load curve to better grasp the user’s electricity consumption behavior. User electricity consumption behavior information can be implemented in many aspects. User classification based on user electricity consumption behavior can provide relevant information for medium and short-term load forecasting, thereby improving the accuracy of load forecasting. The results of user classification can also be used to study changes in user energy consumption [12]. As the basis for user billing, the tariff rate of the energy trading market is formulated; the load model obtained through load curve clustering can improve the accuracy of grid simulation more than the traditional average load model, thereby better supporting the planning and operation of the grid [13]. Cluster analysis of user load curves is the basis of many tasks, such as load forecasting, load model calculations, and user segmentation. With the gradual implementation of a new round of electricity reform, the retail side has opened up, and the status of electricity users has been further improved. Analyzing the user’s power consumption behavior from the perspective of the user’s load change law and power consumption change law, so as to provide users with energy-saving suggestions and provide personalized service plans, is an important magic weapon for power grid companies to seize users.
At the same time, electric power big data also has the characteristics of high dimensionality. In the process of analysis and mining, dimensionality reduction is usually required, so feature selection algorithms are required. The quality of feature extraction will directly affect the analysis results. In addition to business domain knowledge, it also requires deep background knowledge of statistics and machine learning modeling [15]. However, as the diversity and complexity of data continue to increase, the manpower and time required for feature extraction will become heavier and heavier. Therefore, it is of great significance to study feature extraction algorithms for power big data.
The classification-based outlier detection method is to classify the unclassified samples when there are enough labeled samples. Classification-based is usually divided into two stages. In the training phase, it is mainly to learn the classification model of normal samples and abnormal samples, and in the test phase, it is mainly to classify the classification objects. The classification-based outlier detection has obvious shortcomings. The effectiveness of the algorithm depends on the prediction accuracy of the classification algorithm, and it depends on whether there are enough training samples of the correct category label.
The outlier detection based on the neighbor model mainly determines whether it is an outlier based on the difference between the detected sample point and its neighbors. If a sample point is very different from other neighboring sample points, it is considered an outlier. The advantage of neighbor-based is that it does not need to know the distribution of the data, nor does it need to pre-mark the training samples, and is not sensitive to the dimensionality of the data. The big data anomaly detection system architecture is shown in Fig. 1. However, the disadvantage of this method is that the nearest neighbor search requires a certain time complexity. For data with complex structure, it is difficult to select a particularly suitable distance function.

Big data anomaly detection system architecture.
The density-based method assumes that the data density of normal data points is higher than that of abnormal points. According to the problems encountered in actual detection, scholars have proposed an outlier detection method based on Local Outlier Factor (LOF) [13]. For a given data point A, D K (A) is the distance from the data point to the K-nearest neighbor data point, and L K (A) is the set of all the data points of the K neighbors of the data point X. The density peak algorithm is shown in Fig. 2.

Density peak algorithm.
The reachable distance D
i
(A, B) is defined as the maximum distance from a data point X to its K-nearest neighbor set:
In the formula, dist (A, B) is the distance between two points, and the reachable distance from data point A to data point B is asymmetric. Intuitively speaking, when A is a density area, then the distance between A and B is larger. On the other hand, when the distance between two points is small, the reachable distance is smoothed out by the K neighbor distance of B. Correspondingly, the reachable distances between different points will become more similar. Therefore, the average distance D
i
(A) from the data point A to all its K neighbors is defined as:
The local reachable density of data point A can be quantitatively expressed as
Therefore, the average value of the ratio of the K-nearest neighbor accessibility density to the accessibility density of the data point A is taken as the local anomaly factor of the data point A, namely:
The value of G i (A) is greater than 1, and there is no fixed range. When the data set is large and the internal structure is complex, the local anomaly factor algorithm may cause deviations in the calculation of the nearest neighbor reachability because the obtained neighbor points belong to different clusters, which will lead to a large difference between the obtained result and the actual value [15].
The fast density peak clustering algorithm is mainly based on two assumptions: one is that the cluster center is surrounded by neighbors with lower density; the other is that the distance between the cluster center and any other point with higher density is relatively large. In this way, for each sample point, two parameters need to be measured: local density α
i
and distance between points β
i
. Among them, the calculation of the local density α
i
also depends on the value of the cutoff distance δ
k
of another parameter. Generally, the distance between all sample points is selected 2% before the ascending order of mutual distance. The local density α
i
of any sample point xi in the data set is defined as:
Where δ
ij
is the distance between xi and x
j
,
Distance β
i
, defined as:
β i is used to represent the minimum distance between the sample point X k and the sample point with greater density, β i = max(δ ij ) and for the sample with the highest density in the data set [16]. Data points with smaller local density and larger distances will be identified as outliers.
Find all the sample sets (α i , β i ) in the data set X. Later, draw the two-dimensional plot of α i , β i , which is called a decision diagram. From the decision diagram, you can find points with a larger α i value and β i value. These points can be used as the cluster centers of the data set X. From the perspective of outlier detection, points with a smaller α i value and a larger β i value can also be seen intuitively in the decision diagram, and these points can be initially identified as abnormal points. The algorithm can easily identify cluster centers when the number of clusters is uncertain, and can also identify outliers. However, this algorithm is suitable for data sets with a specific structure, and has a poor effect on data sets with relatively sparse cluster centers. In addition, the process of determining cluster centers and abnormal points includes the subjective factors of the researcher. Different researchers choose α i and β i valuedifferently, so the results obtained are also different.
In power big data, due to different data sources, different statistical calibers of the same data, data entry of frontline personnel, abnormal behaviors, and the lack of a corresponding data quality control system, abnormal data often results. Abnormal data contains information related to the occurrence of abnormal conditions in the system. Therefore, there is huge research value behind the abnormal data, which provides help for practical applications, such as equipment failure, power abnormality detection, etc. In this chapter, the fast density peak clustering algorithm is used for outlier detection and the lack of local density, and uses the K-Nearest neighbors (KNN) idea to redefine the local density. The proposed fast density peak outlier detection algorithm based on KNN to achieve more accurate outlier detection.
The use of fast density peak clustering algorithm for outlier detection is not affected by clusters, nor by the number of initial clusters [17]. However, the accuracy of the fast density peak clustering algorithm depends on the selection of the cutoff distance, and the local characteristics of the data are not considered when calculating the local density, and its effect is not very satisfactory on some data sets. At the same time.
The core of the algorithm is to use KNN to calculate the local density and distance of the sample. The KNN-based local density and distance take into account both the global characteristics of the data set and the local characteristics of the data set, and give outlier judgments. Rules, so the result is more ideal [18].
In the original data set, calculate the Euclidean distance D (x
i
, x
j
) between any sample x
i
and other samples, and arrange the calculation results in ascending order. Record the sample corresponding to the kth distance as KNN
k
(x
i
). The K nearest neighbors of x
i
are:
Used KNN
k
(x
i
) to calculate the local density of x
i
;
In the formula, K is determined by the parameter q, q is the percentage of the number of samples N, K = q · KNN, the greater the value of the local density, the greater the density of x
i
. On the basis of formula (8), the definition of KNN distance is given as:
After calculating the samples (α i , β i ), determine the outliers in the data set. The points with smaller density and larger distances in the data set may be the defined outliers because there are fewer neighbors around them and the distance from other samples is larger [19].
This article believes that the abnormal sample should meet the following conditions: α
i
< α
k
β
i
< β
k
when the local density and the distance are greater than the threshold, the sample point can be judged as an abnormal value. Among them, the local density threshold is defined as α
k
:
The distance β
k
is defined as:
In the formula, N is the total number of samples in the data set, and φ α and φ β is an empirical parameter [20].
Overall architecture
Consumer electricity behavior analysis based on discrete wavelet transform feature extraction and improved fast density peak clustering algorithm mainly includes three parts [21]. The first part is DWT feature extraction of load data, and the load curve is decomposed into spectral components of different time scales; the part takes the different frequency spectrums obtained in the first part as input and performs clustering respectively to obtain typical curves of different time scales; the third part reconstructs the original typical load curves based on the typical curves of different scales and analyzes them. The specific implementation process is shown in Fig. 3.

User electricity consumption behavior analysis architecture diagram.
The data used in this chapter is from a certain province user power load data, the collection frequency is 15 min, so the daily load curve has 105 data points. Due to problems such as collection equipment failures and communication failures, data collection will be unsuccessful, resulting in dirty data from the original smart meter data. To ensure the accuracy of the results, a simple preprocessing was first performed on the original recorded data to eliminate a large number of empty values [22].
The typical load curve of a single user in this section and the data used in the analysis are randomly selected users’ daily load data for the whole year of 2019; the typical load curve of users in a certain industry and the data used for analysis are the 2019 data of all 54 users in a certain industry in the province. The annual daily load data for the whole year, 16748 records remain after excluding the null value; the typical load curves of multiple high-energy users and the data used for analysis are the daily load data of the provinces 65 high-energy users in 2019, including electrolysis Industries such as aluminum, iron and steel, and ferroalloys are the largest electricity consumers in the province, and their data is simply nulled out, leaving 15,353 records. Data normalization is very important to distance-based algorithms and can speed up training. Therefore, this paper normalizes the load data, and normalizes the 96-point load data according to the formula.
Typical load curve and analysis of a single user
The classification of users is based on different user types, and different user types are described by different typical load curves [23]. Therefore, this article first conducts cluster analysis on the load curve of a single user to find a typical load curve that can describe the user.
Data standardization
In actual work, it is found that there are abnormal value in electric power big data, and the existence of abnormal valueswill adversely affect the results of feature extraction and cluster analysis. Therefore, this paper proposes a fast density peak outlier detection algorithm based on KNN for the characteristics of large volume, rapid change and high dimensionality of electric power data. The algorithm mainly uses the idea of KNN to redefine the two parameters of local density and distance in the original fast density peak clustering algorithm, optimize it from the perspective of detection, and make up for the original algorithm for not considering the local characteristics and dependence of the data. From the perspective of outlier detection algorithm classification, this algorithm is a detection algorithm, but it avoids the clustering process, reduces the outlier detection time, and also absorbs. The outlier detection based on nearest neighbors is not sensitive to the data dimension and does not need to mark samples. The experimental results based on a single transformer and multiple transformers show the effectiveness of the algorithm. It also shows that the algorithm is not only suitable for outlier detection of small data sets, but also applicable to large data sets.
After the data of a certain user is standardized, the overall picture of the daily load data at 105 points throughout the year is shown in Fig. 4.

A users daily point data curve.
KNN-based fast density peak clustering is performed on each component at different time scales. The cluster centers at each scale are shown in Fig. 5. Therefore, it more comprehensively describes the electricity consumption behavior of users and provides a new idea for the analysis of users’ electricity consumption behavior. The electricity consumption behavior analysis method is tested on the electricity load data of individual users and industrial users, and good results have been obtained.

Component cluster center.
From the clustering of the approximate components of the third level, the user can be divided into two categories in terms of the overall trend. The blue curves in Fig. 5 are the first category, with a total of 150 records. Among them, the first type of curve has three peaks, and the second type of curve has two peaks. In other time scales, there is a phenomenon that the number of one category is much higher than that of the other category. It can be considered as an abnormal point of the data, and abnormal sample points are not considered in the clustering. Therefore, the typical load curve of this user can be divided into two categories. Table 1 shows the results of the cluster analysis of this user.
Clustering results
According to the cluster centers in each time scale, the user’s typical load curve is reconstructed. It can be seen from Table 1 that the user can reconstruct two typical load curves, and the values of the two curves can be obtained by formula. The results are shown in Fig. 6.

Reconstruction of typical user load curve.
On the basis of using the KNN idea to improve the local density and distance parameters in the fast density peak clustering algorithm, the original algorithm needs to artificially identify the possible cluster centers in the decision graph, and the cluster centers are realized according to the method of outward statistical testing. At the same time, according to the characteristics of the electricity load data, the discrete wavelet transform method is used to extract the characteristics of the load curve in multiple time scales, and the characteristics of different time scales are clustered and the typical curve reconstructed, and then the electricity consumption behavior analysis is carried out. The feature extraction of the electricity consumption behavior analysis model is more complete, and less information is lost from the original data to the characteristic parameters.
From the results obtained, the feature extraction by discrete wavelet transform and the fast density peak clustering algorithm based on KNN effectively cluster the user. Judging from the two typical load curves of the user, the first peak of the blue curve appears at 3 am to 6 am, the second peak appears at 10 am to 12 am. Combining the user’s file information, comprehensively considering the user’s power consumption characteristics, types, and external market, environment, and seasonal factors, the user can propose peak shift and energy saving suggestions.
In this case, in addition to the simple null elimination processing for users in multiple high-energy-consuming industries, the original data also includes a certain amount of all-zero record data and data without any changes throughout the day. This data is not considered for the time being, so it is eliminated. The number of data records has been changed from 18009 to 18624. The number of eliminated data records is 1.1% of the original data, which can be considered to not affect the subsequent analysis results.
Outlier detection
In order to get more accurate results, this case first uses the KNN-based fast density peak outlier detection algorithm mentioned in part 2 of this article to detect outliers and eliminate outliers. Figure 7 is the decision diagram for outlier detection. According to the outlier judgment rule, the outliers in the data are screened out. The curve of the outliers is shown in Fig. 7.

Outliers of high energy consumption industry.
It can be seen from the graph of outliers in high-energy-consuming industries that most of the curves identified as outliers have only a few points that are much larger than other points, forming a curve like a step. The occurrence of this abnormal phenomenon may be related to abnormal signal, failure of acquisition equipment, abnormality of production equipment, etc. The statistics of clustering results are shown in Table 2. There are a total of 49 abnormal curves. These abnormal values are temporarily filtered out, and the remaining 18,624 data are used as the original data for subsequent electricity consumption behavior analysis.
Statistics of clustering results
Analyze the clustering results from the perspective of users, Table 3 shows some of the results. It can be seen from the table that some users can be represented by a typical curve, and some users canexpressed by two curves (due to space limitations, not all results are listed), it can be seen that users in high-energy-consuming industries use electricity very regularly. According to the characteristics of different typical curves of users, the time when the peaks and valleys appear can be analyzed, and the time of appearance of different typical curves in a year can be analyzed, and the seasonal characteristics of the industry can be known, or from the curve. Analyze the prosperity of the user or industry in terms of changing trends to provide a basis for social and economic forecasts [24].
User categories
User categories
This paper takes electric power big data as the research object, and explores its abnormal value detection and electricity behavior analysis method. First, study the feature extraction algorithm and clustering analysis algorithm according to the characteristics of big data, and according to the experimental results of the algorithm, it is found that the discrete wavelet transform and fast density peak clustering algorithm have more advantages in the face of electric power big data. In addition, the phenomenon of outliers found in the actual electric power big data will affect the accuracy of data feature extraction and clustering. Therefore, a KNN-based fast density peak outlier detection algorithm is proposed and obtained on the actual data set.
Determine the feature extraction algorithm and clustering algorithm suitable for power big data.
Based on the full study of the basic principles of discrete wavelet transform and Gaussian mixture model, this thesis compares the advantages and disadvantages of the two feature extraction algorithms in the face of large power data through actual data experiments, and gives the choice of discrete wavelet transform for users. The reason for the feature extraction algorithm in the process of electrical behavior analysis. For the clustering analysis algorithm, the reason for finally choosing the fast density peak clustering algorithm as the basic theoretical basis for the subsequent outlier detection and electricity consumption behavior analysis is given, which is the basis for the subsequent work of the paper.
Proposed an algorithm based on density peak outlier detection
The algorithm mainly uses the idea of KNN to redefine the two parameters of local density and distance in the original fast density peak clustering algorithm, optimize it from the perspective of outlier detection, and make up for the original algorithm for dependence of the data. From the perspective of outlier detection algorithm classification, this algorithm is a cluster-based algorithm, but it avoids the clustering process, reduces the time complexity of the outlier detection based on cluster algorithm, and also absorbs. The outlier detection based on nearest neighbors is not sensitive to the data dimension and does not need to mark samples. The experimental results based on a single transformer and multiple transformers show the effectiveness of the algorithm. It also shows that the algorithm is not only suitable for outlier detection of small data sets, but also applicable to large data sets.
Conclusion
This study proposes a user electricity behavior analysis method that combines wavelet transform feature extraction and KNN-based fast density peak clustering algorithm. First, a three-layer wavelet transform is performed on the user load curve according to the difference of electricity consumption patterns; then a fast density peak clustering algorithm based on KNN is proposed, and the local density and distance are defined by the idea of KNN, and implemented according to the outward statistical test method. The automatic selection of cluster centers; finally, the algorithm is used to cluster analysis of load curves in different time dimensions, and the typical load curves are reconstructed according to the characteristic curves of different time dimensions, and verified on actual data. The multi-level and multi-dimensional analysis and reasonable division of load curves is a basic work, which will play a more important role in load forecasting, setting time-of-use electricity prices, detecting abnormal electricity usage, and load control.
Footnotes
Acknowledgments
Supported by the National Natural Science Foundation of China (61070015), the Guangdong Natural Science Foundation Team Project (10351806001000000), Guangdong Frontier and Key Technology Innovation Project (2014B010110004), Key Research Projects of Universities in Guangdong Province (Natural Science) (2019GZDXM020) and the Science and Technology Program of Guangzhou (201804010402).
