Customer characteristics analysis method based on the selection of electricity consumption characteristics and behavioral portraits of different groups of people

Abstract

Because the traditional methods do not select the best feature collection in feature analysis, the accuracy and effectiveness of user feature clustering are not high, and the accuracy of user feature classification is not high. Therefore, this paper proposes a customer feature analysis method based on power consumption feature selection and behavior portrait of different people. The optimal feature set is obtained according to the maximum correlation and minimum redundancy criterion, and the user portrait task is described. The spatial feature domain classification method is used to classify the user portrait information, and the user label database is constructed according to the classification results. The AP clustering algorithm is used to cluster the power user portrait information and complete the customer feature analysis. Experimental results show that this method effectively improves the accuracy and effectiveness of user feature clustering, and the accuracy of user feature classification is high, indicating that the application effect is good.

Keywords

Power consumption characteristics behavioral portraits customer characteristics AP clustering algorithm information classification

1 Introduction

With the deepening application of power informatization, power information data presents an explosive growth trend, and the era of big data in the power industry has come [1]. The traditional method has certain guiding value for statistical analysis of customers’ power consumption behavior under the condition of small-scale data, but the application effect is not very ideal under the condition of massive data [2]. As power distribution is the nerve terminal of smart grid, it is directly oriented to users. The network architecture is complex and there are many business types. Businesses need to be refined and distinguished according to different clustering methods, so it is difficult to analyze user behavior [3]. With the rise of big data analysis technology, it is possible to deeply analyze user behavior and extract users’ potential power consumption habits and power consumption trends [4]. Under this background, relevant researchers and scholars have put forward some customer feature analysis methods.

In reference [5], aiming at the classification of intelligent users’ power consumption behavior under the background of big data, a user power consumption behavior classification method based on limit learning machine (ELM) algorithm is proposed. Firstly, based on the feature optimization strategy of user behavior in the early stage, the feature optimization strategy is used to extract the best feature set of load curve and classify and analyze the user power consumption data. Then, taking the feature optimization set as the input, by comparing the accuracy of training set and test set under different hidden layer activation functions and the number of hidden layer nodes, the input parameters of elm algorithm suitable for user power consumption behavior analysis are optimized. Finally, the user’s power consumption behavior is classified. However, this method does not extract features, resulting in low accuracy. Reference [6] puts forward a characteristic analysis method of differentiated power consumption behavior based on the segmentation of power consumption customer groups. Through the segmentation management of differentiated power consumption behavior of power consumption customer groups, a segmentation structure model is established to improve the service quality of power supply enterprises. Based on power big data, a model structure to meet the segmentation of power customer groups is established through data mining technology. Firstly, according to the actual operation of power customers, a model evaluation index based on customer power supply reliability requirements, customer behavior and customer value is established. Then, for a large data group, K-means clustering algorithm is used for data cleaning and preprocessing to obtain refined detailed data. Finally, the differentiated evaluation and management of power customers will be realized. However, this method takes a long time in data preprocessing, resulting in its low effectiveness. Reference [7] proposes an AP clustering analysis method of user power consumption behavior based on optimized SAX and weighted load characteristic index. Firstly, SAX algorithm is used to reduce the dimension of load curve and extract features, and simulated annealing particle swarm optimization algorithm is used to determine the reasonable number of characters and states. Then combined with the load characteristic index, the improved AP clustering algorithm is used to cluster the load curve. In the clustering process, the entropy weight method is used to objectively weight the load characteristic index to avoid the subjectivity of index setting. Finally, based on the clustering results, the power consumption behavior and demand response potential of various users are analyzed. However, the accuracy of this method in user classification is low, which leads to its poor applicability. Therefore, this paper proposes a customer feature analysis method based on different groups’ power consumption feature selection and behavior images. Compared with the traditional analysis of customers’ electricity consumption behavior, this method pays more attention to the mining of customers’ electricity consumption value, and can realize the quantitative analysis of a large number of customers’ electricity consumption. Determine the best feature set based on the maximum correlation and minimum redundancy criteria, describe the user portrait task, and classify and process the described information. Finally, the AP clustering algorithm is used to cluster the portrait information of power users and complete the customer feature analysis. After using this method, the classification accuracy of user characteristics is up to 0.95, which indicates that the accuracy and effectiveness of this method are high, and it can serve customers more effectively, improve customer satisfaction, reduce operational risk and provide decision-making reference.

2 Customer characteristics analysis method

2.1 Determination of the optimal feature set based on the maximum correlation and minimum redundancy criterion

The maximum correlation and minimum redundancy criterion is a filtering feature selection method. Its core idea is to maximize the correlation between features and categorical variables and minimize the redundancy between features and features. This paper applies it to the selection of users’ electricity consumption characteristics, and obtains the feature set with the strongest correlation and lowest redundancy, which is used to characterize the electricity consumption characteristics of users. The correlation between feature and classification variable takes the mutual information value between feature and classification variable as the measurement index, which represents the reduction degree of category uncertainty when the feature is known. In the process of solving, in order to make each characteristic variable more statistically significant, it is necessary to discretize each variable, that is, convert the numerical sequence of each variable into a probability distribution interval. In this paper, the features are normalized, and then the variable interval is evenly dispersed to obtain the probability distribution of each feature variable, and then the mutual information between each feature quantity and user category is calculated.

In information theory [8], entropy is used as a measure of information uncertainty. The greater the entropy, the higher the degree of information uncertainty. The formula for calculating the entropy of K is: $X (K) = \int_{i} \frac{{| ψ_{i} |}^{2}}{ω_{i}} \times {log}_{2} h (k)$ (1) In the formula, h (k) represents the probability of occurrence of a specific event k; i represents a feature item; ψ_i represents a set of possible events; ω_i represents a set of all possible events. After discretization, the formula for calculating the entropy of the i-th feature g_i can be obtained as: $X (g_{i}) = \frac{1}{\sqrt{| d |}} - X (K) (\frac{a - b}{a})$ (2)

In the formula, d represents the number of intervals of feature g_i; b represents the number of samples in which feature g_i falls in a certain interval; a represents the total number of samples.

The information entropy of user category l is: $X (l) = {[F^{'} (X (l))]}^{- 1} F (X (l))$ (3)

In the formula, F (X (l)) represents the total number of categories of users; F′ (X (l)) represents the number of samples in which users belong to a category.

The joint information entropy of the i-th feature g_i and the user category l is: $X (g_{i}, l) = \sum_{i = 1}^{n} {[F^{'} (X (l))]}^{- 1} F (X (l))$ (4)

In the formula, n represents the number of samples where g_i falls in a certain interval and the user category is l. Then the mutual information between the i-th feature g_i and the user category l is defined as: $S (g_{i}, l) = X (g_{i}) + X (l) - X (g_{i}, l)$ (5)

According to formula (5), the mutual information between each feature and user category can be obtained separately.

Based on the obtained mutual information, the maximum correlation index between the optimal feature set and category l is defined as S′ (l), and the calculation formula is: $S^{'} (l) = \frac{σ_{i} + σ_{j} - 2 σ_{il}}{N}$ (6)

In the formula, σ_i and σ_j respectively represent the i and j features in the optimal feature set; σ_l represents the correlation between the feature i and the user category in the optimal feature set; N represents the features contained in the optimal feature set The number of features.

The redundancy of information between two features can be measured by indicators such as information gain, Gini coefficient, and correlation coefficient. This article uses correlation coefficient to measure, and its calculation formula is as follows: $C (t) = \frac{\sum_{i = 1}^{n} | σ_{i}, σ_{j} |}{cov (σ_{i}, σ_{j})} + θ_{i} θ_{j} \times φ_{σ_{i}, σ_{j}}$ (7)

In the formula, cov (σ_i, σ_j) represents the covariance of the two features; θ_i and θ_j represent the standard deviations of features σ_i and σ_j, respectively; φ_{σ_i,σ_j} represents the correlation coefficient of the two features, the value range is [-1,1], the closer the absolute value is to 1, the greater the correlation, the closer the absolute value is to 0, the smaller the correlation.

The minimum redundancy index is S [C (t)], and its calculation formula is: $S [C (t)] = \sum_{i = 1}^{N} (t - t_{s})$ (8)

In the formula, t represents the correlation between features and categorical variables; t_s represents the redundancy between features and features.

Combining the two indicators in formula (7) and formula (8) to obtain the maximum correlation and minimum redundancy criterion, the corresponding formula is as follows: $I_{s} = max [C (t) - S [C (t)]]$ (9)

In the formula, it represents the maximum correlation and minimum redundancy index. Solving the feature set that satisfies the maximum correlation and minimum redundancy criterion is the optimal feature set.

2.2 User portrait task description

User portrait generally has three steps. The first step is to clarify the target of the user portrait, that is, determine the actual situation of the user portrait, and analyze the main data composition of the user portrait. The second step is to collect relevant data about user portraits, and the third and most important step is to build a user tag library. User tags can not only make people clear, but also help computer processing tasks such as calculations.

A tag is a symbolic identification of user characteristics. The tag has two obvious characteristics. One is that it has a certain population and can sample the characteristics of abstract transactions to a certain extent. The second is that a certain type of characteristic of the user can be identified by a symbol. The symbol can be English, Chinese, or a number or a symbol that gives a special meaning. The tag library is a centralized management of tags, and is ultimately used to mark user behaviors and attributes. The main task of the user portrait is to label the user based on the user information, that is, to build the user’s label library. Through the tag library, the meaning of each tag can be clearly understood, and the user’s portrait can be given practical meaning. The definition of the user portrait description tag of the user tag library is shown in the following table:

The above table is mainly about the establishment of a tag library for tags such as employment, population, house type, house occupancy rate, cooking type, presence or absence of children, etc. On this basis, users can be better marked.

Table 1
User portrait description label definition table

Serial number User portrait User portrait description Category Tags

1 Employment Employment of the main household earner Hire 1

Not employed 2

2 Population Number of family members Less (≤2) 1

Many (≥3) 2

3 Type of Housing Property Type Freestyle 1

Connected 2

4 Occupancy Is the house unused for more than 6 hours per day? Yes 1

No 2

5 Type of Cooking Type of cooking facility Electricity 1

Non-electric 2

6 Presence of Children Are there any children in the home Yes 1

No 2

Serial number	User portrait	User portrait description	Category	Tags
1	Employment	Employment of the main household earner	Hire	1
			Not employed	2
2	Population	Number of family members	Less (≤2)	1
			Many (≥3)	2
3	Type of Housing	Property Type	Freestyle	1
			Connected	2
4	Occupancy	Is the house unused for more than 6 hours per day?	Yes	1
			No	2
5	Type of Cooking	Type of cooking facility	Electricity	1
			Non-electric	2
6	Presence of Children	Are there any children in the home	Yes	1
			No	2

According to the above analysis, referring to the general research process framework of data mining, this paper uses a two-stage method to conduct user portraits on customer characteristic behavior log data. The overall framework is shown in Fig. 1. The user profile framework in this paper mainly includes three key stages: feature engineering, classification engineering and clustering engineering. Figure 1 mainly introduces feature engineering, classification engineering and clustering engineering.

Fig. 1

Overall framework of user portrait.

Feature engineering is to obtain the information features (information entropy) after eliminating redundancy from the original data for subsequent algorithms and models; Classification engineering is to classify features according to feature classification model; Clustering engineering consists of multiple classes according to the similarity of each object.

2.3 Classification and processing of power user portrait information

Different types of single features are combined in pairs to obtain binary features, and then the feature selection method of multi-index fusion is used to select features with larger correlation coefficients for subsequent model training. The spatial feature domain classification method is used to classify the information of the user portrait, and the multi-scale layer-by-layer analysis method is used to accurately classify the fuzzy power user portrait, and the power consumption category feature quantity of the user portrait is extracted [9]. The user portrait is in a low-dimensional space. The similarity feature of is s (A, B), and the template matching equation is: $s (A, B) = \frac{P (A, B)}{R (A, B)} + \frac{U (A, B)}{E (A, B)}$ (10)

In the formula, P (A, B) represents the power consumption information feature component in the power consumption information subset A of the user portrait; R (A, B) represents the statistical feature quantity of the multi-dimensional power consumption category of the user portrait. Assuming that the power consumption information set of the power user portrait is (c, v), take this as the power consumption information center, and use the sharpening template classification method to obtain the feature classification model of the power user portrait as: $W (c, v) = E {d (t) - I^{'} {(t)}^{2}}$ (11)

In the formula, d (t) represents the power user portrait training sample; E represents the main feature vector space; I′ (t) represents the feature information collection.

According to a set of power user portrait training samples, the main feature vector space is constructed, that is, the feature subspace (feature information collection), and the feature domain distribution model of the user portrait multi-dimensional measurement is established in the local feature domain of the power distribution feature domain, which is described as follows: $S_{i} (f) = \frac{\int_{- \infty}^{+ \infty} X_{T} (f_{i} + α / - 2)}{X_{T}^{*} (f_{i} - α / - 2)}$ (12)

In the formula, W_i+1 represents the user portrait feature domain; f_i represents the template matching value of the feature domain; α represents the overall multi-dimensional information feature components presented by the user portrait management information; w_i represents the statistical feature value of the power user portrait.

The activity category lasso model is established to extract the category features of power user portraits and information classification processing, and the optimized detection model for user portraits is obtained as: $S_{i} (f) = \frac{\int_{- \infty}^{+ \infty} X_{T} (f_{i} + α / - 2)}{X_{T}^{*} (f_{i} - α / - 2)}$ (13)

In the feature domain, the user portrait is fused with the feature domain template of the electricity information value to realize the classification of the user portrait information. In order to construct a three-dimensional power user portrait, it is necessary to use the statistical feature decomposition method to calculate the distance between each power consumption information and the clustering center, construct the power consumption histogram of the power user portrait, and construct the information layer by layer analysis result under multi-dimensional dynamic scanning. The accurate positioning of the power user portrait is achieved, and the accurate feature value is extracted based on this to obtain a three-dimensional portrait.

In summary, the process of the classification algorithm for power user profile information proposed in this paper is shown in Fig. 2.

Fig. 2

Flow chart of the classification algorithm for power user profile information.

The template matching equation is composed of feature components of power consumption information and statistical feature quantities of multi-dimensional power consumption categories of user portraits; By using the sharpened template classification method, the classification model of power user portrait features is obtained; Constructing a main feature vector space according to a group of portrait training samples of power users; In the local feature domain of the power distribution feature domain, the feature domain distribution model of the multi-dimensional measurement of the user portrait is established; Finally, the class feature of super user portrait is extracted by lasso model and the information is classified to obtain the optimized detection model of user portrait.

2.4 Power user portrait information clustering processing

According to the above classification results of power user information, the user similarity is measured by Euclidean distance, and based on the measurement results, the power user profile information is clustered by clustering algorithm. Previously, based on the classification algorithm of power user profile information, the highly relevant features were extracted, and the features for constructing the user tag library were formed. In this way, tags that can effectively represent users are found from the original massive network access log data, which provides data support for the construction of user portrait models. In order to achieve further analysis of customer characteristics, it is necessary to cluster the power user profile information. This article mainly uses the AP clustering algorithm [10] to achieve.

Different from traditional clustering algorithms, AP clustering algorithm does not need to specify the number of clustering categories and clustering centers in advance, but iteratively competes for clustering centers through the message mechanism between sample points to achieve the optimization of clustering results. Its advantage is that it has a certain degree of objectivity and good applicability. The main process of the AP clustering algorithm is to iteratively calculate the support B (μ_i, μ_j) and fitness D (μ_i, μ_j) in the data sample, where B (μ_i, μ_j) represents the degree to which sample μ_i supports μ_j as its cluster center, and D (μ_i, μ_j) represents the degree to which μ_j is suitable to become the cluster center of sample μ_i. Define the similarity between any two sample points μ_i and μ_j to be measured by the Euclidean distance S (μ_i, μ_j) [11], then the iterative formula is: $S (μ_{i}, μ_{j}) = B (μ_{i}, μ_{j}) - max D (μ_{i}, μ_{j})$ (14) $D (μ_{i}, μ_{j}) = \sum_{i, j = 1}^{n} max S (μ_{i}, μ_{j})$ (15)

In the iterative process, each sample point μ_i competes to become the cluster center, and the goal is to find t_k such that it is the maximum value in S (μ_i, μ_j), then t_k is the cluster center. Since some sample points of competing cluster centers will be eliminated in each iteration, the remaining cluster center data is the number of cluster categories.

In order to verify the advantages of the AP clustering algorithm, a user in a certain area was selected to perform a case analysis. After the data was standardized with the help of MATLAB software [12], AP clustering was performed. The results are shown in Fig. 3.

Fig. 3

AP clustering effect verification.

It can be seen from Fig. 3 that the users in the calculation example are automatically divided into 4 categories. Categories 1, 2, 3, 4 indicate that the power user profile information clustering process of the AP clustering algorithm is better, and it can achieve accurate division of the same or similar types of user profile information.

2.5 Precise positioning of power user portraits

Based on the above-mentioned construction of the user portrait information processing model, a user portrait construction method based on multi-dimensional category feature recognition and corner identification is proposed, and the multi-dimensional feature distribution feature domain of power user portraits is constructed in the smart electricity management system. The statistical feature decomposition method is used to calculate the distance between each power consumption information and the cluster center, construct the power consumption histogram of the power user portrait, and scan the power user portrait with multi-dimensional scanning technology. It is obtained that the feature quantity of the power user portrait center is H ={ h₁, h₂, . . . , h_n }, the power user portrait classification line is marked as H_x = - jϑ_x|g_i| and H_y = - jϑ_y|g_i|, and the power consumption information value output of the user portrait management information map meets the following requirements: $η = \sum_{i = 1}^{n} \hat{W} ({\vec{m}}_{i} - \vec{m}) {({\vec{m}}_{i} - {\vec{m}}_{j})}^{T}$ (16)

In the formula, T = 1, 2, . . . , M, i < j. By calculating the displacement vector F_i (i = 1, 2, . . . , M) in the gradient direction of the user portrait collection, using the power user portrait distribution power consumption information matching method, two vectors of power demand are obtained, ϖ_i and ϖ_j, then the block feature domain classification control parameters of the user portrait ∂_i is selected as follows: $\partial_{i} = \sum_{i, j = 1}^{n} u (x_{ij}) ξ_{ij}$ (17)

In the formula, x_ij represents the multi-dimensional scale information of the power user portrait; ξ_ij represents the overall feature sampling feature distribution sequence of the power user portrait.

The feature extraction result of the electricity demand of the user portrait is: $∥ W_{ij} ∥ = \frac{Gh (t)}{p_{ij} + σ^{2}} (\frac{1}{1 + γ_{i}})$ (18)

In the formula, h (t) represents the overall multi-dimensional information feature components presented by the user portrait management information; G represents the iconic sequence of the power user portrait; σ² represents the power consumption weight value; γ_i represents the power consumption information intensity.

The feature matching method by feature domain is used to detect the block fusion of the two-dimensional user portrait and the feature block matching, and the output of the subset of power consumption information of the user portrait is as follows: $z (t) = Y_{k} + x (t) + [e {(t)}_{max} - e {(t)}_{min}]$ (19)

In the formula, e (t) _max and e (t) _min represent the maximum attribute value and the minimum attribute value of the power user portrait feature. The layer-by-layer analysis results of building information under multi-dimensional dynamic scanning are described as: $h (x) = {\begin{matrix} 1 & X_{j} \leq 0 (j = 1, 2, \dots, M) \\ 0 & X_{j} > 0 (j = 1, 2, \dots, M) \end{matrix}$ (20)

Based on the above analysis, a precise positioning model of power user portraits is constructed to achieve precise positioning of power user portraits.

According to the above parameter setting, the power customer stereo portrait is constructed, and the power customer stereo portrait database is obtained, as shown in Fig. 4.

Fig. 4

Portraits of power users.

Analyzing Fig. 4, it can be seen that the method in this paper can effectively realize the multi-dimensional construction and feature point calibration of the power customer’s three-dimensional portrait. Based on the selection of electricity consumption characteristics of different groups of people and the construction of behavior portraits, the analysis of users’ electricity consumption behavior is realized by clustering optimization, which mainly includes the feature selection of user electricity load curve and the cluster optimization of user electricity consumption behavior. Due to the large amount of calculation using load curve clustering data, it is not suitable for the use of large amounts of data. This paper adopts the feature optimization strategy to extract the optimal feature set of the load curve. After optimizing the characteristics of the data, this paper takes the four characteristics of daily average load, valley power coefficient, power consumption percentage of flat section, and peak-hour power consumption rate as the preferred features of user power consumption data.

According to the optimized feature set, the power consumption characteristics of all user power load curves are extracted [13], and these characteristics are used for cluster analysis. By changing the number of clusters, compare the accuracy and effectiveness under different cluster numbers, stop clustering when the set threshold is reached, and finally select the best cluster number according to the clustering optimization strategy [14] proposed in this paper. The clustering optimization strategy adopted in this paper can effectively make up for the deficiency of dynamic clustering algorithm in power consumption behavior analysis and realize the effective analysis of user power consumption behavior [15].The following is the overall idea of this article:

First, the best feature set is obtained according to the maximum correlation and minimum redundancy criteria, and the user portrait task is described. The spatial feature domain classification method is used to classify the user portrait information, and the user label database is constructed according to the classification results. AP clustering algorithm is used to cluster the portrait information of power users and complete the analysis of customer characteristics.

3 Calculation example analysis

In order to verify the effectiveness of the customer feature analysis method based on power consumption feature selection and behavior portrait of different people, an example is analyzed. In the analysis process, the differential power consumption behavior analysis method based on power consumption customer segmentation proposed in document [6] and the AP clustering user power consumption behavior analysis method based on optimized sax and weighted load characteristic index proposed in document [7] are selected as comparison methods, and compared with the methods proposed in this paper to verify their effectiveness.

3.1 Experimental data

This paper uses the load data of a power company’s main network and all power customers from January 1 to December 31, 2019. The sampling interval is 30 minutes, and each daily load curve has 48 sampling points. First, perform cluster analysis on the daily load curve data of the main network of the power company, and divide the daily load data of the main network into three categories according to the clustering results.

3.2 Analysis of experimental results

3.2.1 Effectiveness of user feature classification

Analyze the data of 1,000 residential users under the selected power company. By clustering the typical and sub-typical power consumption curves of users, comparing the clustering results to judge the accuracy of the clustering, the calculation formula is (21), and calculate the validity by formula (22).

Define the accuracy of user feature clustering as C_τ: $C_{τ} = \frac{1}{τ} \sum_{i = 1}^{n} (\frac{2}{n_{i} (n_{i} - 1)})$ (21)

In the formula, τ represents the number of clusters; n_i represents the number of within-class load curves of type i features.

Define the effectiveness of user feature clustering as E_τ: $E_{τ} = \frac{1}{τ} \sum_{i = 1}^{n} p_{i τ}$ (22)

C_τ and E_τ are used to measure the clustering effect of user characteristics. The larger the value, the better the clustering effect. On the contrary, the clustering effect is not good.

The change trend of accuracy and validity obtained by the above calculation formula is shown in Fig. 6.

Fig. 6

Changes in accuracy and validity of user feature clustering.

Fig. 5

Overall thinking.

It can be seen from the accuracy curve that before the number of clusters reaches 10, the accuracy fluctuates steadily around 88%. When the number of clusters exceeds 10, the accuracy begins to decrease significantly. At the same time, it can be seen from the validity curve in the figure. Before the number of clusters reaches 10, the effectiveness continues to rise, and then its effectiveness tends to fluctuate steadily. It can be seen that choosing the optimal number of clusters to be 9 or 10 is more reasonable.

The accuracy and validity data obtained by the above calculation formula are shown in Table 2.

Table 2

Numerical results of accuracy and validity of user feature clustering

Number of clusters	C _τ	E _τ
2	0.883	0.700
4	0.885	0.782
6	0.884	0.802
8	0.886	0.810
10	0.878	0.820
12	0.868	0.832
14	0.858	0.828
16	0.828	0.823
18	0.822	0.822

Analyzing the data Table 2 shows that the clustering results are consistent with the trend of user characteristic clustering accuracy and validity, indicating that the clustering results of the proposed method have a certain degree of reliability.

According to the data of 1000 residential users selected above, the validity of user feature classification is verified by randomly sampling user samples. The specific calculation process is as follows:

N samples were randomly selected. For the results of the classifier test, K samples were judged wrong. K is a random variable and P(K) obeys binomial distribution. $C_{N}^{K} = \frac{N!}{K! (N - K)!}$ (23) $\frac{\partial ln P (k)}{\partial P (e)} = \frac{k}{P (e)} - \frac{N - k}{1 - P (e)} \Rightarrow 0$ (24)

Where, K represents the true error rate and P(e) represents the likelihood function. $\hat{P} (k) = \frac{k}{N}$ represents the maximum likelihood estimation of the error rate. $\hat{P} (e)$ is an unbiased estimate of P(e). If (ɛ₁, ɛ₂) is the confidence space of $\hat{P} (e)$ , 0 < α < 1 is the test level, and (1 - α) is the confidence, then the relationship between the probability of P(e) appearing in the confidence interval and the test level α is: $P (ɛ_{1} \leq P (e) \leq ɛ_{2}) = 1 - α$ (25)

If the sample number N is larger, the confidence interval is smaller, and P(e) is closer to the true error.

3.2.2 Accuracy of user feature classification

In order to verify the effectiveness of the method, the accuracy of user feature classification is used as the experimental index, and the proposed method is compared with the method of literature [6] and literature [7]. The comparison results are shown in Table 3.

Table 3
Comparison results of user characteristic classification accuracy

Number of The proposed Literature [6] Literature [7]

iterations/time method method method

2 0.86 0.78 0.75

4 0.88 0.78. 0.76

6 0.89 0.79 0.77

8 0.91 0.79 0.79

10 0.91 0.83 0.80

12 0.91 0.83 0.81

14 0.93 0.85 0.83

16 0.93 0.87 0.85

18 0.95 0.88 0.87

Number of	The proposed	Literature [6]	Literature [7]
2	0.86	0.78	0.75
4	0.88	0.78.	0.76
6	0.89	0.79	0.77
8	0.91	0.79	0.79
10	0.91	0.83	0.80
12	0.91	0.83	0.81
14	0.93	0.85	0.83
16	0.93	0.87	0.85
18	0.95	0.88	0.87

From the comparison in Table 3, it can be seen that with the increase of the number of iterations, the user feature classification accuracy of this paper is up to 0.95, the user feature classification accuracy of the literature [6] method is the highest 0.88, and the user feature classification accuracy of the literature [7] method is the highest Feature classification accuracy is the highest at 0.87. In contrast, the user feature classification accuracy of our method is significantly better than other comparison methods. Since the original feature set contains all the features, it will lead to the increase of redundancy, and the increase of redundant information will interfere with the classification results, resulting in a decrease in classification accuracy. This method effectively solves this problem. It can be seen that it is necessary to select appropriate feature sets for different sample sets, and the feature sets optimized by the proposed method can better adapt to the user feature analysis in this paper.

3.2.3 Error rate of user feature classification

In order to verify the validity of this paper, this paper takes the error rate of user feature classification as the experimental index, and adopts the proposed method, the method of literature [6], and the method of literature [7] to conduct experimental tests. The comparison results are shown in Table 4.

Table 4
User feature classification error rate comparison table

Number of The proposed Literature [6] Literature [7]

experiments / time method method method

10 0.05 0.12 0.19

20 0.05 0.14 0.18

30 0.05 0.16 0.17

40 0.04 0.17 0.17

50 0.05 0.12 0.18

60 0.03 0.14 0.16

70 0.05 0.17 0.19

80 0.04 0.18 0.18

90 0.03 0.12 0.16

100 0.05 0.15 0.17

Number of	The proposed	Literature [6]	Literature [7]
10	0.05	0.12	0.19
20	0.05	0.14	0.18
30	0.05	0.16	0.17
40	0.04	0.17	0.17
50	0.05	0.12	0.18
60	0.03	0.14	0.16
70	0.05	0.17	0.19
80	0.04	0.18	0.18
90	0.03	0.12	0.16
100	0.05	0.15	0.17

From the comparison in Table 4, it can be seen that the highest user feature classification error rate of this paper is 0.05. The user feature classification error rate of the literature [6] method is the highest 0.18, and the user feature classification error rate of the literature [7] method is the highest 0.19. It shows that the error rate of the method in this paper is significantly lower than that of the traditional method. It can be seen that the feature set optimized by the proposed method can better adapt to the user feature classification in this paper.

3.2.4 Time comparison test of user feature classification

In order to verify the validity of this paper, this paper takes the time of user feature classification as the experimental index, and adopts the proposed method, the method of literature [6], and the method of literature [7] to conduct experimental tests. The test results are shown in Table 5.

Table 5
Comparison of user characteristics classification time

Number of The proposed Literature [6] Literature [7]

experiments / time method method method

10 0.96s 1.43s 1.52s

20 0.95s 1.46s 1.56s

30 0.89s 1.44s 1.49s

40 0.87s 1.39s 1.46s

50 0.94s 1.40s 1.48s

60 0.99s 1.44s 1.57s

70 0.87s 1.36s 1.55s

80 0.93s 1.32s 1.43s

90 0.87s 1.49s 1.47s

100 0.92s 1.44s 1.56s

Number of	The proposed	Literature [6]	Literature [7]
10	0.96s	1.43s	1.52s
20	0.95s	1.46s	1.56s
30	0.89s	1.44s	1.49s
40	0.87s	1.39s	1.46s
50	0.94s	1.40s	1.48s
60	0.99s	1.44s	1.57s
70	0.87s	1.36s	1.55s
80	0.93s	1.32s	1.43s
90	0.87s	1.49s	1.47s
100	0.92s	1.44s	1.56s

From the comparison in Table 5, it can be seen that with the increase of the number of experiments, the user feature classification time in this paper is up to 0.99 s, the time used by the method in Reference [6] is up to 1.49 s, and the time used by the method in Reference [7] is up to 1.57 s s, indicating that the efficiency of this method is significantly higher than that of the traditional method. The above descriptions prove that the method in this paper is efficient, accurate and effective. The technical level and application value of the method proposed in this paper are proved to be high.

4 Conclusion

In order to solve the problems of low accuracy and effectiveness of user feature clustering and low accuracy of user feature classification in traditional methods, this paper proposes a customer feature analysis method based on the power consumption characteristics and behavior portrait selection of different people. The experimental results show that compared with the traditional user power consumption behavior analysis method, this method has higher accuracy and effectiveness of user feature clustering and higher accuracy of user feature classification. Therefore, it can be seen that the customer feature analysis effect of this method is better than that of the traditional method, but this method still has some shortcomings, such as the complexity of the algorithm process, which leads to a long operation time and makes the user feature classification time unable to achieve the expected effect. Therefore, in the next research. The algorithm will be improved to reduce the operation time and improve the classification efficiency.This technology can be applied to industrial, medical and other application scenarios, and it is hoped that it can be more applied in actual scenarios in the future, and can provide more effective help for customer service and improve customer satisfaction.

References

Yang

C.T.

, Chen

T.Y.

, Kristiani

and Wu

S.F.

, The implementation of data storage and analytics platform for big data lake of electricity usage with spar, The Journal of Supercomputing 77(6) (2021), 5934–5959.

Ghazvini

M.F.

, Ramos

, Soares

, Castro

and Vale

, Liberalization and customer behavior in the Portuguese residential retail electricity market, Utilities Policy 59(8) (2019), 100919.

Tsaousoglou

, Steriotis

, Efthymiopoulos

, Makris

and Varvarigos

, Truthful, practical and privacy-aware demand response in the smart grid via a distributed and optimal mechanism, IEEE Transactions on Smart Grid 11(4) (2020), 3119–3130.

Honda

, Ozawa

and Wakamatsu

, Profitability assessment of residential photovoltaic battery systems in Japan using electric power big data, Sustainability 13(10) (2021), 5370.

, Chen

Z.M.

, Gong

G.J.

, Xu

Z.Q.

and Qi

, Classification analysis method for electricity consumption behavior based on extreme learning machine algorithm, Automation of Electric Power Systems 43(02) (2019), 97–104.

Luo

J.G.

, Chen

, Lin

and Huang

, Analysis of the characteristics of the differentiated electricity consumption behavior based on the subdivision of the customer group, Advances of Power System & Hydroelectric Engineering 255(10) (2020), 72–76.

C.Y.

, Cai

W.Y.

, Zhao

R.S.

, Yu

C.Q.

and Zhang

, Customer behavior analysis based on affinity propagation algorithm with optimized SAX and weighted load characteristic indices, Transactions of China Electrotechnical Society 34(1) (2019), 368–377.

Oladyshkin

, Mohammadi

, Kroeker

and Nowak

, Bayesian3 active learning for the gaussian process emulator using information theory, Entropy 22(8) (2020), 890.

Cunha

, Canuto

, Viegas

, Salles

and Rocha

, Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling, Information Processing & Management 57(4) (2020), 102263.

10.

Shi

, Chen

, Zhang

, Yang

and Xu

, Dynamic AP clustering and precoding for user-centric virtual cell networks, IEEE Transactions on Communications 67(3) (2019), 2504–2516.

11.

Sahu

H.K.

, SSK modulated WPCN with Euclidean distance based selection combining receiver, Wireless Networks 27(11) (2021), 1–8.

12.

Terven

J.R.

and Córdova-Esparza

D.M.

, KinZ an Azure Kinect toolkit for Python and Matlab, Science of Computer Programming 211(9) (2021), 102702.

13.

Kamel

, Li

, Bu

and Wu

, A generalized voltage stability indicator based on the tangential angles of PV and load curves considering voltage dependent load models, International Journal of Electrical Power & Energy Systems 127(2) (2021), 106624.

14.

Blumenberg

and Ruggles

K.V.

, Hypercluster: a flexible tool for parallelized unsupervised clustering optimization, BMC Bioinformatics 21(1) (2020), 428.

15.

Guo

, Wang

, Ji

J.W.

and Ma

F.J.

, Model simulation of electricity characteristics in power user data based on data mining, Computer Simulation 33(5) (2016), 447–450.