Automatic classification method of power user’s requirements text based on parallel naive Bayesian algorithm

Abstract

In order to ensure the efficiency of power user’s requirements processing, an automatic classification method for demand test of power users based on parallel naive Bayesian algorithm is proposed. Polynomial naive Bayes is selected to build Hadoop cluster, and the feature words of power user’s requirements are selected through chi square test. The weight of each feature item is calculated by word frequency-inverse text frequency index method, and the weight sum of each category is calculated. The weight sum is input into naive Bayes algorithm to output the text classification results of power user’s requirements. At the same time, The naive Bayes classification algorithm is parallelized and encapsulated to reduce the cost of data movement and exchange in the classification process, and improve the operation efficiency of demand text classification of power user. The experimental results show that this method can accurately extract the feature words of power user’s requirements, effectively realize the automatic classification of power user’s requirements text, and have a more accurate classification effect. The average fitness value of the proposed method tends to be stable after more than 20 training times, and the number of network convergence steps is 7. When the ratio of energy function is about 0.4 and 0.6, the average IU value is the highest. When the required number of texts ranges from 500 to 1500, the delay time of text classification is 0.02 s, and the peak signal-to-noise ratio is more than 33, among which the highest peak signal-to-noise ratio is 42.52, and the normalization coefficient is 1.

Keywords

MapReduce Naive Bayes power user’s requirements automatic text classification parallel processing

1 Introduction

In the fierce market competition, customer service has become one of the important problems faced by enterprises in the market. Under the trend of rapid development in recent years, many companies have realized the importance of customer service: to satisfy customers and to satisfy their needs as the goal and center of all work. According to the different actual conditions of enterprises, it is an important issue for all enterprises to establish a customer service center suitable for the enterprise according to local conditions [1]. Good customer service can connect the feelings between enterprises and customers, maintain and create a good social image of enterprises, and ultimately achieve the long-term goal of cultivating consumers’ loyalty to enterprises and brands. As we all know, electric power is an important basic industry related to the national economy and the people’s livelihood, and an important part of the national economy. Electric power enterprises have the characteristics of economies of scale, which are similar to gas, tap water, telecommunications, etc., and are significantly representative in general public service enterprises. As an important business activity of power enterprises, customer service is not only related to the vital interests of power customers, but also related to the operating efficiency of power enterprises [2]. The solutions to the customer service problems of electric power enterprises are widely applicable to solve the customer service problems of the whole industry.

In order to better serve the power users, it is necessary to ensure the quality of the power user’s requirements text. With the continuous development of computer technology, the demand text of power users, especially big data, has increased exponentially. How to effectively classify and retrieve a large number of texts by membership categories has become the research focus and hotspot in the field of data mining and information retrieval [3]. Today, with the wave of artificial intelligence sweeping the world, text mining technology has been widely used in natural language processing fields such as text review, advertisement filtering, emotion analysis and anti-pornography recognition [4]. As an effective text classification method in text mining, text classification has been widely applied and paid attention to in the fields of information retrieval, information filtering, text database and search engine [5], providing technical support and solutions for deep-seated analysis.

Naive Bayes algorithm is a recognized classical classification algorithm, which is generally used for text classification. It is based on the principle of statistical classification, and calculates the probability that a given object belongs to different categories by assuming that the influence of a single attribute value of the classification object on a given class is independent of the value of other attributes. The main idea is to predict the class membership relationship under the assumption that the features are independent of each other, and the posterior probability is calculated with the value of the prior probability based on the Bayesian formula.

Based on this, Zhen Xue et al. used a combination of chi square value and document frequency to express the importance of feature words, so as to obtain the weight of feature words and establish a naive Bayes classifier accelerator [6], which has good real-time performance. Liu Peng et al. proposed a parallel naive Bayes algorithm for large-scale Chinese text classification based on spark. The algorithm took into account the distribution of feature items within and between classes and combines the correlation between feature items, to adjust the weight calculation value [7]. The above algorithm improves the performance of text classification to a certain extent, but there are two limitations: firstly, in the process of text classification, most words in the language belong to low-frequency words, which is easy to cause the problem of sparse data; Secondly, due to the limitation of its expansibility and computing capacity, it is difficult to ensure the efficiency of data processing when the centralized platform runs the traditional naive Bayes text classification algorithm. Yuan et al. proposed an unsupervised feature selection method based on orthogonal feature representation. In this method, the original data is represented by an autoencoder as a vector in a low-dimensional feature space. Then PCA is used to represent the feature space as orthogonal coordinate system. Finally, the most representative features are preserved by selecting the axes with the highest variance contribution. Compared with other unsupervised feature selection methods, the features obtained by this method have better robustness and interpretability, and can effectively reduce feature dimensions while maintaining classification performance. Wang et al. proposed a double regularization unsupervised feature selection method based on matrix decomposition and minimum redundancy and applied it to the problem of gene selection. In this method, the data matrix is decomposed into the product of two low-rank matrices to achieve the reduction and feature selection of data features. Among them, one low-rank matrix describes the implicit attribute of the sample, and the other represents the importance of the feature. At the same time, the method also utilizes the minimum redundancy criterion to obtain more independent and representative features, which further improves the classification performance. Mehrpooya proposed the use of matrix decomposition (MF) as a way to increase dimension and reduce dimension in systematic pharmacology. In this regard, we propose three new feature selection methods using the mathematical concept of feature basis. We have applied these techniques, along with three other MF methods, to analyze eight different gene expression data sets to study and compare their performance in feature selection. Our results show that these methods are able to reduce the feature space and find predictive features in phenotypic determination.

In order to solve the above two problems, this paper applies text classification technology to the field of power user’s requirements, proposes a text classification algorithm based on Naive Bayes, and uses MapReduce programming model to realize the parallelization of the algorithm on Hadoop cloud computing platform. MapReduce is a programming model for parallel operation of large-scale data sets (greater than 1TB). The concepts “Map” and “Reduce” and their main ideas are borrowed from the functional programming language and the features borrowed from the vector programming language. It greatly facilitates programmers to run their own programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map function to map a set of key value pairs into a new set of key value pairs, and specify a concurrent Reduce function to ensure that each of the mapped key value pairs shares the same key group. The naive Bayesian algorithm is implemented based on MapReduce, so that the parallel Bayesian algorithm has better execution efficiency and higher scalability than the traditional Bayesian algorithm in the case of large amount of data. The parallel naive Bayes algorithm can quickly and accurately classify the power user’s requirements text and mine the hidden power demand of customers, thus providing data support for improving the power service quality and potential service risk prediction. The novelty of this paper lies in the application of text classification technology to the field of user demand, and combined with the parallelization characteristics of Hadoop cloud computing platform, a parallel text classification algorithm based on naive Bayes algorithm is proposed, in order to mine the hidden demand of power users and improve the quality of power service. Compared with the traditional method, the algorithm has better execution efficiency and scalability under the condition of large amount of data. At the same time, it can accurately classify users’ demands and provide strong data support for predicting potential risks of services. In addition, the implementation of this algorithm can also provide reference for other fields of data analysis.

2 Automatic classification of power user’s requirements text

2.1 Text classification of power user’s requirements based on naive Bayes

Text classification algorithm based on Naive Bayesian is a learning method based on probability statistics [8]. The commonly used models are multivariate Bernoulli model and polynomial model. The calculation granularity of the two is different. The polynomial model takes words as the granularity and the Bernoulli model takes files as the granularity [9]. Therefore, the calculation methods of the prior probability and class conditional probability of the two are different. When calculating the posterior probability, for a power user’s requirements text d, in the polynomial model, only the words that appear in d will participate in the posterior probability calculation, while in the Bernoulli model, the words that do not appear in d but appear globally will also participate in the calculation, but participate as the “negative side” [10]. This paper uses polynomial model.

Let the power user’s requirements text d be represented by the feature word d = (t₁, t₂, ⋯ , t_|n|) contained therein, t_k is the feature word, k = 1, 2, ⋯ , |n|, n is the set of power user’s requirements feature words, and |n| is the number of feature words. Meanwhile, C is set as the target class set, C = 1, and C_j is the class label. According to the Bayesian formula, the probability that the power user’s requirements text d belongs to the category C_j is: $P (C_{j} | d) = \frac{P (C_{j}) P (d | C_{j})}{P (d)}$ (1)

In the formula, the denominator is independent of the category, so formula (1) can be changed to: $P (C_{j} | d) = P (C_{j}) P (d | C_{j})$ (2)

In the formula, both the class prior probability P (C_j) and the class conditional probability P (d|C_j) can be obtained through training set learning [11], and the maximum likelihood estimation is generally used as their estimation value. P (C_j) can be estimated from equation (3): $P (C_{j}) = \frac{N_{j}}{N}$ (3)

Where N_j is the total number of the feature words for power user’s requirements of category C_j, and N is the total number of feature words of the training set.

Since the power user’s requirements text d can be represented by the feature words contained therein, P (d|C_j) can be rewritten as: $P (d | C_{j}) = P ((t_{1}, t_{2}, \dots, t_{| n |}) | C_{j})$ (4)

Naive Bayes assumes that the influence of the feature words t_i and t_j on the category is independent [12], then formula (4) can be rewritten as: $P (d | C_{j}) = \prod_{k = 1}^{| n |} P (t_{k} | C_{j})$ (5)

Substituting formula (5) into formula (2) can obtain: $P (C_{j} | d) = P (C_{j}) \prod_{k = 1}^{| n |} P (t_{k} | C_{j})$ (6)

Based on the assumption of naive Bayes independence, the class conditional probability of Chinese files in formula (5) is converted into the class conditional probability of feature words. The calculation formula for class conditional probability P (t_k|C_j) of the feature words of power user’s requirements is: $P (t_{k} | C_{j}) = \frac{w_{jk}}{\sum_{i = 1}^{| n |} w_{ji}}$ (7)

Where w_jk is the weight of the feature word t_k in the training set in category C_j, and $\sum_{i = 1}^{| n |} w_{ji}$ is the sum of the weights of all power user’s requirements feature words in category C_j.

2.2 Feature selection of power user’s requirements based on Chi square statistics

Because the power user’s requirements text screened in the preprocessing stage has a high dimension, special feature selection is required [13], and a set of power user’s requirements feature words with high discrimination and small dimension is obtained [14]. In this paper, the method of x² chi square statistics is used for feature selection. This method assumes that two power user’s requirements samples are not related to each other [15], and the chi square value determines the degree of deviation between the two. The larger the chi square value is, the more obvious the representative characteristics are. The calculation formula of this method is as follows: $x^{2} (k, c) = \frac{N \times {(AD - BE)}^{2}}{(A + E) \times (A + B) \times (B + D) \times (E + D)}$ (8)

Where N is the number of power user’s requirements texts, k represents the feature item, and c represents the category. B is the total number of power user’s requirements texts including feature item k in non-category c, E is the total number of power user’s requirements texts excluding feature item k in category c, A is the total number of power user’s requirements texts including feature item k in category c, and D is the total number of power user’s requirements texts excluding feature item k in non-category c.

2.3 Word frequency statistics of power user’s requirements based on TFIDF algorithm

2.3.1 TFIDF algorithm

TFIDF is a statistical method used to evaluate the importance of a power user’s requirements vocabulary to a document set or one of the documents in a corpus. The importance of power user’s requirements vocabulary increases in proportion to the number of times it appears in the document, but at the same time it decreases in inverse proportion to the frequency of its appearance in the corpus [16]. TFIDF is actually: term frequency (TF) and inverse document frequency (IDF) [17]. TF represents the number of times that the power user’s requirements word S = (s₁, s₂, ⋯ , s_i) after word segmentation appears in the text. DF represents the text frequency of s_i, that is, the text frequency containing s_i in the power user’s requirements text set. IDF represents the inverse text frequency of s_i, and the formula is as follows: ${IDF}_{i} = ln (N / η_{{DF}_{i}})$ (9)

Where η represents the proportion.

The weight calculation is carried out for the importance of power user’s requirements words, and the calculation formula is as follows: $ω_{i, j} = {TF}_{i, j} \times {IDF}_{i} = {TF}_{i, j} \times ln (N / η_{{DF}_{i}})$ (10)

In practical application, TF_i,j needs to be normalized, and λ_{TFT
_i,j} = ln(1 + TF_i,j). At this time: $ω_{i, j} = ln (1 + {TF}_{i, j}) \times ln (N / η_{{DF}_{i}})$ (11)

2.3.2 Improvement of TFIDF algorithm

The TFIDF algorithm can make the feature words of power user’s requirements sensitive to the discrimination of categories, and obtain: $IDFCF = log (\frac{m | N |}{m + λ o}) = log (\frac{| N |}{1 + λ o / m})$ (12)

Where, m is the number of power user’s requirements texts containing the term s_i in class C_j, and o is the number of power user’s requirements texts containing the term s_i except class C_j. λ is the classification weight coefficient of the power user’s requirements words. The larger λ weakens the contribution of the power user’s requirements words to the weight in the class discrimination, the smaller λ attaches importance to the contribution of the power user’s requirements words to the weight in the class discrimination. If there are a large number of proper nouns in the power user’s requirements corpus, it is better to take λ smaller to enhance the discrimination weight between words [18]. In the improved IDF formula, the function value increases with the increase of m. If m is larger, it indicates that the number of s_i entries contained in class C_j accounts for a larger proportion of the number of files containing s_i entries, so that the calculated IDF value is larger, and those words that appear frequently among classes have a larger weight. Therefore, the improved TFIDFCF algorithm makes the weight of words that appear more frequently in one class than in other classes [19].

2.4 Automatic text classification of power user’s requirements based on parallel naive Bayes

2.4.1 Introduction to MapReduce programming framework

MapReduce is composed of ordinary PC clusters and adopts the idea of “divide and rule”. It distributes the operation of large-scale text sets of power user’s requirements to each sub node under the management of a master node for common completion, and then integrates the intermediate results of each node to obtain the final results [20]. The main idea is to automatically disassemble the power user’s requirements text to be classified into Map and Reduce. The MapReduce framework operates on key value pair 〈key, value〉. That is to say, the framework regards the input of power user’s requirements text as a group of 〈key, value〉, and also generates a group of key value as the output of power user’s requirements text. The types of these two groups of key value pairs may be different. During processing, each node reads the locally stored text processing of power user’s requirements (map), and combines the processed text and distributes it to the Reduce node, avoiding the transmission of a large number of texts and improving the processing efficiency. The input and output types of a MapReduce power user’s requirements text are as follows: (input) ⟶〈 key1, value1〉⟶ map ⟶〈key2, value2〉⟶ combine ⟶〈key2, value2〉⟶ reduce ⟶ (output) 〈key3,value3〉.

MapReduce distributed processing framework can not only process large-scale power user’s requirements text, but also hide many cumbersome details, such as automatic parallelization and load balancing. Moreover, MapReduce has very good scalability. Every time a server is added, the computing capacity of the server can be connected to the cluster, which can greatly save costs.

2.4.2 MapReduce execution process

As a programming model, MapReduce is used to carry out parallel computing on a large range of power user’s requirements text sets. It mainly consists of four independent parts: Client, JobTracker, TaskTracker and HDFS, which constitute the whole process from task submission to task completion. Client is to edit and deploy programs and feed back program tasks to the next level. JobTracker controls the overall computing framework of parallel processing, deploys the resources of the entire node group, and arranges tasks. TaskTracker is used to receive and implement instructions from JobTracker and send work implementation status to JobTracker through heartbeat communication mechanism. The throughput of Hadoop distributed file system (HDFS) data is high, and this feature can be used to save access programs. At the same time, Hadoop distributed file system can also access the power center in the form of flow.

Figure 1 describes the implementation process of MapReduce, which is mainly divided into task submission, initialization, task allocation, task implementation and task completion. The content of task submission is: firstly, deploy the relevant information of the task (Map method, Reduce method and transport channel), upload the task after deployment (Fig. 1 ①), then obtain the ID through the allocation node (Fig. 1 ②), check the method and channel arrangement of the task, sort out the task resources and copy them to the distributed file system (Fig. 1 ③), and notify the task implementation node to start the task (Fig. 1 ⑤). Initialization means that JobTracker obtains the input of task information through the distributed file system (Fig. 1 ⑥), and initializes the Map and Reduce programs according to the task resource information. The task assignment phase is that TaskTracker contacts JobTracker through heartbeat communication mechanism and obtains the assigned task by receiving the task information sorted by JobTracker (Fig. 1 ⑦). The content of task implementation is: after obtaining its own task, TaskTracker obtains the deployment information, code and data related to task resources and copies them to the system folder (Fig. 1 ⑧). After the task of Map or Reduce is completed, the task feedback information is output to HDFS and saved (Fig. 1 ⑨). The content of task realization is: the work assignment node receives the assignment task realization signal of the implementation task node by means of the heartbeat communication mechanism, and the final task information is fed back to the task assignment node after it is realized. The task assignment node registers the implemented task and sends it to the client, and finally presents the task realization signal at the client. In addition to the above elements, MapReduce implementation also contains the following important complementary blocks: 1. Data fragmentation: During task submission, data is divided into multiple fragments and stored in the HDFS. Each fragment is 64MB or 128MB in size. 2. Data localization: In the task assignment phase, the TaskTracker will allocate as many tasks as possible to the node that stores the data fragment based on the data fragment required by the task to reduce data transfer and network overhead. 3. Shuffle and Sort: In the Reduce task implementation phase, Shuffle and Sort are performed on the intermediate results generated by Map tasks, that is, aggregate the same key values to Reduce the data volume and time consumption of reduce tasks. 4. Combiner: Add the Combiner function to further reduce the amount of output data of the Map task to improve performance. The interaction tasks between these elements are as follows: After a job is submitted, Map and Reduce jobs are initialized and assigned to the TaskTracker node. The TaskTracker node performs the corresponding Map or Reduce jobs, performs the Shuffle, Sort, or Combiner operations, and outputs the final result to the HDFS. Throughout the process, JobTracker is responsible for scheduling and monitoring tasks, communicating with TaskTracker through a heartbeat mechanism and assigning task information to them. In addition, HDFS serves as the storage system of MapReduce and is responsible for storing the input and output data of jobs.

Fig. 1

Main process of MapReduce.

2.4.3 Improved classification algorithm model based on parallel naive Bayesian

It can be seen from the above analysis that for the power user’s requirements text, the classification of the power user’s requirements text can be realized according to the naive Bayes algorithm by accurately extracting the feature words of the power user’s requirements text to construct the feature vectors and comparing the class membership of each vector. Due to the large amount of power user’s requirements text, it is difficult to meet the requirements of real-time monitoring of power user’s requirements text by relying on the server centralized processing mode. As a distributed application framework, Hadoop is also a distributed file system for information storage. Through distributed storage and parallel processing of information, Hadoop can effectively reduce the cost of information movement and improve the efficiency of system operation. Therefore, the naive Bayes algorithm is combined with Hadoop big data processing platform, to realize the parallel operation of the naive Bayes classification algorithm, and build a parallel text classification method for power user requirements based on MapReduce, which can improve the processing ability of the algorithm for power user requirements. Based on the naive Bayes classification algorithm of MapReduce parallel encapsulation, classification tasks can be automatically concurrent in a cluster environment composed of multiple computers. The feature extraction and mapping operations of original data are completed through the Map process. After text shuffling, the text of power user’s requirements is merged through Reduce to calculate the final classification result; It reduces the overhead of data movement and exchange, effectively improves the running speed of classification algorithm, and solves the scalability problem of the system at the system level. At the same time, by localizing the text storage and feature extraction of power user’s requirements, the number of local feature words extracted and stored can be increased, and the representativeness and discrimination of feature vectors can be improved. In the classification process of different subject categories, different feature words and different weights are used to construct feature vectors, so as to improve the classification accuracy of power user’s requirements text. On the other hand, naive Bayes algorithm, as a machine learning algorithm, has strong self-learning ability. It can guarantee the classification performance of the system through sample training. Moreover, the trained classification system can continuously improve the classification ability and data mining ability of the system through the learning of new samples in the process of use. The improved naive Bayes parallel classification algorithm based on MapReduce is as follows:

(1) Map function

Input: training the text set of power user’s requirements

Output: key value pair 〈Key’, Value’〉, where key’ is the combination of category label or label, attribute name and value, and Value’ is the frequency.

Map(Key, Value)

{

for (i = 0, i < total number of samples, i++)

{

Assign a category label and calculate

the value of each attribute;

Assign the label value to the Key, and

the Value is the attribute value;

Output key value pair 〈Key, Value〉;

for (each attribute value)

{

Search for properties consistent

with the test sample;

Construct a link string of the label,

the attribute name and the attribute

value;

Assign the string’s Key’;Value’

to be set to 1;

Output key value pair 〈Key’,

Value’〉;

}

(2) Reduce function

Input: Key’ and Value’ (Key ‘and Value’ output by Map function)

Output: 〈Key”, Value”〉 key value pair, where Key” is the combination of label, attribute name and value, and Value” is the frequency.

Reduce(Key’, Value’)

{

Initialize the counter Sum = 0, and record the current statistical frequency of Key’;

while (pKey ’) / / pKey’ is the pointer of variable Key’;

{

Sum+=Value. get():

pKey’=pKey’+1:

}

Assign Key’ value to Key” and Sum value to Value”;

Output key value pair 〈Key”, Value”〉;

}

The process of improving classification algorithm model based on MapReduce Bayes algorithm generally includes the following steps:

Data preprocessing: Divides and marks the data to be classified, and stores the data in the HDFS.

Calculate the conditional probability: Calculate the conditional probability of each feature in the training data using the Map task and store the results in the HDFS.

Calculate prior probabilities: Calculate prior probabilities of each category of training data through Reduce tasks and store the results in the HDFS.

Predicting unknown data classification: Use the Map task to classify unknown data and export the result to the HDFS.

Steps 2 and 3 are time-consuming computing processes that can be accelerated by parallel computing. The Map task in Step 2 can perform parallel computation based on the divided data blocks. Each Map task calculates the conditional probability of a piece of data and finally obtains the conditional probability of the entire data set by combining the Reduce task. The Reduce task in step 3 can also be calculated in parallel. Each Reduce task calculates the prior probability of a part of the data and finally combines to obtain the prior probability of the whole data set. Parallel computing can improve algorithm computing efficiency without ensuring algorithm accuracy and enjoy the advantages of MapReduce parallel computing. By constructing a MapReduce based parallel processing algorithm for power user’s requirements text, the extraction and merging of feature words of power user’s requirements are completed. According to the occurrence frequency of the feature words of the power user’s requirements text, the number of text and other characteristic information, the information gain algorithm is used to sort. According to different classification topics, the feature words are given different weights, and the specified number of feature words are selected to construct the text feature vector, which is used as the input information of the naive Bayesian classification algorithm for training, and the text classifier is constructed to realize the classification processing of the power user’s requirements text.

3 Experimental analysis

Taking an electric power company as an experimental object to test the performance of this method, four virtual machines are created using VmwareWorkstation Pro14 software. Each virtual machine includes a core CPU, 1GB of memory, 20GB of hard disk and one virtual network card. Hadoop distributed cluster is built and Anaconda3, Python3.7 and PyCharm are used as the development environment. The power user’s requirements text obtained from the power center is used as the experimental corpus, and its categories include: electricity recovery in arrears, electricity meter abnormality, electrical equipment loss, electricity demand coordination of customer side, abnormality of meter reading data and others. A total of 18000 power user’s requirements texts are included, of which the ratio of training data to test data is 2 : 1, that is, 12000 training data and 6000 test data are included.

Through the power center, it can query the power purchase and sales of a power enterprise in the area covered by the power company. The work order processing interface is shown in Fig. 2.

Fig. 2

Processing interface diagram of purchase and sale of electrical orders by power enterprises.

It can be seen from Fig. 2 that the power purchase and sales of various users in the power coverage area can be directly browsed according to the needs. By clearly dividing the user categories, the work efficiency of power problem inquiry and processing can be effectively improved. The operation is convenient, and the quality of user’ s power service can be effectively improved. The integrated control ability of the power company is improved, and the information between users and enterprises is convenient and widely applicable.

As shown in Fig. 3, the method of this paper classifies the distribution proportion of electricity recovery in arrears, electricity meter abnormality, electrical equipment loss, electricity demand coordination of customer side, abnormality of meter reading data and other in the power user’s requirements text.

Fig. 3

Distribution proportion of various problems in power user’s requirements text.

It can be seen from Fig. 3 that the electricity recovery in arrears, electricity meter abnormality, electrical equipment loss, electricity demand coordination of customer side, abnormality of meter reading data and others account for 35%, 15%, 10%, 10%, 5% and 25% respectively in the power user’s requirements text. Electricity recovery in arrears, meter abnormality and electrical equipment loss account for a large part of the demand of power users and are representative. Because the staff will often be affected by their own business level and enterprise management policies when dividing the business types of users’ power consumption problems, the division results of manual users’ power consumption problems deviate from the real problems reflected by users. Through the method division in this paper, the demand of power users can be analyzed from the acceptance content of the power user’s requirements text, thus avoiding the influence of subjective factors such as the professional level of staff, enterprise management policies and the accuracy of power business categories.

After many times of learning and training, the results of the average fitness value and the absolute error of the network training are shown in Fig. 4.

Fig. 4

Change of average fitness and absolute error of network training.

It can be seen from Fig. 4 that with the increase of the number of experiments, the average fitness value of the method in this paper tends to be stable after the number of training times is greater than 20, and the number of network convergence steps is 7, which indicates that the method in this paper has good average fitness, fast network convergence speed, can quickly realize the text classification of power user requirements, and has high classification accuracy.

Mean IU is used as the evaluation index to test the influence of various proportions α and β of the energy function on the classification effect of the power user’s requirements text of this method. The experimental results are described in Fig. 5.

Fig. 5

Mean IU value results of different item proportions.

It can be seen from Fig. 5 that when the ratio of α and β in the energy function is about 0.4 and 0.6, the mean IU value is the highest. At this time, the classification effect of power user’s requirements text is the best, and a relatively ideal classification effect of power user’s requirements text can be obtained.

Let the learning factor values be 0.4, 0.8 and 1.2 respectively. Under different learning factors, the linearity of the method in this paper can be analyzed in positive and negative classification. The results are shown in Fig. 6.

Fig. 6

Classification linearity with different learning factors.

It can be seen from the graph (a) in Fig. 6 that the linearity value of positive classification of the method in this paper increases with the increase of training times. When the linearity reaches a certain value, the linearity value of different learning factor curves remains unchanged. When the learning factor value is 0.8, the linearity value is low before the training times are about 20 times; When the training steps exceed 20 times, the linearity value increases rapidly; When the learning factor value is 0.4 and 1.2, there is a large difference between the linearity values of the two before the training times are about 60 times; When the training times are about 60 times, the linearity values of the learning factors of the three values coincide with each other. The results show that when the learning factor values are different and after different training times, the linearity values of the positive classification of the method in this paper reach 0.99, and the positive classification ability is good; It can be seen from the analysis Figure (b) that, with the increase of training times, the linearity value curves of different learning factors show a relatively smooth trend with little fluctuation. Among them, when the learning factor value is 0.8, the linearity value of negative classification is about 0.19; When the learning factor value is 0.4 and the training times are about 30, the linearity value of negative classification is about 0.6. In summary, In this paper, the linearity of positive and negative classification is high in different learning factors, and the accuracy of power user’s requirements text classification is good.

The delay time of text classification is taken as an indicator to measure the real-time performance of text classification of power user’s requirements, and the results are shown in Fig. 7.

Fig. 7

Delay time of text classification.

It can be seen from the analysis of Fig. 7 that the delay time of the proposed method increases with the increase of the number of power user’s requirements texts. When the number of power user’s requirements texts is between 500 and 1500, the delay time of text classification is 0.02 s. When the number of power user’s requirements texts exceeds 1500, the delay time of text classification shows a small and slow upward trend; When the number of text is 4000, The classification delay time is 0.055 s longer than the minimum delay time. The delay time difference of 0.055 s is very small in mathematical time measurement. It can be seen that the method in this paper has good real-time performance in the text classification of power user requirements.

The feature word extraction ability of the power user’s requirements by the method in this paper is tested. Start with different text lengths and contents, the feature word extraction ability by calculating its peak signal-to-noise ratio value and normalization coefficient is analyzed. The results are shown in Table 1.

Table 1

Extraction ability test of feature words

Text category	The length of the text	Peak signal-to-noise ratio value	The normalized coefficient
The text	50	42.52	1
	100	39.78	1
	230	37.51	1
	1030	35.32	1
Digital	50	39.45	1
	140	36.45	1
	310	33.01	1
	1030	33.11	1

It can be seen from Table 1 that in the feature words process of extracting power user’s requirements from texts with different content core lengths, the peak signal-to-noise ratio data exceeded 33, of which the highest peak signal-to-noise ratio is 42.52, and the normalization coefficient is 1. Therefore, it can be seen that the accuracy of the feature words of power user’s requirements extracted by the method in this paper has reached 100%, and the feature words extraction ability of power user’s requirements is excellent.

The requirement texts of 1800 advanced users are classified according to the proposed method and compared with the method in literature [6]. The classification results are shown in Fig. 8.

Fig. 8

Classification results of power user’s requirements text.

According to Fig. 8, the method proposed in this paper can effectively classify user requirement text. There are only a few classification errors in the reply text of overdue electricity and the text of abnormal electricity meter, and the overall classification effect of the text of power user demand is good. However, the method of literature [6] cannot classify the text of user demand, and there are quantity classification errors in the text of overdue telephone reply and abnormal electricity meter, resulting in poor classification effect. The experimental results show that this method can classify the user demand text accurately.

The confusion matrix is used to analyze the classification effect of the power user’s requirements text in the method in this paper. 500 texts of each type of power user’s requirements are randomly selected. The horizontal axis of the confusion matrix is the classification result, the vertical axis is the actual category, and the diagonal value is the number of texts accurately classified. The confusion matrix of the power user’s requirements text classified in the proposed method is shown in Fig. 9.

Fig. 9

Classification results of power user’s requirements text.

It can be seen from Fig. 9 that when the method in this paper classifies the text of power user’s demand, the classification error of overdue power reply is up to 0.06; The experiment shows that the error of this method is low, and the maximum error is only 0.06.

Kappa coefficient and ROC-AUC were used to evaluate the performance of the classification model. Kappa coefficient is used to measure the performance of the classification model when dealing with unbalanced category distribution. The value range of this coefficient is [– 1,1]. 0 means that the accuracy rate of the model is only equal to the random guess,1 means that the performance of the model is no error and no difference from the random guess, and – 1 means that the performance of the model is worse than the random guess. The area under the ROC curve (AUC) represents the ability of the classifier to distinguish between positive and negative samples. The closer the AUC is to 1, the better the performance of the classifier is; the closer the AUC is to 0.5, the worse the performance of the classifier is. The results are shown in Figs. 10 and 11. According to the analysis of Figs. 10 and 11, the kappa coefficient of the proposed method is close to 1. The larger the ROC-AUC is, the larger the area under the curve is, and the curve is more convex to the upper left corner, which indicates that the proposed method has a better effect of user requirement text classification.

Fig. 10

Results of kappa coefficient.

Fig. 11

ROC-AUC curve results.

4 Conclusion

This paper briefly analyzes the naive Bayes algorithm and MapReduce framework, and builds an automatic classification method of power user’s requirements text based on parallel naive Bayes algorithm using Hadoop platform to realize the classification processing of power user’s requirements text. Taking advantage of the simplicity, efficiency and applicability of naive Bayes algorithm to unstructured data classification, and combining the parallel processing capabilities of cloud computing platform distributed storage and MapReduce, the naive Bayes classification algorithm is parallelized and encapsulated, which reduces the cost of data movement and exchange in the classification process and improves the operation efficiency of the classification algorithm. The local storage characteristic of power user’s requirements text is used to increase the extraction quantity of feature words, and the information gain algorithm is used to give different weights to the specially added words to enhance the discrimination of feature vectors. At the same time, the self-learning ability of naive Bayesian algorithm is used to continuously improve the classification ability. The experiment proves that the automatic classification method of power user’s requirements text based on parallel naive Bayes algorithm can realize fast and accurate classification processing of power user’s requirements text, and is of great significance to carry out online monitoring of power user’s requirements, timely control of user’s power consumption and maintain power stability. Although this method has achieved some success in the automatic classification of power user demand text, there are still some limitations and future research directions. Among them, the following points are worth emphasizing: It depends on the labeled training data set, and the quality and quantity of the training data set will have an important impact on the performance of the classifier. Therefore, how to obtain sufficient and high-quality training data sets is still a challenge. In feature selection, information gain algorithm is mainly used to weight features at present, but other feature selection methods may be more suitable for some fields and scenes, so how to select the appropriate feature selection method still needs to be further studied. For complex power system models, we need to combine more domain knowledge and prior information to improve the naive Bayes classifier, so as to improve its applicability and accuracy. This method is limited in the classification of power users’ demand text, and some adjustments and improvements may be needed for other types of text classification, such as adding other feature extraction methods, such as word vector model, and considering the context of text. In conclusion, future research work can be started from many aspects, such as better use of feature selection algorithm, optimization of performance of naive Bayes classifier, and combination of other machine learning algorithms, so as to expand its application in different fields and scenarios.

References

Tandon

, Kiran

and Sah

A.N.

, Customer satisfaction as mediator between website service quality and repurchase intention: an emerging economy case, Operations Research 59(1-2) (2019), 155–156.

Mortaz

and Valenzuela

, Evaluating the impact of renewable generation on transmission expansion planning, Electric Power Systems Research 169(APR.) (2019), 35–44.

Franz, Kaiser, Karen and Alim , Order parameter allows classification of planar graphs based on balanced fixed points in the kuramoto model, Physical Review E 99(5) (2019), 52308–52308.

Lopes

, Agnelo

, Teixeira

C.A.

, Laranjeiro

and Bernardino

, Automating orthogonal defect classification using machine learning algorithms, Future Generation Computer Systems 102(Jan.) (2020), 932–947.

Zhou

, Ma

and Li

, Feature selection based on term frequency deviation rate for text classification, Applied Intelligence 51(1) (2021), 1–20.

Xue

, Wei

and Guo

, A real-time naive bayes classifier accelerator on fpga, IEEE Access PP(99) (2020), 1–1.

Liu

, Zhao

H.H.

, Teng

J.Y.

, Yang

Y.Y.

and Zhu

Z.W.

, Parallel naive bayes algorithm for large-scale chinese text classification based on spark, Journal of Central South University 26(1) (2019), 1–12.

Aridas

K.C.

, Karlos

, Kanas

G.V.

, Fazakis

and Kotsiantis

S.B.

, Uncertainty Based Under-Sampling for Learning Naive Bayes Classifiers Under Imbalanced Data Sets[J], IEEE Access PP(99) (2019), 1–1.

Arvor

, Betbeder

, Daher

, Blossier

and Junior

, Towards user-adaptive remote sensing: knowledge-driven automatic classification of sentinel-2 time series, Remote Sensing of Environment 264(17) (2021), 112615.

10.

Eken

, Menhour

and Koksal

, Doca: a content-based automatic classification system over digital documents, IEEE Access 7(99) (2019), 97996–98004.

11.

Wang

, Xu

and Che

, Power quality disturbance classification based on compressed sensing and deep convolution neural networks, IEEE Access PP(99) (2019), 1–1.

12.

Strack

J.L.

, Carugati

, Orallo

C.M.

, Maestri

S.O.

, Donato

P.G.

and Funes

M.A.

, Three-phase voltage events classification algorithm based on an adaptive threshold, Electric Power Systems Research 172(JUL.) (2019), 167–176.

13.

Lin

, Wang

, Zhao

, Chen

and Huang

, Power quality disturbance feature selection and pattern recognition based on image enhancement techniques, IEEE Access PP(99) (2019), 1–1.

14.

Babakmehr

, Sartipizadeh

and Simes

M.G.

, Compressive informative sparse representation-based power quality events classification, IEEE Transactions on Industrial Informatics 16(2) (2020), 909–921.

15.

, An

, Yan

and Ji

, Incremental attribute reduction method based on chi-square statistics and information entropy, IEEE Access PP(99) (2020), 1–1.

16.

Lopes

, Agnelo

, Teixeira

C.A.

, Laranjeiro

and Bernardino

, Automating orthogonal defect classification using machine learning algorithms, Future Generation Computer Systems 102(Jan.) (2020), 932–947.

17.

, Wu

and Xue

, Transfer naive bayes algorithm with group probabilities, Applied Intelligence 50(1) (2020), 61–73.

18.

Shi

, Zurada

, Karwowski

, Guan

and Cakit

, Batch and data streaming classification models for detecting adverse events and understanding the influencing factors, Engineering Applications of Artificial Intelligence 85(Oct.) (2019), 72–84.

19.

Mohamed

M.R.

, Nasr

A.A.

, Tarrad

I.F.

and Abdulmageed

S.R.

, Exploiting incremental classifiers for the training of an adaptive intrusion detection model, International Journal of Network Security 21(2) (2019), 275–289.

20.

Wei

X.Z.

and Zhao

H.N.

, Research on Multi-source and multi-modal big Data Retrieval Based on Mapreduce [J], Computer Simulation 38(04) (2021), 422–426.