Abstract
Aiming at the problems of low mining accuracy and high privacy protection data noise in privacy protection data mining methods in blockchain, a privacy protection data mining algorithm in blockchain based on decision tree classification is proposed. Extract the privacy protection data in the blockchain, calculate and update the distance between the data in the data set to be denoised, and denoise the updated data. Finally, starting from the root of the decision tree, calculate the information gain value of this part of privacy protection data, determine the attribute probability of privacy protection data, and complete the in-depth mining of privacy protection data in the blockchain through the calculation of decision leaf density value. The experimental results show that the mining accuracy of the proposed algorithm is always more than 90%, and the data noise is stable below 0.6 dB.
Introduction
With the advent of the network era, the network has become the key link of current economic development and people’s life. The network has been integrated into all corners of social life [1]. With the large-scale growth of data in the network, all walks of life continue to obtain private data in the network in order to obtain certain economic benefits in economic development [2]. Blockchain is the underlying information processing technology represented by bitcoin. This technology stores, collects and mines data through the data structure of block chain [15]. At present, there are many business opportunities hidden in the potential information hidden in privacy protection data. Therefore, the mining of privacy protection data is conducive to the continuous development of economy and plays a key guiding role in the positioning and development strategy of enterprises [12]. Therefore, researchers in related fields have done a lot of research and achieved some results.
Literature [8] proposed a homomorphic encryption privacy protection data mining method. This method first extracts the data samples and maps the correlation space, analyzes the nonlinear problems, then effectively re matches the data between the obtained sample data, constructs a kernel space expression mode, scores the construction mode, and then determines the characteristics of the data through the clustering prevention method, The function is used to mine data according to the fitness. The mining route of this method is clear and the mining accuracy is good, but the data after mining has some noise and needs further improvement. Literature [13] proposed a sensitive data mining method based on differential privacy. Firstly, this method analyzes the advantages of differential privacy technology, deeply analyzes the frequent items of privacy data, and completes the mining of privacy data by extracting and preprocessing the privacy data. The mining speed of this method is fast, but the mining accuracy is poor. Literature [14] proposed a cloud privacy data mining method based on clustering. This method mines according to the characteristics of existing cloud privacy data, determines the similarity between cloud privacy data with the help of cosine method, and sets the rules of the set privacy data with the help of K-means clustering algorithm to complete data mining. This method can effectively identify the mined data, but it takes a long time in mining.
Aiming at the shortcomings of the above methods, this paper designs a new privacy protection data mining algorithm in blockchain based on decision tree classification. The specific route of this paper is as follows:
Firstly, divide all data clusters in the blockchain, determine the relationship between the data in the cluster, calculate the dissimilarity of the data in all clusters, preliminarily extract the privacy protection data in the blockchain, calculate the centroid cluster, and complete the extraction of privacy protection data in the blockchain by comparing the data with the centroid cluster;
Then, set the centroid vector of privacy protection data in the blockchain, calculate the distance between the data in the data set to be denoised and update it, denoise the updated data, process high-dimensional data with the help of universe, and complete the preprocessing of privacy protection data in the blockchain;
Finally, the basic principle of decision tree classification method is analyzed. Starting from the root of decision tree, the information gain value of this part of privacy protection data is calculated to determine the attribute probability of privacy protection data. On this basis, the purity of mining data is improved by determining the information entropy of privacy protection data in the diameter of decision tree. Through the calculation of decision leaf density value, Complete the in-depth mining of privacy protection data in the blockchain.
Privacy protection data extraction and preprocessing in blockchain
Privacy protection data extraction in blockchain
The amount of data in the blockchain is large, there are many kinds, and there are privacy protection data that are very similar and have some interference. Therefore, in this study, first extract the privacy protection data in the blockchain. As blockchain is an underlying technology, the data in it exists at the bottom of the whole system [9]. Therefore, this paper uses clustering algorithm to extract privacy protection data in blockchain. The clustering algorithm divides all data in the whole blockchain into different clusters, and determines the privacy protection data through the comparison of different privacy belonging clusters. Before using the clustering algorithm, all data in the blockchain, including general data and privacy protection data, are divided into a variety of different clusters. It is assumed that all data sets in the blockchain are:
Specifically,
In formula, A represents clusters,
According to the determined different data clusters, the relationship between the data in each data cluster is expressed as a matrix, that is:
In formula, n represents the amount of data in the data cluster in the blockchain and q represents the attribute value of the data in the data cluster blockchain.
According to the relationship between the data determined above, the data in all clusters are calculated for the degree of dissimilarity, and the privacy protection data in the blockchain is initially extracted through this calculation, and the general blockchain data is removed [11]. Set all the data as two-by-two pairs, and compare the proximity between the two data. At this time, the two-mode matrix [7] is formed as:
In formula,
When it is determined that the proximity of the data in the data pair is different, the data in these blockchains are taken as the objects for further extraction, and the iterative clustering algorithm is used. In the process of iterative clustering algorithm, select the nearest centroid cluster in the data team [3] and calculate it by Euclidean distance square, then the distance from the sample to the centroid is:
In the formula,
After the value is determined, the centroid distance of the data in each data team is calculated, and the data within a reasonable range is determined as the privacy protection data in the blockchain. So far, the extraction of privacy protection data in the blockchain is completed.
In the extraction of privacy protection data in the blockchain, by dividing all data in the blockchain into different clusters [6], determine the relationship between the data in the cluster, calculate the dissimilarity of the data in all clusters, preliminarily extract the privacy protection data in the blockchain, calculate the centroid cluster, and complete the extraction of privacy protection data in the blockchain by comparing the data with the centroid cluster.
Privacy protection data preprocessing in blockchain
Based on the above determination of different privacy protection data in the blockchain, in order to realize the accurate and in-depth mining of privacy protection data in the blockchain, these data need to be further preprocessed. In the blockchain, due to the special location of the data, the privacy protection data in the blockchain has some problems of high noise and high dimension. If it is not handled in time, it will affect the subsequent in-depth mining of privacy protection data.
Firstly, the privacy protection data in the above extracted blockchain is denoised [4]. Set the centroid vector of privacy protection data in the blockchain as:
And set the quality vector data set as:
Where, e represents the attribute value of the data to be noise reduced.At this point, calculate the distance between data in the data to be reduced set, that is:
According to the determined distance between the data in the data set to be denoised, update the distance between the noise reduction data to reduce the data variation in the noise reduction process. Set the updated data set to be denoised as:
On this basis, the updated data to be denoised is processed to complete the noise reduction of the data, and the following results are obtained:
Where,
Due to the privacy protection data in the blockchain, there is not only some noise in the data, but also its dimension is the key to affecting the normal deep mining of data. Therefore, the dimension of privacy protection data needs to be further processed.
In the preprocessing of this step, the high latitude data set in the privacy protection data is regarded as a domain [10], and the dimension reduction processing of the data is completed in this domain.
Set the privacy protection high latitude data set in the blockchain as:
The universe of the data set is set as:
In the above formula, B represents the set of properties, V represents the collection of attribute values, and F represents the mapping value of
According to the determined universe value, the dimension of high latitude data is reduced to obtain:
In formula,
In the process of privacy protection data preprocessing in the blockchain, by setting the centroid vector of privacy protection data in the blockchain, calculate and update the distance between the data in the data set to be denoised, denoise the updated data, and then process the high-dimensional data with the help of universe to complete the privacy protection data preprocessing in the blockchain.
Privacy protection data mining algorithm in blockchain based on decision tree classification
Decision tree classification method is a result of classifying research objects based on the shape of tree. The method consists of tree root, tree interior and leaves as different nodes. By extending different nodes from bottom to top, multiple data sets are studied according to the characteristics of the research object obtained by the current node, and the final research results are obtained through continuous change [5]. According to different leaf nodes, this method can be divided into information entropy decision tree and Gini index decision tree. The basic model of decision tree classification is shown in Fig. 1.

Schematic diagram of decision tree classification model.
The algorithm is applied to the privacy protection data mining in the blockchain in this paper. According to the privacy protection data obtained above, these data are used as the input parameters of the decision tree classification model for in-depth mining according to their different attributes.
Set the set of privacy protection data at this time to S, assume the attribute of each privacy protection attribute, set the privacy protection data of M different classes to
In the formula,
Then, after the probability value according on the determined privacy data property, set the property A with V different values and divide the property A into the S subset, when the sample property A has the same value in the decision tree. Now, the information entropy of the privacy-protected data in the decision tree is:
In the formula,
In formula, I represents the information gain value on the partial branch of the decision tree diameter.
Finally, the deep mining of privacy protection data in the final blockchain is realized through the leaves of the decision tree. Leaves in decision tree play a key role in decision tree data mining. Because the density of leaves affects the mining of privacy protection data, this part of mining mainly realizes the mining of privacy protection data through the calculation of leaf density. Suppose the set of privacy protection data in the decision leaf is:
Any vector is: T represents the attribute value of the leaf point.
Based on the attribute value of the privacy protection data in the blade, the calculation of the local density in the blade data can be expressed as:
In formula,
By determining the local density of nodes in the decision leaf, the global density calculation value is realized, which is used as the final mining of privacy data in the blockchain, and the following results are obtained:
In formula,
In the mining of privacy protection data in blockchain, the basic principle of decision tree classification method is analyzed, the information gain value of this part of privacy protection data is calculated from the root of decision tree, and the attribute probability of privacy protection data is determined. On this basis, the purity of mining data is improved by determining the information entropy of privacy protection data in decision tree path, Finally, through the calculation of the decision leaf density value, the in-depth mining of privacy protection data in the blockchain is completed.
Experimental scheme
In order to verify the feasibility of the mining algorithm designed in this paper, experimental analysis is carried out. In the experiment, the blockchain data in a local network is taken as the research object for experimental analysis. The data in the selected sample blockchain mainly includes two types: one is general number and the other is privacy protection data. The effectiveness of the method is verified by designing algorithm mining. The specific experimental parameters of the experiment are shown in Table 1.
Details of data parameters of sample blockchain
Details of data parameters of sample blockchain
According to the experimental environment and experimental parameters determined in Section 3.1, the experimental indicators extracted in this experiment are the accuracy of privacy data mining in the sample blockchain, privacy protection data noise and mining time, which are used as experimental indicators for experimental analysis. The experiment is carried out by comparing the algorithm in this paper, the algorithm in literature [8] and the algorithm in literature [13]. The experimental results are obtained through multiple calculations and meet the requirements of experimental standards.

Comparison of accuracy of privacy protection data mining of samples with different algorithms.
Accuracy analysis of sample privacy protection data mining with different algorithms
In the experiment, the algorithm in this paper, the algorithm in literature [8] and the algorithm in literature [13] are compared for many times to analyze the accuracy of sample privacy protection data mining. The experimental results are shown in Fig. 2.
By analyzing the results in Fig. 2, it can be seen that there are some differences in the accuracy of sample privacy protection data mining using this algorithm, literature [8] algorithm and literature [13] algorithm. Among them, the highest accuracy of this algorithm for sample privacy protection data mining is about 97%, the highest accuracy of literature [8] algorithm for sample privacy protection data mining is about 79%, and the highest accuracy of literature [13] algorithm for sample privacy protection data mining is about 81%. In contrast, the mining accuracy of this method is better. This is because this method uses the decision tree classification method to determine the attributes and information entropy of privacy protection data through the decision tree, which improves the mining accuracy of the algorithm.
Analysis of privacy preserving data noise in sample mining with different algorithms
In the experiment, the noise processing results of this algorithm, literature [8] algorithm and literature [13] algorithm on sample privacy protection data mining are compared for many times. The experimental results are shown in Fig. 3.

Analysis of privacy protection data noise of sample mining with different algorithms.
By analyzing the data in Fig. 3, it can be seen that the noise processing results of sample privacy protection data mining using this algorithm, literature [8] algorithm and literature [13] algorithm are different. Among them, the noise processing results of literature [8] algorithm and literature [13] algorithm for sample privacy protection data mining are relatively volatile, and the noise suppression effect is poor. The noise processing results of this algorithm for sample privacy protection data mining are relatively stable, and the noise can be kept within a reasonable range, which verifies the effectiveness of this method.
In the experiment, the time-consuming results of this algorithm, literature [8] algorithm and literature [13] algorithm on sample privacy protection data mining are compared for many times. The experimental results are shown in Fig. 4.

Time consuming analysis of privacy protection data mining of samples with different algorithms.
By analyzing the experimental result data in Fig. 4, it can be seen that the time-consuming results of sample privacy protection data mining using this algorithm, literature [8] algorithm and literature [13] algorithm are different. When the number of iterations is 40, the time-consuming of this algorithm for sample privacy protection data mining is about 3.8 s, the time-consuming of literature [8] algorithm for sample privacy protection data mining is about 7.8 s, and the time-consuming of literature [13] algorithm for sample privacy protection data mining is about 12.1 s; When the number of iterations is 100, the time-consuming of this algorithm for sample privacy protection data mining is about 4.2 s, the time-consuming of literature [8] algorithm for sample privacy protection data mining is about 8.3 s, and the time-consuming of literature [13] algorithm for sample privacy protection data mining is about 14 s; In contrast, the mining time of this method is shorter, because this method preprocesses the data before mining, which reduces the time required for data filtering in mining, so it improves the mining efficiency of this algorithm.
In order to solve the problems of large mining error and high privacy protection data noise in the privacy protection data mining method in blockchain, this paper proposes a privacy protection data mining algorithm in blockchain based on decision tree classification. Divide all data in the blockchain into different clusters, determine the relationship between the data in the cluster, calculate the dissimilarity of the data in all clusters, preliminarily extract the privacy protection data in the blockchain, set the centroid vector of the privacy protection data in the blockchain through the comparison between the data and the centroid cluster, calculate the distance between the data in the data set to be denoised and update it, Denoise the updated data, process high-dimensional data with the help of universe, analyze the basic principle of decision tree classification method, calculate the information gain value of this part of privacy protection data from the root of decision tree, determine the attribute probability of privacy protection data, determine the information entropy of privacy protection data in decision tree diameter, and calculate the density value of decision leaves, Complete the in-depth mining of privacy protection data in the blockchain. Compared with traditional methods, this method has the following advantages:
Using the proposed algorithm to mine privacy protection data in blockchain, the highest mining accuracy is about 97%.
The proposed algorithm has good noise effect on privacy protection data in blockchain.
The proposed algorithm is used to mine the privacy protection data in the blockchain, which takes the shortest time and improves the efficiency of mining.
