Abstract
The selection of big data attributes plays a positive role in the development of the network. At present, the attribute selection for big data is completed by detecting the attribute of data, which can not guarantee the accuracy of the selection. In this paper, a big data attribute selection method based on support vector machine (SVM) is proposed for distributed network fault diagnosis database. The method is used to mine big data in the distributed network fault diagnosis database, and calculate its attribute weights according to which complete attribute classification, so as to complete the selection if big data attributes. Experiments show that the proposed method improves the efficiency of big data attribute selection, and has certain practical value.
Keywords
Introduction
The development of computer and Internet technology has brought convenience to people’s work and life, and the rapid development of the network has also brought about the problem of network security [17, 18]. Distributed network fault diagnosis is an effect means to prevent virus intrusion [15, 21]. The speed and accuracy of the selection of big data attribute in the database directly affect the effect of the distributed network fault diagnosis [5, 6]. There are many research methods for the real-time nature of distributed network fault diagnosis. Related researchers have selected large data attribute selection methods for distributed network fault diagnosis databases based on rough sets [19]. This method converts a distributed network fault diagnosis database into a sequence of objects, randomly extracts data objects from the database without returning, and determines whether the current candidate reduction can be used to distinguish between objects and objects in the current sequence of objects [20, 26]. If can be distinguished, the object is placed in the current object sequence, otherwise the appropriate attribute is added to the candidate reduction until all the data objects are distinguished [1, 4]. The method of attribute selection is to select the current candidate reduction as a big data reduction in the distributed network fault diagnosis database to complete its attribute selection. Although the speed is fast, the accuracy needs to be improved. This method has become the focus of discussion among relevant experts and scholars, and its research has gradually entered the scope of experts and scholars [28, 30]. With the deepening of the research content, lots of research results have been produced [23, 27].
The related literature proposes a method for selecting big data attributes in a distributed network fault diagnosis database based on local time-scale feature extraction of a decision tree. By constructing a distributed network fault diagnosis data transmission model, a big data attribute selection decision tree model database is established [22]. Then the paper uses the signal processing method to complete the big data attribute extraction in the database, according to which the big data attribute selection in distributed network fault diagnosis database can be completed. The drawback of this method is that the selection of attributes is quit slow. Literature [16] proposed a big data attribute selection method for distributed network fault diagnosis database based on IPv6. By analyzing the bottleneck in the process of distributed network fault diagnosis, a distributed network device driver based on NAPI (New Application Program Interface) optimized for multi-core processor is proposed. Then, it analyzes the necessity of IP fragmentation reorganization in IPv6 network behavior management system, the business process of big data attribute selection in distributed network fault diagnosis database, and the characteristics of IPv6 session flow, and proposes the session flow maintenance scheme based on IPv6 five tuples. The method searches the session flow to complete the big data attribute selection in the distributed network fault diagnosis database. However, this method has a small range of applications, and in the case of a large distributed network fault diagnosis database, it can increase the load for big data attribute selection. In the literature [2], a big data attribute selection method based on depth representation of the distributed network fault diagnosis database is proposed. First, the artificial neural network algorithm is analyzed and implemented. On the basis of it, the big data attribute is detected by the feature depth of the learning. And then DRBM (Dynamic Random Access Memory) expansion structure is adopted to construct big data attribute detection model, thus the big data attribute selection in the distributed network fault diagnosis database is completed. But this method is complicated and difficult to generalize.
Aiming at the above problems, a method for selecting big data attributes of distributed network fault diagnosis database based on support vector machine is proposed. First, the decision tree method is used to mine and calculate the big data in the distributed network fault diagnosis database, and the attribute of the big data is obtained. Then, the big data attribute classification is completed through a subset of assessment, stop criteria and result validity verification generated by big data attribute subset in distributed network fault diagnosis database. According to the similarity degree of data attribute space, the calculation method of attributes similarity and weight are obtained. The loss function is analyzed to improve the feature selection algorithm of big data attribute and calculate the weight of big data attribute. The gradient rise method is used to solve the saddle point, and furthermore to realize the large data attribute selection in the distributed network fault diagnosis database. The proposed method can effectively improve the accuracy of big data attribute selection in distributed network fault diagnosis database, reduce the calculation process, energy and time consumption, and has good practical value.
Research on big data attribute selection method in distributed network fault diagnosis database
Big data selection is the basis of classification, mining and processing of big data in distributed network fault diagnosis database [3]. With the development of application of big data mining processing technology [6, 9], the problem of big data attribute selection in distributed network fault diagnosis database has gradually become the focus of relevant experts and scholars [13].
Collection and analysis of big data attribute selection method in distributed network fault diagnosis database
Collection of big data attribute selection method in distributed network fault diagnosis database
In order to realize the selection of big data attributes in the distributed network fault diagnosis database, it is necessary to mine data in the distributed network fault diagnosis database and then calculate its attributes to realize big data attribute analysis.
There are many methods of data mining. Among them, the decision tree method is the most commonly used method [14, 29]. By using the tree structure to show the result of data mining [7], the method is simple and intuitive, and therefore it is suitable for this paper [25]. The specific process is shown in Fig. 1.

Dig data attribute analysis process in distributed network fault diagnosis database.
U is the big data set in distributed network fault diagnosis database, F1 and F2 are two big data attributes on node N of the decision tree. The information gain of F1 is greater than that of F2, so the big data attribute F1 on node N is selected as a classification attribute.
Assume E1 and E2 are the information entropy of F1 and F2 respectively, we get
Let M be a line recording the reduction of big data in distributed network fault diagnosis database, which belongs to the range of attribute j on node N. When the record is not reduced, the information entropy of the node attribute can be described as
In Equation (2), m represents the range count of big data attribute in given distributed network fault diagnosis database, n i and p i are the information entropy of big data attribute i value segment in the database [8], and ɛ represents the value segment of a big data attribute in a given database [24].
By reducing the centralized record of the dig data in distributed network fault diagnosis database, we can get attribute information entropy of the big data attribute node in new database as following [11]:
According to Equations (2) and (3), we obtain:
A represents a finite nonempty set of all the attributes of the big data in the database. Let x = p
j
, y = n
j
, and subtract Equation (2) from Equation (3), we get Equation (6)
Depending on the nature of A (x, y), if y = 0, we conclude that A (x, 0) =0, x ∈ [1, + ∞], and:
In the above formula, A (1, y) ≥ - ɛ - log(1 + y) since
Let
According to ɛ ≤ p + n, and substituting Equations (7) and (8) into Equation (6), we calculate the maximum and minimum value of ΔE as follows:
The minimum ΔE can be calculated if A = 0 and ɛ = p + n
The maximum ΔE can be calculated if A = 1 - log ɛ - log(x + y - 1) and ɛ ≠ 0
Through the above method, the attribute collection of big data in distributed network database based on decision tree is completed.
The attribute selection of big data in the distributed network fault database is to select m attributes subset from N big data attributes, while the m subset can accurately distinguish each data object in the database as well as the original N attributes. In the general distributed network fault diagnosis database [12], the analysis of big data attributes can be divided into four steps: big data attribute subset generation, big data attribute subset evaluation, stop criterion and result validity verification.
The subset generation is a search process that produces a subset of attributes for evaluation. N represents the number of bid data attribute in the distributed network fault diagnosis database, therefore 2 N represents the number of all candidate subsets. Even if the number of original dataset attributes N is small, the number of its subsets is substantial. Therefore, it is impossible to complete the exhaustive search for the big data attribute space in the entire distributed network fault diagnosis database.
Rough set theory is a mathematical tool, usually used to characterize data incompleteness and uncertainty, so as to analyze and process imprecise, inconsistent, incomplete information, discover the implied knowledge, and reveal the underlying laws [10, 28]. The rough set theory is applied in the selection of big data attributes in distributed network fault diagnosis database, and the specific process is shown in below.
In database I =〈 U, A, V, f 〉, U represents a finite nonempty set of big data in the database and U = {1, 2, ⋯ , n}, n indicates the number of big data, A represents a finite nonempty set of all the attributes of the big data in the database and A = C ∪ D, wherein C, D respectively represent the set of conditional attributes and the set of decision attributes. C = {a1, a2, ⋯ , a m }, m indicates the number of conditional attributes. Let D = {d}, where d represents the decision attribute, V represents a collection of any possible values for all attributes of big data in distributed network fault diagnosis database V = ∪ vi . f represents a distributed network fault diagnosis database.
Assume R represents the binary relation on U, which mean that R has reflexivity, transitivity and symmetry. x, y are any objects in U, and R (x) is a set of y satisfiedxRy . When meets all requirements of ∀x ∈ U, ∃ xRx, ∀x, y ∈ U, ∃ xRy ⇒ yRx and∀x, y, z ∈ U, ∃ xRy, yRz ⇒ xRz. The big data attribute selection error cloud formula is expressed as
In the above function, E represents the sum of squared errors for all big data attributes, p represents an object of the dataset, o
i
is mean of class C
i
, C
i
is the distributed network fault diagnosis database, and N
i
indicates the number of data object in class C
i
. Use formula (12) to calculate the distance from each p in the data set to k cluster center:
And then the property extraction is completed through the attribute similarity. H
i
, H
j
∈ R
D
are two object spaces, where R
D
represents a distributed network fault diagnosis database, d (H
i
, H
j
) represents the distance between two object spaces, and d (H
ik
, H
jk
) represents the spatial distance of the k -th dimension of the two object spaces.
According to the calculation, the above formula is converted into:
In the above equation: E
X
ik
, E
X
jk
respectively represent the spatial center of the k -th dimension of the two object spaces, σ
ik
and σ
jk
represent the entropy of the two attribute spaces, and
Assume s (H
ik
, H
jk
) is the similarity of the k -th dimension of the two object spaces, we can get
Let the distance of m (m > 2) D dimensional object space is
The distance of its k -dimensional attribute space is
For the classification of several object spaces with D attributes, the attribute with the smallest similarity degree has the greatest contribution to the classification, and the similarity of the attribute can be calculated to calculate the weight of the attribute in the classification [31, 32]. Through the classification of each attribute, and the similarity of data object space, we can get the calculation method of attribute similarity and attribute weight.
There are n training samples in m class applications classification problem. The ith data attribute x
i
(i = 1, 2, ⋯ , n) is a p dimensional vector x
i
= (xi1, xi2, x
ip
). y
i
indicates its corresponding category mark, and y
i
∈ {1, 2, ⋯ , m}. Let f (x) = (f1 (x) , ⋯ , f
m
(x)) represents the decision vector of big data attributes, wherein f
c
(x) , c = 1, 2, ⋯ , m represents the attribute discriminant function, and f
c
(x) = xβ
c
+ β0c. β
c
represents the weight of the c-class data of the distributed network fault diagnosis database, and β0c represents the offset of c class data. The final attribute discrimination rule is
According to the above discussion, the loss function of the big data attribute in the distributed network fault diagnosis database is:
In the above equation, (f (x
i
) - y
i
) + = ((f1 (x
i
) - yi1) +, ⋯ , (f
m
(x
i
) - y
im
) +). Class mark y
i
is encoded to y
i
= (yi1, ⋯ , y
im
), an m-dimensional vector. Assume that the corresponding large data attribute diagnostic type is j, the j-th component of y
i
is 1, and the remaining components are denoted by -1/(m - 1). L (y
i
) is also an m-dimensional vector, with 0 as its j-th component and 1 as its remaining components. For better representation, the expression of the loss function can be rewritten to formula (19)
In order to ensure that each attribute belongs to only a certain category, f
c
(x) need to meet the conditions:
Since the above condition is satisfied for any of the data attributes x, it can be converted into
Combining the formula (20) and the formula (21), we get that SVM-based supervised big data attribute feature selection algorithm is equivalent to optimization problem:
Assume that
In the above function, M represents a positive number, j and k represents the features of the big data attribute in the distributed network fault diagnosis database. The normalization of big data attribute eigenvalues is:
Let
In the above equation, a
c
= (a
ci
, ⋯ , a
cn
c
), and 1
c
represents the n
c
-dimensional vector with all big data attributes equals to 1. The Lagrangian function of the above equation is
In the formula, u
ic
, μ2c, μ3, and μ4 represent the big data attribute parameters in the distributed network fault diagnosis database. One saddle point
In order to solve the problem of the saddle point, the gradient rise method is used to solve the dual problem:
Let
For the above equation, it is necessary to pass through multiple iterations until the convergence condition is satisfied, and finally the solution (β
c
, β0c, a
c
, t
c
) is obtained. Because the objective function of the large data attribute problem in the distributed network fault diagnosis database is derivative, it can be transformed into a linear system of equations:
In the above formula, I represents the unit p × p matrix, and
In the above discussion, the big data attribute selection algorithm is improved by calculating the loss function, the weight of the big data attribute is calculated, and the saddle point is solved by the gradient rising method, so as to realize the selection process of big data attribute in distributed network fault diagnosis database based on support vector machine. The process is shown in Fig. 2.

Big data attribute selection process in distributed network fault diagnosis database.
Experimental environment
In order to prove the validity of the big data attribute selection method in the distributed network fault diagnosis database based on support vector machine, we use MATLAB2008a as the platform and Intel P4 2 G processor to perform the simulation experiment. The main frequency is Dual P2.16, the memory is 3 G, the hard disk asks 250 G, and the operating system is windows 7.
Results and analysis
The experiment selects three data sets in the network database to analyze the data attributes of three different experimental data sets, and compares the time consuming of the three data sets. In the first data set, the method proposed in this paper is compared with the data attribute selection method proposed in [7] and [8], and the comparison result is shown below.
First, the time consumed (min), calculated through formula (35), in the three methods for big data attribute selection is compared.
In the above equation, σ is the response time parameter when the big data attribute is selected, and the average response time of the big data attribute selection is obtained according to the above three methods. The comparison results are shown in Table 1.
Time-consuming comparison of three methods
According to the formula (36), the average time-consuming comparison of the three methods in the second data set is calculated. In order to ensure the accuracy of the experiment, 500 experiments were carried out, with 50 experimental data as a set of data, so as to complete the average time calculation, the time unit is seconds (s), the formula is:
In the above formula, T′ is the time spent in other work in the experiment. The average time of three big data attribute selection methods was calculated. The comparison results are shown in Table 2.
Average time-consuming comparison of three methods for large data attribute selection
In the third data set, the average time consumption of the three methods is compared, and the results are shown in Fig. 3. The dots on the straight line in Fig. 3 represent the theoretical time consuming of the three methods. With the increase of the number of experiments, the corresponding time of the method in this paper is less than 16 s, the method in reference 7 is greater than 22 s, and the method in reference 8 is between 17–22 s. It can be seen that difference between actual and theoretical time-consuming of the proposed method is less than that of the literature [7] and the literature [8]. The average time-consuming polyline of the proposed method is close to a straight line and the fluctuation is small, which indicates that the proposed method is stable in the big data attribute selection.

The average time-consuming comparison of the three methods.
Then compare the energy consumption of three methods, we assume N as the energy consumption unit,
According to the above formula, the energy consumption of the three selection methods are compared, and the results are shown in Fig. 4.
It can be seen from Fig. 4 that with the increase of the running time, the energy consumption of the three methods also increases, and the energy consumption fluctuation of the proposed method in big data attribute selection is smaller than that of literature [7] and the literature [8]. When the running time is 50 h, the method of this paper is 227 N, the method of reference [7] is 359 N, and the method of reference [8] is 562 N. So the proposed method can effectively reduce the energy consumption in big data attribute selection process, which indicates that the proposed method is stable in the big data attribute selection.

The energy consumption comparison results of the three methods.
Finally, three methods are used to select the data attributes of any seven databases in the network, and the accuracy of these three methods in the process of selecting big data attributes is compared. Since the number of attributes in the database is large, accuracy indicates that the number of big data attributes can be selected correctly, and error indicates that the number of big data attributes cannot be selected accurately. The results are shown in Table 3.
The accuracy of three method for big data attribute selection
The accuracy ratio is the ratio of the exact quantity to the total quantity. The error rate is the ratio of the number of errors to the total quantity. The formula is as follows (38).

Comparison of accuracy of three methods.

Comparison of error rate of three methods.
Through Figs. 5 and 6 we can see that the method proposed in this article has the highest accuracy and the lowest error rate.
In summary, the method proposed in this paper can effectively reduce the energy consumption and cost of big data attribute selection in the distributed network fault diagnosis database, improve the efficiency of big data attribute selection in the distributed network fault diagnosis database, ensure the real-time of network fault diagnosis, and has great practical value.
Reducing the consumption of big data attribute selection time can improve the fault diagnosis capability of distributed networks. The big data attribute method based on support vector machine proposed in this paper can effectively reduce the energy consumption in the process of big data attribute selection. And the energy consumption fluctuation of the proposed method in big data attribute selection is smaller than traditional methods, the corresponding time of the method is less than 16 s, and the average time-consuming polyline of the proposed method is close to a straight line and the fluctuation is small, and the method proposed in this article has the lowest error rate, which indicates that the proposed method is stable in the big data attribute selection, and it has good practical value. In future work, we will work to further reduce the bit error rate, improve the stability of the proposed method, and try to apply the proposed method to other types of data selection.
