Abstract
In order to overcome the low classification accuracy of traditional methods, this paper proposes a new classification method of complex attribute big data based on iterative fuzzy clustering algorithm. Firstly, principal component analysis and kernel local Fisher discriminant analysis were used to reduce dimensionality of complex attribute big data. Then, the Bloom Filter data structure is introduced to eliminate the redundancy of the complex attribute big data after dimensionality reduction. Secondly, the redundant complex attribute big data is classified in parallel by iterative fuzzy clustering algorithm, so as to complete the complex attribute big data classification. Finally, the simulation results show that the accuracy, the normalized mutual information index and the Richter’s index of the proposed method are close to 1, the classification accuracy is high, and the RDV value is low, which indicates that the proposed method has high classification effectiveness and fast convergence speed.
Introduction
At present, the level of scientific research is constantly improving, and information technology and Internet technology are gradually mature in different fields. The development of emerging technologies not only drives the progress of the industry itself, but also expands the market demand for software and hardware [6, 23]. At the same time, in the information age, a variety of big data with complex attributes are gradually emerging. These big data have relatively complex characteristics such as data volume, type and dimension, which makes it more difficult to classify them, which has also attracted extensive attention from relevant scholars. Therefore, it is of great significance to study a new big data classification method for complex attributes.
Sun et al. Proposed a big data classification method for complex attributes of multi-stream and multi-task networks. The method mainly uses the crawler technology data acquisition complex attribute data, on the basis of building a new loss function is used to balance the relationship between the different attributes, on the basis of constructing multi-scale characteristics with the combination of multiple attribute classification flow more multitasking network (MSMT), use the network to realize complex attribute data classification. However, this method has the problem of low classification accuracy, which is far from the actual application effect [21]. Boulle et al. Proposed a big data classification method for complex attributes based on Bayesian classifier. This method first analyzes that the current relational database is mainly composed of mixed, numerical and classified data. And the particularity of relational data is involved in a one-to-many relationship, so according to the relationship between the relational data set up steady bayesian classifier, in order to achieve to classify a complex attribute data in a relational database, but this method has the problem of low convergence speed, and ideal application effect has a larger gap [4]. Liao et al. Proposed a complex attribute data classification modeling method based on SVM-NP. This method is mainly aimed at the problem that the traditional classifier often ignores the overlap of the sample attributes in the classification process, which leads to the low accuracy of the classifier. Therefore, the SVM-BP algorithm is improved on the basis of SVM-RFE algorithm, and the boundary samples with high overlap degree are divided into multiple planes for training and testing. Compared with the traditional method, this model can reduce the overlap of sample attributes twice. Theoretical analysis and experimental results show that the model has a good classification performance under appropriate parameters, but the convergence rate of this method is low, which is far from the ideal application effect [14].
Data redundancy characteristics, because of the large complex attribute such characteristics caused a certain adverse effect on the result of the data classification, so in order to solve the above problems existing in the method, raise complex attribute data classification accuracy and convergence speed, introducing iterative fuzzy clustering algorithm, this paper designed a kind of new complex attribute data classification method, Therefore, this method has the characteristics of high classification accuracy, classification effectiveness and fast convergence speed. The overall technical circuit of this method is as follows:
(1) Principal component analysis and kernel local Fisher discriminant analysis were used to reduce the dimensionality of complex attribute big data, and the dimensionality was controlled to the same level; The Bloom Filter data structure was introduced to eliminate the redundancy of the complex attribute big data after dimensionality reduction, and the data quality was optimized.
(2) Through the iterative fuzzy clustering algorithm, the de-redundant complex attribute big data is parallelized to complete the classification of complex attribute big data.
(3) The classification accuracy, regularization mutual information index, Richter’s index and RDV value of different methods were compared through experiments.
Big data classification of complex attributes
Dimensionality reduction processing for big data of complex attributes
Network data contains numerous complex attribute data, implement classification for massive complex attribute data before, to huge amounts of data need to be real-time prefetching [1, 7], and reduce the computational overhead data classification, so the data dimension reduction processing method has the characteristics of high efficiency, in order to further enhance the efficiency of data dimension reduction, the paper need to normalized processing of data, make the data length smaller than the maximum data length that the task Map can process.
The massive data in the network contains many tasks. Assume that one of the input data is the data from the task Map, and use the set threshold to realize the prefetching operation of big data with complex attributes. The threshold formula of prefetching operation is as follows:
In formula (1), F represents the number of completed map tasks; A represents the total number of map tasks.
When prefetching big data with complex network attributes, there may be many nodes prefetching data in the same output results, and many nodes may have resource competition to improve data processing time. Using conflict detection method to achieve massive data prefetching, when there are other tasks in a task prefetching data also appear prefetching operation, you need to discard the data [13, 15], prefetching data on other nodes. When all nodes are occupied, the prefetching operation needs to wait for a fixed time. The waiting time formula is as follows:
In the above formula,
Network data has the characteristics of mass, complexity and high dimension, and the feature extraction of big data classification algorithm with complex attributes needs to have the function of dimensionality reduction. For complex attribute data, the traditional dimensionality reduction method cannot be used to determine the required dimensionality reduction of each kind of data. The complex attribute data contains a large amount of data and its workload is also high, so a feature extraction method with high computational efficiency is needed [22].
The incremental orthogonal component analysis (IOCA) method is selected to reduce the dimension of network data and extract the features, so as to improve the time complexity of massive network data classification. The IOCA method does not need to set a fixed target dimension, and the target dimension can be adjusted according to the change of input data in the learning process [5]. Using this method to preprocess mass data will form a better orthogonal component, avoid data redundancy, good compression dimension.
Ioca method can obtain the orthogonal component space
(1) Let
(2) The adaptive threshold is set to judge whether
The specific calculation process is as follows:
(1) The initial dimension
(2) Use
(3) The eigenvector is represented by
(4) Calculate
(5) Calculation
(6) Calculate
(7) K is the original data dimension. When
(8) Let
Due to the diversity of the dimensions of complex attribute big data, the processing difficulty of complex attribute big data will be increased due to the differences of dimensions. Therefore, this paper uses the data dimension reduction method based on the combination of principal component analysis and kernel local Fisher discriminant analysis to reduce the dimension of complex attribute big data and control its dimension to the same level [9, 16].
The original data set of complex attribute big data is set as X, the mathematical model of dimension reduction method is
(1) Principal component analysis is used to reduce complex attribute big data from u dimension to n dimension, and the linear mapping method is
(2) The kernel function
(3) Using kernel local Fisher discriminant analysis method, the high-dimensional data in the dimension space K of complex attribute big data is reduced to v dimension, and a complex attribute big data matrix Y with significant characteristics is obtained. At this time, the linear mapping method is
The detailed process of the above steps is as follows:
(1) The data set Z of complex attribute big data is constructed;
(2) In order to prevent the interference of high-dimensional data features on low dimensional data features, Z is normalized to obtain
(3) Principal component analysis is performed on
(4) The kernel parameter β of Gaussian kernel is set, and
(5) The kernel local Fisher discriminant analysis method is used to train the high-dimensional complex attribute big data
Data elimination algorithm based on Bloom filter data structure
Because of the complexity of the large data not only from the data structure, but also from the content of the data itself, also includes manual input data, often exist duplicate data, it is disadvantageous influence the result of the complex attribute data classification, so in complex attribute data dimension reduction processing results as the foundation, design based on Bloom filter data redundancy elimination algorithm of data structure. Repeated redundant data of big data with complex attributes are removed, and the quality of big data with complex attributes is optimized, so as to improve the classification efficiency of big data with complex attributes [19].
The application efficiency of data structure Bloom filter is significant. This algorithm can describe the big data characteristic values of complex attributes through Bloom filter data structure and turn them into a set of hash mapping functions and a number [3]. Compared with other data de-redundancy algorithms, Bloom Filter algorithm is more suitable for de-redundancy processing of large data with complex attributes.
The random data structure of Bloom filter contains a variety of hash function mapping to compress the parameter space. The dimension reduced complex attribute big data X is described as U vector. Bloom filter algorithm can describe the subordinate attributes of a complex attribute big data element and a complex attribute big data set with high precision. Using Bloom filter structure to calculate hash value needs to use a consistent number of hash functions
Operate h hash function values with consistent strings to determine whether the position of n-bit array is 1. If one bit is 0, you can know that the string is not in a complex attribute big data set. If all of them are 0, you can include the string in a complex attribute big data set.
In the initial case, bloom filter belongs to an array with n positions, and the position of each complex attribute big data is set to 0. When bloom filter describes m data element sets, the data element sets are L,
In the large data set with complex attributes, a continuous subsequence in an M data segment is defined, and it is regarded as a shingle. In the data segment M, all the continuous subsequences with the size of are c set, and the shingle set is
(1) After dimension reduction, the structure of complex attribute big data X is described as
(2) The mapping function is set to two hash functions: hash 1 function and hash 2 function;
(3) Hash1 function and shingle2 function are used to calculate the value of each shingle, and the corresponding bit of
(4) The eigenvalue of complex attribute big data is the output
The above four steps can be transformed into the process of computing the similarity of multi-dimensional complex attribute big data, which belongs to the operation of multiple bloom filter similarity. The similarity of complex attribute big data is significant, which means that there are many values of 1 in bloom filter, so this data is redundant data [11, 18, 20].
Hamming distance is used to judge the similarity of complex attribute big data. The Hamming distance operation method is mainly to calculate the amount of data with different positions in two binary sequences [17]. In addition, the similarity calculation method is as follows:
Among them, p and q represent different dimensions of complex attribute big data in turn;
Setting the most significant scientific hash function can improve the redundancy elimination accuracy of complex attribute big data. The method of setting the best hash function quantity m is as follows:
Where w is the probability that the array position is 0, H represents the data dimension.
The specific implementation steps of the data elimination algorithm are as follows:
(1) The hash table is initialized, the complex attribute big data of a single dimension is set as a single data file, the redundant data in the complex attribute big data is mined, and the hash value of the complex attribute big data file is calculated;
(2) The complex attribute big data block is input into the complex attribute big data stream, and the bloom filter data structure is established and initialized;
(3) The hash function without correlation is selected to map the data elements in the reduced complex attribute big data set to interval
(4) Because there are data file extraction attributes in the selected complex attribute big data set, the hash value of each complex attribute big data block can be included in the complex attribute big data sequence as an eigenvalue, and its corresponding
(5) Set the eigenvalue of complex attribute big data file to the output
(6) Calculate the similarity of big data with complex attributes, judge whether there is redundant data, and if there is, eliminate it [2].
Big data classification of complex attributes based on iterative fuzzy clustering algorithm
Taking big data with complex attributes as an example, this kind of data has massive features, more label information and more complex data structure. By establishing multi-dimensional features and implementing classification of complex attribute big data, data types can be subdivided to judge data features under different attributes.
Because the iterative fuzzy clustering algorithm has the advantages of high efficiency, and the results more intuitive, so this article applies the method innovation to complex attribute data classification in the process, mainly in the complex attribute data dimension reduction and eliminate redundant results as the basis, using the iterative fuzzy clustering algorithm for data classification processing, in order to improve the final classification result. In order to ensure the convergence of the algorithm, a termination condition needs to be set. When the number of iterations of the algorithm reaches the maximum, the convergence stops and the final classification result is output.
Iterative fuzzy clustering algorithm can distribute the big data of complex attributes in multiple computing nodes, and coordinate the iterative operation by using local aggregation, global aggregation and other mechanisms [8]. Iterative fuzzy clustering algorithm organizes multi-dimensional big data with complex attributes in the form of vertices and edges. Vertices belong to a quaternion structure:
Iterative fuzzy clustering algorithm uses graph pattern to realize big data modeling of complex attributes, inputs big data of complex attributes in Max Compute, uses vertex and directed edge to realize big data modeling of complex attributes, realizes state update of big data of complex attributes through inter-node communication, and presents iterative operation through superstep. The big data nodes with complex attributes are distributed among multiple computing nodes to realize parallel iterative computation. The classification process of complex attribute big data based on iterative fuzzy clustering algorithm is as follows:
(1) Algorithm input.
(2) Graph loading, iterative fuzzy clustering algorithm uses custom Graph Loader to load the input data of complex attribute big data, describes it as vertex and directed edge; sets the identifier of vertex as Recordnum, transforms the specific content of Recordnum, and saves it in the vertex of quad structure.
(3) The clustering set of four tuple structure is established and initialized.
(4) Iterative operation: the algorithm computes the complex attribute big data in the quad structure, initializes the centroid of this round of iterative operation according to the centroid operation result of the previous round of iteration, and turns the nearest point set into an empty set in the initialization mode. Each node of complex attribute big data implements local iterative operation in the quad structure, calculates the distance between the node and the centroid in the quad structure, and realizes local aggregation. Each quaternion structure integrates the results of local iteration operation, calculates its centroid, and analyzes whether it meets the condition of iteration stop. If it stops, the centroid is included in the Max Compute table to realize global aggregation.
(5) Based on the global aggregation results of data, big data classification of complex attributes was carried out according to different data attributes to obtain the final classification results.
To sum up, the algorithm steps are implemented in a distributed parallel mode in a cluster environment, which not only uses the parallel operation mode inside the node, but also uses the global information iterative operation mechanism to obtain and fuse the local operation results, so as to analyze whether they conform to the convergence criteria [10, 12]. The schematic diagram of iterative fuzzy clustering algorithm is shown in Fig. 1.

Schematic diagram of iterative fuzzy clustering algorithm.
To sum up, to achieve the classification of big data of complex attributes based on iterative fuzzy clustering algorithm, the next step is to test the application effect of this method, so as to verify the reliability and scientificity of this method.
Experimental scheme
In order to test the effectiveness of the big data classification method of complex attributes based on iterative fuzzy clustering algorithm designed in this paper, the specific experimental scheme is as follows:
(1) Experimental environment: the PC processor used in the experiment is Inter Pentium G460 (3.0 GHz), the memory is 16 GB, and Windows 764 bit. Matlab simulation software was used to test the application effect of the algorithm in this paper.
(2) Experimental data: MNIST data set and Object data set in the network were selected as experimental test data sets. The MNIST dataset contains five categories of web page tags, namely engineering, college, faculty, course and activity. Each type of web page contains three types of data, namely, the connection within the web page, the word band characteristics of web page text and the number of application links between web pages. The total number of samples is 1729. Object contains numerous image data in a data set, which contains trees, birds, sunrise, building a total of 15 kinds of image data, all kinds of image data is 68 dimensions of the communist party of China (include color histogram, color correlogram dimension for 132 dimensions, edge direction histogram dimension for 73 dimensions, color moment dimension for 230 dimensions, texture feature histogram dimension for 120 dimensions, the total number of samples was 3786. The 20 types of multi-source data from the above two data sets were mixed, and different methods were used to classify big data of complex attributes. In this process, the optimal simulation parameters were set as the initial simulation parameters, so as to improve the accuracy and reliability of the simulation experiment.
(3) Experimental methods: the reference [21] method, reference [4] method and method of this paper were used as experimental methods.
(4) The experimental indexes were as follows
In order to intuitively display the classification effect of complex attribute big data, the accuracy of different methods is compared.
Among them, the total number of clusters obtained after classification is M; the number of samples in the cluster i with complex attribute big data is
The classification effect of this algorithm for complex attribute big data is reflected by regularized mutual information
Among them, the total number of clusters is K; in the j cluster of classification results, the number of samples belonging to the i cluster is
The classification effect of this algorithm for complex attribute big data is reflected by Richter’s’s index
Among them, the number of the same kind of complex attribute big data samples clustered into the same cluster is
RDV (relative difference value) is selected as an important evaluation index to measure the convergence speed of the algorithm. The lower the RDV value of the algorithm, the better the convergence speed of the algorithm. The calculation formula of RDV value is as follows:
In formula (9),
Analysis of experimental results
Accuracy
In order to intuitively verify the effectiveness of this method for cross source classification of complex attribute big data, this paper selects the reference [21] method and reference [4] method as experimental comparison methods, and verifies the effectiveness of this method through the comparison results.
The classification results of the three methods are shown in Table 1.
Classification results of different methods
Classification results of different methods
The experimental results in Table 1 show that the classification results of multi-source data in different data sets are very close to the actual results by using this method; the classification results of multi-source data in different data sets by using the other two methods are quite different from the actual results, which effectively verifies the effectiveness of this method in cross source classification of large data with complex attributes.
The classification accuracy test results of the three methods are shown in Fig. 2.

Comparison of classification accuracy of different methods.
The experimental results in Fig. 2 show that the number of samples is not linear with classification accuracy. The classification accuracy of large data with complex attributes is higher than 99% in different sample numbers. It shows that the classification accuracy of this method is higher under different sample numbers. The main reason is that bloom is introduced into this method Filter data structure, the dimension reduction of complex attribute big data for redundancy processing. The algorithm classifies the complex attribute big data in parallel by iterative fuzzy clustering algorithm, which can complete the classification of complex attribute big data, and effectively improve the classification accuracy.
The normalized mutual information

Regularization mutual information index test results of classification results.
As shown in Fig. 3, the minimum value of regularized mutual information index value of classification results is as high as 0.97 after clustering large data with complex attributes in multiple data sets by the method in this paper, and the value range is
The test results of Richter’s’s index

The test results of the Richter’s index of classification results.
As shown in Fig. 4, after clustering large data with complex attributes in multiple data sets, the Richter’s index values of classification results are higher than 0.95 and close to 1, indicating good classification effect. The reason is that the method combines principal component analysis and kernel local Fisher discriminant analysis to reduce the dimensionality of complex attribute big data. Then, the Bloom Filter data structure is introduced to eliminate the redundancy of the complex attribute big data after dimensionality reduction. Secondly, the redundant complex attribute big data is classified in parallel by iterative fuzzy clustering algorithm, which effectively improves the classification effect.
RDV values of big data with complex attributes of different methods are shown in Fig. 5.

Comparison of RDV values of different methods.
As can be seen from the experimental results in Fig. 5, under the conditions of different computing times, the proposed method has a low RDV value, indicating that the proposed method can maintain a high convergence rate while maintaining a high classification accuracy. It is again verified that the proposed method has a high classification effectiveness and convergence rate for big data with complex attributes. The reason is that this method uses iterative fuzzy clustering algorithm to parallel classify the redundant complex attribute big data, and the parallelization effectively improves the classification effectiveness and convergence speed of this method.
With the rapid development of big data technology, big data classification of complex attributes has become a difficulty in big data management. In order to improve the classification effect of big data of complex attributes, this paper proposes a big data classification algorithm of complex attributes based on iterative fuzzy clustering algorithm. This method mainly adopts the combination of principal component analysis and kernel local Fisher discriminant analysis to reduce the dimension of big data of complex attributes. The Bloom Filter data structure was introduced to eliminate the redundancy of the complex attribute big data after dimensionality reduction. Iterative fuzzy clustering algorithm is used to classify the redundant complex attribute big data in parallel, so as to complete the complex attribute big data classification. The simulation results show that the classification accuracy of this method is higher than 99%, and the regularized mutual information index value range is
