Abstract
In order to overcome the inaccuracy recognition results of the current network redundancy data, a new method based on attributes and relations for evaluating the redundancy of massive heterogeneous data in the Internet of Things is proposed. This method constructs a global database system and preliminarily obtains redundant data. Using angle variance, anomaly recognition and detection of Internet of Things data are carried out. A weighted network based on attribute distance is constructed, in which attributes and relations are used to calculate comprehensive redundancy, and heterogeneous data redundancy evaluation of the Internet of Things is realized. The experimental results show that the redundancy calculation accuracy and recall rate of the results of this method are both above 95%, which is obviously higher than other methods. It proves that the proposed method has high accuracy and strong anomaly recognition performance, and the integrated database of the Internet of Things is security.
Introduction
There are many domains in the Internet of Things. These domains involve various aspects, such as commerce, sports and so on. Among them, there are redundant data provided by large-scale heterogeneous data sources [3,5,11]. These redundant data will make the power consumption of the Internet of Things system increase rapidly when it is running normally, and will also bring inconvenient for user, which will reduce the availability of the system. Therefore, it is necessary to detect redundant data in multi-source heterogeneous data correctly and efficiently. The primary keys of the same entity in different data sources are different. Some of the data contained in the records of the same entity are redundant and some are complementary [9,10,14]. The purpose of this paper is to detect redundant records in the massive heterogeneous data of the Internet of Things, so as to make the Internet of Things run faster.
In Reference [6], In the context of the continuous development of the Internet of Things, according to the definition of mutual information, the data-data feature-decision hierarchy in the definition of data fusion is applied to the data monitoring of the Internet of Things, and a data fusion method based on multi-tier mode is designed and constructed. Among them, mutual information method can measure the degree of correlation and redundancy between conditional attributes and decision attributes to extract rules and form relevant knowledge. In the process, the association degree between massive data is identified according to mutual information method, and the association features are selected during data preprocessing. Then, decision fusion is implemented for data based on multi-layer feed forward network, the method is effectively combined with MapReduce model in the development of parallel computing of massive data sets, and a 3M framework is constructed to fuse massive data of the Internet of Things. In Reference [4], in view of the adverse effects of a large number of redundant data in data stream, a data redundancy recognition and filtering method based on pre-installed flow table (PFFR) is proposed. It uses pre-loaded transfer flow table items to restrict UDP to control the rate of packet initialization, and installs flow table and redundant grouping algorithm to realize redundant control grouping according to the path. Reference [1] points out that under the background of the continuous development of big data, redundant data occupies a high proportion in storage. Redundant data identification and filtering is a reliable technology for optimizing storage system. Redundant data can be deleted at the same time by using comparative data fingerprints to ensure data uniqueness. In the recognition of redundant data, the fingerprint without matching can be judged as new data and stored. However, it is found that there are still large-scale redundant data without matching indicators. If the redundant data in unmatched blocks can be identified, the recognition rate of data redundancy can be improved to a certain extent.
Thus, a redundant data detection method based on regression detection is proposed. Regression detection is performed for matched data blocks generated by traditional slider method. The types of data operations are judged by comparing the changes of integration structure of unmatched blocks. Redundancy detection is performed based on different operation types to realize the detection of redundant data in unmatched blocks.Through anomaly detection, the time and energy consumption of redundant evaluation can be reduced as much as possible. In order to improve the accuracy of redundant data detection and recall rate in the Internet of things.
At present, there are many methods to detect redundant data in the Internet of Things. The accuracy and recall of the above research results are poor. To this end, a method to evaluate the redundancy of massive heterogeneous data in the Internet of Things based on attributes and relations is proposed. The overall components of the method are as follows:
The importance of the Internet of Things to people’s lives and the poor accuracy of the results of network redundant data identification are introduced. Using the Internet of Things to integrate heterogeneous data, a global database system is built, and redundant data is obtained initially. Angular variance is introduced to detect data anomalies, including dimensionality reduction and anomaly recognition, which minimizes time and energy consumption, and to a certain extent ensures the security of massive heterogeneous data in the Internet of Things. Based on attributes and relations, a weighted network based on attribute distance is constructed to calculate the redundancy of heterogeneous data. The accuracy of redundancy calculation and recall of results of different methods are compared. Conclusion.
The redundancy evaluation method of massive heterogeneous data in Internet of things is designed
Preliminary acquisition of redundant data
In order to effectively evaluate the redundancy of massive heterogeneous data in the Internet of Things, the data to be evaluated are integrated into a database, and the degree of data redundancy can be preliminarily obtained in the integration process, which is conducive to improving the evaluation efficiency.
Because the subsystems are cooperative and independent, it is usually necessary to centralize the subsystem data in a global database without changing the normal operation of the original system. Then the system can be constructed in the global database to mine, analyze and share references [7,12,13]. In this paper, the database in the background of each subsystem is called local database, and the database composed of local database is called global database.
The heterogeneous data integration database of the Internet of Things includes client and server. Among them, the client is mainly responsible for extracting incremental data from the local database. The call sub-function module is used to carry out a series of preprocessing, storing the processed data in the corresponding directory of the Ftp server, and notifying the server to start data download by message notification. At the same time, the server uses Ftp server to connect with the client, download the latest data, use the call of sub-function modules to achieve data processing, and then store the processed data into the target database to achieve data integration.
In the client program, the identification and extraction unit, Xml conversion unit, primary key conversion unit, encryption and signature unit and data compression unit with incremental data are included. Among them, the main responsibility of the Xml conversion unit is to use the unit to convert the incremental data to the standard Xml format. The primary key conversion unit is mainly responsible for eliminating the differences between local and global databases. The encrypted signature unit is mainly responsible for encrypting and signing data, which has guaranteed the security of the integrated data [2,17]. The main responsibility of data compression unit is to reduce data size and improve data transmission efficiency by using compressed data.
In the server-side program, there are data decompression unit, decryption signature verification unit, heterogeneous conversion unit, Sql statement generation unit, primary key conversion unit and metadata unit. Among them, the data decompression unit is mainly responsible for decompressing compressed data files and information transmitted by the client. The main responsibility of the decryption signature verification unit is to verify whether the data is secure or not and whether the data transmitted by legitimate users is tampered with. Heterogeneous transformation unit is mainly responsible for eliminating heterogeneous data in local databases, using uniform format standards to store and update the global databases in real time. TheSql statement generating unit generates the corresponding Sql statement from the obtained data, and updates the data to the corresponding database by executing the Sqlstatement. Metadata unit stores heterogeneous transformation data in local and global databases, which can provide support for heterogeneous transformation unit. Primary key conversion unit uses the maintenance of primary key comparison tables between global and local databases to achieve the removal of primary key differences. The design model of the system is shown in Fig. 1.

Integration framework for massive heterogeneous data in the Internet of Things.
The key sub-modules in Fig. 1 are constructed to ensure the security and evaluation efficiency of heterogeneous data in the Internet of Things.
(1) Server-side programs
The program is actually an integrated unit. After receiving the data from the client, it updates the data to the corresponding database by calling the data decompression unit, decryption verification signature unit and heterogeneous transformation unit. The schematic diagram of the process of the server-side program is shown in Fig. 2.

Schematic diagram of server-side program processing.
(2) Client program
The program is mainly responsible for data identification and extraction. In the process, the incremental data in the local database is extracted by the incremental extraction unit. Then, according to the Xml conversion unit, the encrypted signature unit and the data compression unit, after corresponding processing, it is stored in the corresponding target object, and the server program is notified to download the data. Assuming that the transmitted data file is too large, it is very easy to cause the data transmission failure, and at the same time, it is time-consuming to retransmit the data after the failure. So we need to set a threshold, assuming that the data file is larger than the set threshold, then we can split the file according to certain rules and transmit it in segments. When the data transmission fails, as long as the failed part is re-transmitted, the stability of the system can be enhanced in reference [8,15,20].
(3) Elimination of heterogeneous patterns
In the massive heterogeneous distributed database in the Internet of Things, because each local database is constructed according to different users at different times and locations, using different data models, there are various types of differences and conflicts between them. In this paper, each local database schema is defined as a local schema, and the target database schema is defined as a global schema. In order to integrate heterogeneous data into global database, it is necessary to transform local data into global schema. The mapping relationship between global and local schemas is stored according to Xml. After collecting incremental data from local databases, data schema transformation can be completed through Xml conversion template, thus forming a global unified schema.
(4) Data signature and encryption
Encryption: In the process of encryption, the DES key is constructed through the generator of symmetric key, and the data file is encrypted according to DES. Then the DES key is encrypted according to the public key of the receiving end, and the encrypted data file and key are transmitted to the receiving end. A detailed encryption schematic is shown in Fig. 3.

Encryption schematic.
Decryption: In the process of decryption, the DES key after encryption is decrypted by the private key of the receiver, and the DES symmetric key is obtained. Then, the ciphertext is decrypted according to the DES key, and the plaintext is obtained. The detailed decryption schematic is shown in Fig. 4.

Decryption schematic.
Signature: In this section, a message signed includes three key links. Firstly, the transmitter hashes the original data and establishes a summary. Secondly, the sender encrypts the digest according to the private key. Thirdly, the original data is merged with the encrypted digest and transmitted to the receiver. The signature schematic is shown in Fig. 5.

Signature schematic.
Signature verification: In this part, HASH operation is performed on the plaintext after decryption of ciphertext to obtain a HASH message digest, and then the HASH message digest is compared with the received HASH message digest. Assuming that the digest is the same, the original text has not been tampered with, otherwise it has been tampered with, and the data file received is not the original reference. Then the signature file is verified. Assuming that the verification is correct, the sender can be identified as the owner of the public key. Conversely, the identity of the sender can not be determined and belongs to the data transmitted by illegal users. Figure 6 is a schematic diagram of a verification signature.

Schematic of signature verification.
Based on the above data integration, the redundancy degree of massive heterogeneous data in the Internet of Things is preliminarily obtained. Angular variance is introduced to detect data anomalies, which includes dimensionality reduction and anomaly recognition, so as to ensure that time and energy consumption are minimized in redundancy assessment. In the operation of the algorithm, the inputs are the latest data element The data to be detected in the database of Section 2.1 is selected to get the high-dimensional data sample X containing certain elements. Based on the initialization of samples and optimal thresholds, an optimal data set network is constructed. The information entropy of each dimension as The polynomial coefficient The matrix eigenvalues and eigenvectors are computed by using formula (2) and formula (3):
the data set The new data element Anomaly factor On the contrary, the data grid is updated according to the first-in-first-out method and the best data set threshold. Go to step (7) and detect the next data point.
Evaluation method for data redundancy
In the evaluation of data redundancy based on attributes and relations, it mainly includes the following aspects:
(1) Ownership network under attribute distance
The most intuitive description of the Internet of Things is the network, in which the nodes correspond to the research objects and the edges correspond to the relationship between the objects. In this paper, the attribute distances of related objects are fused into the object-relational network, and an authorized network reference containing comprehensive information data is formed [16,19].
Supposing there are
Where S stands for positive integers.
For the edges of the network, the attribute distances of the objects connected by the edges are calculated, and the attribute distances are marked in the network to form a data rights network of the Internet of Things based on the attribute distances, as shown in Fig. 7:

Weighted network of Internet of Things data based on attribute distance.
(2) Minimum distance of weighted network under attribute distance
The minimum distance, that is, the shortest path, represents a series of connected edges in the process of connecting two target objects in the network, and there may be more than one path between two target objects. For example, in Fig. 1, there are two paths between Object 1 and Object 4.
Assuming that the path between
In the formula,
Assuming that there are
Where, the shorter the shortest distance between two objects is, the higher the similarity between them is, that is, the higher the redundant is.
(3) Shared neighbors
The object in the network that has direct connection with a target object q is the neighbor object. The set of all neighbor objects of object q is marked as
In Fig. 1, the neighbor objects of target object 5 are 4, 6, 7 and 9, which can be recorded as
Assuming that a target object q in the network is directly connected to the object q and v, the object q is called the shared neighbors u and v. The set of shared neighbor objects u and v is marked as
In Fig. 1, objects 5 and 7 have the shared neighbors 6 and 9, which can be represented as
(4) Comprehensive redundancy based on attributes and relations
In the calculation of comprehensive redundancy based on attributes and relations between target objects, it is assumed that the objects
In the formula, α represents the redundancy judgement parameter.
When the shortest distance between objects is short and there are many shared neighbors, and the length of shortest path between two objects and shared neighbors is not long, then the two objects have the highest probability of redundancy reference [18,21].
In summary, in the process of calculating the comprehensive redundancy based on attributes and relations between classes, assuming that
In the formula,
In order to verify the effectiveness of the redundant evaluation method of massive heterogeneous data of the Internet of things based on attributes and relationships, four sets of comparative experiments were designed. Under the same objective conditions, the method in this paper is compared with literature [6], literature [4] and literature [1]. The accuracy of redundant calculation and recall of results are compared. The higher the numerical value of experimental data is, the better the accuracy and recall rate is, and it is more applicable to practical application. The experimental environment is as follows: The CPU is AMD Sempron 3000+, its memory is 512M, the operating system is Windows XP, and the algorithm is written by Visual C++.
First, the performance of anomaly detection is verified. The experimental data came from KDD cup in 1999. In order to ensure the integrity and accuracy of the experimental data, the data were preprocessed. Check for omissions in the individuals to be investigated. And 20,000 of them were tested.

Abnormal recognition rate of the proposed method.

Accuracy comparison of different research results.
In Fig. 8, the redundancy evaluation method of massive heterogeneous data in the Internet of Things based on attributes and relations can effectively detect the abnormal data in the massive heterogeneous data of the Internet of Things. In the process of data integration, the method uses encryption and signature function modules to ensure data security to a certain extent, and then uses angle variance method to realize data anomaly detection, which further guarantees data security and improves the rate of anomaly data recognition.
As for the verification of the accuracy and recall of redundancy calculation and evaluation, the experimental data comes from an e-commerce platform in the Internet of Things. In order to ensure the validity of the experimental results, the data were preprocessed. Data collected randomly within 12 hours were selected as experimental data. In the above experimental environment, the experimental results are as follows:
According to the analysis, Fig. 9 shows that in the same amount of data and experiments, the redundancy calculation accuracy of the literature [6] method, literature [4] method and literature [1] method is all lower than 90%. The redundancy calculation accuracy of the proposed method is over 95%. The accuracy of the method is proved. This is because the introduction of Angle variance method can effectively detect abnormal data, greatly reduce the time and energy consumption, thus improving the accuracy and ensuring the safety of data.

Comparisons of recall rates of different research results.

(Continued.)
Analysis of Fig. 10 shows that under the same data volume and number of experiments, the recall rate of reference results is below 95%, while the recall rate of the proposed method is above 96%. Therefore, the redundancy recognition recall of the proposed method is better. This is because this method first uses heterogeneous data sets of the Internet of Things to preliminarily evaluate data redundancy, and then uses attributes and relations to further calculate data redundancy. It can identify network’s redundant data more comprehensively, thus effectively enhancing the performance of the method.
On the Internet of things in real life, the importance of evaluation is put forward based on the attributes and relationships in the iot huge amounts of heterogeneous data redundancy method. In the process of data integration, the method ensures the security of data by using encryption and signature function modules. Because the Angle variance method can effectively detect abnormal data, the redundant calculation accuracy is improved. This method makes use of heterogeneous data sets of the Internet of things to preliminarily evaluate the data redundancy, and further calculates the data redundancy with attributes and relationships, thus improving the redundancy recognition and recall performance of this method. Compared with the traditional method, the experiment results show that the method can improve the accuracy of redundancy calculation and the performance of redundancy identification and retrieval. It can better evaluate the massive heterogeneous data redundancy of the Internet of things. The next step will focus on incremental data extraction and further study how to improve the evaluation efficiency.
