Abstract
Aiming at the problems of poor accuracy of data feature extraction and large classification error in library archives data classification methods, an automatic classification method of library archives data based on data mining is designed. Firstly, the linear relationship between the characteristic variables of library archives data is determined, and the linear coefficient of archives data characteristics is calculated; Then, the characteristic states of library archives data are divided into three states, the characteristic data are normalized, and the adaptive differential evolution algorithm is used to remove the noise in the characteristics of library archives data; Finally, the mapping relation training model in data mining is used to input the data feature training set, and the file data features are labeled according to different weights; Establish automatic data classification model. The experimental results show that the highest accuracy of this method is about 97%.
Keywords
Introduction
With the in-depth development of modern electronic technology, computer technology has been widely used in various fields. Computer technology is more and more widely used in the library. Through the computer management of books and documents in the library, the work efficiency of library management can be improved by the application of computer [2]. Among them, there are a large number of archival data in the library. These archival data record important library information and are of great significance to the cultural development of society [1]. However, with the passage of time and the continuous increase of knowledge and culture data, the storage and search of archival data has become the focus of library management. The classification of library archives data is very important to solve the problem of archives data management [4]. Continuously improving the classification efficiency of library archives data can improve the management efficiency of library archives data [8]. Therefore, relevant researchers have done a lot of research on data classification algorithms, and achieved some results.
Literature [11] proposed a non-equilibrium data classification method based on mixed sampling, which was applied to library archives data classification. This method mainly aims at the problem of less sample processing effect in the data processing of SVM algorithm. In this method, the data far from the platform data in the hyperplane, the samples close to the real class data are oversampled, and the data offset from the hyperplane are processed to realize the data classification. The classification algorithm classifies the data according to the authenticity, and the classification accuracy is high, but there is still a problem of less data. Literature [13] proposed a data stream classification algorithm based on adaptive random forest. In this method, multiple data classifiers are set in the data stream to be classified, the data obtained by the classifier are extracted and trained synchronously, and then trained through multiple trees in the random forest. Make it preprocess the data, and classify the processed data according to different attributes. The algorithm can effectively realize classification, but there is some uncertainty in the feature attributes in the classified data, which affects the effect of classification. Literature [6] proposed an unbalanced data classification algorithm based on the combination of adasyn and smote. The algorithm solves the problem of low classification accuracy in traditional classification algorithms in detail. Firstly, the k-nearest neighbor algorithm is used to classify the data in the classification, then the data of different sizes are classified, and the unqualified data are eliminated to realize the accurate classification of data. This method can accurately classify according to different attributes of data, but there is a problem of limited classification range. The effect of its application in library archives data classification is poor and needs to be improved.
Aiming at the problems existing in the above classification methods, this paper designs a new method of library archives data classification. The effectiveness of data classification is further improved by introducing data mining methods. The technical route of this method research is as follows:
Step 1: determine the linear relationship between the characteristic variables of library archives data, calculate the linear coefficient of archives data characteristics, construct the library archives data feature extraction matrix, and complete the library archives data feature extraction;
Step 2: divide the characteristic states of library archives data into three states, normalize the characteristic data with the normalization mode, remove the noise in the characteristics of library archives data with the help of adaptive differential evolution algorithm, and complete the characteristic preprocessing of library archives data;
Step 3: use the library archives data feature training set input by the mapping relationship training model in data mining, calculate the weight of the data in the training set with the help of information entropy, and label the archives data feature according to different weights; On this basis, the automatic classification model of library archives data is constructed. Each library archives data is input into the model according to different labels, and the output result is the automatic classification result of library archives data.
Feature extraction and preprocessing of library archives data
Feature extraction of library archives data
In order to realize the effective classification of library archives data, it is necessary to extract the characteristics of all archives data in the library before classification. The feature extraction of library archives data is very important. In classification, it can be studied according to different characteristics of data. Because the characteristics of library archives data show a nonlinear trend, in the feature extraction of library archives data, considering the multiple states of feature data, this paper adopts the method of multiple regression [3].
In multiple linear regression, the research object is usually described as the linear relationship between one variable and another variable. Therefore, in this study, firstly, the characteristics of library archives data are described as the linear relationship between one variable and another variable as follows:
In formula, y represents the library archive data characterization variable, n represents an arbitrary variable of the library archive data, and x represents the corresponding variable value.
Among them, the corresponding variable value in library archives data can be expressed as:
In the formula,
In the multiple linear regression extraction library archives data feature extraction, because archives data features exist not only in one state, but also in many other states [10]. Therefore, in order to realize the in-depth extraction of data features, it is necessary to further extract the data features of library archives.
Assuming that b samples exist in the library archive data features, the feature data to be extracted is expressed as:
At this time, the features in the above file data are extracted through a feature extraction matrix, namely:
In formula, u represents the characteristic sample of library archive data and r represents the data from the independent variables of library archive data.
In library archives data feature extraction, the linear relationship between library archives data feature variables is determined, the linear coefficient of archives data feature is calculated, and the library archives data feature extraction matrix is constructed to complete the library archives data feature extraction.
Preprocessing of library archives data characteristics
On the basis of the above determined characteristics of library archives data, due to the large scale of library archives data, and there are many ambiguous feature data and data affecting data classification in its feature extraction [9]. Therefore, it is necessary to preprocess the library archives characteristic data before classification. In the above determined library characteristic data, firstly, the status of characteristic data is divided into the following three categories:
When the library profile feature data sample size is
When the library file feature data sample size is
In formula, p represents the residual null vector of the data features.
When the sample size of the library archive feature data is
According to the above three forms of library archives data characteristics, it can be seen that the library archives data characteristics are not invariable, and they need to be preprocessed according to these three states. Firstly, it is necessary to normalize the characteristics of library archives data and keep the trend of its feature vector consistent [7]. The normalized data feature matrix is as follows:
In the formula, σ represents the normalized coefficient and
Based on the normalized library archive data features described above, an adaptive differential evolution algorithm was used to remove the noise from the library archive data features,that is:
In formula, μ represents the random noise in the archival data features and θ represents the vector eigenvalues that remove the noise.
In the library archives data feature preprocessing, the library archives data feature state is divided into three states, the normalized mode is used to normalize the feature data, and the adaptive differential evolution algorithm is used to remove the noise in the library archives data feature, so as to complete the library archives data feature preprocessing, which lays a foundation for subsequent research.
Automatic classification method of library archives data based on data mining
Data mining analysis
Data mining analysis based on the feature extraction and preprocessing of library archives data, this paper introduces data mining algorithm to study the final classification of archives data. There are four main methods of data mining: association analysis, sequential pattern analysis, classification analysis and cluster analysis. Association analysis is to find the relevant rules with statistical correlation between itemsets from many data sources. Sequential mode analysis is a mode with high frequency of digging vibration relative to time or other modes, so as to analyze the causal relationship between data [12]. Classification analysis is to build a model to describe and distinguish data categories or concepts, so that the model can be used to predict the classification categories of objects marked by unknown classes: classification is to construct a classification model or classification function, that is, classifier, which can map a data item in the database to a given category. Cluster analysis is to divide the data into several categories according to the similarity through the relevant clustering algorithm, observe the characteristics of each category, and focus on the specific cluster for further analysis.
Design of automatic classification method for library archives data
Therefore, in the automatic classification of library archives data in this paper, we design an automatic classification model. In the constructed model, the library archives data features are input into the model to form a training set. Each training set includes several attribute data features, and then each feature data is labeled to realize data classification.
Firstly, the feature training set of library archives data input by the training model is reflected by the mapping relationship in data mining.
The feature vector set of the trained library archives data is expressed as:
In formula, V represents the attribute value of archive data and
Assume that the data tuple of library archives data characteristics is set to:
The collection of data categories for the library archive data characteristics is:
To describe the classification problem of library archival data as a mapping of data mining, that is:
Among them, each library archives data feature corresponds to a class and its corresponding tuple. At this time, the corresponding mapping set, that is, the training set of the model, is:
In formula, ρ is the proportional relationship representing the data feature map.
Then, based on the training set of library archives data features of the above constructed input model, the trained archives data features are labeled to provide orderly data for the input of subsequent library archives data automatic classification model.
Before labeling the data in the training set of the training library archival data characteristics, it is necessary to determine the different critical degrees of the data in the training set, with its critical degree as the basis for the data labels.This paper calculates the weights of the data in the training set with the help of information entropy. Assuming a random variable in the library archive data characteristic data,
In formula, l represents the random variable values for the characteristics of the library archive data.
According to the information entropy of the characteristics of library archives data determined by formula (14), the greater the entropy, the higher the key degree of the data [5]. In this process, the entropy probability will be consistent. When it is impossible to determine the weight of library archives data characteristics, it is determined through the calculation of probability distribution. When the higher the uncertainty factor, the higher the criticality. At this time, there are:
According to the above determined criticality of library archives data characteristics, the data are labeled, and the results are as follows:
In the formula,
Finally, a library archives data classification model based on data mining is constructed to complete its rapid classification research. In the model construction of this paper, the method of high-dimensional storage is used to distinguish the characteristics of each library archives data according to different labels, and determine the order of input model according to different labels [14]. At this time, the tag of library archives data characteristics is regarded as its attribute node, and its node is defined as:
In the formula,
Search different node data according to different pheromones. At this time, the data features with certain differences in the classification process are expressed as follows:
In formula, k represents the difference factor and σ is the key properties of the difference data.
On this basis, the pheromone kernel of library archives data features and the pheromone synthesis of difference data are calculated to determine the pheromone basis for the final archives data feature classification, namely:
In formula, h represents repulsion,
Finally, the final automatic classification model of library archives data based on data mining is constructed, and the following results are obtained:
In formula, m represents the number of categorical category data and f represents the number of categorical data.
In the automatic classification of library archives data, the feature training set of library archives data input by the mapping relationship training model in data mining is used to calculate the weight of the data in the training set with the help of information entropy, and the feature labels of archives data are carried out according to different weights; On this basis, the automatic classification model of library archives data is constructed. Each library archives data is input into the model according to different labels, and the output result is the automatic classification result of library archives data [14]. The automatic classification process of library archives data is shown in Fig. 1.

Automatic classification process of library archives data.
Experimental scheme
In order to verify the effectiveness of the proposed method, a large public library was taken as the research object to classify the archival data. The archival data of books and documents in recent three years are selected as the research object. According to the different attributes of archival data, the archival data is determined to be 3 gb, including four types of data, namely classical literature, modern novel literature, foreign poetry literature and Chinese and foreign classical translation literature. Mix these archival data and classify them effectively. The classification results were counted by SPSS13.0.
Experimental index design
The experiment was analyzed by comparing the methods of this paper, literature [11], literature [13] and literature [6]. Taking the feature extraction accuracy and classification error of sample library archives data as the research index. Among them, the accuracy of feature extraction affects the subsequent classification effect. Therefore, the higher the value, the better the basis of classification.
Analysis of experimental results
Accuracy analysis of feature extraction of library archives data with different methods
In the test, the methods of this paper, literature [11], literature [13] and literature [6] are analyzed to analyze the data feature extraction accuracy of sample library archives data. The results are shown in Fig. 2.

Analysis of feature extraction accuracy of library archives data with different methods.
By analyzing the experimental result data in Fig. 2, it can be seen that there are some differences in the accuracy of data feature extraction of sample library archives data by using this method, literature [11], literature [13] and literature [6]. Among them, the highest accuracy of data feature extraction of sample library archives data by this method, literature [11], literature [13] and literature [6] is about 97%, and always higher than 90%; Although the accuracy of data feature extraction of sample library archives data by literature [11], literature [13] and literature [6] methods is within a reasonable range, it is always lower than this method, which verifies the effectiveness of this method.
Based on the above feature extraction accuracy, this paper further tests the methods of this paper, literature [11], literature [13] and literature [6] to analyze the classification error of sample library archives data. The results are shown in Fig. 3.

Error analysis of library archives data classification with different methods.
By observing Fig. 3, it can be seen that there are some differences in the classification errors of archival data of sample libraries by using the methods of this paper, literature [11], literature [13] and literature [6]. Among them, the classification error of sample library archives data by this method is always less than 2%, and shows a stable change trend, while the classification error of sample library archives data by literature [11] method, literature [13] method and literature [6] method is always higher than this method, which verifies that this method can realize the effective classification of library archives data and improve the classification effect.
In order to improve the effectiveness of library archives data classification method, an automatic classification method of library archives data based on data mining is designed. By determining the linear relationship between the characteristic variables of library archives data, calculating the linear coefficient of archives data characteristics, dividing the characteristic states of library archives data into three states, normalizing the characteristic data with the normalization mode, and removing the noise in the characteristics of library archives data with the help of adaptive differential evolution algorithm, This paper adopts the feature training set of library archives data input by the mapping relationship training model in data mining, calculates the weight of the data in the training set with the help of information entropy, constructs the automatic classification model of library archives data, inputs the library archives data into the model according to different labels, and outputs the automatic classification results. This method has the following advantages:
The highest accuracy of feature extraction of library archives data using this method is about 97%, which has a certain accuracy;
Using this method to classify library archives data has low error and certain reliability.
