A classification and extraction method of attribute hybrid big data based on Naive Bayes algorithm

Abstract

In the identification of network text information, the existing technology is difficult to accurately extract and classify text information with high propagation speed and high update speed. In order to solve this problem, the research combines the Naive Bayes algorithm with the feature two-dimensional information gain weighting method, uses the feature weighting method to optimize the Naive Bayes algorithm, and calculates the dimension of different documents and data categories through a new feature operation method. The data gain between them can improve its classification performance, and the classification models are compared and analyzed in the actual Chinese and English databases. The research results show that the classification accuracy rates of the IGDC-DWNB model in the Sogou database, 20-newsgroup database, Fudan database and Ruster21578 database are 0.89, 0.89, 0.93, and 0.88, respectively, which are higher than other classification models in the same environment. It can be seen that the model designed in the research has higher classification accuracy, stronger overall performance, and stronger reliability and robustness in practical applications, which can provide a new development idea for big data classification technology.

Keywords

Naive Bayes big data classification model attribute mixing

1. Introduction

Under the development trend of the digital age, the Internet has become the main means for human beings to obtain information from the outside world. With the increasing dependence of human beings on this means of information acquisition, the information on the Internet is gradually enriched and increased to human beings. Perception is difficult to fully process directly [1]. The vast majority of data in the Internet is embodied in the form of text data. On the basis of text data, multimedia information will be derived, and all information is guided by text information. Therefore, efficient identification and classification of text data is an important basis for efficient data processing in the information age [2]. As one of the important tools for the classification of Internet data and texts, the Bayesian algorithm system can form a certain classification effect, but its basic assumptions differentiate the features between different texts, so the consistent classification performance cannot be effectively improved [3]. Naive Bayes (NB) has been widely used in the field of data classification by scholars due to its rich probability expression ability and the incremental learning characteristics of comprehensive prior knowledge. However, the assumption that features are independent of each other is not in line with reality. Most scholars improve the algorithm from feature weighting, algorithm improvement and feature selection, but these improvement methods are relatively single and do not comprehensively consider the factors affecting feature weight. In the current research results, few feature selection methods can significantly improve the classification accuracy while maintaining the indirectness and low time complexity of the model. Therefore, the research uses the feature two-dimensional information gain weighting technology to optimize the Bayesian classification model, thereby improving the text data classification performance of the classification model and improving the classification effect. On the one hand, this research enriches the application practice cases of Naive Bayesian algorithm in other fields, and on the other hand, it also provides a more robust and stable performance strategy for the field of hybrid big data classification, which is widely used and more suitable for practical applications.

2. Related work

NB classification model is widely used in data classification because of its simplicity, efficiency and effectiveness. The principle of the algorithm is to first calculate the prior probability of each category, then calculate the posterior probability of each feature belonging to a certain category by using Bayesian theorem, select the category with the largest posterior probability estimate, and then determine the final category. In terms of algorithm improvement, scholars mostly use depth weighting to improve the Nb model, but few feature selection methods in the current research results can significantly improve its classification accuracy while maintaining the indirectness and low time complexity of the model. Prahartiwi and Dari [4] combined naive Bayesian algorithm, support vector machine and decision tree algorithm and applied it to breast cancer data prediction. The results showed that the prediction accuracy of the method designed in the study was higher than the traditional prediction method. Isa [5] proposed a potential debtor evaluation system based on the Naive Bayes algorithm, which can predict the debtor’s repayment according to the credit status of the potential debtor and provide suggestions. The results show that the accuracy of the method is 86%. Has application significance. Cao et al. [6]proposed a naive Bayesian model-based post-chemotherapy emesis risk prediction model. The study conducted cross-validation on 3,000 patients to test the performance of the model. The results showed that the model had good performance in predicting the risk of emesis. Resti et al. [7] applied polynomial naive Bayesian to diabetes status detection. The research conducted cross-validation after training the model. The results showed that the polynomial naive Bayesian model could obtain accurate prediction results in diabetes prediction.

In addition, researchers in the field of big data classification are constantly trying new technologies. Brahmane and Krishna [8] proposed a big data classification technology based on the Spark framework. The results show that this method can find suitable features in massive data and classify them. Selvi and Valarmathi [9] developed a big data classification model based on the Firefly Slave algorithm based feature selection technology. The model can select the best features of the data and use neural networks for classification. The results show that the classification efficiency of this method is higher than other methods. Lakhwani [10] conducted a comprehensive analysis of big data classification technologies, conducted a comparative study of different technologies, and re-analyzed important methods in this field. Yoanita et al. [11] used NB algorithm to establish an automatic classification model of comment emotion, which will classify positive and negative comments, and then use chi square analysis to analyze the characteristics that affect the number of subscribers. Nugroho et al. [12] used the NB classification model to classify the texts of multiple tags, and used particle swarm optimization (PSO) to optimize them. The results show that the optimization model can be used to improve the effectiveness of social electronic services. Muhajir and Chotijah [13] created an application based on Web browser to diagnose the damage type of the laptop, which uses the NB algorithm to diagnose the problems on the laptop and provide solutions. The results show that the diagnosis of web browser based applications using NB algorithm is effective and reliable. Alfianti et al. [14] used the PSO algorithm to optimize the support vector machine (SVM) and NB algorithm to improve their accuracy and provide a more accurate and effective solution to the comment classification problem. The results show that the accuracy of the optimized algorithm for comment classification has been improved. Mahmudah et al. [15] used the NB algorithm to classify driving activities, and divide the data set into training set and test set. The NB algorithm only needs a few training data. By calculating the average and variance of each feature, various probability values can be obtained, and then categories can be effectively divided. Rahutomo et al. [16] proposed a news classification model based on the NB algorithm, which automatically classifies news in the media into predetermined categories at a specific time every day, and presents its classification results in the form of pictures.

Since the NB algorithm needs to assume that each feature is independent to be true, the condition does not conform to reality. In order to weaken the method of naive Bayesian feature condition independent assumption, scholars have improved it from feature weighting, algorithm improvement and feature selection. Some scholars use the information of each feature relative to the category as the weight, and some scholars use the improved information gain rate to weight. However, the improvement of these feature weights is relatively simple, which cannot fully consider the factors affecting the feature weight. In terms of algorithm improvement, most scholars use depth weighting to improve the naive Bayesian model, and the improvement effect is more significant. In feature extraction, there are mainly two feature selection methods: filter and wrapper. However, in the current improvement results, it is difficult for feature selection methods to significantly improve the classification accuracy of the model while maintaining simplicity and low time complexity. This research starts with the depth weighting method, combines the Naive Bayesian algorithm with the feature two-dimensional information gain weighting method, and improves the classification performance of the model through the depth weighting method to achieve better big data classification effect.

3. Classification and extraction of attribute mixed big data based on Naive Bayes algorithm

3.1 Selection of naive Bayesian model based on data

Bayesian classification models are generally fixed or learned, and classification learning models can be divided into two types, supervised and unsupervised [17]. Two learning classification models are shown in Fig. 1.

Figure 1.

Supervised Bayesian classification learning model.

The Bayesian classification model has two parts, one part is the training data set, which is stored by the joint probability distribution table or conditional probability distribution table; the other part is the graph structure, which consists of a special two-layer structure [18]. This kind of graph structure has two layers. The first layer is class variables, and the second layer is attribute variables. The directed edges between the two layers are all pointed by class variables to attribute variables. Any attribute variables in the second layer come from class variables. A directed edge of the variable. A one-dimensional Bayesian classification model consists of a training dataset and a structure map [19]. Common one-dimensional Bayesian classification models include the NB classification model that assumes that attribute variables are independent of each other, and the Selective Naive Bayes classification model (SNB) that removes irrelevant variables on the basis of the former. is the NB classification model [20]. The essence of the multi-dimensional Bayesian classifier is to use the Bayesian classification method to classify the unclassified data. The schematic diagram of the classification is shown in Fig. 2.

Figure 2.

Schematic diagram of multidimensional Bayesian classifier.

The NB algorithm is a classification method using knowledge of probability and statistics. This method is simple, has high classification accuracy, and is fast. Its main idea is to first calculate the prior probability of each category, and then use Bayes’ theorem to calculate whether each feature belongs to a certain category. The posterior probability of each category is compared, and the category with the largest posterior probability estimate is obtained, which is the final required category [21]. The data classification problem targeted by NB needs to consider the independence between data features, which is a discrete problem, and there are two discrete NB models, namely the Bernoulli Naive Bayes model based on binomial distribution (Bernoulli Naïve Bayes, BNB) and multinomial Naive Bayes model (Multinomial Naive Bayes, MNB) based on multinomial distribution.

The principle of the BNB model is to think that there are two possibilities for an event to occur or not to occur. When the Bernoulli experiment is repeated n times independently, a new binomial distribution will be generated [22]. In the dataset, each feature attribute may or may not appear in the data, so the presence or absence of the attribute in the data can be regarded as a Bernoulli experiment, and all feature attributes are binomial distributions. For a data set $r$ , its vector form is $r=\{{t_{1},t_{2},\cdots,t_{n}}\}$ , $t_{k}\in\{{1,0}\}$ , when $t_{k}=$ 1, it indicates that the attribute appears in the data, otherwise it does not appear, $n$ indicating the scale of the data. For data processing, NB assumes that under the condition that the data category is fixed, the conditional probability calculation of each feature attribute is independent of each other and does not affect each other. Under this condition, $r$ the category of the predicted data is shown in Eq. (1).

$\displaystyle a(b)=\arg\mathop{\max}\limits_{a\in A}p({A_{j}})\prod_{i=1}^{n}(% {t_{k}p({Bt_{k}|{A_{j}}})+({1-t_{k}})({1-p({Bt_{k}|{A_{j}}})})})$ (1)

In Eq. (1), $p({A_{j}})$ is the prior probability and $p({Bt_{k}|{A_{j}}})$ the conditional probability, both of which can be calculated by the approximate estimation method of frequency counting, and the prior probability is calculated as shown in Eq. (2).

$\displaystyle p({A_{j}})=\frac{tf({B,A_{j}})}{\sum\limits_{j=1}^{C}{tf({B,A_{j% }})}}$ (2)

The conditional probability calculation is shown in Eq. (3).

$\displaystyle p({Bt_{k}|{A_{j}}})=\frac{tf({Bt_{k},A_{j}})}{tf({B,A_{j}})}$ (3)

Equations (2) and (3) represent the number $tf({Bt_{k},A_{j}})$ of data $A_{j}$ with attributes $t_{k}$ that appear in the category, which $tf({B,A_{j}})$ is the number of all data in the category. For the situation that $A_{j}$ occurs during calculation $p({Bt_{k}|{A_{j}}})=0$ , Laplace smoothing is used to process it., as in Eq. (4).

$\displaystyle P({Bt_{k}|{A_{j}}})=\frac{\sum{{}_{i=1}^{n}t_{ki}\delta({A_{j},A% })}+1}{\sum{{}_{i=1}^{n}\delta({A_{j},A})+2}}$ (4)

In Eq. (4), $\delta$ represents a binary function such as Eq. (5).

$\displaystyle\delta({x,y})=\left\{{\begin{array}[]{l}1,x=y\\ 0,x\neq y\\ \end{array}}\right.$ (5)

MNB model is to regard the data as an attribute bag model, and the frequency of attributes appearing in the data will affect the judgment of the data category. When MNB calculates the conditional probability, it needs to count the frequency of the attribute [23]. Let the attribute category be $A=\{{A_{1},A_{2},\cdots,A_{j}}\}$ , among them $j=1,2,\cdots,C$ , $B_{i}$ is any piece of data, it contains m attributes $B_{i}=\{{t_{1},t_{2},\cdots,t_{n}}\}$ , and the category of the corresponding maximum posterior probability is the category to which it $B_{i}$ belongs, and the calculation equation of the posterior probability is shown in Eq. (6).

$\displaystyle P({A_{j}|{B_{i}}})=\frac{P({B_{i}|{A_{j}}})P({A_{j}})}{P({B_{i}})}$ (6)

In Eq. (6), $P({A_{j}})$ is the probability of category $A_{j}$ appearance, is $P({B_{i}|{A_{j}}})$ the conditional probability that $P({B_{i}})=P({t_{1},t_{2},\cdots,t_{n}})$ the attribute $B_{i}$ belongs to the category, and $A_{j}$ is the joint probability of all features. The essence of the process of Bayesian classification is to solve $P({B_{i}|{A_{j}}})$ the maximum value. as given training data $P({B_{i}})$ is a constant. The solution process is shown in Eq. (7).

$\displaystyle A_{m}=\mathop{\max}\limits_{C_{j}\in C}P({A_{j}|{B_{i}}})=% \mathop{\max}\limits_{C_{j}\in C}P({C_{j}})({B_{i}|{A_{j}}})$ (7)

In Eq. (7), $A_{m}$ is the final classification result. Due to the conditional independence assumption of the NB algorithm, the equation for the solution process can be simplified as Eq. (8).

$\displaystyle A_{m}=\mathop{\max}\limits_{A_{j}\in A}P({A_{j}})\mathop{\prod}% \limits_{k=1}^{n}P({t_{k}|{A_{j}}})$ (8)

Since the probability obtained by each calculation may have a small value, for such an underflow situation, the logarithm of the decision rule is usually taken, and the discriminant equation is as shown in Eq. (9).

$\displaystyle A_{m}=\mathop{\max}\limits_{A_{j}\in A}\left[{\ln P({A_{j}})+% \sum\limits_{k=1}^{n}{\ln P({t_{k}|{A_{j}}})\times W_{k}\times TF_{t_{k}}}}\right]$ (9)

In Eq. (9), $n$ is the number of feature attributes, which $t_{k}({k=1,2,\cdots,n})$ represents $B_{i}$ the $k$ th feature attribute in the data. The calculation of prior probability and conditional probability is shown in Eq. (10).

$\displaystyle\left\{{\begin{array}[]{l}P({A_{j}})=\frac{\sum_{i=1}^{m}\delta({% A_{i},A_{j}})+1}{m+C}\\ P({t_{k}|{A_{j}}})=\frac{\sum_{i=1}^{m}TF_{it_{k}}\delta({A_{i},A_{j}})+1}{% \sum_{k=1}^{n}\sum_{i=1}^{m}TF_{it_{k}}\delta({A_{i},A_{j}})+n}\\ \end{array}}\right.$ (10)

In Eq. (10), $m$ it represents the total number of training data, the number of $C$ categories, and the category of $A_{i}$ the first $i$ data, which $TF_{it_{k}}$ is the frequency of feature attributes $t_{k}$ appearing $\delta$ in the data, and $B_{i}$ is a binary function.

Since the BNB model can only simply consider the presence or absence of feature attributes in statistical data, and does not consider the frequency of feature attributes, the classification accuracy of the model is not high, and the polynomial NB model has better performance on large data sets. Therefore, the study adopts the MNB model [24]. Data classification is an important research field in machine learning. Under a fixed classification standard, data sets without category labels are divided into set categories according to the characteristic attributes of the data. The basic classification process of data classification is shown in Fig. 3.

Figure 3.

Basic classification process of data classification.

3.2 Naive Bayesian classification model based on feature two-dimensional information gain deeply weighting

Since the performance of the classifier is affected by the feature conditional independence assumption of the NB algorithm, in order to suppress this assumption, the algorithm needs to use feature weighting. The main principle of calculating the weight of feature attributes is to first use the feature attribute selection algorithm to select the feature attributes, and then use the weight calculation method to assign weights to them. The feature attributes that have little effect on classification enable the feature attributes to better distinguish the categories of the data, thereby achieving a higher accuracy rate. In order to overcome the deficiency of this algorithm ignoring the influence of the distribution information of feature attributes within and between classes on the data classification results and improve the accuracy of classification, a NB algorithm weighted by Information Gain based on 2D Characteristic (IGDC) is proposed. The conditional probability calculation equation of is linked to the total number of features. The traditional weighting algorithm only considers the distribution of feature attributes among the class columns, and ignores the frequency of feature attributes in each category of data, which affects the weight [25]. In order to improve the accuracy of the algorithm, a weight calculation function IGDC function is defined. The improvement of data classification by feature attributes increases with the increase of information gain. Therefore, the information gain performance of features in terms of data and categories is combined to understand the degree of improvement of feature categories and data on classification. Find the feature category probability as shown in Eq. (11).

$\displaystyle P({t,A_{j}})=\frac{tf({t,A_{j}})+U}{\sum_{j=1}^{C}tf({t,A_{j}})+% C*U}$ (11)

In Eq. (11), $tf({t,A_{j}})$ represents the frequency of feature attributes $A_{j}$ appearing in the category, and $P({t,A_{j}})$ represents the frequency of the feature attributes appearing in the category $A_{j}$ in the total number of the same feature attributes in the training set. $U$ is the smoothing factor, its role is to suppress the case where the probability is 0, take $U=$ 0.01, $C$ is the number of categories. The calculation of the probability that the number of data that the feature attribute appears in each category appears in the total number of data that the corresponding feature attribute appears in the training set is shown in Eq. (12).

$\displaystyle P({B_{t},A_{j}})=\frac{tf({B_{t},A_{j}})+U}{\sum_{j=1}^{C}tf({B_% {t},A_{j}})+C*U}$ (12)

Equation (12) represents the $tf({B_{i},A_{j}})$ number of data containing characteristic attributes $t$ in the category. $A_{j}$ In the past, the traditional methods for finding information gain of feature data only considered the relationship between features and data, but did not consider the relationship between data and data categories. Therefore, when defining the new equation for finding the information gain of feature data, the relationship between feature and data and the relationship between data and data category are considered together, and the new feature category information gain and feature data information gain are obtained as Eq. (13).

$\displaystyle\left\{{\begin{array}[]{l}\textit{IGC}=\sum\limits_{j=1}^{C}{P({t% ,A_{j}})}*\textit{lbP}({t,A_{j}})-\sum\limits_{j=1}^{C}{P({A_{j}})*\textit{lbP% }({C_{j}})}\\ \textit{IGD}=\sum\limits_{j=1}^{C}{P({B_{i},A_{j}})}*\textit{lbP}({B_{i},A_{j}% })-\sum\limits_{j=1}^{C}{P({A_{j}})*\textit{lbP}({C_{j}})}\\ \end{array}}\right.$ (13)

In Eq. (13), IGC is the feature category information gain, which represents the relationship between the feature and the category, and $P({t,A_{j}})$ represents the probability of finding the feature category. where IGD is the feature data information gain, which represents the relationship between the feature and the data. $P({B_{i},A_{j}})$ represents the probability of finding feature data. The feature category information gain calculated in the equation can reflect the influence of each feature attribute on each category, and the feature data information gain can reflect the influence of each feature attribute on each category of data, and has a certain accuracy. The feature attribute category information gain and the data information gain are combined, and the linear normalization method is used to process it, and the weight is obtained as shown in Eq. (14).

$\displaystyle W_{\textit{IGDC}}=\frac{\textit{IGD}*\textit{IDC}-\min({\textit{% IGD}-\textit{IGC}})}{\max({\textit{IGD}*\textit{IDC}})-\min({\textit{IGD}*% \textit{IGC}})}$ (14)

In Eq. (14), $W_{\textit{IGDC}}$ represents the IGDC obtained after two kinds of information gain, linear normalization can make the data change proportionally without affecting the overall weight adjustment. IGDC can more accurately reflect the weight of features, so IGDC is used as the weight of deeply weighting, the calculation equation of conditional probability of NB algorithm is linked with the total number of features, and the deeply weighting equation is improved, as shown in Eq. (15).

$\displaystyle C_{m}=\mathop{\max}\limits_{A_{j}\in A}P({A_{j}|{B_{i}}})=% \mathop{\max}\limits_{A_{j}\in A}\left[{\ln P({A_{j}})+\sum\limits_{k=1}^{n}{% \ln}P({t_{k}|{A_{j},W_{k}}})\times W_{k}\times TF_{t_{k}}}\right]$ (15)

In Eq. (15), $W_{k}$ represents $t_{k}$ the weight of the $TF_{tk}$ feature $t_{k}$ , the frequency of the feature appearing in, and $P({t_{k}|{A_{j},W_{k}}})$ the $A_{j}$ conditional probability of the deeply-weighted feature. The research will use a new deeply-weighting method such as Eq. (16).

$\displaystyle P({t_{k}|{A_{j},W_{k}}})=\frac{\sum{{}_{i=1}^{m}W_{k}TF_{it_{k}}% \delta({A_{i},A_{j}})+1}}{\sum_{k=1}^{n}\sum_{i=1}^{m}TF_{it_{k}}({W_{k}+1})% \delta({A_{i},A_{j}})+n}$ (16)

Equation (16) is $n$ the number of feature attributes. Since the conditional probability of the NB algorithm represents the ratio of the frequency of features appearing in each category to the total number of features in the category, to balance the equation, the weight needs to be added by one. The weight in the $W_{k}$ equation adopts the IGDC weighting equation, and associates the weight with the conditional probability equation of the NB algorithm for deeply weighting. The deeply-weighted NB model is defined as DWNB, and the IGDC-DWNB model is obtained by combining the IGDC weighting method. The classification process of this model is shown in Fig. 4.

4. Experimental analysis of big data classification and extraction based on Naive Bayesian algorithm

4.1 The index performance of classification models in Chinese databases

In order to explore the classification effect of the attribute hybrid big data classification and extraction method of the Naive Bayesian algorithm, the research will start from the comparison of algorithms and experimental data sets, and analyze the classification effect and index gap of different algorithms in different data set environments, and then measure the performance difference between different algorithms. The research experiment software environment uses Matlab6.5 programming language and Bayesian toolbox, while the hardware equipment uses Pentium PC, with 1 GB memory specification, 2.93 GHz main frequency specification and 120 GB hard disk specification.

During the experiment, the research mainly selected four main databases: Sogou experimental environment database, 20-newsgroup experimental environment database, Fudan experimental environment database and Ruster21578 experimental environment database, and collected the corresponding test data and training data from them. According to the training characteristics of the algorithm, 65% of the data in the database is used as the training set data, and 35% of the data is used as the test set data. The specific conditions of the four test databases are shown in Table 1.

Table 1
Database classification

Kind of database	Sogou experimental environment database	20-newsgroup experimental environment database	Fudan experimental environment database	Ruster21578 experimental environment database
Number of categories	6	6	6	6
Number of document data	1200	1200	1200	900
Number of training sets	780	780	780	585
Number of test sets	420	420	420	315

Figure 4.

Classification process of igdc-dwnb model.

When comparing different algorithms, the research mainly used IGDC-DWNB (IGDC Deeply Weighted Naive Bayesian Classification Model) and DFWNB (TFIDF Deeply Weighted Naive Bayesian Classification Model), IGDCNB (feature two-dimensional information The gain weighted naive Bayes classification model), OFWNB (ordinary weighted naive Bayes classification model) three types of Bayesian weighted models are compared. And analyze the differences in the corresponding accuracy index, recall index F1 index and macro F1 index. Model indicator pairs in the Sogou database environment are shown in Fig. 5.

Table 2

Model comparison in Fudan database environment

Model type	Indicator type	Document data
		Surroundings	Transportation	Educate	Military	Economy	Physical education	Mean
IGDC-DWNB	Accuracy (P)	0.96	0.96	0.98	0.95	0.95	0.9	0.95
	Recall (R)	0.88	0.96	0.97	0.88	0.99	0.98	0.94
	F1 value	0.91	0.95	0.98	0.91	0.97	0.94	0.94
IGDCNB	Accuracy (P)	0.94	0.93	0.93	0.93	0.92	0.94	0.93
	Recall (R)	0.87	0.94	0.96	0.91	0.93	0.96	0.93
	F1 value	0.91	0.94	0.95	0.91	0.93	0.95	0.93
DFWNB	Accuracy (P)	0.88	0.96	0.96	0.91	0.86	0.84	0.90
	Recall (R)	0.87	0.81	0.95	0.83	0.91	0.97	0.89
	F1 value	0.88	0.86	0.95	0.87	0.88	0.91	0.89
OFWNB	Accuracy (P)	0.95	0.87	0.95	0.81	0.81	0.74	0.86
	Recall (R)	0.85	0.85	0.94	0.86	0.86	0.97	0.89
	F1 value	0.88	0.86	0.94	0.84	0.82	0.84	0.86

Figure 5.

Model comparison in Sogou database environment.

As can be seen from Fig. 5, in the Chinese database environment of Sogou database, the average accuracy of the IGDC-DWNB model is 0.89, the average recall rate is 0.90, the average value of F1 is 0.89, and the average value of the indicators is distributed around 0.90. As a comparison, the average precision, recall, and F1 value of IGDCNB are 0.89, 0.89, and 0.89, respectively. Although the average accuracy of the two models is equal to the average value of F1, the average recall rate of the IGDC-DWNB model is 0.90, which is relatively high; at the same time, the average precision rate of the DFWNB model is 0.87, the average recall rate is 0.86, and the average value of F1 is 0.85. The IGDC-DWNB model is lower, in addition, the average accuracy of the OFWNB model is 0.83, the average recall rate is 0.84, and the average value of the F1 value is 0.83. The average distribution of the indicators is about 0.83, and the overall indicator level is at the lowest level among the four models. It can be seen that in the Chinese database environment of Sogou database, the overall performance level of the IGDC-DWNB model is the highest. The model comparison results in the Fudan database environment are shown in Table 2.

As can be seen from Table 2, under the environment of Fudan database, which is also a Chinese database, the average accuracy of the IGDC-DWNB model is 0.95, the average recall rate is 0.94, and the average F1 value is 0.94, which is maintained at around 0.94 as a whole; The average accuracy of the IGDCNB model is 0.93, the average recall rate is 0.93, and the average F1 value is also 0.93. The values of the three indicators are all stable at 0.93. Compared with the IGDC-DWNB model, the overall performance level has an overall The average accuracy of the FWNB model is 0.90, the average recall rate is 0.89, and the average value of the F1 value is also 0.89. The values of the three indicators are all stable at 0.89. Compared with the first two models, the overall performance The average level of the OFWNB model showed a further decline, and the average value of two indicators failed to reach 0.90; the average accuracy of the OFWNB model was 0.86, the average recall rate was 0.89, and the average value of the F1 value was also 0.86. Steady at around 0.87, the overall performance level is the lowest among the four models. It can be seen that in the Fudan database environment, which also belongs to the Chinese database, the IGDC-DWNB model has obvious advantages in the three indicators of accuracy, recall, and F1 value. In the two database environments, Fudan database and Sogou database, the macro F1 value comparison of the four models is shown in Fig. 6.

Figure 6.

Macro F1 values of four models in two database environments.

As can be seen from Fig. 6, in the Fudan database environment, the macro F1 value of the IGDC-DWNB model is relatively higher in the six feature dimensions, and this difference is more significant at feature dimension 500 and feature dimension 600, and the difference between the feature dimension 800 and the feature dimension 1000 is relatively slight. It can be seen that the performance of the IGDC-DWNB model in low dimensions is more obvious. In the Sogou database environment, the macro F1 value of the IGDC-DWNB model in the six feature dimensions is relatively higher, and the more obvious gaps appear at feature dimensions 500, 700, and 800, while the slighter gaps appear at feature dimensions 900 and 1000, it can be seen that the performance of the IGDC-DWNB model in this environment is also more obvious in low dimensions, which shows that this is the characteristic of the Chinese data environment.

4.2 The index performance of classification models in English databases

The model comparison results in the English database environment are shown in Table 3.

Table 3
Model comparison in English database environment

Database type	Indicator comparison
	Model type	Indicator type	Document data
			Alt. atheism	Comp. graphics	Misc. forsale	Rec. autos	Sci. crypt	Talk. politics	Mean
20-newsgroup	IGDC-DWNB	Accuracy (P)	0.98	0.94	0.78	0.95	0.98	0.97	0.93
experimental		Recall (R)	0.97	0.83	0.95	0.93	0.94	0.98	0.93
environment		F1 value	0.98	0.88	0.84	0.88	0.97	0.98	0.92
database	IGDCNB	Accuracy (P)	0.95	0.9	0.81	0.9	0.97	0.91	0.91
		Recall (R)	0.96	0.86	0.92	0.95	0.93	0.98	0.93
		F1 value	0.96	0.88	0.86	0.85	0.95	0.95	0.91
	DFWNB	Accuracy (P)	0.86	0.86	0.86	0.91	0.91	0.81	0.87
		Recall (R)	0.96	0.96	0.86	0.93	0.92	0.96	0.93
		F1 value	0.91	0.82	0.87	0.74	0.91	0.88	0.86
	OFWNB	Accuracy (P)	0.97	0.88	0.71	0.81	0.96	0.93	0.88
		Recall (R)	0.92	0.81	0.91	0.87	0.87	0.94	0.89
		F1 value	0.95	0.84	0.79	0.82	0.91	0.93	0.87
	Model type	Indicator type	Document data
			Crude	Acq	Interest	Wheat	Money	Earn	Mean
Ruster21578	IGDC-DWNB	Accuracy (P)	0.98	0.93	0.65	0.98	0.82	0.93	0.88
experimental		Recall (R)	0.93	0.98	0.88	1	0.51	0.96	0.88
environment		F1 value	0.95	0.95	0.75	0.99	0.62	0.94	0.87
database	IGDCNB	Accuracy (P)	0.91	0.91	0.63	0.95	0.71	0.98	0.85
		Recall (R)	0.95	0.97	0.74	0.99	0.55	0.88	0.85
		F1 value	0.92	0.93	0.68	0.98	0.62	0.93	0.84
	DFWNB	Accuracy (P)	0.94	0.73	0.66	0.97	0.61	0.96	0.81
		Recall (R)	0.81	0.94	0.57	0.93	0.72	0.81	0.80
		F1 value	0.96	0.81	0.61	0.95	0.65	0.87	0.81
	OFWNB	Accuracy (P)	0.97	0.73	0.61	0.94	0.72	0.94	0.82
		Recall (R)	0.81	0.94	0.78	0.97	0.48	0.84	0.80
		F1 value	0.88	0.83	0.68	0.95	0.58	0.88	0.80

As can be seen from Table 3, in the 20-newsgroup experimental database environment, the average accuracy of the IGDC-DWNB model is 0.93, the average recall rate is 0.93, the average F1 value is 0.92, and the overall maintenance is around 0.93; IGDCNB The average accuracy of the model is 0.91, the average recall rate is 0.93, and the average F1 value is also 0.91. The values of the three indicators are all stable at around 0.91. Compared with the IGDC-DWNB model, the overall The performance level is reduced by about 0.02 numerical points; the average precision of the FWNB model is 0.87, the average recall is 0.93, and the average F1 value is 0.86. The numerical differences of the three indicators are relatively large, but Basically maintained at 0.88, compared with the previous two models, the overall performance level showed a further decline, and the average value of two indicators failed to reach 0.90; the average accuracy rate of OFWNB model was 0.88, and the average recall rate was 0.89, the average value of F1 is also 0.87, the values of the three indicators are maintained at around 0.88, and the overall performance level is similar to that of DFWNB. In addition, in the Ruster 21578 experimental database environment, the average accuracy of the IGDC-DWNB model is 0.88, the average recall rate is 0.88, and the average F1 value is 0.87, which is maintained at about 0.87 as a whole; the accuracy of the IGDCNB model is The average value of the recall rate is 0.85, the average recall rate is 0.85, and the average value of the F1 value is also 0.84. The values of the three indicators are all stable at around 0.84. Compared with the IGDC-DWNB model, The overall performance level is reduced by about 0.03 numerical points; the average accuracy of the FWNB model is 0.81, the average recall rate is 0.80, the average F1 value is 0.81, and the values of the three indicators are basically maintained at around 0.80, compared with the first two models, the overall performance level shows a cliff-like decline; the average accuracy of the OFWNB model is 0.82, the average recall rate is 0.80, and the average F1 value is also 0.80. The values of the three indicators remain stable at around 0.81, and the overall performance level is similar to that of DFWNB. It can be seen that the IGDC-DWNB model also reflects its own more excellent model performance in the English database environment, and can perform higher accuracy and more efficient classification in practical applications. In the English database environment, the macro F1 value comparison of the four models is shown in Fig. 7.

Figure 7.

Macro F1 values of four models in two English database environments.

As can be seen from Fig. 6, in the 20-newsgroup environment, the macro F1 value of the IGDC-DWNB model is relatively higher in the six feature dimensions, and this difference is more significant when the feature dimension is 800 and the feature dimension. 1000, while the difference in feature dimension 500, 600, and 700 is relatively slight. It can be seen that the performance of the IGDC-DWNB model in this environment is more obvious in high dimensions. In the Ruster21578 environment, the macro F1 values of the six feature dimensions of the IGDC-DWNB model are relatively higher, and the more obvious gaps appear in the feature dimensions 800, 900 and 1000, while the slighter gaps appear in the features Dimension 500 and 700. It can be seen that the performance of the IGDC-DWNB model in this environment is also more obvious in high dimensions. It can be seen that in the English data environment, the performance of the model in high dimensions is stronger. Compared with other researches, the model designed in this paper also has higher universality and performance stability. Xing and Bei [26] proposed an intelligent classification model for medical and health big data based on improved KNN algorithm. The model uses weight classification for each classification to improve classification efficiency. Research results show that the model has good classification effect in medical data. The classification model designed in this study is not aimed at the subdivision field, but the mixed type big data classification in the universal field. The research results show that the model designed in this study has higher accuracy in the advanced models of the same type, which can ensure the stability of classification performance while maintaining universality. Boyapati et al. [27] designed a big data classification model based on support vector machine, which can play a strong classification ability in both commercial and medical data sets. This research selects the Bayesian algorithm which is more robust than the support vector machine and has more stable classification efficiency as the basis, and improves the design to IGDC-DWNB model. The results show that the model designed in the research is at F1 value. Both accuracy and recall rate have strong advantages. Lakshmanaprabu et al. [28] designed a health big data classifier of the Internet of Things based on a random forest network, studied and improved the random forest algorithm using listening algorithm, and the designed model passed the validation. This research also uses the improved method, and uses the feature two-dimensional information gain weighting method to improve the naive Bayesian algorithm. The improved big data classification model has stronger classification ability and is more stable than the advanced models of the same type. In general, the IGDC-DWNB model studied and designed has stronger classification performance in data classification, and has stronger robustness and stability in practical applications.

5. Conclusion

Accurately extract and classify Internet text data information with high propagation speed and high update speed, the research starts with the depth weighting method, combines the naive Bayesian algorithm with the feature two-dimensional information gain weighting method, and improves the classification performance of the model through the depth weighting method to achieve better big data classification effect. Finally, the model is applied to both Chinese and English. Compare and analyze data sets with different characteristics with other classification models. The research results show that the classification accuracy rates of the IGDC-DWNB model in the Sogou experimental environment database, the 20-newsgroup experimental environment database, the Fudan experimental environment database and the Ruster21578 experimental environment data are 0.89, 0.89, 0.93, and 0.88 respectively, and the recall rate is 0.89, 0.89, 0.93, and 0.88. were 0.90, 0.93, 0.93, and 0.88, respectively, and the F1 values were 0.89, 0.94, 0.92, and 0.87, respectively. Compared with the other three classification models, the model designed in the study was stable and obvious in classification accuracy, recall, and F1 value. This shows that the model designed in the study has stronger classification ability in big data classification, can achieve accurate classification and high-quality classification in different data set environments, and has stronger robustness and stability in practical applications. Compared with other weighting algorithms, the depth weighting method proposed in the study has been greatly improved. However, the research is based on the assumption that the data is complete, and there will inevitably be missing and unavoidable redundant information in the actual application. Therefore, it is necessary to further solve the classification of incomplete data and effectively remove redundant information. In the experiment, only three datasets are used for comparative analysis of the experimental results. In order to prove the effectiveness and feasibility of the model, more datasets should be used for experiments and comparison with other methods.

References

Rghioui

Lloret

Oumnad

. Big data classification and Internet of Things in healthcare. Int J E-Health Med C. 2020; 11(2): 20-37.

Uma

Rao

SUM

Rao

Reddy

. Comprehensive survey of classification, streaming techniques in big data analytics. Des Eng (Toronto). 2021; 2021(5): 643-663.

Nurmalasari

Kusrini

Sudarmawan

. Komparasi algoritma Naive Bayes dan k-nearest neighbor untuk membangun pengetahuan diagnosa penyakit diabetes. Jurnal Komtika (Komputasi dan Informatika). 2021; 5(1): 52-59.

Prahartiwi

Dari

. Komparasi algoritma Naive Bayes, decision tree dan support vector machine untuk prediksi penyakit kanker payudara. Jurnal Teknik Komputer. 2021; 7(1): 51-54.

Isa

. Aplikasi asesmen calon debitur menggunakan Naive Bayes di koperasi mitra sejahtera SMK negeri 1 kota sukabumi. Jurnal Sisfokom (Sistem Informasi dan Komputer). 2021; 10(1): 31-39.

Cao

Xiong

Yang

. Establishment of Naive Bayes classifier-based risk prediction model for chemotherapyinduced nausea and vomiting. J South Med Univ. 2021; 41(4): 607-612.

Resti

Kresnawati

Dewi

Zayanti

. Diagnosis of diabetes mellitus in women of reproductive age using the prediction methods of Naive Bayes, discriminant analysis, and logistic regression. Sci Technol Indones. 2021; 6(2): 96-104.

Brahmane

Krishna

. Rider chaotic biography optimization-driven deep stacked auto-encoder for big data classification using spark architecture: Rider chaotic biography optimization. Int J Web Serv Res. 2021; 18(3): 42-62.

Selvi

Valarmathi

. Optimal feature selection for big data classification: Firefly with lion-assisted model. Big Data. 2020; 8(2): 125-146.

10.

Lakhwani

. Big data classification techniques: A systematic literature review. J Nat Rem. 2020; 21(2): 972-5547.

11.

Yoanita

Setiawan

Irawan

. Analisis fitur-fitur yang mempengaruhi jumlah subscribers youtube menggunakan algoritma Naive Bayes classifier. Smatika Jurnal. 2020; 10(1): 36-40.

12.

Nugroho

Istiadi

Marisa

. Optimasi naive Bayes classifier untuk klasifikasi teks pada e-government menggunakan particle swarm optimization. Jurnal Teknologi dan Sistem Komputer. 2020; 8(1): 21-26.

13.

Muhajir

Chotijah

. Aplikasi berbasis web browser untuk mendiagnosa kerusakan laptop dengan metode Naive Bayes. JIPI (Jurnal Ilmiah Penelitian dan Pembelajaran Informatika). 2020; 5(2): 112-122.

14.

Alfianti

Gunawan

Amin

. Sentiment analysis of cosmetic review using Naive Bayes and Support Vector Machine method based on Particle Swarm Optimization. Jurnal Riset Informatika. 2020; 2(3): 169-178.

15.

Mahmudah

Puspitorini

Siswandari

Wijayanti

. Metode Naive Bayes classifier – Smoothing pada sensor smartphone untuk klasifikasi aktivitas pengendara. Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI). 2020; 9(3): 268-277.

16.

Rahutomo

Firdausi

Rochmanshah

. Pengembangan sistem analisa keberpihakan media online berdasarkan trend waktu menggunakan Naive Bayes classifier. Jurnal Informatika Polinema. 2020; 6(1): 33-40.

17.

Elik

. The comparison of the model performances of Naive Bayes, C4.5 and C5.0 algorithms: Implementation on fish consumption habits. J Adv Res Appl Math. 2021; 7(1): 17-30.

18.

Yulia

Solecha

. Implementasi particle swarm optimization (PSO) pada analysis sentiment review Aplikasi Trafi menggunakan algoritma Naive Bayes (NB). Jurnal Teknik Komputer. 2021; 7(1): 25-29.

19.

Saraswati

Riminarsih

. Analisis sentimen terhadap pelayanan krl commuterline berdasarkan data twitter menggunakan algortima bernoulli Naive Bayes. Jurnal Ilmiah Informatika Komputer. 2020; 25(3): 225-238.

20.

Alotaibi

Hamdaoui

. Improvement of semi-supervised document classification based on fine tuning Naive Bayesian classifier. Eur J Sci Res. 2021; 158(3): 181-190.

21.

Liu

. Optimization of architectural art teaching model based on Naive Bayesian classification algorithm and fuzzy model. J Intell Fuzzy Syst. 2020; 39(2): 1965-1976.

22.

. E-Commerce data classification in the cloud environment based on Bayesian algorithm. J Intell Fuzzy Syst. 2021; 40(4): 5819-5826.

23.

Wijaya

. The classification of documents in Malay and Indonesian using the naive Bayesian method uses words and phrases as a training set. Mendel. 2020; 26(2): 23-28.

24.

Zhuo

. Gaussian discriminative analysis aided GAN for imbalanced big data augmentation and fault classification. J Process Control. 2020; 92: 271-287.

25.

Long

. Multimodal information gain in Bayesian design of experiments. Computation Stat. 2022; 37(2): 865-885.

26.

Xing

Bei

. Medical health big data classification based on KNN classification algorithm. IEEE Access. 2019; 8: 28808-28819.

27.

Boyapati

Swarna

Dutt

Vyas

. Big data approach for medical data classification: a review study. In: 2020 3rd international conference on intelligent sustainable systems (ICISS). 2020 December 03–05. Thoothukudi, India. New York: IEEE. 2020. pp. 762-766.

28.

Lakshmanaprabu

Shankar

Ilayaraja

Nasir

Vijayakumar

Chilamkurti

. Random forest for big data classification in the internet of things using optimal features. Int J Mach Learn Cyb. 2019; 10(10): 2609-2618.