Abstract
Aiming at the imprecise and uncertain data and knowledge, this paper proposes a novel prior assumption by the rough set theory. The performance of the classical Bayesian classifier is improved through this study. We applied the operations of approximations to represent the imprecise knowledge accurately, and the concept of approximation quality is first applied in this method. Thus, this paper provides a novel rough set theory based prior probability in classical Bayesian classifier and the corresponding rough set prior Bayesian classifier. And we chose 18 public datasets to evaluate the performance of the proposed model compared with the classical Bayesian classifier and Bayesian classifier with Dirichlet prior assumption. Sufficient experimental results verified the effectiveness of the proposed method. The mainly impacts of our proposed method are: firstly, it provides a novel methodology which combines the rough set theory with the classical probability theory; secondly, it improves the accuracy of prior assumptions; thirdly, it provides an appropriate prior probability to the classical Bayesian classifier which can improve its performance only by improving the accuracy of prior assumption and without any effect to the likelihood probability; fourthly, the proposed method provides a novel and effective method to deal with the imprecise and uncertain data; last but not least, this methodology can be extended and applied to other concepts of classical probability theory, which providing a novel methodology to the probability theory.
Introduction
Prior probability, also known as the prior, is the probability distribution of an uncertain quantity, and it can express the belief of this quantity before the actual occurrence takes places. Prior do plays an importance role in many machine learning models such as naive Bayesian classifiers (NBC) [1], Spike and slab regression [2], Bayesian network [3], Bayesian Structural Time Series (BSTS) [4], Markov chain Monte Carlo [5], Probabilistic Neural Network (PNN) [6, 7], etc. However, by now, the domain of prior is not being extensively studied and the researches in this domain are even far from enough. Therefore, in this paper, we will study and propose a novel approach to improve the prior assumption.
Priors can be generated using a number of approaches [8]. Normally, priors can be obtained from the previous information. And they can be obtained from the subjective evaluation of the expert experiences as well. When there is no information usable, an informative prior can be set up to reflect the balance between the results. Also, they can be chosen on the basis of some principles, for example, the maximum entropy or the symmetric given constraints [9, 10]. In this paper, we will concentrate on the former one.
Rough set theory is a new foundation of knowledge representation and processing. It was first proposed in the early 1980s by Pawlak [11]. Generally, it is based on the indiscernibility relation. The rough set method is a mathematical technique for analyzing the indeterminacy and fuzzy data. It is an extension of the set theory, which is used to study the intelligent systems with incomplete and insufficient information. Thus it can be used as a constituent part of the hybrid solution of the data mining and machine learning. And specifically, it is effective in the feature selection and the rule induction. The conceptions of the lower approximation and the upper approximation in rough set theory can hide the knowledge in information and express it in the form of decision rules [12, 13]. In rough set theory, each decision rule has two conditional probabilities. They are known as the coverage factor and certainty factor, which have a great relevance with the upper and lower approximations of the rough set theory. The research results indicate that these two factors fulfil the Bayesian rule [14]. It is assumed that there is a complete certainty that an object classified by using Pawlak rough sets is a correct classification by an equivalence relation [11, 12]. Rough set theory is gradually being of fundamental importance to the artificial intelligence and cognitive science areas, such as data mining [15, 16], expert systems [17], knowledge discovery [18–20], attribute reduction [21], pattern recognition [22] and decision analysis [23–26]. And it has been used in many fields such as market analysis, finances, medicine, pharmacology, etc. Therefore, this paper considers applying the rough set theory to explore a more accurate prior probability assumption.
In this paper, we will concentrate on the study of the prior probability in naive Bayesian classifiers. As we all know, the naive Bayesian classifier is an effective classification model, and it can perform quite well in a wide variety of domains. It is fast and simple. And surprisingly, the prediction accuracy of the NBC performs quite decent compares with other more complex classification algorithms [27]. Additionally, the prior probability plays a significant role in Bayesian theorem. The Bayesian theorem calculates the posterior probability distribution through the prior distribution and the likelihood functions. As Berthold and Hand (1999) [28] point out that, ’The result of the Bayesian data analysis process is the posterior distribution that represents a revision of the prior distribution on the light of the evidence provided by the data’. Thus, it is meaningful to study the prior assumption of naive Bayesian classifiers.
In this paper, we propose a novel prior probability proposition based on the rough set theory, which provides a novel methodology combines the rough set theory with the classical probability theory. The accuracy of prior assumption is improved by this method. And we introduced it into the naive Bayesian classifiers as an example to analyze the feasibility of our method. This method provides an appropriate prior probability to the classical Bayesian classifier which can improve its performance only by improving the accuracy of prior assumption and without any effect to the likelihood probability. We conduct sufficient experiments on the UCI machine learning repository [29] to evaluate the performance of our method. We find that our method can significantly obtain a more accurate prior assumption. Moreover, the data from real-world usually has a certain degree of uncertainty. The proposed method shows a high significance to deal with this kind of uncertain data.
The rest of the paper is organized as follows. Section 2 makes a short review of the works related in this paper. In section 3, we describe the definitions and notations used in the paper. Section 4 proposes the rough set theory based prior assumption and combines it with the naive Bayesian classifier. In section 5, we conduct two experiments on various real-world data sets and evaluate the performance of each model. In the end, we conclude the paper in section 6.
Related works
To our knowledge, there is no existing study that is able to apply rough set theory to improve the prior assumption as we do. Our work is clearly related to the rough set theory and Bayesian theorem. There has been a surge of recent work on the study of the rough set theory. Although not directly related, there are other studies of the combination of rough set methods with the probabilistic approaches.
Wong et al. and Pawlak et al. firstly introduced the probabilistic rough set (PRS) model in 1985 and 1988, respectively [30, 31]. They combined the rough set theory with the probabilistic approaches and defined the probabilistic upper approximation and probabilistic lower approximation used a threshold of 0.5. In 1993, Ziarko presented a variable precision rough set (VPRS) model [32]. It is applying a pair of thresholds based on the graded set. Then Pawlak introduced the study of rough set approaches into Bayesian methods in 1999 and 2002 [33, 34]. It combined the Bayesian methods with rough set theory and represented some of the Bayesian data analysis using rough set. Slezak et al. proposed a Bayesian rough set (BRS) model in 2002 [35]. They applied the prior probability as the threshold to define the positive region, negative region and boundary region. In 2004, Greco et al. proposed a study of decision algorithms and decision rules which combined rough set decision rules with the Bayesian confirmation measures [36, 37]. Next, in 2005, Slezak presented a rough Bayesian model [38]. It used the ratio of the prior probability and the posterior probability as the threshold to define the probabilistic approximation.
In 2010, Yao and Zhou proposed a naive Bayesian decision-theoretic rough set (NBRS) model [39]. It defined the conditional probability of rough set theory based on the naive probabilistic independence assumption and the Bayesian theorem. Then in 2015, Yao et al. introduced some new results about the probabilistic rough sets, they were the explanation and calculation of the thresholds, the estimation of the conditional probabilities and the explanation and applications of probabilistic rough set approximations [40]. In 2015, Zhan et al. proposed a theory of rough soft hemiring, which studies roughness in soft hemirings concerning the Pawlak approximation spaces [41]. In 2016, Yao and Zhou introduced two different classes of Bayesian approaches, which were Bayesian classification rough set and Bayesian confirmation rough set [42]. In 2017, Liang et al. introduced the intuitionistic fuzzy point operator (IFPO) into decision-theoretic rough sets and explore three-way decisions [43]. In 2018, Liu et al. proposed a decision theoretic rough set approaches to multi-covering approximation spaces based on fuzzy probability measure [44]. In the same year, Sun et al. introduced a three-way decisions approach to multiple attribute group decision making with linguistic information-based decision-theoretic rough fuzzy set [45].
However, even though the Bayesian classifier has been known for several decades, to our knowledge, the study put forward in this paper has so far not been explored.
And there exist some studies about the improvement of Bayesian prior probabilities. In 1995, Williams proposed a study on Bayesian regularization and pruning using the Laplace prior [46]. In 2007, Li and Bilmes presented Bayesian divergence prior for generic classifier adaptation [47]. In 2009, Wong proposed generalized Dirchlet prior and Liouville prior for naive Bayesian classifier [48]. These methods are inspiring work, and has made a good attempt to improve the prior probabilities. However, to improve the prior probability, these methods not only made an improvement of the prior probabilities, but also made the corresponding changes on the likelihood probabilities of the naive Bayesian classifiers. Instead, in this paper, we put forward a concise and effective alternative prior probability. And this proposed method can only improve the accuracy of the prior probability, and without any affecting on the likelihood probability.
Preliminaries
In this section, we will introduce some basic concepts of Pawlak rough set theory and the prior probability in naive Bayesian classifiers.
Pawlak rough sets
An information system is the start point of Pawlak rough set. Basically, it is a data table. The columns and the rows of an information system are labelled by attributes and objects of interest, respectively. And the entries of it are attribute values [39].
Formally, given a Pawlak approximation space (U, R) on an information system, U is a finite and nonempty set called universe, while R is an equivalence binary relation on universe U. Further, the attributes of an information system could be distinguished into two disjoint classes: condition attributes and decision attributes. Thus, the system S = (U, C, D) will be called a decision table, C refers to condition attributes whereas D refers to decision attributes. Some conditions have to be satisfied when making a decision according to the decision table.
The disjoint subsets of U partitioned by R are called equivalence classes. The two elements in the same equivalence class are indiscernible. Additionally, equivalence classes of an equivalence relation are referred to granules or elementary sets. Thus, for an arbitrary set ∀X ⊆ U, it may not able to be precisely described by complying with the equivalence class of R. Hence, concepts of the lower and the upper approximation have to be lead out [11].
The operations of approximations are used to represent the imprecise knowledge accurately. They are the basic operations of rough set theory. These two operations on rough sets are defined as follows:
Assigning to every x ∈ U and R is an equivalence binary relation on U, two sets R* (X) and R* (X) called the lower and the upper approximation of X, respectively [49].
The concept of lower approximation is refers to the union of all R - granules which are wholly included in the concept, while the upper approximation refers to the union of all which have a non-empty intersection with the concept.
The set of the boundary region of X is defined as follows:
The relations of the upper approximation R* (X), lower approximation R* (X), boundary region BN R (X) and the universe U are as shown in Fig. 1:

Relations in the rough set theory.
As can be seen from Fig. 1, the whole set represents the universe U, every disjoint cell in U are equivalence classes of U which partitioned by R. The inside of the blue circle is the real region of the set X. The inside of the outer green circle is the upper approximation R* (X), it is the union of all cells which have a non-empty intersection with the blue circle. The inside of the inner green circle is the lower approximation R* (X), it is the union of all cells which are wholly included in the blue circle. The range between the two green circles is the boundary region BN R (X), it is the union of all cells which are neither on the inside of set X nor on the outside of the set X.
Due to the existence of boundary region, that is some cells could not be classified by a subset of U nor be classified by its complementary set. Thus these cells are classified into the boundary region. The indeterminacy of a set is caused by the existence of boundary region. The larger the boundary region of the set, the lower the accuracy of it has. To be more precisely, the concept of approximation quality is introduced as follows:
The approximation quality d R (X)reflects the degree of completeness we understand the knowledge of set X. Clearly, for ∀R and X ⊆ U, 0 ≤ d R (X) ≤1. If d R (X) =1, which means BN R (X) =∅, the set X is R - definable; on the contrary, if d R (X) <1, which means BN R (X)≠ ∅, the set X is R - undefinable.
The naive Bayesian classifier model is based on the Bayesian rule and works as follows. Assume input space X ⊆ R n is the set of n-dimension vectors, and output space is the set of class tags ϒ ={ c1, c2, ⋯ , c k }. Feature vectors x ∈ X are the inputs whereas class labels y ∈ ϒ are the outputs. X and Y are the random variables defined in input space X and output space ϒ, respectively. P (X, Y) is the joint probability distribution of X, Y. Training dataset T ={ (x1y1) , (x2y2) , ⋯ , (x N y N ) } is generated by P (X, Y) IID.
The prior probability distribution in naive Bayesian Classifier is simply generated by computing the class probabilities:
By the Bayesian theorem, the posterior probability of naive Bayesian Classifier is a conditional probability. The naive Bayesian Classifier calculates the posterior probability P (Y = c k |X = x) for all classes Y = c k and chooses the class which with the largest posterior probability to be the predicted class of X = x. The posterior probability could be rewritten as:
Thus, the naive Bayesian Classifier can be expressed as:
Further, the denominator in Eq. (7) can be ignored, since it is the same for all classes, and does not affect the final result of the classifier. Thus the naive Bayesian classifier can be simplified as:
Generally, it is difficult or unreliable to evaluate P (X(j) = x(j)|Y = c
k
) for all possible X(j) = x(j) and Y = c
k
from available data expect the size of data is very large. And
The conditional independence assumption of naive Bayesian theory model assumes the attributes are independent of each other. This assumption extremely improves the calculation efficiency. Though it appears to be unreasonable, many studies identified that it is not as impractical as most people imagined [40, 51]. The prediction is based on the "zero-one loss" measurement is the critical factor for the practicability of this assumption. No matter what the process of generating prediction is, if a prediction is correct, then it obtains the minimum "zero-one loss". Therefore, the naive Bayesian classification model usually can achieve a decent result though there are few real datasets can truly fulfil the conditional independence assumption. And it may be superior to the other classification models.
In this section, we will introduce the notion of our rough set theory based prior assumption and lead it into the naive Bayesian classifier. The rough set theory based prior assumption is defined as follows:
The probability of the elements in the lower approximation R* (Y = c k ) is the probability of the union of all R - granules that are included in the set. The probability of the elements in the boundary region BN R (Y = c k ) is the probability of the union of all R - granules are not included but have a nonempty intersection in the set, which represents some elements in this part have a certain possibility could belong to the set. Thus, here we bring in the concept of approximation quality d R (Y = c k ) to weight that possibility. As mentioned above, d R (Y = c k ) could represent the degree of completeness we understand the knowledge of set, so here it could reflect the certainty of elements in BN R (Y = c k ) belongs to the set. Therefore, the sum of above two parts is the prior probability we obtained.
According to Eq. (4), obviously, we have:
According to Eq. (7), the rough set prior Bayesian classifier can be expressed as:
Further, according to Eq. (9) and Eq. (11), the rough set prior Bayesian classifier is defined as follows:
If the boundary region is the empty set, that is, the lower approximation is equal to the higher approximation. In this case, the data set can be regarded as an exact set. Then the classification result of the rough set prior Bayesian classifier is exactly the same with naive Bayesian Classifier. That is:
We chose 18 public datasets from the UCI machine learning repository [29] to evaluate the improvement of our prior assumption towards the performance of the naive Bayesian classifier. Detailed information of these datasets is listed in Table 1.
The information of datasets.
The information of datasets.
There came out three important problems that we need to confront. Firstly, the pre-processing of the numeric attributes. Next, how to deal with the missing values. Then, how to manage the zero counts. Next, we will discuss how to handle each problem respectively.
Firstly, as for the numeric attributes, we discretized them into several classes. Here we applied a Canopy-K-means clustering algorithm [52] to discretize the numeric attributes dynamically. Next, as for the missing values, we ignored them directly in the training and testing phases. Then, for the zero counts problems, this kind of problems may be produced at the time of the given classes and attribute values do not appear in the meanwhile in the training set. Here we applied the Laplace correction to handle this issue [53].
Experiment with the 10-fold cross validation method has been implemented to evaluate the performance of the Bayesian classification model with the proposed prior assumption and to compare the experimental results with naive Bayesian classification model as our benchmarks. Here we applied some common metrics to evaluate the performance of our method: % classification accuracy (Acc), % precision (Pr), % recall and % F1-score. Accuracy: accuracy is the most common evaluation metrics. It is measured by the proportion of the correct classification. Let Precision: precision refers to how many samples are positive real samples. Let the positive sample set of the model output is A, and the real positive sample set is B, then we have:
Recall: recall refers to the number of positive samples which are judged as positive samples by the model.
F1-score: Sometimes we need to weigh the accuracy and recall rates, and then F
β
- score can represent the harmonic average assessment index of precision and recall.
The comparison between the rough set prior Bayesian classifier (RSBC) with the naive Bayesian classifier (NBC) in testing metrics with the 10-fold cross validation method is listed in Table 2. And the higher values are marked as black.
Experimental result 1
It is clear from the experimental results that the average classification performance of the proposed model is better than naive Bayesian classifier under the 10-fold cross-validation. The proposed rough set based prior assumption significantly improves the classification efficiency of the classical naive Bayesian classifier. This is represented in Table. 2. The table above clearly explains that there is a significant improvement in the performance of the classifier.
As can be seen from the Table 2, almost all the evaluation parameters of RSBC for the 18 datasets above are better than those of NBC. Though as for the car dataset, the recall rate of RSBC is lower than NBC, the value of the other three evaluation parameters are significantly higher than NBC, especially for the precision rate and F1-score, those values of RSBC are higher than NBC 26.12% and 39.36%, respectively. And as for the tictactoe dataset, only the value of RSBC is slightly lower than NBC in the precision rate, about 1.41%, but there is no doubt that the overall performance of the rough set prior Bayesian classifier is better than the naive Bayesian classifier.
Next, we chose the continuous datasets from Table 1 (No.1-12) to evaluate the performance of our rough set prior assumption towards the Dirichlet prior assumption on the Bayesian classifiers [48]. Here, we applied the accuracy (Acc) to compare the performance of these two prior assumptions. And the higher values are marked as black.
According to the Table 3, the performances of the proposed rough set prior assumption show an obvious advantage than the Dirichlet prior assumption.
Therefore, these two experimental results indicate that the proposed method can efficiently increases the precision of the prior assumption.
Experimental result 2.
A novel prior assumption is proposed in this paper. And a new rough set theory based prior probability for the classical Bayesian classifier and the corresponding rough set prior Bayesian classifier is provided in this paper as well. We combine the rough set theory with the classical probability theory and first applying the concept of approximation quality in rough set theory. This method does not need to provide any prior information except the original dataset, thus it is relatively objective to describe and deal with the uncertainty problem. The performance of the classical Bayesian classifier is improved through our study. And the proposed method can only improve the accuracy of prior probability, and without any effect to the likelihood probability. Sufficient experimental results applied in 18 UCI public datasets verified the proposed method can be an appropriate prior for the classical Bayesian classifier, and indicating that the proposed method efficiently increases the precision of the prior assumption. And the theoretical analysis and experimental verification show that the proposed method holds a great significance in dealing with the imprecise and uncertain data. Further, this methodology can be extended and applied to other concepts of probability theory, which providing a novel methodology to improve the accuracy of probability theory.
