Abstract
Classical rough set theory is based on the conventional indiscernibility relation. It is not suitable for analyzing incomplete information. Some successful extended rough set models based on different non-equivalence relations have been proposed. The data-driven valued tolerance relation is such a non-equivalence relation. However, when predicting the unknown attribute value of an object, it regards the frequency of an attribute value approximately as the probability of appearance of this value, without considering the effects of other known attribute values of this object on predicting the unknown attribute value. In this paper, considering both the frequency of the known attribute values and the influence weight to predict the unknown attribute values. Modified data-driven valued tolerance relation (MDVT) is defined. On this basis, an extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Experimental results show that the MDVT can get better classification results than other generalized indiscernibility relations.
Introduction
Classical rough set theory developed by Pawlak in 1982 [1], as a powerful soft computing tool to handle imprecise, uncertain, and vague information, has been extensively and successfully applied in the field of machine learning, data mining and decision analysis [2–8]. The classical rough set theory based on the conventional indiscernibility relation, is not suitable for analyzing incomplete information systems (IIS), where some attribute values are unknown or missing. But in practice, because of the errors of data measuring, the limitation of acquiring data, some human factors, etc., IIS often occur in knowledge acquisition. The semantics of missing attribute values in IIS has been analyzed by many researchers. Generally, there are four kinds of missing attribute values: "lost values" (the values that were recorded but currently are unavailable), "do not care" conditions (the original values were irrelevant), restricted "do not care" conditions (similar to ordinary "do not care" conditions but interpreted differently, these missing attribute values may occur when in the same data set where there are "lost values" and "do not care" conditions), and "attribute-concept values" (these missing attribute values may be replaced by any attribute value limited to the same concept) [9–15]. Therefore, many researchers have paid attention to IIS in recent years and endeavored to find out solutions. There are usually two methods in rough set theory to deal with IIS: data reparation [16–20] and model extension [21–27]. Compared with data reparation, model extension does not change the original information of IIS. At present, various extensions of classical rough set model have been proposed, in which the indiscernibility relation was extended to some non-equivalence relations (or called generalized indiscernibility relations) such as tolerance relation [9], non-symmetric similarity relation [10], limited tolerance relation [11], valued tolerance relation [12] and characteristic relation [13] etc. The generalized indiscernibility relation is a basic and core concept of rough set in incomplete information system. Attribute reduction and rule extraction are both based on the generalized indiscernibility relation. The classification performance of the generalized indiscernibility relation will affect the results of attribute reduction and rule extraction. However, these existing generalized indiscernibility relations have their own limitations, so proposing a new generalized indiscernibility relation, which can get better classification results, is essential and important.
The tolerance relation is too relaxed. In the tolerance relation, two objects that do not have any known same attribute values may be considered as indiscernible, and classified in the same class. The non-symmetric similarity relation is too strict. In the non symmetric similarity relation, two objects having a lot of known same attribute values may be separated, and classified in the different classes. In the limited tolerance relation, two objects may be considered as similar when they have only one known same attribute value. This condition is too lenient for an IIS with a large number of attributes. The characteristic relation, which is a generalization of both tolerance relation and non-symmetric similarity relation, cannot avoid their limitations. In the valued tolerance relation, we must know the prior probability distribution of the IIS in advance. Unfortunately, this is very difficult for a new IIS with some missing attribute values. So, it is usually supposed that there exists a uniform distribution among the attribute values. Obviously, this is too subjective and unreasonable. In addition, the threshold of tolerance degree is usually selected based on the prior domain knowledge, and it is also difficult to select a suitable threshold for different IIS.
Based on the idea of data-driven data mining [28], data-driven valued tolerance relation was proposed by Wang [29]. The frequency of an attribute value is approximately regarded as the probability of appearance of this value, and then, an approximate probability distribution among the attribute values can be derived. This calculation method of tolerance degree does not require any prior knowledge except the data set. An auto-selection algorithm of tolerance degree threshold is proposed, which does not require any prior domain knowledge except the data set. The data-driven valued tolerance relation can resolve the problems of the valued tolerance relation. However, when predicting probability of the unknown attribute value of an object, it regards the frequency of an attribute value approximately as the probability of appearance of this value, without considering the effects of other known attribute values of this object on predicting unknown attribute value. In order to solve the defect of data-driven valued tolerance relation, In this paper, considering both the frequency of the known attribute values and the influence weight to predict the unknown attribute values. Modified data-driven valued tolerance relation (MDVT) is defined, which can make full use of the prior knowledge of information system. On this basis, an extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Experimental results show that the MDVT can get better classification results than other generalized indiscernibility relations.
The rest of the paper is organized as follows. In Section 2, we review several kinds of typical generalized indiscernibility relations as our foundations for the subsequent discussion. In Section 3, modified data-driven valued tolerance relation is defined. In Section 4, extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Some simulation experimental results are discussed in section 5. At last, in section 6, we conclude this paper.
Typical generalized indiscernibility relations
In this section, we review relevant definitions of tolerance relation, non-symmetric similarity relation, limited tolerance relation, and data-driven valued tolerance relation as our foundations for the subsequent discussion.
Formally, an information system (IS) is a quadruple (U, AT, V, f), where U is a non-empty finite set of objects, called the universe; AT is a non-empty finite set of attributes; V is the union of attribute domains V = ∪ a∈ATV a , V a where is the value set of attribute a; f : U × AT → V is an information function which assigns particular values from domains of attribute to objects, such as ∀a ∈ AT, x ∈ U, f (x, a) ∈ V a , where f (x, a) denotes the value of attribute for object. If some attribute values are missing, the IS is an incomplete information system (IIS), otherwise it is a complete information system (CIS). In an IIS, the missing attribute value is denoted by "*".
Tolerance relation
where B is a subset of attribute set AT. Clearly T
B
is a reflexive and symmetric relation, but not necessarily transitive. The tolerance class for any object x ∈ U is defined as:
where B is a subset of attribute set AT. Clearly S
B
is a reflexive and transitive relation, but not necessarily symmetric. Two similarity classes for any object x ∈ U are defined as:
Obviously predecessor set S
B
(x) of x is the set of objects similar to x, and successor set
where Q B (x) ={ b ∈ B|b (x) ≠ * }.
Obviously, L
B
is a reflexive and symmetric relation, but not necessarily transitive. The limited tolerance class for any object x ∈ U is defined as:
In [28], Wang indicated that the frequency of an attribute value can be approximately regarded as the probability of appearance of this value, and then, an approximate probability distribution among the attribute values can be derived. This calculation method does not require any prior knowledge except the data set, and is objective.
On this basis, the probability that two objects are similar with respect to attribute set B is calculated as:
where I U ={ (x, x) |x ∈ U } is the identity relation on U, tolerance degree P B (x, y) is defined by Eq. (9), λ is the threshold of tolerance degree.
The data-driven valued tolerance class for any object x ∈ U is defined as:
It is clear that
An IIS (U, AT, V, f)
According to Definition 4, the probability of b (x) = b
i
is
The above calculation concerns only the known attribute values on the attribute a4, ignoring the effects of the other known attribute values of all object. Although the frequency of appearance of 1 and 2 on attribute a4 are equal to 1/2, however
For object x5, [a1 (x5) , a2 (x5) , a3 (x5)] = [1, 1, 1], which is same as object x1 and x2. Obviously, when predict the unknown attribute value a4 (x5), from the perspective of the known attribute value similarity, a4 (x1) and a4 (x2) should have the greater influence on a4 (x5) than a4 (x3) and a4 (x4), That is to say, the influence weight of a4 (x1) and a4 (x2) is greater than that of a4 (x3) and a4 (x4). Due to a4 (x1) = a4 (x2) = 1 and a4 (x3) = a4 (x4) = 2, the probability of a4 (x5) = 1 should greater than that of a4 (x5) = 2. Similarly, the probability of a4 (x6) = 2 should greater than that of a4 (x6) = 1.
In this section, to solve the limitations of the data-driven valued tolerance relation, considering both the frequency of the known attribute values and the influence weight to predict the unknown attribute values, and then, a modified data-driven valued tolerance relation is defined.
where
Then the probability that two objects are possibly same with respect to attribute set B is calculated as:
where I U ={ (x, x) |x ∈ U } is the identity relation on U, λ is the threshold of tolerance degree. In this paper, λ > 0 is calculated as follows.
Consider IIS (U, AT, V, f) and tolerance relation T B with B ⊆ AT. For any x ∈ U, tolerance class T B (x) can be calculated. Based on modified data-driven valued tolerance relation, according to the calculation method of tolerance degree, for any y in T B (x), tolerance degree P B (x, y) can be calculated. Therefore, the threshold in the modified data-driven valued tolerance relation can be calculated as [28]:
Obviously,
The modified data-driven valued tolerance class for any object x ∈ U is defined as:
From Table 1 we can see that a4 (x5) =*, a4 (x6) =*, set
First, we calculate the frequency of appearance of 1 and 2 on the attribute a4. According to Definition 6, we have
Second, we calculate the influence weight of 1 and 2 on a4 (x5), a4 (x6). According to Definition 7, we have
Finally, we calculate the probability of a4 (x6) = 1, a4 (x6) = 2, a4 (x7) = 1, a4 (x7) = 2. According to Definition 8, we have
From Table 1, According to the analysis of Example 1, the probability of a4 (x5) = 1 should greater than that of a4 (x5) = 2, whereas the probability of a4 (x6) = 2 should greater than that of a4 (x6) = 1. From our calculation results, we can see that P (a4 (x5) = 1) > P (a4 (x5) = 2), P (a4 (x6) = 1) < P (a4 (x6) = 2). which is consistent with the truth.
By comparing the tolerance relation, non-symmetric similarity relation, limited tolerance relation, and data-driven valued tolerance relation, we can get the following Properties.
The proof of Property 6 is shown in the following Example 3.
An IIS (U, AT, V, f)
According to Eq. (17), we have
So there are P
B
(x1, x2) < P
C
(x1, x2) < P
A
(x1, x2), therefore, ∀ (x, y) ∈ U × U, if A ⊆ B, based on MDVT, the relationship between P
A
(x, y) and P
B
(x, y) is uncertain. By the definition
Property 6 shows that the modified data-driven valued tolerance relation and its tolerance class are not necessarily monotonically decreasing with the growth of attribute set. Furthermore, it is easy to get the following Property 7 from Property 6.
Property 7 shows that in the modified data-driven valued tolerance relation, with the increase of attribute set, the lower approximation is not necessarily monotonically increasing, and the upper approximation is not necessarily monotonically decreasing. These properties are not consistent with the corresponding properties based on the indiscernibility relation in CIS.
As MDVT is an reflexive and symmetric generalized indiscernibility relation, in order to demonstrate the effectiveness of MDVT proposed in this paper, we compare its classification accuracy with that of other three typical reflexive and symmetric generalized indiscernibility relations, which are tolerance relation, limited tolerance relation and data-driven valued tolerance relation. The detailed introductions of these typical relations are shown in Section 2. As non-symmetric similarity relation is a reflexive and transitive relation, but not necessarily symmetric, we do not compare classification accuracy of MDVT with that of non-symmetric similarity relation.
The classification accuracy of the generalized indiscernibility relation is defined as follows [29, 30]. Firstly, for a given CIS S1 = (U, AT, V, f), let E ={ (x, y) ∈ U × U| ∀ b ∈ AT, b (x) = b (y) } be an equivalence relation on U, and [x]
E
be the equivalence class of x in U, then the classification U/E ={ [x]
E
|x ∈ U } can be got. Secondly, CIS S1 is modified by introducing some percentage of randomly chosen missing attribute values, and then IIS S2 = (U°, AT, V, f) is derived. Let R be a generalized indiscernibility relation on U°, R (x) be the generalized indiscernibility class of x in U, then the classification U/R ={ R (x) |x ∈ U } can be derived. Finally, the classification accuracy can be defined as:
In our experiments, six complete data sets from UCI Machine Learning Repository [31] are used as shown in Table 3, which are Zoo, Monk, Spect, Hayes, TicTacToe and Chess. In Table 3, the first column is the name of data sets, the second column is the number of condition attributes and the third column is the number of decision attributes. All these six data sets are complete. Each complete data set is modified by introducing 10%, 30% and 50% randomly chosen missing attribute values, and then 18 incomplete data sets (X-a% represents data set X missing a% data) are derived. For each incomplete data set, we run the experiments 100 times and get the average result, which is used as the final classification accuracy. The experimental results are shown in Table 4, where T, L, DVT and MDVT denote tolerance relation, limited tolerance relation, data-driven valued tolerance relation and modified data-driven valued tolerance relation, respectively.
Figure 1 plots the classification accuracy from Table 4. In these plots, the X-axis represents the percentage of randomly chosen missing attribute values, and the Y-axis represents the classification accuracy. From Table 4 and Fig.1, we can see that the classification accuracy of MDVT is higher than that of T, L on all test data sets. The classification accuracy of MDVT is generally higher than that of DVT on all test data sets except the Monk-50% and Hayes-50%. Note that with the increasing of missing attribute values, the classification accuracy deteriorate for all test data sets. However, the classification accuracies of T and L decrease quickly with the increasing of the incomplete degree of data set, whereas the classification accuracies of DVT and MDVT is relatively stable. The experimental results show that the modified data-driven valued tolerance relation can get better classification accuracy than that of other generalized indiscernibility relations.

Classification accuracies for the data sets used in the experiment.
Four complete data sets in UCI
Six complete data sets in UCI
Incomplete information is one of the most difficult problems of data mining. In order to deal with incomplete information, various extended models of classical rough set theory have been proposed. The datadriven valued tolerance relation does not require any prior knowledge except the data set, which can solve the problems of the valued tolerance relation. However, when predicting probability of the unknown attribute value of an object, it regards the frequency of an attribute value approximately as the probability of appearance of this value, without considering the effects of other known attribute values of this object on predicting probability of unknown attribute value. In this paper, to solve the limitations of data-driven valued tolerance relation, modified data-driven valued tolerance relation is defined, which not only considering the frequency of an attribute value, but also considering the effects of other known attribute values of an object on predicting probability of unknown attribute value of this object. On this basis, the extended rough set model based on modified data-driven valued tolerance relation is proposed. Some properties of the new model are analyzed. Experimental results show that the modified data-driven valued tolerance relation can get better classification accuracy than that of other generalized indiscernibility relations. In the future, we will investigate the attribute reduction and rule extraction based on the new generalized indiscernibility relation. Its effectiveness in dealing with incomplete information system will be further inspected.
Footnotes
Acknowledgments
The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions to improve the paper. This work was supported by the National Natural Science Foundation of China (No. 61402005), the Natural Science Foundation of Anhui Province, China (No. 1308085QF114), the Higher Education Natural Science Foundation of Anhui Province, China (No. KJ2013A015), and the Open Foundation of Key Laboratory of Intelligent Computing and Signal Processing at Anhui University of Ministry of Education of China (2014).
